Hadoop for Developers

Duration: 4 days

We’re very happy to present this course from our partners, Jumping Bean, a company with significant experience delivering high quality courses focussing on Apache Hadoop.

Apache Hadoop is an open-source framework that has been designed to handle large data sets across clusters of computers. It manages distributed storage and processing of data by breaking it up and spreading it across cluster nodes.

This Apache Hadoop course that we are presenting, and very much looking forward to, will give participants an in-depth look at fundamental concepts and processes that are integral to getting Apache Hadoop working efficiently and solidly for businesses of all sizes.

Developed by Doug Cutting and Michale J., Hadoop makes it less costly to store and process data, spreading the processing requirements amongst different computers to make complex calculations.

One of the most impressive uses of Apache Hadoop is the National Security Agency (NSA) in the United States of America using it to detect and foil terrorist and cyber attacks. Police also use it to catch criminals while credit card companies also use it to identify fraudulent transactions.

Below is an outline for this 3 day course which gives a thorough breakdown of all the topics that will be covered. For more information, do not hesitate to contact us directly, we'd love to hear from you.

Please find the course objectives below:

Understanding Bigdata & Hadoop

  • Introduction to Big Data & Big Data Challenges
  • Limitations & Solutions of Big Data Architecture
  • Hadoop & its Features
  • Hadoop Ecosystem
  • Hadoop 2.x Core Components
  • Hadoop Storage: HDFS (Hadoop Distributed File System)
  • Hadoop Processing: MapReduce Framework
  • Different Hadoop Distributions

Hadoop Architecture & HDFS

  • Hadoop 2.x Cluster Architecture
  • Federation and High Availability Architecture
  • Typical Production Hadoop Cluster
  • Hadoop Cluster Modes
  • Common Hadoop Shell Commands
  • Hadoop 2.x Configuration Files
  • Single Node Cluster & Multi-Node Cluster set up
  • Basic Hadoop Administration

Hadoop MapReduce Framework

  • Traditional way vs MapReduce way
  • Why MapReduce
  • YARN Components
  • YARN Architecture
  • YARN MapReduce Application Execution Flow
  • YARN Workflow
  • Anatomy of MapReduce Program
  • Input Splits, Relation between Input Splits and HDFS Blocks
  • MapReduce: Combiner & Partitioner
  • Demo of Health Care Dataset
  • Demo of Weather Dataset

Advanced Hadoop MapReduce

  • Counters
  • Distributed Cache
  • MRunit
  • Reduce Join
  • Custom Input Format
  • Sequence Input Format
  • XML file Parsing using MapReduce

Apache Pig

  • Introduction to Apache Pig
  • MapReduce vs Pig
  • Pig Components & Pig Execution
  • Pig Data Types & Data Models in Pig
  • Pig Latin Programs
  • Shell and Utility Commands
  • Pig UDF & Pig Streaming
  • Testing Pig scripts with Punit
  • Aviation use-case in PIG
  • Pig Demo of Healthcare Dataset

Apache Hive

  • Introduction to Apache Hive
  • Hive vs Pig
  • Hive Architecture and Components
  • Hive Metastore
  • Limitations of Hive
  • Comparison with Traditional Database
  • Hive Data Types and Data Models
  • Hive Partition
  • Hive Bucketing
  • Hive Tables (Managed Tables and External Tables)
  • Importing Data
  • Querying Data & Managing Outputs
  • Hive Script & Hive UDF
  • Retail use case in Hive
  • Hive Demo on Healthcare Dataset

Advanced Apache Hive & HBase

  • Hive QL: Joining Tables, Dynamic Partitioning
  • Custom MapReduce Scripts
  • Hive Indexes and views
  • Hive Query Optimizers
  • Hive Thrift Server
  • Hive UDF
  • Apache HBase: Introduction to NoSQL Databases and HBase
  • HBase v/s RDBMS
  • HBase Components
  • HBase Architecture
  • HBase Run Modes
  • HBase Configuration
  • HBase Cluster Deployment

Advanced Apache Hbase

  • HBase Data Model
  • HBase Shell
  • HBase Client API
  • Hive Data Loading Techniques
  • Apache Zookeeper Introduction
  • ZooKeeper Data Model
  • Zookeeper Service
  • HBase Bulk Loading
  • Getting and Inserting Data
  • HBase Filters

Processing Distributed Data with Apache Spark

  • What is Spark
  • Spark Ecosystem
  • Spark Components
  • What is Scala
  • Why Scala
  • SparkContext
  • Spark RDD

Oozie & Hadoop Project

  • Oozie
  • Oozie Components
  • Oozie Workflow
  • Scheduling Jobs with Oozie Scheduler
  • Demo of Oozie Workflow
  • Oozie Coordinator
  • Oozie Commands
  • Oozie Web Console
  • Oozie for MapReduce
  • Combining flow of MapReduce Jobs
  • Hive in Oozie
  • Hadoop Project Demo
  • Hadoop Talend Integration
For an onsite course please contact us