Course Outline

  • Section 1: Introduction to Hadoop
    • hadoop history, concepts
    • eco system
    • distributions
    • high level architecture
    • hadoop myths
    • hadoop challenges
    • hardware / software
    • Labs : first look at Hadoop
  • Section 2: HDFS Overview
    • concepts (horizontal scaling, replication, data locality, rack awareness)
    • architecture (Namenode, Secondary namenode, Data node)
    • data integrity
    • future of HDFS : Namenode HA, Federation
    • labs : Interacting with HDFS
  • Section 3 : Map Reduce Overview
    • mapreduce concepts
    • daemons : jobtracker / tasktracker
    • phases : driver, mapper, shuffle/sort, reducer
    • Thinking in map reduce
    • Future of mapreduce (yarn)
    • labs : Running a Map Reduce program
  • Section 4 : Pig
    • pig vs java map reduce
    • pig latin language
    • user defined functions
    • understanding pig job flow
    • basic data analysis with Pig
    • complex data analysis with Pig
    • multi datasets with Pig
    • advanced concepts
    • lab : writing pig scripts to analyze / transform data
  • Section 5: Hive
    • hive concepts
    • architecture
    • SQL support in Hive
    • data types
    • table creation and queries
    • Hive data management
    • partitions & joins
    • text analytics
    • labs (multiple) : creating Hive tables and running queries, joins , using partitions, using text analytics functions
  • Section 6: BI Tools for Hadoop
    • BI tools and Hadoop
    • Overview of current BI tools landscape
    • Choosing the best tool for the job


  • programming background with databases / SQL
  • basic knowledge of Linux (be able to navigate Linux command line, editing files with vi / nano)

Lab environment

Zero Install : There is no need to install hadoop software on students’ machines! A working Hadoop cluster will be provided for students.

Students will need the following

  21 Hours


Related Courses

Apache Ambari: Efficiently Manage Hadoop Clusters

 21 hours

Apache Ambari is an open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. In this instructor-led live training participants will learn the management tools and practices provided by Ambari to

Administrator Training for Apache Hadoop

 35 hours

Audience: The course is intended for IT specialists looking for a solution to store and process large data sets in a distributed system environment Goal: Deep knowledge on Hadoop cluster

Fintech: A Practical Introduction for Managers

 14 hours

Fintech refers to the convergence of finance + new technologies. In this instructor-led, live training, participants will gain an understanding of the technologies, methods and mindset needed to implement a Fintech strategy. This training is

Hadoop Administration

 21 hours

The course is dedicated to IT specialists that are looking for a solution to store and process large data sets in distributed system environment Course goal: Getting knowledge regarding Hadoop cluster

Hadoop for Developers (4 days)

 28 hours

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to various components (HDFS, MapReduce, Pig, Hive and HBase) Hadoop

Advanced Hadoop for Developers

 21 hours

Apache Hadoop is one of the most popular frameworks for processing Big Data on clusters of servers. This course delves into data management in HDFS, advanced Pig, Hive, and HBase.  These advanced programming techniques will be beneficial to

Hadoop Administration on MapR

 28 hours

Audience: This course is intended to demystify big data/hadoop technology and to show it is not difficult to understand.

HBase for Developers

 21 hours

This course introduces HBase – a NoSQL store on top of Hadoop.  The course is intended for developers who will be using HBase to develop applications,  and administrators who will manage HBase clusters. We will walk a developer

Hortonworks Data Platform (HDP) for Administrators

 21 hours

Hortonworks Data Platform (HDP) is an open-source Apache Hadoop support platform that provides a stable foundation for developing big data solutions on the Apache Hadoop ecosystem. This instructor-led, live training (online or onsite) introduces

Data Analysis with Hive/HiveQL

 7 hours

This course covers how to use Hive SQL language (AKA: Hive HQL, SQL on Hive, HiveQL) for people who extract data from Hive

Impala for Business Intelligence

 21 hours

Cloudera Impala is an open source massively parallel processing (MPP) SQL query engine for Apache Hadoop clusters. Impala enables users to issue low-latency SQL queries to data stored in Hadoop Distributed File System and Apache

Matlab for Prescriptive Analytics

 14 hours

Prescriptive analytics is a branch of business analytics, together with descriptive and predictive analytics. It uses predictive models to suggest actions to take for optimal outcomes, relying on optimization and rules-based techniques as a basis

Model Based Development for Embedded Systems

 21 hours

Model Based Development (MBD) is a software development methodology that enables faster, more cost-effective development of dynamic systems such as control systems, signal processing and communication systems. It relies on graphic modeling rather

Requirements Analysis

 21 hours

Requirements Analysis, also known as Requirements Engineering, is the process of identifying user expectations for a new or altered product or project. This instructor-led, live training (online or onsite) is aimed at persons who wish to

Software Engineering, Requirements Engineering and Testing

 63 hours

This course demonstrates through hands-on practice the fundamentals and applications of software engineering, requirements engineering and testing.