Course Outline


  • Introduction to Cloud Computing and Big Data solutions
  • Overview of Apache Hadoop Features and Architecture

Setting up Hadoop

  • Planning a Hadoop cluster (on-premise, cloud, etc.)
  • Selecting the OS and Hadoop distribution
  • Provisioning resources (hardware, network, etc.)
  • Downloading and installing the software
  • Sizing the cluster for flexibility

Working with HDFS

  • Understanding the Hadoop Distributed File System (HDFS)
  • Overview of HDFS Command Reference
  • Accessing HDFS
  • Performing Basic File Operations on HDFS
  • Using S3 as a complement to HDFS

Overview of the MapReduce

  • Understanding Data Flow in the MapReduce Framework
  • Map, Shuffle, Sort and Reduce
  • Demo: Computing Top Salaries

Working with YARN

  • Understanding resource management in Hadoop
  • Working with ResourceManager, NodeManager, Application Master
  • Scheduling jobs under YARN
  • Scheduling for large numbers of nodes and clusters
  • Demo: Job scheduling

Integrating Hadoop with Spark

  • Setting up storage for Spark (HDFS, Amazon, S3, NoSQL, etc.)
  • Understanding Resilient Distributed Datasets (RDDs)
  • Creating an RDD
  • Implementing RDD Transformations
  • Demo: Implementing a Text Search Program for Movie Titles

Managing a Hadoop Cluster

  • Monitoring Hadoop
  • Securing a Hadoop cluster
  • Adding and removing nodes
  • Running a performance benchmark
  • Tuning a Hadoop cluster to optimizing performance
  • Backup, recovery and business continuity planning
  • Ensuring high availability (HA)

Upgrading and Migrating a Hadoop Cluster

  • Assessing workload requirements
  • Upgrading Hadoop
  • Moving from on-premise to cloud and vice-versa
  • Recovering from failures


Summary and Conclusion


  • System administration experience
  • Experience with Linux command line
  • An understanding of big data concepts


  • System administrators
  • DBAs
  35 Hours


Related Courses

Python and Spark for Big Data (PySpark)

  21 hours

Introduction to Graph Computing

  28 hours

Apache Spark MLlib

  35 hours

Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP

  21 hours

Hortonworks Data Platform (HDP) for Administrators

  21 hours

Apache Ambari: Efficiently Manage Hadoop Clusters

  21 hours

Impala for Business Intelligence

  21 hours

Data Analysis with Hive/HiveQL

  7 hours

Spark for Developers

  21 hours

Magellan: Geospatial Analytics on Spark

  14 hours

Alluxio: Unifying Disparate Storage Systems

  7 hours

Apache Spark SQL

  7 hours

A Practical Introduction to Stream Processing

  21 hours

Big Data Analytics in Health

  21 hours