Course Outline

Module 1. Introduction to Hadoop

  • The Hadoop Distributed File System (HDFS)
  • The Read Path and The Write Path
  • Managing Filesystem Metadata
  • The Namenode and the Datanode
  • The Namenode High Availability
  • Namenode Federation
  • The Command-Line Tools
  • Understanding REST Support

Module 2. Introduction to MapReduce

  • Analyzing the Data with Hadoop
  • Map and Reduce Pattern
  • Java MapReduce
  • Scaling Out
  • Data Flow
  • Developing Combiner Functions
  • Running a Distributed MapReduce Job

Module 3. Planning a Hadoop Cluster

  • Picking a Distribution and Version of Hadoop
  • Versions and Features
  • Hardware Selection
  • Master and Worker Hardware Selection
  • Cluster Sizing
  • Operating System Selection and Preparation
  • Deployment Layout
  • Setting up Users, Groups, and Privileges
  • Disk Configuration
  • Network Design

Module 4. Installation and Configuration

  • Installing Hadoop
  • Configuration: An Overview
  • The Hadoop XML Configuration Files
  • Environment Variables and Shell Scripts
  • Logging Configuration
  • Managing HDFS
  • Optimization and Tuning
  • Formatting the Namenode
  • Creating a /tmp Directory
  • Thinking Namenode High Availability
  • The Fencing Options
  • Automatic Failover Configuration
  • Format and Bootstrap the Namenodes
  • Namenode Federation

Module 5. Understanding Hadoop I/O

  • Data Integrity in HDFS  
  • Understanding Codecs
  • Compression and Input Splits
  • Using Compression in MapReduce
  • The Serialization mechanism
  • File-Based Data Structures
  • The SequenceFile format
  • Other File Formats and Column-Oriented Formats

Module 6. Developing a MapReduce Application

  • The Configuration API 
  • Setting Up the Development Environment
  • Managing Configuration
  • GenericOptionsParser, Tool, and ToolRunner
  • Writing a Unit Test with MRUnit
  • The Mapper and Reducer
  • Running Locally on Test Data 
  • Testing the Driver
  • Running on a Cluster
  • Packaging and Launching a Job
  • The MapReduce Web UI
  • Tuning a Job

Module 7. Identity, Authentication, and Authorization

  • Managing Identity
  • Kerberos and Hadoop
  • Understanding Authorization

Module 8. Resource Management

  • What Is Resource Management?
  • HDFS Quotas
  • MapReduce Schedulers
  • Anatomy of a YARN Application Run
  • Resource Requests
  • Application Lifespan
  • YARN Compared to MapReduce 1
  • Scheduling in YARN
  • Scheduler Options
  • Capacity Scheduler Configuration
  • Fair Scheduler Configuration
  • Delay Scheduling
  • Dominant Resource Fairness

Module 9. MapReduce Types and Formats

  • MapReduce Types
  • The Default MapReduce Job
  • Defining the Input Formats
  • Managing Input Splits and Records
  • Text Input and Binary Input
  • Managing Multiple Inputs
  • Database Input (and Output)
  • Output Formats
  • Text Output and Binary Output
  • Managing Multiple Outputs
  • The Database Output

Module 10. Using MapReduce Features

  • Using Counters
  • Reading Built-in Counters
  • User-Defined Java Counters
  • Understanding Sorting
  • Using the Distributed Cache

Module 11. Cluster Maintenance and Troubleshooting

  • Managing Hadoop Processes
  • Starting and Stopping Processes with Init Scripts
  • Starting and Stopping Processes Manually
  • HDFS Maintenance Tasks
  • Adding a Datanode
  • Decommissioning a Datanode
  • Checking Filesystem Integrity with fsck
  • Balancing HDFS Block Data
  • Dealing with a Failed Disk
  • MapReduce Maintenance Tasks 
  • Killing a MapReduce Job
  • Killing a MapReduce Task
  • Managing Resource Exhaustion

Module 12. Monitoring

  • The available Hadoop Metrics
  • The role of SNMP
  • Health Monitoring
  • Host-Level Checks
  • HDFS Checks
  • MapReduce Checks

Module 13. Backup and Recovery

  • Data Backup
  • Distributed Copy (distcp)
  • Parallel Data Ingestion
  • Namenode Metadata
 21 Hours

Testimonials (1)

Upcoming Courses