Course Outline
Module 1. Introduction to Hadoop
- The Hadoop Distributed File System (HDFS)
- The Read Path and The Write Path
- Managing Filesystem Metadata
- The Namenode and the Datanode
- The Namenode High Availability
- Namenode Federation
- The Command-Line Tools
- Understanding REST Support
Module 2. Introduction to MapReduce
- Analyzing the Data with Hadoop
- Map and Reduce Pattern
- Java MapReduce
- Scaling Out
- Data Flow
- Developing Combiner Functions
- Running a Distributed MapReduce Job
Module 3. Planning a Hadoop Cluster
- Picking a Distribution and Version of Hadoop
- Versions and Features
- Hardware Selection
- Master and Worker Hardware Selection
- Cluster Sizing
- Operating System Selection and Preparation
- Deployment Layout
- Setting up Users, Groups, and Privileges
- Disk Configuration
- Network Design
Module 4. Installation and Configuration
- Installing Hadoop
- Configuration: An Overview
- The Hadoop XML Configuration Files
- Environment Variables and Shell Scripts
- Logging Configuration
- Managing HDFS
- Optimization and Tuning
- Formatting the Namenode
- Creating a /tmp Directory
- Thinking Namenode High Availability
- The Fencing Options
- Automatic Failover Configuration
- Format and Bootstrap the Namenodes
- Namenode Federation
Module 5. Understanding Hadoop I/O
- Data Integrity in HDFS
- Understanding Codecs
- Compression and Input Splits
- Using Compression in MapReduce
- The Serialization mechanism
- File-Based Data Structures
- The SequenceFile format
- Other File Formats and Column-Oriented Formats
Module 6. Developing a MapReduce Application
- The Configuration API
- Setting Up the Development Environment
- Managing Configuration
- GenericOptionsParser, Tool, and ToolRunner
- Writing a Unit Test with MRUnit
- The Mapper and Reducer
- Running Locally on Test Data
- Testing the Driver
- Running on a Cluster
- Packaging and Launching a Job
- The MapReduce Web UI
- Tuning a Job
Module 7. Identity, Authentication, and Authorization
- Managing Identity
- Kerberos and Hadoop
- Understanding Authorization
Module 8. Resource Management
- What Is Resource Management?
- HDFS Quotas
- MapReduce Schedulers
- Anatomy of a YARN Application Run
- Resource Requests
- Application Lifespan
- YARN Compared to MapReduce 1
- Scheduling in YARN
- Scheduler Options
- Capacity Scheduler Configuration
- Fair Scheduler Configuration
- Delay Scheduling
- Dominant Resource Fairness
Module 9. MapReduce Types and Formats
- MapReduce Types
- The Default MapReduce Job
- Defining the Input Formats
- Managing Input Splits and Records
- Text Input and Binary Input
- Managing Multiple Inputs
- Database Input (and Output)
- Output Formats
- Text Output and Binary Output
- Managing Multiple Outputs
- The Database Output
Module 10. Using MapReduce Features
- Using Counters
- Reading Built-in Counters
- User-Defined Java Counters
- Understanding Sorting
- Using the Distributed Cache
Module 11. Cluster Maintenance and Troubleshooting
- Managing Hadoop Processes
- Starting and Stopping Processes with Init Scripts
- Starting and Stopping Processes Manually
- HDFS Maintenance Tasks
- Adding a Datanode
- Decommissioning a Datanode
- Checking Filesystem Integrity with fsck
- Balancing HDFS Block Data
- Dealing with a Failed Disk
- MapReduce Maintenance Tasks
- Killing a MapReduce Job
- Killing a MapReduce Task
- Managing Resource Exhaustion
Module 12. Monitoring
- The available Hadoop Metrics
- The role of SNMP
- Health Monitoring
- Host-Level Checks
- HDFS Checks
- MapReduce Checks
Module 13. Backup and Recovery
- Data Backup
- Distributed Copy (distcp)
- Parallel Data Ingestion
- Namenode Metadata
Testimonials
The fact that all the data and software was ready to use on an already prepared VM, provided by the trainer in external disks.
vyzVoice
Related Courses
Apache Ambari: Efficiently Manage Hadoop Clusters
21 hoursApache Ambari is an open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. In this instructor-led live training participants will learn the management tools and practices provided by Ambari to
Administrator Training for Apache Hadoop
35 hoursAudience: The course is intended for IT specialists looking for a solution to store and process large data sets in a distributed system environment Goal: Deep knowledge on Hadoop cluster
Apache Hadoop: Manipulation and Transformation of Data Performance
21 hoursThis course is intended for developers, architects, data scientists or any profile that requires access to data either intensively or on a regular basis. The major focus of the course is data manipulation and transformation. Among the tools
Hadoop Administration
21 hoursThe course is dedicated to IT specialists that are looking for a solution to store and process large data sets in distributed system environment Course goal: Getting knowledge regarding Hadoop cluster
Hadoop For Administrators
21 hoursApache Hadoop is the most popular framework for processing Big Data on clusters of servers. In this three (optionally, four) days course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem, how to plan
Hadoop for Business Analysts
21 hoursApache Hadoop is the most popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads in to tradional BI analytics world. This course will introduce an analyst to the core components of
Hadoop for Developers (4 days)
28 hoursApache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to various components (HDFS, MapReduce, Pig, Hive and HBase) Hadoop
Advanced Hadoop for Developers
21 hoursApache Hadoop is one of the most popular frameworks for processing Big Data on clusters of servers. This course delves into data management in HDFS, advanced Pig, Hive, and HBase. These advanced programming techniques will be beneficial to
Hadoop for Developers and Administrators
21 hoursHadoop is the most popular Big Data processing framework.
Hadoop Administration on MapR
28 hoursAudience: This course is intended to demystify big data/hadoop technology and to show it is not difficult to understand.
HBase for Developers
21 hoursThis course introduces HBase – a NoSQL store on top of Hadoop. The course is intended for developers who will be using HBase to develop applications, and administrators who will manage HBase clusters. We will walk a developer
Hortonworks Data Platform (HDP) for Administrators
21 hoursHortonworks Data Platform (HDP) is an open-source Apache Hadoop support platform that provides a stable foundation for developing big data solutions on the Apache Hadoop ecosystem. This instructor-led, live training (online or onsite) introduces
Data Analysis with Hive/HiveQL
7 hoursThis course covers how to use Hive SQL language (AKA: Hive HQL, SQL on Hive, HiveQL) for people who extract data from Hive
Impala for Business Intelligence
21 hoursCloudera Impala is an open source massively parallel processing (MPP) SQL query engine for Apache Hadoop clusters. Impala enables users to issue low-latency SQL queries to data stored in Hadoop Distributed File System and Apache
Apache Avro: Data Serialization for Distributed Applications
14 hoursAudience Developers Format of the Course Lectures, hands-on practice, small tests along the way to gauge understanding