Get in Touch

Course Outline

1: HDFS (17%)

  • Explain the role of HDFS Daemons
  • Describe the standard operation of an Apache Hadoop cluster, encompassing both data storage and processing functions.
  • Identify current computing system features that drive the need for systems like Apache Hadoop.
  • Classify the primary objectives of HDFS Design.
  • Determine the appropriate use case for HDFS Federation within a given scenario.
  • Identify the components and daemons of an HDFS HA-Quorum cluster.
  • Analyze the role of HDFS security mechanisms, including Kerberos.
  • Select the optimal data serialization choice for a specific scenario.
  • Describe the pathways for file read and write operations.
  • Identify the commands required to manipulate files using the Hadoop File System Shell.

2: YARN and MapReduce version 2 (MRv2) (17%)

  • Understand the impact on cluster settings when upgrading from Hadoop 1 to Hadoop 2.
  • Comprehend the deployment of MapReduce v2 (MRv2 / YARN), including all associated YARN daemons.
  • Understand the core design strategy for MapReduce v2 (MRv2).
  • Determine how YARN manages resource allocations.
  • Identify the workflow of a MapReduce job executing on YARN.
  • Identify the specific files that must be modified and how to do so when migrating a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) on YARN.

3: Hadoop Cluster Planning (16%)

  • Highlight key considerations for selecting hardware and operating systems to host an Apache Hadoop cluster.
  • Analyze the options available when selecting an operating system.
  • Understand kernel tuning and disk swapping processes.
  • Identify an appropriate hardware configuration for a given scenario and workload pattern.
  • Determine the necessary ecosystem components for a cluster to meet SLA requirements in a specific scenario.
  • Cluster sizing: Identify workload specifics, including CPU, memory, storage, and disk I/O, based on a scenario and execution frequency.
  • Disk Sizing and Configuration: Understand JBOD versus RAID, SANs, virtualization, and disk sizing requirements within a cluster.
  • Network Topologies: Understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario.

4: Hadoop Cluster Installation and Administration (25%)

  • Identify how the cluster handles disk and machine failures in a given scenario.
  • Analyze logging configuration and the format of logging configuration files.
  • Understand the fundamentals of Hadoop metrics and cluster health monitoring.
  • Identify the function and purpose of available tools for cluster monitoring.
  • Install all ecosystem components in CDH 5, including (but not limited to): Impala, Flume, Oozie, Hue, Manager, Sqoop, Hive, and Pig.
  • Identify the function and purpose of available tools for managing the Apache Hadoop file system.

5: Resource Management (10%)

  • Understand the overall design goals of each of Hadoop schedulers.
  • Determine how the FIFO Scheduler allocates cluster resources in a given scenario.
  • Determine how the Fair Scheduler allocates cluster resources under YARN in a given scenario.
  • Determine how the Capacity Scheduler allocates cluster resources in a given scenario.

6: Monitoring and Logging (15%)

  • Understand the functions and features of Hadoop’s metric collection capabilities.
  • Analyze the NameNode and JobTracker Web UIs.
  • Understand methods for monitoring cluster Daemons.
  • Identify and monitor CPU usage on master nodes.
  • Describe methods for monitoring swap and memory allocation across all nodes.
  • Identify methods for viewing and managing Hadoop’s log files.
  • Interpret a log file.

Requirements

  • Foundational Linux administration skills
  • Basic programming proficiency
 35 Hours

Testimonials (3)

Upcoming Courses

Related Categories