Course Outline

Introduction

Principles of Distributed Computing

  • Apache Spark
  • Hadoop

Principles of Data Serialization

  • How data object is passed over the network
  • Serialization of objects
  • Serialization approaches
    • Thrift
    • Protocol Buffers
    • Apache Avro
      • data structure
      • size, speed, format characteristics
      • persistent data storage
      • integration with dynamic languages
      • dynamic typing
      • schemas
        • untagged data
        • change management

Data Serialization and Distributed Computing

  • Avro as a subproject of Hadoop
    • Java serialization
    • Hadoop serialization
    • Avro serialization

Using Avro with

  • Hive (AvroSerDe)
  • Pig (AvroStorage)

Porting Existing RPC Frameworks

Summary and Conclusion

Requirements

  • A general familiarity with distributed computing.
  14 Hours
 

Testimonials

Related Courses

Hortonworks Data Platform (HDP) for Administrators

  21 hours

Apache Ambari: Efficiently Manage Hadoop Clusters

  21 hours

Impala for Business Intelligence

  21 hours

Data Analysis with Hive/HiveQL

  7 hours

Apache Avro: Data Serialization for Distributed Applications

  14 hours

Hadoop Administration

  21 hours

Administrator Training for Apache Hadoop

  35 hours

Hadoop Administration on MapR

  28 hours

Hadoop for Developers (4 days)

  28 hours

Advanced Hadoop for Developers

  21 hours

HBase for Developers

  21 hours

Hadoop For Administrators

  21 hours

Hadoop for Business Analysts

  21 hours

Hadoop for Developers and Administrators

  21 hours