Course Outline

Introduction

Principles of Distributed Computing

  • Apache Spark
  • Hadoop

Principles of Data Serialization

  • How data object is passed over the network
  • Serialization of objects
  • Serialization approaches
    • Thrift
    • Protocol Buffers
    • Apache Avro
      • data structure
      • size, speed, format characteristics
      • persistent data storage
      • integration with dynamic languages
      • dynamic typing
      • schemas
        • untagged data
        • change management

Data Serialization and Distributed Computing

  • Avro as a subproject of Hadoop
    • Java serialization
    • Hadoop serialization
    • Avro serialization

Using Avro with

  • Hive (AvroSerDe)
  • Pig (AvroStorage)

Porting Existing RPC Frameworks

Summary and Conclusion

Requirements

  • A general familiarity with distributed computing.
 14 Hours

Testimonials (5)

Related Categories