Course Outline


Scala Programming in Depth Review

  • Syntax and structure
  • Flow control and functions

Spark Internals

  • Resilient Distributed Datasets (RDD)
  • Spark script to graph to cluster

Overview of Spark Streaming

  • Streaming architecture
  • Intervals in streaming
  • Fault tolerance

Preparing the Development Environment

  • Installing and configuring Apache Spark
  • Installing and configuring the Scala IDE
  • Installing and configuring JDK

Spark Streaming Beginner to Advanced

  • Working with key/value RDD's
  • Filtering RDD's
  • Improving Spark scripts with regular expressions
  • Sharing data on a cluster
  • Working with network data sets
  • Implementing BFS algorithms
  • Creating Spark driver scripts
  • Tracking in real time with scripts
  • Writing continuous applications
  • Streaming linear regression
  • Using Spark Machine Learning Library

Spark and Clusters

  • Bundling dependencies and Spark scripts using the SBT tool
  • Using EMR for illustrating clusters
  • Optimizing by partitioning RDD's
  • Using Spark logs

Integration in Spark Streaming

  • Integrating Apache Kafka and working with Kafka topics
  • Integrating Apache Fume and working with pull-based/push-based Flume configurations
  • Writing a custom receiver class
  • Integrating Cassandra and exposing data as real-time services

In Production

  • Packaging an application and running it with Spark-Submit
  • Troubleshooting, tuning, and debugging Spark Jobs and clusters

Summary and Conclusion


  • Programming and scripting experience


  • Software Engineers
  21 Hours


Related Courses

Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP

 21 hours

This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and

Apache Spark MLlib

 35 hours

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative

Akka - from Beginner to Intermediate

 21 hours

This training outline is intended to bring attendees from a beginner to an intermediate/advanced level in the understanding and knowledge of the Akka framework. The entire course is hands on, mostly driven by the trainer in the beginning and

Alluxio: Unifying Disparate Storage Systems

 7 hours

Alluxio is an open-source virtual distributed storage system that unifies disparate storage systems and enables applications to interact with data at memory speed. It is used by companies such as Intel, Baidu and Alibaba. In this instructor-led,

Introduction to Graph Computing

 28 hours

Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set

Hortonworks Data Platform (HDP) for Administrators

 21 hours

Hortonworks Data Platform (HDP) is an open-source Apache Hadoop support platform that provides a stable foundation for developing big data solutions on the Apache Hadoop ecosystem. This instructor-led, live training (online or onsite) introduces

Magellan: Geospatial Analytics on Spark

 14 hours

Magellan is an open-source distributed execution engine for geospatial analytics on big data. Implemented on top of Apache Spark, it extends Spark SQL and provides a relational abstraction for geospatial analytics. This instructor-led, live

Machine Learning Fundamentals with Scala and Apache Spark

 14 hours

The aim of this course is to provide a basic proficiency in applying Machine Learning methods in practice. Through the use of the Scala programming language and its various libraries, and based on a multitude of practical examples this course

Scala: Advanced Object-Functional Programming

 14 hours

Scala is a concise, object-oriented language with functional programming features, including currying, type inference, immutability, lazy evaluation, and pattern matching. Scala code runs on a JVM and was designed to address some of the shortcomings

Scala: Advanced Functional Programming

 14 hours

Scala is a concise, object-oriented language with functional programming features, including currying, type inference, immutability, lazy evaluation, and pattern matching. In this instructor-led, live training participants will learn how to use

Property Based Testing with ScalaCheck

 21 hours

ScalaCheck is a library for carrying out automated, property-based testing for Scala or Java programs. Inspired by the Haskell library QuickCheck, it uses properties to describe the expected behavior of an application, generating random input data

Programming in Scala

 14 hours

The training aims to provide opportunities Scala language, learning the syntax of programming paradigms, and space applications.

Spark for Developers

 21 hours

OBJECTIVE: This course will introduce Apache Spark. The students will learn how  Spark fits  into the Big Data ecosystem, and how to use Spark for data analysis.  The course covers Spark shell for interactive data analysis, Spark

Apache Spark SQL

 7 hours

Spark SQL is Apache Spark's module for working with structured and unstructured data. Spark SQL provides information about the structure of the data as well as the computation being performed. This information can be used to perform

Python and Spark for Big Data (PySpark)

 21 hours

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python. In this