Course Outline

spark.mllib: data types, algorithms, and utilities

  • Data types
  • Basic statistics
    • summary statistics
    • correlations
    • stratified sampling
    • hypothesis testing
    • streaming significance testing
    • random data generation
  • Classification and regression
    • linear models (SVMs, logistic regression, linear regression)
    • naive Bayes
    • decision trees
    • ensembles of trees (Random Forests and Gradient-Boosted Trees)
    • isotonic regression
  • Collaborative filtering
    • alternating least squares (ALS)
  • Clustering
    • k-means
    • Gaussian mixture
    • power iteration clustering (PIC)
    • latent Dirichlet allocation (LDA)
    • bisecting k-means
    • streaming k-means
  • Dimensionality reduction
    • singular value decomposition (SVD)
    • principal component analysis (PCA)
  • Feature extraction and transformation
  • Frequent pattern mining
    • FP-growth
    • association rules
    • PrefixSpan
  • Evaluation metrics
  • PMML model export
  • Optimization (developer)
    • stochastic gradient descent
    • limited-memory BFGS (L-BFGS) high-level APIs for ML pipelines

  • Overview: estimators, transformers and pipelines
  • Extracting, transforming and selecting features
  • Classification and regression
  • Clustering
  • Advanced topics


Knowledge of one of the following:

  • Java
  • Scala
  • Python
  • SparkR.
  35 Hours


Related Courses

Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP

 21 hours

This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and

Alluxio: Unifying Disparate Storage Systems

 7 hours

Alluxio is an open-source virtual distributed storage system that unifies disparate storage systems and enables applications to interact with data at memory speed. It is used by companies such as Intel, Baidu and Alibaba. In this instructor-led,

Big Data Analytics in Health

 21 hours

Big data analytics involves the process of examining large amounts of varied data sets in order to uncover correlations, hidden patterns, and other useful insights. The health industry has massive amounts of complex heterogeneous medical and

Hadoop and Spark for Administrators

 35 hours

Apache Hadoop is a popular data processing framework for processing large data sets across many computers. This instructor-led, live training (online or onsite) is aimed at system administrators who wish to learn how to set up, deploy and manage

Apache Spark for .NET Developers

 21 hours

Apache Spark is a distributed processing engine for analyzing very large data sets. It can process data in batches and real-time, as well as carry out machine learning, ad-hoc queries, and graph processing. .NET for Apache Spark is a free,

Apache Spark Fundamentals

 21 hours

Apache Spark is an analytics engine designed to distribute data across a cluster in order to process it in parallel. It contains modules for streaming, SQL, machine learning and graph processing. This instructor-led, live training (online or

Apache Spark in the Cloud

 21 hours

Apache Spark's learning curve is slowly increasing at the begining, it needs a lot of effort to get the first return. This course aims to jump through the first tough part. After taking this course the participants will understand the

Spark for Developers

 21 hours

OBJECTIVE: This course will introduce Apache Spark. The students will learn how  Spark fits  into the Big Data ecosystem, and how to use Spark for data analysis.  The course covers Spark shell for interactive data analysis, Spark

Apache Spark SQL

 7 hours

Spark SQL is Apache Spark's module for working with structured and unstructured data. Spark SQL provides information about the structure of the data as well as the computation being performed. This information can be used to perform

Hortonworks Data Platform (HDP) for Administrators

 21 hours

Hortonworks Data Platform (HDP) is an open-source Apache Hadoop support platform that provides a stable foundation for developing big data solutions on the Apache Hadoop ecosystem. This instructor-led, live training (online or onsite) introduces

A Practical Introduction to Stream Processing

 21 hours

Stream Processing refers to the real-time processing of "data in motion", that is, performing computations on data as it is being received. Such data is read as continuous streams from data sources such as sensor events, website user

Magellan: Geospatial Analytics on Spark

 14 hours

Magellan is an open-source distributed execution engine for geospatial analytics on big data. Implemented on top of Apache Spark, it extends Spark SQL and provides a relational abstraction for geospatial analytics. This instructor-led, live

SMACK Stack for Data Science

 14 hours

SMACK is a collection of data platform softwares, namely Apache Spark, Apache Mesos, Apache Akka, Apache Cassandra, and Apache Kafka. Using the SMACK stack, users can create and scale data processing platforms. This instructor-led, live training

Scaling Data Pipelines with Spark NLP

 14 hours

Spark NLP is an open source library, built on Apache Spark, for natural language processing with Python, Java, and Scala. It is widely used for enterprise and industry verticals, such as healthcare, finance, life science, and recruiting. This

Apache Spark Streaming with Scala

 21 hours

Scala is a condensed version of Java for large scale functional and object-oriented programming. Apache Spark Streaming is an extended component of the Spark API for processing big data sets as real-time streams. Together, Spark Streaming and