Course Outline

spark.mllib: data types, algorithms, and utilities

  • Data types
  • Basic statistics
    • summary statistics
    • correlations
    • stratified sampling
    • hypothesis testing
    • streaming significance testing
    • random data generation
  • Classification and regression
    • linear models (SVMs, logistic regression, linear regression)
    • naive Bayes
    • decision trees
    • ensembles of trees (Random Forests and Gradient-Boosted Trees)
    • isotonic regression
  • Collaborative filtering
    • alternating least squares (ALS)
  • Clustering
    • k-means
    • Gaussian mixture
    • power iteration clustering (PIC)
    • latent Dirichlet allocation (LDA)
    • bisecting k-means
    • streaming k-means
  • Dimensionality reduction
    • singular value decomposition (SVD)
    • principal component analysis (PCA)
  • Feature extraction and transformation
  • Frequent pattern mining
    • FP-growth
    • association rules
    • PrefixSpan
  • Evaluation metrics
  • PMML model export
  • Optimization (developer)
    • stochastic gradient descent
    • limited-memory BFGS (L-BFGS)

spark.ml: high-level APIs for ML pipelines

  • Overview: estimators, transformers and pipelines
  • Extracting, transforming and selecting features
  • Classification and regression
  • Clustering
  • Advanced topics

Requirements

Knowledge of one of the following:

  • Java
  • Scala
  • Python
  • SparkR.
  35 Hours
 

Testimonials

Related Courses

Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP

  21 hours

Alluxio: Unifying Disparate Storage Systems

  7 hours

Big Data Analytics in Health

  21 hours

Hadoop and Spark for Administrators

  35 hours

Hortonworks Data Platform (HDP) for Administrators

  21 hours

A Practical Introduction to Stream Processing

  21 hours

Magellan: Geospatial Analytics on Spark

  14 hours

Apache Spark for .NET Developers

  21 hours

SMACK Stack for Data Science

  14 hours

Apache Spark Fundamentals

  21 hours

Apache Spark in the Cloud

  21 hours

Spark for Developers

  21 hours

Scaling Data Pipelines with Spark NLP

  14 hours

Python, Spark, and Hadoop for Big Data

  21 hours

Apache Spark Streaming with Scala

  21 hours