Course Outline


  • Overview of Spark and Hadoop features and architecture
  • Understanding big data
  • Python programming basics

Getting Started

  • Setting up Python, Spark, and Hadoop
  • Understanding data structures in Python
  • Understanding PySpark API
  • Understanding HDFS and MapReduce

Integrating Spark and Hadoop with Python

  • Implementing Spark RDD in Python
  • Processing data using MapReduce
  • Creating distributed datasets in HDFS

Machine Learning with Spark MLlib

Processing Big Data with Spark Streaming

Working with Recommender Systems

Working with Kafka, Sqoop, Kafka, and Flume

Apache Mahout with Spark and Hadoop


Summary and Next Steps


  • Experience with Spark and Hadoop
  • Python programming experience


  • Data scientists
  • Developers
  21 Hours


Related Courses

Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP

 21 hours

This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and

Apache Spark MLlib

 35 hours

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative

Scaling Data Analysis with Python and Dask

 14 hours

Dask is a flexible and high-performance Python library for parallel computing. It scales and accelerates big data processing with other Python-based data science libraries, such as Pandas, Numpy, and Scikit-Learn. This instructor-led, live

Data Analysis with Python, Pandas, and Numpy

 14 hours

Pandas is a Python package that provides data structures for working with structured (tabular, multidimensional, potentially heterogeneous) and time series data.

Accelerating Python Pandas Workflows with Modin

 14 hours

Modin is a parallel data frame system designed to speed up Pandas workflows. It can be used to handle large datasets, leveraging Ray or Dask as the backend framework for distributed computing in Python. This instructor-led, live training (online

Machine Learning with Python and Pandas

 14 hours

Pandas is a Python library for data manipulation and analysis. Using Pandas, users can perform predictive analysis through machine learning. This instructor-led, live training (online or onsite) is aimed at data scientists who wish to use Pandas

FARM (FastAPI, React, and MongoDB) Full Stack Development

 14 hours

FARM (FastAPI, React, and MongoDB) is similar to MERN, but performs faster with Python and FastAPI replacing Node.js and Express as the backend. FastAPI is a high-performance Python web framework used by top companies, such as Microsoft, Uber, and

Developing APIs with Python and FastAPI

 14 hours

FastAPI is an open source, high-performance web framework for building APIs with Python. It is used by many large companies, such as Uber, Netflix, and Microsoft. This instructor-led, live training (online or onsite) is aimed at developers who

Web application development with Flask

 14 hours

This practical course is addressed to Python developers that want to create and maintain their first web applications. It is also addressed to people who are already familiar with other web frameworks such as Django or Web2py, and want to learn

Advanced Flask

 14 hours

Flask is a micro-framework for developing web applications in Python. Unlike other frameworks, Flask does not have any dependencies on external libraries, making it lightweight and fast. This instructor-led, live training (online or onsite) is

Build REST APIs with Python and Flask

 14 hours

Flask is a micro-framework for developing web services in Python. Flask, unlike other frameworks, does not have any dependencies on external libraries, making it lightweight and fast. This instructor-led, live training (online or onsite) is aimed

Introduction to Graph Computing

 28 hours

Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set

Game Development with PyGame

 7 hours

PyGame is an open source library of Python modules for developing game applications and programs. It is lightweight, easy to use, and compatible with any operating system or platform. This instructor-led, live training (online or onsite) is aimed

Scientific Computing with Python SciPy

 7 hours

SciPy is an open source Python library for scientific, mathematical, and technical computing. It is built on the NumPy extension, providing a wide range of functionalities for performing complex numerical operations. This instructor-led, live

Python and Spark for Big Data (PySpark)

 21 hours

Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python. In this