Course Outline

  1. Big data fundamentals
    • Big Data and its role in the corporate world
    • The phases of development of a Big Data strategy within a corporation
    • Explain the rationale underlying a holistic approach to Big Data
    • Components needed in a Big Data Platform
    • Big data storage solution
    • Limits of Traditional Technologies
    • Overview of database types
    • The four dimensions of Big Data
  2. Big data impact on business
    • Business importance of Big Data
    • Challenges of extracting useful data
    • Integrating Big data with traditional data
  3. Big data storage technologies
    • Overview of big data technologies
      • Data storage models
      • Hadoop
      • Hive
      • Cassandra
      • MongoDB
    • Choosing the right big data technology
  4. Processing big data
    • Connecting and extracting data from database
    • Transforming and preparation data for processing
    • Using Hadoop MapReduce for processing distributed data
    • Monitoring and executing Hadoop MapReduce jobs
    • Hadoop distributed file system building blocks
    • Mapreduce and Yarn
    • Handling streaming data with Spark
  5. Big data analysis tools and technologies
    • Programming Hadoop with Pig Latin language
    • Querying big data with Hive
    • Mining data with Mahout
    • Visualizing and reporting tools
  6. Big data in business
    • Managing and establishing Big Data needs
    • Business importance of Big Data
    • Selecting the right big data tools for the problem

 

Data Warehousing Concepts

  • What is Data Ware House?
  • Difference between OLTP and Data Ware Housing
  • Data Acquisition
  • Data Extraction
  • Data Transformation.
  • Data Loading
  • Data Marts
  • Dependent vs Independent data Mart
  • Data Base design

ETL Testing Concepts:

  • Introduction.
  • Software development life cycle.
  • Testing methodologies.
  • ETL Testing Work Flow Process.
  • ETL Testing Responsibilities in Data stage.       

Big data Fundamentals

  • Big Data and its role in the corporate world
  • The phases of development of a Big Data strategy within a corporation
  • Explain the rationale underlying a holistic approach to Big Data
  • Components needed in a Big Data Platform
  • Big data storage solution
  • Limits of Traditional Technologies
  • Overview of database types

NoSQL Databases

Hadoop

Map Reduce

Apache Spark

 

Requirements

Delegates should have an awareness and some experience of storgage tools and an awreness of handling large data sets

  14 Hours
 

Testimonials

Related Courses

Apache Accumulo Fundamentals

 21 hours

Apache Accumulo is a sorted, distributed key/value store that provides robust, scalable data storage and retrieval. It is based on the design of Google's BigTable and is powered by Apache Hadoop, Apache Zookeeper, and Apache Thrift. This

Apache Airflow

 21 hours

Apache Airflow is a platform for authoring, scheduling and monitoring workflows. This instructor-led, live training (online or onsite) is aimed at data scientists who wish to use Apache Airflow to build and manage end-to-end data pipelines. By

Apache Drill

 21 hours

Apache Drill is a schema-free, distributed, in-memory columnar SQL query engine for Hadoop, NoSQL and other Cloud and file storage systems. The power of Apache Drill lies in its ability to join data from multiple data stores using a single query.

Apache Drill Performance Optimization and Debugging

 7 hours

Apache Drill is a schema-free, distributed, in-memory columnar SQL query engine for Hadoop, NoSQL and and other Cloud and file storage systems. The power of Apache Drill lies in its ability to join data from multiple data stores using a single

Apache Drill Query Optimization

 7 hours

Apache Drill is a schema-free, distributed, in-memory columnar SQL query engine for Hadoop, NoSQL and other Cloud and file storage systems. The power of Apache Drill lies in its ability to join data from multiple data stores using a single query.

Apache Hama

 14 hours

Apache Hama is a framework based on the Bulk Synchronous Parallel (BSP) computing model and is primarily used for Big Data analytics. In this instructor-led, live training, participants will learn the fundamentals of Apache Hama as they step

Apache Arrow for Data Analysis across Disparate Data Sources

 14 hours

Apache Arrow is an open-source in-memory data processing framework. It is often used together with other data science tools for accessing disparate data stores for analysis. It integrates well with other technologies such as GPU databases, machine

Big Data & Database Systems Fundamentals

 14 hours

The course is part of the Data Scientist skill set (Domain: Data and Technology).

Data Vault: Building a Scalable Data Warehouse

 28 hours

Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its

Data Virtualization with Denodo Platform

 14 hours

Denodo is a data virtualization platform for managing big data, logical data warehouses, and enterprise data operations. This instructor-led, live training (online or onsite) is aimed at architects, developers, and administrators who wish to use

Dremio for Self-Service Data Analysis

 21 hours

Dremio is an open-source "self-service data platform" that accelerates the querying of different types of data sources. Dremio integrates with relational databases, Apache Hadoop, MongoDB, Amazon S3, ElasticSearch, and other data sources.

Apache Druid for Real-Time Data Analysis

 21 hours

Apache Druid is an open-source, column-oriented, distributed data store written in Java. It was designed to quickly ingest massive quantities of event data and execute low-latency OLAP queries on that data. Druid is commonly used in business

Apache Kylin: From Classic OLAP to Real-Time Data Warehouse

 14 hours

Apache Kylin is an extreme, distributed analytics engine for big data. In this instructor-led live training, participants will learn how to use Apache Kylin to set up a real-time data warehouse. By the end of this training, participants will

Zeppelin for Interactive Data Analytics

 14 hours

Apache Zeppelin is a web-based notebook for capturing, exploring, visualizing and sharing Hadoop and Spark based data. This instructor-led, live training introduces the concepts behind interactive data analytics and walks participants through the