SMACK Stack for Data Science Training Course
The SMACK suite comprises several data platform software tools, including Apache Spark, Apache Mesos, Apache Akka, Apache Cassandra, and Apache Kafka. With the SMACK stack, users can develop and expand their data processing platforms.
This instructor-led training session (conducted either online or at your location) is designed for data scientists looking to utilize the SMACK stack in building big data processing solutions.
Upon completion of this course, participants will be able to:
- Create a data pipeline architecture for handling large-scale data processing.
- Construct cluster infrastructure using Apache Mesos and Docker.
- Analyze data through Spark and Scala.
- Handle unstructured data with Apache Cassandra.
Course Format
- Engaging lectures combined with discussions.
- A multitude of exercises and practical activities.
- Practical implementation in a live lab setting.
Customization Options for the Course
- To arrange for customized training, please contact us to discuss your requirements.
Course Outline
Introduction
SMACK Stack Overview
- What is Apache Spark? Apache Spark features
- What is Apache Mesos? Apache Mesos features
- What is Apache Akka? Apache Akka features
- What is Apache Cassandra? Apache Cassandra features
- What is Apache Kafka? Apache Kafka features
Scala Language
- Scala syntax and structure
- Scala control flow
Preparing the Development Environment
- Installing and configuring the SMACK stack
- Installing and configuring Docker
Apache Akka
- Using actors
Apache Cassandra
- Creating a database for read operations
- Working with backups and recovery
Connectors
- Creating a stream
- Building an Akka application
- Storing data with Cassandra
- Reviewing connectors
Apache Kafka
- Working with clusters
- Creating, publishing, and consuming messages
Apache Mesos
- Allocating resources
- Running clusters
- Working with Apache Aurora and Docker
- Running services and jobs
- Deploying Spark, Cassandra, and Kafka on Mesos
Apache Spark
- Managing data flows
- Working with RDDs and dataframes
- Performing data analysis
Troubleshooting
- Handling failure of services and errors
Summary and Conclusion
Requirements
- An understanding of data processing systems
Audience
- Data Scientists
Need help picking the right course?
SMACK Stack for Data Science Training Course - Enquiry
Testimonials (1)
very interactive...
Richard Langford
Course - SMACK Stack for Data Science
Upcoming Courses
Related Courses
Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
21 HoursThis course targets developers and data scientists interested in integrating AI into their applications. It places particular emphasis on Data Analysis, Distributed AI, and Natural Language Processing.
Anaconda Ecosystem for Data Scientists
14 HoursThis instructor-led, live training in the UAE (online or onsite) is aimed at data scientists who wish to use the Anaconda ecosystem to capture, manage, and deploy packages and data analysis workflows in a single platform.
By the end of this training, participants will be able to:
- Install and configure Anaconda components and libraries.
- Understand the core concepts, features, and benefits of Anaconda.
- Manage packages, environments, and channels using Anaconda Navigator.
- Use Conda, R, and Python packages for data science and machine learning.
- Get to know some practical use cases and techniques for managing multiple data environments.
Big Data Business Intelligence for Telecom and Communication Service Providers
35 HoursOverview
Communications service providers (CSP) are under pressure to cut costs and boost average revenue per user (ARPU), while maintaining a top-notch customer experience, but data volumes continue to rise. Global mobile data traffic is projected to grow at an annual compound growth rate of 78 percent by 2016, reaching 10.8 exabytes monthly.
Simultaneously, CSPs are generating vast amounts of data, including call detail records (CDR), network information, and customer details. Companies that fully leverage this data gain a competitive advantage. A recent survey by The Economist Intelligence Unit found that companies using data-driven decision-making experience a 5-6% productivity boost. However, only half of the valuable data is utilized by 53% of businesses, with one-fourth noting that significant amounts of useful data are overlooked. The sheer volume of data makes manual analysis impractical and most legacy software systems struggle to keep up, leading to valuable data being discarded or ignored.
With Big Data & Analytics’ high-speed, scalable big data software, CSPs can analyze all their data for improved decision-making in less time. Various Big Data products and techniques offer an end-to-end platform for collecting, preparing, analyzing, and presenting insights from big data. Application areas include network performance monitoring, fraud detection, customer churn prevention, and credit risk analysis. These Big Data & Analytics solutions can handle terabytes of data, but implementing such tools requires a new type of cloud-based database system like Hadoop or massive-scale parallel computing processors (KPU, etc.).
This course on Big Data BI for Telco covers all the emerging areas where CSPs are investing to enhance productivity and open up new revenue streams. It provides a comprehensive 360-degree overview of Big Data BI in the telecommunications sector so that decision-makers and managers can gain a broad understanding of the potential benefits of Big Data BI for productivity and revenue growth.
Course Objectives
The primary goal of this course is to introduce new Big Data business intelligence techniques across four sectors of Telecom Business (Marketing/Sales, Network Operations, Financial Operations, and Customer Relationship Management). Students will be introduced to the following:
- An introduction to Big Data, including the 4Vs (volume, velocity, variety, veracity) from a Telco perspective
- The differences between Big Data analytics and legacy data analytics
- Justifying the use of Big Data within a Telco context
- Familiarity with the Hadoop ecosystem and its tools like Hive, Pig, SPARC – understanding when and how they are used to address Big Data challenges
- The process of extracting data for analysis using an integrated Hadoop dashboard approach to ease business analysts' pain points in collecting and analyzing data
- Basic insights into analytics, visualization, and predictive analytics specific to Telco
- Customer churn analytics and how Big Data can reduce customer churn and dissatisfaction – case studies included
- Analyzing network failures and service issues using Network metadata and IPDR
- Financial analysis for fraud detection, waste reduction, and ROI estimation from sales and operational data
- Solving customer acquisition challenges through targeted marketing, customer segmentation, and cross-selling based on sales data
- A summary of all Big Data analytic products and their roles in the Telco analytics landscape
- Steps to introduce Big Data Business Intelligence into your organization
Target Audience
- Network Operations, Financial Managers, CRM managers, and top IT managers within the Telco CIO office.
- Telco business analysts
- CFO office managers/analysts
- Operational managers
- QA managers
Data Science Programme
245 HoursThe unprecedented surge in information and data has propelled our capacity for innovation and pushing boundaries to new heights. Today, the role of a Data Scientist is among the most sought-after skills across various industries.
Our approach goes beyond theoretical learning; we provide practical, industry-relevant skills that connect academic knowledge with real-world demands.
This 7-week curriculum can be customized according to your specific industry needs. For more details or to learn about our offerings, please contact us or visit the Nobleprog Institute website.
Audience:
This program is designed for postgraduate-level individuals as well as anyone with the necessary prerequisite skills, which will be assessed through an evaluation and interview process.
Delivery:
The course delivery combines Instructor-Led Classroom sessions and Instructor-Led Online sessions. Typically, the first week involves classroom-led instruction, weeks 2 to 6 are conducted in a virtual classroom setting, and the seventh week returns to classroom-led instruction.
Data Science for Big Data Analytics
35 HoursBig data refers to extensive and intricate datasets that conventional data processing applications cannot effectively manage. The challenges associated with big data encompass capturing the data, storing it, analyzing it, searching through it, sharing it, transferring it, visualizing it, querying it, updating it, and ensuring information privacy.
Data Science essential for Marketing/Sales professionals
21 HoursThis course is designed for Marketing and Sales Professionals looking to delve deeper into the application of data science within Marketing and Sales domains. It offers comprehensive coverage of various data science techniques utilized for "upselling," "cross-selling," market segmentation, branding, and customer lifetime value (CLV).
Difference Between Marketing and Sales - How are marketing and sales different?
In simple terms, sales can be described as a process that focuses on individuals or small groups. Conversely, marketing targets larger audiences or the general public. Marketing encompasses research (identifying customer needs), product development (creating innovative products), and promotion (through advertisements) to raise consumer awareness about the product. Essentially, marketing involves generating leads or prospects. Once the product is launched in the market, it falls on the sales team to convince customers to make a purchase. Sales revolves around converting these leads into actual purchases and orders, whereas marketing aims for long-term goals, while sales focuses on short-term objectives.
Introduction to Graph Computing
28 HoursIn this instructor-led, live training in the UAE, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
- Understand how graph data is persisted and traversed.
- Select the best framework for a given task (from graph databases to batch processing frameworks.)
- Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
- View real-world big data problems in terms of graphs, processes and traversals.
Jupyter for Data Science Teams
7 HoursThis instructor-led, live training in the UAE (online or onsite) introduces the idea of collaborative development in data science and demonstrates how to use Jupyter to track and participate as a team in the "life cycle of a computational idea". It walks participants through the creation of a sample data science project based on top of the Jupyter ecosystem.
By the end of this training, participants will be able to:
- Install and configure Jupyter, including the creation and integration of a team repository on Git.
- Use Jupyter features such as extensions, interactive widgets, multiuser mode and more to enable project collaboraton.
- Create, share and organize Jupyter Notebooks with team members.
- Choose from Scala, Python, R, to write and execute code against big data systems such as Apache Spark, all through the Jupyter interface.
Kaggle
14 HoursThis instructor-led, live training in the UAE (online or onsite) is aimed at data scientists and developers who wish to learn and build their careers in Data Science using Kaggle.
By the end of this training, participants will be able to:
- Learn about data science and machine learning.
- Explore data analytics.
- Learn about Kaggle and how it works.
MATLAB Fundamentals, Data Science & Report Generation
35 HoursIn the initial segment of this training program, we delve into the basics of MATLAB and its dual role as a programming language and an integrated platform. This section introduces participants to MATLAB syntax, arrays and matrices, data visualization techniques, script creation, and object-oriented programming concepts.
The second part focuses on utilizing MATLAB for tasks such as data mining, machine learning, and predictive analytics. To offer a clear and practical understanding of MATLAB's capabilities compared to other tools like spreadsheets, C, C++, and Visual Basic, we provide relevant comparisons throughout the session.
In the final segment, participants will learn how to enhance their workflow by automating data processing tasks and report generation using MATLAB.
Throughout the course, practical application of these concepts is emphasized through hands-on exercises in a lab setting. By the conclusion of the training, participants will have a comprehensive understanding of MATLAB's functionalities and be equipped to tackle real-world data science challenges as well as streamline their work processes through automation.
Progress assessments are integrated into the course to monitor learning outcomes.
Course Format
- The curriculum encompasses both theoretical and practical exercises, including case studies, code review, and hands-on implementation.
Note
- Practical sessions will be based on pre-arranged sample data report templates. For any specific needs, please reach out to us for arrangements.
Accelerating Python Pandas Workflows with Modin
14 HoursThis instructor-led, live training in the UAE (online or onsite) is aimed at data scientists and developers who wish to use Modin to build and implement parallel computations with Pandas for faster data analysis.
By the end of this training, participants will be able to:
- Set up the necessary environment to start developing Pandas workflows at scale with Modin.
- Understand the features, architecture, and advantages of Modin.
- Know the differences between Modin, Dask, and Ray.
- Perform Pandas operations faster with Modin.
- Implement the entire Pandas API and functions.
Python Programming for Finance
35 HoursPython is a widely-used programming language in the financial sector, adopted by major investment banks and hedge funds for developing various financial applications, from core trading systems to risk management tools.
In this instructor-led live training session, participants will learn how to leverage Python to create practical solutions for specific finance-related challenges.
By the end of this course, participants will be able to:
- Grasp the basics of the Python programming language
- Download, install, and maintain the optimal development tools for building financial applications in Python
- Select and apply appropriate Python packages and techniques to manage, visualize, and analyze financial data from diverse sources (CSV, Excel, databases, web, etc.)
- Create applications that address issues related to asset allocation, risk analysis, investment performance, and more
- Debug, integrate, deploy, and optimize a Python application
Audience
- Developers
- Analysts
- Quants
Course Format
- The course includes lectures, discussions, exercises, and extensive hands-on practice.
Note
- This training is designed to address key challenges faced by finance professionals. If you have a specific topic, tool, or technique that you would like to cover in more detail, please contact us to arrange for additional content.
GPU Data Science with NVIDIA RAPIDS
14 HoursThis instructor-led, live training in the UAE (online or onsite) is aimed at data scientists and developers who wish to use RAPIDS to build GPU-accelerated data pipelines, workflows, and visualizations, applying machine learning algorithms, such as XGBoost, cuML, etc.
By the end of this training, participants will be able to:
- Set up the necessary development environment to build data models with NVIDIA RAPIDS.
- Understand the features, components, and advantages of RAPIDS.
- Leverage GPUs to accelerate end-to-end data and analytics pipelines.
- Implement GPU-accelerated data preparation and ETL with cuDF and Apache Arrow.
- Learn how to perform machine learning tasks with XGBoost and cuML algorithms.
- Build data visualizations and execute graph analysis with cuXfilter and cuGraph.
Python and Spark for Big Data (PySpark)
21 HoursIn this instructor-led, live training in the UAE, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
- Learn how to use Spark with Python to analyze Big Data.
- Work on exercises that mimic real world cases.
- Use different tools and techniques for big data analysis using PySpark.
Apache Spark MLlib
35 HoursMLlib serves as the machine learning (ML) library for Spark, aiming to make scalable and user-friendly practical machine learning accessible. It encompasses various learning algorithms and utilities such as classification, regression, clustering, collaborative filtering, dimensionality reduction, along with foundational optimization tools and advanced pipeline APIs.
The library is divided into two main packages:
-
spark.mllib includes the original API constructed using RDDs.
-
spark.ml offers a more sophisticated API based on DataFrames, facilitating the creation of ML pipelines.
Audience
This course is tailored for engineers and developers looking to leverage an integrated Machine Learning Library within Apache Spark.