Get in Touch

Course Outline

Each session lasts 2 hours

Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Government

  • Case Studies from NIH, DoE
  • Big Data adoption rates in Government Agencies and how they are aligning future operations around Big Data Predictive Analytics
  • Wide-scale Application Areas in DoD, NSA, IRS, USDA, etc.
  • Integrating Big Data with Legacy data
  • Foundational understanding of enabling technologies in predictive analytics
  • Data Integration & Dashboard visualization
  • Fraud management
  • Business Rule/Fraud detection generation
  • Threat detection and profiling
  • Cost-benefit analysis for Big Data implementation

Day-1: Session-2 : Introduction to Big Data-1

  • Core characteristics of Big Data: volume, variety, velocity, and veracity. MPP architecture for volume handling.
  • Data Warehouses – static schema, slowly evolving datasets
  • MPP Databases such as Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
  • Hadoop Based Solutions – no constraints on dataset structure.
  • Typical pattern: HDFS, MapReduce (crunch), retrieval from HDFS
  • Batch processing – suited for analytical/non-interactive tasks
  • Volume handling: CEP streaming data
  • Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
  • Less production-ready options – Storm/S4
  • NoSQL Databases – (columnar and key-value): Best suited as analytical adjuncts to data warehouses/databases

Day-1 : Session -3 : Introduction to Big Data-2

NoSQL solutions

  • KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
  • KV Store - Dynamo, Voldemort, Dynomite, SubRecord, MongoDB, DovetailDB
  • KV Store (Hierarchical) - GT.m, Cache
  • KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
  • KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracotta
  • Tuple Store - Gigaspaces, Coord, Apache River
  • Object Database - ZopeDB, DB40, Shoal
  • Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Psevere, Riak-Basho, Scalaris
  • Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning issues in Big Data

  • RDBMS – static structure/schema, does not support agile, exploratory environments.
  • NoSQL – semi-structured, providing enough structure to store data without requiring an exact schema beforehand.
  • Data cleaning issues

Day-1 : Session-4 : Big Data Introduction-3 : Hadoop

  • When to select Hadoop?
  • STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration)
  • SEMI-STRUCTURED data – challenging to handle with traditional solutions (DW/DB)
  • Warehousing data = HUGE effort and static even after implementation
  • For variety & volume of data, processed on commodity hardware – HADOOP
  • Commodity H/W required to create a Hadoop Cluster

Introduction to MapReduce /HDFS

  • MapReduce – distribute computing across multiple servers
  • HDFS – make data available locally for the computing process (with redundancy)
  • Data – can be unstructured/schema-less (unlike RDBMS)
  • Developer responsibility to interpret the data
  • Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

Day-2: Session-1: Big Data Ecosystem - Building Big Data ETL: Universe of Big Data Tools - Which one to use and when?

  • Hadoop vs. Other NoSQL solutions
  • For interactive, random access to data
  • HBase (column-oriented database) built on top of Hadoop
  • Random access to data but with restrictions (max 1 PB)
  • Not ideal for ad-hoc analytics; good for logging, counting, time-series
  • Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
  • Flume – Stream data (e.g., log data) into HDFS

Day-2: Session-2: Big Data Management System

  • Managing moving parts, compute node start/fail: ZooKeeper - For configuration/coordination/naming services
  • Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
  • Deploy, configure, cluster management, upgrade, etc. (sys admin): Ambari
  • In Cloud: Whirr

Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI :

  • Introduction to Machine learning
  • Learning classification techniques
  • Bayesian Prediction – preparing training file
  • Support Vector Machine
  • KNN p-Tree Algebra & vertical mining
  • Neural Network
  • Big Data large variable problem - Random Forest (RF)
  • Big Data Automation problem – Multi-model ensemble RF
  • Automation through Soft10-M
  • Text analytic tool – Treeminer
  • Agile learning
  • Agent-based learning
  • Distributed learning
  • Introduction to Open source Tools for predictive analytics: R, RapidMiner, Mahout

Day-2: Session-4 Predictive analytics ecosystem-2: Common predictive analytic problems in Government.

  • Insight analytic
  • Visualization analytic
  • Structured predictive analytic
  • Unstructured predictive analytic
  • Threat/fraudster/vendor profiling
  • Recommendation Engine
  • Pattern detection
  • Rule/Scenario discovery – failure, fraud, optimization
  • Root cause discovery
  • Sentiment analysis
  • CRM analytic
  • Network analytic
  • Text Analytics
  • Technology-assisted review
  • Fraud analytic
  • Real Time Analytic

Day-3 : Session-1 : Real Time and Scalable Analytic Over Hadoop

  • Why common analytic algorithms fail in Hadoop/HDFS
  • Apache Hama – for Bulk Synchronous distributed computing
  • Apache SPARK – for cluster computing for real-time analytic
  • CMU Graphics Lab2 – Graph-based asynchronous approach to distributed computing
  • KNN p-Algebra based approach from Treeminer for reduced hardware operation cost

Day-3: Session-2: Tools for eDiscovery and Forensics

  • eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
  • Predictive coding and technology-assisted review (TAR)
  • Live demo of a TAR product (vMiner) to understand how TAR works for faster discovery
  • Faster indexing through HDFS – velocity of data
  • NLP or Natural Language processing – various techniques and open source products
  • eDiscovery in foreign languages – technology for foreign language processing

Day-3 : Session 3: Big Data BI for Cyber Security – Understanding the whole 360-degree view from speedy data collection to threat identification

  • Understanding basics of security analytics – attack surface, security misconfiguration, host defenses
  • Network infrastructure / Large data pipe / Response ETL for real-time analytic
  • Prescriptive vs predictive – Fixed rule-based vs auto-discovery of threat rules from Meta data

Day-3: Session 4: Big Data in USDA : Application in Agriculture

  • Introduction to IoT (Internet of Things) for agriculture – sensor-based Big Data and control
  • Introduction to Satellite imaging and its application in agriculture
  • Integrating sensor and image data for soil fertility, cultivation recommendation, and forecasting
  • Agriculture insurance and Big Data
  • Crop Loss forecasting

Day-4 : Session-1: Fraud prevention BI from Big Data in Government - Fraud analytic:

  • Basic classification of Fraud analytics – rule-based vs predictive analytics
  • Supervised vs unsupervised Machine learning for Fraud pattern detection
  • Vendor fraud/overcharging for projects
  • Medicare and Medicaid fraud – fraud detection techniques for claim processing
  • Travel reimbursement frauds
  • IRS refund frauds
  • Case studies and live demos will be provided wherever data is available.

Day-4 : Session-2: Social Media Analytic – Intelligence gathering and analysis

  • Big Data ETL API for extracting social media data
  • Text, image, metadata, and video
  • Sentiment analysis from social media feeds
  • Contextual and non-contextual filtering of social media feeds
  • Social Media Dashboard to integrate diverse social media sources
  • Automated profiling of social media profiles
  • Live demo of each analytic will be given through the Treeminer Tool.

Day-4 : Session-3: Big Data Analytic in image processing and video feeds

  • Image Storage techniques in Big Data – Storage solutions for data exceeding petabytes
  • LTFS and LTO
  • GPFS-LTFS (Layered storage solution for Big image data)
  • Fundamentals of image analytics
  • Object recognition
  • Image segmentation
  • Motion tracking
  • 3-D image reconstruction

Day-4: Session-4: Big Data applications in NIH:

  • Emerging areas of Bioinformatics
  • Meta-genomics and Big Data mining issues
  • Big Data Predictive analytic for Pharmacogenomics, Metabolomics, and Proteomics
  • Big Data in downstream Genomics processes
  • Application of Big data predictive analytics in Public health

Big Data Dashboard for quick accessibility of diverse data and display :

  • Integration of existing application platform with Big Data Dashboard
  • Big Data management
  • Case Study of Big Data Dashboard: Tableau and Pentaho
  • Use Big Data app to push location-based services in Government.
  • Tracking system and management

Day-5 : Session-1: How to justify Big Data BI implementation within an organization:

  • Defining ROI for Big Data implementation
  • Case studies for saving Analyst Time for collection and preparation of Data – increase in productivity gain
  • Case studies of revenue gain from saving licensed database costs
  • Revenue gain from location-based services
  • Savings from fraud prevention
  • An integrated spreadsheet approach to calculate approximate expense vs. Revenue gain/savings from Big Data implementation.

Day-5 : Session-2: Step-by-step procedure to replace legacy data system to Big Data System:

  • Understanding practical Big Data Migration Roadmap
  • What important information is needed before architecting a Big Data implementation
  • Different ways of calculating volume, velocity, variety, and veracity of data
  • How to estimate data growth
  • Case studies

Day-5: Session 4: Review of Big Data Vendors and review of their products. Q&A session:

  • Accenture
  • APTEAN (Formerly CDC Software)
  • Cisco Systems
  • Cloudera
  • Dell
  • EMC
  • GoodData Corporation
  • Guavus
  • Hitachi Data Systems
  • Hortonworks
  • HP
  • IBM
  • Informatica
  • Intel
  • Jaspersoft
  • Microsoft
  • MongoDB (Formerly 10Gen)
  • MU Sigma
  • Netapp
  • Opera Solutions
  • Oracle
  • Pentaho
  • Platfora
  • Qliktech
  • Quantum
  • Rackspace
  • Revolution Analytics
  • Salesforce
  • SAP
  • SAS Institute
  • Sisense
  • Software AG/Terracotta
  • Soft10 Automation
  • Splunk
  • Sqrrl
  • Supermicro
  • Tableau Software
  • Teradata
  • Think Big Analytics
  • Tidemark Systems
  • Treeminer
  • VMware (Part of EMC)

Requirements

  • Basic knowledge of business operations and data systems within their respective government domains
  • Basic understanding of SQL/Oracle or relational databases
  • Basic understanding of Statistics (at the spreadsheet level)
 35 Hours

Testimonials (1)

Upcoming Courses

Related Categories