Big Data (TDAI006)

Learn how big data is driving organizational change and essential analytical tools and techniques. Understand big data and how it will impact your business with the tools and systems used by big data scientists and engineers.


Chapter 1. Defining Big Data


  • In-Class Discussion
  • Gartner's Definition of Big Data
  • More Definitions of Big Data
  • Transforming Data into Business Information
  • Challenges Posed by Big Data
  • Processing Big Data
  • Apache Hadoop
  • The Cloud and Big Data
  • The CAP Theorem
  • Summary


Chapter 2. Hadoop Overview


  • The Client – Server Processing Pattern
  • Apache Hadoop
  • Apache Hadoop Logo
  • Typical Hadoop Applications
  • Hadoop Clusters
  • Hadoop Distributions
  • Hadoop's Main Components
  • HDFS
  • HDFS Blocks
  • YARN
  • Hadoop-based Systems for Data Analysis
  • MapReduce
  • Similarity with SQL Aggregation Operations
  • Distributed Computing Economics
  • Discussion: Divide and Conquer
  • Apache Pig
  • Pig Latin
  • Running Pig
  • Pig Latin Script Example
  • What is Hive?
  • Hive's Value Proposition
  • Who uses Hive?
  • What Hive Does Not Have
  • HiveQL
  • Working with Hive Tables
  • What is HBase?
  • HBase vs RDBS
  • Interfacing with HBase
  • HBase Table Design Digest
  • A Cell's Value Versioning
  • Creating and Populating a Table in HBase Shell
  • Getting a Cell's Value
  • Counting Rows in an HBase Table
  • Summary


Chapter 3. Big Data Analytics in the Cloud


  • Data is King
  • Big Data Stores in the Cloud 
  • Example: AWS Simple Storage Service (S3) 
  • MapReduce (and Hadoop) in the Cloud 
  • Information and Data Security
  • Data-at-rest Security Examples
  • Example of Object Encryption in S3
  • One S3 Use Case: Backup and Archiving
  • Data Analytics Services in the Cloud
  • Analytics Services with AWS
  • AWS EMR: Software Configuration Screen
  • AWS EMR: Hardware Configuration Screen
  • Big Data Analytics Solutions from Google Cloud
  • Google Data Processing and Analytics Pipelines
  • Google BigQuery
  • Machine Learning
  • Microsoft Azure ML Studio
  • Machine Learning Pipeline
  • Summary


Chapter 4. Making Big Data Small Techniques


  • What is Data Science?
  • Data Science, Machine Learning, AI?
  • Making Big Data Small
  • Descriptive Statistics
  • Correlation
  • Reducing the Number of Data Attributes
  • Lasso Regularization
  • Sampling Examples
  • Data Compression
  • Summary


Chapter 5. Introduction to Apache Spark


  • What is Apache Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Shell
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop's MapReduce
  • Spark vs MapReduce
  • The Resilient Distributed Dataset (RDD)
  • Spark Streaming (Micro-batching)
  • Spark SQL
  • Example of Spark SQL
  • Spark Machine Learning Library
  • Example: Using Random Forests with Spark MLlib
  • The Output (the “Confusion” matrix)
  • Dumping the Trained Model
  • Clustering
  • Finding Centroids Example
  • Using kMeans Module with Spark MLlib
  • Printing the Centroids
  • GraphX