This document provides an overview of big data concepts and related technologies. It discusses what big data is, how Apache Hadoop uses MapReduce for distributed storage and processing of large datasets. Key components of the Hadoop ecosystem are described including HDFS for storage and YARN for resource management. Apache Spark is presented as an alternative to Hadoop for its in-memory computing capabilities and support for stream processing. Spark can complement Hadoop. Elasticsearch is introduced as a NoSQL database for full text search. Apache Kafka is summarized as a system for publishing and processing streams of records. Data engineering processes of acquiring, preparing, and analyzing data are outlined for both legacy and big data systems.
2. Big Data
■ Generic term used for large amount of Data ( structured + unstructured ) Peta bytes.
– Facebook Data
– Google Data
– NASA Data
■ Not “Really” Big Data
– Most back office Data eg employee record etc
– BankingTransaction Data
– But, same concepts could be applied.
3. Apache Hadoop
■ Big Data does not mean Hadoop.
■ Apache Hadoop is an open-source software framework used for distributed storage
and processing of dataset of big data using the MapReduce programming model on
commodity hardware.
4. Components of Hadoop Ecosystem
Hadoop Distributed File System
(HDFS)
Distributed Batch Processing
(Map Reduce)
Resource Management (YARN)
5.
6. Apache Spark
■ Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce
cluster computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a function
across the data, reduce the results of the map, and store reduction results on disk.
Spark's RDDs function as a working set for distributed programs that offers a
(deliberately) restricted form of distributed.
7. Spark & Hadoop
■ Spark and Hadoop not necessarily compete with one another, rather complement each
other.
Apache Hadoop Apache Spark
Map reduce &YARN base system Spark & RDD
Mostly for Batch processing Can we used for stream processing
Optimized for cheap hardware RAM heavy operations
Not easy API ( need to be Rockstar Java Dev) Easy API ( Python, R, Scala, Java )
8. Elasticsearch
NoSQL full text search DataBase
Not ACID compliant like Oracle,
MySQL
Designed for distributed computing
rather than centralized
computation.
9. SQL vs NoSQL ( broadly )
NoSQL Databases Relational Databases
Designed for performance Designed for integrity
No relational schema Predefined schema
Various storage models Stored as individual records
Better for a lot of reads and writes Better for a lot of search and query
Auto-sharding Sharding is not implicitly supported
Limited query abilities Well-defined, advanced and standardized query
language
11. Apache Kafka
• Publish and subscribe to streams of records.
• Store streams of records in a fault-tolerant way.
• Process streams of records as they occur.
• Eg: LinkedIn Notification
12. Data Engineering
Acquiring
• Data Gathering
• Sampling
• Data Ingestion
Data
Preparation
• DataWrangling
• Data Cleaning
• Transformation
Data analysis
• StatisticalAnalysis
• Data Modeling
13. Data Engineering – Legacy
Acquiring
SQL insert
statements
CSV file
dumping
Data
Preparation
ETLTools
IBM - Data
Stage
Talend
Analytics
Excel based
analytics
SAS bases
analytics
IBMWatson
14. Data Engineering – Big Data Stack
Data
Acquisition:
• Data Ingestion
• Sqoop – Batch
Ingestion
• Kafka – Stream
Ingestion
Data Prep
• Trifacta Data
Wrangling
• Apache Spark
Data
Analytics
• Data Robot
• SAS + R based
Analytics
• IBM -Watson
Editor's Notes
Components of Hadoop Architecture
Core components :
The Distributed Storage Framework – Hadoop Distributed File System
Distributed Processing Framework - MapReduce and YARN
Other supporting frameworks:
Integration Frameworks – SQOOP , FLUME
Management Frameworks – Ambari, ZooKeeper, Oozie
Development Frameworks – Pig, Hive, Hbase, HCatalog
Business Intelligence and Reporting Frameworks – Third Party Tools and Applications like SAS, Microstrategy, Splunk, Microsoft BI stack etc.
Sources:
Intel Distribution for Apache Hadoop: https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/09/24/securing-big-data-for-the-enterprise-project-rhino-and-the-intel-distribution-for-apache-hadoop-idh
Big Data Open Source Technology Stack: http://datakulfi.wordpress.com/2013/03/27/big-data-open-source-technology-landscape/
Apache Hadoop Official: http://hadoop.apache.org/
Where this all fits: http://blog.syncsort.com/wp-content/uploads/2013/11/strata_diagram.png