Big data overview

Big Data
■ Generic term used for large amount of Data ( structured + unstructured ) Peta bytes.
– Facebook Data
– Google Data
– NASA Data
■ Not “Really” Big Data
– Most back office Data eg employee record etc
– BankingTransaction Data
– But, same concepts could be applied.

Apache Hadoop
■ Big Data does not mean Hadoop.
■ Apache Hadoop is an open-source software framework used for distributed storage
and processing of dataset of big data using the MapReduce programming model on
commodity hardware.

Components of Hadoop Ecosystem
Hadoop Distributed File System
(HDFS)
Distributed Batch Processing
(Map Reduce)
Resource Management (YARN)

Apache Spark
■ Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce
cluster computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a function
across the data, reduce the results of the map, and store reduction results on disk.
Spark's RDDs function as a working set for distributed programs that offers a
(deliberately) restricted form of distributed.

Spark & Hadoop
■ Spark and Hadoop not necessarily compete with one another, rather complement each
other.
Apache Hadoop Apache Spark
Map reduce &YARN base system Spark & RDD
Mostly for Batch processing Can we used for stream processing
Optimized for cheap hardware RAM heavy operations
Not easy API ( need to be Rockstar Java Dev) Easy API ( Python, R, Scala, Java )

Elasticsearch
NoSQL full text search DataBase
Not ACID compliant like Oracle,
MySQL
Designed for distributed computing
rather than centralized
computation.

SQL vs NoSQL ( broadly )
NoSQL Databases Relational Databases
Designed for performance Designed for integrity
No relational schema Predefined schema
Various storage models Stored as individual records
Better for a lot of reads and writes Better for a lot of search and query
Auto-sharding Sharding is not implicitly supported
Limited query abilities Well-defined, advanced and standardized query
language

Apache Kafka
• Publish and subscribe to streams of records.
• Store streams of records in a fault-tolerant way.
• Process streams of records as they occur.
• Eg: LinkedIn Notification

Data Engineering
Acquiring
• Data Gathering
• Sampling
• Data Ingestion
Data
Preparation
• DataWrangling
• Data Cleaning
• Transformation
Data analysis
• StatisticalAnalysis
• Data Modeling

Data Engineering – Legacy
Acquiring
SQL insert
statements
CSV file
dumping
Data
Preparation
ETLTools
IBM - Data
Stage
Talend
Analytics
Excel based
analytics
SAS bases
analytics
IBMWatson

Data Engineering – Big Data Stack
Data
Acquisition:
• Data Ingestion
• Sqoop – Batch
Ingestion
• Kafka – Stream
Ingestion
Data Prep
• Trifacta Data
Wrangling
• Apache Spark
Data
Analytics
• Data Robot
• SAS + R based
Analytics
• IBM -Watson

Big data overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data overview

Similar to Big data overview (20)

Recently uploaded

Recently uploaded (20)

Big data overview

Editor's Notes