OVERVIEW
BIGDATA
beCloudReady
Big Data
■ Generic term used for large amount of Data ( structured + unstructured ) Peta bytes.
– Facebook Data
– Google Data
– NASA Data
■ Not “Really” Big Data
– Most back office Data eg employee record etc
– BankingTransaction Data
– But, same concepts could be applied.
Apache Hadoop
■ Big Data does not mean Hadoop.
■ Apache Hadoop is an open-source software framework used for distributed storage
and processing of dataset of big data using the MapReduce programming model on
commodity hardware.
Components of Hadoop Ecosystem
Hadoop Distributed File System
(HDFS)
Distributed Batch Processing
(Map Reduce)
Resource Management (YARN)
Apache Spark
■ Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce
cluster computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a function
across the data, reduce the results of the map, and store reduction results on disk.
Spark's RDDs function as a working set for distributed programs that offers a
(deliberately) restricted form of distributed.
Spark & Hadoop
■ Spark and Hadoop not necessarily compete with one another, rather complement each
other.
Apache Hadoop Apache Spark
Map reduce &YARN base system Spark & RDD
Mostly for Batch processing Can we used for stream processing
Optimized for cheap hardware RAM heavy operations
Not easy API ( need to be Rockstar Java Dev) Easy API ( Python, R, Scala, Java )
Elasticsearch
NoSQL full text search DataBase
Not ACID compliant like Oracle,
MySQL
Designed for distributed computing
rather than centralized
computation.
SQL vs NoSQL ( broadly )
NoSQL Databases Relational Databases
Designed for performance Designed for integrity
No relational schema Predefined schema
Various storage models Stored as individual records
Better for a lot of reads and writes Better for a lot of search and query
Auto-sharding Sharding is not implicitly supported
Limited query abilities Well-defined, advanced and standardized query
language
Apache Kafka
Apache Kafka
• Publish and subscribe to streams of records.
• Store streams of records in a fault-tolerant way.
• Process streams of records as they occur.
• Eg: LinkedIn Notification
Data Engineering
Acquiring
• Data Gathering
• Sampling
• Data Ingestion
Data
Preparation
• DataWrangling
• Data Cleaning
• Transformation
Data analysis
• StatisticalAnalysis
• Data Modeling
Data Engineering – Legacy
Acquiring
SQL insert
statements
CSV file
dumping
Data
Preparation
ETLTools
IBM - Data
Stage
Talend
Analytics
Excel based
analytics
SAS bases
analytics
IBMWatson
Data Engineering – Big Data Stack
Data
Acquisition:
• Data Ingestion
• Sqoop – Batch
Ingestion
• Kafka – Stream
Ingestion
Data Prep
• Trifacta Data
Wrangling
• Apache Spark
Data
Analytics
• Data Robot
• SAS + R based
Analytics
• IBM -Watson

Big data overview

  • 1.
  • 2.
    Big Data ■ Genericterm used for large amount of Data ( structured + unstructured ) Peta bytes. – Facebook Data – Google Data – NASA Data ■ Not “Really” Big Data – Most back office Data eg employee record etc – BankingTransaction Data – But, same concepts could be applied.
  • 3.
    Apache Hadoop ■ BigData does not mean Hadoop. ■ Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model on commodity hardware.
  • 4.
    Components of HadoopEcosystem Hadoop Distributed File System (HDFS) Distributed Batch Processing (Map Reduce) Resource Management (YARN)
  • 6.
    Apache Spark ■ Sparkand its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed.
  • 7.
    Spark & Hadoop ■Spark and Hadoop not necessarily compete with one another, rather complement each other. Apache Hadoop Apache Spark Map reduce &YARN base system Spark & RDD Mostly for Batch processing Can we used for stream processing Optimized for cheap hardware RAM heavy operations Not easy API ( need to be Rockstar Java Dev) Easy API ( Python, R, Scala, Java )
  • 8.
    Elasticsearch NoSQL full textsearch DataBase Not ACID compliant like Oracle, MySQL Designed for distributed computing rather than centralized computation.
  • 9.
    SQL vs NoSQL( broadly ) NoSQL Databases Relational Databases Designed for performance Designed for integrity No relational schema Predefined schema Various storage models Stored as individual records Better for a lot of reads and writes Better for a lot of search and query Auto-sharding Sharding is not implicitly supported Limited query abilities Well-defined, advanced and standardized query language
  • 10.
  • 11.
    Apache Kafka • Publishand subscribe to streams of records. • Store streams of records in a fault-tolerant way. • Process streams of records as they occur. • Eg: LinkedIn Notification
  • 12.
    Data Engineering Acquiring • DataGathering • Sampling • Data Ingestion Data Preparation • DataWrangling • Data Cleaning • Transformation Data analysis • StatisticalAnalysis • Data Modeling
  • 13.
    Data Engineering –Legacy Acquiring SQL insert statements CSV file dumping Data Preparation ETLTools IBM - Data Stage Talend Analytics Excel based analytics SAS bases analytics IBMWatson
  • 14.
    Data Engineering –Big Data Stack Data Acquisition: • Data Ingestion • Sqoop – Batch Ingestion • Kafka – Stream Ingestion Data Prep • Trifacta Data Wrangling • Apache Spark Data Analytics • Data Robot • SAS + R based Analytics • IBM -Watson

Editor's Notes

  • #5 Components of Hadoop Architecture Core components : The Distributed Storage Framework – Hadoop Distributed File System Distributed Processing Framework - MapReduce and YARN Other supporting frameworks: Integration Frameworks – SQOOP , FLUME Management Frameworks – Ambari, ZooKeeper, Oozie Development Frameworks – Pig, Hive, Hbase, HCatalog Business Intelligence and Reporting Frameworks – Third Party Tools and Applications like SAS, Microstrategy, Splunk, Microsoft BI stack etc. Sources: Intel Distribution for Apache Hadoop: https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/09/24/securing-big-data-for-the-enterprise-project-rhino-and-the-intel-distribution-for-apache-hadoop-idh Big Data Open Source Technology Stack: http://datakulfi.wordpress.com/2013/03/27/big-data-open-source-technology-landscape/ Apache Hadoop Official: http://hadoop.apache.org/ Where this all fits: http://blog.syncsort.com/wp-content/uploads/2013/11/strata_diagram.png