Big Data 
Distributions and Ecosystem
BigData 
• BigData 3Vs 
– Volume 
– Velocity 
– Variety 
– (Veracity ~ Accuracy)
BigData ≈ 
• Started by Google White paper about Map-Reduce 
• Open Source 
• Apache Software foundation 
• Current version 2.2.0 
• Second Generation 
• Set of tools – Core Map-Reduce 
• http://hadoop.apache.org/#What+Is+Apache+Had 
oop%3F
Distributions
Distribution Comparison 
• Different approach to near to real-time 
analytics ( Hortonworks vs Cloudera) 
• Different approach to cluster management 
• Different level of “Open Source” – vendor 
lock-in 
• Proprietary components – MapR-FS (MAPR) 
NOTE: BigData space is highly evolving.
We use
Ecosystem
• Map-Reduce paradigm 
• HDFS – Hadoop Distributed File System 
• YARN – Yet another resource negotiator – 
promotes cluster to non-MapReduce 
computational models 
• (TEZ) – should bridge the gap between batch 
and near-to-real-time operational model
PIG 
• Scripting language for 
Map-Reduce 
• Procedural language 
• Typical use cases: 
– standard extract-transform- 
load (ETL) 
data pipelines 
– research on raw data 
– iterative processing of 
data
HIVE 
• SQL-like syntax 
• Data warehouse – ad-hoc 
queries 
• Declarative language 
• On top of Map-Reduce
PIG x HIVE
HBase 
• Real-time access to data 
• non-relational (NoSQL) 
database 
• Columnar database 
• Run on tom HDFS 
• Fault tolerant, Flexible, 
Highly Available
Apache Storm 
• Distributed real-time computation system 
• Adds real-time data processing capabilities to 
Apache Hadoop 
• “Stream processing” 
• Run on top of YARN 
• Usually part of “λ architecture”
Apache Mahout 
• Scalable machine 
learning for Hadoop 
• Based on Map-Reduce 
• Algorithms: 
– Collaborative filtering 
– Clustering 
– Classification 
– Frequent item mining
Apache Flume 
• Streaming data into 
hadoop 
• Collecting, Aggregating 
• Guarantee data delivery 
• Scale horizontally
Apache Sqoop 
• Move data between 
Hadoop and structured 
datastores – relational 
databases 
• Import into HDFS, 
Hbase or Hive
Apache ZooKeeper 
• Distributed 
configuration service 
• Synchronization service 
• Naming registry 
• Reliable, Simple, 
Ordered
Apache Ambari 
• Management console 
to hadoop cluster 
• Monitoring 
• Incubation phase ASF
Apache OoZiE 
• Workflow engine 
• DAG = Directed Acyclic 
Graph 
• Integrates with: 
– MapReduce 
– PIG 
– Hive 
– Sqoop
Typical use case
Apache Falcon 
• Simplyfy data 
management and 
pipeline processing 
• Automate movement 
and processing of 
datasets 
• Data replication 
• Data eviction 
• Coordination and 
scheduling
Apache Knox 
• Authentication for 
Hadoop 
• Hadoop security 
• Expects to run on DMZ 
environment 
• Hadoop cluster 
protected by firewall

Hadoop distributions - ecosystem

  • 1.
  • 2.
    BigData • BigData3Vs – Volume – Velocity – Variety – (Veracity ~ Accuracy)
  • 4.
    BigData ≈ •Started by Google White paper about Map-Reduce • Open Source • Apache Software foundation • Current version 2.2.0 • Second Generation • Set of tools – Core Map-Reduce • http://hadoop.apache.org/#What+Is+Apache+Had oop%3F
  • 5.
  • 6.
    Distribution Comparison •Different approach to near to real-time analytics ( Hortonworks vs Cloudera) • Different approach to cluster management • Different level of “Open Source” – vendor lock-in • Proprietary components – MapR-FS (MAPR) NOTE: BigData space is highly evolving.
  • 7.
  • 8.
  • 10.
    • Map-Reduce paradigm • HDFS – Hadoop Distributed File System • YARN – Yet another resource negotiator – promotes cluster to non-MapReduce computational models • (TEZ) – should bridge the gap between batch and near-to-real-time operational model
  • 11.
    PIG • Scriptinglanguage for Map-Reduce • Procedural language • Typical use cases: – standard extract-transform- load (ETL) data pipelines – research on raw data – iterative processing of data
  • 12.
    HIVE • SQL-likesyntax • Data warehouse – ad-hoc queries • Declarative language • On top of Map-Reduce
  • 13.
  • 14.
    HBase • Real-timeaccess to data • non-relational (NoSQL) database • Columnar database • Run on tom HDFS • Fault tolerant, Flexible, Highly Available
  • 15.
    Apache Storm •Distributed real-time computation system • Adds real-time data processing capabilities to Apache Hadoop • “Stream processing” • Run on top of YARN • Usually part of “λ architecture”
  • 16.
    Apache Mahout •Scalable machine learning for Hadoop • Based on Map-Reduce • Algorithms: – Collaborative filtering – Clustering – Classification – Frequent item mining
  • 17.
    Apache Flume •Streaming data into hadoop • Collecting, Aggregating • Guarantee data delivery • Scale horizontally
  • 18.
    Apache Sqoop •Move data between Hadoop and structured datastores – relational databases • Import into HDFS, Hbase or Hive
  • 19.
    Apache ZooKeeper •Distributed configuration service • Synchronization service • Naming registry • Reliable, Simple, Ordered
  • 20.
    Apache Ambari •Management console to hadoop cluster • Monitoring • Incubation phase ASF
  • 21.
    Apache OoZiE •Workflow engine • DAG = Directed Acyclic Graph • Integrates with: – MapReduce – PIG – Hive – Sqoop
  • 22.
  • 23.
    Apache Falcon •Simplyfy data management and pipeline processing • Automate movement and processing of datasets • Data replication • Data eviction • Coordination and scheduling
  • 24.
    Apache Knox •Authentication for Hadoop • Hadoop security • Expects to run on DMZ environment • Hadoop cluster protected by firewall

Editor's Notes

  • #3 What Big Data term means? What tries to tackle. Raw definition
  • #5 BigData term associated with Hadoop