Hadoop distributions - ecosystem

Big Data
Distributions and Ecosystem

BigData
• BigData 3Vs
– Volume
– Velocity
– Variety
– (Veracity ~ Accuracy)

BigData ≈
• Started by Google White paper about Map-Reduce
• Open Source
• Apache Software foundation
• Current version 2.2.0
• Second Generation
• Set of tools – Core Map-Reduce
• http://hadoop.apache.org/#What+Is+Apache+Had
oop%3F

Distribution Comparison
• Different approach to near to real-time
analytics ( Hortonworks vs Cloudera)
• Different approach to cluster management
• Different level of “Open Source” – vendor
lock-in
• Proprietary components – MapR-FS (MAPR)
NOTE: BigData space is highly evolving.

• Map-Reduce paradigm
• HDFS – Hadoop Distributed File System
• YARN – Yet another resource negotiator –
promotes cluster to non-MapReduce
computational models
• (TEZ) – should bridge the gap between batch
and near-to-real-time operational model

PIG
• Scripting language for
Map-Reduce
• Procedural language
• Typical use cases:
– standard extract-transform-
load (ETL)
data pipelines
– research on raw data
– iterative processing of
data

HIVE
• SQL-like syntax
• Data warehouse – ad-hoc
queries
• Declarative language
• On top of Map-Reduce

HBase
• Real-time access to data
• non-relational (NoSQL)
database
• Columnar database
• Run on tom HDFS
• Fault tolerant, Flexible,
Highly Available

Apache Storm
• Distributed real-time computation system
• Adds real-time data processing capabilities to
Apache Hadoop
• “Stream processing”
• Run on top of YARN
• Usually part of “λ architecture”

Apache Mahout
• Scalable machine
learning for Hadoop
• Based on Map-Reduce
• Algorithms:
– Collaborative filtering
– Clustering
– Classification
– Frequent item mining

Apache Flume
• Streaming data into
hadoop
• Collecting, Aggregating
• Guarantee data delivery
• Scale horizontally

Apache Sqoop
• Move data between
Hadoop and structured
datastores – relational
databases
• Import into HDFS,
Hbase or Hive

Apache ZooKeeper
• Distributed
configuration service
• Synchronization service
• Naming registry
• Reliable, Simple,
Ordered

Apache Ambari
• Management console
to hadoop cluster
• Monitoring
• Incubation phase ASF

Apache OoZiE
• Workflow engine
• DAG = Directed Acyclic
Graph
• Integrates with:
– MapReduce
– PIG
– Hive
– Sqoop

Apache Falcon
• Simplyfy data
management and
pipeline processing
• Automate movement
and processing of
datasets
• Data replication
• Data eviction
• Coordination and
scheduling

Apache Knox
• Authentication for
Hadoop
• Hadoop security
• Expects to run on DMZ
environment
• Hadoop cluster
protected by firewall

Hadoop distributions - ecosystem

More Related Content

What's hot

Viewers also liked

Similar to Hadoop distributions - ecosystem

Recently uploaded

Hadoop distributions - ecosystem

Editor's Notes