Hadoop Summit 2014 - recap

Hadoop Summit 2014
What’s cookin?
Eric Eijkelenboom & Martin Olsen - UserReport - www.userreport.com

More hard work during the
night

Overview
• YARN
• Tez
• Spark
• BlinkDB
• Summingbird
• Storm
• ML

YARN
• Support other workloads than MapReduce

YARN
• Allow other apps to ‘go distributed’ on top of HDFS

Tez
• Execution engine on YARN
• Complex graphs of tasks for processing data

Tez
• Hive and Pig can use Tez since version 0.13
• 2-3x performance increase compared to older Hive
and Pig versions
• Tez does performance optimisations and resource
management across the cluster
• Reuses containers and JVMs: effective for short
queries in e.g. Hive.
• Multiple jobs at the same time

Spark
The new kid on the block

BlinkDB
Interactive queries on Very Large Data, based on
sampling

BlinkDB
• Ofﬂine sampling module
• Compute data samples, based on a ‘storage budget’
• Store samples on disk and in memory
• Sample selection module
• Select the right samples for an incoming query
• Query execution in parallel
• Answers are augmented by error and conﬁdence bounds

BlinkDB
• BlinkDB has been demonstrated live at VLDB 2012
on a 100 node Amazon EC2 cluster answering a
range of queries on 17 TBs of data in less than 2
seconds (over 200x faster than Hive), within an
error of 2-10%.

SummingBird
• Write MapReduce programs that look like native
Java or Scala collection transformations
• Platform-agnostic
• Execute on a number of distributed MapReduce
platforms, like Scalding (Hadoop) or Storm
• The same code can run for batch and streaming

SummingBird
• Word-count in pure Scala
!
!
• In SummingBird

SummingBird
• ‘Strongly encourages’ the lambda architecture

Storm (on YARN)
Stream data processing on Hadoop.
Storm recap:
• Processes unbounded streams of tuples.
• Basic primivitives are Spout's and Bolt's
• A spout is a source of streams.
• A bolt processes streams and may emit new streams

Storm Alternatives
Spark Streaming

Machine Learning
Sparse Data Representation
uid1: url1, url2, url4, url6, url7, url8
uid2: url2, url3, url5, url9, url10, url11
uid1: 11010111000
uid2: 01101000111

Machine Learning
Options on Hadoop
• Python with UDF
• MLlib
• Mahout
• SparkR

Mahout
• A scalable machine learning library
The Mahout community decided to move its codebase onto […] systems
that offer a richer programming model and more efﬁcient execution than
Hadoop MapReduce.
!
Mahout will therefore reject new MapReduce algorithm implementations
from now on.
!
We are building our future implementations on top of a DSL […].
Programs written in this DSL are automatically optimized and executed in
parallel on Apache Spark.
https://mahout.apache.org/

Machine Learning
Trends
• Sparse data representation
• Deep learning
• Anomaly detection

and a lot more…
(come talk to us :))

Hadoop Summit 2014 - recap

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Hadoop Summit 2014 - recap

Similar to Hadoop Summit 2014 - recap (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit 2014 - recap