Hadoop Summit 2014 - recap

•

0 likes•466 views

UserReport

Technology Education

Overview
• YARN
• Tez
• Spark
• BlinkDB
• Summingbird
• Storm
• ML

YARN
• Support other workloads than MapReduce

YARN
• Allow other apps to ‘go distributed’ on top of HDFS

Tez
• Execution engine on YARN
• Complex graphs of tasks for processing data

Tez
• Hive and Pig can use Tez since version 0.13
• 2-3x performance increase compared to older Hive
and Pig versions
• Tez does performance optimisations and resource
management across the cluster
• Reuses containers and JVMs: effective for short
queries in e.g. Hive.
• Multiple jobs at the same time

BlinkDB
Interactive queries on Very Large Data, based on
sampling

BlinkDB
• Ofﬂine sampling module
• Compute data samples, based on a ‘storage budget’
• Store samples on disk and in memory
• Sample selection module
• Select the right samples for an incoming query
• Query execution in parallel
• Answers are augmented by error and conﬁdence bounds

BlinkDB
• BlinkDB has been demonstrated live at VLDB 2012
on a 100 node Amazon EC2 cluster answering a
range of queries on 17 TBs of data in less than 2
seconds (over 200x faster than Hive), within an
error of 2-10%.

SummingBird
• Write MapReduce programs that look like native
Java or Scala collection transformations
• Platform-agnostic
• Execute on a number of distributed MapReduce
platforms, like Scalding (Hadoop) or Storm
• The same code can run for batch and streaming

SummingBird
• Word-count in pure Scala
!
!
• In SummingBird

SummingBird
• ‘Strongly encourages’ the lambda architecture

Storm (on YARN)
Stream data processing on Hadoop.
Storm recap:
• Processes unbounded streams of tuples.
• Basic primivitives are Spout's and Bolt's
• A spout is a source of streams.
• A bolt processes streams and may emit new streams

Machine Learning
Sparse Data Representation
uid1: url1, url2, url4, url6, url7, url8
uid2: url2, url3, url5, url9, url10, url11
uid1: 11010111000
uid2: 01101000111

Machine Learning
Options on Hadoop
• Python with UDF
• MLlib
• Mahout
• SparkR

Mahout
• A scalable machine learning library
The Mahout community decided to move its codebase onto […] systems
that offer a richer programming model and more efﬁcient execution than
Hadoop MapReduce.
!
Mahout will therefore reject new MapReduce algorithm implementations
from now on.
!
We are building our future implementations on top of a DSL […].
Programs written in this DSL are automatically optimized and executed in
parallel on Apache Spark.
https://mahout.apache.org/

Machine Learning
Trends
• Sparse data representation
• Deep learning
• Anomaly detection

What's hot

Cost-Based Optimizer in Apache Spark 2.2 Databricks

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

EclairJS = Node.Js + Apache SparkJen Aman

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk

Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma

Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit

Spark: Interactive To ProductionJen Aman

Spark_Intro_Syed_AcademySyed Hadoop

Cost effective BigData Processing on Amazon EC2Sujee Maniyam

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Spark Summit EU talk by Luca CanaliSpark Summit

Netflix running Presto in the AWS CloudZhenxiao Luo

Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit

Scalable Deep Learning Platform On Spark In BaiduJen Aman

Review of Calculation Paradigm and its ComponentsNamuk Park

Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit

HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon

Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks

What's hot (20)

Cost-Based Optimizer in Apache Spark 2.2

Enterprise Scale Topological Data Analysis Using Spark

EclairJS = Node.Js + Apache Spark

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...

Qubole @ AWS Meetup Bangalore - July 2015

Spark Summit EU talk by Ruben Pulido and Behar Veliqi

Spark: Interactive To Production

Spark_Intro_Syed_Academy

Cost effective BigData Processing on Amazon EC2

Spark Summit EU talk by Shay Nativ and Dvir Volk

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Spark Summit EU talk by Luca Canali

Netflix running Presto in the AWS Cloud

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen

Scalable Deep Learning Platform On Spark In Baidu

Review of Calculation Paradigm and its Components

Spark Summit EU talk by Debasish Das and Pramod Narasimha

HBaseCon2017 Community-Driven Graphs with JanusGraph

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

Viewers also liked

Self GrowthChioma Ken-Owotor

_IASP OIMP.compressedChad Knutsen

eCom Lead Value Optimization - DCBKKDrop Ship Lifestyle

Shoes Thru The Ages - International Shoe Companyegreenseth

Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needsThe ALS Association

Presentation 225 a francesca monachino & melissa werz_the keys to driving -...The ALS Association

Come Leggere il Contatore ElettricoEnegan

Bestow Showcase: Godrej ChotukoolBestow

03 motl lukic_yakovlevelsherbenietal_printedantennadesignspatialpowercombinerCarlos Andres

[Sidang] TippyDB: Pengembangan Prototipe Geographically-Aware Distributed NoSQLInstitut Teknologi Bandung

Key challenges to scale up climate change resillience in Botswana lr experien...PROCASUR Corporation / Corporación PROCASUR

Ppt 8thsamvb18

Uretek Mid-Atlantic: The Best of 2014Uretek Mid-Atlantic

Systematization and Reinforcement Guidelines Sudan Training WSPROCASUR Corporation / Corporación PROCASUR

BprLuis Fernando Diaz Sinning

Apartments in Knysna South Africaknysnaarea

Marks vacationJoe Marks

Il Risparmio energetico nelle PMI in ItaliaEnegan

Viewers also liked (18)

Self Growth

_IASP OIMP.compressed

eCom Lead Value Optimization - DCBKK

Shoes Thru The Ages - International Shoe Company

Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs

Presentation 225 a francesca monachino & melissa werz_the keys to driving -...

Come Leggere il Contatore Elettrico

Bestow Showcase: Godrej Chotukool

03 motl lukic_yakovlevelsherbenietal_printedantennadesignspatialpowercombiner

[Sidang] TippyDB: Pengembangan Prototipe Geographically-Aware Distributed NoSQL

Key challenges to scale up climate change resillience in Botswana lr experien...

Ppt 8th

Uretek Mid-Atlantic: The Best of 2014

Systematization and Reinforcement Guidelines Sudan Training WS

Bpr

Apartments in Knysna South Africa

Marks vacation

Il Risparmio energetico nelle PMI in Italia

Similar to Hadoop Summit 2014 - recap

Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore

Intro to Apache Spark by CTO of TwingoMapR Technologies

Machine Learning With H2O vs SparkMLArnab Biswas

Introduction to apache sparkUserReport

Apache Spark FundamentalsZahra Eskandari

Introduction To Hadoop EcosystemInSemble

Introduction to Impalamarkgrover

Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit

BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon

Cleveland Hadoop Users Group - SparkVince Gonzalez

Giraph+Gora in ApacheCon14Renato Javier Marroquín Mogrovejo

MHUG - YARNJoseph Niemiec

9/2017 STL HUG - Back to SchoolAdam Doyle

Taboola Road To Scale With Apache Sparktsliwowicz

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati

Big Data tools in practiceDarko Marjanovic

Spark One Platform WebinarCloudera, Inc.

Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.

Apache Spark in IndustryDorian Beganovic

Similar to Hadoop Summit 2014 - recap (20)

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Intro to Apache Spark by CTO of Twingo

Machine Learning With H2O vs SparkML

Introduction to apache spark

Apache Spark Fundamentals

Introduction To Hadoop Ecosystem

Introduction to Impala

Real time fraud detection at 1+M scale on hadoop stack

BDM8 - Near-realtime Big Data Analytics using Impala

Cleveland Hadoop Users Group - Spark

Giraph+Gora in ApacheCon14

MHUG - YARN

9/2017 STL HUG - Back to School

Taboola Road To Scale With Apache Spark

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Big Data tools in practice

Spark One Platform Webinar

Bay Area Impala User Group Meetup (Sept 16 2014)

Apache Spark in Industry

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

AI as an Interface for Commercial BuildingsMemoori

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Install Stable Diffusion in windows machinePadma Pradeep

Build your next Gen AI Breakthrough - April 2024Neo4j

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Artificial intelligence in the post-deep learning eraDeakin University

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?

Human Factors of XR: Using Human Factors to Design XR Systems

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

AI as an Interface for Commercial Buildings

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Install Stable Diffusion in windows machine

Build your next Gen AI Breakthrough - April 2024

Pigging Solutions Piggable Sweeping Elbows

Artificial intelligence in the post-deep learning era

Streamlining Python Development: A Guide to a Modern Project Setup

Benefits Of Flutter Compared To Other Frameworks

Maximizing Board Effectiveness 2024 Webinar.pptx

My Hashitalk Indonesia April 2024 Presentation

Unlocking the Potential of the Cloud for IBM Power Systems

APIForce Zurich 5 April Automation LPDG

Pigging Solutions in Pet Food Manufacturing

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Hadoop Summit 2014 - recap

1. Hadoop Summit 2014 What’s cookin? Eric Eijkelenboom & Martin Olsen - UserReport - www.userreport.com

2. Hard work during the day

3. A quick bite to eat

4. More hard work during the night

5. The End

6. Overview • YARN • Tez • Spark • BlinkDB • Summingbird • Storm • ML

7. YARN • Support other workloads than MapReduce

8. YARN • Allow other apps to ‘go distributed’ on top of HDFS

9. YARN cluster architecture

10. Tez • Execution engine on YARN • Complex graphs of tasks for processing data

11. Tez • Hive and Pig can use Tez since version 0.13 • 2-3x performance increase compared to older Hive and Pig versions • Tez does performance optimisations and resource management across the cluster • Reuses containers and JVMs: effective for short queries in e.g. Hive. • Multiple jobs at the same time

12. Spark The new kid on the block

13. BlinkDB Interactive queries on Very Large Data, based on sampling

14. BlinkDB • Ofﬂine sampling module • Compute data samples, based on a ‘storage budget’ • Store samples on disk and in memory • Sample selection module • Select the right samples for an incoming query • Query execution in parallel • Answers are augmented by error and conﬁdence bounds

15. BlinkDB • BlinkDB has been demonstrated live at VLDB 2012 on a 100 node Amazon EC2 cluster answering a range of queries on 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.

16. SummingBird • Write MapReduce programs that look like native Java or Scala collection transformations • Platform-agnostic • Execute on a number of distributed MapReduce platforms, like Scalding (Hadoop) or Storm • The same code can run for batch and streaming

17. SummingBird • Word-count in pure Scala ! ! • In SummingBird

18. SummingBird • ‘Strongly encourages’ the lambda architecture

19. Storm (on YARN) Stream data processing on Hadoop. Storm recap: • Processes unbounded streams of tuples. • Basic primivitives are Spout's and Bolt's • A spout is a source of streams. • A bolt processes streams and may emit new streams

20. Storm (on YARN)

21. Storm Alternatives Spark Streaming

22. Machine Learning Sparse Data Representation uid1: url1, url2, url4, url6, url7, url8 uid2: url2, url3, url5, url9, url10, url11 uid1: 11010111000 uid2: 01101000111

23. Machine Learning Options on Hadoop • Python with UDF • MLlib • Mahout • SparkR

24. Mahout • A scalable machine learning library The Mahout community decided to move its codebase onto […] systems that offer a richer programming model and more efﬁcient execution than Hadoop MapReduce. ! Mahout will therefore reject new MapReduce algorithm implementations from now on. ! We are building our future implementations on top of a DSL […]. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark. https://mahout.apache.org/

25. Machine Learning Trends • Sparse data representation • Deep learning • Anomaly detection

26. and a lot more… (come talk to us :))

Hadoop Summit 2014 - recap

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Hadoop Summit 2014 - recap

Similar to Hadoop Summit 2014 - recap (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit 2014 - recap