SlideShare a Scribd company logo
1 of 61
Download to read offline
From the dive bars of silicon valley to the World Tour
Carlo Curino
PhD + PostDoc in databases
“we did all of this 30 years ago!”
Yahoo! Research
“webscale, webscale, webscale!”
Microsoft – CISL
“enterprise + cloud + search engine + big-data”
My perspective
Cluster as an Embedded systems (map-reduce)
single-purpose clusters
General purpose cluster OS (YARN, Mesos, Omega, Corona)
standardizing access to computational resources
Real-time OS for the cluster !? (Rayon)
predictable resource allocation
Cluster stdlib (REEF)
factoring out common functionalities
Agenda
Cluster as an Embedded Systems
“the era of map-reduce only clusters”
Purpose-built technology
Within large web companies
Well targeted mission (process webcrawl)
à scale and fault tolerance
The origin
Google leading the pack
Google File System + MapReduce (2003/2004)
Open-source and parallel efforts
Yahoo! Hadoop ecosystem HDFS + MR (2006/2007)
Microsoft Scope/Cosmos (2008) (more than MR)
In-house growth
What was the key to success for Hadoop?
In-house growth (following Hadoop story)
Access, access, access…
All the data sit in the DFS
Trivial to use massive compute power
à lots of new applications
But… everything has to be MR
Cast any computation as map-only job
MPI, graph processing, streaming, launching web-servers!?!
Popularization
Everybody wants Big Data
Insight from raw data is cool
Outside MS and Google, Big-Data == Hadoop
Hadoop as catch-all big-data solution (and cluster manager)
Not just massive in-house clusters
New challenges?
New deployment environments
Small clusters (10s of machines)
Public Cloud
New deployment challenges
Small clusters
Efficiency matter more than scalability
Admin/tuning done by mere mortals
Cloud
Untrusted users (security)
Users are paying (availability, predictability)
Users are unrelated to each other (performance isolation)
Classic MapReduce
Classic MapReduce
Classic MapReduce
Classic Hadoop (1.0) Architecture
Client
Job1
JobTracker
Scheduler
TaskTracker TaskTracker TaskTracker
Map Reduce Map Reduce
Map
Map
Map
Map
Map
Map
Map
Red.
Red.
Red.
Red.
Red.
Red.
Reduce
Handles resource management
Global invariants
fairness/capacity
Determines
who runs / resources / where
Manages MapReduce application flow
maps before reducers, re-run upon failure, etc..
What are the key shortcomings
of (old) Hadoop?
Hadoop 1.0 Shortcomings
Programming model rigidity
JobTracker manages resources
JobTracker manages application workflow (data dependencies)
Performance and Availability
Map vs Reduce slots lead to low cluster utilization (~70%)
JobTracker had too much to do: scalability concern
JobTracker is a single point of failure
Hadoop 1.0 Shortcomings (similar to original MR)
General purpose cluster OS
“Cluster OS (YARN)”
YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)*
Request-based central scheduler
Mesos (2011, UCB, open-sourced, tested at Twitter)*
Offer-based two level scheduler
Omega (2013, Google, simulation?)*
Shared-state-based scheduling
Corona (2013, Facebook, production)
YARN-like but offered-based
Four proposals
* all three were best-papers or best-student-paper
YARN
Ad-hoc
app
Ad-hoc
app
Ad-hoc
app
Ad-hoc
Apps
YARN
MR
v2
Tez Giraph Storm Dryad
REEF
...
Hive / Pig
Hadoop 1.x
(MapReduce)
MR v1
Hive / Pig
Users
Application
Frameworks
Programming
Model(s)
Cluster OS
(Resource
Management)
Hadoop 1 World Hadoop 2 World
File System
HDFS 1 HDFS 2
Hardware
Ad-hoc
app
Ad-hoc
app
A new architecture for Hadoop
Decouples resource management from programming model
(MapReduce is an “application” running on YARN)
YARN (or Hadoop 2.x)
YARN (Hadoop 2) Architecture
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Task
Task
Task Task
TaskApp
Master
Task
Task
Negotiate access to
more resources
(ResourceRequest)
Flexibility, Performance and Availability
Multiple Programming Models
Central components do less à scale better
Easier High-Availability (e.g., RM vs AM)
Why does this matter?
System Jobs/Day Tasks/Day Cores pegged
Hadoop 1.0 77k 4M 3.2
YARN 125k (150k) 12M (15M) 6 (10)
Anything else you can think?
Maintenance, Upgrade, and Experimentation
Run with multiple framework versions (at one time)
Trying out a new ideas is as is as launching a job
Anything else you can think?
Real-time OS for the cluster (?)
“predictable resource allocation”
YARN (Cosmos, Mesos, and Corona)
support instantaneous scheduling invariants (fairness/capacity)
maximize cluster throughput (eye to locality)
Current trends
New applications (require “gang” and dependencies)
Consolidation of production/test clusters + Cloud
(SLA jobs mixed with best-effort jobs)
Motivation
Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie)
Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza)
Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph)
Example Use Cases
In a consolidated cluster:
Time-based SLAs for production jobs (completion deadline)
Good latency for best-effort jobs
High cluster utilization/throughput
(Support rich applications: gang and skylines)
High-Level Goals
Decompose time-based SLAs in
resource definition: via RDL
predictable resource allocation: planning + scheduling
Divide and Conquer time-based SLAs
Expose to planner application needs
time: start (s), finish (f)
resources: capacity (w), total parallelism (h),
minimum parallelism (l), min lease duration (t)
Resource Definition Language (RDL) 1/2
Skylines / pipelines:
dependencies: among atomic allocations
(ALL, ANY, ORDER)
Resource Definition Language (RDL) 2/2
Important classes
Framework semantics: Perforator modeling of Scope/Hive
Machine Learning: gang + bounded iterations (PREDict)
Periodic jobs: history-based resource definition
Coming up with RDL specs
prediction
Root
100%
Staging
15%
Production
60%
J1
10%
J2
40%
J3
10%
Post
5%
Best Effort
20%
Planning vs Scheduling
Plan
Follower
J3
J1
J3
J1
J3
J1
J3
J1 J3
??
Scheduling
(fine-grained but
time-oblivious)
Resource
Definition
Planning
(coarse but
time-aware)
Preemption
Plan
Sharing
Policy
ReservationService
Resource
Manager
Sys Model /
Feedback
Some example run:
lots of queues for gridmix
Microsoft pipelines
Dynamic queues
Improves
production job SLAs
best-effort jobs latency
cluster utilization and throughput
Comparing against Hadoop CapacityScheduler
Under promise, over deliver
Plan for late execution, and run as early as you can
Greedy Agent
GB
Coping with imperfections (system)
compensate RDL based on black-box models of overheads
Coping with Failures (system)
re-plan (move/kill allocations) in response of system-
observable resource issues
Coping with Failures/Misprediction (user)
continue in best-effort mode when reservation expires
re-negotiate existing reservations
Dealing with “Reality”
Sharing Policy: CapacityOverTimePolicy
constrains: instantaneous max, and running avg
e.g., no user can exceed an instantaneous 30% allocation, and
an average of 10% in any 24h period of time
single partial scan of plan: O(|alloc| + |window|)
User Quotas (trade-off flexibility to fairness)
Introduce Admission Control and Time-based SLAs (YARN-1051)
New ReservationService API (to reserve resources)
Agents + Plan + SharingPolicy to organize future allocations
Leverage underlying scheduler
Future Directions
Work with MSR-India on RDL estimates for Hive and MR
Advanced agents for placement ($$-based and optimal algos)
Enforcing decisions (Linux Containers, Drawbridge, Pacer)
Conclusion
Cluster stdlib: REEF
“factoring out recurring components”
Dryad DAG computations
Tez DAG computations (focus on interactive and Hive support)
Storm stream processing
Spark interactive / in-memory / iterative
Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel)
Impala scalable, interactive, SQL-like query
HoYA Hbase on yarn
Stratoshpere parallel iterative computations
REEF, Weave, Spring-Hadoop meta-frameworks to help build apps
Focusing on YARN: many applications
Lots of repeated work
Communication
Configuration
Data and Control Flow
Error handling / fault-tolerance
Common “better than hadoop” tricks:
Avoid Scheduling overheads
Control Excessive disk IO
Are YARN/Mesos/Omega enough?
The Challenge
YARN / HDFS
SQL / Hive … …
Machine
Learning
u  Fault Tolerance
u  Row/Column Storage
u  High Bandwidth Networking
The Challenge
YARN / HDFS
SQL / Hive … …
Machine
Learning
u  Fault Awareness
u  Local data caching
u  Low Latency Networking
SQL / Hive
The Challenge
YARN / HDFS
… …
Machine
Learning
SQL / Hive
REEF in the Stack
YARN / HDFS
… …
Machine
Learning
REEF
REEF in the Stack (Future)
YARN / HDFS
SQL / Hive … …
Machine
Learning
REEF
Operator API and Library
Logical Abstraction
REEF
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Task
Task
Task Task
TaskApp
Master
Task
Task
Negotiate access to
more resources
(ResourceRequest)
REEF
Client
Job1
Resource
Manager
Scheduler
NodeManager NodeManager NodeManager
Evaluator
Task
services
Evaluator
Task
services
REEF RT
Driver
Name-
based
User control
flow logic
Retains
State!
User data
crunching
logic
Fault-
detection
Injection-based
checkable
configuration
Event-based
Control flow
REEF: Computation and Data Management
Extensible Control Flow Data Management Services
Storage
Network
State Management
Job Driver
Control plane implementation. User code
executed on YARN’s Application Master
Activity User code executed within an Evaluator.
Evaluator
Execution Environment for Activities. One
Evaluator is bound to one YARN Container.
Control Flow is centralized in the Driver
Evaluator, Tasks configuration and launch
Error Handling is centralized in the Driver
All exceptions are forwarded to the Driver
All APIs are asynchronous
Support for:
Caching / Checkpointing / Group communication
Example Apps Running on REEF
MR, Asynch Page Rank, ML regressions, PCA, distributed shell,….
REEF Summary
(Open-sourced with Apache License)
Big-Data Systems
Ongoing focus
Future work
Leverage high-level app semantics
Coordinate tiered-storage and scheduling
Conclusions
Adding Preemption to YARN,
and open-sourcing it toApache
Limited mechanisms to “revise current schedule”
Patience
Container killing
To enforce global properties
Leave resources fallow (e.g., CapacityScheduler) à low utilization
Kill containers (e.g., FairScheduler) à wasted work
(Old) new trick
Support work-preserving preemption
(via) checkpointing à more than preemption
State of the Art
Changes throughout YARN
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
Task
Task
PreemptionMessage {
Strict { Set<ContainerID> }
Flexible { Set<ResourceRequest>,
Set<ContainerID> }
}
Collaborative application
Policy-based binding for Flexible
preemption requests
Use of Preemption
Context:
Outdated information
Delayed effects of actions
Multi-actor orchestration
Interesting type of preemption:
RM declarative request
AM bounds it to containers
Changes throughout YARN
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
MR
AM Task
Task
Task
Task
Task
Task
Task
When can I preempt?
tag safe UDFs or
user-saved state
@Preemptable
public class MyReducer{
…
}
Common Checkpoint Service
WriteChannel cwc = cs.create();
cwc.write(…state…);
CheckpointID cid = cs.commit(cwc);
ReadChannel crc = cs.open(cid);
57
CapacityScheduler + Unreservation + Preemption: memory utilization
CapacityScheduler (allow overcapacity)
CapacityScheduler (no overcapacity)
Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
Task
Task
MR-5192
MR-5194
MR-5197
MR-5189
MR-5189
MR-5176
YARN-569
MR-5196
(Metapoint) Experience contributing to Apache
Engaging with OSS
talk with active developers
show early/partial work
small patches
ok to leave things unfinished
With @Preemptable
tag imperative code with semantic property
Generalize this trick
expose semantic properties to platform (@PreserveSortOrder)
allow platforms to optimize execution (map-reduce pipelining)
REEF seems the logical place where to do this.
Tagging UDFs
(Basic) Building block for:
Enables efficient preemption
Dynamic Optimizations (task splitting, efficiency improvements)
Fault Tolerance
Other uses for Checkpointing
61

More Related Content

What's hot

Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stackFlytxt
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...CloudxLab
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworksAmal Targhi
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 

What's hot (20)

Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworks
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 

Viewers also liked

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
Big Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer StoriesBig Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer StoriesYellowfin
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (13)

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Smile
SmileSmile
Smile
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
Big Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer StoriesBig Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer Stories
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
What is big data?
What is big data?What is big data?
What is big data?
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to The Evolution of Big Data Frameworks

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" BioinformaticsBrian Repko
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 

Similar to The Evolution of Big Data Frameworks (20)

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hive paris
Hive parisHive paris
Hive paris
 

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceanseXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingeXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
 

More from eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Recently uploaded

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Recently uploaded (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

The Evolution of Big Data Frameworks

  • 1. From the dive bars of silicon valley to the World Tour Carlo Curino
  • 2. PhD + PostDoc in databases “we did all of this 30 years ago!” Yahoo! Research “webscale, webscale, webscale!” Microsoft – CISL “enterprise + cloud + search engine + big-data” My perspective
  • 3. Cluster as an Embedded systems (map-reduce) single-purpose clusters General purpose cluster OS (YARN, Mesos, Omega, Corona) standardizing access to computational resources Real-time OS for the cluster !? (Rayon) predictable resource allocation Cluster stdlib (REEF) factoring out common functionalities Agenda
  • 4. Cluster as an Embedded Systems “the era of map-reduce only clusters”
  • 5. Purpose-built technology Within large web companies Well targeted mission (process webcrawl) à scale and fault tolerance The origin Google leading the pack Google File System + MapReduce (2003/2004) Open-source and parallel efforts Yahoo! Hadoop ecosystem HDFS + MR (2006/2007) Microsoft Scope/Cosmos (2008) (more than MR)
  • 6. In-house growth What was the key to success for Hadoop?
  • 7. In-house growth (following Hadoop story) Access, access, access… All the data sit in the DFS Trivial to use massive compute power à lots of new applications But… everything has to be MR Cast any computation as map-only job MPI, graph processing, streaming, launching web-servers!?!
  • 8. Popularization Everybody wants Big Data Insight from raw data is cool Outside MS and Google, Big-Data == Hadoop Hadoop as catch-all big-data solution (and cluster manager)
  • 9. Not just massive in-house clusters New challenges? New deployment environments Small clusters (10s of machines) Public Cloud
  • 10. New deployment challenges Small clusters Efficiency matter more than scalability Admin/tuning done by mere mortals Cloud Untrusted users (security) Users are paying (availability, predictability) Users are unrelated to each other (performance isolation)
  • 14. Classic Hadoop (1.0) Architecture Client Job1 JobTracker Scheduler TaskTracker TaskTracker TaskTracker Map Reduce Map Reduce Map Map Map Map Map Map Map Red. Red. Red. Red. Red. Red. Reduce Handles resource management Global invariants fairness/capacity Determines who runs / resources / where Manages MapReduce application flow maps before reducers, re-run upon failure, etc..
  • 15. What are the key shortcomings of (old) Hadoop? Hadoop 1.0 Shortcomings
  • 16. Programming model rigidity JobTracker manages resources JobTracker manages application workflow (data dependencies) Performance and Availability Map vs Reduce slots lead to low cluster utilization (~70%) JobTracker had too much to do: scalability concern JobTracker is a single point of failure Hadoop 1.0 Shortcomings (similar to original MR)
  • 17. General purpose cluster OS “Cluster OS (YARN)”
  • 18. YARN (2008-2013, Hadoop 2.x, production at Yahoo!, GA)* Request-based central scheduler Mesos (2011, UCB, open-sourced, tested at Twitter)* Offer-based two level scheduler Omega (2013, Google, simulation?)* Shared-state-based scheduling Corona (2013, Facebook, production) YARN-like but offered-based Four proposals * all three were best-papers or best-student-paper
  • 19. YARN Ad-hoc app Ad-hoc app Ad-hoc app Ad-hoc Apps YARN MR v2 Tez Giraph Storm Dryad REEF ... Hive / Pig Hadoop 1.x (MapReduce) MR v1 Hive / Pig Users Application Frameworks Programming Model(s) Cluster OS (Resource Management) Hadoop 1 World Hadoop 2 World File System HDFS 1 HDFS 2 Hardware Ad-hoc app Ad-hoc app
  • 20. A new architecture for Hadoop Decouples resource management from programming model (MapReduce is an “application” running on YARN) YARN (or Hadoop 2.x)
  • 21. YARN (Hadoop 2) Architecture Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Task Task Task Task TaskApp Master Task Task Negotiate access to more resources (ResourceRequest)
  • 22. Flexibility, Performance and Availability Multiple Programming Models Central components do less à scale better Easier High-Availability (e.g., RM vs AM) Why does this matter? System Jobs/Day Tasks/Day Cores pegged Hadoop 1.0 77k 4M 3.2 YARN 125k (150k) 12M (15M) 6 (10)
  • 23. Anything else you can think?
  • 24. Maintenance, Upgrade, and Experimentation Run with multiple framework versions (at one time) Trying out a new ideas is as is as launching a job Anything else you can think?
  • 25. Real-time OS for the cluster (?) “predictable resource allocation”
  • 26. YARN (Cosmos, Mesos, and Corona) support instantaneous scheduling invariants (fairness/capacity) maximize cluster throughput (eye to locality) Current trends New applications (require “gang” and dependencies) Consolidation of production/test clusters + Cloud (SLA jobs mixed with best-effort jobs) Motivation
  • 27. Job/Pipeline with SLAs: 200 CPU hours by 6am (e.g., Oozie) Service: daily ebb/flows, reserve capacity accordingly (e.g., Samza) Gang: I need 50 concurrent containers for 3 hours (e.g., Giraph) Example Use Cases
  • 28. In a consolidated cluster: Time-based SLAs for production jobs (completion deadline) Good latency for best-effort jobs High cluster utilization/throughput (Support rich applications: gang and skylines) High-Level Goals
  • 29. Decompose time-based SLAs in resource definition: via RDL predictable resource allocation: planning + scheduling Divide and Conquer time-based SLAs
  • 30. Expose to planner application needs time: start (s), finish (f) resources: capacity (w), total parallelism (h), minimum parallelism (l), min lease duration (t) Resource Definition Language (RDL) 1/2
  • 31. Skylines / pipelines: dependencies: among atomic allocations (ALL, ANY, ORDER) Resource Definition Language (RDL) 2/2
  • 32. Important classes Framework semantics: Perforator modeling of Scope/Hive Machine Learning: gang + bounded iterations (PREDict) Periodic jobs: history-based resource definition Coming up with RDL specs prediction
  • 33. Root 100% Staging 15% Production 60% J1 10% J2 40% J3 10% Post 5% Best Effort 20% Planning vs Scheduling Plan Follower J3 J1 J3 J1 J3 J1 J3 J1 J3 ?? Scheduling (fine-grained but time-oblivious) Resource Definition Planning (coarse but time-aware) Preemption Plan Sharing Policy ReservationService Resource Manager Sys Model / Feedback
  • 34. Some example run: lots of queues for gridmix Microsoft pipelines Dynamic queues
  • 35. Improves production job SLAs best-effort jobs latency cluster utilization and throughput Comparing against Hadoop CapacityScheduler
  • 36. Under promise, over deliver Plan for late execution, and run as early as you can Greedy Agent GB
  • 37. Coping with imperfections (system) compensate RDL based on black-box models of overheads Coping with Failures (system) re-plan (move/kill allocations) in response of system- observable resource issues Coping with Failures/Misprediction (user) continue in best-effort mode when reservation expires re-negotiate existing reservations Dealing with “Reality”
  • 38. Sharing Policy: CapacityOverTimePolicy constrains: instantaneous max, and running avg e.g., no user can exceed an instantaneous 30% allocation, and an average of 10% in any 24h period of time single partial scan of plan: O(|alloc| + |window|) User Quotas (trade-off flexibility to fairness)
  • 39. Introduce Admission Control and Time-based SLAs (YARN-1051) New ReservationService API (to reserve resources) Agents + Plan + SharingPolicy to organize future allocations Leverage underlying scheduler Future Directions Work with MSR-India on RDL estimates for Hive and MR Advanced agents for placement ($$-based and optimal algos) Enforcing decisions (Linux Containers, Drawbridge, Pacer) Conclusion
  • 40. Cluster stdlib: REEF “factoring out recurring components”
  • 41. Dryad DAG computations Tez DAG computations (focus on interactive and Hive support) Storm stream processing Spark interactive / in-memory / iterative Giraph graph-processing Bulk Synchronous Parallel (a la’ Pregel) Impala scalable, interactive, SQL-like query HoYA Hbase on yarn Stratoshpere parallel iterative computations REEF, Weave, Spring-Hadoop meta-frameworks to help build apps Focusing on YARN: many applications
  • 42. Lots of repeated work Communication Configuration Data and Control Flow Error handling / fault-tolerance Common “better than hadoop” tricks: Avoid Scheduling overheads Control Excessive disk IO Are YARN/Mesos/Omega enough?
  • 43. The Challenge YARN / HDFS SQL / Hive … … Machine Learning u  Fault Tolerance u  Row/Column Storage u  High Bandwidth Networking
  • 44. The Challenge YARN / HDFS SQL / Hive … … Machine Learning u  Fault Awareness u  Local data caching u  Low Latency Networking
  • 45. SQL / Hive The Challenge YARN / HDFS … … Machine Learning
  • 46. SQL / Hive REEF in the Stack YARN / HDFS … … Machine Learning REEF
  • 47. REEF in the Stack (Future) YARN / HDFS SQL / Hive … … Machine Learning REEF Operator API and Library Logical Abstraction
  • 48. REEF Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Task Task Task Task TaskApp Master Task Task Negotiate access to more resources (ResourceRequest)
  • 49. REEF Client Job1 Resource Manager Scheduler NodeManager NodeManager NodeManager Evaluator Task services Evaluator Task services REEF RT Driver Name- based User control flow logic Retains State! User data crunching logic Fault- detection Injection-based checkable configuration Event-based Control flow
  • 50. REEF: Computation and Data Management Extensible Control Flow Data Management Services Storage Network State Management Job Driver Control plane implementation. User code executed on YARN’s Application Master Activity User code executed within an Evaluator. Evaluator Execution Environment for Activities. One Evaluator is bound to one YARN Container.
  • 51. Control Flow is centralized in the Driver Evaluator, Tasks configuration and launch Error Handling is centralized in the Driver All exceptions are forwarded to the Driver All APIs are asynchronous Support for: Caching / Checkpointing / Group communication Example Apps Running on REEF MR, Asynch Page Rank, ML regressions, PCA, distributed shell,…. REEF Summary (Open-sourced with Apache License)
  • 52. Big-Data Systems Ongoing focus Future work Leverage high-level app semantics Coordinate tiered-storage and scheduling Conclusions
  • 53. Adding Preemption to YARN, and open-sourcing it toApache
  • 54. Limited mechanisms to “revise current schedule” Patience Container killing To enforce global properties Leave resources fallow (e.g., CapacityScheduler) à low utilization Kill containers (e.g., FairScheduler) à wasted work (Old) new trick Support work-preserving preemption (via) checkpointing à more than preemption State of the Art
  • 55. Changes throughout YARN Client Job1 RM Scheduler NodeManager NodeManager NodeManager App Master Task Task Task Task Task Task Task PreemptionMessage { Strict { Set<ContainerID> } Flexible { Set<ResourceRequest>, Set<ContainerID> } } Collaborative application Policy-based binding for Flexible preemption requests Use of Preemption Context: Outdated information Delayed effects of actions Multi-actor orchestration Interesting type of preemption: RM declarative request AM bounds it to containers
  • 56. Changes throughout YARN Client Job1 RM Scheduler NodeManager NodeManager NodeManager MR AM Task Task Task Task Task Task Task When can I preempt? tag safe UDFs or user-saved state @Preemptable public class MyReducer{ … } Common Checkpoint Service WriteChannel cwc = cs.create(); cwc.write(…state…); CheckpointID cid = cs.commit(cwc); ReadChannel crc = cs.open(cid);
  • 57. 57 CapacityScheduler + Unreservation + Preemption: memory utilization CapacityScheduler (allow overcapacity) CapacityScheduler (no overcapacity)
  • 58. Client Job1 RM Scheduler NodeManager NodeManager NodeManager App Master Task Task Task Task Task Task Task MR-5192 MR-5194 MR-5197 MR-5189 MR-5189 MR-5176 YARN-569 MR-5196 (Metapoint) Experience contributing to Apache Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished
  • 59. With @Preemptable tag imperative code with semantic property Generalize this trick expose semantic properties to platform (@PreserveSortOrder) allow platforms to optimize execution (map-reduce pipelining) REEF seems the logical place where to do this. Tagging UDFs
  • 60. (Basic) Building block for: Enables efficient preemption Dynamic Optimizations (task splitting, efficiency improvements) Fault Tolerance Other uses for Checkpointing
  • 61. 61