Shared slides-edbt-keynote-03-19-13

Daniel Abadi
Daniel AbadiAssociate Professor
Data Loading Challenges when
Transforming Hadoop into a
Parallel Database System
Daniel Abadi
Yale University
March 19th, 2013
Twitter: @daniel_abadi
Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Lessons learned from commercializing
HadoopDB into Hadapt
Ideas for overcoming the loading challenge
(invisible loading)
EDBT 2013 paper on invisible loading that is
published alongside this talk can be found at:
http://cs-
www.cs.yale.edu/homes/dna/papers/invisibleloadi
ng.pdf
Big Data
Data is growing (currently) faster than Moore’s Law
– Scale becoming a bigger concern
– Cost becoming a bigger concern
– Unstructured data “variety” becoming a bigger concern
Increasing desire for incremental scale out on
commodity hardware
– Fault tolerance a bigger concern
– Dealing with unpredictable performance a bigger concern
Parallel database systems are the traditional solution
A lot of excitement around Hadoop as a tool to help with
these problems (initiated by Yahoo and Facebook)
Hadoop/MapReduce
Data is partitioned across N machines
– Typically stored in a distributed file system (GFS/HDFS)
On each machine n, apply a function, Map, to each data item d
– Map(d)  {(key1,value1)} “map job”
– Sort output of all map jobs on n by key
– Send (key1,value1) pairs with same key value to same machine
(using e.g., hashing)
On each machine m, apply reduce function to (key1, {value1})
pairs mapped to it
– Reduce(key1,{value1})  (key2,value2) “reduce job”
Optionally, collect output and present to user
Map Workers
Example
Count occurrences of the word “Italy” and “DBMS” in all documents
map(d):
words = split(d,’ ‘)
foreach w in words:
if w == ‘Italy’
emit (‘Italy’, 1)
if w == ‘DBMS’
emit (‘DBMS’, 1)
reduce(key, valueSet):
count = 0
for each v in valueSet:
count += v
emit (key,count)
Data Partitions
Reduce Workers
Italy reducer DBMS reducer
(DBMS,1)(Italy, 1)
Relational Operators In MR
Straightforward to implement relational
operators in MapReduce
– Select: simple filter in Map Phase
– Project: project function in Map Phase
– Join: Map produces tuples with join key as
key; Reduce performs the join
Query plans can be implemented as a
sequence of MapReduce jobs (e.g Hive)
Similarities Between Hadoop
and Parallel Database Systems
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
Differences
Hadoop can operate on in-situ data, without
requiring transformation or loading
Schemas:
– Hadoop doesn’t require them, DBMSs do
– Easy to write simple MR programs over raw data
Indexes
– Hadoop’s built in support is primitive
Declarative vs imperative programming
Hadoop uses a run-time scheduler for fine-grained
load balancing
Hadoop checkpoints intermediate results for fault
tolerance
SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at
Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
Loading
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF
Scalability
Except for DBMS-X load time and UDFs
all systems scale near linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node failure,
and do not adapt if a node is
running slowly
Benchmark Conclusions
Hadoop is consistently more scalable
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop is consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needs better support for compression and direct
operation on compressed data
– Needs better support for indexing
– Needs better support for co-partitioning of datasets
Best of Both Worlds Possible?
Connector
Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)
Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact
Adding DBMS Technology to
Hadoop
Option 1:
– Keep Hadoop’s storage and build parallel executor on
top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
– Use relational storage on each node
Better support for traditional database management
(including updates and deletes)
Accelerates “time to complete system”
We chose this option for HadoopDB
HadoopDB Architecture
SMS Planner
HadoopDB Experiments
VLDB 2009 paper ran same Web analytics
benchmark as SIGMOD 2009 paper
Used PostgreSQL as the DBMS storage
layer
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes
Vertica
DBMS-X
Hadoop
HadoopDB
•HadoopDB much faster than Hadoop
•In 2009, didn’t quite match the database
systems in performance (but see next slide)
•Hadoop start-up costs
•PostgreSQL join performance
•Fault tolerance/run-time scheduling
TPC-H Benchmark Results
•More advanced storage and better
join algorithms can lead to better
performance (even beating database
systems for some queries)
UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Time(seconds)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
HadoopDB
Hadapt
Goal: Turn HadoopDB into enterprise-
grade software so it can be used for real
analytical applications
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012
Speaker has a significant financial interest in Hadapt
Hadapt: Lessons Learned
Location matters
– Impossible to scale quickly in New Haven,
Connecticut
– Company was forced to move to Cambridge
(near other successful DB companies)
Personally painful for me
Hadapt: Lessons Learned
Business side matters
– Need intelligence and competence across the
whole company (not just in engineering)
Decisions have to made all the time that have
enormous effects on the company
– Technical founders often find themselves
doing non-technical tasks more than half the
time
Hadapt: Lessons Learned
Good enterprise software: more time on
minutia than innovative new features
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Disaster Recovery
– Installer, upgrader, downgrader (!)
– Documentation
Hadapt: Lessons Learned
Customer input to the design process
critical
– Hadapt had no plans to integrate text search
into the product until request from a customer
– Today ~75% of Hadapt customers use the
text search feature
Hadapt: Lessons Learned
Installation and load processes are critical
– They are the customer’s first two interactions
with your product
– Lead to the invisible loading idea
Discussed in the remaining slides of this
presentation
If only we could make loading
invisible …
Review: Analytic Workflows
Assume data already in file system (HDFS)
Two workflows
– Workflow #1: In-situ querying
e.g. Map Reduce, HIVE / Pig, Impala, etc
Latency optimization (time to first query answer)
File
System
Query
EngineData Retrieval
Query Result
Submit Query
Review: Analytic Workflows
Assume data already in file system (HDFS)
Two workflows
– Workflow #2: Querying after bulk load
E.g. RDBMS, Hadapt
Storage and retrieval optimization for throughput
(repeated queries)
Other benefits: data validation and quality control
File
System
Query
Engine
1. Bulk Load
Query Result
2. Submit Query
Storage
Engine
Immediate Versus Deferred
Gratification
Use Case Example
Context-Dependent Schema Awareness
Different analysts know the schema of
what they are looking for and don’t care
about other log messages
Message field
varies depending
on application
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
File
System
Query
Engine
Partial Load 1
Query 1 Result
Submit Query 1
Storage
Engine
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
File
System
Query
Engine
Partial Load 2
Query 2 Result
Submit Query 2
Storage
Engine
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
Benefits
– Eliminates manual workflows on
data modeling and bulk load scripting
– Optimizes query performance over time
– Boosts productivity and performance
Interface and User Experience
Design
Structured parser interface for mapper
– getAttribute(int index)
– Similar to PigLatin
Minimize load’s impact on query performance
– Load a subset of rows (based on HDFS file splits)
– Load size tunable based on query performance
impact
Optional: load a subset of columns
– Useful for columnar file formats
Physical Tuning: Incremental
Reorganization
Performance Study
Conclusion
Invisible loading: Immediate gratification
+ long term performance benefits
Applicability beyond HDFS
Future work
– Automatic data schema / format recognition
– Schema evolution support
– Decouple load from query
Workload-sensitive loading strategies
1 of 45

Recommended

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real... by
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
5.6K views30 slides
Hadoop and Graph Data Management: Challenges and Opportunities by
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
3.9K views36 slides
Beckman abadi-5min-pres by
Beckman abadi-5min-presBeckman abadi-5min-pres
Beckman abadi-5min-presDaniel Abadi
1.6K views4 slides
Daniel Abadi: VLDB 2009 Panel by
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi
1.5K views9 slides
Boston Hadoop Meetup, April 26 2012 by
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
1.2K views15 slides
Jstorm introduction-0.9.6 by
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6longda feng
8.7K views44 slides

More Related Content

What's hot

Hadoop introduction , Why and What is Hadoop ? by
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
71.4K views34 slides
Presentation on Hadoop Technology by
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
349 views14 slides
SQL-on-Hadoop Tutorial by
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
10.7K views148 slides
Hadoop bigdata overview by
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
1.5K views24 slides
Hadoop 101 by
Hadoop 101Hadoop 101
Hadoop 101EMC
8.6K views50 slides
Overview of Hadoop and HDFS by
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
425 views44 slides

What's hot(20)

Hadoop introduction , Why and What is Hadoop ? by sudhakara st
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st71.4K views
Presentation on Hadoop Technology by OpenDev
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev349 views
SQL-on-Hadoop Tutorial by Daniel Abadi
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi10.7K views
Hadoop bigdata overview by harithakannan
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan1.5K views
Hadoop 101 by EMC
Hadoop 101Hadoop 101
Hadoop 101
EMC8.6K views
Seminar Presentation Hadoop by Varun Narang
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang81.5K views
Apache Hadoop by Ajit Koti
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti5.1K views
Introduction to Hadoop and MapReduce by eakasit_dpu
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu2.1K views
Seminar_Report_hadoop by Varun Narang
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang12.3K views
Performance Issues on Hadoop Clusters by Xiao Qin
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
Xiao Qin5K views
Apache Hadoop - Big Data Engineering by BADR
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
BADR499 views
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database by Edureka!
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!104.5K views
Cloud Computing: Hadoop by darugar
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar7K views
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011) by Hari Shankar Sreekumar
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Bringing OLTP woth OLAP: Lumos on Hadoop by DataWorks Summit
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit3.9K views
Hadoop MapReduce Framework by Edureka!
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
Edureka!4.1K views

Viewers also liked

Consistency Tradeoffs in Modern Distributed Database System Design by
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
6.4K views11 slides
CAP, PACELC, and Determinism by
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
6.4K views20 slides
Invisible loading by
Invisible loadingInvisible loading
Invisible loadingDaniel Abadi
1.9K views23 slides
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs by
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
1.6K views38 slides
VLDB 2009 Tutorial on Column-Stores by
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
21.4K views161 slides
The Power of Determinism in Database Systems by
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
4.8K views24 slides

Viewers also liked(9)

Consistency Tradeoffs in Modern Distributed Database System Design by Arinto Murdopo
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
Arinto Murdopo6.4K views
CAP, PACELC, and Determinism by Daniel Abadi
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
Daniel Abadi6.4K views
Invisible loading by Daniel Abadi
Invisible loadingInvisible loading
Invisible loading
Daniel Abadi1.9K views
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs by Daniel Abadi
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Daniel Abadi1.6K views
VLDB 2009 Tutorial on Column-Stores by Daniel Abadi
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
Daniel Abadi21.4K views
The Power of Determinism in Database Systems by Daniel Abadi
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
Daniel Abadi4.8K views
Column-Stores vs. Row-Stores: How Different are they Really? by Daniel Abadi
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
Daniel Abadi8.3K views
CAP Theorem - Theory, Implications and Practices by Yoav Francis
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
Yoav Francis6.9K views
Graph database Use Cases by Max De Marzi
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi58.9K views

Similar to Shared slides-edbt-keynote-03-19-13

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook by
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
22.4K views24 slides
Hadoop by kamran khan by
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
59 views30 slides
Big Data and Hadoop by
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
253 views24 slides
Cppt Hadoop by
Cppt HadoopCppt Hadoop
Cppt Hadoopchunkypandey12
27 views31 slides
Cppt by
CpptCppt
Cpptchunkypandey12
126 views31 slides
Cppt by
CpptCppt
Cpptchunkypandey12
147 views31 slides

Similar to Shared slides-edbt-keynote-03-19-13(20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook by Amr Awadallah
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah22.4K views
Big Data and Hadoop by Mr. Ankit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit253 views
Hadoop a Natural Choice for Data Intensive Log Processing by Hitendra Kumar
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar3.8K views
Hadoop hive presentation by Arvind Kumar
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar5.5K views
Introduccion a Hadoop / Introduction to Hadoop by GERARDO BARBERENA
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA401 views
Big data and hadoop overvew by Kunal Khanna
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna773 views
DWH & big data architecture approaches by Luxoft
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
Luxoft1.2K views
Владимир Слободянюк «DWH & BigData – architecture approaches» by Anna Shymchenko
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko287 views
Hadoop Administration pdf by Edureka!
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
Edureka!32.9K views
Hadoop training in bangalore by TIB Academy
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
TIB Academy17 views
عصر کلان داده، چرا و چگونه؟ by datastack
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack534 views
Hadoop_Its_Not_Just_Internal_Storage_V14 by John Sing
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
John Sing4.2K views
Distributed Systems Hadoop.pptx by AlAmin638189
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
AlAmin6381896 views

Recently uploaded

NTGapps NTG LowCode Platform by
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
287 views30 slides
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
142 views32 slides
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
48 views17 slides
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...ShapeBlue
105 views15 slides
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...ShapeBlue
74 views17 slides
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsShapeBlue
172 views13 slides

Recently uploaded(20)

NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson142 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue172 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue86 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li74 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue110 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue121 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue97 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views

Shared slides-edbt-keynote-03-19-13

  • 1. Data Loading Challenges when Transforming Hadoop into a Parallel Database System Daniel Abadi Yale University March 19th, 2013 Twitter: @daniel_abadi
  • 2. Overview of Talk Motivation for HadoopDB Overview of HadoopDB Lessons learned from commercializing HadoopDB into Hadapt Ideas for overcoming the loading challenge (invisible loading) EDBT 2013 paper on invisible loading that is published alongside this talk can be found at: http://cs- www.cs.yale.edu/homes/dna/papers/invisibleloadi ng.pdf
  • 3. Big Data Data is growing (currently) faster than Moore’s Law – Scale becoming a bigger concern – Cost becoming a bigger concern – Unstructured data “variety” becoming a bigger concern Increasing desire for incremental scale out on commodity hardware – Fault tolerance a bigger concern – Dealing with unpredictable performance a bigger concern Parallel database systems are the traditional solution A lot of excitement around Hadoop as a tool to help with these problems (initiated by Yahoo and Facebook)
  • 4. Hadoop/MapReduce Data is partitioned across N machines – Typically stored in a distributed file system (GFS/HDFS) On each machine n, apply a function, Map, to each data item d – Map(d)  {(key1,value1)} “map job” – Sort output of all map jobs on n by key – Send (key1,value1) pairs with same key value to same machine (using e.g., hashing) On each machine m, apply reduce function to (key1, {value1}) pairs mapped to it – Reduce(key1,{value1})  (key2,value2) “reduce job” Optionally, collect output and present to user
  • 5. Map Workers Example Count occurrences of the word “Italy” and “DBMS” in all documents map(d): words = split(d,’ ‘) foreach w in words: if w == ‘Italy’ emit (‘Italy’, 1) if w == ‘DBMS’ emit (‘DBMS’, 1) reduce(key, valueSet): count = 0 for each v in valueSet: count += v emit (key,count) Data Partitions Reduce Workers Italy reducer DBMS reducer (DBMS,1)(Italy, 1)
  • 6. Relational Operators In MR Straightforward to implement relational operators in MapReduce – Select: simple filter in Map Phase – Project: project function in Map Phase – Join: Map produces tuples with join key as key; Reduce performs the join Query plans can be implemented as a sequence of MapReduce jobs (e.g Hive)
  • 7. Similarities Between Hadoop and Parallel Database Systems Both are suitable for large-scale data processing – I.e. analytical processing workloads – Bulk loads – Not optimized for transactional workloads – Queries over large amounts of data – Both can handle both relational and nonrelational queries (DBMS via UDFs)
  • 8. Differences Hadoop can operate on in-situ data, without requiring transformation or loading Schemas: – Hadoop doesn’t require them, DBMSs do – Easy to write simple MR programs over raw data Indexes – Hadoop’s built in support is primitive Declarative vs imperative programming Hadoop uses a run-time scheduler for fine-grained load balancing Hadoop checkpoints intermediate results for fault tolerance
  • 9. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems – Mostly focused on performance differences – Measured differences in load and query time for some common data processing tasks – Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 10. Hardware Setup 100 node cluster Each node – 2.4 GHz Code 2 Duo Processors – 4 GB RAM – 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes – 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 11. Loading 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 12. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 13. UDF Task 0 200 400 600 800 1000 1200 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) DBMS Hadoop DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents Performed via a UDF
  • 14. Scalability Except for DBMS-X load time and UDFs all systems scale near linearly BUT: only ran on 100 nodes As nodes approach 1000, other effects come into play – Faults go from being rare, to not so rare – It is nearly impossible to maintain homogeneity at scale
  • 15. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 16. Benchmark Conclusions Hadoop is consistently more scalable – Checkpointing allows for better fault tolerance – Runtime scheduling allows for better tolerance of unexpectedly slow nodes – Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data – Reasons mostly non-fundamental – Needs better support for compression and direct operation on compressed data – Needs better support for indexing – Needs better support for co-partitioning of datasets
  • 17. Best of Both Worlds Possible? Connector
  • 18. Problems With the Connector Approach Network delays and bandwidth limitations Data silos Multiple vendors Fundamentally wasteful – Very similar architectures Both partition data across a cluster Both parallelize processing across the cluster Both optimize for local data processing (to minimize network costs)
  • 19. Unified System Two options: – Bring Hadoop technology to a parallel database system Problem: Hadoop is more than just technology – Bring parallel database system technology to Hadoop Far more likely to have impact
  • 20. Adding DBMS Technology to Hadoop Option 1: – Keep Hadoop’s storage and build parallel executor on top of it Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects) Need better Storage Formats (Trevni and Parquet are promising) Updates and Deletes are hard (Impala doesn’t support them) – Use relational storage on each node Better support for traditional database management (including updates and deletes) Accelerates “time to complete system” We chose this option for HadoopDB
  • 23. HadoopDB Experiments VLDB 2009 paper ran same Web analytics benchmark as SIGMOD 2009 paper Used PostgreSQL as the DBMS storage layer
  • 24. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes Vertica DBMS-X Hadoop HadoopDB •HadoopDB much faster than Hadoop •In 2009, didn’t quite match the database systems in performance (but see next slide) •Hadoop start-up costs •PostgreSQL join performance •Fault tolerance/run-time scheduling
  • 25. TPC-H Benchmark Results •More advanced storage and better join algorithms can lead to better performance (even beating database systems for some queries)
  • 26. UDF Task 0 100 200 300 400 500 600 700 800 10 nodes 25 nodes 50 nodes Time(seconds) DBMS Hadoop HadoopDB
  • 27. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop HadoopDB
  • 28. Hadapt Goal: Turn HadoopDB into enterprise- grade software so it can be used for real analytical applications – Launched with $1.5 million in seed money in 2010 – Raised an additional $8 million in 2011 – Raised an additional $6.75 million in 2012 Speaker has a significant financial interest in Hadapt
  • 29. Hadapt: Lessons Learned Location matters – Impossible to scale quickly in New Haven, Connecticut – Company was forced to move to Cambridge (near other successful DB companies) Personally painful for me
  • 30. Hadapt: Lessons Learned Business side matters – Need intelligence and competence across the whole company (not just in engineering) Decisions have to made all the time that have enormous effects on the company – Technical founders often find themselves doing non-technical tasks more than half the time
  • 31. Hadapt: Lessons Learned Good enterprise software: more time on minutia than innovative new features – SQL coverage – Failover for high availability – Authorization / authentication – Disaster Recovery – Installer, upgrader, downgrader (!) – Documentation
  • 32. Hadapt: Lessons Learned Customer input to the design process critical – Hadapt had no plans to integrate text search into the product until request from a customer – Today ~75% of Hadapt customers use the text search feature
  • 33. Hadapt: Lessons Learned Installation and load processes are critical – They are the customer’s first two interactions with your product – Lead to the invisible loading idea Discussed in the remaining slides of this presentation
  • 34. If only we could make loading invisible …
  • 35. Review: Analytic Workflows Assume data already in file system (HDFS) Two workflows – Workflow #1: In-situ querying e.g. Map Reduce, HIVE / Pig, Impala, etc Latency optimization (time to first query answer) File System Query EngineData Retrieval Query Result Submit Query
  • 36. Review: Analytic Workflows Assume data already in file system (HDFS) Two workflows – Workflow #2: Querying after bulk load E.g. RDBMS, Hadapt Storage and retrieval optimization for throughput (repeated queries) Other benefits: data validation and quality control File System Query Engine 1. Bulk Load Query Result 2. Submit Query Storage Engine
  • 38. Use Case Example Context-Dependent Schema Awareness Different analysts know the schema of what they are looking for and don’t care about other log messages Message field varies depending on application
  • 39. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries File System Query Engine Partial Load 1 Query 1 Result Submit Query 1 Storage Engine
  • 40. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries File System Query Engine Partial Load 2 Query 2 Result Submit Query 2 Storage Engine
  • 41. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries Benefits – Eliminates manual workflows on data modeling and bulk load scripting – Optimizes query performance over time – Boosts productivity and performance
  • 42. Interface and User Experience Design Structured parser interface for mapper – getAttribute(int index) – Similar to PigLatin Minimize load’s impact on query performance – Load a subset of rows (based on HDFS file splits) – Load size tunable based on query performance impact Optional: load a subset of columns – Useful for columnar file formats
  • 45. Conclusion Invisible loading: Immediate gratification + long term performance benefits Applicability beyond HDFS Future work – Automatic data schema / format recognition – Schema evolution support – Decouple load from query Workload-sensitive loading strategies