SlideShare a Scribd company logo
1 of 45
Download to read offline
Data Loading Challenges when
Transforming Hadoop into a
Parallel Database System
Daniel Abadi
Yale University
March 19th, 2013
Twitter: @daniel_abadi
Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Lessons learned from commercializing
HadoopDB into Hadapt
Ideas for overcoming the loading challenge
(invisible loading)
EDBT 2013 paper on invisible loading that is
published alongside this talk can be found at:
http://cs-
www.cs.yale.edu/homes/dna/papers/invisibleloadi
ng.pdf
Big Data
Data is growing (currently) faster than Moore’s Law
– Scale becoming a bigger concern
– Cost becoming a bigger concern
– Unstructured data “variety” becoming a bigger concern
Increasing desire for incremental scale out on
commodity hardware
– Fault tolerance a bigger concern
– Dealing with unpredictable performance a bigger concern
Parallel database systems are the traditional solution
A lot of excitement around Hadoop as a tool to help with
these problems (initiated by Yahoo and Facebook)
Hadoop/MapReduce
Data is partitioned across N machines
– Typically stored in a distributed file system (GFS/HDFS)
On each machine n, apply a function, Map, to each data item d
– Map(d)  {(key1,value1)} “map job”
– Sort output of all map jobs on n by key
– Send (key1,value1) pairs with same key value to same machine
(using e.g., hashing)
On each machine m, apply reduce function to (key1, {value1})
pairs mapped to it
– Reduce(key1,{value1})  (key2,value2) “reduce job”
Optionally, collect output and present to user
Map Workers
Example
Count occurrences of the word “Italy” and “DBMS” in all documents
map(d):
words = split(d,’ ‘)
foreach w in words:
if w == ‘Italy’
emit (‘Italy’, 1)
if w == ‘DBMS’
emit (‘DBMS’, 1)
reduce(key, valueSet):
count = 0
for each v in valueSet:
count += v
emit (key,count)
Data Partitions
Reduce Workers
Italy reducer DBMS reducer
(DBMS,1)(Italy, 1)
Relational Operators In MR
Straightforward to implement relational
operators in MapReduce
– Select: simple filter in Map Phase
– Project: project function in Map Phase
– Join: Map produces tuples with join key as
key; Reduce performs the join
Query plans can be implemented as a
sequence of MapReduce jobs (e.g Hive)
Similarities Between Hadoop
and Parallel Database Systems
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
Differences
Hadoop can operate on in-situ data, without
requiring transformation or loading
Schemas:
– Hadoop doesn’t require them, DBMSs do
– Easy to write simple MR programs over raw data
Indexes
– Hadoop’s built in support is primitive
Declarative vs imperative programming
Hadoop uses a run-time scheduler for fine-grained
load balancing
Hadoop checkpoints intermediate results for fault
tolerance
SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at
Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
Loading
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF
Scalability
Except for DBMS-X load time and UDFs
all systems scale near linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node failure,
and do not adapt if a node is
running slowly
Benchmark Conclusions
Hadoop is consistently more scalable
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop is consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needs better support for compression and direct
operation on compressed data
– Needs better support for indexing
– Needs better support for co-partitioning of datasets
Best of Both Worlds Possible?
Connector
Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)
Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact
Adding DBMS Technology to
Hadoop
Option 1:
– Keep Hadoop’s storage and build parallel executor on
top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
– Use relational storage on each node
Better support for traditional database management
(including updates and deletes)
Accelerates “time to complete system”
We chose this option for HadoopDB
HadoopDB Architecture
SMS Planner
HadoopDB Experiments
VLDB 2009 paper ran same Web analytics
benchmark as SIGMOD 2009 paper
Used PostgreSQL as the DBMS storage
layer
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes
Vertica
DBMS-X
Hadoop
HadoopDB
•HadoopDB much faster than Hadoop
•In 2009, didn’t quite match the database
systems in performance (but see next slide)
•Hadoop start-up costs
•PostgreSQL join performance
•Fault tolerance/run-time scheduling
TPC-H Benchmark Results
•More advanced storage and better
join algorithms can lead to better
performance (even beating database
systems for some queries)
UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Time(seconds)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
HadoopDB
Hadapt
Goal: Turn HadoopDB into enterprise-
grade software so it can be used for real
analytical applications
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012
Speaker has a significant financial interest in Hadapt
Hadapt: Lessons Learned
Location matters
– Impossible to scale quickly in New Haven,
Connecticut
– Company was forced to move to Cambridge
(near other successful DB companies)
Personally painful for me
Hadapt: Lessons Learned
Business side matters
– Need intelligence and competence across the
whole company (not just in engineering)
Decisions have to made all the time that have
enormous effects on the company
– Technical founders often find themselves
doing non-technical tasks more than half the
time
Hadapt: Lessons Learned
Good enterprise software: more time on
minutia than innovative new features
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Disaster Recovery
– Installer, upgrader, downgrader (!)
– Documentation
Hadapt: Lessons Learned
Customer input to the design process
critical
– Hadapt had no plans to integrate text search
into the product until request from a customer
– Today ~75% of Hadapt customers use the
text search feature
Hadapt: Lessons Learned
Installation and load processes are critical
– They are the customer’s first two interactions
with your product
– Lead to the invisible loading idea
Discussed in the remaining slides of this
presentation
If only we could make loading
invisible …
Review: Analytic Workflows
Assume data already in file system (HDFS)
Two workflows
– Workflow #1: In-situ querying
e.g. Map Reduce, HIVE / Pig, Impala, etc
Latency optimization (time to first query answer)
File
System
Query
EngineData Retrieval
Query Result
Submit Query
Review: Analytic Workflows
Assume data already in file system (HDFS)
Two workflows
– Workflow #2: Querying after bulk load
E.g. RDBMS, Hadapt
Storage and retrieval optimization for throughput
(repeated queries)
Other benefits: data validation and quality control
File
System
Query
Engine
1. Bulk Load
Query Result
2. Submit Query
Storage
Engine
Immediate Versus Deferred
Gratification
Use Case Example
Context-Dependent Schema Awareness
Different analysts know the schema of
what they are looking for and don’t care
about other log messages
Message field
varies depending
on application
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
File
System
Query
Engine
Partial Load 1
Query 1 Result
Submit Query 1
Storage
Engine
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
File
System
Query
Engine
Partial Load 2
Query 2 Result
Submit Query 2
Storage
Engine
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
Benefits
– Eliminates manual workflows on
data modeling and bulk load scripting
– Optimizes query performance over time
– Boosts productivity and performance
Interface and User Experience
Design
Structured parser interface for mapper
– getAttribute(int index)
– Similar to PigLatin
Minimize load’s impact on query performance
– Load a subset of rows (based on HDFS file splits)
– Load size tunable based on query performance
impact
Optional: load a subset of columns
– Useful for columnar file formats
Physical Tuning: Incremental
Reorganization
Performance Study
Conclusion
Invisible loading: Immediate gratification
+ long term performance benefits
Applicability beyond HDFS
Future work
– Automatic data schema / format recognition
– Schema evolution support
– Decouple load from query
Workload-sensitive loading strategies

More Related Content

What's hot

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersXiao Qin
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoopdarugar
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 

What's hot (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

Viewers also liked

Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Daniel Abadi
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 

Viewers also liked (9)

Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 

Similar to Shared slides-edbt-keynote-03-19-13

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 

Similar to Shared slides-edbt-keynote-03-19-13 (20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
hadoop
hadoophadoop
hadoop
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Shared slides-edbt-keynote-03-19-13

  • 1. Data Loading Challenges when Transforming Hadoop into a Parallel Database System Daniel Abadi Yale University March 19th, 2013 Twitter: @daniel_abadi
  • 2. Overview of Talk Motivation for HadoopDB Overview of HadoopDB Lessons learned from commercializing HadoopDB into Hadapt Ideas for overcoming the loading challenge (invisible loading) EDBT 2013 paper on invisible loading that is published alongside this talk can be found at: http://cs- www.cs.yale.edu/homes/dna/papers/invisibleloadi ng.pdf
  • 3. Big Data Data is growing (currently) faster than Moore’s Law – Scale becoming a bigger concern – Cost becoming a bigger concern – Unstructured data “variety” becoming a bigger concern Increasing desire for incremental scale out on commodity hardware – Fault tolerance a bigger concern – Dealing with unpredictable performance a bigger concern Parallel database systems are the traditional solution A lot of excitement around Hadoop as a tool to help with these problems (initiated by Yahoo and Facebook)
  • 4. Hadoop/MapReduce Data is partitioned across N machines – Typically stored in a distributed file system (GFS/HDFS) On each machine n, apply a function, Map, to each data item d – Map(d)  {(key1,value1)} “map job” – Sort output of all map jobs on n by key – Send (key1,value1) pairs with same key value to same machine (using e.g., hashing) On each machine m, apply reduce function to (key1, {value1}) pairs mapped to it – Reduce(key1,{value1})  (key2,value2) “reduce job” Optionally, collect output and present to user
  • 5. Map Workers Example Count occurrences of the word “Italy” and “DBMS” in all documents map(d): words = split(d,’ ‘) foreach w in words: if w == ‘Italy’ emit (‘Italy’, 1) if w == ‘DBMS’ emit (‘DBMS’, 1) reduce(key, valueSet): count = 0 for each v in valueSet: count += v emit (key,count) Data Partitions Reduce Workers Italy reducer DBMS reducer (DBMS,1)(Italy, 1)
  • 6. Relational Operators In MR Straightforward to implement relational operators in MapReduce – Select: simple filter in Map Phase – Project: project function in Map Phase – Join: Map produces tuples with join key as key; Reduce performs the join Query plans can be implemented as a sequence of MapReduce jobs (e.g Hive)
  • 7. Similarities Between Hadoop and Parallel Database Systems Both are suitable for large-scale data processing – I.e. analytical processing workloads – Bulk loads – Not optimized for transactional workloads – Queries over large amounts of data – Both can handle both relational and nonrelational queries (DBMS via UDFs)
  • 8. Differences Hadoop can operate on in-situ data, without requiring transformation or loading Schemas: – Hadoop doesn’t require them, DBMSs do – Easy to write simple MR programs over raw data Indexes – Hadoop’s built in support is primitive Declarative vs imperative programming Hadoop uses a run-time scheduler for fine-grained load balancing Hadoop checkpoints intermediate results for fault tolerance
  • 9. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems – Mostly focused on performance differences – Measured differences in load and query time for some common data processing tasks – Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 10. Hardware Setup 100 node cluster Each node – 2.4 GHz Code 2 Duo Processors – 4 GB RAM – 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes – 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 11. Loading 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 12. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 13. UDF Task 0 200 400 600 800 1000 1200 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) DBMS Hadoop DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents Performed via a UDF
  • 14. Scalability Except for DBMS-X load time and UDFs all systems scale near linearly BUT: only ran on 100 nodes As nodes approach 1000, other effects come into play – Faults go from being rare, to not so rare – It is nearly impossible to maintain homogeneity at scale
  • 15. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 16. Benchmark Conclusions Hadoop is consistently more scalable – Checkpointing allows for better fault tolerance – Runtime scheduling allows for better tolerance of unexpectedly slow nodes – Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data – Reasons mostly non-fundamental – Needs better support for compression and direct operation on compressed data – Needs better support for indexing – Needs better support for co-partitioning of datasets
  • 17. Best of Both Worlds Possible? Connector
  • 18. Problems With the Connector Approach Network delays and bandwidth limitations Data silos Multiple vendors Fundamentally wasteful – Very similar architectures Both partition data across a cluster Both parallelize processing across the cluster Both optimize for local data processing (to minimize network costs)
  • 19. Unified System Two options: – Bring Hadoop technology to a parallel database system Problem: Hadoop is more than just technology – Bring parallel database system technology to Hadoop Far more likely to have impact
  • 20. Adding DBMS Technology to Hadoop Option 1: – Keep Hadoop’s storage and build parallel executor on top of it Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects) Need better Storage Formats (Trevni and Parquet are promising) Updates and Deletes are hard (Impala doesn’t support them) – Use relational storage on each node Better support for traditional database management (including updates and deletes) Accelerates “time to complete system” We chose this option for HadoopDB
  • 23. HadoopDB Experiments VLDB 2009 paper ran same Web analytics benchmark as SIGMOD 2009 paper Used PostgreSQL as the DBMS storage layer
  • 24. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes Vertica DBMS-X Hadoop HadoopDB •HadoopDB much faster than Hadoop •In 2009, didn’t quite match the database systems in performance (but see next slide) •Hadoop start-up costs •PostgreSQL join performance •Fault tolerance/run-time scheduling
  • 25. TPC-H Benchmark Results •More advanced storage and better join algorithms can lead to better performance (even beating database systems for some queries)
  • 26. UDF Task 0 100 200 300 400 500 600 700 800 10 nodes 25 nodes 50 nodes Time(seconds) DBMS Hadoop HadoopDB
  • 27. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop HadoopDB
  • 28. Hadapt Goal: Turn HadoopDB into enterprise- grade software so it can be used for real analytical applications – Launched with $1.5 million in seed money in 2010 – Raised an additional $8 million in 2011 – Raised an additional $6.75 million in 2012 Speaker has a significant financial interest in Hadapt
  • 29. Hadapt: Lessons Learned Location matters – Impossible to scale quickly in New Haven, Connecticut – Company was forced to move to Cambridge (near other successful DB companies) Personally painful for me
  • 30. Hadapt: Lessons Learned Business side matters – Need intelligence and competence across the whole company (not just in engineering) Decisions have to made all the time that have enormous effects on the company – Technical founders often find themselves doing non-technical tasks more than half the time
  • 31. Hadapt: Lessons Learned Good enterprise software: more time on minutia than innovative new features – SQL coverage – Failover for high availability – Authorization / authentication – Disaster Recovery – Installer, upgrader, downgrader (!) – Documentation
  • 32. Hadapt: Lessons Learned Customer input to the design process critical – Hadapt had no plans to integrate text search into the product until request from a customer – Today ~75% of Hadapt customers use the text search feature
  • 33. Hadapt: Lessons Learned Installation and load processes are critical – They are the customer’s first two interactions with your product – Lead to the invisible loading idea Discussed in the remaining slides of this presentation
  • 34. If only we could make loading invisible …
  • 35. Review: Analytic Workflows Assume data already in file system (HDFS) Two workflows – Workflow #1: In-situ querying e.g. Map Reduce, HIVE / Pig, Impala, etc Latency optimization (time to first query answer) File System Query EngineData Retrieval Query Result Submit Query
  • 36. Review: Analytic Workflows Assume data already in file system (HDFS) Two workflows – Workflow #2: Querying after bulk load E.g. RDBMS, Hadapt Storage and retrieval optimization for throughput (repeated queries) Other benefits: data validation and quality control File System Query Engine 1. Bulk Load Query Result 2. Submit Query Storage Engine
  • 38. Use Case Example Context-Dependent Schema Awareness Different analysts know the schema of what they are looking for and don’t care about other log messages Message field varies depending on application
  • 39. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries File System Query Engine Partial Load 1 Query 1 Result Submit Query 1 Storage Engine
  • 40. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries File System Query Engine Partial Load 2 Query 2 Result Submit Query 2 Storage Engine
  • 41. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries Benefits – Eliminates manual workflows on data modeling and bulk load scripting – Optimizes query performance over time – Boosts productivity and performance
  • 42. Interface and User Experience Design Structured parser interface for mapper – getAttribute(int index) – Similar to PigLatin Minimize load’s impact on query performance – Load a subset of rows (based on HDFS file splits) – Load size tunable based on query performance impact Optional: load a subset of columns – Useful for columnar file formats
  • 45. Conclusion Invisible loading: Immediate gratification + long term performance benefits Applicability beyond HDFS Future work – Automatic data schema / format recognition – Schema evolution support – Decouple load from query Workload-sensitive loading strategies