Sponsored By:
Big Data Warehousing Meetup
Today’s Topic: Real Time
Interactive Queries IN HADOOP
June 10, 2013
WELCOME!
Joe Caserta
Founder & President, Caserta Concepts
7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta
President, Caserta Concepts
Author, Data Warehouse ETL Toolkit
Welcome
About the Meetup and about Caserta Concepts
7:30 Elliott Cordo
Principal Consultant, Caserta Concepts
Intro to Real-Time Queries in
Hadoop
7:50 Abhijit Lele
Solutions Engineer at Hortonworks
Deep dive into Hortonworks
STINGER
8:10 -
9:00
More Networking
Tell us what you’re up to…
Agenda
About the BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Next BDW Meetup: September 16.
• Topic: Real-World Use Case and
Solution in Financial Sector
• Want to present your idea/solution?
Contact joe@casertaconcepts.com
About Caserta Concepts
• Financial Services
• Healthcare / Insurance
• Retail / eCommerce
• Digital Media / Marketing
• K-12 / Higher Education
Industries Served
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Founded in 2001
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Strategic Data
Ecosystems
Focused
Expertise
Client Portfolio
Finance
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services
Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Database
BI/Visualization/
Analytics
Master Data Management
Big Data
Analytics
Storm
Opportunities
Does this word cloud excite you?
Speak with us about our open positions: jobs@casertaconcepts.com
Contacts
Joe Caserta
President & Founder, Caserta Concepts
P: (855) 755-2246 x227
E: joe@casertaconcepts.com
Bob Eilbacher
VP, Sales & Marketing
P: (855) 755-2246 x 345
E: bob@casertaconcepts.com
Elliott Cordo
Principal Consultant, Caserta Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Why talk about Interactive Queries?
ERP
Finance
Legacy
ETL
Search/Data
Analytics
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data BI
NoSQL
Database Cassandra
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
REALTIME INTERACTIVE QUERIES IN
HADOOP
Elliott Cordo
Principal Consultant, Caserta Concepts
The most ambiguous term of all..
REAL-TIME
What do we mean by this?
• Real-time ingest/ processing
• Very Recent Data
• Real-time/interactive queries
• Fast Queries
What are the acceptable latencies and how are they measured?
Would we categorize 1 minute latency from transaction occurrence to
query availability real-time?
What is our threshold for query latency?
Let’s explore both aspects
Plumbing: Ingest and Processing
• Lets assume “freshness” of data is important for our “Real-time”
requirement.
• Micro-batch: bring in data in incremental batches as quickly as
possible. Highly optimized ETL!
• Streaming: Continuously push messages into Hadoop
Micro-batch
Use our familiar tools, and build REALY good ETL:
Traditional:
• Informatica
• Talend
Big Data tools:
• Sqoop
• PIG
• Hive
How “Real-time” can we be  on the scale of minutes
Streaming
Microbatch is too slow, we need to push data into our analytics
system, at low latency!
Several classes of products fit this requirement:
Stream data collection
• Flume
Complex Events Processors:
• ESPER
• Streambase
Distributed Computation Systems:
• Storm
• Akka
How “Real-time” are we now??  minutes to milliseconds!
A quick Storm example:
CONTINOUS FEED OF DATA!
Are we fast enough yet to be considered “Real-time”?
Market Prices
Calc CRM Opportunities
Stock Trades
Calc Trade Price
vs Market
CRUD Operations
CRUD Operations
Lookup Customer
Profile
Message
Queues
Next tuple
Next tuple
Next tuple…
New “Topic’” on
Message Queue
So we have options for fresh data,
Now queries need some attention!
• Hadoop is a batch system  Queries are dispatched as
map reduce jobs
• Simple queries take around a minute or two
• Complex queries (joins and aggregation) can take much
longer
So how have we run queries in Hadoop up
until now?
• Hive – compiles SQL code into Map Reduce
• PIG – suited for data transform but query capable!
• Third Party Tools – Such as Datameer
• HBase  Low Latency!!!!!
• But query language is Spartan
• Low query flexibility!
• Anticipate your query and materialize it!
RDBMS NoSQL
Volume
QueryFlexibility
MPP Connectors
• Massively Parallel Processing: Horizontally scalable database platform (columnar
under the hood) that present themselves relationally
• Many have built sophisticated integrations to Hadoop
• Use MPP managed tables and Hadoop in same query
• Ship data to MPP from Hadoop for faster queries
Downstream Relational Databases
• Move aggregate data out to relational Datamarts using
ETL
• Both this solution and MPP connectors suffer from a few
problems:
• Batch/Micro-batch Latency
• Processing and imposition of relational model  loss of agility
• Majority of data left behind in Hadoop
ETL
Relational
Data Mart
Dremel
• Research published by Google in 2010
• Main Features
• Fast/interactive ad-hoc queries
• Scaling to trillion of records, Petabytes of data
• Relies on it’s own processing model outside Map-Reduce
• Leverages a special columnar storage
• Foundation for Google Big Query
Inspired by Dremel
• There are several new query engines!
• Drill (Incubator)
• Stinger (Hortonworks)
• Impala (Cloudera)
• All process outside Map Reduce framework
• Evolution or extension of Hive!
• Several MPP features have been adopted to deal with
things like query planning and join optimization such as
collocation and broadcasting.
• Also note that to achieve the ultimate performance some
structure will need to be imposed on the data:
• ORC File
• Parquet
Now we have something that can provide
us “Real-time” in Hadoop
• At least most of the time
•Queries are significantly faster but
not always instantaneous
• Simple selects  A couple seconds
• Join queries  10’s of seconds
Where is this going?
• So is the roadmap for these engines to be a
Hadoop MPP? 
• Likely not
• Are we ready to build an EDW on Hadoop?

• For the right use case we are getting there!
Consider as a supplement to MPP or relational
EDW.
• Will there be a “winner” in the open source
race? 
• Maybe, or they will evolve and find their own
strengths, niches
What we do know about these new
engines
• They are made to fit a need for fast queries
on large sets of data!
• They present an exciting feature for the
Hadoop ecosystem

Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

  • 1.
    Sponsored By: Big DataWarehousing Meetup Today’s Topic: Real Time Interactive Queries IN HADOOP June 10, 2013
  • 2.
    WELCOME! Joe Caserta Founder &President, Caserta Concepts
  • 3.
    7:00 Networking Grab aslice of pizza and a drink... 7:15 Joe Caserta President, Caserta Concepts Author, Data Warehouse ETL Toolkit Welcome About the Meetup and about Caserta Concepts 7:30 Elliott Cordo Principal Consultant, Caserta Concepts Intro to Real-Time Queries in Hadoop 7:50 Abhijit Lele Solutions Engineer at Hortonworks Deep dive into Hortonworks STINGER 8:10 - 9:00 More Networking Tell us what you’re up to… Agenda
  • 4.
    About the BDWMeetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Next BDW Meetup: September 16. • Topic: Real-World Use Case and Solution in Financial Sector • Want to present your idea/solution? Contact joe@casertaconcepts.com
  • 5.
    About Caserta Concepts •Financial Services • Healthcare / Insurance • Retail / eCommerce • Digital Media / Marketing • K-12 / Higher Education Industries Served • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004) Founded in 2001 • Big Data Analytics • Data Warehousing • Business Intelligence • Strategic Data Ecosystems Focused Expertise
  • 6.
  • 7.
    Implementation Expertise &Offerings Strategic Roadmap/ Assessment/Consulting Database BI/Visualization/ Analytics Master Data Management Big Data Analytics Storm
  • 8.
    Opportunities Does this wordcloud excite you? Speak with us about our open positions: jobs@casertaconcepts.com
  • 9.
    Contacts Joe Caserta President &Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Bob Eilbacher VP, Sales & Marketing P: (855) 755-2246 x 345 E: bob@casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com
  • 10.
    Why talk aboutInteractive Queries? ERP Finance Legacy ETL Search/Data Analytics Horizontally Scalable Environment - Optimized for Analytics Big Data Cluster Canned Reporting Big Data BI NoSQL Database Cassandra ETL Ad-Hoc/Canned Reporting Traditional BI Mahout MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW
  • 11.
    REALTIME INTERACTIVE QUERIESIN HADOOP Elliott Cordo Principal Consultant, Caserta Concepts
  • 12.
    The most ambiguousterm of all.. REAL-TIME What do we mean by this? • Real-time ingest/ processing • Very Recent Data • Real-time/interactive queries • Fast Queries What are the acceptable latencies and how are they measured? Would we categorize 1 minute latency from transaction occurrence to query availability real-time? What is our threshold for query latency? Let’s explore both aspects
  • 13.
    Plumbing: Ingest andProcessing • Lets assume “freshness” of data is important for our “Real-time” requirement. • Micro-batch: bring in data in incremental batches as quickly as possible. Highly optimized ETL! • Streaming: Continuously push messages into Hadoop
  • 14.
    Micro-batch Use our familiartools, and build REALY good ETL: Traditional: • Informatica • Talend Big Data tools: • Sqoop • PIG • Hive How “Real-time” can we be  on the scale of minutes
  • 15.
    Streaming Microbatch is tooslow, we need to push data into our analytics system, at low latency! Several classes of products fit this requirement: Stream data collection • Flume Complex Events Processors: • ESPER • Streambase Distributed Computation Systems: • Storm • Akka How “Real-time” are we now??  minutes to milliseconds!
  • 16.
    A quick Stormexample: CONTINOUS FEED OF DATA! Are we fast enough yet to be considered “Real-time”? Market Prices Calc CRM Opportunities Stock Trades Calc Trade Price vs Market CRUD Operations CRUD Operations Lookup Customer Profile Message Queues Next tuple Next tuple Next tuple… New “Topic’” on Message Queue
  • 17.
    So we haveoptions for fresh data, Now queries need some attention! • Hadoop is a batch system  Queries are dispatched as map reduce jobs • Simple queries take around a minute or two • Complex queries (joins and aggregation) can take much longer
  • 18.
    So how havewe run queries in Hadoop up until now? • Hive – compiles SQL code into Map Reduce • PIG – suited for data transform but query capable! • Third Party Tools – Such as Datameer • HBase  Low Latency!!!!! • But query language is Spartan • Low query flexibility! • Anticipate your query and materialize it! RDBMS NoSQL Volume QueryFlexibility
  • 19.
    MPP Connectors • MassivelyParallel Processing: Horizontally scalable database platform (columnar under the hood) that present themselves relationally • Many have built sophisticated integrations to Hadoop • Use MPP managed tables and Hadoop in same query • Ship data to MPP from Hadoop for faster queries
  • 20.
    Downstream Relational Databases •Move aggregate data out to relational Datamarts using ETL • Both this solution and MPP connectors suffer from a few problems: • Batch/Micro-batch Latency • Processing and imposition of relational model  loss of agility • Majority of data left behind in Hadoop ETL Relational Data Mart
  • 21.
    Dremel • Research publishedby Google in 2010 • Main Features • Fast/interactive ad-hoc queries • Scaling to trillion of records, Petabytes of data • Relies on it’s own processing model outside Map-Reduce • Leverages a special columnar storage • Foundation for Google Big Query
  • 22.
    Inspired by Dremel •There are several new query engines! • Drill (Incubator) • Stinger (Hortonworks) • Impala (Cloudera) • All process outside Map Reduce framework • Evolution or extension of Hive! • Several MPP features have been adopted to deal with things like query planning and join optimization such as collocation and broadcasting. • Also note that to achieve the ultimate performance some structure will need to be imposed on the data: • ORC File • Parquet
  • 23.
    Now we havesomething that can provide us “Real-time” in Hadoop • At least most of the time •Queries are significantly faster but not always instantaneous • Simple selects  A couple seconds • Join queries  10’s of seconds
  • 24.
    Where is thisgoing? • So is the roadmap for these engines to be a Hadoop MPP?  • Likely not • Are we ready to build an EDW on Hadoop?  • For the right use case we are getting there! Consider as a supplement to MPP or relational EDW. • Will there be a “winner” in the open source race?  • Maybe, or they will evolve and find their own strengths, niches
  • 25.
    What we doknow about these new engines • They are made to fit a need for fast queries on large sets of data! • They present an exciting feature for the Hadoop ecosystem

Editor's Notes

  • #11 Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB