Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

Sponsored By:
Big Data Warehousing Meetup
Today’s Topic: Real Time
Interactive Queries IN HADOOP
June 10, 2013

WELCOME!
Joe Caserta
Founder & President, Caserta Concepts

7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta
President, Caserta Concepts
Author, Data Warehouse ETL Toolkit
Welcome
About the Meetup and about Caserta Concepts
7:30 Elliott Cordo
Principal Consultant, Caserta Concepts
Intro to Real-Time Queries in
Hadoop
7:50 Abhijit Lele
Solutions Engineer at Hortonworks
Deep dive into Hortonworks
STINGER
8:10 -
9:00
More Networking
Tell us what you’re up to…
Agenda

About the BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Next BDW Meetup: September 16.
• Topic: Real-World Use Case and
Solution in Financial Sector
• Want to present your idea/solution?
Contact joe@casertaconcepts.com

About Caserta Concepts
• Financial Services
• Healthcare / Insurance
• Retail / eCommerce
• Digital Media / Marketing
• K-12 / Higher Education
Industries Served
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Founded in 2001
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Strategic Data
Ecosystems
Focused
Expertise

Client Portfolio
Finance
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services

Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Database
BI/Visualization/
Analytics
Master Data Management
Big Data
Analytics
Storm

Opportunities
Does this word cloud excite you?
Speak with us about our open positions: jobs@casertaconcepts.com

Contacts
Joe Caserta
President & Founder, Caserta Concepts
P: (855) 755-2246 x227
E: joe@casertaconcepts.com
Bob Eilbacher
VP, Sales & Marketing
P: (855) 755-2246 x 345
E: bob@casertaconcepts.com
Elliott Cordo
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com

Why talk about Interactive Queries?
ERP
Finance
Legacy
ETL
Search/Data
Analytics
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data BI
NoSQL
Database Cassandra
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW

REALTIME INTERACTIVE QUERIES IN
HADOOP
Elliott Cordo

The most ambiguous term of all..
REAL-TIME
What do we mean by this?
• Real-time ingest/ processing
• Very Recent Data
• Real-time/interactive queries
• Fast Queries
What are the acceptable latencies and how are they measured?
Would we categorize 1 minute latency from transaction occurrence to
query availability real-time?
What is our threshold for query latency?
Let’s explore both aspects

Plumbing: Ingest and Processing
• Lets assume “freshness” of data is important for our “Real-time”
requirement.
• Micro-batch: bring in data in incremental batches as quickly as
possible. Highly optimized ETL!
• Streaming: Continuously push messages into Hadoop

Micro-batch
Use our familiar tools, and build REALY good ETL:
Traditional:
• Informatica
• Talend
Big Data tools:
• Sqoop
• PIG
• Hive
How “Real-time” can we be  on the scale of minutes

Streaming
Microbatch is too slow, we need to push data into our analytics
system, at low latency!
Several classes of products fit this requirement:
Stream data collection
• Flume
Complex Events Processors:
• ESPER
• Streambase
Distributed Computation Systems:
• Storm
• Akka
How “Real-time” are we now??  minutes to milliseconds!

A quick Storm example:
CONTINOUS FEED OF DATA!
Are we fast enough yet to be considered “Real-time”?
Market Prices
Calc CRM Opportunities
Stock Trades
Calc Trade Price
vs Market
CRUD Operations
CRUD Operations
Lookup Customer
Profile
Message
Queues
Next tuple
Next tuple
Next tuple…
New “Topic’” on
Message Queue

So we have options for fresh data,
Now queries need some attention!
• Hadoop is a batch system  Queries are dispatched as
map reduce jobs
• Simple queries take around a minute or two
• Complex queries (joins and aggregation) can take much
longer

So how have we run queries in Hadoop up
until now?
• Hive – compiles SQL code into Map Reduce
• PIG – suited for data transform but query capable!
• Third Party Tools – Such as Datameer
• HBase  Low Latency!!!!!
• But query language is Spartan
• Low query flexibility!
• Anticipate your query and materialize it!
RDBMS NoSQL
Volume
QueryFlexibility

MPP Connectors
• Massively Parallel Processing: Horizontally scalable database platform (columnar
under the hood) that present themselves relationally
• Many have built sophisticated integrations to Hadoop
• Use MPP managed tables and Hadoop in same query
• Ship data to MPP from Hadoop for faster queries

Downstream Relational Databases
• Move aggregate data out to relational Datamarts using
ETL
• Both this solution and MPP connectors suffer from a few
problems:
• Batch/Micro-batch Latency
• Processing and imposition of relational model  loss of agility
• Majority of data left behind in Hadoop
ETL
Relational
Data Mart

Dremel
• Research published by Google in 2010
• Main Features
• Fast/interactive ad-hoc queries
• Scaling to trillion of records, Petabytes of data
• Relies on it’s own processing model outside Map-Reduce
• Leverages a special columnar storage
• Foundation for Google Big Query

Inspired by Dremel
• There are several new query engines!
• Drill (Incubator)
• Stinger (Hortonworks)
• Impala (Cloudera)
• All process outside Map Reduce framework
• Evolution or extension of Hive!
• Several MPP features have been adopted to deal with
things like query planning and join optimization such as
collocation and broadcasting.
• Also note that to achieve the ultimate performance some
structure will need to be imposed on the data:
• ORC File
• Parquet

Now we have something that can provide
us “Real-time” in Hadoop
• At least most of the time
•Queries are significantly faster but
not always instantaneous
• Simple selects  A couple seconds
• Join queries  10’s of seconds

Where is this going?
• So is the roadmap for these engines to be a
Hadoop MPP? 
• Likely not
• Are we ready to build an EDW on Hadoop?

• For the right use case we are getting there!
Consider as a supplement to MPP or relational
EDW.
• Will there be a “winner” in the open source
race? 
• Maybe, or they will evolve and find their own
strengths, niches

What we do know about these new
engines
• They are made to fit a need for fast queries
on large sets of data!
• They present an exciting feature for the
Hadoop ecosystem

Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

More Related Content

What's hot

Viewers also liked

Similar to Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

More from Caserta

Recently uploaded

Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

Editor's Notes