Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup


Published on

During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here:

If you would like more information, please don't hesitate to contact us at Or, visit our website at

Published in: Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB
  • Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

    1. 1. Sponsored By:Big Data Warehousing MeetupToday’s Topic: Real TimeInteractive Queries IN HADOOPJune 10, 2013
    2. 2. WELCOME!Joe CasertaFounder & President, Caserta Concepts
    3. 3. 7:00 NetworkingGrab a slice of pizza and a drink...7:15 Joe CasertaPresident, Caserta ConceptsAuthor, Data Warehouse ETL ToolkitWelcomeAbout the Meetup and about Caserta Concepts7:30 Elliott CordoPrincipal Consultant, Caserta ConceptsIntro to Real-Time Queries inHadoop7:50 Abhijit LeleSolutions Engineer at HortonworksDeep dive into HortonworksSTINGER8:10 -9:00More NetworkingTell us what you’re up to…Agenda
    4. 4. About the BDW Meetup• Big Data is a complex, rapidly changinglandscape• We want to share our stories and hearabout yours• Great networking opportunity for likeminded data nerds• Opportunities to collaborate on excitingprojects• Next BDW Meetup: September 16.• Topic: Real-World Use Case andSolution in Financial Sector• Want to present your idea/solution?Contact
    5. 5. About Caserta Concepts• Financial Services• Healthcare / Insurance• Retail / eCommerce• Digital Media / Marketing• K-12 / Higher EducationIndustries Served• President: Joe Caserta, industry thought leader,consultant, educator and co-author, The DataWarehouse ETL Toolkit (Wiley, 2004)Founded in 2001• Big Data Analytics• Data Warehousing• Business Intelligence• Strategic DataEcosystemsFocusedExpertise
    6. 6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
    7. 7. Implementation Expertise & OfferingsStrategic Roadmap/Assessment/ConsultingDatabaseBI/Visualization/AnalyticsMaster Data ManagementBig DataAnalyticsStorm
    8. 8. OpportunitiesDoes this word cloud excite you?Speak with us about our open positions:
    9. 9. ContactsJoe CasertaPresident & Founder, Caserta ConceptsP: (855) 755-2246 x227E: joe@casertaconcepts.comBob EilbacherVP, Sales & MarketingP: (855) 755-2246 x 345E: bob@casertaconcepts.comElliott CordoPrincipal Consultant, Caserta ConceptsP: (855) 755-2246 x267E: elliott@casertaconcepts.cominfo@casertaconcepts.com1(855)
    10. 10. Why talk about Interactive Queries?ERPFinanceLegacyETLSearch/DataAnalyticsHorizontally Scalable Environment - Optimized for AnalyticsBig Data ClusterCanned ReportingBig Data BINoSQLDatabase CassandraETLAd-Hoc/CannedReportingTraditional BIMahout MapReduce Pig/HiveN1 N2 N4N3 N5Hadoop Distributed File System (HDFS)TraditionalEDW
    11. 11. REALTIME INTERACTIVE QUERIES INHADOOPElliott CordoPrincipal Consultant, Caserta Concepts
    12. 12. The most ambiguous term of all..REAL-TIMEWhat do we mean by this?• Real-time ingest/ processing• Very Recent Data• Real-time/interactive queries• Fast QueriesWhat are the acceptable latencies and how are they measured?Would we categorize 1 minute latency from transaction occurrence toquery availability real-time?What is our threshold for query latency?Let’s explore both aspects
    13. 13. Plumbing: Ingest and Processing• Lets assume “freshness” of data is important for our “Real-time”requirement.• Micro-batch: bring in data in incremental batches as quickly aspossible. Highly optimized ETL!• Streaming: Continuously push messages into Hadoop
    14. 14. Micro-batchUse our familiar tools, and build REALY good ETL:Traditional:• Informatica• TalendBig Data tools:• Sqoop• PIG• HiveHow “Real-time” can we be  on the scale of minutes
    15. 15. StreamingMicrobatch is too slow, we need to push data into our analyticssystem, at low latency!Several classes of products fit this requirement:Stream data collection• FlumeComplex Events Processors:• ESPER• StreambaseDistributed Computation Systems:• Storm• AkkaHow “Real-time” are we now??  minutes to milliseconds!
    16. 16. A quick Storm example:CONTINOUS FEED OF DATA!Are we fast enough yet to be considered “Real-time”?Market PricesCalc CRM OpportunitiesStock TradesCalc Trade Pricevs MarketCRUD OperationsCRUD OperationsLookup CustomerProfileMessageQueuesNext tupleNext tupleNext tuple…New “Topic’” onMessage Queue
    17. 17. So we have options for fresh data,Now queries need some attention!• Hadoop is a batch system  Queries are dispatched asmap reduce jobs• Simple queries take around a minute or two• Complex queries (joins and aggregation) can take muchlonger
    18. 18. So how have we run queries in Hadoop upuntil now?• Hive – compiles SQL code into Map Reduce• PIG – suited for data transform but query capable!• Third Party Tools – Such as Datameer• HBase  Low Latency!!!!!• But query language is Spartan• Low query flexibility!• Anticipate your query and materialize it!RDBMS NoSQLVolumeQueryFlexibility
    19. 19. MPP Connectors• Massively Parallel Processing: Horizontally scalable database platform (columnarunder the hood) that present themselves relationally• Many have built sophisticated integrations to Hadoop• Use MPP managed tables and Hadoop in same query• Ship data to MPP from Hadoop for faster queries
    20. 20. Downstream Relational Databases• Move aggregate data out to relational Datamarts usingETL• Both this solution and MPP connectors suffer from a fewproblems:• Batch/Micro-batch Latency• Processing and imposition of relational model  loss of agility• Majority of data left behind in HadoopETLRelationalData Mart
    21. 21. Dremel• Research published by Google in 2010• Main Features• Fast/interactive ad-hoc queries• Scaling to trillion of records, Petabytes of data• Relies on it’s own processing model outside Map-Reduce• Leverages a special columnar storage• Foundation for Google Big Query
    22. 22. Inspired by Dremel• There are several new query engines!• Drill (Incubator)• Stinger (Hortonworks)• Impala (Cloudera)• All process outside Map Reduce framework• Evolution or extension of Hive!• Several MPP features have been adopted to deal withthings like query planning and join optimization such ascollocation and broadcasting.• Also note that to achieve the ultimate performance somestructure will need to be imposed on the data:• ORC File• Parquet
    23. 23. Now we have something that can provideus “Real-time” in Hadoop• At least most of the time•Queries are significantly faster butnot always instantaneous• Simple selects  A couple seconds• Join queries  10’s of seconds
    24. 24. Where is this going?• So is the roadmap for these engines to be aHadoop MPP? • Likely not• Are we ready to build an EDW on Hadoop?• For the right use case we are getting there!Consider as a supplement to MPP or relationalEDW.• Will there be a “winner” in the open sourcerace? • Maybe, or they will evolve and find their ownstrengths, niches
    25. 25. What we do know about these newengines• They are made to fit a need for fast querieson large sets of data!• They present an exciting feature for theHadoop ecosystem