• Save
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Upcoming SlideShare
Loading in...5
×
 

Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

on

  • 2,892 views

During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a ...

During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks

If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.

Statistics

Views

Total Views
2,892
Views on SlideShare
2,887
Embed Views
5

Actions

Likes
3
Downloads
0
Comments
0

3 Embeds 5

https://twitter.com 3
http://www.linkedin.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB

Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup Presentation Transcript

  • Sponsored By:Big Data Warehousing MeetupToday’s Topic: Real TimeInteractive Queries IN HADOOPJune 10, 2013
  • WELCOME!Joe CasertaFounder & President, Caserta Concepts
  • 7:00 NetworkingGrab a slice of pizza and a drink...7:15 Joe CasertaPresident, Caserta ConceptsAuthor, Data Warehouse ETL ToolkitWelcomeAbout the Meetup and about Caserta Concepts7:30 Elliott CordoPrincipal Consultant, Caserta ConceptsIntro to Real-Time Queries inHadoop7:50 Abhijit LeleSolutions Engineer at HortonworksDeep dive into HortonworksSTINGER8:10 -9:00More NetworkingTell us what you’re up to…Agenda
  • About the BDW Meetup• Big Data is a complex, rapidly changinglandscape• We want to share our stories and hearabout yours• Great networking opportunity for likeminded data nerds• Opportunities to collaborate on excitingprojects• Next BDW Meetup: September 16.• Topic: Real-World Use Case andSolution in Financial Sector• Want to present your idea/solution?Contact joe@casertaconcepts.com
  • About Caserta Concepts• Financial Services• Healthcare / Insurance• Retail / eCommerce• Digital Media / Marketing• K-12 / Higher EducationIndustries Served• President: Joe Caserta, industry thought leader,consultant, educator and co-author, The DataWarehouse ETL Toolkit (Wiley, 2004)Founded in 2001• Big Data Analytics• Data Warehousing• Business Intelligence• Strategic DataEcosystemsFocusedExpertise
  • Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  • Implementation Expertise & OfferingsStrategic Roadmap/Assessment/ConsultingDatabaseBI/Visualization/AnalyticsMaster Data ManagementBig DataAnalyticsStorm
  • OpportunitiesDoes this word cloud excite you?Speak with us about our open positions: jobs@casertaconcepts.com
  • ContactsJoe CasertaPresident & Founder, Caserta ConceptsP: (855) 755-2246 x227E: joe@casertaconcepts.comBob EilbacherVP, Sales & MarketingP: (855) 755-2246 x 345E: bob@casertaconcepts.comElliott CordoPrincipal Consultant, Caserta ConceptsP: (855) 755-2246 x267E: elliott@casertaconcepts.cominfo@casertaconcepts.com1(855) 755-2246www.casertaconcepts.com
  • Why talk about Interactive Queries?ERPFinanceLegacyETLSearch/DataAnalyticsHorizontally Scalable Environment - Optimized for AnalyticsBig Data ClusterCanned ReportingBig Data BINoSQLDatabase CassandraETLAd-Hoc/CannedReportingTraditional BIMahout MapReduce Pig/HiveN1 N2 N4N3 N5Hadoop Distributed File System (HDFS)TraditionalEDW
  • REALTIME INTERACTIVE QUERIES INHADOOPElliott CordoPrincipal Consultant, Caserta Concepts
  • The most ambiguous term of all..REAL-TIMEWhat do we mean by this?• Real-time ingest/ processing• Very Recent Data• Real-time/interactive queries• Fast QueriesWhat are the acceptable latencies and how are they measured?Would we categorize 1 minute latency from transaction occurrence toquery availability real-time?What is our threshold for query latency?Let’s explore both aspects
  • Plumbing: Ingest and Processing• Lets assume “freshness” of data is important for our “Real-time”requirement.• Micro-batch: bring in data in incremental batches as quickly aspossible. Highly optimized ETL!• Streaming: Continuously push messages into Hadoop
  • Micro-batchUse our familiar tools, and build REALY good ETL:Traditional:• Informatica• TalendBig Data tools:• Sqoop• PIG• HiveHow “Real-time” can we be  on the scale of minutes
  • StreamingMicrobatch is too slow, we need to push data into our analyticssystem, at low latency!Several classes of products fit this requirement:Stream data collection• FlumeComplex Events Processors:• ESPER• StreambaseDistributed Computation Systems:• Storm• AkkaHow “Real-time” are we now??  minutes to milliseconds!
  • A quick Storm example:CONTINOUS FEED OF DATA!Are we fast enough yet to be considered “Real-time”?Market PricesCalc CRM OpportunitiesStock TradesCalc Trade Pricevs MarketCRUD OperationsCRUD OperationsLookup CustomerProfileMessageQueuesNext tupleNext tupleNext tuple…New “Topic’” onMessage Queue
  • So we have options for fresh data,Now queries need some attention!• Hadoop is a batch system  Queries are dispatched asmap reduce jobs• Simple queries take around a minute or two• Complex queries (joins and aggregation) can take muchlonger
  • So how have we run queries in Hadoop upuntil now?• Hive – compiles SQL code into Map Reduce• PIG – suited for data transform but query capable!• Third Party Tools – Such as Datameer• HBase  Low Latency!!!!!• But query language is Spartan• Low query flexibility!• Anticipate your query and materialize it!RDBMS NoSQLVolumeQueryFlexibility
  • MPP Connectors• Massively Parallel Processing: Horizontally scalable database platform (columnarunder the hood) that present themselves relationally• Many have built sophisticated integrations to Hadoop• Use MPP managed tables and Hadoop in same query• Ship data to MPP from Hadoop for faster queries
  • Downstream Relational Databases• Move aggregate data out to relational Datamarts usingETL• Both this solution and MPP connectors suffer from a fewproblems:• Batch/Micro-batch Latency• Processing and imposition of relational model  loss of agility• Majority of data left behind in HadoopETLRelationalData Mart
  • Dremel• Research published by Google in 2010• Main Features• Fast/interactive ad-hoc queries• Scaling to trillion of records, Petabytes of data• Relies on it’s own processing model outside Map-Reduce• Leverages a special columnar storage• Foundation for Google Big Query
  • Inspired by Dremel• There are several new query engines!• Drill (Incubator)• Stinger (Hortonworks)• Impala (Cloudera)• All process outside Map Reduce framework• Evolution or extension of Hive!• Several MPP features have been adopted to deal withthings like query planning and join optimization such ascollocation and broadcasting.• Also note that to achieve the ultimate performance somestructure will need to be imposed on the data:• ORC File• Parquet
  • Now we have something that can provideus “Real-time” in Hadoop• At least most of the time•Queries are significantly faster butnot always instantaneous• Simple selects  A couple seconds• Join queries  10’s of seconds
  • Where is this going?• So is the roadmap for these engines to be aHadoop MPP? • Likely not• Are we ready to build an EDW on Hadoop?• For the right use case we are getting there!Consider as a supplement to MPP or relationalEDW.• Will there be a “winner” in the open sourcerace? • Maybe, or they will evolve and find their ownstrengths, niches
  • What we do know about these newengines• They are made to fit a need for fast querieson large sets of data!• They present an exciting feature for theHadoop ecosystem