Discardable In-Memory Materialized Queries With Hadoop

4,603 views

Published on

What to do with all that memory in a Hadoop cluster? Should we load all of our data into memory to process it?

The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. This proposed solution requires a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ).

In this session we will talk through how we can build on existing Hadoop facilities to deliver three key underlying concepts that enable this approach.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,603
On SlideShare
0
From Embeds
0
Number of Embeds
481
Actions
Shares
0
Downloads
74
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • This software does not yet exist. I hope and believe that it will. I don’t believe in promoting vaporware! I want to have a discussion about where Hadoop is going so as many of us can shape it. This seems like a great time and place to have it.
    My background is in database. I think like a database guy. I’ve been thinking really hard about how to bring the “good stuff” from databases over to Hadoop
    I have a deep knowledge of BI, and what BI tools need from their underlying data system, having written Mondrian
  • Hadoop today brings a lot of brute force.

    It can be more effective if it were a bit smarter. We cannot make effective use of memory unless we are a bit smarter.
  • Lots of memory now. But still not enough
  • Bringing all kinds of data into the Hadoop tent - Batch, interactive, streaming, iterative
    Declarative, algebraic, transparent - Applications use different copies of data without knowing it
    Making Hadoop smarter – Adding brains to compliment Hadoop’s formidable brawn

  • Discardable In-Memory Materialized Queries With Hadoop

    1. 1. Page1 © Hortonworks Inc. 2014 Discardable, In-Memory Materialized Query for Hadoop Julian Hyde Julian Hyde June 3rd, 2014
    2. 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
    3. 3. Page3 © Hortonworks Inc. 2014 Before we get started… The bad news • This software is not available to download • I am a database bigot • I am a BI bigot The good news • I believe in Hadoop • Now is a great time to discuss where Hadoop is going
    4. 4. Page4 © Hortonworks Inc. 2014 Hadoop today Brute force Hadoop brings a lot of CPU, disk, IO Yarn, Tez, Vectorization are making Hadoop faster How to use that brute force is left to the application Business Intelligence Best practice is to pull data out of Hadoop • Populate enterprise data warehouse • In-memory analytics • Custom analytics, e.g. Lambda architecture Ineffective use of memory Opportunity to make Hadoop smarter
    5. 5. Page5 © Hortonworks Inc. 2014 Brute force + diplomacy
    6. 6. Page6 © Hortonworks Inc. 2014 Hardware trends Typical Hadoop server configuration Lots of memory - but not enough for all data More heterogeneous mix (disk + SSD + memory) What to do about memory? Year 2009 2014 Cores 4 – 8 24 Memory 8 GB 128 GB SSD None 1 TB Disk 4 x 1 TB 12 x 4 TB Disk : memory 512 : 1 384 : 1
    7. 7. Page7 © Hortonworks Inc. 2014 Dumb use of memory - Buffer cache Example: 50 TB data 1 TB memory Full-table scan Analogous to virtual memory • Great while it works, but… • Table scan nukes the cache! Table (disk) Buffer cache (memory) Table scan Query operators (Tez, or whatever) Pre-fetch
    8. 8. Page8 © Hortonworks Inc. 2014 “Document-oriented” analysis Operate on working sets small enough to fit in memory Analogous to working on a document (e.g. a spreadsheet) • Works well for problems that fit into memory (e.g. some machine-learning algorithms) • If your problem grows, you’re out of luck Working set (memory) Table (disk) Interactive user
    9. 9. Page9 © Hortonworks Inc. 2014 Smarter use of memory - Materialized queries In-memory materialized queries Tables on disk
    10. 10. Page10 © Hortonworks Inc. 2014 Census data Census table – 300M records Which state has the most males? Brute force – read 150M records Smarter – read 50 records CREATE TABLE Census (id, gender, zipcode, state, age); SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE TABLE CensusSummary AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, c FROM CensusSummary WHERE gender = ‘M’ ORDER BY c DESC LIMIT 1;
    11. 11. Page11 © Hortonworks Inc. 2014 Materialized view – Automatic smartness A materialized view is a table that is declared to be identical to a given query Optimizer rewrites query on “Census” to use “CensusSummary” instead Even smarter – read 50 records CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1;
    12. 12. Page12 © Hortonworks Inc. 2014 Indistinguishable from magic Run query #1 System creates materialized view in background Related query #2 runs faster SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(NULLIF(gender, ‘F’)) AS males, COUNT(NULLIF(gender, ‘M’)) AS females FROM Census GROUP BY state HAVING females > males;
    13. 13. Page13 © Hortonworks Inc. 2014 Materialized views - Classic Classic materialized view (Oracle, DB2, Teradata, MSSql) 1. A table defined using a SQL query 2. Designed by DBA 3. Storage same as a regular table 1. On disk 2. Can define indexes 4. DB populates the table 5. Queries are rewritten to use the table** 6. DB updates the table to reflect changes to source data (usually deferred)* *Magic required SELECT t.year, AVG(s.units) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) GROUP BY t.year; CREATE MATERIALIZED VIEW SalesMonthZip AS SELECT t.year, t.month, c.state, c.zipcode, COUNT(*), SUM(s.units), SUM(s.price) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId) GROUP BY t.year, t.month, c.state, c.zipcode;
    14. 14. Page14 © Hortonworks Inc. 2014 Materialized views - DIMMQ DIMMQs - Discardable, In-memory Materialized Queries Differences with classic materialized views 1. May be in-memory 2. HDFS may discard – based on DDM (Distributed Discardable Memory) 3. Lifecycle support: 1. Assume table is populated 2. Don’t populate & maintain 3. User can flag as valid, invalid, or change definition (e.g. date range) 4. HDFS may discard 4. More design options: 1. DBA specifies 2. Retain query results (or partial results) 3. An agent builds MVs based on query traffic
    15. 15. Page15 © Hortonworks Inc. 2014 DIMMQ compared to Spark RDDs It’s not “either / or” • Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon • Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible Spark RDD DIMMQ Access By reference By algebraic expression Sharing Within session Across sessions Execution model Built-in External Recovery Yes No Discard Failure or GC Failure or cost-based Native organization Memory Disk Language Scala (or other JVM language) Language-independent
    16. 16. Page16 © Hortonworks Inc. 2014 Data independence This is not just about SQL standards compliance! Materialized views are supposed to be transparent in creation, maintenance and use. If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded Data independence Ability to move data around and not tell your application Replicas Redundant copies Moving between disk and memory Sort order, projections (à la Vertica), aggregates (à la Microstrategy) Indexes, and other weird data structures
    17. 17. Page17 © Hortonworks Inc. 2014 Implementing DIMMQs Relational algebra Apache Optiq (just entered incubator – yay!) Algebra, rewrite rules, cost model Metadata Hive: “CREATE MATERIALIZED VIEW” Definitions of materialized views in HCatalog HDFS - Discardable Distributed Memory (DDM) Off-heap data in memory-mapped files Discard policy Build in-memory, replicate to disk; or vice versa Central namespace Evolution of existing components
    18. 18. Page18 © Hortonworks Inc. 2014 Tiled queries in distributed memory Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk
    19. 19. Page19 © Hortonworks Inc. 2014 An adaptive system Ongoing activities: • Agent suggests new MVs • MVs are built in background • Ongoing query activity uses MVs • User marks MVs as invalid due to source data changes • HDFS throws out MVs that are not pulling their weight Dynamic equilibrium DIMMQs continually created & destroyed System moves data around to adapt to changing usage patterns
    20. 20. Page20 © Hortonworks Inc. 2014 Lambda architecture From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
    21. 21. Page21 © Hortonworks Inc. 2014 Lambda architecture in Hadoop via DIMMQs Use DIMMQs for materialized historic & streaming data MV on disk MV in key- value store Hadoop Query via SQL
    22. 22. Page22 © Hortonworks Inc. 2014 Variations on a theme Materialized queries don’t have to be in memory Materialized queries don’t need to be discardable Materialized queries don’t need to be accessed via SQL Materialized queries allow novel data structures to be described Maintaining materialized queries - build on Hive ACID Fine-grained invalidation Streaming into DIMMQs In-memory tables don’t have to be materialized queries Data aging – Older data to cheaper storage Discardable In-memory Algebraic
    23. 23. Page23 © Hortonworks Inc. 2014 Lattice Space of possible materialized views A star schema, with mandatory many-to-one relationships Each view is a projected, filtered aggregation • Sales by zipcode and quarter in 2013 • Sales by state in Q1, 2012 Lattice gathers stats • “I used MV m to answer query q and avoided fetching r rows” • Cost of MV = construction effort + memory * time • Utility of MV = query processing effort saved Recommends & builds optimal MVs CREATE LATTICE SalesStar AS SELECT * FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId);
    24. 24. Page24 © Hortonworks Inc. 2014 Conclusion Broadening the tent – batch, interactive BI, streaming, iterative Declarative, algebraic, transparent Seamless movement between memory and disk Adding brains to complement Hadoop’s brawn
    25. 25. Page25 © Hortonworks Inc. 2014 Thank you! My next talk: “Cost-based query optimization in Hive” 4:35pm Wednesday @julianhyde http://hortonworks.com/blog/dmmq/ http://hortonworks.com/blog/ddm/ http://incubator.apache.org/projects/optiq.html

    ×