Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015 IBM Corporation
HHS-2479: Challenges of SQL on Hadoop
A story from the trenches
Scott C. Gray (sgray@us.ibm.com)
Se...
2
Why SQL on Hadoop?
 Why are you even asking? This should be
obvious by now! :)
 Hadoop is designed for any data
 Does...
3
SQL Engines Everywhere!
 SQL engines are springing up everywhere and maturing at an incredible pace!
 In some cases, t...
4
The Data Model Plop
 Ditch that pricy data warehouse and plop it on Hadoop!
 Problem solved, right?
database
partition...
5
 Your plan may just work! Or…it may not
 Even moving your application from one traditional data warehouse to another r...
6
Challenges of SQL on Hadoop
 Hadoop’s architecture presents significant challenges to matching the functionality of a
t...
7
Disclaimer(s)
 This presentation is not designed to scare you or dishearten you
 Only to educate you and make you thin...
8
Data Placement
 Most DW’s rely heavily on controlled data placement
 Data is explicitly partitioned across the cluster...
9
Query Processing Without Data Placement
 Without co-location the options for join processing are limited
 Redistributi...
10
Data Placement – Explicit Placement Policies
 HDFS
 HDFS has supported a pluggable data placement policy for some tim...
11
Data Placement – Partitioning Without Placement
 Without explicit placement, the next best
thing is reducing the amoun...
12
Avoid-the-Join: Nested Data Types
 One way to avoid the cost of joins is to physically
nest related data
 E.g. store ...
13
Avoid-the-Join: Gotchas
 While avoiding the cost of the joins, nested data types have some downsides:
 The row size c...
14
Indexing, It’s a Challenge!
 HDFS’ random block placement is problematic for traditional indexing
 An index is typica...
15
Indexing and Hadoop Legacy
 Hive-derived database engines use standard Hadoop classes to allow access to any data
 In...
16
Hive “Indexes”
 Hive has supported indexing since 0.8, but they are
barely indexes in the traditional sense
 They are...
17
Block Level Indexing and Synopsis
 The latest trend in indexing is with “smarter” file formats
 Exemplified by Parque...
18
Indexing in HBase
 All HBase tables are inherently partitioned and indexed on row key
 Provides near-RDBMS levels of ...
19
 Most secondary index solutions for HBase store the index in another HBase table
 The index is automatically maintain...
20
Secondary Indexes in HBase
 Global Index (e.g. Phoenix, Big SQL)
 Index table is maintained like a regular HBase
tabl...
21
Data Manipulation (DML) In a Non-Updateable World
 Another “gotcha” of HDFS is that it only supports write and append
...
22
Hive ACID Tables
 Hive 0.14 introduced “ACID” tables (which
aren’t quite ACID yet!)
 Modifications are logged, in row...
23
Data Modification in HBase
 HBase takes a similar approach
 Changes are logged to the write-ahead-log (WAL)
 Final v...
24
Security
 Most Hadoop SQL solutions are built on “open” data formats. This is good!
 E.g. delimited, JSON, Parquet, O...
25
Common SQL Security Models
 Optional mode for Hive
 All operations executed as connected user
 HDFS provides securit...
26
Permission Synchronization
 Solutions like Apache Sentry attempt to bridge the gap between security models
 The SQL s...
27
Security Delegation
 New projects are popping up to solve this problem!!
 Dedicated scan-and-filter I/O engines
 Cli...
28
Oh, There’s So Much More!!
 Other areas and technologies I would have liked to have covered:
 File formats out the wa...
29
Thank You!
 Thanks for putting up with me
 Queries? (Optimized of course!)
Upcoming SlideShare
Loading in …5
×

Challenges of SQL on Hadoop: A story from the trenches

660 views

Published on

Challenges of SQL on Hadoop
A story from the trenches

Published in: Data & Analytics
  • Be the first to comment

Challenges of SQL on Hadoop: A story from the trenches

  1. 1. © 2015 IBM Corporation HHS-2479: Challenges of SQL on Hadoop A story from the trenches Scott C. Gray (sgray@us.ibm.com) Senior Architect and STSM, IBM Open Platform with Apache Hadoop
  2. 2. 2 Why SQL on Hadoop?  Why are you even asking? This should be obvious by now! :)  Hadoop is designed for any data  Doesn't impose any structure  Extremely flexible  At lowest levels is API based  Requires strong programming expertise  Steep learning curve  Even simple operations can be tedious  Why not use SQL in places its strengths shine?  Familiar widely used syntax  Separation of what you want vs. how to get it  Robust ecosystem of tools
  3. 3. 3 SQL Engines Everywhere!  SQL engines are springing up everywhere and maturing at an incredible pace!  In some cases, the richness of SQL in these engines matches or surpasses that of traditional data warehouses  <ShamelessPlug>e.g. IBM’s Big SQL</ShamelessPlug>  Robust SQL plus inexpensive, easily expandable and reliable clustering leads to a deep burning desire… IBM Big SQL SQL
  4. 4. 4 The Data Model Plop  Ditch that pricy data warehouse and plop it on Hadoop!  Problem solved, right? database partition database partition database partition database partition $$$ $
  5. 5. 5  Your plan may just work! Or…it may not  Even moving your application from one traditional data warehouse to another requires:  Planning  Tuning  An intimate understanding of architectural differences between products  Hadoop’s architecture adds another level of potential impedance mismatch that needs to be considered as well.. Whoa…Hold On There, Buckaroo database partition database partition database partition database partition $$$ ? $
  6. 6. 6 Challenges of SQL on Hadoop  Hadoop’s architecture presents significant challenges to matching the functionality of a traditional data warehouse  Everything is a tradeoff though!  Hadoop’s architecture helps solve problems that have challenged data warehouses  It opens data processing well beyond just relational  This presentation will discuss a (very small) subset of the following challenges and ways in which some projects are addressing them  Data placement  Indexing  Data manipulation  Security  File formats out the wazoo (that’s a technical term)  Caching (buffer pools)  Optimization and data ownership  Competing workloads
  7. 7. 7 Disclaimer(s)  This presentation is not designed to scare you or dishearten you  Only to educate you and make you think and plan  And to teach you about the bleeding edge technologies that will solve all your problems!  I work on one of these SQL engines (IBM’s Big SQL)  I wouldn’t be doing so if it weren’t solving real problems for real customers  There are a LOT of technologies out there  I’m sure I’m missing at least one of your favorite. Sorry.  Call me out when I’m wrong. I want to learn too!
  8. 8. 8 Data Placement  Most DW’s rely heavily on controlled data placement  Data is explicitly partitioned across the cluster  A particular node “owns” a known subset of data  Partitioning tables on the same key(s) and on the same nodes allows for co-located processing  The fundamental design of HDFS explicitly implements “random” data placement  No matter which node writes a block there is no guarantee a copy will live on that node  Rebalancing HDFS can move blocks around  So, no co-located processing without bending over backwards (more on this later) Partition A T1 T2 Partition B T1 T2 Partition C T1 T2 Query Coordinator HDFS
  9. 9. 9 Query Processing Without Data Placement  Without co-location the options for join processing are limited  Redistribution join  DB engines read and filter “local” blocks for each table  Records with the same key are shipped to the same node to be joined  In the worst case both joined tables are moved in their entirety!  Doesn’t really work well for non-equijoins (!=, <, >, etc.)  Hash Join  Smaller, or heavily filtered, tables are shipped to all other nodes  An in memory hash table is used for very fast joins  Can still lead to a lot of network to move the small table  Tricks like bloom filters can help optimize these types of joins T1 T1 DB Engine T1DB Engine T2 DB Engine T2 DB Engine DB Engine DB Engine DB Engine Broadcast Join T1 T1 DB Engine T1DB Engine T2 DB Engine Hash Join T2 T2
  10. 10. 10 Data Placement – Explicit Placement Policies  HDFS  HDFS has supported a pluggable data placement policy for some time now  This could be used to keep blocks for specific tables “together”  HDFS doesn’t know the data in the blocks, so it would be an “all or nothing” policy • A full copy of both tables together a given host • Can be more granular by placing hive-style partitions together (next slide)  What do you do when a host “fills up”?  I’m not aware of any SQL engine that leverages this feature now  HBase (e.g. HBASE-10576)  HBase today takes advantage of HDFS write behavior such that table regions are “most likely” local  There are projects underway to cause the HBase balancer to split tables (regions) together  This nicely solves the problem of a host “filling up”  Obviously, this is restricted to HBase storage only
  11. 11. 11 Data Placement – Partitioning Without Placement  Without explicit placement, the next best thing is reducing the amount of data to be scanned  In Hive, “partitioning” allows for subdividing data by the value in a set of columns  Queries only access the directories required to satisfy the query  Typically cannot be taken advantage of when joining on the partitioning column  Scanning a lot of partitions can be quite expensive!  Other platforms, like Jethrodata, similarly allow for range partitioning into directories  Allows for more control over the number of directories/data
  12. 12. 12 Avoid-the-Join: Nested Data Types  One way to avoid the cost of joins is to physically nest related data  E.g. store data as nested JSON, AVRO, etc.  Each department row contains all employees  Apache Drill allows this with no schema provided!  Impala is adding language support to simplify such queries  An ARRAY-of-STRUCT implicitly treated as a table  Aggregates an be applied to arrays  Dynamic JSON schema discovery CREATE HADOOP TABLE DEPARTMENT ( DEPT_ID INT NOT NULL, DEPT_NAME VARCHAR(30) NOT NULL, ... EMPLOYEES ARRAY<STRUCT< EMP_ID:INT, EMP_NAME: VARCHAR(30), SALARY DECIMAL(10,2) ... >> ) ROW FORMAT SERDE ‘com.myco.MyJsonSerDe’ SELECT D.DEPT_NAME, SUM(E.SALARY) FROM DEPARTMENT D, UNNEST(D.EMPLOYEES) AS E Big SQL Example SELECT DEPT_NAME, SUM(E.SALARY) FROM (SELECT D.DEPT_NAME, FLATTEN(D.EMPLOYEES) E FROM `myfile.json` D)
  13. 13. 13 Avoid-the-Join: Gotchas  While avoiding the cost of the joins, nested data types have some downsides:  The row size can become very large  Most storage formats must completely read the entire row even when the complex column is not being used  You are no longer relational! • Becomes expensive to slice the data another way
  14. 14. 14 Indexing, It’s a Challenge!  HDFS’ random block placement is problematic for traditional indexing  An index is typically just a data file organized by indexed columns  Each block in the index file will, of course, be randomly scattered  Each index entry will point to data in the base data, which is ALSO randomly scattered!  This sort of “global” index will work for smaller point or range queries  Network I/O costs grow as the scan range increases on the index  Many SQL engines allow users to just drop data files into a directory to make it available  How does the index know it needs to be updated? D D D I D D = Data I = Index D I D I I D D D
  15. 15. 15 Indexing and Hadoop Legacy  Hive-derived database engines use standard Hadoop classes to allow access to any data  InputFormat – Used to interpret, split, and read a given file type  OutputFormat – Used to write data into a given file type  This interfaces are great!  They were established with the very first version of Hadoop (MapReduce, specifically)  They are ubiquitous  You can turn literally any file format into a table!  But…the interface lacks any kind of “seek” operation!  A feature necessary to implement an index
  16. 16. 16 Hive “Indexes”  Hive has supported indexing since 0.8, but they are barely indexes in the traditional sense  They are limited in utility  No other Hive derived SQL solution uses them  The index table contains  One row for each [index-values,blockoffset] pair  A set of bit for each row in the block (1 = a row contains the indexed columns)  This sort of index is useful for  Indexing any file type, regardless of format  Skipping base table blocks that don’t contain matching values  Avoiding interpretation of data in rows that don’t match the index  You still have to read each matching block in its entirety (up to the last “1”) CREATE INDEX IDX1 ON T1 (A, B) ROW FORMAT DELIMITED A B Block Offset Bits CA San Jose 6371541 011010010000… CA San Jose 4718461 110100000111… CA Berkeley 1747665 110000000011… NY New York 1888828 1111111100001…
  17. 17. 17 Block Level Indexing and Synopsis  The latest trend in indexing is with “smarter” file formats  Exemplified by Parquet and ORC  These formats typically  Store data in a compressed columnar(-ish) format  Store indexing and/or statistical information within each block  Can be configured with search criteria prior to reading data  Index and data are intimately tied together and always in sync  Optimizations include  Skipping of blocks that do not match your search criteria  Quickly seeking within a block to data matching your search criteria  You still have to at least “peek” at every block  Fetching a single row out of a billion will still take some time Parquet ORC
  18. 18. 18 Indexing in HBase  All HBase tables are inherently partitioned and indexed on row key  Provides near-RDBMS levels of performance for fetching on row key (yay!!)  At a non-negligible cost in writes due to index maintenance (boo!!)  And requires persistent servers (memory, CPU) instead of just simple flat files  Today HBase has no native secondary index support (it’s coming!)  But there are many solutions that will provide them for you….
  19. 19. 19  Most secondary index solutions for HBase store the index in another HBase table  The index is automatically maintained via HBase co-processors (kind of like triggers)  There is a measurable cost to index maintenance  Big SQL is exploring using co-processors to store index data outside of HBase  E.g. using a Lucene index  Stored locally with each region server Secondary Indexes in HBase Row Key C1 C2 C3 12345 Frank Martin 44 12346 Mary Lee 22 T1 Row Key Pointer Martin|44 12345 Lee|22 12346 CREATE INDEX IDX1 ON T1 (C2, C3) T1_IDX1
  20. 20. 20 Secondary Indexes in HBase  Global Index (e.g. Phoenix, Big SQL)  Index table is maintained like a regular HBase table (regions are randomly scattered)  Index data likely not co-located with base data  Good for point or small scan queries  Suffers from “network storm” during large index scans Region Server T1 Region A T1_IDX1 Region B Region Server T1 Region B Region Server T1_IDX1 Region A  Local Index (e.g. Phoenix)  Custom HBase balance ensures index data is co-located with base data • There is a small chance that it will be remote  No network hop to go from index to base data  BUT, for a given index key all index region servers must be polled • Potentially more expensive for single row lookups Region Server T1 Region A Region Server T1 Region B T1_IDX1 Region A T1_IDX1 Region B
  21. 21. 21 Data Manipulation (DML) In a Non-Updateable World  Another “gotcha” of HDFS is that it only supports write and append  Modifying data is difficult without the ability to update data!  Variable block length and block append (HDFS-3689) may allow some crude modification features  As a result, you’ll notice very few SQL solutions support DML operations  Those that do support it have to bend over backwards to accommodate the file system  Modifications are logged next to original data  Reads of original data are merged
  22. 22. 22 Hive ACID Tables  Hive 0.14 introduced “ACID” tables (which aren’t quite ACID yet!)  Modifications are logged, in row order, next to the base data files  During read, delta file changes are “merged” into the base data  Minor compaction process merges delta files together, major re-builds base data  Not suitable for OLTP  Single row update still scans all base data and produces one delta file
  23. 23. 23 Data Modification in HBase  HBase takes a similar approach  Changes are logged to the write-ahead-log (WAL)  Final view of the row is cached in memory  Base data (HFILES) are periodically rebuilt by merging changes  HBase achieves OLTP levels of performance by caching changes in memory  Hive supports “UPSERT” semantics  It is still difficult (costly) to implement SQL UPDATE semantics Region Server Write Ahead Log (WAL) Cache HFILE HFILE HFILE
  24. 24. 24 Security  Most Hadoop SQL solutions are built on “open” data formats. This is good!  E.g. delimited, JSON, Parquet, ORC, etc.  Data can be manipulated by different tools (SQL, Pig, Spark, etc.)  Use the right tool for the right job!  But, there lurks a significant security challenge!  HDFS has a very course, file level, security model  (Good) SQL solutions provide very fine-grained access controls  Object level, column level, row level, data masking rules, etc.  Providing a consistent view of security across technologies is a huge challenge
  25. 25. 25 Common SQL Security Models  Optional mode for Hive  All operations executed as connected user  HDFS provides security enforcement  Pros  Permissions are always “in sync”  Cons  GRANT/REVOKE is impossible  Fine grained access control is impossible  Permissions managed outside of SQL Impersonation  Provided by most SQL solutions  All operations executed as server owner  Server owner is typically a privileged user  Pros  “Traditional” SQL security model  GRANT/REVOKE is supported  Fine grained access control supported  Cons  Data is owned by the server  Users must be granted explicit permission Server Based
  26. 26. 26 Permission Synchronization  Solutions like Apache Sentry attempt to bridge the gap between security models  The SQL server runs as a single privileged user  GRANT/REVOKE operations are also translated to HDFS permission changes  Object ownership translates to file ownership  But, this model falls apart quickly CREATE PERMISSION BRANCH_A ON HR_STAFF FOR ROWS WHERE ( VERIFY_ROLE_FOR_USER( SESSION_USER, 'BRANCH_A_ROLE') = 1 AND HR.STAFF.BRANCH_NAME = 'Branch_A’) ENFORCED FOR ALL ACCESS ENABLE GRANT SELECT ON T1 TO BOB GRANT SELECT ON T1 (C1, C2, C3) TO BOB hdfs dfs –setfacl –m user:bob:r-- /path/to/T1 ???? ????
  27. 27. 27 Security Delegation  New projects are popping up to solve this problem!!  Dedicated scan-and-filter I/O engines  Client applications provide scan-and-filter criteria  Engine enforces security policies (from Hive metastore)  Also provides other optimizations (caching, performance optimizations)  LLAP (Live Long and Process)  Developed as part of Hive  In-memory vectorized predicate evaluation  Data caching  RecordService  Currently part of Sentry  C++ based runtime (from Impala), highly optimized Compute/Data Node Hive I/O Engine HDFS HDFS HDFS MR Pig Spark SQL Compute/Data Node Hive I/O Engine HDFS HDFS HDFS MR Pig Spark SQL
  28. 28. 28 Oh, There’s So Much More!!  Other areas and technologies I would have liked to have covered:  File formats out the wazoo • Trade-offs in a propriety format vs. being completely agnostic  Caching • How file formats and compression make efficient caching difficult  Schema-discovery and schema-less querying • What if the data doesn’t have a rigid schema? (Hint: Drill It)  Optimization and data ownership • How do you optimize a query if you have no statistics? (dynamic vs. static optimization)  Competing workloads • How does the database deal with competing workloads from other Hadoop tools? “Sir Not-Appearing-In-This-Film”
  29. 29. 29 Thank You!  Thanks for putting up with me  Queries? (Optimized of course!)

×