SQL	  on	  Hadoop	  Defining	  the	  New	  Genera/on	  of	  	  Analy/c	  Databases	         Strata Conference, February 2013
Speaker Bio: Carl SteinbachCurrently:  Engineer at Citus Data  PMC Chair, Committer -- Apache Hive Project  @cwsteinbach o...
This is going to sound strange, but…    I used to thinkdatabases were boring                                        3	  
Why?Undergrad at MIT 1997-2001Number of Database Classes: 0Number of Database Faculty Members: 0My Conclusion: Databases a...
Things Changed Over the Next Couple of YearsI got a job!Database Group Formed at MIT (2003)   - Mike Stonebraker   - Sam M...
What Changed?Web-scale DataNew DB Research: Columnar Storage, NoSQLMPP Analytic Databases Gained Market TractionGFS (’03) ...
What’s Good About Hadoop?Commodity StorageScale-outFlexibility  MapReduce  Multi-structured Data                          ...
What’s Bad About Hadoop?MapReduce!No Schemas!Missing Features  Optimizer, Indexes, ViewsIncompatibility with Existing Tool...
Apache Hive Solved Many of These Problems     User	  Client	                                         HiveServer2	         ...
But Other Problems RemainedMapReduce: Latency OverheadMany Missing Features:•    ANSI SQL•    Cost Based Optimizer•    UDF...
One Solution: Separate MPP DB Cluster      MPP	  Database	  Cluster	                     MPP	  Master	  Node	             ...
One Solution: Separate MPP DB Cluster                                                                MPP	  Master	  Node	 ...
Better Solution:  A New Architecture for SQL on Hadoop                                                              MPP	  ...
The New Architecture in Detail: CitusDB                                            CitusDB	  Master	  Node	               ...
The New Architecture in Detail: CitusDB                                                              CitusDB	  Master	  No...
The New Architecture in Detail: CitusDB                                                                      CitusDB	  Mas...
The New Architecture in Detail: CitusDB                                                                  CitusDB	  Master	...
The New Architecture in Detail: CitusDB                                                                  CitusDB	  Master	...
The New Architecture in Detail: CitusDB                                                                        CitusDB	  M...
CitusDB: Standing on the Shoulders of Giants                             +Mature, Battle-tested           Proven Scalabili...
Leveraging PostgreSQL PerformanceCost-based Query Optimizerpostgres=#	  EXPLAIN	  SELECT	  	                              ...
Leveraging PostgreSQL Features:More than 300 Built-in FunctionsQUOTE_LITERAL       REGR_SLOPE              COS            ...
Leveraging PostgreSQL FeaturesExtensible, Rich Type SystemPluggable Format HandlersSecurityInternationalizationConnectivit...
Where are We Headed?Distributed. SQL. Anywhere.                                        CitusDB	  Master	  Node	           ...
Defining the New Generation of Distributed Analytic DatabasesSQL à Ease of Use, Increased ProductivityReal-time responsive...
Where Are We At?CitusDB SQL on Hadoop is in Open BetaDownload our Binary PackagesOr Use Our EC2 AMI    http://citusdata.co...
We’re Hiringhttp://citusdata.com/job                            27	  
For questions and more information:        info@citusdata.com           (650) 566-9010                                    ...
Upcoming SlideShare
Loading in …5
×

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

4,107 views

Published on

1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
4,107
On SlideShare
0
From Embeds
0
Number of Embeds
61
Actions
Shares
0
Downloads
144
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

  1. 1. SQL  on  Hadoop  Defining  the  New  Genera/on  of    Analy/c  Databases   Strata Conference, February 2013
  2. 2. Speaker Bio: Carl SteinbachCurrently: Engineer at Citus Data PMC Chair, Committer -- Apache Hive Project @cwsteinbach on TwitterFormerly: Cloudera, Informatica, NetApp, Oracle 2  
  3. 3. This is going to sound strange, but… I used to thinkdatabases were boring 3  
  4. 4. Why?Undergrad at MIT 1997-2001Number of Database Classes: 0Number of Database Faculty Members: 0My Conclusion: Databases are a Dead Field 4  
  5. 5. Things Changed Over the Next Couple of YearsI got a job!Database Group Formed at MIT (2003) - Mike Stonebraker - Sam MaddenNew Class: 6.830 Database Systems (2005) 5  
  6. 6. What Changed?Web-scale DataNew DB Research: Columnar Storage, NoSQLMPP Analytic Databases Gained Market TractionGFS (’03) and MapReduce (‘04) PapersApache Hadoop – v0.1.0 released in 2006 6  
  7. 7. What’s Good About Hadoop?Commodity StorageScale-outFlexibility MapReduce Multi-structured Data 7  
  8. 8. What’s Bad About Hadoop?MapReduce!No Schemas!Missing Features Optimizer, Indexes, ViewsIncompatibility with Existing Tools BI, ETL, IDEs 8  
  9. 9. Apache Hive Solved Many of These Problems User  Client   HiveServer2   Hive  MetaStore   Hive  CLI   SQL  to  MapReduce   Table  to  Files   SQL  Queries   Catalog  Metadata   Compiler   Table  to  Format   ETL,  BI,  SQL  IDE   Rule  Based   Op/mizer   Hive  ODBC/JDBC   MR  Plan  Execu/on   Coordinator   Map/Reduce   Map/Reduce     Map/Reduce   Hive  Operators   Hive  Operators   Hive  Operators   Hive  SerDes   Hive  SerDes   Hive  SerDes   HDFS   HDFS   HDFS   datanode   datanode   datanode   9  
  10. 10. But Other Problems RemainedMapReduce: Latency OverheadMany Missing Features:•  ANSI SQL•  Cost Based Optimizer•  UDFs•  Data Types•  Security•  … 10  
  11. 11. One Solution: Separate MPP DB Cluster MPP  Database  Cluster   MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Hadoop  Cluster   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   11  
  12. 12. One Solution: Separate MPP DB Cluster MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Pull  Data  to   Work   IO  Bo]leneck   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   12  
  13. 13. Better Solution: A New Architecture for SQL on Hadoop MPP  Master  Node   Global  Query   Push   Executor   Work   to   Data  Maintain   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor  Data  Locality   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   13  
  14. 14. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   14  
  15. 15. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Metadata  Sync   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  1)  The  CitusDB  Master  Node  retrieves  file  system  metadata  from  the  Hadoop  NameNode.   15  
  16. 16. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   User  Query   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  2)  The  user  submits  a  SQL  query  to  the  CitusDB  master  node  using  the  PostgreSQL  CLI  or  a  JDBC/ODBC  app.   16  
  17. 17. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Queries   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  3)  The  Master  Node  generates  an  op/mized  global  query  plan  and  sends  fragment  queries  to  the  workers.   17  
  18. 18. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Results   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  4)  The  CitusDB  worker  processes  running  on  each  DataNode  process  the  fragment  queries   18                              and  send  par/al  result  sets  back  to  the  Master  Node.  
  19. 19. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   Query  Results   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  5)  The  Master  Node  merges  the  par/al  result  sets  and  returns  the  final  result  to  the  user.   19  
  20. 20. CitusDB: Standing on the Shoulders of Giants +Mature, Battle-tested Proven ScalabilityEnterprise Class Features Cost EffectivenessHas an Elephant Mascot Has an Elephant Mascot 20  
  21. 21. Leveraging PostgreSQL PerformanceCost-based Query Optimizerpostgres=#  EXPLAIN  SELECT      customer.c_custkey,    sum((lineitem.l_extendedprice  *  (1::numeric  -­‐  lineitem.l_discount)))      ….    -­‐>    Sort    (cost=282459.19..282599.52  rows=56134  width=182)                    Sort  Key:  customer.c_custkey,  customer.c_name                    Sort  Method:  external  merge    Disk:  17192kB    ….      -­‐>    Hash  Join    (cost=39666.61..257246.25  rows=56134  width=16)                      Hash  Cond:  (lineitem.l_orderkey  =  orders.o_orderkey)                                      -­‐>    Seq  Scan  on  lineitem_102022  lineitem    (cost=0.00..190571.11)   21  
  22. 22. Leveraging PostgreSQL Features:More than 300 Built-in FunctionsQUOTE_LITERAL REGR_SLOPE COS GREATEST QUOTE_IDENT SET_BYTESTRING_TO_ARRAY ENUM_RANGE EXTRACT REGR_SXY REGR_R2 XMLFORESTCONVERT_TO NTH_VALUE DIV OVERLAPS LAG LAGDATE_TRUNC SIN BTRIM FLOOR PI FORMATTO_DATE TRANSACTION_TIMESTAMP LOWER SQRT TRUNC ARRAY_AGGLOWER_INC REGR_SYY CONCAT RTRIM STRIP LTRIMCHAR_LENGTH IS FALSE ARRAY_FILL REGR_AVGY XMLAGG BETWEENCURRENT_TIMESTAMP BROADCAST JUSTIFY_DAYS IS DISTINCT UPPER BOXARRAY_LENGTH ISCLOSED VAR_POP TIMEOFDAY COVAR_POP CURRVALREPEAT VAR_SAMP OCTET_LENGTH LN NETMASK LOCALTIMEUPPER QUERY_TO_XML STATEMENT_TIMESTAMP TO_CHAR FIRST_VALUE LPADCASE GET_BIT TAN TRUNC LOWER_INF REGR_AVGXBOOL_AND IS NOT UNKNOWN ARRAY_APPEND ISNULL REGR_COUNT DATE_PARTCORR ENUM_LAST XMLCOMMENT SCHEMA_TO_XML SET_MASKLEN ARRAY_TO_STRINGXPATH_EXISTS NUMNODE REGEXP_MATCHES COALESCE NOW EXTRACTRADIUS SPLIT_PART CONVERT_FROM ENUM_FIRST ISOPEN UPPER_INCMOD REPLACE XPATH BIT_AND REGR_COUNT TRANSLATEAREA EVERY AT TIME ZONE RADIANS NOW SQRTATAN2 IS TRUE RANDOM SUM MIN NOT LIKEREGEXP_REPLACE RPAD CEILING TRIM TO_HEX LOGDECODE NOW WIDTH STDDEV_POP GET_BYTE DATE_TRUNCBOOL_OR REGR_SXX ROUND LSEG XML_IS_WELL_FORMED VARIANCECUME_DIST PATH COVAR_SAMP STRING_AGG LASTVAL UNNESTOVERLAY PERCENT_RANK HOSTMASK PCLOSE HEIGHT ANYPOINT IN ARRAY_DIMS MASKLEN DENSE_RANK LOCALTIMESTAMPJUSTIFY_INTERVAL CURRENT_DATE CURSOR_TO_XML LIKE SETVAL LENGTHPOWER UPPER_INF GENERATE_SUBSCRIPTS POSITION LAST_VALUE INITCAPIS NOT TRUE XMLAGG PG_SLEEP VAR_POP STRPOS SIGNFORMAT GENERATE_SERIES STDDEV_SAMP DENSE_RANK COT SUBSTRREVERSE REGR_INTERCEPT SIMILAR TO DATABASE_TO_XML ARRAY_CAT STDDEVIS NOT FALSE DIAMETER NOTNULL HOST TO_ASCII ABSROW_TO_JSON ROW_NUMBER SUBSTRING SETSEED ISFINITE SOMESET_BIT ARRAY_NDIMS REGEXP_SPLIT_TO_ARRAY TO_TIMESTAMP NOT MD5 22  
  23. 23. Leveraging PostgreSQL FeaturesExtensible, Rich Type SystemPluggable Format HandlersSecurityInternationalizationConnectivity: ODBC, JDBCEcosystem Add-Ons: PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET, Python, etc) 23  
  24. 24. Where are We Headed?Distributed. SQL. Anywhere. CitusDB  Master  Node   Metadata   Distributed  Query   Planner   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrapper   Foreign  Data  Wrapper   Foreign  Data  Wrapper   HDFS   mongod   RDBMS   Hadoop  Datanode   MongoDB  Shard   RDBMS  server   24  
  25. 25. Defining the New Generation of Distributed Analytic DatabasesSQL à Ease of Use, Increased ProductivityReal-time responsiveness à FasterData Locality à Proven ScalabilitySchema-on-Read à Flexibility, Lower Cost 25  
  26. 26. Where Are We At?CitusDB SQL on Hadoop is in Open BetaDownload our Binary PackagesOr Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop 26  
  27. 27. We’re Hiringhttp://citusdata.com/job 27  
  28. 28. For questions and more information: info@citusdata.com (650) 566-9010 28  

×