Your SlideShare is downloading. ×
0
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

3,248

Published on

1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,248
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
138
Comments
1
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SQL  on  Hadoop  Defining  the  New  Genera/on  of    Analy/c  Databases   Strata Conference, February 2013
  • 2. Speaker Bio: Carl SteinbachCurrently: Engineer at Citus Data PMC Chair, Committer -- Apache Hive Project @cwsteinbach on TwitterFormerly: Cloudera, Informatica, NetApp, Oracle 2  
  • 3. This is going to sound strange, but… I used to thinkdatabases were boring 3  
  • 4. Why?Undergrad at MIT 1997-2001Number of Database Classes: 0Number of Database Faculty Members: 0My Conclusion: Databases are a Dead Field 4  
  • 5. Things Changed Over the Next Couple of YearsI got a job!Database Group Formed at MIT (2003) - Mike Stonebraker - Sam MaddenNew Class: 6.830 Database Systems (2005) 5  
  • 6. What Changed?Web-scale DataNew DB Research: Columnar Storage, NoSQLMPP Analytic Databases Gained Market TractionGFS (’03) and MapReduce (‘04) PapersApache Hadoop – v0.1.0 released in 2006 6  
  • 7. What’s Good About Hadoop?Commodity StorageScale-outFlexibility MapReduce Multi-structured Data 7  
  • 8. What’s Bad About Hadoop?MapReduce!No Schemas!Missing Features Optimizer, Indexes, ViewsIncompatibility with Existing Tools BI, ETL, IDEs 8  
  • 9. Apache Hive Solved Many of These Problems User  Client   HiveServer2   Hive  MetaStore   Hive  CLI   SQL  to  MapReduce   Table  to  Files   SQL  Queries   Catalog  Metadata   Compiler   Table  to  Format   ETL,  BI,  SQL  IDE   Rule  Based   Op/mizer   Hive  ODBC/JDBC   MR  Plan  Execu/on   Coordinator   Map/Reduce   Map/Reduce     Map/Reduce   Hive  Operators   Hive  Operators   Hive  Operators   Hive  SerDes   Hive  SerDes   Hive  SerDes   HDFS   HDFS   HDFS   datanode   datanode   datanode   9  
  • 10. But Other Problems RemainedMapReduce: Latency OverheadMany Missing Features:•  ANSI SQL•  Cost Based Optimizer•  UDFs•  Data Types•  Security•  … 10  
  • 11. One Solution: Separate MPP DB Cluster MPP  Database  Cluster   MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Hadoop  Cluster   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   11  
  • 12. One Solution: Separate MPP DB Cluster MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Pull  Data  to   Work   IO  Bo]leneck   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   12  
  • 13. Better Solution: A New Architecture for SQL on Hadoop MPP  Master  Node   Global  Query   Push   Executor   Work   to   Data  Maintain   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor  Data  Locality   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   13  
  • 14. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   14  
  • 15. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Metadata  Sync   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  1)  The  CitusDB  Master  Node  retrieves  file  system  metadata  from  the  Hadoop  NameNode.   15  
  • 16. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   User  Query   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  2)  The  user  submits  a  SQL  query  to  the  CitusDB  master  node  using  the  PostgreSQL  CLI  or  a  JDBC/ODBC  app.   16  
  • 17. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Queries   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  3)  The  Master  Node  generates  an  op/mized  global  query  plan  and  sends  fragment  queries  to  the  workers.   17  
  • 18. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Results   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  4)  The  CitusDB  worker  processes  running  on  each  DataNode  process  the  fragment  queries   18                              and  send  par/al  result  sets  back  to  the  Master  Node.  
  • 19. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   Query  Results   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  5)  The  Master  Node  merges  the  par/al  result  sets  and  returns  the  final  result  to  the  user.   19  
  • 20. CitusDB: Standing on the Shoulders of Giants +Mature, Battle-tested Proven ScalabilityEnterprise Class Features Cost EffectivenessHas an Elephant Mascot Has an Elephant Mascot 20  
  • 21. Leveraging PostgreSQL PerformanceCost-based Query Optimizerpostgres=#  EXPLAIN  SELECT      customer.c_custkey,    sum((lineitem.l_extendedprice  *  (1::numeric  -­‐  lineitem.l_discount)))      ….    -­‐>    Sort    (cost=282459.19..282599.52  rows=56134  width=182)                    Sort  Key:  customer.c_custkey,  customer.c_name                    Sort  Method:  external  merge    Disk:  17192kB    ….      -­‐>    Hash  Join    (cost=39666.61..257246.25  rows=56134  width=16)                      Hash  Cond:  (lineitem.l_orderkey  =  orders.o_orderkey)                                      -­‐>    Seq  Scan  on  lineitem_102022  lineitem    (cost=0.00..190571.11)   21  
  • 22. Leveraging PostgreSQL Features:More than 300 Built-in FunctionsQUOTE_LITERAL REGR_SLOPE COS GREATEST QUOTE_IDENT SET_BYTESTRING_TO_ARRAY ENUM_RANGE EXTRACT REGR_SXY REGR_R2 XMLFORESTCONVERT_TO NTH_VALUE DIV OVERLAPS LAG LAGDATE_TRUNC SIN BTRIM FLOOR PI FORMATTO_DATE TRANSACTION_TIMESTAMP LOWER SQRT TRUNC ARRAY_AGGLOWER_INC REGR_SYY CONCAT RTRIM STRIP LTRIMCHAR_LENGTH IS FALSE ARRAY_FILL REGR_AVGY XMLAGG BETWEENCURRENT_TIMESTAMP BROADCAST JUSTIFY_DAYS IS DISTINCT UPPER BOXARRAY_LENGTH ISCLOSED VAR_POP TIMEOFDAY COVAR_POP CURRVALREPEAT VAR_SAMP OCTET_LENGTH LN NETMASK LOCALTIMEUPPER QUERY_TO_XML STATEMENT_TIMESTAMP TO_CHAR FIRST_VALUE LPADCASE GET_BIT TAN TRUNC LOWER_INF REGR_AVGXBOOL_AND IS NOT UNKNOWN ARRAY_APPEND ISNULL REGR_COUNT DATE_PARTCORR ENUM_LAST XMLCOMMENT SCHEMA_TO_XML SET_MASKLEN ARRAY_TO_STRINGXPATH_EXISTS NUMNODE REGEXP_MATCHES COALESCE NOW EXTRACTRADIUS SPLIT_PART CONVERT_FROM ENUM_FIRST ISOPEN UPPER_INCMOD REPLACE XPATH BIT_AND REGR_COUNT TRANSLATEAREA EVERY AT TIME ZONE RADIANS NOW SQRTATAN2 IS TRUE RANDOM SUM MIN NOT LIKEREGEXP_REPLACE RPAD CEILING TRIM TO_HEX LOGDECODE NOW WIDTH STDDEV_POP GET_BYTE DATE_TRUNCBOOL_OR REGR_SXX ROUND LSEG XML_IS_WELL_FORMED VARIANCECUME_DIST PATH COVAR_SAMP STRING_AGG LASTVAL UNNESTOVERLAY PERCENT_RANK HOSTMASK PCLOSE HEIGHT ANYPOINT IN ARRAY_DIMS MASKLEN DENSE_RANK LOCALTIMESTAMPJUSTIFY_INTERVAL CURRENT_DATE CURSOR_TO_XML LIKE SETVAL LENGTHPOWER UPPER_INF GENERATE_SUBSCRIPTS POSITION LAST_VALUE INITCAPIS NOT TRUE XMLAGG PG_SLEEP VAR_POP STRPOS SIGNFORMAT GENERATE_SERIES STDDEV_SAMP DENSE_RANK COT SUBSTRREVERSE REGR_INTERCEPT SIMILAR TO DATABASE_TO_XML ARRAY_CAT STDDEVIS NOT FALSE DIAMETER NOTNULL HOST TO_ASCII ABSROW_TO_JSON ROW_NUMBER SUBSTRING SETSEED ISFINITE SOMESET_BIT ARRAY_NDIMS REGEXP_SPLIT_TO_ARRAY TO_TIMESTAMP NOT MD5 22  
  • 23. Leveraging PostgreSQL FeaturesExtensible, Rich Type SystemPluggable Format HandlersSecurityInternationalizationConnectivity: ODBC, JDBCEcosystem Add-Ons: PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET, Python, etc) 23  
  • 24. Where are We Headed?Distributed. SQL. Anywhere. CitusDB  Master  Node   Metadata   Distributed  Query   Planner   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrapper   Foreign  Data  Wrapper   Foreign  Data  Wrapper   HDFS   mongod   RDBMS   Hadoop  Datanode   MongoDB  Shard   RDBMS  server   24  
  • 25. Defining the New Generation of Distributed Analytic DatabasesSQL à Ease of Use, Increased ProductivityReal-time responsiveness à FasterData Locality à Proven ScalabilitySchema-on-Read à Flexibility, Lower Cost 25  
  • 26. Where Are We At?CitusDB SQL on Hadoop is in Open BetaDownload our Binary PackagesOr Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop 26  
  • 27. We’re Hiringhttp://citusdata.com/job 27  
  • 28. For questions and more information: info@citusdata.com (650) 566-9010 28  

×