Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

4,527 views

Published on

  • ⇒ www.WritePaper.info ⇐ is a good website if you’re looking to get your essay written for you. You can also request things like research papers or dissertations. It’s really convenient and helpful.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/39sFWPG ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❤❤❤ http://bit.ly/39sFWPG ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • not overtly techie and easy to follow, good job!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

  1. 1. SQL  on  Hadoop  Defining  the  New  Genera/on  of    Analy/c  Databases   Strata Conference, February 2013
  2. 2. Speaker Bio: Carl SteinbachCurrently: Engineer at Citus Data PMC Chair, Committer -- Apache Hive Project @cwsteinbach on TwitterFormerly: Cloudera, Informatica, NetApp, Oracle 2  
  3. 3. This is going to sound strange, but… I used to thinkdatabases were boring 3  
  4. 4. Why?Undergrad at MIT 1997-2001Number of Database Classes: 0Number of Database Faculty Members: 0My Conclusion: Databases are a Dead Field 4  
  5. 5. Things Changed Over the Next Couple of YearsI got a job!Database Group Formed at MIT (2003) - Mike Stonebraker - Sam MaddenNew Class: 6.830 Database Systems (2005) 5  
  6. 6. What Changed?Web-scale DataNew DB Research: Columnar Storage, NoSQLMPP Analytic Databases Gained Market TractionGFS (’03) and MapReduce (‘04) PapersApache Hadoop – v0.1.0 released in 2006 6  
  7. 7. What’s Good About Hadoop?Commodity StorageScale-outFlexibility MapReduce Multi-structured Data 7  
  8. 8. What’s Bad About Hadoop?MapReduce!No Schemas!Missing Features Optimizer, Indexes, ViewsIncompatibility with Existing Tools BI, ETL, IDEs 8  
  9. 9. Apache Hive Solved Many of These Problems User  Client   HiveServer2   Hive  MetaStore   Hive  CLI   SQL  to  MapReduce   Table  to  Files   SQL  Queries   Catalog  Metadata   Compiler   Table  to  Format   ETL,  BI,  SQL  IDE   Rule  Based   Op/mizer   Hive  ODBC/JDBC   MR  Plan  Execu/on   Coordinator   Map/Reduce   Map/Reduce     Map/Reduce   Hive  Operators   Hive  Operators   Hive  Operators   Hive  SerDes   Hive  SerDes   Hive  SerDes   HDFS   HDFS   HDFS   datanode   datanode   datanode   9  
  10. 10. But Other Problems RemainedMapReduce: Latency OverheadMany Missing Features:•  ANSI SQL•  Cost Based Optimizer•  UDFs•  Data Types•  Security•  … 10  
  11. 11. One Solution: Separate MPP DB Cluster MPP  Database  Cluster   MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Hadoop  Cluster   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   11  
  12. 12. One Solution: Separate MPP DB Cluster MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Pull  Data  to   Work   IO  Bo]leneck   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   12  
  13. 13. Better Solution: A New Architecture for SQL on Hadoop MPP  Master  Node   Global  Query   Push   Executor   Work   to   Data  Maintain   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor  Data  Locality   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   13  
  14. 14. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   14  
  15. 15. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Metadata  Sync   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  1)  The  CitusDB  Master  Node  retrieves  file  system  metadata  from  the  Hadoop  NameNode.   15  
  16. 16. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   User  Query   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  2)  The  user  submits  a  SQL  query  to  the  CitusDB  master  node  using  the  PostgreSQL  CLI  or  a  JDBC/ODBC  app.   16  
  17. 17. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Queries   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  3)  The  Master  Node  generates  an  op/mized  global  query  plan  and  sends  fragment  queries  to  the  workers.   17  
  18. 18. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Results   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  4)  The  CitusDB  worker  processes  running  on  each  DataNode  process  the  fragment  queries   18                              and  send  par/al  result  sets  back  to  the  Master  Node.  
  19. 19. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   Query  Results   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode  Step  5)  The  Master  Node  merges  the  par/al  result  sets  and  returns  the  final  result  to  the  user.   19  
  20. 20. CitusDB: Standing on the Shoulders of Giants +Mature, Battle-tested Proven ScalabilityEnterprise Class Features Cost EffectivenessHas an Elephant Mascot Has an Elephant Mascot 20  
  21. 21. Leveraging PostgreSQL PerformanceCost-based Query Optimizerpostgres=#  EXPLAIN  SELECT      customer.c_custkey,    sum((lineitem.l_extendedprice  *  (1::numeric  -­‐  lineitem.l_discount)))      ….    -­‐>    Sort    (cost=282459.19..282599.52  rows=56134  width=182)                    Sort  Key:  customer.c_custkey,  customer.c_name                    Sort  Method:  external  merge    Disk:  17192kB    ….      -­‐>    Hash  Join    (cost=39666.61..257246.25  rows=56134  width=16)                      Hash  Cond:  (lineitem.l_orderkey  =  orders.o_orderkey)                                      -­‐>    Seq  Scan  on  lineitem_102022  lineitem    (cost=0.00..190571.11)   21  
  22. 22. Leveraging PostgreSQL Features:More than 300 Built-in FunctionsQUOTE_LITERAL REGR_SLOPE COS GREATEST QUOTE_IDENT SET_BYTESTRING_TO_ARRAY ENUM_RANGE EXTRACT REGR_SXY REGR_R2 XMLFORESTCONVERT_TO NTH_VALUE DIV OVERLAPS LAG LAGDATE_TRUNC SIN BTRIM FLOOR PI FORMATTO_DATE TRANSACTION_TIMESTAMP LOWER SQRT TRUNC ARRAY_AGGLOWER_INC REGR_SYY CONCAT RTRIM STRIP LTRIMCHAR_LENGTH IS FALSE ARRAY_FILL REGR_AVGY XMLAGG BETWEENCURRENT_TIMESTAMP BROADCAST JUSTIFY_DAYS IS DISTINCT UPPER BOXARRAY_LENGTH ISCLOSED VAR_POP TIMEOFDAY COVAR_POP CURRVALREPEAT VAR_SAMP OCTET_LENGTH LN NETMASK LOCALTIMEUPPER QUERY_TO_XML STATEMENT_TIMESTAMP TO_CHAR FIRST_VALUE LPADCASE GET_BIT TAN TRUNC LOWER_INF REGR_AVGXBOOL_AND IS NOT UNKNOWN ARRAY_APPEND ISNULL REGR_COUNT DATE_PARTCORR ENUM_LAST XMLCOMMENT SCHEMA_TO_XML SET_MASKLEN ARRAY_TO_STRINGXPATH_EXISTS NUMNODE REGEXP_MATCHES COALESCE NOW EXTRACTRADIUS SPLIT_PART CONVERT_FROM ENUM_FIRST ISOPEN UPPER_INCMOD REPLACE XPATH BIT_AND REGR_COUNT TRANSLATEAREA EVERY AT TIME ZONE RADIANS NOW SQRTATAN2 IS TRUE RANDOM SUM MIN NOT LIKEREGEXP_REPLACE RPAD CEILING TRIM TO_HEX LOGDECODE NOW WIDTH STDDEV_POP GET_BYTE DATE_TRUNCBOOL_OR REGR_SXX ROUND LSEG XML_IS_WELL_FORMED VARIANCECUME_DIST PATH COVAR_SAMP STRING_AGG LASTVAL UNNESTOVERLAY PERCENT_RANK HOSTMASK PCLOSE HEIGHT ANYPOINT IN ARRAY_DIMS MASKLEN DENSE_RANK LOCALTIMESTAMPJUSTIFY_INTERVAL CURRENT_DATE CURSOR_TO_XML LIKE SETVAL LENGTHPOWER UPPER_INF GENERATE_SUBSCRIPTS POSITION LAST_VALUE INITCAPIS NOT TRUE XMLAGG PG_SLEEP VAR_POP STRPOS SIGNFORMAT GENERATE_SERIES STDDEV_SAMP DENSE_RANK COT SUBSTRREVERSE REGR_INTERCEPT SIMILAR TO DATABASE_TO_XML ARRAY_CAT STDDEVIS NOT FALSE DIAMETER NOTNULL HOST TO_ASCII ABSROW_TO_JSON ROW_NUMBER SUBSTRING SETSEED ISFINITE SOMESET_BIT ARRAY_NDIMS REGEXP_SPLIT_TO_ARRAY TO_TIMESTAMP NOT MD5 22  
  23. 23. Leveraging PostgreSQL FeaturesExtensible, Rich Type SystemPluggable Format HandlersSecurityInternationalizationConnectivity: ODBC, JDBCEcosystem Add-Ons: PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET, Python, etc) 23  
  24. 24. Where are We Headed?Distributed. SQL. Anywhere. CitusDB  Master  Node   Metadata   Distributed  Query   Planner   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrapper   Foreign  Data  Wrapper   Foreign  Data  Wrapper   HDFS   mongod   RDBMS   Hadoop  Datanode   MongoDB  Shard   RDBMS  server   24  
  25. 25. Defining the New Generation of Distributed Analytic DatabasesSQL à Ease of Use, Increased ProductivityReal-time responsiveness à FasterData Locality à Proven ScalabilitySchema-on-Read à Flexibility, Lower Cost 25  
  26. 26. Where Are We At?CitusDB SQL on Hadoop is in Open BetaDownload our Binary PackagesOr Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop 26  
  27. 27. We’re Hiringhttp://citusdata.com/job 27  
  28. 28. For questions and more information: info@citusdata.com (650) 566-9010 28  

×