Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SQL In/On/Around Hadoop

1,455 views

Published on

SQL In/On/Around Hadoop
Chris Twogood, Vice President Product and Services Marketing
Fawad Qureshi, Principal Consultant, Big Data
Teradata

Published in: Technology

SQL In/On/Around Hadoop

  1. 1. SQL In/On/Around Hadoop Hadoop Summit, 2015 Chris Twogood, Vice President Product and Services Marketing Fawad Qureshi, Principal Consultant, Big Data
  2. 2. 2 © 2014 Teradata Over 12 SQL Interfaces for Hadoop Apache Drill Apache Hive Apache Phoenix Apache Spark SQL Apache Tajo Cloudera Impala IBM BigSQL Oracle Big Data SQL Pivotal Hawq Presto Splice Machine SQLstream Teradata QueryGrid Source: Gartner Market Guide for Hadoop Distributions, 06 January 2015
  3. 3. 3 Query Processing on Hadoop © 2014 Teradata HADOOP HDFS QUERY ENGINE HADOOP HADOOP RDBMS DATA VIRTUALIZATION Map Reduce RDBMSHDFSRDBMS Raw Map Reduce RDBMS On Top Of Hadoop Query Engine Using HDFS Files RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive Virtualization Layer Over All Data Sources
  4. 4. 4 Query Processing on Hadoop – Raw Map Reduce © 2014 Teradata HADOOP Map Reduce RDBMS • Native Map Reduce processing • Direct commands to Hadoop and HDFS • “Data Manipulation” more than “query processing” • Programming and Map Reduce skills required • Batch processing focused • Full flexibility to operate on any data in HDFS
  5. 5. 5 Query Processing on Hadoop – RDBMS On Top of Hadoop © 2014 Teradata HDFSRDBMS • RDBMS on Hadoop cluster • Proprietary data dictionary/meta data • Proprietary data format within HDFS files • Data types may be limited • SQL query engine • SQL language, but standards compatibility varies • Query engine maturity varies • Data not portable, can not be read by other systems/engines • Example: Pivotal HAWQ
  6. 6. 6 Query Processing on Hadoop – Query Engine Using HDFS Files © 2014 Teradata HDFS QUERY ENGINE • SQL query engine on Hadoop cluster • Standard data dictionary/meta data (e.g., Hive) • Standard data format within HDFS files (e.g., ORC files) • Data types may be limited • SQL query engine • SQL language, but standards compatibility varies • Query engine maturity varies • Data “portable” and can be read by other systems/engines • Examples: IBM Big SQL, Cloudera Impala
  7. 7. 7 Query Processing on Hadoop – RDBMS Orchestrating Queries With Remote Access to Hadoop/Hive © 2014 Teradata HADOOPRDBMS • External RDBMS sends (part of) queries to engine on Hadoop • Standard data dictionary/meta data within Hadoop cluster (e.g., Hive) • Standard data format within HDFS files (e.g., ORC) • Data types may be limited by engine on Hadoop and external RDBMS • SQL query engine capabilities combination of external and internal Hadoop engines • Combines data and analytics in two systems • SQL language, standards compatibility generally high • Query engine generally mature • Data in Hadoop “portable” and can be read by other systems/engines • Example: Teradata QueryGrid
  8. 8. 8 Query Processing on Hadoop – Virtualization Layer Over All Data Sources © 2014 Teradata HADOOP RDBMS DATA VIRTUALIZATION • External virtualization software sends (part of) queries to engine on Hadoop • Standard data dictionary/meta data within Hadoop cluster (e.g., Hive) • Standard data format within HDFS files (e.g., ORC) • Data types may be limited by engine on Hadoop and external virtualization software • SQL query engine capabilities combination of external and Hadoop engines and virtualization layer limitations • Combines data and analytics in two systems • Extra layer and/or data movement • SQL language, standards compatibility generally high • Query engine maturity and utilization of engines varies • Data in Hadoop “portable”, can be read by other engines • Example: Cisco Data Virtualization Platform (formerly Composite Software)
  9. 9. 9 Shift from a Single Platform to an Ecosystem “We will abandon the old models based on the desire to implement for high-value analytic applications.” "Logical" Data Warehouse
  10. 10. 10 • Pick Your Best-of Breed Technology: – Data types – Analytic engines – Economic options – File systems – Operating systems • With Different Characteristics: – CPU centric – I/O centric – Data volume centric – Workload characteristics and volume – Availability/DR – Service Level Agreements Data Fabric Vision Enabled by QueryGrid Analytic Flexibility to meet your business needs Users direct their queries to a single cohesive data fabric Focus on data and business questions, not integrating separate systems
  11. 11. 11 Customer Value Based on Social Influence Use Case HADOOP TERADATA ASTER DATABASE TERADATA DATABASE • Determine high value customers based on history • Determine customer value based on social influence <= • Determine customer sentiment • Determine customer sphere of influence $$
  12. 12. 12 © 2014 Teradata Teradata QueryGrid™ • Automated and optimized work distribution through “push-down” processing across platforms – Minimize data movement, process data where it resides – Minimize data duplication – Transparently automate analytic processing and data movement between systems – Bi-directional data movement • Run the right analytic on the right platform – Take advantage of specialized processing engines while operating as a cohesive analytic environment • Integrated processing; within and outside the UDA • Easy access to data and analytics through existing SQL skills and tools Optimize, simplify, and orchestrate processing across and beyond the Teradata UDA
  13. 13. 13 INTEGRATED DATA WAREHOUSE TERADATA DATABASE DATA PLATFORM HADOOP Teradata Database 15 – Teradata QueryGrid Leverage analytic resources, reduce data movement • Parallel Bi-directional data transfer • Push-down processing • Native Analytics on Target system • Easy configuration of server connections • Simplified Server Grammar • Adaptive Optimizer
  14. 14. 14 Deep History – QueryGrid Teradata 15.00 Use Case SELECT Trans.Trans_ID ,Trans.Trans_Amount FROM TD_Transactions Trans WHERE Trans_Amount > 5000 UNION SELECT * FROM FOREIGN TABLE (SELECT Trans_ID ,Trans_Amount FROM Transaction_Hist WHERE Trans_Amount > 5000)@Hadoop Hist; HADOOP TERADATA DATABASE – Push "Foreign Table" Select to Hive to execute the query – Provides import to Teradata of just the required columns. – Allows predicate processing of conditions on non-partitioned columns. – The Hadoop cluster resources are used for data qualification. Years 5-10 Years 1-5
  15. 15. 15 Incremental planning & execution of smaller query fragments • Most efficient overall query plan derived from reliable statistics – Statistics dynamically collected from foreign data • Incremental query plans generated for single and multi-system queries – Consistent Optimizer approach for queries within and between systems – Teradata systems “transfer” query plans between systems • A fully automatic optimizer feature – users don’t have to change anything Adaptive Optimizer Better Query Plan Foreign and Sub-Queries Why? Unreliable statistics can result in less-than- optimal query plans Some analytic systems, like Hadoop, don’t keep data statistics Statistics not designed for compatibility between databases How? Pulls out remote server requests and single-row and scalar non-correlated sub- queries from a main query Plans and executes them Plugs the results into the main query Plans and executes the main query ∑
  16. 16. 16 QueryGrid provides parallel transfer
  17. 17. 17 QueryGrid Architecture Advantages • Designed for analytics across the enterprise – Generalized architecture with tuned connections – Extends Integrated Data Warehouse - Not trying to be general purpose top tier engine • Combines core curated data warehouse and other data repositories (e.g., data lake) • Minimizes layers and data movement – Combines processing engines without extra control or virtualization layer – Pushes processing to data – Moves only data required for processing with data on or analytics of other system © 2014 Teradata
  18. 18. 18 DATAMART 1990’s Just Give Me Some Data and Fast! EDW/IDW 2000’s Give Me Good Data But Do It Efficiently! LOGICALDATAWAREHOUSE 2010’s Give Me All Data Fast, Simple & Effectively!
  19. 19. 19 • Complex Multi-Structured Data set • Huge Volumes • Retention of Data Sets a problem • Difficulty in correlating the data sets © 2014 Teradata Teradata QueryGrid: A Customer Example Warranty Data Analysis
  20. 20. 20 Pareto Rule The largest data set may not be the most complex © 2014 Teradata 20% 80% 80% 20% 80% 20% Complexity Volume Queries
  21. 21. 21 The Pareto Effect in Factory Test Data © 2014 Teradata Factory Test Data 100% Repair_Order 0.76% Test_header 0.25% Repair_details 0.03% Flash_file 0.02% Result 95.75% PSN 0.13% Repair_Fail 0.95% Log 0.83%
  22. 22. 22 Combining Hadoop and Relational Engines © 2014 Teradata
  23. 23. 2323 © 2014 Teradata

×