Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
EMC Corporation All rights reserved
SQL ON HADOOP
EMC Corporation All rights reserved
• Introduction
• Hive
• HAWQ
• Impala
• SparkSQL
• HBase + Phoenix
• Drill
• Networkin...
EMC Corporation All rights reserved
• How many developers?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many BI/SQL Developer?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many Business analyst/Sales?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many have used Hadoop?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• How many have used SQL on Hadoop?
INTRODUCTION
A SURVEY
EMC Corporation All rights reserved
• Hadoop is an open source framework for large-
scale data storing & processing.
WHAT ...
EMC Corporation All rights reserved
• Application Workgroup in EMC
– Focused on
•Big data development/infrastructure
•Appl...
EMC Corporation All rights reserved
• Fahim Kundi
– 10+ years experience in EDW and big data
• Haden Pareira
– Data engine...
EMC Corporation All rights reserved
WHAT IS HADOOP
EMC Corporation All rights reserved
• HDFS is a file system – it’s all files
• MapReduce requires strong programming skill...
EMC Corporation All rights reserved
• SQL is well known in analytics community
• Faster and easier data insights
• Allows ...
EMC Corporation All rights reserved
• Cloudera – Impala
• Hortonworks – Hive/Tez
• Pivotal – HAWQ … now HDB
• MapR – Drill...
EMC Corporation All rights reserved
HIVE
EMC Corporation All rights reserved
Hive and HAWQ
By Fahim Kundi
EMC Corporation All rights reserved
CONTENTS
• Hive Introduction
• How Hive Works
• Apache Tez
• Hive with Tez Vs Mapreduc...
EMC Corporation All rights reserved
HIVE INTRODUCTION (1)
• Apache Hive is high level query language
and data warehouse fe...
EMC Corporation All rights reserved
HIVE INTRODUCTION (2)
• Hive supports all the common primitive data
formats such as IN...
EMC Corporation All rights reserved
HOW HIVE WORKS (1)
• The tables in Hive are similar to tables in a relational
database...
EMC Corporation All rights reserved
HOW HIVE WORKS (2)
• Within a particular database, data in the tables is
serialized an...
EMC Corporation All rights reserved
APACHE TEZ (1)
• Apache Tez, a new distributed execution framework
that is targeted to...
EMC Corporation All rights reserved
APACHE TEZ (2)
• The Tez API has the following components –
– DAG (Directed Acyclic Gr...
EMC Corporation All rights reserved
EXAMPLE OF HIVE WITH TEZ VS MAPREDUCE
EMC Corporation All rights reserved
ORC FILE
• ORC(Optimal Row Columnar) is columnar file format designed
for Hadoop workl...
EMC Corporation All rights reserved
ORC FILE LAYOUT
EMC Corporation All rights reserved
PARQUET
• Apache Parquet is a columnar storage format available
to any project in the ...
EMC Corporation All rights reserved
PARQUET FILE LAYOUT
EMC Corporation All rights reserved
ORC VS PARQUET
• Two major consideration for considering ORC over Parquet
– Many of th...
EMC Corporation All rights reserved
FILE SIZE COMPARISION
EMC Corporation All rights reserved
HAWQ INTRODUCTION
• HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for
its stor...
EMC Corporation All rights reserved
HAWQ FEATURES
• HAWQ provides all major features found in Greenplum
database
– SQL Com...
EMC Corporation All rights reserved
HAWQ ARCHITECTURE
Interconnect
Local Storage
HAWQ Master
Parser Query Optimizer
PXF
Lo...
EMC Corporation All rights reserved
HAWQ PARALLEL QUERY OPTIMIZER
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute M...
EMC Corporation All rights reserved
PIVOTAL EXTENSION FRAMEWORK (PXF)
• PXF is a fast, extensible framework connecting HAW...
EMC Corporation All rights reserved
Muhammad Ali
Image courtesy cloudera
EMC Corporation. All rights reserved.
• Interactive Query on top of Hadoop
• ANSI-92 SQL Standard
• Native MPP query engin...
EMC Corporation. All rights reserved.
• Native to Hadoop
– Blends with the eco system
– Security
– Hive MetaStore / HCatal...
EMC Corporation. All rights reserved.
IMPALAARCHITECTURE
Image courtesy cloudera
EMC Corporation. All rights reserved.
• Query execution times (small to medium size)
• Parquet Format
– Compression
• High...
EMC Corporation. All rights reserved.
IMPALA DEMO
EMC Corporation. All rights reserved.
• Distributed columnar storage manager
• Performance of Parquet
– Great for analytic...
EMC Corporation. All rights reserved.
WHERE DO YOU POSITION KUDU?
EMC Corporation. All rights reserved.
• IoT use cases
– High velocity data
– Same data read for analytical queries near re...
EMC Corporation. All rights reserved.
IMPALA DEMO
EMC Corporation. All rights reserved.
SPARK
EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
• General Purpose Distributed Computing System
– Multiple langu...
EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
Image Courtesy: Sachin Parmar
http://www.slideshare.net/sachinp...
EMC Corporation. All rights reserved.
SPARKSQL
EMC Corporation. All rights reserved.
SPARKSQL
• Structured Data Processing
– Commonly known to us as tables
• Integrated ...
EMC Corporation. All rights reserved.
SPARKSQL
• Two APIs
– DataFrames
• Data organized into named columns
• Similar to Ta...
EMC Corporation. All rights reserved.
SPARKSQL ARCHITECTURE
EMC Corporation. All rights reserved.
DEMO
SPARKSQL ON HADOOP
EMC Corporation. All rights reserved.
SQL On Hadoop
Upcoming SlideShare
Loading in …5
×

3

Share

Download to read offline

SQL On Hadoop

Download to read offline

These slides were presented on May 18, 2016 at UAE Big data meetup.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

SQL On Hadoop

  1. 1. EMC Corporation All rights reserved SQL ON HADOOP
  2. 2. EMC Corporation All rights reserved • Introduction • Hive • HAWQ • Impala • SparkSQL • HBase + Phoenix • Drill • Networking & Pizza AGENDA
  3. 3. EMC Corporation All rights reserved • How many developers? INTRODUCTION A SURVEY
  4. 4. EMC Corporation All rights reserved • How many BI/SQL Developer? INTRODUCTION A SURVEY
  5. 5. EMC Corporation All rights reserved • How many Business analyst/Sales? INTRODUCTION A SURVEY
  6. 6. EMC Corporation All rights reserved • How many have used Hadoop? INTRODUCTION A SURVEY
  7. 7. EMC Corporation All rights reserved • How many have used SQL on Hadoop? INTRODUCTION A SURVEY
  8. 8. EMC Corporation All rights reserved • Hadoop is an open source framework for large- scale data storing & processing. WHAT IS HADOOP
  9. 9. EMC Corporation All rights reserved • Application Workgroup in EMC – Focused on •Big data development/infrastructure •Application modernization •DevOps ABOUT THE HOSTS
  10. 10. EMC Corporation All rights reserved • Fahim Kundi – 10+ years experience in EDW and big data • Haden Pareira – Data engineer with 5+ years of Hadoop experience • Muhammad Ali – Data engineer 2+ years with Hadoop ABOUT THE HOSTS APPLICATION WORKGROUP IN EMC
  11. 11. EMC Corporation All rights reserved WHAT IS HADOOP
  12. 12. EMC Corporation All rights reserved • HDFS is a file system – it’s all files • MapReduce requires strong programming skills • It’s so difficult WHAT IS HADOOP
  13. 13. EMC Corporation All rights reserved • SQL is well known in analytics community • Faster and easier data insights • Allows SQL/BI developer to retain their expertise and create value out of big data SQL ON HADOOP
  14. 14. EMC Corporation All rights reserved • Cloudera – Impala • Hortonworks – Hive/Tez • Pivotal – HAWQ … now HDB • MapR – Drill • IBM – Big SQL SQL ON HADOOP
  15. 15. EMC Corporation All rights reserved HIVE
  16. 16. EMC Corporation All rights reserved Hive and HAWQ By Fahim Kundi
  17. 17. EMC Corporation All rights reserved CONTENTS • Hive Introduction • How Hive Works • Apache Tez • Hive with Tez Vs Mapreduce • ORC and Parquet Format • HAWQ Introduction • Query Optimizer • PxF
  18. 18. EMC Corporation All rights reserved HIVE INTRODUCTION (1) • Apache Hive is high level query language and data warehouse features built on top of Hadoop. • It is initially developed by yahoo and made open source in 2008. • SQL Like Query Language called HQL. • Partitioning and Bucketing for faster Query processing. • Integration with Visualization tool like Tableau.
  19. 19. EMC Corporation All rights reserved HIVE INTRODUCTION (2) • Hive supports all the common primitive data formats such as INT, BINARY, BOOLEAN, CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP etc. • In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
  20. 20. EMC Corporation All rights reserved HOW HIVE WORKS (1) • The tables in Hive are similar to tables in a relational database. • Databases are comprised of tables, which are made up of partitions. • Data can be accessed via a simple query language and Hive supports overwriting or appending data. • Hive queries internally will be converted to map reduce programs or Tez.
  21. 21. EMC Corporation All rights reserved HOW HIVE WORKS (2) • Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. • Each table can be sub-divided into partitions that determine how data is distributed within sub- directories of the table directory. • Data within partitions can be further broken down into buckets.
  22. 22. EMC Corporation All rights reserved APACHE TEZ (1) • Apache Tez, a new distributed execution framework that is targeted towards data-processing applications on Hadoop. • Tez is developed by Hortonwork and built on top of YARN (Resource Management Framework for Hadoop) • Tez generalizes Mapreduce to more powerful framework as it creates Dataflow Graph for job executed by User. (Example)
  23. 23. EMC Corporation All rights reserved APACHE TEZ (2) • The Tez API has the following components – – DAG (Directed Acyclic Graph) – defines the overall job. One DAG object corresponds to one job – Vertex – defines the user logic along with the resources and the environment needed to execute the user logic. One Vertex corresponds to one step in the job – Edge – defines the connection between producer and consumer vertices. • Tez is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility.
  24. 24. EMC Corporation All rights reserved EXAMPLE OF HIVE WITH TEZ VS MAPREDUCE
  25. 25. EMC Corporation All rights reserved ORC FILE • ORC(Optimal Row Columnar) is columnar file format designed for Hadoop workloads. • ORC files developed to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. It is optimized for large streaming reads. • ORC Features: – Columnar format for complex data types – Built into Hive from 0.11 – Support for Pig and Mapreduce via Hcat. – Two level of compression • Light weight type specific • General – Built in Indexes
  26. 26. EMC Corporation All rights reserved ORC FILE LAYOUT
  27. 27. EMC Corporation All rights reserved PARQUET • Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. • Parquet Feature: – Columnar File Format – Support Nested Data Structures – Accessible by Hive, Spark, Pig, Drill, MR – R/W in HDFS or local file system
  28. 28. EMC Corporation All rights reserved PARQUET FILE LAYOUT
  29. 29. EMC Corporation All rights reserved ORC VS PARQUET • Two major consideration for considering ORC over Parquet – Many of the performance improvements provided in the Stinger initiative are dependent on features of the ORC format including block level index for each column. This leads to potentially more efficient I/O allowing Hive to skip reading entire blocks of data if it determines predicate values are not present there. – Also the Cost Based Optimizer has the ability to consider column level metadata present in ORC files in order to generate the most efficient graph. – ACID transactions are only possible when using ORC as the file format.
  30. 30. EMC Corporation All rights reserved FILE SIZE COMPARISION
  31. 31. EMC Corporation All rights reserved HAWQ INTRODUCTION • HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for its storage layer. • HAWQ evolves from the Greenplum Database query planner to handle query processing and does not rely on MapReduce under the hood to do processing. • HAWQ reads data from and writes data to HDFS natively. • It also has extensions(PxF) to allow it to interact with data contained in other services (HBase, Hive, Avro, etc) that also reside in HDFS.
  32. 32. EMC Corporation All rights reserved HAWQ FEATURES • HAWQ provides all major features found in Greenplum database – SQL Completeness: 2003 Extensions – JDBC Compliant – Robust Query Optimizer – Row or Column-Oriented Table Storage – Parallel Loading and Unloading – Distributions – Multi-level Partitioning – High speed data redistribution – Views – External Tables – Compression – Resource Management – Security – Authentication – Management and Monitoring
  33. 33. EMC Corporation All rights reserved HAWQ ARCHITECTURE Interconnect Local Storage HAWQ Master Parser Query Optimizer PXF Local Temp Storage Segment Host Query Executor HDFS PXF Segment [Segment …] DataNode Local Temp Storage Segment Host Query Executor HDFS PXF Segment [Segment …] HAWQ Standby Master NameNode HDFS Secondary NameNode HDFS
  34. 34. EMC Corporation All rights reserved HAWQ PARALLEL QUERY OPTIMIZER Gather Motion Sort HashAggregate HashJoin Redistribute Motion HashJoin Seq Scan on lineitem Hash Seq Scan on orders Hash HashJoin Seq Scan on customer Hash Broadcast Motion Seq Scan on nation • Turn SQL Query into execution Plan • Cost based Optimizer
  35. 35. EMC Corporation All rights reserved PIVOTAL EXTENSION FRAMEWORK (PXF) • PXF is a fast, extensible framework connecting HAWQ to a HDFS data store of choice that exposes a parallel API  An advanced version of external tables  Enables combining HAWQ data and Hadoop data in a single query  Supports connectors for HDFS, HBase and Hive  Provides extensible framework API to enable custom connector development for any data sources HDFS HBase Hive Xtension Framework
  36. 36. EMC Corporation All rights reserved Muhammad Ali Image courtesy cloudera
  37. 37. EMC Corporation. All rights reserved. • Interactive Query on top of Hadoop • ANSI-92 SQL Standard • Native MPP query engine • Written in C++ IMPALA OVERVIEW
  38. 38. EMC Corporation. All rights reserved. • Native to Hadoop – Blends with the eco system – Security – Hive MetaStore / HCatalog – Query existing HDFS data • Not as fault-tolerant as MapReduce – (or Hive or SparkSQL or …) – Single node fails during query the whole query fails – But if it’s 20x faster, you can rerun and still finish faster ;) IMPALA OVERVIEW
  39. 39. EMC Corporation. All rights reserved. IMPALAARCHITECTURE Image courtesy cloudera
  40. 40. EMC Corporation. All rights reserved. • Query execution times (small to medium size) • Parquet Format – Compression • High Concurrency – kills the competitors • Partitioning • Query Optimizer (Compute Statistics!) IMPALA WHERE IT SHINES
  41. 41. EMC Corporation. All rights reserved. IMPALA DEMO
  42. 42. EMC Corporation. All rights reserved. • Distributed columnar storage manager • Performance of Parquet – Great for analytical queries • Mutability of HBase – Supports UPDATE/DELETE unlike Parquet • One common storage to rule them all! – (not exactly!) WHAT THE HELL IS KUDU!
  43. 43. EMC Corporation. All rights reserved. WHERE DO YOU POSITION KUDU?
  44. 44. EMC Corporation. All rights reserved. • IoT use cases – High velocity data – Same data read for analytical queries near real time • Predictive Modeling – Large datasets updated frequently – Retraining models • Time-series applications – Kudu offers compound keys/hash based partitioning – Avoids hot spotting KUDU USE CASES
  45. 45. EMC Corporation. All rights reserved. IMPALA DEMO
  46. 46. EMC Corporation. All rights reserved. SPARK
  47. 47. EMC Corporation. All rights reserved. 2 MIN INTRO TO SPARK • General Purpose Distributed Computing System – Multiple language support (Java, Scala, Python, and R) – Fault tolerant, data distribution, in-memory caching etc. • RDD – Resilient distributed datasets • Operations – Transformations (define new RDDs) – Actions (return value) • No nonsense – 100x faster than MapReduce – Disk used only when can’t be avoided
  48. 48. EMC Corporation. All rights reserved. 2 MIN INTRO TO SPARK Image Courtesy: Sachin Parmar http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer?
  49. 49. EMC Corporation. All rights reserved. SPARKSQL
  50. 50. EMC Corporation. All rights reserved. SPARKSQL • Structured Data Processing – Commonly known to us as tables • Integrated into Spark programming model • Unified Data Access • Scalability • Support for HiveQL • Cache it!
  51. 51. EMC Corporation. All rights reserved. SPARKSQL • Two APIs – DataFrames • Data organized into named columns • Similar to Tables • Can be constructed from structured data files, Hive, external DBs – DataSets • Experimental interface • Strongly typed & SQL execution engine • Can be constructed from regular JVM objects
  52. 52. EMC Corporation. All rights reserved. SPARKSQL ARCHITECTURE
  53. 53. EMC Corporation. All rights reserved. DEMO SPARKSQL ON HADOOP
  54. 54. EMC Corporation. All rights reserved.
  • MassireDieme

    Sep. 3, 2019
  • bunkertor

    Jun. 14, 2016
  • AyokunnuOjeniyi

    May. 26, 2016

These slides were presented on May 18, 2016 at UAE Big data meetup.

Views

Total views

301

On Slideshare

0

From embeds

0

Number of embeds

6

Actions

Downloads

24

Shares

0

Comments

0

Likes

3

×