SlideShare a Scribd company logo
1 of 23
H104: Harnessing the Hadoop Ecosystem
Optimizations in Apache Hive
Jason Huang, Senior Solutions Architect – Qubole, Inc.
May 12, 2015
NYC Data Summit Hadoop Day
A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project.
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed,
Norwest Ventures.
2015 CNBC Disruptor 50 Companies – announced today!
World class product and engineering team from:
Hive – SQL on Hadoop
●  A system for managing and querying unstructured data as if it were
structured
●  Uses Map-Reduce for execution
●  HDFS for Storage (or Amazon S3)
●  Key Building Principles
●  SQL as a familiar data warehousing tool
●  Extensibility (Pluggable map/reduce scripts in the language of your
choice, Rich and User Defined Data Types, User Defined Functions)
●  Interoperability (Extensible Framework to support different file and data
formats)
●  Performance
Why Hive?
●  Problem : Unlimited data
●  Terabytes everyday
●  Wide Adoption of Hadoop
●  Scalable/Available
●  But, Hadoop can be …
●  Complex
●  Different Paradigm
●  Map-Reduce hard to program
Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST	
  API	
  
(HTTPS)	
  	
  
SSH	
  	
  
Ephemeral	
  Hadoop	
  Clusters,	
  
Managed	
  by	
  Qubole	
  
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS	
  –	
  Qubole	
  
User,	
  Account	
  
ConfiguraFons	
  
(Encrypted	
  
credenFals	
  
Amazon S3
w/S3 Server Side
Encryption
Default	
  Hive	
  
Metastore	
  
Encryption Options:
a)  Qubole can encrypt the result cache
b)  Qubole supports encryption of the ephemeral drives used for HDFS
c)  Qubole supports S3 Server Side Encryption
(c)(b)	
  
(a)	
  
(opFonal)	
  
Custom	
  Hive	
  
Metastore	
  
SSH	
  	
  
De-normalizing data:
Normalization:
-  models data tables with certain rules to deal with redundancy
-  normalizing creates multiple relational tables
-  requires joins at runtime to produce results
Joins are expensive and difficult operations to perform and are one of the
common reasons for performance issues. Because of this, it’s a good idea
to avoid highly normalized table structures because they require join
queries to derive the desired metrics.
Partitioning Tables:
Hive partitioning is an effective method to improve the query performance
on larger tables. Partition key is best as a low cardinal attribute.
Bucketing:
Improves the join performance if the bucket key and join keys are
common.
Bucketing:
-  improves the join performance if the bucket key and join keys are
common
-  distributes the data in different buckets based on the hash results on
the bucket key
-  Reduces I/O scans during the join process if the process is happening
on the same keys (columns)
Note: set bucketing flag (hive.enforce.bucketing) each time before
writing data to the bucketed table.
To leverage the bucketing in the join operation we should set
hive.optimize.bucketmapjoin=true. This setting hints to Hive to do
bucket level join during the map stage join.
Map join:
Really efficient if a table on the other side of a join is small enough to fit
in the memory.
File Input Formats:
-  play a critical role in Hive performance
E.g. JSON, the text type of input formats
-  not a good choice for a large production system where data volume is
really high
-  readable format take a lot of space and have some overhead of
parsing ( e.g JSON parsing )
To address these problems, Hive comes with columnar input formats like
RCFile, ORC etc. Columnar formats reduce read operations in queries by
allowing each column to be accessed individually.
Other binary formats like Avro, sequence files, Thrift can be effective in
various use cases.
Compress map/reduce output:
-  reduce the intermediate data volume
-  reduces the amount of data transfers between mappers and reducers
over the network
Note: gzip compressed files are not splittable – so apply with caution
File size should not be larger than a few hundred megabytes
-  otherwise it can potentially lead to an imbalanced job
-  compression codec options: e.g. snappy, lzo, bzip, etc.
For map output compression: set mapred.compress.map.output=“true”
For job output compression: set mapred.output.compress=“true”
Parallel execution:
Hadoop can execute MapReduce jobs in parallel, and several queries
executed on Hive automatically use this parallelism.
Vectorization:
-  allows Hive to process a batch of rows in ORC format together instead
of processing one row at a time
Each batch consists of a column vector which is usually an array of
primitive types. Operations are performed on the entire column vector,
which improves the instruction pipelines and cache usage.
To enable: set hive.vectorized.execution.enabled=true
Sampling:
-  allows users to take a subset of dataset and analyze it, without having
to analyze the entire data set
Hive offers a built-in TABLESAMPLE clause that allows you to sample
your tables.
TABLESAMPLE can sample at various granularity levels
-  return only subsets of buckets (bucket sampling)
-  HDFS blocks (block sampling)
-  first N records from each input split
Alternatively, you can implement your own UDF that filters out records
according to your sampling algorithm.
Sampling on Buckets:
Unit Testing:
-  In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries
and more.
-  Verify the correctness of your whole HiveQL query without touching a
Hadoop cluster.
-  Executing HiveQL query in the local mode takes literally seconds,
compared to minutes, hours or days if it runs in the Hadoop mode.
Various tools available: e.g HiveRunner, Hive_test and Beetest.
Qubole Data Service
18
19
Use Cases and Additional Information
20
21
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
22
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Aug-13
Sept-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Number of queries
Segment audiences based on their behavior including such
topics as user pathway and multi-dimensional recency
analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies
23
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
Links for more information
http://www.datacenterknowledge.com/
archives/2015/04/02/hybrid-clouds-need-for-
speed/
http://engineering.pinterest.com/post/
92742371919/powering-big-data-at-pinterest
http://www.itbusinessedge.com/slideshows/
six-details-your-big-data-provider-wont-tell-
you.html
http://www.marketwired.com/press-release/
qubole-reports-rapid-adoption-of-its-self-
service-big-data-analytics-
platform-1990272.htm

More Related Content

What's hot

Couchbase Server
Couchbase ServerCouchbase Server
Couchbase Server
templedf
 

What's hot (20)

Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Optimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data WarehouseOptimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data Warehouse
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
How to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudHow to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the Cloud
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, CouchbaseDatabase Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Couchbase Server
Couchbase ServerCouchbase Server
Couchbase Server
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Eugene Polonichko "Architecture of modern data warehouse"
Eugene Polonichko "Architecture of modern data warehouse"Eugene Polonichko "Architecture of modern data warehouse"
Eugene Polonichko "Architecture of modern data warehouse"
 
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 

Viewers also liked

ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 

Viewers also liked (9)

Indic threads pune12-comparing hadoop data storage
Indic threads pune12-comparing hadoop data storageIndic threads pune12-comparing hadoop data storage
Indic threads pune12-comparing hadoop data storage
 
Data flow ii extract
Data flow   ii extractData flow   ii extract
Data flow ii extract
 
Data modeling dimensions
Data modeling dimensionsData modeling dimensions
Data modeling dimensions
 
Research methodology
Research methodologyResearch methodology
Research methodology
 
Data modeling facts
Data modeling factsData modeling facts
Data modeling facts
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Temporal Snapshot Fact Tables
Temporal Snapshot Fact TablesTemporal Snapshot Fact Tables
Temporal Snapshot Fact Tables
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 

Similar to Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 

Similar to Harnessing the Hadoop Ecosystem Optimizations in Apache Hive (20)

Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 

More from Qubole

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
Qubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
Qubole
 

More from Qubole (13)

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

  • 1. H104: Harnessing the Hadoop Ecosystem Optimizations in Apache Hive Jason Huang, Senior Solutions Architect – Qubole, Inc. May 12, 2015 NYC Data Summit Hadoop Day
  • 2. A little bit about Qubole Ashish Thusoo Founder & CEO Joydeep Sen Sarma Founder & CTO Founded in 2011 by the pioneers of “big data” @ Facebook and the creator’s of the Apache Hive Project. Based in Mountain View, CA with offices in Bangalore, India. Investments by Charles River, LightSpeed, Norwest Ventures. 2015 CNBC Disruptor 50 Companies – announced today! World class product and engineering team from:
  • 3. Hive – SQL on Hadoop ●  A system for managing and querying unstructured data as if it were structured ●  Uses Map-Reduce for execution ●  HDFS for Storage (or Amazon S3) ●  Key Building Principles ●  SQL as a familiar data warehousing tool ●  Extensibility (Pluggable map/reduce scripts in the language of your choice, Rich and User Defined Data Types, User Defined Functions) ●  Interoperability (Extensible Framework to support different file and data formats) ●  Performance
  • 4. Why Hive? ●  Problem : Unlimited data ●  Terabytes everyday ●  Wide Adoption of Hadoop ●  Scalable/Available ●  But, Hadoop can be … ●  Complex ●  Different Paradigm ●  Map-Reduce hard to program
  • 5. Qubole DataFlow Diagram Qubole UI via Browser SDK ODBC User Access Qubole’s AWS Account Customer’s AWS Account REST  API   (HTTPS)     SSH     Ephemeral  Hadoop  Clusters,   Managed  by  Qubole   Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS  –  Qubole   User,  Account   ConfiguraFons   (Encrypted   credenFals   Amazon S3 w/S3 Server Side Encryption Default  Hive   Metastore   Encryption Options: a)  Qubole can encrypt the result cache b)  Qubole supports encryption of the ephemeral drives used for HDFS c)  Qubole supports S3 Server Side Encryption (c)(b)   (a)   (opFonal)   Custom  Hive   Metastore   SSH    
  • 6. De-normalizing data: Normalization: -  models data tables with certain rules to deal with redundancy -  normalizing creates multiple relational tables -  requires joins at runtime to produce results Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of this, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
  • 7. Partitioning Tables: Hive partitioning is an effective method to improve the query performance on larger tables. Partition key is best as a low cardinal attribute.
  • 8. Bucketing: Improves the join performance if the bucket key and join keys are common.
  • 9. Bucketing: -  improves the join performance if the bucket key and join keys are common -  distributes the data in different buckets based on the hash results on the bucket key -  Reduces I/O scans during the join process if the process is happening on the same keys (columns) Note: set bucketing flag (hive.enforce.bucketing) each time before writing data to the bucketed table. To leverage the bucketing in the join operation we should set hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join.
  • 10. Map join: Really efficient if a table on the other side of a join is small enough to fit in the memory.
  • 11. File Input Formats: -  play a critical role in Hive performance E.g. JSON, the text type of input formats -  not a good choice for a large production system where data volume is really high -  readable format take a lot of space and have some overhead of parsing ( e.g JSON parsing ) To address these problems, Hive comes with columnar input formats like RCFile, ORC etc. Columnar formats reduce read operations in queries by allowing each column to be accessed individually. Other binary formats like Avro, sequence files, Thrift can be effective in various use cases.
  • 12. Compress map/reduce output: -  reduce the intermediate data volume -  reduces the amount of data transfers between mappers and reducers over the network Note: gzip compressed files are not splittable – so apply with caution File size should not be larger than a few hundred megabytes -  otherwise it can potentially lead to an imbalanced job -  compression codec options: e.g. snappy, lzo, bzip, etc. For map output compression: set mapred.compress.map.output=“true” For job output compression: set mapred.output.compress=“true”
  • 13. Parallel execution: Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism.
  • 14. Vectorization: -  allows Hive to process a batch of rows in ORC format together instead of processing one row at a time Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. To enable: set hive.vectorized.execution.enabled=true
  • 15. Sampling: -  allows users to take a subset of dataset and analyze it, without having to analyze the entire data set Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels -  return only subsets of buckets (bucket sampling) -  HDFS blocks (block sampling) -  first N records from each input split Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.
  • 17. Unit Testing: -  In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more. -  Verify the correctness of your whole HiveQL query without touching a Hadoop cluster. -  Executing HiveQL query in the local mode takes literally seconds, compared to minutes, hours or days if it runs in the Hadoop mode. Various tools available: e.g HiveRunner, Hive_test and Beetest.
  • 19. 19
  • 20. Use Cases and Additional Information 20
  • 21. 21 “Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable” Mohammad Shahangian - Lead, Data Science and Infrastructure Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers. Rapid expansion of big data beyond developers (240 users out of 600 person company) Use CasesUser and Query Growth Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc. Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce Enterprise scale processing and data access
  • 22. 22 “We needed something that was reliable and easy to learn, setup, use and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.” Marc Rosen - Sr. Director, Data Analytics Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards. Use Cases# Commands Per Month 0 1250 2500 3750 5000 Aug-13 Sept-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies
  • 23. 23 Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure Links for more information http://www.datacenterknowledge.com/ archives/2015/04/02/hybrid-clouds-need-for- speed/ http://engineering.pinterest.com/post/ 92742371919/powering-big-data-at-pinterest http://www.itbusinessedge.com/slideshows/ six-details-your-big-data-provider-wont-tell- you.html http://www.marketwired.com/press-release/ qubole-reports-rapid-adoption-of-its-self- service-big-data-analytics- platform-1990272.htm