HIVE: Data Warehousing & Analytics on Hadoop

Zheng Shao
Zheng ShaoOpen Source Big Data
HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team
Why Another Data Warehousing System? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is HIVE? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on  Hadoop Cluster Oracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop Usage @ Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
Dealing with Structured Data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MetaStore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hive Query Language ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Running Custom Map/Reduce Scripts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
(Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
Hive QL – Join ,[object Object],[object Object],[object Object],[object Object],X = page_view user pv_users pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male pageid age 1 25 2 25 1 32
Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
Joins ,[object Object],[object Object],[object Object],[object Object],[object Object]
Join To Map Reduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hive Optimizations  – Merge Sequential Map Reduce Jobs ,[object Object],[object Object],A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
Hive QL – Group By ,[object Object],[object Object],[object Object],pv_users pageid age 1 25 2 25 1 32 2 25 pageid age count 1 25 1 2 25 2 1 32 1
Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
Hive QL – Group By with Distinct ,[object Object],[object Object],page_view pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 2 111 9:08:20 pageid count_distinct_userid 1 2 2 1
Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
Group by Future optimizations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Inserts into Files, Tables and Local Files  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hive Performance ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop Challenges @ Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
1 of 28

Recommended

Apache Iceberg: An Architectural Look Under the Covers by
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
1.5K views24 slides
Apache Calcite (a tutorial given at BOSS '21) by
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
2.6K views88 slides
Polyglot persistence @ netflix (CDE Meetup) by
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
5.1K views96 slides
Spark shuffle introduction by
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
50.7K views33 slides
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli... by
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
699 views20 slides
Iceberg: A modern table format for big data (Strata NY 2018) by
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
2K views34 slides

More Related Content

What's hot

Introduction to redis by
Introduction to redisIntroduction to redis
Introduction to redisTanu Siwag
832 views58 slides
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
4.3K views38 slides
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang by
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
5.9K views103 slides
Data profiling in Apache Calcite by
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
595 views21 slides
Hw09 Hadoop Development At Facebook Hive And Hdfs by
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
6.3K views24 slides
Presto query optimizer: pursuit of performance by
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
4.3K views31 slides

What's hot(20)

Introduction to redis by Tanu Siwag
Introduction to redisIntroduction to redis
Introduction to redis
Tanu Siwag832 views
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by HostedbyConfluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent4.3K views
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang by Databricks
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks5.9K views
Hw09 Hadoop Development At Facebook Hive And Hdfs by Cloudera, Inc.
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.6.3K views
Presto query optimizer: pursuit of performance by DataWorks Summit
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
DataWorks Summit4.3K views
Productizing Structured Streaming Jobs by Databricks
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks3.2K views
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea... by Impetus Technologies
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
10 Good Reasons to Use ClickHouse by rpolat
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
rpolat372 views
Scalable Filesystem Metadata Services with RocksDB by Alluxio, Inc.
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
Alluxio, Inc.2.6K views
patroni-based citrus high availability environment deployment by hyeongchae lee
patroni-based citrus high availability environment deploymentpatroni-based citrus high availability environment deployment
patroni-based citrus high availability environment deployment
hyeongchae lee544 views
Making Apache Spark Better with Delta Lake by Databricks
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks5.4K views
Building robust CDC pipeline with Apache Hudi and Debezium by Tathastu.ai
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai2.7K views
Spark and S3 with Ryan Blue by Databricks
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks3.9K views
Apache kudu by Asim Jalis
Apache kuduApache kudu
Apache kudu
Asim Jalis6.3K views
Implementing High Availability Caching with Memcached by Gear6
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
Gear657.4K views
Future of Data Engineering by C4Media
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media1.7K views
Free Training: How to Build a Lakehouse by Databricks
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks3.4K views

Viewers also liked

Hadoop, Pig, and Twitter (NoSQL East 2009) by
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
143.2K views58 slides
introduction to data processing using Hadoop and Pig by
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
92.5K views32 slides
Pig, Making Hadoop Easy by
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
84.7K views16 slides
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop by
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
82.7K views40 slides
Integration of Hive and HBase by
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
99.6K views39 slides
Practical Problem Solving with Apache Hadoop & Pig by
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
237.2K views199 slides

Viewers also liked(18)

Hadoop, Pig, and Twitter (NoSQL East 2009) by Kevin Weil
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil143.2K views
introduction to data processing using Hadoop and Pig by Ricardo Varela
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela92.5K views
Pig, Making Hadoop Easy by Nick Dimiduk
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk84.7K views
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop by royans
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans82.7K views
Integration of Hive and HBase by Hortonworks
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks99.6K views
Practical Problem Solving with Apache Hadoop & Pig by Milind Bhandarkar
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar237.2K views
Introduction To Map Reduce by rantav
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav106.2K views
Hive Quick Start Tutorial by Carl Steinbach
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach139.8K views
Big Data Analytics with Hadoop by Philippe Julio
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio441.9K views
Big Data & Hadoop Tutorial by Edureka!
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!90.2K views
Hadoop introduction , Why and What is Hadoop ? by sudhakara st
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st71.4K views
Seminar Presentation Hadoop by Varun Narang
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang81.5K views
Big data and Hadoop by Rahul Agarwal
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal82.3K views
Hadoop - Overview by Jay
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay 8.2K views
Hadoop demo ppt by Phil Young
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young40.6K views
A beginners guide to Cloudera Hadoop by David Yahalom
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
David Yahalom4.8K views
Hadoop Overview & Architecture by EMC
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC64.6K views

Similar to HIVE: Data Warehousing & Analytics on Hadoop

Hadoop Hive Talk At IIT-Delhi by
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
3.8K views37 slides
Hive Apachecon 2008 by
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
20.5K views34 slides
2008 Ur Tech Talk Zshao by
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk ZshaoJeff Hammerbacher
2.2K views36 slides
Hadoop and Hive by
Hadoop and HiveHadoop and Hive
Hadoop and HiveZheng Shao
6.9K views36 slides
Hive ICDE 2010 by
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
7.2K views51 slides
Hive by
HiveHive
HiveSrinath Reddy
587 views24 slides

Similar to HIVE: Data Warehousing & Analytics on Hadoop(20)

Hive Apachecon 2008 by athusoo
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
athusoo20.5K views
Hadoop and Hive by Zheng Shao
Hadoop and HiveHadoop and Hive
Hadoop and Hive
Zheng Shao6.9K views
Hive ICDE 2010 by ragho
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho7.2K views
Hadoop Summit 2009 Hive by Zheng Shao
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Zheng Shao1.8K views
Hadoop Summit 2009 Hive by Namit Jain
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Namit Jain3.9K views
Hive Percona 2009 by prasadc
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
prasadc4.6K views
Hive Training -- Motivations and Real World Use Cases by nzhang
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang20.2K views
Hive User Meeting August 2009 Facebook by ragho
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho27.4K views
Hive User Meeting 2009 8 Facebook by Zheng Shao
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
Zheng Shao1.6K views
Hive @ Hadoop day seattle_2010 by nzhang
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang3K views
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain by Yahoo Developer Network
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Hadoop 101 for bioinformaticians by attilacsordas
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas3K views
[SSA] 04.sql on hadoop(2014.02.05) by Steve Min
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
Steve Min2K views
Stratosphere with big_data_analytics by Avinash Pandu
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu307 views

Recently uploaded

VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueShapeBlue
207 views54 slides
Initiating and Advancing Your Strategic GIS Governance Strategy by
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance StrategySafe Software
184 views68 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
146 views31 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
35 views49 slides
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
91 views52 slides
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
196 views62 slides

Recently uploaded(20)

VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue207 views
Initiating and Advancing Your Strategic GIS Governance Strategy by Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software184 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10146 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue196 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue183 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue152 views
LLMs in Production: Tooling, Process, and Team Structure by Aggregage
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Aggregage57 views
Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar38 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue247 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue171 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue178 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue303 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays36 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue224 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays33 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue164 views

HIVE: Data Warehousing & Analytics on Hadoop

  • 1. HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team
  • 2.
  • 3.
  • 4. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • 5.
  • 6.
  • 7. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt. Web UI
  • 8. Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables #Buckets=32 Bucketing Info Partitioning Cols
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. (Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • 14.
  • 15. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> pageid age 1 25 2 25 pageid age 1 32
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. Hive QL – Group By in Map Reduce pv_users Map Shuffle Sort Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 pageid age count 2 25 2
  • 21.
  • 22. Hive QL – Group By with Distinct in Map Reduce page_view Shuffle and Sort Reduce Map Reduce pageid count 1 1 2 1 pageid count 1 1 pageid userid time 1 111 9:08:01 2 111 9:08:13 pageid userid time 1 222 9:08:14 2 111 9:08:20 key v <1,111> <2,111> <2,111> key v <1,222> pageid count 1 2 pageid count 2 1
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.

Editor's Notes

  1. STYLE GUIDELINES The master cannot be changed, if you are going to place another logo on any slide, please place it in the lower right corner. Title sizes may not be tampered with. If your title is too long please shorten it. Please do not center the title and subtitle, everything is made to align with the Facebook logo above them. Always remember to use the correct slide type for what you’re using it for. If you’re looking to use half a slide with bullet points and the other half with a picture, pick the correct slide type.