SlideShare a Scribd company logo
1 of 44
Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook
Overview Motivations Data-driven model Challenges Data Infrastructure Hadoop & Hive In-house tools Hive Details Architecture Data model Query language Extensibility Research Problems
Motivations
Facebook is just a Set of Web Services …
 … at Large Scale The social graph is large 400 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
Facebook is still growing and changing
Under the Hook Data flow from users’ perspective Clients (browser/phone/3rd party apps)  Web Services  Users Another big topic on the Web Services To complete the feedback system … The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics!  Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “The center of the universe has shifted from e-business to me-business.” -- same as above “Invariably, simple models and a lot of data trump more elaborate models based on less data.”  -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
Problems and Challenges Data-driven development/business  Huge amount of log data/user data generated every day Need to analyze these data to feedback development/business decisions Machine learning, report/dashboard generation, A/B testing And many more problems Scalability (more than petabytes) Availability (HA) Manageability (e.g., scheduling) Performance (CPU, memory, disk/network I/O) And many more…
Facebook Engineering Teams (backend) Facebook Infrastructure Building foundations that serves end users/applications OLTP workload Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse) Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc. OLAP workload Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.) Other Engineering teams  Platform, search, site integrity, monetization, apps, growth, etc.
DI Key Challenges (I) – scalability Data, data and more data 200 GB/day in March 2008 12 TB/day at the end of 2009 About 8x increase per year  Total size is 5 PB now (x3 when considering replication) Same order as the Web (~25 billion indexable pages)
DI Key Challenges (II) – Performance Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7K queries/day at the end of 2009 25K queries/day now Workload is a mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time) Sampling subset of data are not always good enough
Other Requirements Accessibility Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!) Schema discovery (more than 20K tables) Data exploration and visualization (learning the data by looking) Leverage existing prevalent and familiar tools (e.g., BI tools)  Flexibility Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.) Data formats could be different (plain text, row store, column store, complex data types) Extensibility Easy to plug in user defined functions, aggregations etc.  Data storage could be files, web services, “NoSQL stores”
Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports  Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems
Lets try Hadoop (MapReduce + HDFS) … Pros Superior in availability/scalability/manageability (99.9%) Large and healthy open source community (popular in both industry and academic organizations)
But not quite … Cons: Programmability and Metadata Efficiency not that great, but throw more hardware MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results No schema Solution: Hive!
What is Hive ? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage RDBMS for metadata Key Building Principles: SQL is a familiarlanguage on data warehouses Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.) Scalability and Performance Interoperability (JDBC/ODBC/thrift)
Hive: Familiar Schema Concepts
Column Data Types ,[object Object]
integer types, float, string, date, boolean
Nest-able Collections
array<any-type>
map<primitive-type, any-type>
User-defined types
structures with attributes which can be of any-type,[object Object]
Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client  Restartability (Work In Progress)
Hive: Simplifying Hadoop Programming $ cat > /tmp/reducer.sh uniq-c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoopdfs –cat /tmp/largekey/part* vs. hive> select key, count(1) from kv1 where key > 100 group by key;
MapReduceScripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM     (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py'       AS (uhash, page_id, unix_time)      FROM mylog      DISTRIBUTE BY uhash      SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);
Hive Architecture
Hive: Making Optimizations Transparent  Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
Hive: Interoperability with Other Tools JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
Powered by Hive
Usage in Facebook
Usage Types of Applications: Reporting  Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement  Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data)	 Ad Optimization Eg: User Engagement as a function of user attributes Many others
Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
Hive & Hadoop Usage @ Facebook Statistics per day: 800TB of I/O per day 10K – 25K Hive jobs per day Hive simplifies Hadoop: New engineers go though a Hive training session Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs
Data Flow Architecture at Facebook Scirbe-HDFS Web Servers Scribe-Hadoop Cluster Hive replication Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL
Scribe-HDFS: 101 HDFS Data Node Scribed Append to  /staging/<category>/<file> Scribed <category, msgs> HDFS Data Node Scribed Scribed Scribed HDFS Data Node Scribe-HDFS
Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
Future Work Scaling in a Dynamic and Fast Growing Environment Erasure codes for Hadoop Namenode scalability past 150 million objects Isolating Adhoc queries from jobs with strict deadlines Hive Replication Resource Sharing Pools for slots More scalable loading of data Incremental load of site data Continuous load of log data
Future Work Discovering Data from > 20K tables Collaborative Hive Finding Unused/rarely used Data

More Related Content

What's hot

Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionXplenty
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

What's hot (20)

Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Hadoop
HadoopHadoop
Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Hive
HiveHive
Hive
 
Hive
HiveHive
Hive
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Similar to WaterlooHiveTalk

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And HiveCloudera, Inc.
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 

Similar to WaterlooHiveTalk (20)

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
Chapter01
Chapter01Chapter01
Chapter01
 
Ena ch01
Ena ch01Ena ch01
Ena ch01
 
Ena ch01
Ena ch01Ena ch01
Ena ch01
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 

Recently uploaded

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

WaterlooHiveTalk

  • 1. Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook
  • 2. Overview Motivations Data-driven model Challenges Data Infrastructure Hadoop & Hive In-house tools Hive Details Architecture Data model Query language Extensibility Research Problems
  • 4. Facebook is just a Set of Web Services …
  • 5. … at Large Scale The social graph is large 400 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
  • 6. Facebook is still growing and changing
  • 7. Under the Hook Data flow from users’ perspective Clients (browser/phone/3rd party apps)  Web Services  Users Another big topic on the Web Services To complete the feedback system … The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics! Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
  • 8. Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “The center of the universe has shifted from e-business to me-business.” -- same as above “Invariably, simple models and a lot of data trump more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
  • 9. Problems and Challenges Data-driven development/business Huge amount of log data/user data generated every day Need to analyze these data to feedback development/business decisions Machine learning, report/dashboard generation, A/B testing And many more problems Scalability (more than petabytes) Availability (HA) Manageability (e.g., scheduling) Performance (CPU, memory, disk/network I/O) And many more…
  • 10. Facebook Engineering Teams (backend) Facebook Infrastructure Building foundations that serves end users/applications OLTP workload Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse) Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc. OLAP workload Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.) Other Engineering teams Platform, search, site integrity, monetization, apps, growth, etc.
  • 11. DI Key Challenges (I) – scalability Data, data and more data 200 GB/day in March 2008 12 TB/day at the end of 2009 About 8x increase per year Total size is 5 PB now (x3 when considering replication) Same order as the Web (~25 billion indexable pages)
  • 12. DI Key Challenges (II) – Performance Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7K queries/day at the end of 2009 25K queries/day now Workload is a mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time) Sampling subset of data are not always good enough
  • 13. Other Requirements Accessibility Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!) Schema discovery (more than 20K tables) Data exploration and visualization (learning the data by looking) Leverage existing prevalent and familiar tools (e.g., BI tools) Flexibility Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.) Data formats could be different (plain text, row store, column store, complex data types) Extensibility Easy to plug in user defined functions, aggregations etc. Data storage could be files, web services, “NoSQL stores”
  • 14. Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems
  • 15. Lets try Hadoop (MapReduce + HDFS) … Pros Superior in availability/scalability/manageability (99.9%) Large and healthy open source community (popular in both industry and academic organizations)
  • 16. But not quite … Cons: Programmability and Metadata Efficiency not that great, but throw more hardware MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results No schema Solution: Hive!
  • 17. What is Hive ? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage RDBMS for metadata Key Building Principles: SQL is a familiarlanguage on data warehouses Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.) Scalability and Performance Interoperability (JDBC/ODBC/thrift)
  • 19.
  • 20. integer types, float, string, date, boolean
  • 25.
  • 26. Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client Restartability (Work In Progress)
  • 27. Hive: Simplifying Hadoop Programming $ cat > /tmp/reducer.sh uniq-c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoopdfs –cat /tmp/largekey/part* vs. hive> select key, count(1) from kv1 where key > 100 group by key;
  • 28. MapReduceScripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);
  • 30. Hive: Making Optimizations Transparent Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
  • 31. Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
  • 32. Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
  • 33. Hive: Interoperability with Other Tools JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
  • 36. Usage Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
  • 37. Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
  • 38. Hive & Hadoop Usage @ Facebook Statistics per day: 800TB of I/O per day 10K – 25K Hive jobs per day Hive simplifies Hadoop: New engineers go though a Hive training session Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs
  • 39. Data Flow Architecture at Facebook Scirbe-HDFS Web Servers Scribe-Hadoop Cluster Hive replication Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL
  • 40. Scribe-HDFS: 101 HDFS Data Node Scribed Append to /staging/<category>/<file> Scribed <category, msgs> HDFS Data Node Scribed Scribed Scribed HDFS Data Node Scribe-HDFS
  • 41. Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
  • 42. Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
  • 43. Future Work Scaling in a Dynamic and Fast Growing Environment Erasure codes for Hadoop Namenode scalability past 150 million objects Isolating Adhoc queries from jobs with strict deadlines Hive Replication Resource Sharing Pools for slots More scalable loading of data Incremental load of site data Continuous load of log data
  • 44. Future Work Discovering Data from > 20K tables Collaborative Hive Finding Unused/rarely used Data
  • 45. Future Dynamic Inserts into multiple partitions More join optimizations Persistent UDFs, UDAFs and UDTFs Benchmarks for monitoring performance IN, exists and correlated sub-queries Statistics Materialized Views
  • 46. Research Challenges Reducing response time for small/medium jobs 20 thousands queries per day 1 million queries per day Indexes on Hadoop, data mart strategy Near real-time query processing – pipelining MapReduce Distributed systems problems in large scale: Job scheduling problem: mixed throughput and response time workloads Orchestra commits on thousands of machines (scribe conf files) Cross data center replication and consistency Full SQL compliant Required by 3rd party tools (e.g., BI) through ODBC/JDBC.
  • 47. Query Optimizations Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture Cost models in the MapReduce framework
  • 48. Social Graph Every user sees a different, personalized stream of information (news feed) 130 friend + 60 object updates in real time Edge-rank: ranking of updates that should be shown on the top Social graph is stored in distributed MySQL databases Data replication between data centers: an update to one data center should be replicated to other data centers as well How to partition a dense graph such that data transfer from different partitions is minimized.

Editor's Notes

  1. Motivations: - The problems we face - The role of data infrastructure team in FB - Why we chose the current infrastructure?
  2. List of apps, news feed, ads/notifications Dynamic web site What boils down to is a set of web services, not a big deal
  3. -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  4. -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  5. -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  6. 1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?