Your SlideShare is downloading. ×
WaterlooHiveTalk
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

WaterlooHiveTalk

3,985
views

Published on

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,985
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
80
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Motivations: - The problems we face - The role of data infrastructure team in FB - Why we chose the current infrastructure?
  • List of apps, news feed, ads/notifications Dynamic web site What boils down to is a set of web services, not a big deal
  • -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  • -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  • -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  • 1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?
  • Transcript

    • 1. Petabyte Scale Data Warehousing at Facebook
      Ning Zhang
      Data Infrastructure
      Facebook
    • 2. Overview
      Motivations
      Data-driven model
      Challenges
      Data Infrastructure
      Hadoop & Hive
      In-house tools
      Hive Details
      Architecture
      Data model
      Query language
      Extensibility
      Research Problems
    • 3. Motivations
    • 4. Facebook is just a Set of Web Services …
    • 5. … at Large Scale
      The social graph is large
      400 million monthly active users
      250 million daily active users
      160 million active objects (groups/events/pages)
      130 friend connections per user on average
      60 object (groups/events/pages) connections per user on average
      Activities on the social graph
      People spent 500 billion minutes per month on FB
      Average user creates 70 pieces of content each month
      25 billion pieces of content are shared each month
      Millions of search queries per day
      Facebook is still growing fast
      New users, features, services …
    • 6. Facebook is still growing and changing
    • 7. Under the Hook
      Data flow from users’ perspective
      Clients (browser/phone/3rd party apps)  Web Services  Users
      Another big topic on the Web Services
      To complete the feedback system …
      The developers want to know how a new app/feature received by the users (A/B test)
      The advertisers want to know how their ads perform (dashboard/reports)
      Based on historical data, how to construct a model and predicate the future (machine learning)
      Need data analytics!
      Data warehouse: ETL, data processing, BI …
      Closing the loop: decision-making based on analyzing the data (users’ feedback)
    • 8. Data-driven Business/R&D/Science …
      DSS is not new but Web gives it new elements.
      “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.”
      -- by Andreas Weigend, Harvard Business Review
      “The center of the universe has shifted from e-business to me-business.”
      -- same as above
      “Invariably, simple models and a lot of data trump more elaborate models based on less data.”
      -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
    • 9. Problems and Challenges
      Data-driven development/business
      Huge amount of log data/user data generated every day
      Need to analyze these data to feedback development/business decisions
      Machine learning, report/dashboard generation, A/B testing
      And many more problems
      Scalability (more than petabytes)
      Availability (HA)
      Manageability (e.g., scheduling)
      Performance (CPU, memory, disk/network I/O)
      And many more…
    • 10. Facebook Engineering Teams (backend)
      Facebook Infrastructure
      Building foundations that serves end users/applications
      OLTP workload
      Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, …
      Facebook Data Infrastructure (data warehouse)
      Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc.
      OLAP workload
      Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.)
      Other Engineering teams
      Platform, search, site integrity, monetization, apps, growth, etc.
    • 11. DI Key Challenges (I) – scalability
      Data, data and more data
      200 GB/day in March 2008 12 TB/day at the end of 2009
      About 8x increase per year
      Total size is 5 PB now (x3 when considering replication)
      Same order as the Web (~25 billion indexable pages)
    • 12. DI Key Challenges (II) – Performance
      Queries, queries and more queries
      More than 200 unique users query on the data warehouse every day
      7K queries/day at the end of 2009
      25K queries/day now
      Workload is a mixture of ad-hoc queries and ETL/reporting queries.
      Fast, faster and real-time
      Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time)
      Sampling subset of data are not always good enough
    • 13. Other Requirements
      Accessibility
      Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!)
      Schema discovery (more than 20K tables)
      Data exploration and visualization (learning the data by looking)
      Leverage existing prevalent and familiar tools (e.g., BI tools)
      Flexibility
      Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.)
      Data formats could be different (plain text, row store, column store, complex data types)
      Extensibility
      Easy to plug in user defined functions, aggregations etc.
      Data storage could be files, web services, “NoSQL stores”
    • 14. Why not Existing Data Warehousing Systems?
      Cost of analysis and storage on proprietary systems does not support trends towards more data.
      Cost based on data size (15 PB costs a lot!)
      Expensive hardware and supports
      Limited Scalability does not support trends towards more data
      Product designed decades ago (not suitable for petabyte DW)
      ETL is a big bottleneck
      Long product development & release cycle
      Users requirements changes frequently (agile programming practice)
      Closed and proprietary systems
    • 15. Lets try Hadoop (MapReduce + HDFS) …
      Pros
      Superior in availability/scalability/manageability (99.9%)
      Large and healthy open source community (popular in both industry and academic organizations)
    • 16. But not quite …
      Cons: Programmability and Metadata
      Efficiency not that great, but throw more hardware
      MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results
      No schema
      Solution: Hive!
    • 17. What is Hive ?
      A system for managing and querying structured data built on top of Hadoop
      Map-Reduce for execution
      HDFS for storage
      RDBMS for metadata
      Key Building Principles:
      SQL is a familiarlanguage on data warehouses
      Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.)
      Scalability and Performance
      Interoperability (JDBC/ODBC/thrift)
    • 18. Hive: Familiar Schema Concepts
    • 19. Column Data Types
      • Primitive Types
      • 20. integer types, float, string, date, boolean
      • 21. Nest-able Collections
      • 22. array<any-type>
      • 23. map<primitive-type, any-type>
      • 24. User-defined types
      • 25. structures with attributes which can be of any-type
    • Hive Query Language
      DDL
      {create/alter/drop} {table/view/partition}
      create table as select
      DML
      Insert overwrite
      QL
      Sub-queries in from clause
      Equi-joins (including Outer joins)
      Multi-table Insert
      Sampling
      Lateral Views
      Interfaces
      JDBC/ODBC/Thrift
    • 26. Optimizations
      Column Pruning
      Also pushed down to scan in columnar storage (RCFILE)
      Predicate Pushdown
      Not pushed below Non-deterministic functions (eg. rand())
      Partition Pruning
      Sample Pruning
      Handle small files
      Merge while writing
      CombinedHiveInputFormat while reading
      Small Jobs
      SELECT * with partition predicates in the client
      Restartability (Work In Progress)
    • 27. Hive: Simplifying Hadoop Programming
      $ cat > /tmp/reducer.sh
      uniq-c | awk '{print $2" "$1}‘
      $ cat > /tmp/map.sh
      awk -F '01' '{if($1 > 100) print $1}‘
      $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
      $ bin/hadoopdfs –cat /tmp/largekey/part*
      vs.
      hive> select key, count(1) from kv1 where key > 100 group by key;
    • 28. MapReduceScripts Examples
      add file page_url_to_id.py;
      add file my_python_session_cutter.py;
      FROM
      (SELECT TRANSFORM(uhash, page_url, unix_time)
      USING 'page_url_to_id.py'
      AS (uhash, page_id, unix_time)
      FROM mylog
      DISTRIBUTE BY uhash
      SORT BY uhash, unix_time) mylog2
      SELECT TRANSFORM(uhash, page_id, unix_time)
      USING 'my_python_session_cutter.py'
      AS (uhash, session_info);
    • 29. Hive Architecture
    • 30. Hive: Making Optimizations Transparent
      Joins:
      Joins try to reduce the number of map/reduce jobs needed.
      Memory efficient joins by streaming largest tables.
      Map Joins
      User specified small tables stored in hash tables on the mapper
      No reducer needed
      Aggregations:
      Map side partial aggregations
      Hash-based aggregates
      Serialized key/values in hash tables
      90% speed improvement on Query
      SELECT count(1) FROM t;
      Load balancing for data skew
    • 31. Hive: Making Optimizations Transparent
      Storage:
      Column oriented data formats
      Column and Partition pruning to reduce scanned data
      Lazy de-serialization of data
      Plan Execution
      Parallel Execution of Parts of the Plan
    • 32. Hive: Open & Extensible
      Different on-disk storage(file) formats
      Text File, Sequence File, …
      Different serialization formats and data types
      LazySimpleSerDe, ThriftSerDe …
      User-provided map/reduce scripts
      In any language, use stdin/stdout to transfer data …
      User-defined Functions
      Substr, Trim, From_unixtime …
      User-defined Aggregation Functions
      Sum, Average …
      User-define Table Functions
      Explode …
    • 33. Hive: Interoperability with Other Tools
      JDBC
      Enables integration with JDBC based SQL clients
      ODBC
      Enables integration with Microstrategy
      Thrift
      Enables writing cross language clients
      Main form of integration with php based Web UI
    • 34. Powered by Hive
    • 35. Usage in Facebook
    • 36. Usage
      Types of Applications:
      Reporting
      Eg: Daily/Weekly aggregations of impression/click counts
      Measures of user engagement
      Microstrategy reports
      Ad hoc Analysis
      Eg: how many group admins broken down by state/country
      Machine Learning (Assembling training data)
      Ad Optimization
      Eg: User Engagement as a function of user attributes
      Many others
    • 37. Hadoop & Hive Cluster @ Facebook
      Hadoop/Hive cluster
      13600 cores
      Raw Storage capacity ~ 17PB
      8 cores + 12 TB per node
      32 GB RAM per node
      Two level network topology
      1 Gbit/sec from node to rack switch
      4 Gbit/sec to top level rack switch
      2 clusters
      One for adhoc users
      One for strict SLA jobs
    • 38. Hive & Hadoop Usage @ Facebook
      Statistics per day:
      800TB of I/O per day
      10K – 25K Hive jobs per day
      Hive simplifies Hadoop:
      New engineers go though a Hive training session
      Analysts (non-engineers) use Hadoop through Hive
      Most of jobs are Hive Jobs
    • 39. Data Flow Architecture at Facebook
      Scirbe-HDFS
      Web Servers
      Scribe-Hadoop Cluster
      Hive
      replication
      Adhoc Hive-Hadoop Cluster
      Production Hive-Hadoop Cluster
      Oracle RAC
      Federated MySQL
    • 40. Scribe-HDFS: 101
      HDFS
      Data Node
      Scribed
      Append to
      /staging/<category>/<file>
      Scribed
      <category, msgs>
      HDFS
      Data Node
      Scribed
      Scribed
      Scribed
      HDFS
      Data Node
      Scribe-HDFS
    • 41. Scribe-HDFS: Near real time Hadoop
      Clusters collocated with the web servers
      Network is the biggest bottleneck
      Typical cluster has about 50 nodes.
      Stats:
      50TB/day of raw data logged
      99% of the time data is available within 20 seconds
    • 42. Warehousing at Facebook
      Instrumentation (PHP/Python etc.)
      Automatic ETL
      Continuous copy data to Hive tables
      Metadata Discovery (CoHive)
      Query (Hive)
      Workflow specification and execution (Chronos)
      Reporting tools
      Monitoring and alerting
    • 43. Future Work
      Scaling in a Dynamic and Fast Growing Environment
      Erasure codes for Hadoop
      Namenode scalability past 150 million objects
      Isolating Adhoc queries from jobs with strict deadlines
      Hive Replication
      Resource Sharing
      Pools for slots
      More scalable loading of data
      Incremental load of site data
      Continuous load of log data
    • 44. Future Work
      Discovering Data from > 20K tables
      Collaborative Hive
      Finding Unused/rarely used Data
    • 45. Future
      Dynamic Inserts into multiple partitions
      More join optimizations
      Persistent UDFs, UDAFs and UDTFs
      Benchmarks for monitoring performance
      IN, exists and correlated sub-queries
      Statistics
      Materialized Views
    • 46. Research Challenges
      Reducing response time for small/medium jobs
      20 thousands queries per day 1 million queries per day
      Indexes on Hadoop, data mart strategy
      Near real-time query processing – pipelining MapReduce
      Distributed systems problems in large scale:
      Job scheduling problem: mixed throughput and response time workloads
      Orchestra commits on thousands of machines (scribe conf files)
      Cross data center replication and consistency
      Full SQL compliant
      Required by 3rd party tools (e.g., BI) through ODBC/JDBC.
    • 47. Query Optimizations
      Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture
      Cost models in the MapReduce framework
    • 48. Social Graph
      Every user sees a different, personalized stream of information (news feed)
      130 friend + 60 object updates in real time
      Edge-rank: ranking of updates that should be shown on the top
      Social graph is stored in distributed MySQL databases
      Data replication between data centers: an update to one data center should be replicated to other data centers as well
      How to partition a dense graph such that data transfer from different partitions is minimized.
    • 49. Questions?