Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Motivations: - The problems we face - The role of data infrastructure team in FB - Why we chose the current infrastructure?
  • List of apps, news feed, ads/notifications Dynamic web site What boils down to is a set of web services, not a big deal
  • -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  • -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  • -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  • 1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?
  • WaterlooHiveTalk

    1. 1. Petabyte Scale Data Warehousing at Facebook<br />Ning Zhang<br />Data Infrastructure<br />Facebook<br />
    2. 2. Overview<br />Motivations<br />Data-driven model<br />Challenges<br />Data Infrastructure<br />Hadoop & Hive<br />In-house tools<br />Hive Details<br />Architecture<br />Data model<br />Query language<br />Extensibility<br />Research Problems<br />
    3. 3. Motivations <br />
    4. 4. Facebook is just a Set of Web Services …<br />
    5. 5. … at Large Scale<br />The social graph is large<br />400 million monthly active users<br />250 million daily active users<br />160 million active objects (groups/events/pages)<br />130 friend connections per user on average<br />60 object (groups/events/pages) connections per user on average<br />Activities on the social graph<br />People spent 500 billion minutes per month on FB<br />Average user creates 70 pieces of content each month<br />25 billion pieces of content are shared each month<br />Millions of search queries per day<br />Facebook is still growing fast<br />New users, features, services …<br />
    6. 6. Facebook is still growing and changing<br />
    7. 7. Under the Hook<br />Data flow from users’ perspective<br />Clients (browser/phone/3rd party apps)  Web Services  Users<br />Another big topic on the Web Services<br />To complete the feedback system …<br />The developers want to know how a new app/feature received by the users (A/B test)<br />The advertisers want to know how their ads perform (dashboard/reports)<br />Based on historical data, how to construct a model and predicate the future (machine learning)<br />Need data analytics! <br />Data warehouse: ETL, data processing, BI …<br />Closing the loop: decision-making based on analyzing the data (users’ feedback)<br />
    8. 8. Data-driven Business/R&D/Science …<br />DSS is not new but Web gives it new elements.<br />“In 2009, more data will be generated by individuals than the entire history of mankind through 2008.”<br />-- by Andreas Weigend, Harvard Business Review<br />“The center of the universe has shifted from e-business to me-business.”<br />-- same as above<br />“Invariably, simple models and a lot of data trump more elaborate models based on less data.” <br />-- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data<br />
    9. 9. Problems and Challenges<br />Data-driven development/business <br />Huge amount of log data/user data generated every day<br />Need to analyze these data to feedback development/business decisions<br />Machine learning, report/dashboard generation, A/B testing<br />And many more problems<br />Scalability (more than petabytes)<br />Availability (HA)<br />Manageability (e.g., scheduling)<br />Performance (CPU, memory, disk/network I/O)<br />And many more…<br />
    10. 10. Facebook Engineering Teams (backend)<br />Facebook Infrastructure<br />Building foundations that serves end users/applications<br />OLTP workload<br />Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, …<br />Facebook Data Infrastructure (data warehouse)<br />Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc.<br />OLAP workload<br />Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.)<br />Other Engineering teams<br /> Platform, search, site integrity, monetization, apps, growth, etc. <br />
    11. 11. DI Key Challenges (I) – scalability<br />Data, data and more data<br />200 GB/day in March 2008 12 TB/day at the end of 2009<br />About 8x increase per year <br />Total size is 5 PB now (x3 when considering replication)<br />Same order as the Web (~25 billion indexable pages)<br />
    12. 12. DI Key Challenges (II) – Performance<br />Queries, queries and more queries<br />More than 200 unique users query on the data warehouse every day<br />7K queries/day at the end of 2009<br />25K queries/day now<br />Workload is a mixture of ad-hoc queries and ETL/reporting queries.<br />Fast, faster and real-time<br />Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time)<br />Sampling subset of data are not always good enough<br />
    13. 13. Other Requirements<br />Accessibility<br />Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!)<br />Schema discovery (more than 20K tables)<br />Data exploration and visualization (learning the data by looking)<br />Leverage existing prevalent and familiar tools (e.g., BI tools) <br />Flexibility<br />Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.)<br />Data formats could be different (plain text, row store, column store, complex data types)<br />Extensibility<br />Easy to plug in user defined functions, aggregations etc. <br />Data storage could be files, web services, “NoSQL stores”<br />
    14. 14. Why not Existing Data Warehousing Systems?<br />Cost of analysis and storage on proprietary systems does not support trends towards more data.<br />Cost based on data size (15 PB costs a lot!)<br />Expensive hardware and supports<br /> Limited Scalability does not support trends towards more data<br />Product designed decades ago (not suitable for petabyte DW)<br />ETL is a big bottleneck<br />Long product development & release cycle<br />Users requirements changes frequently (agile programming practice)<br />Closed and proprietary systems<br />
    15. 15. Lets try Hadoop (MapReduce + HDFS) …<br />Pros<br />Superior in availability/scalability/manageability (99.9%)<br />Large and healthy open source community (popular in both industry and academic organizations)<br />
    16. 16. But not quite …<br />Cons: Programmability and Metadata<br />Efficiency not that great, but throw more hardware<br />MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results<br />No schema<br />Solution: Hive!<br />
    17. 17. What is Hive ?<br />A system for managing and querying structured data built on top of Hadoop<br />Map-Reduce for execution<br />HDFS for storage<br />RDBMS for metadata<br />Key Building Principles:<br />SQL is a familiarlanguage on data warehouses<br />Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.)<br />Scalability and Performance<br />Interoperability (JDBC/ODBC/thrift)<br />
    18. 18. Hive: Familiar Schema Concepts<br />
    19. 19. Column Data Types<br /><ul><li>Primitive Types
    20. 20. integer types, float, string, date, boolean
    21. 21. Nest-able Collections
    22. 22. array<any-type>
    23. 23. map<primitive-type, any-type>
    24. 24. User-defined types
    25. 25. structures with attributes which can be of any-type</li></li></ul><li>Hive Query Language<br />DDL<br />{create/alter/drop} {table/view/partition}<br />create table as select<br />DML<br />Insert overwrite<br />QL<br />Sub-queries in from clause<br />Equi-joins (including Outer joins)<br />Multi-table Insert<br />Sampling<br />Lateral Views<br />Interfaces<br />JDBC/ODBC/Thrift<br />
    26. 26. Optimizations<br />Column Pruning<br />Also pushed down to scan in columnar storage (RCFILE)<br />Predicate Pushdown<br />Not pushed below Non-deterministic functions (eg. rand())<br />Partition Pruning<br />Sample Pruning<br />Handle small files<br />Merge while writing<br />CombinedHiveInputFormat while reading<br />Small Jobs<br />SELECT * with partition predicates in the client <br />Restartability (Work In Progress)<br />
    27. 27. Hive: Simplifying Hadoop Programming<br />$ cat > /tmp/reducer.sh<br />uniq-c | awk '{print $2" "$1}‘<br />$ cat > /tmp/map.sh<br />awk -F '01' '{if($1 > 100) print $1}‘<br />$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 <br />$ bin/hadoopdfs –cat /tmp/largekey/part*<br />vs.<br />hive> select key, count(1) from kv1 where key > 100 group by key;<br />
    28. 28. MapReduceScripts Examples<br />add file page_url_to_id.py;<br />add file my_python_session_cutter.py;<br />FROM<br /> (SELECT TRANSFORM(uhash, page_url, unix_time)<br />USING 'page_url_to_id.py'<br /> AS (uhash, page_id, unix_time)<br /> FROM mylog<br /> DISTRIBUTE BY uhash<br /> SORT BY uhash, unix_time) mylog2<br />SELECT TRANSFORM(uhash, page_id, unix_time)<br />USING 'my_python_session_cutter.py'<br />AS (uhash, session_info);<br />
    29. 29. Hive Architecture<br />
    30. 30. Hive: Making Optimizations Transparent <br />Joins:<br />Joins try to reduce the number of map/reduce jobs needed.<br />Memory efficient joins by streaming largest tables.<br />Map Joins<br />User specified small tables stored in hash tables on the mapper<br />No reducer needed<br />Aggregations:<br />Map side partial aggregations<br />Hash-based aggregates<br />Serialized key/values in hash tables<br />90% speed improvement on Query<br />SELECT count(1) FROM t;<br />Load balancing for data skew<br />
    31. 31. Hive: Making Optimizations Transparent<br />Storage:<br />Column oriented data formats<br />Column and Partition pruning to reduce scanned data<br />Lazy de-serialization of data<br />Plan Execution<br />Parallel Execution of Parts of the Plan<br />
    32. 32. Hive: Open & Extensible<br />Different on-disk storage(file) formats<br />Text File, Sequence File, …<br />Different serialization formats and data types<br />LazySimpleSerDe, ThriftSerDe …<br />User-provided map/reduce scripts<br />In any language, use stdin/stdout to transfer data …<br />User-defined Functions<br />Substr, Trim, From_unixtime …<br />User-defined Aggregation Functions<br />Sum, Average …<br />User-define Table Functions<br />Explode …<br />
    33. 33. Hive: Interoperability with Other Tools<br />JDBC<br />Enables integration with JDBC based SQL clients<br />ODBC<br />Enables integration with Microstrategy<br />Thrift<br />Enables writing cross language clients<br />Main form of integration with php based Web UI<br />
    34. 34. Powered by Hive<br />
    35. 35. Usage in Facebook<br />
    36. 36. Usage<br />Types of Applications:<br />Reporting <br />Eg: Daily/Weekly aggregations of impression/click counts<br />Measures of user engagement <br />Microstrategy reports<br />Ad hoc Analysis<br />Eg: how many group admins broken down by state/country<br />Machine Learning (Assembling training data) <br />Ad Optimization<br />Eg: User Engagement as a function of user attributes<br />Many others<br />
    37. 37. Hadoop & Hive Cluster @ Facebook<br />Hadoop/Hive cluster<br />13600 cores<br />Raw Storage capacity ~ 17PB<br />8 cores + 12 TB per node<br />32 GB RAM per node<br />Two level network topology<br />1 Gbit/sec from node to rack switch<br />4 Gbit/sec to top level rack switch<br />2 clusters<br />One for adhoc users<br />One for strict SLA jobs<br />
    38. 38. Hive & Hadoop Usage @ Facebook<br />Statistics per day:<br />800TB of I/O per day<br />10K – 25K Hive jobs per day<br />Hive simplifies Hadoop:<br />New engineers go though a Hive training session<br />Analysts (non-engineers) use Hadoop through Hive<br />Most of jobs are Hive Jobs<br />
    39. 39. Data Flow Architecture at Facebook<br />Scirbe-HDFS<br />Web Servers<br />Scribe-Hadoop Cluster<br />Hive<br />replication<br />Adhoc Hive-Hadoop Cluster<br />Production Hive-Hadoop Cluster<br />Oracle RAC<br />Federated MySQL<br />
    40. 40. Scribe-HDFS: 101<br />HDFS<br />Data Node<br />Scribed<br />Append to <br />/staging/<category>/<file><br />Scribed<br /><category, msgs><br />HDFS<br />Data Node<br />Scribed<br />Scribed<br />Scribed<br />HDFS<br />Data Node<br />Scribe-HDFS <br />
    41. 41. Scribe-HDFS: Near real time Hadoop<br />Clusters collocated with the web servers<br />Network is the biggest bottleneck<br />Typical cluster has about 50 nodes.<br />Stats:<br />50TB/day of raw data logged<br />99% of the time data is available within 20 seconds<br />
    42. 42. Warehousing at Facebook<br />Instrumentation (PHP/Python etc.)<br />Automatic ETL<br />Continuous copy data to Hive tables<br />Metadata Discovery (CoHive)<br />Query (Hive)<br />Workflow specification and execution (Chronos)<br />Reporting tools<br />Monitoring and alerting<br />
    43. 43. Future Work<br />Scaling in a Dynamic and Fast Growing Environment<br />Erasure codes for Hadoop<br />Namenode scalability past 150 million objects<br />Isolating Adhoc queries from jobs with strict deadlines<br />Hive Replication<br />Resource Sharing<br />Pools for slots<br />More scalable loading of data<br />Incremental load of site data<br />Continuous load of log data<br />
    44. 44. Future Work<br />Discovering Data from > 20K tables<br />Collaborative Hive<br />Finding Unused/rarely used Data<br />
    45. 45. Future<br />Dynamic Inserts into multiple partitions<br />More join optimizations<br />Persistent UDFs, UDAFs and UDTFs<br />Benchmarks for monitoring performance<br />IN, exists and correlated sub-queries<br />Statistics<br />Materialized Views<br />
    46. 46. Research Challenges<br />Reducing response time for small/medium jobs<br />20 thousands queries per day 1 million queries per day<br />Indexes on Hadoop, data mart strategy<br />Near real-time query processing – pipelining MapReduce<br />Distributed systems problems in large scale: <br />Job scheduling problem: mixed throughput and response time workloads<br />Orchestra commits on thousands of machines (scribe conf files)<br />Cross data center replication and consistency<br />Full SQL compliant<br />Required by 3rd party tools (e.g., BI) through ODBC/JDBC.<br />
    47. 47. Query Optimizations<br />Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture<br />Cost models in the MapReduce framework<br />
    48. 48. Social Graph<br />Every user sees a different, personalized stream of information (news feed)<br />130 friend + 60 object updates in real time<br />Edge-rank: ranking of updates that should be shown on the top<br />Social graph is stored in distributed MySQL databases<br />Data replication between data centers: an update to one data center should be replicated to other data centers as well<br />How to partition a dense graph such that data transfer from different partitions is minimized.<br />
    49. 49. Questions?<br />