Facebook Retrospective - Big data-world-europe-2012


Published on

A retrospective on building, running and using the Hadoop stack at Facebook.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • They are…123Now lets look at the details of each step, starting with step #1.
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • Facebook Retrospective - Big data-world-europe-2012

    1. 1. Data Infrastructure at Facebook A retrospective Joydeep Sen Sarma Ex-Facebook DI Lead, Founder Qubole
    2. 2. Intro• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)• @Facebook: – SysAdmin: operated massive Hadoop/Hive installs – Architect: conceived/wrote Apache Hive. made Hbase@FB happen – Herded cats: first manager of Data Infra team – IT engineer/DBA: built ETL tools, warehouse/reporting for FB Virtual Currency – Vested my stock options!• Founder Qubole Inc. (2011-)
    3. 3. What not to do: Yahoo• Want to add ‘feed’ in warehouse?  Fill form, schmooze PM, wait 2 months.• Want to justify project?  Take $100M, double count 5 times.• Hard to find out what data exists in company., silos• Lots of grand architecture, but no progress
    4. 4. Goals going in• Universal ability to log data and compute against it• Build infrastructure for data processing – Help people help themselves – Get out of the way• Done is better than perfect, Move Fast. – Iterate, Fix Failures Fast, Do everything twice
    5. 5. State of the Union• Sep, 2007: – Use Case: compute relationship strength between friends – Data Sets: user graph, interaction and page-view logs – ~10TB cluster…• July, 2011: – Ads reporting/data-mining, News Feed ranking, Spam classification, PYMK, Search Indexing, Entitization, Sentiment Analysis, Fraud Analysis .. – ~10k queries a day, hundreds of users, scores concurrent – 50PB cluster, 15 engineers/ops in total manning.
    6. 6. User Feedback• Ex-Yahoo Senior-Directory Ads Product Mgmt.: "I havent done SQL for ages - but I can use this stuff easily“• Ex-Yahoo Data Scientist: "This is so amazing. That all data is stored in one place and I canget access instantly without having to wait months and contactmultiple groups/silos“• Ex-Paypal Fraud Analyst: "So much better data and infrastructure than I have ever had inthe past"
    7. 7. Key Highways• Hive – Centrally managed Hadoop service, no setup – SQL is easy, add scripts for map-reduce – Browser based query wizards for SQL dummies • Download results to Excel • Schedule queries periodically with a few clicks• Scribe – Just log data using Scribe from any application – Dead simple to add attributes to user page views – Easy to pull data from RDBMS
    8. 8. Key Highways• Simple Workflow authoring system (Databee)• Reporting is easy – Provision MySQL Data-marts in hours – Easy self-service charting/dashboarding software• Data Explorer – Wiki like system for documenting tables, columns, types – Keyword Search, find table authors, users – Help people help people
    9. 9. Democracies – Ugh!“Democracy may not be the perfect … but it is betterthan the alternatives.”“The family that poops together stays together”
    10. 10. Maintaining Order• Hadoop Fair Scheduler – Guarantee resources to projects/users. Share excess capacity• Multiple Compute tiers – Production, Large Ad-hoc, Small Ad-hoc, Local-mode queries• Kill the bad guys – Code to hunt down bad queries/apps – Track cpu/disk usage – go after biggies• Ban assault rifles – Basic ACLs – can’t delete important tables, directories
    11. 11. Why did we succeed? All Heil Data Consolidation (9pm, FB Hack Night) Ads Engineering Director: “Hey Joy, I want to join user fb-DATA currency purchases with friend request data to test a thesis – pointers?”DATA
    12. 12. Hadoop• Cheap – Can consolidate everything. – We made it cheaper (RCFile, HDFS-RAID)• Reduces governance cost – Only worry about really really large stuff. – Less data replication processes to manage• Separates compute from storage – Most legacy vendors don’t get this• Disk Based analytic systems degrade gracefully – No tipping point (vs. in-memory only) – Ability to catchup, go back in past (vs. real-time stream processing only)
    13. 13. Things we missed
    14. 14. Things we missed• SLOOOOOOW – Extensive work on FB Hadoop repo for faster scheduling – Make testing faster (approx. queries) – Watch @Qubole• SQL as rope – Need higher level templates. Don’t need 10 versions of a 30-day moving average calculator• Duplication of queries/jobs – How to discover if there’s existing summaries? – People help people, but still ..• Didn’t build enough APIs
    15. 15. Final Words• It’s not the software stupid – Software is easy to write and fix – Can be slow• It’s the service that matters – Making everything work seamlessly – Ability to fix/improve things FAST
    16. 16. Q&A