View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Data Infrastructure at Facebook A retrospective Joydeep Sen Sarma Ex-Facebook DI Lead, Founder Qubole
Intro• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)• @Facebook: – SysAdmin: operated massive Hadoop/Hive installs – Architect: conceived/wrote Apache Hive. made Hbase@FB happen – Herded cats: first manager of Data Infra team – IT engineer/DBA: built ETL tools, warehouse/reporting for FB Virtual Currency – Vested my stock options!• Founder Qubole Inc. (2011-)
What not to do: Yahoo• Want to add ‘feed’ in warehouse? Fill form, schmooze PM, wait 2 months.• Want to justify project? Take $100M, double count 5 times.• Hard to find out what data exists in company., silos• Lots of grand architecture, but no progress
Goals going in• Universal ability to log data and compute against it• Build infrastructure for data processing – Help people help themselves – Get out of the way• Done is better than perfect, Move Fast. – Iterate, Fix Failures Fast, Do everything twice
State of the Union• Sep, 2007: – Use Case: compute relationship strength between friends – Data Sets: user graph, interaction and page-view logs – ~10TB cluster…• July, 2011: – Ads reporting/data-mining, News Feed ranking, Spam classification, PYMK, Search Indexing, Entitization, Sentiment Analysis, Fraud Analysis .. – ~10k queries a day, hundreds of users, scores concurrent – 50PB cluster, 15 engineers/ops in total manning.
User Feedback• Ex-Yahoo Senior-Directory Ads Product Mgmt.: "I havent done SQL for ages - but I can use this stuff easily“• Ex-Yahoo Data Scientist: "This is so amazing. That all data is stored in one place and I canget access instantly without having to wait months and contactmultiple groups/silos“• Ex-Paypal Fraud Analyst: "So much better data and infrastructure than I have ever had inthe past"
Key Highways• Hive – Centrally managed Hadoop service, no setup – SQL is easy, add scripts for map-reduce – Browser based query wizards for SQL dummies • Download results to Excel • Schedule queries periodically with a few clicks• Scribe – Just log data using Scribe from any application – Dead simple to add attributes to user page views – Easy to pull data from RDBMS
Key Highways• Simple Workflow authoring system (Databee)• Reporting is easy – Provision MySQL Data-marts in hours – Easy self-service charting/dashboarding software• Data Explorer – Wiki like system for documenting tables, columns, types – Keyword Search, find table authors, users – Help people help people
Democracies – Ugh!“Democracy may not be the perfect … but it is betterthan the alternatives.”“The family that poops together stays together”
Maintaining Order• Hadoop Fair Scheduler – Guarantee resources to projects/users. Share excess capacity• Multiple Compute tiers – Production, Large Ad-hoc, Small Ad-hoc, Local-mode queries• Kill the bad guys – Code to hunt down bad queries/apps – Track cpu/disk usage – go after biggies• Ban assault rifles – Basic ACLs – can’t delete important tables, directories
Why did we succeed? All Heil Data Consolidation (9pm, FB Hack Night) Ads Engineering Director: “Hey Joy, I want to join user fb-DATA currency purchases with friend request data to test a thesis – pointers?”DATA
Hadoop• Cheap – Can consolidate everything. – We made it cheaper (RCFile, HDFS-RAID)• Reduces governance cost – Only worry about really really large stuff. – Less data replication processes to manage• Separates compute from storage – Most legacy vendors don’t get this• Disk Based analytic systems degrade gracefully – No tipping point (vs. in-memory only) – Ability to catchup, go back in past (vs. real-time stream processing only)
Things we missed• SLOOOOOOW – Extensive work on FB Hadoop repo for faster scheduling – Make testing faster (approx. queries) – Watch @Qubole• SQL as rope – Need higher level templates. Don’t need 10 versions of a 30-day moving average calculator• Duplication of queries/jobs – How to discover if there’s existing summaries? – People help people, but still ..• Didn’t build enough APIs
Final Words• It’s not the software stupid – Software is easy to write and fix – Can be slow• It’s the service that matters – Making everything work seamlessly – Ability to fix/improve things FAST