08448380779 Call Girls In Greater Kailash - I Women Seeking Men
20080611accel
1.
2. Data Management at Facebook
Jeff Hammerbacher
Manager, Data
June 12, 2008
3. My Background
Thanks for Asking
hammer@facebook.com
▪
Studied Mathematics at Harvard
▪
Worked as a Quant on Wall Street
▪
Came to Facebook in early 2006 as a Research Scientist
▪
Now manage the Facebook Data Team
▪
25 amazing data engineers and scientists with more on the way
▪
Skills span databases, distributed systems, statistics, machine
▪
learning, data visualization, social network analysis, and more
4. Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
Three main server profiles:
▪
Web
▪
Memcached
▪
Web Tier
MySQL
▪ (more than 10,000 Servers)
Simplified away:
▪
AJAX
▪ Memcached Tier
(around 1,000 servers)
MySQL Tier
Photo and Video
▪ (around 2,000 servers)
Services
▪
6. Services Infrastructure
Thrift, Mainly
Developing a Thrift service:
▪
Define your data structures
▪
JSON-like data model
▪
Define your service endpoints
▪
Select your languages
▪
Generate stub code
▪
Write service logic
▪
Write client
▪
Configure and deploy
▪
Monitor, provision, and upgrade
▪
7. Services Infrastructure
What’s an SOA?
Almost all services written in Thrift
▪
Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe
▪
Batteries included
▪
Network transport libraries
▪
Serialization libraries
▪
Code generation
▪
Robust server implementations (multithreaded, nonblocking, etc.)
▪
Now an Apache Incubator project
▪
For more information, read the whitepaper
▪
Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch
▪
8. Data Infrastructure
Offline Batch Processing
Scribe Tier MySQL Tier
“Data Warehousing”
▪
Began with Oracle database
▪
Schedule data collection via cron
▪
Collect data every 24 hours
▪
“ETL” scripts: hand-coded Python
▪
Data Collection
Server
Data volumes quickly grew
▪
Started at tens of GB in early 2006
▪
Oracle Database
Server
Up to about 1 TB per day by mid-2007
▪
Log files largest source of data growth
▪
9. Data Infrastructure
Distributed Processing with Cheetah
Goal: summarize log files outside of the database
▪
Solution: Cheetah, a distributed log file processing system
▪
Distributor.pl: distribute binaries to processing nodes
▪
C++ Binaries: parse, agg, load
▪
Partitioned Log File
Cheetah Master
Filer Processing Tier
10. Data Infrastructure
Moving from Cheetah to Hadoop
Cheetah limitations
▪
Limited filer bandwidth
▪
No centralized logfile metadata
▪
Writing a new Cheetah job requires writing C++ binaries
▪
Jobs are difficult to monitor and debug
▪
No support for ad hoc querying
▪
Not open source
▪
12. Anatomy of the Facebook Cluster
Hardware
Individual nodes
▪
CPU: Intel Xeon dual socket quad cores (8 cores per box)
▪
Memory: 16 GB ECC DRAM
▪
Disk: 4 x 1 TB 7200 RPM SATA
▪
Network: 1 gE
▪
Topology
▪
320 nodes arranged into 8 racks of 40 nodes each
▪
8 x 1 Gbps links out to the core switch
▪
13. Anatomy of the Facebook Cluster
Recent Cluster Statistics
From May 2nd to May 21st:
▪
Total jobs: 8,794
▪
Total map tasks: 1,362,429
▪
Total reduce tasks: 86,806
▪
Average duration of a successful job: 296 s
▪
Average duration of a successful map: 81 s
▪
Average duration of a successful reduce: 678 s
▪
14. Initial Hadoop Applications
Hadoop Streaming
Almost all applications at Facebook use Hadoop Streaming
▪
Mapper and Reducer take inputs from a pipe and write outputs to a pipe
▪
Facebook users write in Python, PHP, C++
▪
Allows for library reuse, faster development
▪
Eats way too much CPU
▪
More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
▪
15. Initial Hadoop Applications
Unstructured text analysis
Intern asked to understand brand sentiment and influence
▪
Many tools for supporting his project had to be built
▪
Understanding serialization format of wall post logs
▪
Common data operations: project, filter, join, group by
▪
Developed using Hadoop streaming for rapid prototyping in Python
▪
Scheduling regular processing and recovering from failures
▪
Making it easy to regularly load new data
▪
17. Initial Hadoop Applications
Ensemble Learning
Build a lot of Decision Trees and average them
▪
Random Forests are a combination of tree predictors such that each
▪
tree depends on the values of a random vector sampled independently
and with the same distribution for all trees in the forest
Can be used for regression or classification
▪
See “Random Forests” by Leo Breiman
▪
18. More Hadoop Applications
Insights
Monitor performance of your Facebook Ad, Page, Application
▪
Regular aggregation of high volumes of log file data
▪
First hourly pipelines
▪
Publish data back to a MySQL tier
▪
System currently only running partially on Hadoop
▪
20. More Hadoop Applications
Platform Application Reputation Scoring
Users complaining about being spammed by Platform applications
▪
Now, every Platform Application has a set of quotas
▪
Notifications
▪
News Feed story insertion
▪
Invitations
▪
Emails
▪
Quotas determined by calculating a “reputation score” for the
▪
application
21. More Hadoop Applications
Miscellaneous
Experimentation Platform back end
▪
A/B Testing
▪
Champion/Challenger Testing
▪
Lots of internal analyses
▪
Export smaller data sets to R
▪
Ads: targeting optimization, fraud detection
▪
Search: index building, ranking optimization
▪
Load testing for new storage systems
▪
Language prediction for translation targeting
▪
22. Hive
Structured Data Management with Hadoop
Hadoop:
▪
HDFS
▪
MapReduce
▪
Resource Manager
▪
Job Scheduler
▪
Hive:
▪
Logical data partitioning
▪
Metadata store
▪
Query Operators
▪
Query Language
▪
25. Hive
Sample Queries
▪ CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT,
page_url STRING, referrer_url STRING, ip STRING)
COMMENT 'This is the page view table'
PARTITIONED BY(date DATETIME, country STRING)
BUCKETED ON (userid) INTO 32 BUCKETS
ROW FORMAT DELIMITED FIELD DELIMITER 001 ROW DELIMITER 013
STORED AS COMPRESSED
LOCATION '/user/facebook/warehouse/page_view';
▪ FROM pv_users
INSERT INTO TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO FILE /user/facebook/tmp/pv_age_sum.txt
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);
26. Hive
The Team
Joydeep Sen Sarma
▪
Ashish Thusoo
▪
Pete Wyckoff
▪
Suresh Anthony
▪
Zheng Shao
▪
Venky Iyer
▪
Dhruba Borthakur
▪
Namit Jain
▪
Raghu Murthy
▪
27. Cassandra
Structured Storage over a P2P Network
Conceptually: BigTable data model on Dynamo infrastructure
▪
Design Goals:
▪
High availability
▪
Incremental scalability
▪
Eventual consistency (trade consistency for availability)
▪
Optimistic replication
▪
Low total cost of ownership
▪
Minimal administrative overhead
▪
Tunable tradeoffs between consistency, durability, and latency
▪