Data infrastructure at Facebook

E l a b o r a t e d b y :
Data Infrastructure at Facebook
R e p u b l i c o f Tu n i s i a
M i n i s t r y o f H i g h e r E d u c a t i o n a n d S c i e n t i f i c R e s e a r c h
U n i v e r s i t y o f M o n s a t i r
F a c u l t y o f S c i e n c e s o f M o n a s t i r
Doukh Ahmed
2 0 1 9 - 2 0 2 0
S R A 2

WHAT WE WILL DO
Facebook and big Data
Data Warehousing at Facebook
Conclusion
Introduction
Storage systems at Facebook

With tens of millions of users and more than a billion page views every day, Facebook ends up
accumulating massive amounts of data.
One of the challenges that he faced since the early days is developing a scalable way of storing
and processing all these bytes since using this historical data is a very big part of how it can
improve the user experience on Facebook.
Introduction
If Facebook were a country, it would be the most populous nation on earth. Running in its 11th
year of success, Facebook stands today as one of the most popular social networking sites,
comprising of 1.59 billion accounts, which is approximately the 1/5th of the world's total
population.

About a year back (2010) Facebook began playing around with an open source project
called Hadoop.
Hadoop provides a framework for large scale parallel processing using a distributed file
system and the map-reduce programming paradigm.
First, it start with importing some interesting data sets into a relatively small Hadoop
cluster were quickly rewarded as developers latched on to the map-reduce programming
model and started doing interesting projects that were previously impossible due to their
massive computational requirements.

OS
Web server Data Base
Programming
Langage
Communication :Servers /apps
Data Infrastructure
=

What is Apache Hadoop ?
Apache Hadoop is a collection of open source software utilities that facilitate
using a network of many computers to solve problems involving massive amounts of data and
computation.
It provides a software framework for distributed storage and processing of big data using
the Map Reduce programming model . Originally design for computers clusters built from
commodity hardware ,has also found use on clusters of higher-end hardware.
Goals of HDFS ?
• Very Large Distributed File System :
– 10K nodes, 100 million files, 10 - 100 PB
• Assumes Commodity Hardware :
– Files are replicated to handle hardware failure
– Detect failures and recovers from them

• Optimized for Batch Processing :
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth.
• User Space, runs on heterogeneous OS

What is Apache Hive ?
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing
data query and analysis.
Hive gives a SQL-like interface to query data stored in various databases and file systems that
integrate with Hadoop.
Comparison with traditional databases :
The storage and querying operations of Hive closely resemble those of traditional databases.
While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in
comparison to relational databases.
The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to
comply with the restrictions of Hadoop and MapReduce.

What is Apache HBASE ?
HBase is an open-source, non relational, distributed database modeled after Google’s bigtable
and written in Java.
It is developed as part of Apache Software Foundation’s, Apache Hadoop project and runs on top
of HDFS (Hadoop Distributed File System) or Alluxio, providing Bigtable-like capabilities for Hadoop.
That is, it provides a Fault tolerant way of storing large quantities of Sparse data (small amounts of
information caught within a large collection of empty or unimportant data, such as finding the 50
largest items in a group of 2 billion records, or finding the non-zero items representing less than
0.1% of a huge collection).

Scribe (log server) was a server for aggregating logdata streamed in real-time from a large
number of servers . It was designed to be scalable, extensible without client-side
modification, and robust to failure of the network or any specific machine.
Scribe was developed at Facebook and released in 2008 as open-source.
Scribe servers are arranged in a directed graph, with each server knowing only about the next
server in the graph.
What is Scribe ?

Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a
centralized service for distributed systems to a hierarchical key-value store, which is used to
provide a distributed configuration service, synchronization service, and naming registry for
large distributed systems.
ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right.
ZooKeeper was developed in order to fix the bugs that occurred while deploying distributed big
data applications. Some of the prime features of Apache ZooKeeper are:
Reliable System: This system is very reliable as it keeps working even if a node fails.
Simple Architecture: The architecture of ZooKeeper is quite simple as there is a shared
hierarchical namespace which helps coordinating the processes.
Fast Processing: ZooKeeper is especially fast in "read-dominant" workloads (i.e. workloads in
which reads are much more common than writes).
Scalable: The performance of ZooKeeper can be improved by adding nodes..

1.49 billion daily active users
500,000 new users every day
( 6 new profiles every second )

30 million users update their statuses at least once each day

More than 1 billion photos uploaded each month

More than 2.5 billion pieces of content shared each week

4 TB of new data added per day
210 TB of data scanned per day

Semi-online Light Transaction Processing Databases (SLTP)
• Facebook Messages and Facebook Time Series
Immutable Data Store
▪ Photos, videos, etc.
Analytics Data Store
▪ Data Warehouse, Logs storage
This is what we
will talk about !

Total Size Technology
Facebook Messages
and Time Series Data
Tens of petabytes
Facebook Photos High tens of
petabytes
haystack
Data Warehouse Hundreds of
petabytes
Size and Scale of Databases
This is what we will
talk about !

Fig 1 : Data Flow Architecture at Facebook
https://avishkarm.blogspot.com/2013/02/hadoop-architecture-and-its-usage-at.html

As shown in the Figure 1:
there are two sources of data :
1. The federated mysql tier that contains all the Facebook site related data.
2. The web tier that generates all the log data .
And there are two different Hive-Hadoop clusters :
1. The production Hive-Hadoop cluster that used to excute jobs that need to adhere
to very strict delivery deadlines.
2. The ad hoc Hive-Hadoop cluster cluster that used to excute lower priority batch
jobs as well as any ad hoc analysis that the users want to do on historical data
sets.

Data coming from the web servers
The Scribe servers aggregate the logs coming from different web servers and write them out as
HDFS files in the associated Hadoop cluster
Is pushed to a set of Scribe-Hadoop (scribeh) clusters .
These clusters comprise of Scribe servers running on Hadoop clusters.
More than 30TB of data is transferred to the scribeh clusters every day
In order to reduce the cross data center traffic the scribeh clusters are located in the
data centers hosting the web tiers.

Data pushed to Scribe- Hadoop clusters
Periodically is compressed by copier jobs and transferred to the Hive-Hadoop clusters.
The copiers run at 5-15 minute time intervals and copy out all the new files created in the
scribeh clusters, In this manner the log data gets moved to the Hive-Hadoop clusters.
At this point the data is mostly in the form of HDFS files, it gets published either hourly or daily
in the form of partitions in the corresponding Hive tables through a set of loader processes and
then becomes available for consumption.
,

Data coming from the federated mysql tier
Is loaded to the Hive- Hadoop clusters through daily scrape processes .
Scrape processes
Dump the desired data sets from mysql databases .
Compressing them on the source systems .
Moving them into the Hive-Hadoop cluster .
The scrapes need to be resilient to failures and also need to be designed such that they do
not put too much load on the mysql databases.

The production & The ad hoc Hive Hadoop Clusters
Why Facebook use these two types of Clusters ?
The ad hoc nature of user queries makes it dangerous to run production jobs in the same cluster.
A badly written ad hoc job can hog the resources in the cluster, thereby starving the production
jobs and in the absence of sophisticated sandboxing techniques.
The separation of the clusters for ad hoc and production jobs has become the practical choice for
the company in order to avoid such scenarios.

Facebook has multiple Hadoop clusters deployed now with the biggest having about 2500
cpu cores and 1 PetaByte of disk space.
It load over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the
Hadoop file system every day and have hundreds of jobs running each day against these
data sets
The list of projects that are using this infrastructure has proliferated - from those generating
mundane statistics about site usage, to others being used to fight spam and determine
application quality.
Facebook use the information generated by and from users to make decisions about
improvements to the product. Hadoop has enabled the company to make better use of
the data

Because the rapid adoption of Hadoop at Facebook :
developers are free to write map-reduce programs in the language of their choice
The company has embraced SQL as a familiar paradigm to address and operate on
large data sets. Most data stored in Hadoop's file system is published as Tables.
Developers can explore the schemas and data of these tables much like they would
do with a good old database , When they want to operate on these data sets, they
can use a small subset of SQL to specify the required dataset.
Operations on datasets can be written as map and reduce scripts or using standard
query operators (like joins and group-bys) or as a mix of the two.

A lot of different components(Hadoop(HDFS and Map Reduce),Hive,Scribe,Hbase,Zookeeper…)
come together to provide a comprehensive platform for processing data at Facebook.
This infrastructure is used for various different types of jobs each having
different requirements

Data infrastructure at Facebook

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data infrastructure at Facebook

Similar to Data infrastructure at Facebook (20)

Recently uploaded

Recently uploaded (20)

Data infrastructure at Facebook

Editor's Notes