SlideShare a Scribd company logo
E l a b o r a t e d b y :
Data Infrastructure at Facebook
R e p u b l i c o f Tu n i s i a
M i n i s t r y o f H i g h e r E d u c a t i o n a n d S c i e n t i f i c R e s e a r c h
U n i v e r s i t y o f M o n s a t i r
F a c u l t y o f S c i e n c e s o f M o n a s t i r
Doukh Ahmed
2 0 1 9 - 2 0 2 0
S R A 2
WHAT WE WILL DO
Facebook and big Data
Data Warehousing at Facebook
Conclusion
Introduction
Storage systems at Facebook
Introduction
With tens of millions of users and more than a billion page views every day, Facebook ends up
accumulating massive amounts of data.
One of the challenges that he faced since the early days is developing a scalable way of storing
and processing all these bytes since using this historical data is a very big part of how it can
improve the user experience on Facebook.
Introduction
If Facebook were a country, it would be the most populous nation on earth. Running in its 11th
year of success, Facebook stands today as one of the most popular social networking sites,
comprising of 1.59 billion accounts, which is approximately the 1/5th of the world's total
population.
About a year back (2010) Facebook began playing around with an open source project
called Hadoop.
Hadoop provides a framework for large scale parallel processing using a distributed file
system and the map-reduce programming paradigm.
First, it start with importing some interesting data sets into a relatively small Hadoop
cluster were quickly rewarded as developers latched on to the map-reduce programming
model and started doing interesting projects that were previously impossible due to their
massive computational requirements.
OS
Web server Data Base
Programming
Langage
Communication :Servers /apps
Data Infrastructure
=
What is Apache Hadoop ?
Apache Hadoop is a collection of open source software utilities that facilitate
using a network of many computers to solve problems involving massive amounts of data and
computation.
It provides a software framework for distributed storage and processing of big data using
the Map Reduce programming model . Originally design for computers clusters built from
commodity hardware ,has also found use on clusters of higher-end hardware.
Goals of HDFS ?
• Very Large Distributed File System :
– 10K nodes, 100 million files, 10 - 100 PB
• Assumes Commodity Hardware :
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing :
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth.
• User Space, runs on heterogeneous OS
What is Apache Hive ?
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing
data query and analysis.
Hive gives a SQL-like interface to query data stored in various databases and file systems that
integrate with Hadoop.
Comparison with traditional databases :
The storage and querying operations of Hive closely resemble those of traditional databases.
While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in
comparison to relational databases.
The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to
comply with the restrictions of Hadoop and MapReduce.
What is Apache HBASE ?
HBase is an open-source, non relational, distributed database modeled after Google’s bigtable
and written in Java.
It is developed as part of Apache Software Foundation’s, Apache Hadoop project and runs on top
of HDFS (Hadoop Distributed File System) or Alluxio, providing Bigtable-like capabilities for Hadoop.
That is, it provides a Fault tolerant way of storing large quantities of Sparse data (small amounts of
information caught within a large collection of empty or unimportant data, such as finding the 50
largest items in a group of 2 billion records, or finding the non-zero items representing less than
0.1% of a huge collection).
Scribe (log server) was a server for aggregating logdata streamed in real-time from a large
number of servers . It was designed to be scalable, extensible without client-side
modification, and robust to failure of the network or any specific machine.
Scribe was developed at Facebook and released in 2008 as open-source.
Scribe servers are arranged in a directed graph, with each server knowing only about the next
server in the graph.
What is Scribe ?
Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a
centralized service for distributed systems to a hierarchical key-value store, which is used to
provide a distributed configuration service, synchronization service, and naming registry for
large distributed systems.
ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right.
ZooKeeper was developed in order to fix the bugs that occurred while deploying distributed big
data applications. Some of the prime features of Apache ZooKeeper are:
Reliable System: This system is very reliable as it keeps working even if a node fails.
Simple Architecture: The architecture of ZooKeeper is quite simple as there is a shared
hierarchical namespace which helps coordinating the processes.
Fast Processing: ZooKeeper is especially fast in "read-dominant" workloads (i.e. workloads in
which reads are much more common than writes).
Scalable: The performance of ZooKeeper can be improved by adding nodes..
Facebook and Big Data
Who generates this data
1.49 billion daily active users
500,000 new users every day
( 6 new profiles every second )
30 million users update their statuses at least once each day
More than 1 billion photos uploaded each month
More than 2.5 billion pieces of content shared each week
Data Usage
4 TB of new data added per day
210 TB of data scanned per day
80K compute hours per day
Storage systems at Facebook
Semi-online Light Transaction Processing Databases (SLTP)
• Facebook Messages and Facebook Time Series
Immutable Data Store
▪ Photos, videos, etc.
Analytics Data Store
▪ Data Warehouse, Logs storage
This is what we
will talk about !
Total Size Technology
Facebook Messages
and Time Series Data
Tens of petabytes
Facebook Photos High tens of
petabytes
haystack
Data Warehouse Hundreds of
petabytes
Size and Scale of Databases
This is what we will
talk about !
Data Warehousing at Facebook
Data Flow Architecture
Fig 1 : Data Flow Architecture at Facebook
https://avishkarm.blogspot.com/2013/02/hadoop-architecture-and-its-usage-at.html
As shown in the Figure 1:
there are two sources of data :
1. The federated mysql tier that contains all the Facebook site related data.
2. The web tier that generates all the log data .
And there are two different Hive-Hadoop clusters :
1. The production Hive-Hadoop cluster that used to excute jobs that need to adhere
to very strict delivery deadlines.
2. The ad hoc Hive-Hadoop cluster cluster that used to excute lower priority batch
jobs as well as any ad hoc analysis that the users want to do on historical data
sets.
Data coming from the web servers
The Scribe servers aggregate the logs coming from different web servers and write them out as
HDFS files in the associated Hadoop cluster
Is pushed to a set of Scribe-Hadoop (scribeh) clusters .
These clusters comprise of Scribe servers running on Hadoop clusters.
More than 30TB of data is transferred to the scribeh clusters every day
In order to reduce the cross data center traffic the scribeh clusters are located in the
data centers hosting the web tiers.
Data pushed to Scribe- Hadoop clusters
Periodically is compressed by copier jobs and transferred to the Hive-Hadoop clusters.
The copiers run at 5-15 minute time intervals and copy out all the new files created in the
scribeh clusters, In this manner the log data gets moved to the Hive-Hadoop clusters.
At this point the data is mostly in the form of HDFS files, it gets published either hourly or daily
in the form of partitions in the corresponding Hive tables through a set of loader processes and
then becomes available for consumption.
,
Data coming from the federated mysql tier
Is loaded to the Hive- Hadoop clusters through daily scrape processes .
Scrape processes
Dump the desired data sets from mysql databases .
Compressing them on the source systems .
Moving them into the Hive-Hadoop cluster .
The scrapes need to be resilient to failures and also need to be designed such that they do
not put too much load on the mysql databases.
The production & The ad hoc Hive Hadoop Clusters
Why Facebook use these two types of Clusters ?
The ad hoc nature of user queries makes it dangerous to run production jobs in the same cluster.
A badly written ad hoc job can hog the resources in the cluster, thereby starving the production
jobs and in the absence of sophisticated sandboxing techniques.
The separation of the clusters for ad hoc and production jobs has become the practical choice for
the company in order to avoid such scenarios.
Conclusion
Facebook has multiple Hadoop clusters deployed now with the biggest having about 2500
cpu cores and 1 PetaByte of disk space.
It load over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the
Hadoop file system every day and have hundreds of jobs running each day against these
data sets
The list of projects that are using this infrastructure has proliferated - from those generating
mundane statistics about site usage, to others being used to fight spam and determine
application quality.
Facebook use the information generated by and from users to make decisions about
improvements to the product. Hadoop has enabled the company to make better use of
the data
Because the rapid adoption of Hadoop at Facebook :
developers are free to write map-reduce programs in the language of their choice
The company has embraced SQL as a familiar paradigm to address and operate on
large data sets. Most data stored in Hadoop's file system is published as Tables.
Developers can explore the schemas and data of these tables much like they would
do with a good old database , When they want to operate on these data sets, they
can use a small subset of SQL to specify the required dataset.
Operations on datasets can be written as map and reduce scripts or using standard
query operators (like joins and group-bys) or as a mix of the two.
A lot of different components(Hadoop(HDFS and Map Reduce),Hive,Scribe,Hbase,Zookeeper…)
come together to provide a comprehensive platform for processing data at Facebook.
This infrastructure is used for various different types of jobs each having
different requirements
Any questions ?? !!

More Related Content

What's hot

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Sqoop
SqoopSqoop
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
Thisara Pramuditha
 
LOD 구축 공정 가이드라인
LOD 구축 공정 가이드라인LOD 구축 공정 가이드라인
LOD 구축 공정 가이드라인
Hansung University
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 

What's hot (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Teradata
TeradataTeradata
Teradata
 
Sqoop
SqoopSqoop
Sqoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
RDF en quelques slides
RDF en quelques slidesRDF en quelques slides
RDF en quelques slides
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
LOD 구축 공정 가이드라인
LOD 구축 공정 가이드라인LOD 구축 공정 가이드라인
LOD 구축 공정 가이드라인
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 

Similar to Data infrastructure at Facebook

Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Hadoop
HadoopHadoop
Facebook Hadoop Usecase
Facebook Hadoop UsecaseFacebook Hadoop Usecase
Facebook Hadoop Usecase
puneet2k5
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
Hortonworks
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 

Similar to Data infrastructure at Facebook (20)

Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big data
Big dataBig data
Big data
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop
HadoopHadoop
Hadoop
 
Facebook Hadoop Usecase
Facebook Hadoop UsecaseFacebook Hadoop Usecase
Facebook Hadoop Usecase
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Recently uploaded

RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 

Recently uploaded (20)

RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 

Data infrastructure at Facebook

  • 1. E l a b o r a t e d b y : Data Infrastructure at Facebook R e p u b l i c o f Tu n i s i a M i n i s t r y o f H i g h e r E d u c a t i o n a n d S c i e n t i f i c R e s e a r c h U n i v e r s i t y o f M o n s a t i r F a c u l t y o f S c i e n c e s o f M o n a s t i r Doukh Ahmed 2 0 1 9 - 2 0 2 0 S R A 2
  • 2. WHAT WE WILL DO Facebook and big Data Data Warehousing at Facebook Conclusion Introduction Storage systems at Facebook
  • 4. With tens of millions of users and more than a billion page views every day, Facebook ends up accumulating massive amounts of data. One of the challenges that he faced since the early days is developing a scalable way of storing and processing all these bytes since using this historical data is a very big part of how it can improve the user experience on Facebook. Introduction If Facebook were a country, it would be the most populous nation on earth. Running in its 11th year of success, Facebook stands today as one of the most popular social networking sites, comprising of 1.59 billion accounts, which is approximately the 1/5th of the world's total population.
  • 5. About a year back (2010) Facebook began playing around with an open source project called Hadoop. Hadoop provides a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm. First, it start with importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers latched on to the map-reduce programming model and started doing interesting projects that were previously impossible due to their massive computational requirements.
  • 6. OS Web server Data Base Programming Langage Communication :Servers /apps Data Infrastructure =
  • 7. What is Apache Hadoop ? Apache Hadoop is a collection of open source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the Map Reduce programming model . Originally design for computers clusters built from commodity hardware ,has also found use on clusters of higher-end hardware. Goals of HDFS ? • Very Large Distributed File System : – 10K nodes, 100 million files, 10 - 100 PB • Assumes Commodity Hardware : – Files are replicated to handle hardware failure – Detect failures and recovers from them
  • 8. • Optimized for Batch Processing : – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth. • User Space, runs on heterogeneous OS
  • 9. What is Apache Hive ? Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Comparison with traditional databases : The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to comply with the restrictions of Hadoop and MapReduce.
  • 10. What is Apache HBASE ? HBase is an open-source, non relational, distributed database modeled after Google’s bigtable and written in Java. It is developed as part of Apache Software Foundation’s, Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System) or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a Fault tolerant way of storing large quantities of Sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
  • 11. Scribe (log server) was a server for aggregating logdata streamed in real-time from a large number of servers . It was designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine. Scribe was developed at Facebook and released in 2008 as open-source. Scribe servers are arranged in a directed graph, with each server knowing only about the next server in the graph. What is Scribe ?
  • 12. Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a centralized service for distributed systems to a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right. ZooKeeper was developed in order to fix the bugs that occurred while deploying distributed big data applications. Some of the prime features of Apache ZooKeeper are: Reliable System: This system is very reliable as it keeps working even if a node fails. Simple Architecture: The architecture of ZooKeeper is quite simple as there is a shared hierarchical namespace which helps coordinating the processes. Fast Processing: ZooKeeper is especially fast in "read-dominant" workloads (i.e. workloads in which reads are much more common than writes). Scalable: The performance of ZooKeeper can be improved by adding nodes..
  • 15. 1.49 billion daily active users 500,000 new users every day ( 6 new profiles every second )
  • 16. 30 million users update their statuses at least once each day
  • 17. More than 1 billion photos uploaded each month
  • 18. More than 2.5 billion pieces of content shared each week
  • 20. 4 TB of new data added per day 210 TB of data scanned per day
  • 21. 80K compute hours per day
  • 22. Storage systems at Facebook
  • 23. Semi-online Light Transaction Processing Databases (SLTP) • Facebook Messages and Facebook Time Series Immutable Data Store ▪ Photos, videos, etc. Analytics Data Store ▪ Data Warehouse, Logs storage This is what we will talk about !
  • 24. Total Size Technology Facebook Messages and Time Series Data Tens of petabytes Facebook Photos High tens of petabytes haystack Data Warehouse Hundreds of petabytes Size and Scale of Databases This is what we will talk about !
  • 27. Fig 1 : Data Flow Architecture at Facebook https://avishkarm.blogspot.com/2013/02/hadoop-architecture-and-its-usage-at.html
  • 28. As shown in the Figure 1: there are two sources of data : 1. The federated mysql tier that contains all the Facebook site related data. 2. The web tier that generates all the log data . And there are two different Hive-Hadoop clusters : 1. The production Hive-Hadoop cluster that used to excute jobs that need to adhere to very strict delivery deadlines. 2. The ad hoc Hive-Hadoop cluster cluster that used to excute lower priority batch jobs as well as any ad hoc analysis that the users want to do on historical data sets.
  • 29. Data coming from the web servers The Scribe servers aggregate the logs coming from different web servers and write them out as HDFS files in the associated Hadoop cluster Is pushed to a set of Scribe-Hadoop (scribeh) clusters . These clusters comprise of Scribe servers running on Hadoop clusters. More than 30TB of data is transferred to the scribeh clusters every day In order to reduce the cross data center traffic the scribeh clusters are located in the data centers hosting the web tiers.
  • 30. Data pushed to Scribe- Hadoop clusters Periodically is compressed by copier jobs and transferred to the Hive-Hadoop clusters. The copiers run at 5-15 minute time intervals and copy out all the new files created in the scribeh clusters, In this manner the log data gets moved to the Hive-Hadoop clusters. At this point the data is mostly in the form of HDFS files, it gets published either hourly or daily in the form of partitions in the corresponding Hive tables through a set of loader processes and then becomes available for consumption. ,
  • 31. Data coming from the federated mysql tier Is loaded to the Hive- Hadoop clusters through daily scrape processes . Scrape processes Dump the desired data sets from mysql databases . Compressing them on the source systems . Moving them into the Hive-Hadoop cluster . The scrapes need to be resilient to failures and also need to be designed such that they do not put too much load on the mysql databases.
  • 32. The production & The ad hoc Hive Hadoop Clusters Why Facebook use these two types of Clusters ? The ad hoc nature of user queries makes it dangerous to run production jobs in the same cluster. A badly written ad hoc job can hog the resources in the cluster, thereby starving the production jobs and in the absence of sophisticated sandboxing techniques. The separation of the clusters for ad hoc and production jobs has become the practical choice for the company in order to avoid such scenarios.
  • 34. Facebook has multiple Hadoop clusters deployed now with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. It load over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets The list of projects that are using this infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. Facebook use the information generated by and from users to make decisions about improvements to the product. Hadoop has enabled the company to make better use of the data
  • 35. Because the rapid adoption of Hadoop at Facebook : developers are free to write map-reduce programs in the language of their choice The company has embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop's file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database , When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as map and reduce scripts or using standard query operators (like joins and group-bys) or as a mix of the two.
  • 36. A lot of different components(Hadoop(HDFS and Map Reduce),Hive,Scribe,Hbase,Zookeeper…) come together to provide a comprehensive platform for processing data at Facebook. This infrastructure is used for various different types of jobs each having different requirements

Editor's Notes

  1.  Lots of data is generated on Facebook
  2. Data Infrastructure @ FB built on open source technologies: Data Infrastructure Overview hadoop+hive+hbase+scribe GraphQL: créé par Facebook pour permettre la communication entre les applications et les serveurs. Ce langage est maintenant utilisé par un grand nombre d’entreprises. Facebook est une entreprise qui a énormément contribué à l’essor du Big Data en proposant ses innovations en open-source.
  3.  Lots of data is generated on Facebook
  4.  Lots of data is generated on Facebook
  5.  Lots of data is generated on Facebook