Leela Shivaranjani.U
III-BCA
Fatima College
Lots of data is generated on Facebook

– 300+ million active users

– 30 million users update their statuses at least on...
Data Usage






Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned pe...


Need to process Multi Petabyte Datasets



Data may not have strict schema



Expensive to build reliability in each ...










Hadoop is a large-scale distributed batch processing infrastructure.
Hadoop is designed to efficiently dist...








Actually Hadoop is built to process "web-scale" data on the
order of hundreds of gigabytes to terabytes or pet...






Hadoop is designed to efficiently process large
volumes of information by connecting many
commodity computers tog...








In a Hadoop cluster, data is distributed to all the nodes of the
cluster as it is being loaded in.
The Hadoop ...


Data is distributed across nodes at load time.








Hadoop will not run just any program and distribute it
across a cluster.
Programs must be written to conform t...








Separate nodes in a Hadoop cluster still communicate with one
another.
However, in contrast to more convention...






One of the major benefits of using Hadoop in contrast to
other distributed systems is its flat scalability curve....






Hadoop, however, is specifically designed to have a
very flat scalability curve.
After a Hadoop program is writte...
Traditional RDBMS

MapReduce

Data

Structured

Unstructured & Structured

Data Size

Gigabytes

Petabytes

Access

Intera...
Prominent Users of Hadoop












Facebook
Yahoo
eBay
Apple
Google
IBM
LinkedIn
Microsoft
Twitter
Amazon.co...




Hadoop is used massively at facebook.

In 2010 Facebook claimed that they had the largest Hadoop
cluster in the worl...


Producing daily and hourly summaries over large amounts of
data. These summaries are used for a number of different
pur...




As a de facto long-term archival store for the log
datasets.
To look up log events by specific attributes, which
is ...
Facebook & Hadoop
Facebook & Hadoop
Upcoming SlideShare
Loading in …5
×

Facebook & Hadoop

482 views

Published on

Gives a brief overview into the back-end of the popular social networking site-Facebook.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
482
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Facebook & Hadoop

  1. 1. Leela Shivaranjani.U III-BCA Fatima College
  2. 2. Lots of data is generated on Facebook  – 300+ million active users  – 30 million users update their statuses at least once each day  – More than 1 billion photos uploaded each month  – More than 10 million videos uploaded each month  – More than 1 billion pieces of content (web links, news, stories, blog posts, notes, photos, etc.) shared each week
  3. 3. Data Usage     Statistics per day: – 4 TB of compressed new data added per day – 135TB of compressed data scanned per day – data warehouse grows by roughly half a PB per day
  4. 4.  Need to process Multi Petabyte Datasets  Data may not have strict schema  Expensive to build reliability in each application.   Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. Need common infrastructure – Efficient, reliable, Open Source Apache License
  5. 5.      Hadoop is a large-scale distributed batch processing infrastructure. Hadoop is designed to efficiently distribute large amounts of work across a set of machines. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel. Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.
  6. 6.     Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
  7. 7.    Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel. The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU or 250 quad-core machines. Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.
  8. 8.     In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.
  9. 9.  Data is distributed across nodes at load time.
  10. 10.     Hadoop will not run just any program and distribute it across a cluster. Programs must be written to conform to a particular programming model, named "MapReduce.“ In MapReduce, records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together.
  11. 11.     Separate nodes in a Hadoop cluster still communicate with one another. However, in contrast to more conventional distributed systems where application developers explicitly marshal byte streams from node to node over sockets or through MPI (Message Passing Interface) buffers, communication in Hadoop is performed implicitly. Pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node. Hadoop internally manages all of the data transfer and cluster topology issues.
  12. 12.    One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve. A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten to one hundred or one thousand machines. This may involve having the program be rewritten several times; fundamental elements of its design may also put an upper bound on the scale to which the application can grow.
  13. 13.    Hadoop, however, is specifically designed to have a very flat scalability curve. After a Hadoop program is written and functioning on ten nodes, very little--if any--work is required for that same program to run on a much larger amount of hardware. The underlying Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available.
  14. 14. Traditional RDBMS MapReduce Data Structured Unstructured & Structured Data Size Gigabytes Petabytes Access Interactive & Batch Batch Updates Read and write many times Write once, read many times Structure Static Schema Dynamic Schema Integrity High Low Scaling Nonlinear Linear
  15. 15. Prominent Users of Hadoop            Facebook Yahoo eBay Apple Google IBM LinkedIn Microsoft Twitter Amazon.com Ancesrty.com
  16. 16.   Hadoop is used massively at facebook. In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage.  On July 27, 2011 they announced the data had grown to 30 PB.  On June 13, 2012 they announced the data had grown to 100 PB.  On November 8, 2012 they announced the warehouse grows by roughly half a PB per day.
  17. 17.  Producing daily and hourly summaries over large amounts of data. These summaries are used for a number of different purposes within the company: • • • •  Reports based on theses summaries are used by engineering and non-engineering functional teams to drive product decisions. These summaries include reports on growth of the users, page views, and average time spent on the site by the users. Providing performance numbers about advertisement campaigns that are run on Facebook. Backend processing for site features such as people and applications you may like. Running ad hoc jobs over historical data. These analyses help answer questions from product groups and executive teams.
  18. 18.   As a de facto long-term archival store for the log datasets. To look up log events by specific attributes, which is used to maintain the integrity of the site and protect users against spam bots.

×