Lots of data is generated on Facebook
– 300+ million active users
– 30 million users update their statuses at least once
– More than 1 billion photos uploaded each month
– More than 10 million videos uploaded each month
– More than 1 billion pieces of content (web links,
news, stories, blog posts, notes, photos, etc.) shared
Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– data warehouse grows by roughly half a PB
Need to process Multi Petabyte Datasets
Data may not have strict schema
Expensive to build reliability in each application.
Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
Need common infrastructure
– Efficient, reliable, Open Source Apache License
Hadoop is a large-scale distributed batch processing infrastructure.
Hadoop is designed to efficiently distribute large amounts of work across
a set of machines.
While it can be used on a single machine, its true power lies in its ability
to scale to hundreds or thousands of computers, each with several
Hadoop is designed to efficiently process large volumes of information
by connecting many commodity computers together to work in parallel.
Hadoop is written in the Java programming language and is a top-level
Apache project being built and used by a global community of
Actually Hadoop is built to process "web-scale" data on the
order of hundreds of gigabytes to terabytes or petabytes.
At this scale, it is likely that the input data set will not even fit on
a single computer's hard drive, much less in memory.
So Hadoop includes a distributed file system which breaks up
input data and sends fractions of the original data to several
machines in your cluster to hold.
This results in the problem being processed in parallel using all
of the machines in the cluster and computes output results as
efficiently as possible.
Hadoop is designed to efficiently process large
volumes of information by connecting many
commodity computers together to work in parallel.
The theoretical 1000-CPU machine would cost a
very large amount of money, far more than 1,000
single-CPU or 250 quad-core machines.
Hadoop will tie these smaller and more reasonably
priced machines together into a single cost-effective
In a Hadoop cluster, data is distributed to all the nodes of the
cluster as it is being loaded in.
The Hadoop Distributed File System (HDFS) will split large data
files into chunks which are managed by different nodes in the
In addition to this each chunk is replicated across several
machines, so that a single machine failure does not result in any
data being unavailable.
Even though the file chunks are replicated and distributed
across several machines, they form a single namespace, so their
contents are universally accessible.
Data is distributed across nodes at load time.
Hadoop will not run just any program and distribute it
across a cluster.
Programs must be written to conform to a particular
programming model, named "MapReduce.“
In MapReduce, records are processed in isolation by
tasks called Mappers.
The output from the Mappers is then brought together
into a second set of tasks called Reducers, where results
from different mappers can be merged together.
Separate nodes in a Hadoop cluster still communicate with one
However, in contrast to more conventional distributed systems
where application developers explicitly marshal byte streams
from node to node over sockets or through MPI (Message
Passing Interface) buffers, communication in Hadoop is
Pieces of data can be tagged with key names which inform
Hadoop how to send related bits of information to a common
Hadoop internally manages all of the data transfer and cluster
One of the major benefits of using Hadoop in contrast to
other distributed systems is its flat scalability curve.
A program written in distributed frameworks other than
Hadoop may require large amounts of refactoring when
scaling from ten to one hundred or one thousand
This may involve having the program be rewritten
several times; fundamental elements of its design may
also put an upper bound on the scale to which the
application can grow.
Hadoop, however, is specifically designed to have a
very flat scalability curve.
After a Hadoop program is written and functioning
on ten nodes, very little--if any--work is required for
that same program to run on a much larger amount
The underlying Hadoop platform will manage the
data and hardware resources and provide
dependable performance growth proportionate to
the number of machines available.
Unstructured & Structured
Interactive & Batch
Read and write many times
Write once, read many times
Prominent Users of Hadoop
Hadoop is used massively at facebook.
In 2010 Facebook claimed that they had the largest Hadoop
cluster in the world with 21 PB of storage.
On July 27, 2011 they announced the data had grown to 30 PB.
On June 13, 2012 they announced the data had grown to 100 PB.
On November 8, 2012 they announced the warehouse grows by
roughly half a PB per day.
Producing daily and hourly summaries over large amounts of
data. These summaries are used for a number of different
purposes within the company:
Reports based on theses summaries are used by engineering and
non-engineering functional teams to drive product decisions.
These summaries include reports on growth of the users, page
views, and average time spent on the site by the users.
Providing performance numbers about advertisement campaigns
that are run on Facebook.
Backend processing for site features such as people and applications
you may like.
Running ad hoc jobs over historical data. These analyses help
answer questions from product groups and executive teams.
As a de facto long-term archival store for the log
To look up log events by specific attributes, which
is used to maintain the integrity of the site and
protect users against spam bots.