An introduction to Big-Data processing applying hadoop

An introduction to
Big Data processing
using Hadoop
A.Sedighi
hexican.com

No
single
standard
deﬁniHon…
“Big
Data”
is
data
whose
scale,
diversity,

and
complexity
require
new
architecture,

techniques,
algorithms,
and
analyHcs
to

manage
it
and
extract
value
and
hidden

knowledge
from
it…
Big Data, Definition

Information is powerful…
but it is how we use it that will
define us

Data Explosion
relational
text
audio
video
images

Big Data Era
-creates over 30 billion pieces of content per day
-stores 30 petabytes of data
-produces over 90 million tweets per day

Log Files
-Log files contains data.
-Each banking transaction should be logged in
different levels.
How much a Banking solution generates log
files per a day?

Big Data: 3 V's
volume
velocity
variety

What
is
driving
Big
Data
Industry?

- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

Big Data Challenges
Sorting of 10TB on:
1 node takes 2.5 Days O(N log N)
100 nodes takes 35 Mins O(log N)

Big Data Challenges
Problem: “Fat” servers implies high cost.
Solution: Using cheap commodity nodes instead.
Problem: Large number of cheap nodes implies often
failures.
Solution: leverage automatic fault-tolerance

Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines.

What
Technology
Do
We
Have
For
Big
Data
?

MapReduce
Published in 2004 by Google
Popularized by Apache Hadoop project.
Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.

MapReduce philosophy
-hide complexity
-make it scalable
-make it cheap

MapReduce popularized by
Apache Hadoop project

Hadoop Overview
Open source implementation of Google
MapReduce
Google File System (GFS)
First release in 2008 by Yahoo!
Wide adoption by Facebook, Twitter, Amazon, etc.

Everything
Started
By
Searching
Hadoop was created by
Doug Cutting, the creator
of Apache Lucene, the
widely used text search
library. Hadoop has its
origins in Apache Nutch,
an open source web
search engine, itself a part
of the Lucene project.

Hadoop
Sub
Projects
-‐
1

Hadoop
Sub
Projects
-‐
2

Hadoop
Distributed
File
System

(HDFS)
-‐
1
HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
on commodity hardware.
-“Very large” in this context means files that are hundreds
of megabytes, gigabytes, or terabytes in size. There are
Hadoop clusters running today that store petabytes of
data.

Hadoop
Distributed
File
System

(HDFS)
-‐
2
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern. The
time to read the whole dataset is more important than the latency
in reading the first record.

Hadoop
Distributed
File
System

(HDFS)
-‐
3
HDFS is a ﬁlesystem designed for storing very large ﬁles with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.

Were
HDFS
doesn't
work
well?
● Low-‐latency
data
access
● Lots
of
small
files
● MulHple
writers,
arbitrary
file
modificaHons.

HDFS Concepts - Blocks
65MB 128MB or 256MB Block size.
If the seek time is around 10ms, and the transfer rate is 100 MB/s,
then to make the seek time 1% of the transfer time, we need to
make the block size around 100 MB.

Anatomy
of
a
File
Read

Anatomy
of
a
File
Write

Machine Learning - 1
Mahout's
goal
is
to
build
scalable
machine

learning
libraries
providing
core
algorithms
for

clustering,
classiﬁcaHon
and
batch
based

collaboraHve
ﬁltering
are
implemented
on
top

of
Apache
Hadoop
using
the
map/reduce

paradigm.

Machine Learning - 2
Mahout
can
be
used
as
a
recommender
engine

on
the
top
of
hadoop
clusters.

Using
hadoop
for
● ads and recomendations
● online travel
● processing mobile data
● energy savings and discovery
● infrastructure management
● image processing
● fraud detection
● IT security
● health care

An introduction to Big-Data processing applying hadoop

More Related Content

What's hot

Viewers also liked

Similar to An introduction to Big-Data processing applying hadoop

More from Amir Sedighi

Recently uploaded

An introduction to Big-Data processing applying hadoop