Pravin Singh
introducing
BIG DATA
WHAT THE HECK IS BIG DATA?
Any collection of data
sets so large and complex
that it becomes difficult to
process using current data
management tools or
traditional data processing
applications.
Volume
• Exceeds
physical limits
of vertical
scalability
Velocity
• Decision
window small
due to data
change rate
Variety
• Many different
formats make
integration
expensive
WHY SO LOW-COST?
Source: EMC
WHY SO LOW-COST?
Source: EMC
WHY SO FAST?
 Massive Parallel Processing
 Data Locality
 Optimized for write once – read many
 Sequential reads, not random access
Hello Hadoop!
You have an interesting name.
Hadoop Architecture
Source: Hortonworks
The Hadoop Zoo
HDFS
MapReduce
Pig Hive HCat Giraph Mahout
Zookeeper
The Real Simple Hadoop Architecture
MapReduce Engine
JobTracker
TaskTracker
1
TaskTracker
2
…
TaskTracker
N
HDFS Cluster
NameNode
DataNode
1
DataNode
2
…
DataNode
N
Hello HDFS!
Have we met before?
HDFS
My Data.txt
150 MB
64 MB
64 MB
22 MB
Name
Node
64 MB64 MB
64 MB64 MB
22 MB22 MB
Hello MapReduce!
Have you lost some weight?
MapReduce
Input
File
<Key, Value>
<Key, Value>
<Key, Value>
.
.
Shuffle
& Sort
<Key, Value>
<Key, Value>
<Key, Value>
.
.
Result
MapReduce
Big Data for
Dummies.txt
How many times the
words “Big data” and
“Hadoop” show up?
MapReduce
<Big data, 7>
<Hadoop, 4>
<Big data, 9>
<Hadoop, 6>
<Big data, 3>
<Hadoop, 8>
<Big data, 7>
<Big data, 9>
<Big data, 3>
<Hadoop, 4>
<Hadoop, 6>
<Hadoop, 8>
<Big data, 7, 9, 3>
<Hadoop, 4, 6, 8>
<Big data, 19>
<Hadoop, 18>
Let’s Play MapReduce!
’coz All Talk and No Play Makes Session a Dull Affair.
?
Questions. Comments. Feedback.
See you at the (Data) Lake Next Time.
THANK YOU!

Introducing Big Data

Editor's Notes

  • #2 This presentation demonstrates the new capabilities of PowerPoint and it is best viewed in Slide Show. These slides are designed to give you great ideas for the presentations you’ll create in PowerPoint 2010! For more sample templates, click the File tab, and then on the New tab, click Sample Templates.