Big data & Hadoop & How we use it at Alchetron

Brief
BIG DATA & HADOOP
Alchetron.com
Free Social Encyclopedia

BIG DATA
HADOOP
HDFS
MAP-REDUCE
ALCHETRON
FEEDBACKS
Q/A

BIG DATA & HADOOP
+
To understand BIG DATA
we will have to
understand data first !!!

THIS DRAWING WAS CREATED 40,000 YEARS
AGO  THIS WAS THE FIRST TIME WHEN
HUMANS STARTED RECORDING DATA 

AS TIME PASSED WE STARTED CREATING MORE
DATA AS YOU CAN SEE IN THIS PIC WHICH IS
3000-10,000 YEARS OLD
STONE TABLETS 

This man invented
printing machine in
1439 that means
more data is
collected than
before
Johannes Gutenberg

100 crore books
printed till 18th
century & my
dear friends you
are still not born
…..

THIS GUY INVENTS INTERNET IN 1991
SIR Tim Berners-Lee Invents Internet in 1991 now
with internet the amount of data generated
by mankind explodes !!

Next 20 years Computing will move on to Microscopic level
Computers wont be in our pockets but inside our body & mind
This is where Technology & Biology will merge which will
multiply and enhance our capabilities a thousand times
30 years of mobile Technology

Technological change will be so
rapid & exponential

With invention of internet + small & less expensive
storage devices !! Data creation explodes

Data generation statisticsDith invention of internet +
small & less expensive storage devices !!
Data creation explodes
2.7 Zetabytes of data exist in the digital universe today
Facebook stores, accesses, and analyzes 50+ Petabytes of user generated
data.
Walmart handles more than 1 million customer transactions every hour,
which is imported into databases estimated to contain more than 2.5
petabytes of data
More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
YouTube users upload 48 hours of new video every minute of the day.
In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a
day

With invention of internet data creation explodesSO WHAT IS BIG DATA ??
Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in
the last two years alone. This data comes from everywhere :
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few.
This data is big data.

With invention of internet data creation explodes

HADOOP
Open Source Apache Project
Written in Java
Runs on
Linux, Mac OS/X, Windows, and Solaris
Commodity hardware

Contents
• History of Hadoop
• The current applications of Hadoop
• Hadoop HDFS + MAP-REDUCE
• Other hadoop projects

Fun Fact of Hadoop
"The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not used
elsewhere: those are my naming criteria.
---- Doug Cutting, Hadoop project
creator

History of Hadoop
Apache Nutch
Doug Cutting
“Map-reduce”
2004
“It is an important technique!”
Extended
The great journey begins…

History of Hadoop
• Yahoo! became the primary contributor in
2006

History of Hadoop
• Yahoo! deployed large scale science clusters in
2007.
• Tons of Yahoo! Research papers emerge:
– WWW
– CIKM
– SIGIR
• Yahoo! began running major production jobs
in Q1 2008.

Hadoop consists of 2 parts.
They are HDFS & MapReduce.

HDFS
Namenodes & Datanodes are nothing but machines which helps the
client to store data.
Metadata is stored in namenode & actual data is stored in
datanodes

A TaskTracker is a daemon and works on datanode and is a node in
the cluster that accepts tasks - Map, Reduce and Shuffle operations -
from a Jobtracker.
A JobTracker is a daemon and works on namenode
and also farms out MapReduce tasks to specific nodes in the cluster,
ideally the nodes that have the data, or at least are in the same
rack.

Map-Reduce Architecture
Map-reduce is basically a data processing
engine
To understand it deeply you should know
java coding with experience
Lets try to learn the architecture of map-
reduce

BORED   ALMOST THERE
JUST ONE MORE CODE

Now a days (as per latest job market)…
• Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /
data analytics a plus
• Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data
processing system, big data analytics ... multiple technologies, including Hadoop

Other Hadoop Projects Ecosystem
•Hadoop Core
– Distributed File System
– MapReduce Framework
•Pig (initiated by Yahoo!)
– Parallel Programming Language and Runtime
•Hbase (initiated by Powerset)
– Table storage for semi-structured data
•Zookeeper (initiated by Yahoo!)
– Coordinating distributed systems
•Hive (initiated by Facebook)
– SQL-like query language and metastore

TYPICAL HADOOP CLUSTER HANDLING & PROCESSING PETA BYTES OF DATA
1000 TB = 1 PETA BYTE APPROX..

Now a days…
Who use Hadoop?
• Amazon/A9
• Alchetron
• Fox interactive media
• Google
• IBM
• Facebook
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• More at http://wiki.apache.org/hadoop/PoweredBy

Lets see how we
Implemented this at

When you visit Alchetron.com
you are interacting
with data processed
with Hadoop

When you visit
Alchetron.com
you are interacting
with data processed
with Hadoop!!
Search
Index
Search
Index
When you visit
Alchetron.com
you are interacting
with data processed
with Hadoop !!

References
• For more information:
– http://hadoop.apache.org/
– http://developer.yahoo.com/hadoop/
– http://alchetron.com/What-is-Big-data-1530-W
– http://alchetron.com/Big-Data-Hadoop-260-W

Big data & Hadoop & How we use it at Alchetron

Recommended

Recommended

More Related Content

Similar to Big data & Hadoop & How we use it at Alchetron

Similar to Big data & Hadoop & How we use it at Alchetron (20)

Recently uploaded

Recently uploaded (20)

Big data & Hadoop & How we use it at Alchetron