The presentation explores how we reached at big data today and technologies like hadoop which are used to manage it The presentation ends with where we use it in our startup Alchetron.com & some more Informative links at the end
8. THIS GUY INVENTS INTERNET IN 1991
SIR Tim Berners-Lee Invents Internet in 1991 now
with internet the amount of data generated
by mankind explodes !!
13. Next 20 years Computing will move on to Microscopic level
Computers wont be in our pockets but inside our body & mind
This is where Technology & Biology will merge which will
multiply and enhance our capabilities a thousand times
30 years of mobile Technology
16. With invention of internet + small & less expensive
storage devices !! Data creation explodes
17. Data generation statisticsDith invention of internet +
small & less expensive storage devices !!
Data creation explodes
2.7 Zetabytes of data exist in the digital universe today
Facebook stores, accesses, and analyzes 50+ Petabytes of user generated
data.
Walmart handles more than 1 million customer transactions every hour,
which is imported into databases estimated to contain more than 2.5
petabytes of data
More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
YouTube users upload 48 hours of new video every minute of the day.
In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a
day
18. With invention of internet data creation explodesSO WHAT IS BIG DATA ??
Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in
the last two years alone. This data comes from everywhere :
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few.
This data is big data.
24. HADOOP
Open Source Apache Project
Written in Java
Runs on
Linux, Mac OS/X, Windows, and Solaris
Commodity hardware
25. Contents
• History of Hadoop
• The current applications of Hadoop
• Hadoop HDFS + MAP-REDUCE
• Other hadoop projects
26. Fun Fact of Hadoop
"The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not used
elsewhere: those are my naming criteria.
---- Doug Cutting, Hadoop project
creator
27. History of Hadoop
Apache Nutch
Doug Cutting
“Map-reduce”
2004
“It is an important technique!”
Extended
The great journey begins…
29. History of Hadoop
• Yahoo! deployed large scale science clusters in
2007.
• Tons of Yahoo! Research papers emerge:
– WWW
– CIKM
– SIGIR
• Yahoo! began running major production jobs
in Q1 2008.
31. HDFS
Namenodes & Datanodes are nothing but machines which helps the
client to store data.
Metadata is stored in namenode & actual data is stored in
datanodes
32. A TaskTracker is a daemon and works on datanode and is a node in
the cluster that accepts tasks - Map, Reduce and Shuffle operations -
from a Jobtracker.
A JobTracker is a daemon and works on namenode
and also farms out MapReduce tasks to specific nodes in the cluster,
ideally the nodes that have the data, or at least are in the same
rack.
33.
34.
35. Map-Reduce Architecture
Map-reduce is basically a data processing
engine
To understand it deeply you should know
java coding with experience
Lets try to learn the architecture of map-
reduce
41. Now a days (as per latest job market)…
• Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /
data analytics a plus
• Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data
processing system, big data analytics ... multiple technologies, including Hadoop
42. Other Hadoop Projects Ecosystem
•Hadoop Core
– Distributed File System
– MapReduce Framework
•Pig (initiated by Yahoo!)
– Parallel Programming Language and Runtime
•Hbase (initiated by Powerset)
– Table storage for semi-structured data
•Zookeeper (initiated by Yahoo!)
– Coordinating distributed systems
•Hive (initiated by Facebook)
– SQL-like query language and metastore
43.
44. TYPICAL HADOOP CLUSTER HANDLING & PROCESSING PETA BYTES OF DATA
1000 TB = 1 PETA BYTE APPROX..
45. Now a days…
Who use Hadoop?
• Amazon/A9
• Alchetron
• Fox interactive media
• Google
• IBM
• Facebook
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• More at http://wiki.apache.org/hadoop/PoweredBy
47. When you visit Alchetron.com
you are interacting
with data processed
with Hadoop
48. When you visit
Alchetron.com
you are interacting
with data processed
with Hadoop!!
Search
Index
Search
Index
When you visit
Alchetron.com
you are interacting
with data processed
with Hadoop !!