2. WHAT IS BIG DATA ?
• Computer generated data
Application server logs (web sites, games)
Sensor data (weather, water, smart grids)
Images/videos (traffic, security cameras)
• Human generated data
Twitter “Firehose” (50 mil tweets/day 1,400% growth per
year)
Blogs/Reviews/Emails/Pictures
• Social graphs
Facebook, linked-in, contacts
3. HOW MUCH DATA?
• Wayback Machine has 2 PB + 20 TB/month (2006)
• Google processes 20 PB a day (2008)
• “all words ever spoken by human beings” ~ 5 EB
• NOAA has ~1 PB climate data (2007)
• CERN’s LHC will generate 15 PB a year (2008)
4. WHY IS BIG DATA HARD (AND GETTING
HARDER)?
• Data Volume
Unconstrained growth
Current systems don’t scale
• Data Structure
Need to consolidate data from multiple data sources in multiple formats across
multiple businesses
• Changing Data Requirements
Faster response time of fresher data
Sampling is not good enough and history is important
Increasing complexity of analytics
Users demand inexpensive experimentation
5. CHALLENGES OF BIG DATA
PETABYTE
TERABYTE
GIGABYTE
MEGABYTE
KILOBYTE
BYTE
The VOLUME
growing exponentially
The VELOCITY
of data increasing
7. We need tools built specifically for Big Data!
• Apache Hadoop
The MapReduce computational paradigm
Open source, scalable, fault‐tolerant, distributed system
Hadoop lowers the cost of developing a distributed
system for data processing
8. WHAT IS HADOOP ?
At Google MapReduce operation are run on a special file system called
Google File System (GFS) that is highly optimized for this purpose.
GFS is not open source.
Doug Cutting and others at Yahoo! reverse engineered the GFS and
called it Hadoop Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other
related entities is called the project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
9. CONTD..
Software platform that lets one easily write and run
applications that process vast amounts of data
– MapReduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
10. WHAT MAKE IT SPECIALLY USEFUL
• Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing across
clusters of commonly available computers (in thousands).
• Efficient: By distributing the data, it can process it in parallel
on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
12. WHAT IS MAP REDUCE ?
MapReduce is a programming model Google has used successfully is
processing its “big-data” sets (~ 20000 peta bytes per day)
A map function extracts some intelligence from raw data.
A reduce function aggregates according to some guides the data
output by the map.
Users specify the computation in terms of a map and a reduce
function,
Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
Underlying system also handles machine failures, efficient
communications, and performance issues
13. HOW DOES MAP REDUCE WORK
• The run time partitions the input and provides it to different Map
instances;
• Map (key, value) (key’, value’)
• The run time collects the (key’, value’) pairs and distributes them to
several Reduce functions so that each Reduce function gets the pairs
with the same key’.
• Each Reduce produces a single (or zero) file output.
• Map and Reduce are user written functions
15. CLASSES OF PROBLEM SOLVED BY
MAPREDUCE
Benchmark for comparing: Jim Gray’s challenge on data-
intensive computing. Ex: “Sort”
Google uses it for wordcount, adwords, pagerank, indexing data.
Simple algorithms such as grep, text-indexing, reverse indexing
Bayesian classification: data mining domain
Facebook uses it for various operations: demographics
Financial services use it for analytics
Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
Expected to play a critical role in semantic web and in web 3.0
16. MAPREDUCE ENGINE
• MapReduce requires a distributed file system and an engine that can
distribute, coordinate, monitor and gather the results.
• Hadoop provides that engine through (the file system we discussed
earlier) and the JobTracker + TaskTracker system.
• JobTracker is simply a scheduler.
• TaskTracker is assigned a Map or Reduce (or other operations); Map or
Reduce run on node and so is the TaskTracker; each task is run on its
own JVM on a node.
17. WORD COUNT OVER A GIVEN SET
OF WEB PAGES
see bob throw
see1
bob 1
throw 1
see 1
spot 1
run 1
bob 1
run 1
see 2
spot 1
throw 1
see spot run
Can we do word count in parallel?
19. OTHER APPLICATION TO MAPREDUCE
• Distributed grep (as in Unix grep command)
• Count of URL Access Frequency
• ReverseWeb-Link Graph: list of all source URLs associated with a given target
URL
• Inverted index: Produces <word, list(Document ID)> pairs
• Distributed sort
20. HDFS(HADOOP DISTRIBUTED FILE SYSTEM)
• The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file
systems are significant.
• highly fault-tolerant and is designed to be deployed on low-cost hardware.
• provides high throughput access to application data and is suitable for
applications that have large data sets.
• relaxes a few POSIX requirements to enable streaming access to file
system data.
• part of the Apache Hadoop Core project. The project URL is
http://hadoop.apache.org/core/.