BIG DATA & HADOOP
WHAT IS BIG DATA ?• Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smar...
HOW MUCH DATA?• Wayback Machine has 2 PB + 20 TB/month (2006)• Google processes 20 PB a day (2008)• “all words ever spoken...
WHY IS BIG DATA HARD (AND GETTINGHARDER)?• Data Volume Unconstrained growth Current systems don’t scale• Data Structure...
CHALLENGES OF BIG DATAPETABYTETERABYTEGIGABYTEMEGABYTEKILOBYTEBYTEThe VOLUMEgrowing exponentiallyThe VELOCITYof data incre...
BIG DATA VALUEGOOGLEFACEBOOKAMAZONRecommend what customer shouldbuy?Friend SuggesstionPredict traffic usageDisplay relevan...
We need tools built specifically for Big Data!• Apache Hadoop The MapReduce computational paradigm Open source, scalable...
WHAT IS HADOOP ? At Google MapReduce operation are run on a special file system calledGoogle File System (GFS) that is hi...
CONTD..Software platform that lets one easily write and runapplications that process vast amounts of data– MapReduce – off...
WHAT MAKE IT SPECIALLY USEFUL• Scalable: It can reliably store and process petabytes.• Economical: It distributes the data...
HDFS ARCHITECTURENamenodeBreplicationRack1 Rack2ClientBlocksDatanodes DatanodesClientWriteReadMetadata opsMetadata(Name, r...
WHAT IS MAP REDUCE ? MapReduce is a programming model Google has used successfully isprocessing its “big-data” sets (~ 20...
HOW DOES MAP REDUCE WORK• The run time partitions the input and provides it to different Mapinstances;• Map (key, value) ...
CountCountCountLarge scale data splitsParse-hashParse-hashParse-hashParse-hashMap <key, 1><key, value>pair Reducers (say, ...
CLASSES OF PROBLEM SOLVED BYMAPREDUCE Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sor...
MAPREDUCE ENGINE• MapReduce requires a distributed file system and an engine that candistribute, coordinate, monitor and g...
WORD COUNT OVER A GIVEN SETOF WEB PAGESsee bob throwsee1bob 1throw 1see 1spot 1run 1bob 1run 1see 2spot 1throw 1see spot r...
THE MAPREDUCE FRAMEWORK(PIONEERED BY GOOGLE)
OTHER APPLICATION TO MAPREDUCE• Distributed grep (as in Unix grep command)• Count of URL Access Frequency• ReverseWeb-Link...
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)• The Hadoop Distributed File System (HDFS) is a distributed file systemdesigned to ru...
HDFS CONCLUSIONS
Thank you…!
Upcoming SlideShare
Loading in...5
×

Big data & hadoop

386

Published on

Get quick idea of bigdata and hadoop

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
386
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Big data & hadoop

  1. 1. BIG DATA & HADOOP
  2. 2. WHAT IS BIG DATA ?• Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)• Human generated data Twitter “Firehose” (50 mil tweets/day 1,400% growth peryear) Blogs/Reviews/Emails/Pictures• Social graphs Facebook, linked-in, contacts
  3. 3. HOW MUCH DATA?• Wayback Machine has 2 PB + 20 TB/month (2006)• Google processes 20 PB a day (2008)• “all words ever spoken by human beings” ~ 5 EB• NOAA has ~1 PB climate data (2007)• CERN’s LHC will generate 15 PB a year (2008)
  4. 4. WHY IS BIG DATA HARD (AND GETTINGHARDER)?• Data Volume Unconstrained growth Current systems don’t scale• Data Structure Need to consolidate data from multiple data sources in multiple formats acrossmultiple businesses• Changing Data Requirements Faster response time of fresher data Sampling is not good enough and history is important Increasing complexity of analytics Users demand inexpensive experimentation
  5. 5. CHALLENGES OF BIG DATAPETABYTETERABYTEGIGABYTEMEGABYTEKILOBYTEBYTEThe VOLUMEgrowing exponentiallyThe VELOCITYof data increasing
  6. 6. BIG DATA VALUEGOOGLEFACEBOOKAMAZONRecommend what customer shouldbuy?Friend SuggesstionPredict traffic usageDisplay relevant ads
  7. 7. We need tools built specifically for Big Data!• Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault‐tolerant, distributed systemHadoop lowers the cost of developing a distributedsystem for data processing
  8. 8. WHAT IS HADOOP ? At Google MapReduce operation are run on a special file system calledGoogle File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS andcalled it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and otherrelated entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.
  9. 9. CONTD..Software platform that lets one easily write and runapplications that process vast amounts of data– MapReduce – offline computing engine– HDFS – Hadoop distributed file system– HBase (pre-alpha) – online data access
  10. 10. WHAT MAKE IT SPECIALLY USEFUL• Scalable: It can reliably store and process petabytes.• Economical: It distributes the data and processing acrossclusters of commonly available computers (in thousands).• Efficient: By distributing the data, it can process it in parallelon the nodes where the data is located.• Reliable: It automatically maintains multiple copies of data andautomatically redeploys computing tasks based on failures.
  11. 11. HDFS ARCHITECTURENamenodeBreplicationRack1 Rack2ClientBlocksDatanodes DatanodesClientWriteReadMetadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..Block ops6/23/2010 Wipro Chennai 2011 11
  12. 12. WHAT IS MAP REDUCE ? MapReduce is a programming model Google has used successfully isprocessing its “big-data” sets (~ 20000 peta bytes per day)A map function extracts some intelligence from raw data.A reduce function aggregates according to some guides the dataoutput by the map.Users specify the computation in terms of a map and a reducefunction,Underlying runtime system automatically parallelizes thecomputation across large-scale clusters of machines, andUnderlying system also handles machine failures, efficientcommunications, and performance issues
  13. 13. HOW DOES MAP REDUCE WORK• The run time partitions the input and provides it to different Mapinstances;• Map (key, value)  (key’, value’)• The run time collects the (key’, value’) pairs and distributes them toseveral Reduce functions so that each Reduce function gets the pairswith the same key’.• Each Reduce produces a single (or zero) file output.• Map and Reduce are user written functions
  14. 14. CountCountCountLarge scale data splitsParse-hashParse-hashParse-hashParse-hashMap <key, 1><key, value>pair Reducers (say, Count)P-0000P-0001P-0002, count1, count2,count36/23/2010 Wipro Chennai 2011 14
  15. 15. CLASSES OF PROBLEM SOLVED BYMAPREDUCE Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort” Google uses it for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrialobjects. Expected to play a critical role in semantic web and in web 3.0
  16. 16. MAPREDUCE ENGINE• MapReduce requires a distributed file system and an engine that candistribute, coordinate, monitor and gather the results.• Hadoop provides that engine through (the file system we discussedearlier) and the JobTracker + TaskTracker system.• JobTracker is simply a scheduler.• TaskTracker is assigned a Map or Reduce (or other operations); Map orReduce run on node and so is the TaskTracker; each task is run on itsown JVM on a node.
  17. 17. WORD COUNT OVER A GIVEN SETOF WEB PAGESsee bob throwsee1bob 1throw 1see 1spot 1run 1bob 1run 1see 2spot 1throw 1see spot runCan we do word count in parallel?
  18. 18. THE MAPREDUCE FRAMEWORK(PIONEERED BY GOOGLE)
  19. 19. OTHER APPLICATION TO MAPREDUCE• Distributed grep (as in Unix grep command)• Count of URL Access Frequency• ReverseWeb-Link Graph: list of all source URLs associated with a given targetURL• Inverted index: Produces <word, list(Document ID)> pairs• Distributed sort
  20. 20. HDFS(HADOOP DISTRIBUTED FILE SYSTEM)• The Hadoop Distributed File System (HDFS) is a distributed file systemdesigned to run on commodity hardware. It has many similarities with existingdistributed file systems. However, the differences from other distributed filesystems are significant.• highly fault-tolerant and is designed to be deployed on low-cost hardware.• provides high throughput access to application data and is suitable forapplications that have large data sets.• relaxes a few POSIX requirements to enable streaming access to filesystem data.• part of the Apache Hadoop Core project. The project URL ishttp://hadoop.apache.org/core/.
  21. 21. HDFS CONCLUSIONS
  22. 22. Thank you…!

×