WHAT IS BIG DATA ?• Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)• Human generated data Twitter “Firehose” (50 mil tweets/day 1,400% growth peryear) Blogs/Reviews/Emails/Pictures• Social graphs Facebook, linked-in, contacts
HOW MUCH DATA?• Wayback Machine has 2 PB + 20 TB/month (2006)• Google processes 20 PB a day (2008)• “all words ever spoken by human beings” ~ 5 EB• NOAA has ~1 PB climate data (2007)• CERN’s LHC will generate 15 PB a year (2008)
WHY IS BIG DATA HARD (AND GETTINGHARDER)?• Data Volume Unconstrained growth Current systems don’t scale• Data Structure Need to consolidate data from multiple data sources in multiple formats acrossmultiple businesses• Changing Data Requirements Faster response time of fresher data Sampling is not good enough and history is important Increasing complexity of analytics Users demand inexpensive experimentation
CHALLENGES OF BIG DATAPETABYTETERABYTEGIGABYTEMEGABYTEKILOBYTEBYTEThe VOLUMEgrowing exponentiallyThe VELOCITYof data increasing
BIG DATA VALUEGOOGLEFACEBOOKAMAZONRecommend what customer shouldbuy?Friend SuggesstionPredict traffic usageDisplay relevant ads
We need tools built specifically for Big Data!• Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault‐tolerant, distributed systemHadoop lowers the cost of developing a distributedsystem for data processing
WHAT IS HADOOP ? At Google MapReduce operation are run on a special file system calledGoogle File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS andcalled it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and otherrelated entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.
CONTD..Software platform that lets one easily write and runapplications that process vast amounts of data– MapReduce – offline computing engine– HDFS – Hadoop distributed file system– HBase (pre-alpha) – online data access
WHAT MAKE IT SPECIALLY USEFUL• Scalable: It can reliably store and process petabytes.• Economical: It distributes the data and processing acrossclusters of commonly available computers (in thousands).• Efficient: By distributing the data, it can process it in parallelon the nodes where the data is located.• Reliable: It automatically maintains multiple copies of data andautomatically redeploys computing tasks based on failures.
WHAT IS MAP REDUCE ? MapReduce is a programming model Google has used successfully isprocessing its “big-data” sets (~ 20000 peta bytes per day)A map function extracts some intelligence from raw data.A reduce function aggregates according to some guides the dataoutput by the map.Users specify the computation in terms of a map and a reducefunction,Underlying runtime system automatically parallelizes thecomputation across large-scale clusters of machines, andUnderlying system also handles machine failures, efficientcommunications, and performance issues
HOW DOES MAP REDUCE WORK• The run time partitions the input and provides it to different Mapinstances;• Map (key, value) (key’, value’)• The run time collects the (key’, value’) pairs and distributes them toseveral Reduce functions so that each Reduce function gets the pairswith the same key’.• Each Reduce produces a single (or zero) file output.• Map and Reduce are user written functions
CountCountCountLarge scale data splitsParse-hashParse-hashParse-hashParse-hashMap <key, 1><key, value>pair Reducers (say, Count)P-0000P-0001P-0002, count1, count2,count36/23/2010 Wipro Chennai 2011 14
CLASSES OF PROBLEM SOLVED BYMAPREDUCE Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort” Google uses it for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrialobjects. Expected to play a critical role in semantic web and in web 3.0
MAPREDUCE ENGINE• MapReduce requires a distributed file system and an engine that candistribute, coordinate, monitor and gather the results.• Hadoop provides that engine through (the file system we discussedearlier) and the JobTracker + TaskTracker system.• JobTracker is simply a scheduler.• TaskTracker is assigned a Map or Reduce (or other operations); Map orReduce run on node and so is the TaskTracker; each task is run on itsown JVM on a node.
WORD COUNT OVER A GIVEN SETOF WEB PAGESsee bob throwsee1bob 1throw 1see 1spot 1run 1bob 1run 1see 2spot 1throw 1see spot runCan we do word count in parallel?
OTHER APPLICATION TO MAPREDUCE• Distributed grep (as in Unix grep command)• Count of URL Access Frequency• ReverseWeb-Link Graph: list of all source URLs associated with a given targetURL• Inverted index: Produces <word, list(Document ID)> pairs• Distributed sort
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)• The Hadoop Distributed File System (HDFS) is a distributed file systemdesigned to run on commodity hardware. It has many similarities with existingdistributed file systems. However, the differences from other distributed filesystems are significant.• highly fault-tolerant and is designed to be deployed on low-cost hardware.• provides high throughput access to application data and is suitable forapplications that have large data sets.• relaxes a few POSIX requirements to enable streaming access to filesystem data.• part of the Apache Hadoop Core project. The project URL ishttp://hadoop.apache.org/core/.