• Save
Big data & hadoop
Upcoming SlideShare
Loading in...5
×
 

Big data & hadoop

on

  • 606 views

Get quick idea of bigdata and hadoop

Get quick idea of bigdata and hadoop

Statistics

Views

Total Views
606
Views on SlideShare
568
Embed Views
38

Actions

Likes
3
Downloads
0
Comments
1

1 Embed 38

http://goyans.com 38

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data & hadoop Big data & hadoop Presentation Transcript

  • BIG DATA & HADOOP
  • WHAT IS BIG DATA ?• Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)• Human generated data Twitter “Firehose” (50 mil tweets/day 1,400% growth peryear) Blogs/Reviews/Emails/Pictures• Social graphs Facebook, linked-in, contacts
  • HOW MUCH DATA?• Wayback Machine has 2 PB + 20 TB/month (2006)• Google processes 20 PB a day (2008)• “all words ever spoken by human beings” ~ 5 EB• NOAA has ~1 PB climate data (2007)• CERN’s LHC will generate 15 PB a year (2008)
  • WHY IS BIG DATA HARD (AND GETTINGHARDER)?• Data Volume Unconstrained growth Current systems don’t scale• Data Structure Need to consolidate data from multiple data sources in multiple formats acrossmultiple businesses• Changing Data Requirements Faster response time of fresher data Sampling is not good enough and history is important Increasing complexity of analytics Users demand inexpensive experimentation
  • CHALLENGES OF BIG DATAPETABYTETERABYTEGIGABYTEMEGABYTEKILOBYTEBYTEThe VOLUMEgrowing exponentiallyThe VELOCITYof data increasing
  • BIG DATA VALUEGOOGLEFACEBOOKAMAZONRecommend what customer shouldbuy?Friend SuggesstionPredict traffic usageDisplay relevant ads
  • We need tools built specifically for Big Data!• Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault‐tolerant, distributed systemHadoop lowers the cost of developing a distributedsystem for data processing
  • WHAT IS HADOOP ? At Google MapReduce operation are run on a special file system calledGoogle File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS andcalled it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and otherrelated entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.
  • CONTD..Software platform that lets one easily write and runapplications that process vast amounts of data– MapReduce – offline computing engine– HDFS – Hadoop distributed file system– HBase (pre-alpha) – online data access
  • WHAT MAKE IT SPECIALLY USEFUL• Scalable: It can reliably store and process petabytes.• Economical: It distributes the data and processing acrossclusters of commonly available computers (in thousands).• Efficient: By distributing the data, it can process it in parallelon the nodes where the data is located.• Reliable: It automatically maintains multiple copies of data andautomatically redeploys computing tasks based on failures.
  • HDFS ARCHITECTURENamenodeBreplicationRack1 Rack2ClientBlocksDatanodes DatanodesClientWriteReadMetadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..Block ops6/23/2010 Wipro Chennai 2011 11
  • WHAT IS MAP REDUCE ? MapReduce is a programming model Google has used successfully isprocessing its “big-data” sets (~ 20000 peta bytes per day)A map function extracts some intelligence from raw data.A reduce function aggregates according to some guides the dataoutput by the map.Users specify the computation in terms of a map and a reducefunction,Underlying runtime system automatically parallelizes thecomputation across large-scale clusters of machines, andUnderlying system also handles machine failures, efficientcommunications, and performance issues
  • HOW DOES MAP REDUCE WORK• The run time partitions the input and provides it to different Mapinstances;• Map (key, value)  (key’, value’)• The run time collects the (key’, value’) pairs and distributes them toseveral Reduce functions so that each Reduce function gets the pairswith the same key’.• Each Reduce produces a single (or zero) file output.• Map and Reduce are user written functions
  • CountCountCountLarge scale data splitsParse-hashParse-hashParse-hashParse-hashMap <key, 1><key, value>pair Reducers (say, Count)P-0000P-0001P-0002, count1, count2,count36/23/2010 Wipro Chennai 2011 14
  • CLASSES OF PROBLEM SOLVED BYMAPREDUCE Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort” Google uses it for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrialobjects. Expected to play a critical role in semantic web and in web 3.0
  • MAPREDUCE ENGINE• MapReduce requires a distributed file system and an engine that candistribute, coordinate, monitor and gather the results.• Hadoop provides that engine through (the file system we discussedearlier) and the JobTracker + TaskTracker system.• JobTracker is simply a scheduler.• TaskTracker is assigned a Map or Reduce (or other operations); Map orReduce run on node and so is the TaskTracker; each task is run on itsown JVM on a node.
  • WORD COUNT OVER A GIVEN SETOF WEB PAGESsee bob throwsee1bob 1throw 1see 1spot 1run 1bob 1run 1see 2spot 1throw 1see spot runCan we do word count in parallel?
  • THE MAPREDUCE FRAMEWORK(PIONEERED BY GOOGLE)
  • OTHER APPLICATION TO MAPREDUCE• Distributed grep (as in Unix grep command)• Count of URL Access Frequency• ReverseWeb-Link Graph: list of all source URLs associated with a given targetURL• Inverted index: Produces <word, list(Document ID)> pairs• Distributed sort
  • HDFS(HADOOP DISTRIBUTED FILE SYSTEM)• The Hadoop Distributed File System (HDFS) is a distributed file systemdesigned to run on commodity hardware. It has many similarities with existingdistributed file systems. However, the differences from other distributed filesystems are significant.• highly fault-tolerant and is designed to be deployed on low-cost hardware.• provides high throughput access to application data and is suitable forapplications that have large data sets.• relaxes a few POSIX requirements to enable streaming access to filesystem data.• part of the Apache Hadoop Core project. The project URL ishttp://hadoop.apache.org/core/.
  • HDFS CONCLUSIONS
  • Thank you…!