• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big data & hadoop
 

Big data & hadoop

on

  • 554 views

Get quick idea of bigdata and hadoop

Get quick idea of bigdata and hadoop

Statistics

Views

Total Views
554
Views on SlideShare
516
Embed Views
38

Actions

Likes
2
Downloads
0
Comments
1

1 Embed 38

http://goyans.com 38

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big data & hadoop Big data & hadoop Presentation Transcript

    • BIG DATA & HADOOP
    • WHAT IS BIG DATA ?• Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)• Human generated data Twitter “Firehose” (50 mil tweets/day 1,400% growth peryear) Blogs/Reviews/Emails/Pictures• Social graphs Facebook, linked-in, contacts
    • HOW MUCH DATA?• Wayback Machine has 2 PB + 20 TB/month (2006)• Google processes 20 PB a day (2008)• “all words ever spoken by human beings” ~ 5 EB• NOAA has ~1 PB climate data (2007)• CERN’s LHC will generate 15 PB a year (2008)
    • WHY IS BIG DATA HARD (AND GETTINGHARDER)?• Data Volume Unconstrained growth Current systems don’t scale• Data Structure Need to consolidate data from multiple data sources in multiple formats acrossmultiple businesses• Changing Data Requirements Faster response time of fresher data Sampling is not good enough and history is important Increasing complexity of analytics Users demand inexpensive experimentation
    • CHALLENGES OF BIG DATAPETABYTETERABYTEGIGABYTEMEGABYTEKILOBYTEBYTEThe VOLUMEgrowing exponentiallyThe VELOCITYof data increasing
    • BIG DATA VALUEGOOGLEFACEBOOKAMAZONRecommend what customer shouldbuy?Friend SuggesstionPredict traffic usageDisplay relevant ads
    • We need tools built specifically for Big Data!• Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault‐tolerant, distributed systemHadoop lowers the cost of developing a distributedsystem for data processing
    • WHAT IS HADOOP ? At Google MapReduce operation are run on a special file system calledGoogle File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS andcalled it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and otherrelated entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.
    • CONTD..Software platform that lets one easily write and runapplications that process vast amounts of data– MapReduce – offline computing engine– HDFS – Hadoop distributed file system– HBase (pre-alpha) – online data access
    • WHAT MAKE IT SPECIALLY USEFUL• Scalable: It can reliably store and process petabytes.• Economical: It distributes the data and processing acrossclusters of commonly available computers (in thousands).• Efficient: By distributing the data, it can process it in parallelon the nodes where the data is located.• Reliable: It automatically maintains multiple copies of data andautomatically redeploys computing tasks based on failures.
    • HDFS ARCHITECTURENamenodeBreplicationRack1 Rack2ClientBlocksDatanodes DatanodesClientWriteReadMetadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..Block ops6/23/2010 Wipro Chennai 2011 11
    • WHAT IS MAP REDUCE ? MapReduce is a programming model Google has used successfully isprocessing its “big-data” sets (~ 20000 peta bytes per day)A map function extracts some intelligence from raw data.A reduce function aggregates according to some guides the dataoutput by the map.Users specify the computation in terms of a map and a reducefunction,Underlying runtime system automatically parallelizes thecomputation across large-scale clusters of machines, andUnderlying system also handles machine failures, efficientcommunications, and performance issues
    • HOW DOES MAP REDUCE WORK• The run time partitions the input and provides it to different Mapinstances;• Map (key, value)  (key’, value’)• The run time collects the (key’, value’) pairs and distributes them toseveral Reduce functions so that each Reduce function gets the pairswith the same key’.• Each Reduce produces a single (or zero) file output.• Map and Reduce are user written functions
    • CountCountCountLarge scale data splitsParse-hashParse-hashParse-hashParse-hashMap <key, 1><key, value>pair Reducers (say, Count)P-0000P-0001P-0002, count1, count2,count36/23/2010 Wipro Chennai 2011 14
    • CLASSES OF PROBLEM SOLVED BYMAPREDUCE Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort” Google uses it for wordcount, adwords, pagerank, indexing data. Simple algorithms such as grep, text-indexing, reverse indexing Bayesian classification: data mining domain Facebook uses it for various operations: demographics Financial services use it for analytics Astronomy: Gaussian analysis for locating extra-terrestrialobjects. Expected to play a critical role in semantic web and in web 3.0
    • MAPREDUCE ENGINE• MapReduce requires a distributed file system and an engine that candistribute, coordinate, monitor and gather the results.• Hadoop provides that engine through (the file system we discussedearlier) and the JobTracker + TaskTracker system.• JobTracker is simply a scheduler.• TaskTracker is assigned a Map or Reduce (or other operations); Map orReduce run on node and so is the TaskTracker; each task is run on itsown JVM on a node.
    • WORD COUNT OVER A GIVEN SETOF WEB PAGESsee bob throwsee1bob 1throw 1see 1spot 1run 1bob 1run 1see 2spot 1throw 1see spot runCan we do word count in parallel?
    • THE MAPREDUCE FRAMEWORK(PIONEERED BY GOOGLE)
    • OTHER APPLICATION TO MAPREDUCE• Distributed grep (as in Unix grep command)• Count of URL Access Frequency• ReverseWeb-Link Graph: list of all source URLs associated with a given targetURL• Inverted index: Produces <word, list(Document ID)> pairs• Distributed sort
    • HDFS(HADOOP DISTRIBUTED FILE SYSTEM)• The Hadoop Distributed File System (HDFS) is a distributed file systemdesigned to run on commodity hardware. It has many similarities with existingdistributed file systems. However, the differences from other distributed filesystems are significant.• highly fault-tolerant and is designed to be deployed on low-cost hardware.• provides high throughput access to application data and is suitable forapplications that have large data sets.• relaxes a few POSIX requirements to enable streaming access to filesystem data.• part of the Apache Hadoop Core project. The project URL ishttp://hadoop.apache.org/core/.
    • HDFS CONCLUSIONS
    • Thank you…!