Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Technology
  • Be the first to comment


  1. 1. Nguyen Thanh Hai Portal team August 2012
  2. 2. Agenda − Meet Hadoop1 − − − History Data! Data Storage and Analysis − What Hadoop is Not2 − The Hadoop Distributed File System − HDFS concept − Architecture − Goals − Command User Interface3 − MapReduce − Overview − How MapReduce works4 − Practice − Demo − Discussion - Copyright 2012 eXo Platform 2
  3. 3. Meet Hadoop - History - Data! - Data Storage and Analysis - What Hadoop is Not - Copyright 2012 eXo Platform 3
  4. 4. History - Copyright 2012 eXo Platform 4
  5. 5. History- Hadoop got its start in Nutch. A few of them were attempting tobuild an open source web search engine and having troublemanaging computations running on even a handful of computers- Once Google published its GFS and MapReduce papers, theroute became clear. Itd devised systems to solve precisely theproblems they were having with Nutch. So they started, two ofthem, half-time, to try to re-create these systems as a part ofNutch- Around that time. Yahoo! got interested, and quickly puttogether a team. They split off the distributed computing part ofNutch, naming it Hadoop. With the help of Yahoo!, Hadoop soongrew into a technology that could truly scale to the Web. - Copyright 2012 eXo Platform 5
  6. 6. Data! We live in the data age - Copyright 2012 eXo Platform 6
  7. 7. Data! We live in the data age - Copyright 2012 eXo Platform 7
  8. 8. Data Storage and Analysis- While the storage capacities of hard drives have increased massively over the years, access speeds the rate at which data can be read from drivershave not kept up. Once typical drive from 1990 cloud store 1,370 MB ofdata and had a transfer speed of 4.4 MB/s. Over 20 years later, oneterabyte drives are the norm, but the transfer speed is around 100MB/s- This is a long time to read all data on a single drive and writing is evenslower. - Copyright 2012 eXo Platform 8
  9. 9. Data Storage and AnalysisThe obvious way:- Imagine if we have 100 drivers, each holding one hundredth of the data.Working in parallel, we could read the data in under two minutes.- Only using one hundredth of a disk may seem wasteful. But we can store onehundred datasets, each of which is one terabyte, and provide shared access tothem. - Copyright 2012 eXo Platform 9
  10. 10. Data Storage and AnalysisThe problems to solve:- The first: As soon as you start using many pieces of hardware, the chance that firstone will fail is fairly high. A common way of avoiding data loss is throughreplication: redundant copies of the data are kept by the system so that in theevent of failure, there is another copy available.- The second: That most analysis tasks need to be able to combine the data in secondsome way; data read from one disk may need to be combine with the data fromany of the other 99 disks. Various distributed systems allow data to be combinedfrom multiple sources, but doing this correctly is notoriously challengingWith Hadoop:Hadoop provides: a reliable shared storage and analysis system. The storage isprovided by HDFS and analysis by MapReduce. - Copyright 2012 eXo Platform 10
  11. 11. What Hadoop is Not- It is not a substitute for a database. Hadoop stores data in files, and dose not databaseindex them. If you want to find something, you have to run a MapReduce jobgoing through all the data. This take time, and mean that you cannot directly useHadoop as a substitute for a database. Where Hadoop works is where the data istoo big for a database. With very large datasets, the cost of regenerating indexesis so high you cant easily index changing data.- MapReduce is not always the best algorithm. MapReduce is profound idea: algorithmtalking a simple functional programming operation and applying it, in parallel, togigabytes or terabytes of data. But there is a price. For that parallelism, you needto have each MR operation independent from all the others. If you need to knoweverything that has gone before, you have a problem.- Hadoop and MapReduce is not a place to learn Java programming- Hadoop is not an ideal place to learn networking error messages- Hadoop clusters are not a place to learn Unix/Linux system administration - Copyright 2012 eXo Platform 11
  12. 12. The Hadoop Distributed File System - HDFS Concept - Architecture - Goals - Command Line User Interface - Copyright 2012 eXo Platform 12
  13. 13. HDFS conceptBlock:- A disk has a block size, which is the minimum amount of data that it can read orwrite. Filesystem for a single disk build on this by dealing with data in blocks. Thedisk blocks are normally 512 bytes.- HDFS, too, has concept of the block, but it is a much larger unit – 64MB bydefault. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for asingle disk, a file in HDFS that is smaller than a single block does not occupy afull blocks worth of underlying storage. - Copyright 2012 eXo Platform 13
  14. 14. HDFS ConceptNameNode and DataNodes:- An Hadoop cluster has two types of node operating in a master-worker pattern:a namenode (the master) and a number of datanodes (workers)- The NameNode manages the filesystem namespace. It maintains the filesystemtree and the metadata for all the files and directories in the tree. It executes filesystem namespace operations like opening, closing, and renaming files anddirectories. It also determines the mapping of blocks to DataNodes.- DataNodes are the workhorses of the filesystem. They store and retrieve blockswhen they are told to (client or NameNode), and they report back to theNameNode periodically with list of blocks that they are storing. - Copyright 2012 eXo Platform 14
  15. 15. Architecture - Copyright 2012 eXo Platform 15
  16. 16. Architecture - Copyright 2012 eXo Platform 16
  17. 17. HDFS Goals- Hardware Failure: An HDFS instance may consist of hundreds or thousands ofserver machines, each storing part of the file systems data. The fact that these area huge number of components and that each component has a non-trivial probabilityof failure means that some components of HDFS is always non-functional.Therefore, detection of faults and quick, automatic recovery from them is corearchitectural goal of HDFS.- Large Data Sets: Applications that run on HDFS have large data sets. A typicalfile in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support largefile. It should provide high aggregate data bandwidth and scale to hundreds ofnodes in a single cluster. It should support ten of millions of files on single instance.- “Moving Computation is Cheaper than Moving Data”: A computationrequested by an application is much more efficient if it is executed near the data itoperates on. This is especially true when the size of data is huge. This minimizesnetwork congestion and increases the overall throughput of the system. Theassumption is that it is often better to migrate the computation closer to where thedata is located rather than moving data to where the application running. HDFSprovides interfaces for applications to move themselves closer to where data islocated. - Copyright 2012 eXo Platform 17
  18. 18. Command Line User Interface - Copyright 2012 eXo Platform 18
  19. 19. MapReduce - Overview - How MapReduce Works - Copyright 2012 eXo Platform 19
  20. 20. Overview- Hadoop MapReduce is a software framework for easily writing application whichprocess vast amounts of data (multi-terabyte data-sets) in parallel on large cluster(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.- A MapReduce job usually splits the input data-sets into independent chunks whichare processed by the map task in a completely parallel manner. The frameworksorts the output of the maps, which are then input to the reduce task. Typically both taskthe input and the output of job are sorted by filesystem. The framework takes care ofscheduling tasks, monitoring them and re-executes the failed tasks.- The MapReduce framework consist of a single master JobTracker and oneworker TaskTrackser per cluster-node. The master is responsible for schedulingthe jobs component tasks on the worker, monitoring them and re-executing thefailed tasks. The workers execute the tasks as directly by the manner. - Copyright 2012 eXo Platform 20
  21. 21. How MapReduce Works - Copyright 2012 eXo Platform 21
  22. 22. How MapReduce Works - Copyright 2012 eXo Platform 22
  23. 23. Practice - Demo - Discussion - Copyright 2012 eXo Platform 23