Apache hadoop


Published on

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache hadoop

  1. 1. Apache Hadoop Presented By, Darpan Dekivadiya(09BCE008)
  2. 2. What is Hadoop?• A framework for storing and processing big data on lots of commodity machines. o Up to 4,000 machines in a cluster o Up to 20 PB in a cluster• Open Source Apache project• High reliability done in software o Automated fail-over for data and computation• Implemented in Java 28-10-2012 2
  3. 3. Hadoop development• Hadoop was created by Doug Cutting• This is named as Hadoop from his son‟s toy elephant.• It is originally developed to support Nutch search engine project.• After that, So many companies adopted it and contributed in this project. 28-10-2012 3
  4. 4. Hadoop Echo system• Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing.• Hadoop Common: The common utilities that support the other Hadoop subprojects.• HDFS: A distributed file system that provides high throughput access to application data.• MapReduce: A software framework for distributed processing of large data sets on compute clusters.• Pig: A high-level data-flow language and execution framework for parallel computation.• HBase: A scalable, distributed database that supports structured data storage for large tables. 28-10-2012 4
  5. 5. 28-10-2012 5
  6. 6. Hadoop, Why?• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application.• Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant.• Need common infrastructure –Efficient, reliable, Open Source Apache License• The above goals are same as Condor, but o Workloads are IO bound and not CPU bound 28-10-2012 6
  7. 7. Hadoop History• Dec 2004 – Google GFS paper published• July 2005 – Nutch(Search engine) uses MapReduce• Feb 2006 – Starts as a Lucene subproject• Apr 2007 – Yahoo! on 1000-node cluster• Jan 2008 – An Apache Top Level Project• May 2009 – Hadoop sorts Petabyte in 17 hours• Aug 2010 – World‟s Largest Hadoop cluster at o Facebook o 2900 nodes, 30+ PetaByte 28-10-2012 7
  8. 8. Who uses Hadoop?• Amazon/A9• Facebook• Google• IBM• Joost• Last.fm• New York Times• PowerSet• Veoh• Yahoo! 28-10-2012 8
  9. 9. Applications of Hadoop• Search o Yahoo, Amazon, Zvents• Log processing o Facebook, Yahoo, ContextWeb. Joost, Last.fm• Recommendation Systems o Facebook• Data Warehouse o Facebook, AOL• Video and Image Analysis o New York Times, Eyealike 28-10-2012 9
  10. 10. Who generates the data?• Lots of data is generated on Facebook o 500+ million active users o 30 billion pieces of content shared every month (news stories, photos, blogs, etc)• Lots of data is generated for Yahoo search engine.• Lots of data is generated at Amazon S3 cloud service. 28-10-2012 10
  11. 11. Data usage• Data Usage o Statistics per day: o 20 TB of compressed new data added per day o 3 PB of compressed data scanned per day o 20K jobs on production cluster per day o 480K compute hours per day• Barrier to entry is significantly reduced: o New engineers go though a Hadoop/Hive training session o 300+ people run jobs on Hadoop o Analysts (non-engineers) use Hadoop through Hive 28-10-2012 11
  12. 12. HDFSHadoop Distributed File System 28-10-2012 12
  13. 13. Based on Google File System 28-10-2012 13
  14. 14. Redundant storage 28-10-2012 14
  15. 15. Commodity Hardware• Typically in 2 level architecture o Nodes are commodity PCs o 20-40 nodes/rack o The default size of Apache Hadoop block is 64 MB. o Relational databases typically store data blocks in sizes ranging from 4KB to 32KB. 28-10-2012 15
  16. 16. How does HDFS maintain everything? • Two types of nodes o Single NameNode and a number of DataNodes • Namenode o File names, permissions, modified flags, etc. o Data locations exposed so that computations can • Datanode o Store and retrieve blocks when they are told to . o HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software 28-10-2012 16
  17. 17. How HDFS works? 28-10-2012 17
  18. 18. • The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.• The DataNodes are responsible for serving read and write requests from the file system‟s clients. 28-10-2012 18
  19. 19. MapReduceGoogle‟s MapReduce Technique 28-10-2012 19
  20. 20. MapReduce Overview• Provides a clean abstraction for programmers to write distributed application.• Factors out many reliability concerns from application logic• A batch data processing system• Automatic parallelization & distribution• Fault-tolerance• Status and monitoring tools 28-10-2012 20
  21. 21. Programming Model• Programmer has to implement interface of two functions:– map (in_key, in_value) -> (out_key, intermediate_value) list– reduce (out_key, intermediate_value list) -> out_value list 28-10-2012 21
  22. 22. MapReduce Flow 28-10-2012 22
  23. 23. Mapper(indexing example)• Input is the line no and the actual line.• Input 1 : (“100”,“I Love India ”)• Output 1 : (“I”,“100”), (“Love”,“100”), (“India”,“100”)• Input 2 : (“101”,“I Love eBay”)• Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”) 28-10-2012 23
  24. 24. Reducer (indexing example)• Input is word and the line nos.• Input 1 : (“I”,“100”,”101”)• Input 2 : (“Love”,“100”,”101”)• Input 3 : (“India”, “100”)• Input 4 : (“eBay”, “101”)• Output, the words are stored along with the line nos. 28-10-2012 24
  25. 25. Google Page Rank example• Mapper o Input is a link and the html content o Output is a list of outgoing link and pagerank of this page• Reducer o Input is a link and a list of pagranks of pages linking to this page o Output is the pagerank of this page, which is the weighted average of all input pageranks 28-10-2012 25
  26. 26. Conti.• Limited atomicity and transaction support. o HBase supports multiple batched mutations of single rows only. o Data is unstructured and untyped.• No accessed or manipulated via SQL. o Programmatic access via Java, REST, or Thrift APIs. o Scripting via JRuby. 28-10-2012 26
  27. 27. Introduction of HBase
  28. 28. OVERVIEW• HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing Environment.• Data is logically organized into tables, rows and columns. 28-10-2012 28
  29. 29. Outline• Data Model• Architecture and Implementation• Examples & Tests 28-10-2012 29
  30. 30. Conceptual <family>:<label> View Row key Time Column Stamp “contents:” Column “anchor:”• A data row has t12 “<html>…” a sortable row “com.apach key and an e.www” t11 “<html>…” arbitrary number t10 “anchor:apache. com” “APACHE” of columns. t15 “anchor:cnnsi.com” “CNN”• A Time Stamp is “anchor:my.look.c designated t13 a” “CNN.com” automatically if “com.cnn.w t6 “<html>…” not artificially. ww”• <family>:<label> t5 “<html>…” t3 “<html>…”
  31. 31. HStore Physical Storage View Column Row key TS “contents:”• Physically, tables are t12 “<html>…” stored on a per-column “com.apache.w ww” family basis. t11 “<html>…” HStore t6 “<html>…”• Empty cells are not stored in a column- “com.cn.www” t5 “<html>…” oriented storage t3 “<html>…” format. Row key TS Column “anchor:”• Each column family is managed by an HStore. “com.apache. www” t10 “anchor: apache.com” “APACHE” Data MapFile Key/Value t9 “anchor: “CNN” cnnsi.com” Index MapFile Index key com.cn.www” “anchor: “CNN.co t8 my.look.ca” m” Memcache
  32. 32. Time ColumnRow Ranges: RegionsRow key Stamp “contents:” Column “anchor:” t15 anchor:cc value• Row key/ Column t13 ba ascending, Timestamp descending aaaa t12 bb• Physically, tables are broken t11 anchor:cd value into row ranges contain rowsbc t10 from start-key to end-key aaab t14 aaac anchor:be value aaad anchor:ad value t5 ae aaae t3 af
  33. 33. Outline• Data Model• Architecture and Implementation• Examples & Tests
  34. 34. Three major components• The HBaseMaster• The HRegionServer• The HBase client
  35. 35. MasterHBaseMaster 2 META Region 2 META Region 2 META Region 2 META Region 1 ROOT Region • Assign regions to HRegionServers. 1. ROOT region locates all the Server Server Server Server Server META regions. 2. META region maps a number of user regions. USER Region 3. Assign user regions to the HRegionServers. META Region • Enable/Disable table and change table schema ROOT Region USER Region • Monitor the health of each META Region Server USER Region
  36. 36. HBase Client
  37. 37. ROOT RegionHBase Client
  38. 38. HBase ClientMETA Region
  39. 39. User Region HBase ClientInformation cached
  40. 40. Outline• Data Model• Architecture and Implementation• Examples & Tests
  41. 41. Row Key Create columnFamily1: columnFamily2: Timestamp MyTableHBaseAdmin admin= new HBaseAdmin(config);HColumnDescriptor []column;column= new HColumnDescriptor[2];column[0]=new HColumnDescriptor("columnFamily1:");column[1]=new HColumnDescriptor("columnFamily2:");HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));desc.addFamily(column[0]);desc.addFamily(column[1]);admin.createTable(desc);
  42. 42. Insert ValuesBatchUpdate batchUpdate = new BatchUpdate("myRow",timestamp);batchUpdate.put("columnFamily1:labela",Bytes.toByt es("labela value"));batchUpdate.put("columnFamily1:labelb",Bytes.toByt es(“labelb value"));table.commit(batchUpdate);Row Key Timestamp columnFamily1: ts1 labela labela valuemyRow ts2 labelb labelb value
  43. 43. Select value from table where key=‘com.apache.www’ AND Search label=‘anchor:apache.com’ Time Row key Column “anchor:” Stamp t12 t11“com.apache.www” t10 “anchor:apache.com” “APACHE” t9 “anchor:cnnsi.com” “CNN” t8 “anchor:my.look.ca” “CNN.com” “com.cnn.www” t6 t5 t3
  44. 44. Select value from table Scanner Search where anchor=‘cnnsi.com’ Time Row key Column “anchor:” Stamp t12 t11“com.apache.www” t10 “anchor:apache.com” “APACHE” t9 “anchor:cnnsi.com” “CNN” t8 “anchor:my.look.ca” “CNN.com” “com.cnn.www” t6 t5 t3
  45. 45. PIGProgramming Language for Hadoop Framework 28-10-2012 45
  46. 46. Introduction• Pig was initially developed at Yahoo!• Pig programming language is designed to handle any kind of data-hence the name!• Pig is made of two components:  Language itself, which is called PigLatin .  Runtime Environment where PigLatin programs are executed. 28-10-2012 46
  47. 47. Why PigLatin?• Map Reduce is very powerful, but: o It requires a Java programmer. o User has to re-invent common functionality (join, filter, etc.).• For non-java programmers Pig Latin is introduced.• Pig Latin is a data flow language rather than procedural or declarative.• User code and existing binaries can be included almost anywhere.• Metadata not required, but used when available.• Support for nested types.• Operates on files in HDFS. 28-10-2012 47
  48. 48. Pig Latin Overview• Pig provides a higher level language, Pig Latin, that: o Increases productivity. o In one test 10 lines of Pig Latin ≈ 200 lines of Java.• What took 4 hours to write in Java took 15 minutes in Pig Latin. o Opens the system to non-Java programmers. o Provides common operations like join, group, filter, sort. 28-10-2012 48
  49. 49. Load Data• The objects that are being worked on by Hadoop are stored in HDFS.• To access this data, the program must first tell Pig what file (or files) it will use.• That‟s done through the LOAD ‘data_file’ command .• If the data is stored in a file format that is not natively accessible to Pig,• Add the “USING” function to the LOAD statement to specify a user-defined function that can read in and interpret the data. 28-10-2012 49
  50. 50. Transform Data• The transform logic is where all the data manipulation happens.• For example :  FILTER out rows that are not of interest.  JOIN two sets of data files .  GROUP data to build aggregations .  ORDER results . 28-10-2012 50
  51. 51. Example of Pig Program• file composed of Twitter feeds, selects only those tweets that are using en(English) iso_language code, then groups them by the user who is tweeting, and displays the sum of the number of the re tweets of that user‟s tweets. L = LOAD „hdfs//node/tweet_data‟; FL = FILTER L BY iso_language_code EQ „en‟; G = GROUP FL BY from_user; RT = FOREACH G GENERATE group, SUM(retweets); 28-10-2012 51
  52. 52. DUMP and STORE• DUMP or STORE command generates the results of a Pig program.• DUMP command sends the output to the screen, while debugging Pig programs.• DUMP command can be used anywhere in program to dump intermediate result sets to the screen.• STORE command will store results from running programs in a file for further processing and analysis. 28-10-2012 52
  53. 53. Pig Runtime Environment• Pig runtime is used when Pig program need to run in the Hadoop environment .• There are three ways to run a Pig program:  Embedded in a Script.  Embedded in Java Program.  From the Pig Command line, called Grunt.• The Pig runtime environment translates the program into a set of map and reduce tasks and runs.• This greatly simplifies the work associated with the analysis of large amounts of data. 28-10-2012 53
  54. 54. PIG is used for?• Web log processing.• Data processing for web search platforms.• Ad hoc queries across large data sets.• Rapid prototyping of algorithms for processing large data sets 28-10-2012 54
  55. 55. Hadoop@BIGStatistics of Hadoop used at giant structure 28-10-2012 55
  56. 56. Hadoop@Facebook• Production cluster o 4800 cores, 600 machines, 16GB per machine – April 2009 o 8000 cores, 1000 machines, 32 GB per machine – July 2009 o 4 SATA disks of 1 TB each per machine o 2 level network hierarchy, 40 machines per rack o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009• Test cluster • 800 cores, 16GB each 28-10-2012 56
  57. 57. Hadoop@Yahoo• Worlds largest Hadoop production application.• The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster• Biggest contributor to Hadoop.• Converting All its batches to Hadoop. 28-10-2012 57
  58. 58. Hadoop@Amazon• Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)• The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240• Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework. 28-10-2012 58
  59. 59. Thank You 28-10-2012 59