• Like
Hadoop
Upcoming SlideShare
Loading in...5
×

Hadoop

  • 2,446 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,446
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
120
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HadoopSaeed Iqbal P11-6501 MSCS
  • 2. What Is ? inspired by • System for Processing mind-boggingly large amount of data. • The Apache Hadoop software library is a framework: • that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high-availability, • The library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 3. Hadoop Core• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware• Open Source Software + Hardware Commodity • IT Costs Reduction MapReduce Computation HDFS Storage
  • 4. Hadoop, Why?• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application.• Nodes fail every day • Failure is expected, rather than exceptional. • The number of nodes in a cluster is not constant.• Need common infrastructure • Efficient, reliable, Open Source Apache License• The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  • 5. Hadoop History• 2004—Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella.• December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.• January 2006—Doug Cutting joins Yahoo!.• February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS.• February 2006—Adoption of Hadoop by Yahoo! Grid team.• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.• May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.• May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).• October 2006—Research cluster reaches 600 nodes.• December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.• January 2007—Research cluster reaches 900 nodes.• April 2007—Research clusters—2 clusters of 1000 nodes.• April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.• October 2008—Loading 10 terabytes of data per a day on to research clusters.• March 2009—17 clusters with a total of 24,000 nodes• April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
  • 6. Who uses Hadoop?Amazon/A9FacebookGoogleIBMJoostLast.fmNew York TimesPowerSetVeohYahoo!Twitter.com
  • 7. How does HDFS Work? MapReduce has undergone aLet Suppose we have a file complete overhaul inSize : 300MB hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or 0 YARN. 0 The fundamental M idea of MRv2 is B to split up the two major functionalities of the JobTracker, resou rce management and job scheduling/monit oring, into separate daemo
  • 8. How does HDFS Work? 1 MapReduce hasHDFS splits it into blocks 2 undergone a complete 8 overhaul in hadoop-0.23Size of each block is 128 MB. M and we now have, what we call, MapReduce 2.0 B (MRv2) or YARN. The fundamental idea of MRv2 is to split up the 1 two major functionalities 2 of the JobTracker, 8 M B 4 resource management 4 and job M scheduling/monitoring, i B nto separate daemo
  • 9. How does HDFS Work? MapReduce has undergone a completeHDFS will keep 3 copies of each overhaul in hadoop-0.23 and we now have, whatBlock. we call, MapReduce 2.0HDFS store these blocks on (MRv2) or YARN.datanodes,HDFS distributes the block tothe DNs
  • 10. How does HDFS Work?The Name Node tracks blocks and Data nodes. DN DN DN DN DN DN Name Node DN DN DN
  • 11. How does HDFS Work?Sometimes a datanode will die. Not a problem,
  • 12. Example for MapReducePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.
  • 13. Map outputWorker 1: (the 1), (weather 1), (is 1), (good 1).Worker 2: (today 1), (is 1), (good 1).Worker 3: (good 1), (weather 1), (is 1), (good 1).
  • 14. Reduce OutputWorker 1: (the 1)Worker 2: (is 3)Worker 3: (weather 2)Worker 4: (today 1)Worker 5: (good 4)
  • 15. MapReduce Architecture
  • 16. MapReduce: Programming ModelProcess data using special map() and reduce() functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
  • 17. MapReduce:Programming Model M <How,1> <now,1> <How,1 1> How now <brown,1> <now,1 1> brown 1 Brown M <cow,1> <brown,1> R cow 1 cow <How,1> <cow,1> does 1 <does,1> <does,1> How 2 M <it,1> <it,1> R it 1 How doesIt work now <work,1> <work,1> now 2 <now,1> work 1 M Reduce MapReduce Framework Map Input Output
  • 18. MapReduce:Programming ModelMore formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
  • 19. MapReduce Life Cycle Map function Reduce function Run this program as a MapReduce job
  • 20. Hadoop EnvironmentHadoop has become the kernel of the distributed operating system for Big DataThe project includes these modules:Hadoop Common: The common utilities that support the other Hadoop modules.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.Hadoop YARN: A framework for job scheduling and cluster resource management.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 21. Hadoop Architecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 22. Zookeeper – CoordinationFramework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 23. What is ZooKeeper• A centralized service for maintaining • Configuration information, naming • Providing distributed synchronization, • and providing group services. • A set of tools to build distributed applications that can safely handle partial failures• ZooKeeper was designed to store coordination data • Status information • Configuration • Location information
  • 24. ZooKeeper• ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
  • 25. Flume / Sqoop – Data IntegrationFramework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 26. FLUME• Flume is a distributed: • A distributed, reliable, and data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism• One-stop solution for data collection of all formats• It has a simple and flexible architecture based on streaming data flows.
  • 27. Flume: High-Level Overview• Logical Node• Source• Sink
  • 28. Flume ArchitectureLog Log ...Flume Node Flume NodeHDFS December 2nd, 2012 - Apache Flume 1.3.0 Released ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. Sqoop• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.• Easy, parallel database import/export• What you want do? • Insert data from RDBMS to HDFS • Export data from HDFS back into RDBMS
  • 30. Sqoop Architecture & Example March of 2012 HDFS Sqoop RDBMS $ sqoop import --connect jdbc:mysql://localhost/world --username root -- table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF G,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ...33 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 32. Why Hive and Pig?Although MapReduce is very powerful, it can also be complex to masterMany organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java codeMany organizations have programmers who are skilled at writing code in scripting languagesHive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!
  • 33. HiveWhat is Hive? An SQL-like interface to HadoopData Warehouse infrastructure that provides easy data summarization and ad hoc querying and the analysis of large datasets stored in Hadoop compatible file systems. MapRuduce for execution HDFS for storageHive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch querySELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 34. Hive SQL Hive MapReduce38 ©2011 Cloudera, Inc. All Rights Reserved.
  • 35. PigApache Pig is a platform to analyze large data sets.In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.Pig consists of two parts• Pig latin language• Pig engine A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 36. Pig Latin & Pig Engine• Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.• The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.• We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data• Pig Engine:• The job of engine is to execute the data flow written in Pig latin in parallel on hadoop infrastructure.
  • 37. Why Pig is required when we can code all in MR• Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin• In MR we have to lots of manual coding. Pig does optimization of Pig latin scripts while creating them into MR jobs.• It creates optimized version of Map reduce to run on hadoop• It takes very less time to write Pig latin script then to write corresponding MR code• Where Pig is useful Transactional ETL Data pipelines ( Mostly used) Research on raw data Iterative processing
  • 38. Pig Script Pig MapReduce ©2011 Cloudera, Inc. All Rights Reserved.
  • 39. WordCount Example• Input Hello World Bye World Hello Hadoop Goodbye Hadoop• For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>• theBye, 1> just sums up the values < reduce < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 40. WordCount Example In MapReducepublic class WordCount {public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }}public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);}
  • 41. WordCount Example By PigA = LOAD wordcount/input USING PigStorage as(token:chararray);B = GROUP A BY token;C = FOREACH B GENERATE group, COUNT(A) as count;DUMP C;
  • 42. WordCount Example By HiveCREATE TABLE wordcount (token STRING);LOAD DATA LOCAL INPATH ’wordcount/inputOVERWRITE INTO TABLE wordcount;SELECT count(*) FROM wordcount GROUP BY token;
  • 43. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 44. Structured-data vs Raw-data
  • 45. • Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. • Apache HBase is an open-source, distributed, versioned, column- oriented store modeled after Googles Bigtable: A Distributed Storage System for Structured Data by Chang et al.• Coordinated by Zookeeper• Low Latency• Random Reads And Writes• Distributed Key/Value Store• Simple API – PUT – GET – DELETE – SCANE
  • 46. HbaseHBase is a type of "NoSQL" database. NoSQL? "NoSQL" is a general term meaning that the database isnt an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
  • 47. HBase Exampleshbase> create mytable, mycf‘hbase> listhbase> put mytable, row1, mycf:col1, val1‘hbase> put mytable, row1, mycf:col2, val2‘hbase> put mytable, row2, mycf:col1, val3‘hbase> scan mytable‘hbase> disable mytable‘hbase> drop mytableHbase reference : http://hbase.apache.org ©2011 Cloudera, Inc. All Rights Reserved.
  • 48. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 49. What is ?Oozie is a server-based workflow scheduler system to manage Apache Hadoop jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) A Java Web ApplicationOozie is a workflow scheduler for HadoopTriggered Time Job 1 Job 2 Data Job 3 Job 4 Job 5
  • 50. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 51. What isMachine-learning toolDistributed and scalable machine learning algorithms on the Hadoop platformBuilding intelligent applications easier and fasterOur core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
  • 52. Mahout Use CasesYahoo: Spam DetectionFoursquare: RecommendationsSpeedDate.com: RecommendationsAdobe: User TargettingAmazon: Personalization Platform ©2011 Cloudera, Inc. All Rights Reserved.
  • 53. Use case ExamplePredict what the user likes based on His/Her historical behavior Aggregate behavior of people similar to him
  • 54. Recap – Hadoop Arcitecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 55. THANKS