0
HadoopSaeed Iqbal P11-6501 MSCS
What Is                            ?   inspired by • System for Processing mind-boggingly large amount   of data. • The Ap...
Hadoop Core• Open sourced, flexible and available architecture    for large scale computation and data processing on    a ...
Hadoop, Why?• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application.• Nodes fail eve...
Hadoop History•   2004—Initial versions of what is now Hadoop Distributed Filesystem and    MapReduce implemented by Doug ...
Who uses Hadoop?Amazon/A9FacebookGoogleIBMJoostLast.fmNew York TimesPowerSetVeohYahoo!Twitter.com
How does HDFS Work?            MapReduce has                               undergone aLet Suppose we have a file     compl...
How does HDFS Work?             1   MapReduce hasHDFS splits it into blocks      2   undergone a complete                 ...
How does HDFS Work?               MapReduce has                                  undergone a completeHDFS will keep 3 copi...
How does HDFS Work?The Name Node tracks blocks and Data nodes.  DN         DN          DN             DN           DN  DN ...
How does HDFS Work?Sometimes a datanode will die. Not a problem,
Example for MapReducePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.
Map outputWorker 1:  (the 1), (weather 1), (is 1), (good 1).Worker 2:  (today 1), (is 1), (good 1).Worker 3:  (good 1), (w...
Reduce OutputWorker 1:  (the 1)Worker 2:  (is 3)Worker 3:  (weather 2)Worker 4:  (today 1)Worker 5:  (good 4)
MapReduce Architecture
MapReduce: Programming ModelProcess data using special map() and reduce()  functions  The map() function is called on ever...
MapReduce:Programming Model              M     <How,1>                    <now,1>     <How,1 1> How now            <brown,...
MapReduce:Programming ModelMore formally,   Map(k1,v1) --> list(k2,v2)   Reduce(k2, list(v2)) --> list(v2)
MapReduce Life Cycle                          Map function                          Reduce function                       ...
Hadoop EnvironmentHadoop has become the kernel of the distributed  operating system for Big DataThe project includes these...
Hadoop Architecture                                              Hue                                   Mahout             ...
Zookeeper – CoordinationFramework                                              Hue                                   Mahou...
What is ZooKeeper• A centralized service for maintaining   •   Configuration information, naming   •   Providing distribut...
ZooKeeper• ZooKeeper allows distributed processes to coordinate  with each other through a shared hierarchical name  space...
Flume / Sqoop – Data IntegrationFramework                                               Hue                               ...
FLUME• Flume is a distributed:   • A distributed, reliable, and data collection service   • It efficiently collecting, agg...
Flume: High-Level Overview• Logical Node• Source• Sink
Flume ArchitectureLog                                                             Log                                     ...
Sqoop• Apache Sqoop(TM) is a tool designed for efficiently    transferring bulk data between Apache Hadoop and    structur...
Sqoop Architecture & Example                             March of 2012      HDFS      Sqoop      RDBMS     $ sqoop import ...
Pig / Hive – Analytical Language                                              Hue                                   Mahout...
Why Hive and Pig?Although MapReduce is very powerful, it can also be   complex to masterMany organizations have business o...
HiveWhat is Hive?   An SQL-like interface to HadoopData Warehouse infrastructure that provides easy data  summarization an...
Hive            SQL            Hive            MapReduce38            ©2011 Cloudera, Inc. All Rights Reserved.
PigApache Pig is a platform to analyze large data sets.In simple terms you have lots and lots of data on which   you need ...
Pig Latin & Pig Engine• Pig latin is a scripting language which allows you to    describe how data flow from one or more i...
Why Pig is required when we can code all in MR• Pig provides all standard data processing operations    like sort , group ...
Pig      Script      Pig      MapReduce       ©2011 Cloudera, Inc. All Rights Reserved.
WordCount Example• Input  Hello World Bye World  Hello Hadoop Goodbye Hadoop• For the given sample input the map emits  < ...
WordCount Example In MapReducepublic class WordCount {public static class Map extends Mapper<LongWritable, Text, Text, Int...
WordCount Example By PigA = LOAD wordcount/input USING PigStorage as(token:chararray);B = GROUP A BY token;C = FOREACH B G...
WordCount Example By HiveCREATE TABLE wordcount (token STRING);LOAD DATA LOCAL INPATH ’wordcount/inputOVERWRITE INTO TABLE...
Hbase – Column NoSQL DB                                             Hue                                   Mahout          ...
Structured-data vs Raw-data
• Apache HBase™ is the Hadoop database, a  distributed, scalable, big data store.  • Apache HBase is an open-source, distr...
HbaseHBase is a type of "NoSQL" database. NoSQL? "NoSQL" is a general term meaning that the database isnt an RDBMS which s...
HBase Exampleshbase>   create mytable, mycf‘hbase>   listhbase>   put mytable, row1, mycf:col1, val1‘hbase>   put mytable,...
Oozie – Job Workflow & Scheduling                                               Hue                                   Maho...
What is                               ?Oozie is a server-based workflow scheduler system to   manage Apache Hadoop jobs (e...
Mahout – Data Mining                                               Hue                                   Mahout           ...
What isMachine-learning toolDistributed and scalable machine learning algorithms  on the Hadoop platformBuilding intellige...
Mahout Use CasesYahoo: Spam DetectionFoursquare: RecommendationsSpeedDate.com: RecommendationsAdobe: User TargettingAmazon...
Use case ExamplePredict what the user likes based on   His/Her historical behavior   Aggregate behavior of people similar ...
Recap – Hadoop Arcitecture                                              Hue                                   Mahout      ...
THANKS
Upcoming SlideShare
Loading in...5
×

Hadoop

2,781

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,781
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
127
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop"

  1. 1. HadoopSaeed Iqbal P11-6501 MSCS
  2. 2. What Is ? inspired by • System for Processing mind-boggingly large amount of data. • The Apache Hadoop software library is a framework: • that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high-availability, • The library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  3. 3. Hadoop Core• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware• Open Source Software + Hardware Commodity • IT Costs Reduction MapReduce Computation HDFS Storage
  4. 4. Hadoop, Why?• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application.• Nodes fail every day • Failure is expected, rather than exceptional. • The number of nodes in a cluster is not constant.• Need common infrastructure • Efficient, reliable, Open Source Apache License• The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  5. 5. Hadoop History• 2004—Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella.• December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.• January 2006—Doug Cutting joins Yahoo!.• February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS.• February 2006—Adoption of Hadoop by Yahoo! Grid team.• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.• May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.• May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).• October 2006—Research cluster reaches 600 nodes.• December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.• January 2007—Research cluster reaches 900 nodes.• April 2007—Research clusters—2 clusters of 1000 nodes.• April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.• October 2008—Loading 10 terabytes of data per a day on to research clusters.• March 2009—17 clusters with a total of 24,000 nodes• April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
  6. 6. Who uses Hadoop?Amazon/A9FacebookGoogleIBMJoostLast.fmNew York TimesPowerSetVeohYahoo!Twitter.com
  7. 7. How does HDFS Work? MapReduce has undergone aLet Suppose we have a file complete overhaul inSize : 300MB hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or 0 YARN. 0 The fundamental M idea of MRv2 is B to split up the two major functionalities of the JobTracker, resou rce management and job scheduling/monit oring, into separate daemo
  8. 8. How does HDFS Work? 1 MapReduce hasHDFS splits it into blocks 2 undergone a complete 8 overhaul in hadoop-0.23Size of each block is 128 MB. M and we now have, what we call, MapReduce 2.0 B (MRv2) or YARN. The fundamental idea of MRv2 is to split up the 1 two major functionalities 2 of the JobTracker, 8 M B 4 resource management 4 and job M scheduling/monitoring, i B nto separate daemo
  9. 9. How does HDFS Work? MapReduce has undergone a completeHDFS will keep 3 copies of each overhaul in hadoop-0.23 and we now have, whatBlock. we call, MapReduce 2.0HDFS store these blocks on (MRv2) or YARN.datanodes,HDFS distributes the block tothe DNs
  10. 10. How does HDFS Work?The Name Node tracks blocks and Data nodes. DN DN DN DN DN DN Name Node DN DN DN
  11. 11. How does HDFS Work?Sometimes a datanode will die. Not a problem,
  12. 12. Example for MapReducePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.
  13. 13. Map outputWorker 1: (the 1), (weather 1), (is 1), (good 1).Worker 2: (today 1), (is 1), (good 1).Worker 3: (good 1), (weather 1), (is 1), (good 1).
  14. 14. Reduce OutputWorker 1: (the 1)Worker 2: (is 3)Worker 3: (weather 2)Worker 4: (today 1)Worker 5: (good 4)
  15. 15. MapReduce Architecture
  16. 16. MapReduce: Programming ModelProcess data using special map() and reduce() functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
  17. 17. MapReduce:Programming Model M <How,1> <now,1> <How,1 1> How now <brown,1> <now,1 1> brown 1 Brown M <cow,1> <brown,1> R cow 1 cow <How,1> <cow,1> does 1 <does,1> <does,1> How 2 M <it,1> <it,1> R it 1 How doesIt work now <work,1> <work,1> now 2 <now,1> work 1 M Reduce MapReduce Framework Map Input Output
  18. 18. MapReduce:Programming ModelMore formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
  19. 19. MapReduce Life Cycle Map function Reduce function Run this program as a MapReduce job
  20. 20. Hadoop EnvironmentHadoop has become the kernel of the distributed operating system for Big DataThe project includes these modules:Hadoop Common: The common utilities that support the other Hadoop modules.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.Hadoop YARN: A framework for job scheduling and cluster resource management.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  21. 21. Hadoop Architecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  22. 22. Zookeeper – CoordinationFramework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  23. 23. What is ZooKeeper• A centralized service for maintaining • Configuration information, naming • Providing distributed synchronization, • and providing group services. • A set of tools to build distributed applications that can safely handle partial failures• ZooKeeper was designed to store coordination data • Status information • Configuration • Location information
  24. 24. ZooKeeper• ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
  25. 25. Flume / Sqoop – Data IntegrationFramework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  26. 26. FLUME• Flume is a distributed: • A distributed, reliable, and data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism• One-stop solution for data collection of all formats• It has a simple and flexible architecture based on streaming data flows.
  27. 27. Flume: High-Level Overview• Logical Node• Source• Sink
  28. 28. Flume ArchitectureLog Log ...Flume Node Flume NodeHDFS December 2nd, 2012 - Apache Flume 1.3.0 Released ©2011 Cloudera, Inc. All Rights Reserved.
  29. 29. Sqoop• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.• Easy, parallel database import/export• What you want do? • Insert data from RDBMS to HDFS • Export data from HDFS back into RDBMS
  30. 30. Sqoop Architecture & Example March of 2012 HDFS Sqoop RDBMS $ sqoop import --connect jdbc:mysql://localhost/world --username root -- table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF G,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ...33 ©2011 Cloudera, Inc. All Rights Reserved.
  31. 31. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  32. 32. Why Hive and Pig?Although MapReduce is very powerful, it can also be complex to masterMany organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java codeMany organizations have programmers who are skilled at writing code in scripting languagesHive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!
  33. 33. HiveWhat is Hive? An SQL-like interface to HadoopData Warehouse infrastructure that provides easy data summarization and ad hoc querying and the analysis of large datasets stored in Hadoop compatible file systems. MapRuduce for execution HDFS for storageHive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch querySELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  34. 34. Hive SQL Hive MapReduce38 ©2011 Cloudera, Inc. All Rights Reserved.
  35. 35. PigApache Pig is a platform to analyze large data sets.In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.Pig consists of two parts• Pig latin language• Pig engine A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  36. 36. Pig Latin & Pig Engine• Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.• The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.• We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data• Pig Engine:• The job of engine is to execute the data flow written in Pig latin in parallel on hadoop infrastructure.
  37. 37. Why Pig is required when we can code all in MR• Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin• In MR we have to lots of manual coding. Pig does optimization of Pig latin scripts while creating them into MR jobs.• It creates optimized version of Map reduce to run on hadoop• It takes very less time to write Pig latin script then to write corresponding MR code• Where Pig is useful Transactional ETL Data pipelines ( Mostly used) Research on raw data Iterative processing
  38. 38. Pig Script Pig MapReduce ©2011 Cloudera, Inc. All Rights Reserved.
  39. 39. WordCount Example• Input Hello World Bye World Hello Hadoop Goodbye Hadoop• For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>• theBye, 1> just sums up the values < reduce < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  40. 40. WordCount Example In MapReducepublic class WordCount {public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }}public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);}
  41. 41. WordCount Example By PigA = LOAD wordcount/input USING PigStorage as(token:chararray);B = GROUP A BY token;C = FOREACH B GENERATE group, COUNT(A) as count;DUMP C;
  42. 42. WordCount Example By HiveCREATE TABLE wordcount (token STRING);LOAD DATA LOCAL INPATH ’wordcount/inputOVERWRITE INTO TABLE wordcount;SELECT count(*) FROM wordcount GROUP BY token;
  43. 43. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  44. 44. Structured-data vs Raw-data
  45. 45. • Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. • Apache HBase is an open-source, distributed, versioned, column- oriented store modeled after Googles Bigtable: A Distributed Storage System for Structured Data by Chang et al.• Coordinated by Zookeeper• Low Latency• Random Reads And Writes• Distributed Key/Value Store• Simple API – PUT – GET – DELETE – SCANE
  46. 46. HbaseHBase is a type of "NoSQL" database. NoSQL? "NoSQL" is a general term meaning that the database isnt an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
  47. 47. HBase Exampleshbase> create mytable, mycf‘hbase> listhbase> put mytable, row1, mycf:col1, val1‘hbase> put mytable, row1, mycf:col2, val2‘hbase> put mytable, row2, mycf:col1, val3‘hbase> scan mytable‘hbase> disable mytable‘hbase> drop mytableHbase reference : http://hbase.apache.org ©2011 Cloudera, Inc. All Rights Reserved.
  48. 48. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  49. 49. What is ?Oozie is a server-based workflow scheduler system to manage Apache Hadoop jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) A Java Web ApplicationOozie is a workflow scheduler for HadoopTriggered Time Job 1 Job 2 Data Job 3 Job 4 Job 5
  50. 50. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  51. 51. What isMachine-learning toolDistributed and scalable machine learning algorithms on the Hadoop platformBuilding intelligent applications easier and fasterOur core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
  52. 52. Mahout Use CasesYahoo: Spam DetectionFoursquare: RecommendationsSpeedDate.com: RecommendationsAdobe: User TargettingAmazon: Personalization Platform ©2011 Cloudera, Inc. All Rights Reserved.
  53. 53. Use case ExamplePredict what the user likes based on His/Her historical behavior Aggregate behavior of people similar to him
  54. 54. Recap – Hadoop Arcitecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  55. 55. THANKS
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×