Hadoop Product Familyand EcosystemPC Liao
Agenda• What is BigData?• What is the problem?• Hadoop  – Introduction to Hadoop  – Hadoop components  – What sort of prob...
What is BigData?  A set of files   A database   A single file
The Data-Driven World• Modern systems have to deal with far more data than  was the case in the past  – Organizations are ...
What is the problem• Traditionally, computation has been processor-bound• For decades, the primary push was to increase th...
What is the problem• Getting the data to the processors  becomes the bottleneck• Quick calculation   – Typical disk data t...
What is the problem• Failure of a component may cost a lot• What we need when job fail?  – May result in a graceful degrad...
Big Data Solutions by IndustriesThe most common problems Hadoop can solve
Threat Analysis/Trade Surveillance• Challenge:  – Detecting threats in the form of fraudulent activity or attacks     • La...
Big Data Use CaseSmart Protection Network• Challenge   – Information accessibility and transparency problems     for threa...
Index=“vsapi” zbot
Recommendation Engine• Challenge:  – Using user data to predict which products to recommend• Solution with Hadoop:  – Batc...
Walmart Case                        Diapers               Beer                      Friday               Revenue         ?
Hadoop!
– inspired by• Apache Hadoop project  – inspired by Googles MapReduce and Google File System    papers.• Open sourced, fle...
Hadoop Concepts• Distribute the data as it is initially stored in the system• Individual nodes can work on data local to t...
Hadoop Components• Hadoop consists of two core components  – The Hadoop Distributed File System (HDFS)  – MapReduce Softwa...
Hadoop Components: HDFS• HDFS, the Hadoop Distributed File System, is  responsible for storing data on the cluster• Two ro...
How Files Are Stored: Example                       • NameNode holds metadata for the                         data files  ...
HDFS: Points To Note                       • When a client application                         wants to read a file:      ...
Hadoop Components: MapReduce• MapReduce is a method for distributing a task across  multiple nodes• It works like a Unix p...
Features of MapReduce• Automatic parallelization and distribution• Automatic re-execution on failure• Locality optimizatio...
Example : word count• Word count is challenging over massive amounts of  data  – Using a single compute node would be too ...
Word Count Example       Key: offset       Value: line                            Key: word      Key: word                ...
The Hadoop Ecosystems
Growing Hadoop Ecosystem• The term „Hadoop‟ is taken to be the combination of  HDFS and MapReduce• There are numerous othe...
The Ecosystem is the System• Hadoop has become the kernel of the distributed  operating system for Big Data• No one uses t...
Relation Map                                                Hue                                   Mahout                  ...
Zookeeper – Coordination Framework                                                Hue                                   Ma...
What is ZooKeeper• A centralized service for maintaining  – Configuration information  – Providing distributed synchroniza...
Why use ZooKeeper?• Manage configuration across nodes• Implement reliable messaging• Implement redundant services• Synchro...
ZooKeeper Architecture   – All servers store a copy of the data (in memory)   – A leader is elected at startup   – 2 roles...
Hbase – Column NoSQL DB                                               Hue                                   Mahout        ...
Structured-data vs Raw-data
I – Inspired by• Apache open source project• Inspired from Google Big Table• Non-relational, distributed database written ...
Row & Column Oriented
Hbase – Data Model• Cells are “versioned”• Table rows are sorted by row key• Region – a row range [start-key:end-key]
Architecture• Master Server (HMaster)  – Assigns regions to regionservers  – Monitors the health of regionservers• RegionS...
Hbase – workflow
When to use HBase• Need random, low latency access to the data• Application has a variable schema where each row is  sligh...
Flume / Sqoop – Data Integration Framework                                                Hue                             ...
What‟s the problem for data collection• Data collection is currently a priori and ad hoc• A priori – decide what you want ...
(and how can it help?)• A distributed data collection service• It efficiently collecting, aggregating, and moving large  a...
Flume: High-Level Overview• Logical Node• Source• Sink
Architecture• basic diagram  – one master control multiple node
Architecture• multiple master control multiple node
An example flow
Flume / Sqoop – Data Integration Framework                                                Hue                             ...
Sqoop• Easy, parallel database import/export• What you want do?  – Insert data from RDBMS to HDFS  – Export data from HDFS...
What is Sqoop• A suite of tools that connect Hadoop and database  systems• Import tables from databases into HDFS for deep...
How Sqoop helps• The Problem  – Structured data in traditional databases cannot be easily    combined with complex data st...
Sqoop - import process
Sqoop - export process• Exports are performed in parallel using MapReduce
Why Sqoop• JDBC-based implementation  – Works with many popular database vendors• Auto-generation of tedious user-side cod...
Sqoop - JOB• Job management options• E.g sqoop job –create myjob –import –connect xxxxxxx  --table mytable
Pig / Hive – Analytical Language                                                Hue                                   Maho...
Why Hive and Pig?• Although MapReduce is very powerful, it can also be  complex to master• Many organizations have busines...
Hive     – Developed by• What is Hive?  – An SQL-like interface to Hadoop• Data Warehouse infrastructure that provides dat...
Pig                 – Initiated by• A high-level scripting language (Pig Latin)• Process data one step at a time• Simple t...
Hive vs. Pig                  Hive                   PigLanguage          HiveQL (SQL-like)      Pig Latin, a scripting la...
WordCount Example• Input  Hello World Bye World  Hello Hadoop Goodbye Hadoop• For the given sample input the map emits  < ...
WordCount Example In MapReducepublic class WordCount {public static class Map extends Mapper<LongWritable, Text, Text, Int...
WordCount Example By PigA = LOAD wordcount/input USING PigStorage as (token:chararray);B = GROUP A BY token;C = FOREACH B ...
WordCount Example By HiveCREATE TABLE wordcount (token STRING);LOAD DATA LOCAL INPATH ‟wordcount/inputOVERWRITE INTO TABLE...
Oozie – Job Workflow & Scheduling                                                Hue                                   Mah...
What is                       ?• A Java Web Application• Oozie is a workflow scheduler for Hadoop• Crond for Hadoop        ...
Why• Why use Oozie instead of just cascading a jobs one  after another• Major flexibility  – Start, Stop, Suspend, and re-...
High Level Architecture• Web Service API• database store :  – Workflow definitions  – Currently running workflow instances...
How it triggered• Time   – Execute your workflow every 15 minutes     00:15           00:30     00:45      01:00 • Time an...
Exeample Workflow
Oozie use criteria• Need Launch, control, and monitor jobs from your Java  Apps  – Java Client API/Command Line Interface•...
Hue – Web Console                                               Hue                                   Mahout              ...
Hue – developed by• Hadoop User Experience• Apache Open source project• HUE is a web UI for Hadoop• Platform for building ...
Hue• HUE comes with a suite of applications  – File Browser: Browse HDFS; change permissions and    ownership; upload, dow...
Hue: File Browser UI
Hue: Beewax UI
Mahout – Data Mining                                                Hue                                   Mahout          ...
What is• Machine-learning tool• Distributed and scalable machine learning algorithms on  the Hadoop platform• Building int...
Why• Current state of ML libraries  –   Lack Community  –   Lack Documentation and Examples  –   Lack Scalability  –   Are...
Mahout – scale• Scale to large datasets  – Hadoop MapReduce implementations that scales linearly with    data• Scalable to...
Mahout – four use cases• Mahout machine learning algorithms  – Recommendation mining : takes users‟ behavior and find item...
Use case Example• Predict what the user likes based on  – His/Her historical behavior  – Aggregate behavior of people simi...
ConclusionToday, we introduced:• Why Hadoop is needed• The basic concepts of HDFS and MapReduce• What sort of problems can...
Recap – Hadoop Ecosystem                                               Hue                                   Mahout       ...
Questions?
Thank you!
Hadoop Family and Ecosystem
Upcoming SlideShare
Loading in...5
×

Hadoop Family and Ecosystem

10,809

Published on

20121215 Hadoop Family and Ecosystem Introduction by TCloud Computing

Published in: Technology
1 Comment
36 Likes
Statistics
Notes
  • Hi,

    The presentation is excellent and has very nice flow of information. Would you please share the slides with me at sbpatel74@gmail.com ? I am currently working in Pharma company where we are trying to analyze big data capabilities.

    Brijesh
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
10,809
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
1
Likes
36
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop Family and Ecosystem"

  1. 1. Hadoop Product Familyand EcosystemPC Liao
  2. 2. Agenda• What is BigData?• What is the problem?• Hadoop – Introduction to Hadoop – Hadoop components – What sort of problems can be solved with Hadoop?• Hadoop ecosystem• Conclusion
  3. 3. What is BigData? A set of files A database A single file
  4. 4. The Data-Driven World• Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data – That data has inherent value, and cannot be discarded• Examples: – Yahoo – over 170PB of data – Facebook – over 30PB of data – eBay – over 5PB of data• Many organizations are generating data at a rate of terabytes per day
  5. 5. What is the problem• Traditionally, computation has been processor-bound• For decades, the primary push was to increase the computing power of a single machine – Faster processor, more RAM• Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes
  6. 6. What is the problem• Getting the data to the processors becomes the bottleneck• Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  7. 7. What is the problem• Failure of a component may cost a lot• What we need when job fail? – May result in a graceful degradation of application performance, but entire system does not completely fail – Should not result in the loss of any data – Would not affect the outcome of the job
  8. 8. Big Data Solutions by IndustriesThe most common problems Hadoop can solve
  9. 9. Threat Analysis/Trade Surveillance• Challenge: – Detecting threats in the form of fraudulent activity or attacks • Large data volumes involved • Like looking for a needle in a haystack• Solution with Hadoop: – Parallel processing over huge datasets – Pattern recognition to identify anomalies • – i.e., threats• Typical Industry: – Security, Financial Services
  10. 10. Big Data Use CaseSmart Protection Network• Challenge – Information accessibility and transparency problems for threat researcher due to the size and source of data (volume, variety and velocity)• Size of Data – Overall Data • Data sources: 20+ • Data fields: 1000+ • Daily new records: 23 Billion+ • Daily new data size: 4TB+ SPN Smart Feedback • Feedback components: 26 • Data fields : 300+ • Daily new file counts: 6 Million+ • Daily new records: 90 Million+ • Daily new data size: 261GB+
  11. 11. Index=“vsapi” zbot
  12. 12. Recommendation Engine• Challenge: – Using user data to predict which products to recommend• Solution with Hadoop: – Batch processing framework • Allow execution in in parallel over large datasets – Collaborative filtering • Collecting „taste‟ information from many users • Utilizing information to predict what similar users like• Typical Industry – ISP, Advertising
  13. 13. Walmart Case Diapers Beer Friday Revenue ?
  14. 14. Hadoop!
  15. 15. – inspired by• Apache Hadoop project – inspired by Googles MapReduce and Google File System papers.• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware• Open Source Software + Hardware Commodity – IT Costs Reduction
  16. 16. Hadoop Concepts• Distribute the data as it is initially stored in the system• Individual nodes can work on data local to those nodes• Users can focus on developing applications.
  17. 17. Hadoop Components• Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce Software Framework• There are many other projects based around core Hadoop – Often referred to as the „Hadoop Ecosystem‟ – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  18. 18. Hadoop Components: HDFS• HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster• Two roles in HDFS – Namenode: Record metadata – Datanode: Store data Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  19. 19. How Files Are Stored: Example • NameNode holds metadata for the data files • DataNodes hold the actual blocks • Each block is replicated three times on the cluster
  20. 20. HDFS: Points To Note • When a client application wants to read a file: • It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on • It then communicates directly with the DataNodes to read the data
  21. 21. Hadoop Components: MapReduce• MapReduce is a method for distributing a task across multiple nodes• It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  22. 22. Features of MapReduce• Automatic parallelization and distribution• Automatic re-execution on failure• Locality optimizations• MapReduce abstracts all the „housekeeping‟ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  23. 23. Example : word count• Word count is challenging over massive amounts of data – Using a single compute node would be too time-consuming – Number of unique words can easily exceed the RAM• MapReduce breaks complex tasks down into smaller elements which can be executed in parallel• More nodes, more faster
  24. 24. Word Count Example Key: offset Value: line Key: word Key: word Value: count Value: sum of count0:The cat sat on the mat22:The aardvark sat on the sofa
  25. 25. The Hadoop Ecosystems
  26. 26. Growing Hadoop Ecosystem• The term „Hadoop‟ is taken to be the combination of HDFS and MapReduce• There are numerous other projects surrounding Hadoop – Typically referred to as the „Hadoop Ecosystem‟ • Zookeeper • Hive and Pig • HBase • Flume • Other Ecosystem Projects – Sqoop – Oozie – Hue – Mahout
  27. 27. The Ecosystem is the System• Hadoop has become the kernel of the distributed operating system for Big Data• No one uses the kernel alone• A collection of projects at Apache
  28. 28. Relation Map Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  29. 29. Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  30. 30. What is ZooKeeper• A centralized service for maintaining – Configuration information – Providing distributed synchronization• A set of tools to build distributed applications that can safely handle partial failures• ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  31. 31. Why use ZooKeeper?• Manage configuration across nodes• Implement reliable messaging• Implement redundant services• Synchronize process execution
  32. 32. ZooKeeper Architecture – All servers store a copy of the data (in memory) – A leader is elected at startup – 2 roles – leader and follower • Followers service clients, all updates go through leader • Update responses are sent when a majority of servers have persisted the change – HA support
  33. 33. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  34. 34. Structured-data vs Raw-data
  35. 35. I – Inspired by• Apache open source project• Inspired from Google Big Table• Non-relational, distributed database written in Java• Coordinated by Zookeeper
  36. 36. Row & Column Oriented
  37. 37. Hbase – Data Model• Cells are “versioned”• Table rows are sorted by row key• Region – a row range [start-key:end-key]
  38. 38. Architecture• Master Server (HMaster) – Assigns regions to regionservers – Monitors the health of regionservers• RegionServers – Contain regions and handle client read/write request
  39. 39. Hbase – workflow
  40. 40. When to use HBase• Need random, low latency access to the data• Application has a variable schema where each row is slightly different• Add columns• Most of columns are NULL in each row
  41. 41. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  42. 42. What‟s the problem for data collection• Data collection is currently a priori and ad hoc• A priori – decide what you want to collect ahead of time• Ad hoc – each kind of data source goes through its own collection path
  43. 43. (and how can it help?)• A distributed data collection service• It efficiently collecting, aggregating, and moving large amounts of data• Fault tolerant, many failover and recovery mechanism• One-stop solution for data collection of all formats
  44. 44. Flume: High-Level Overview• Logical Node• Source• Sink
  45. 45. Architecture• basic diagram – one master control multiple node
  46. 46. Architecture• multiple master control multiple node
  47. 47. An example flow
  48. 48. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  49. 49. Sqoop• Easy, parallel database import/export• What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  50. 50. What is Sqoop• A suite of tools that connect Hadoop and database systems• Import tables from databases into HDFS for deep analysis• Export MapReduce results back to a database for presentation to end-users• Provides the ability to import from SQL databases straight into your Hive data warehouse
  51. 51. How Sqoop helps• The Problem – Structured data in traditional databases cannot be easily combined with complex data stored in HDFS• Sqoop (SQL-to-Hadoop) – Easy import of data from many databases to HDFS – Generate code for use in MapReduce applications
  52. 52. Sqoop - import process
  53. 53. Sqoop - export process• Exports are performed in parallel using MapReduce
  54. 54. Why Sqoop• JDBC-based implementation – Works with many popular database vendors• Auto-generation of tedious user-side code – Write MapReduce applications to work with your data, faster• Integration with Hive – Allows you to stay in a SQL-based environment
  55. 55. Sqoop - JOB• Job management options• E.g sqoop job –create myjob –import –connect xxxxxxx --table mytable
  56. 56. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  57. 57. Why Hive and Pig?• Although MapReduce is very powerful, it can also be complex to master• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code• Many organizations have programmers who are skilled at writing code in scripting languages• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  58. 58. Hive – Developed by• What is Hive? – An SQL-like interface to Hadoop• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage• Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  59. 59. Pig – Initiated by• A high-level scripting language (Pig Latin)• Process data one step at a time• Simple to write MapReduce program• Easy understand• Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  60. 60. Hive vs. Pig Hive PigLanguage HiveQL (SQL-like) Pig Latin, a scripting languageSchema Table definitions A schema is optionally defined that are stored in a at runtime metastoreProgrammait Access JDBC, ODBC PigServer
  61. 61. WordCount Example• Input Hello World Bye World Hello Hadoop Goodbye Hadoop• For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1>• the reduce just sums up the values < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  62. 62. WordCount Example In MapReducepublic class WordCount {public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }}public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);}
  63. 63. WordCount Example By PigA = LOAD wordcount/input USING PigStorage as (token:chararray);B = GROUP A BY token;C = FOREACH B GENERATE group, COUNT(A) as count;DUMP C;
  64. 64. WordCount Example By HiveCREATE TABLE wordcount (token STRING);LOAD DATA LOCAL INPATH ‟wordcount/inputOVERWRITE INTO TABLE wordcount;SELECT count(*) FROM wordcount GROUP BY token;
  65. 65. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  66. 66. What is ?• A Java Web Application• Oozie is a workflow scheduler for Hadoop• Crond for Hadoop Job 1 Job 2 Job 3 Job 4 Job 5
  67. 67. Why• Why use Oozie instead of just cascading a jobs one after another• Major flexibility – Start, Stop, Suspend, and re-run jobs• Oozie allows you to restart from a failure – You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes
  68. 68. High Level Architecture• Web Service API• database store : – Workflow definitions – Currently running workflow instances, including instance states and variables Oozie WS Tomcat Hadoop/Pig/HDFS API web-app DB
  69. 69. How it triggered• Time – Execute your workflow every 15 minutes 00:15 00:30 00:45 01:00 • Time and Data – Materialize your workflow every hour, but only run them when the input data is ready. HadoopInput Data Exists? 01:00 02:00 03:00 04:00
  70. 70. Exeample Workflow
  71. 71. Oozie use criteria• Need Launch, control, and monitor jobs from your Java Apps – Java Client API/Command Line Interface• Need control jobs from anywhere – Web Service API• Have jobs that you need to run every hour, day, week• Need receive notification when a job done – Email when a job is complete
  72. 72. Hue – Web Console Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  73. 73. Hue – developed by• Hadoop User Experience• Apache Open source project• HUE is a web UI for Hadoop• Platform for building custom applications with a nice UI library
  74. 74. Hue• HUE comes with a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  75. 75. Hue: File Browser UI
  76. 76. Hue: Beewax UI
  77. 77. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  78. 78. What is• Machine-learning tool• Distributed and scalable machine learning algorithms on the Hadoop platform• Building intelligent applications easier and faster
  79. 79. Why• Current state of ML libraries – Lack Community – Lack Documentation and Examples – Lack Scalability – Are Research oriented
  80. 80. Mahout – scale• Scale to large datasets – Hadoop MapReduce implementations that scales linearly with data• Scalable to support your business case – Mahout is distributed under a commercially friendly Apache Software license• Scalable community – Vibrant, responsive and diverse
  81. 81. Mahout – four use cases• Mahout machine learning algorithms – Recommendation mining : takes users‟ behavior and find items said specified user might like – Clustering : takes e.g. text documents and groups them based on related document topics – Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category – Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together
  82. 82. Use case Example• Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  83. 83. ConclusionToday, we introduced:• Why Hadoop is needed• The basic concepts of HDFS and MapReduce• What sort of problems can be solved with Hadoop• What other projects are included in the Hadoop ecosystem
  84. 84. Recap – Hadoop Ecosystem Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  85. 85. Questions?
  86. 86. Thank you!

×