• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to the Hadoop Ecosystem (SEACON Edition)
 

Introduction to the Hadoop Ecosystem (SEACON Edition)

on

  • 893 views

Talk held at the SEACON 2013 on 17.05.2013 in Hamburg

Talk held at the SEACON 2013 on 17.05.2013 in Hamburg

Statistics

Views

Total Views
893
Views on SlideShare
892
Embed Views
1

Actions

Likes
1
Downloads
35
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction to the Hadoop Ecosystem (SEACON Edition) Introduction to the Hadoop Ecosystem (SEACON Edition) Presentation Transcript

    • Introduction to theHadoop ecosystem
    • About me
    • About us
    • Why Hadoop?
    • Why Hadoop?
    • Why Hadoop?
    • Why Hadoop?
    • Why Hadoop?
    • Why Hadoop?
    • Why Hadoop?
    • How to scale data?w1 w2 w3r1 r2 r3
    • But…
    • But…
    • What is Hadoop?
    • What is Hadoop?
    • What is Hadoop?
    • What is Hadoop?
    • The Hadoop App StoreHDFS MapRed HCat Pig Hive HBase Ambari Avro CassandraChukwaIntelSyncFlume Hana HyperT Impala Mahout Nutch Oozie ScoopScribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMCIBM Talend TeraData Pivotal Informat Microsoft. Pentaho JasperKognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
    • Data Storage
    • Data Storage
    • Hadoop Distributed File System•••
    • Hadoop Distributed File System••
    • HDFS Architecture
    • Data Processing
    • Data Processing
    • MapReduce•••
    • Typical large-data problem•••••
    • MapReduce Flow𝐤 𝟏 𝐯 𝟏 𝐤 𝟐 𝐯 𝟐 𝐤 𝟒 𝐯 𝟒 𝐤 𝟓 𝐯 𝟓 𝐤 𝟔 𝐯 𝟔𝐤 𝟑 𝐯 𝟑a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8a 1 3 b 𝟐 7 c 2 8 9a 4 b 9 c 19
    • Combined Hadoop Architecture
    • Word Count Mapper in Javapublic class WordCountMapper extends MapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()){word.set(tokenizer.nextToken());output.collect(word, one);}}}
    • Word Count Reducer in Javapublic class WordCountReducer extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable>{public void reduce(Text key, Iterator values, OutputCollectoroutput, Reporter reporter) throws IOException{int sum = 0;while (values.hasNext()){IntWritable value = (IntWritable) values.next();sum += value.get();}output.collect(key, new IntWritable(sum));}}
    • Scripting for Hadoop
    • Scripting for Hadoop
    • Apache Pig••••
    • Pig in the Hadoop ecosystemHadoop Distributed File SystemDistributed Programming FrameworkMetadata ManagementScripting
    • Pig Latinusers = LOAD users.txt USING PigStorage(,) AS (name,age);pages = LOAD pages.txt USING PigStorage(,) AS (user,url);filteredUsers = FILTER users BY age >= 18 and age <=50;joinResult = JOIN filteredUsers BY name, pages by user;grouped = GROUP joinResult BY url;summed = FOREACH grouped GENERATE group,COUNT(joinResult) as clicks;sorted = ORDER summed BY clicks desc;top10 = LIMIT sorted 10;STORE top10 INTO top10sites;
    • Pig Execution Plan
    • Try that with Java…
    • SQL for Hadoop
    • SQL for Hadoop
    • Apache Hive••
    • Hive in the Hadoop ecosystemHadoop Distributed File SystemDistributed Programming FrameworkMetadata ManagementScripting Query
    • Hive Architecture
    • Hive ExampleCREATE TABLE users(name STRING, age INT);CREATE TABLE pages(user STRING, url STRING);LOAD DATA INPATH /user/sandbox/users.txt INTOTABLE users;LOAD DATA INPATH /user/sandbox/pages.txt INTOTABLE pages;SELECT pages.url, count(*) AS clicks FROM users JOINpages ON (users.name = pages.user)WHERE users.age >= 18 AND users.age <= 50GROUP BY pages.urlSORT BY clicks DESCLIMIT 10;
    • Bringing it all together…
    • Online AdServing••••
    • AdServing Architecture
    • Getting started…
    • Hortonworks Sandbox
    • Hadoop Training•••••••••