Your SlideShare is downloading. ×
Introduction to the Hadoop Ecosystem (SEACON Edition)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to the Hadoop Ecosystem (SEACON Edition)

779
views

Published on

Talk held at the SEACON 2013 on 17.05.2013 in Hamburg

Talk held at the SEACON 2013 on 17.05.2013 in Hamburg

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
779
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
47
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to theHadoop ecosystem
  • 2. About me
  • 3. About us
  • 4. Why Hadoop?
  • 5. Why Hadoop?
  • 6. Why Hadoop?
  • 7. Why Hadoop?
  • 8. Why Hadoop?
  • 9. Why Hadoop?
  • 10. Why Hadoop?
  • 11. How to scale data?w1 w2 w3r1 r2 r3
  • 12. But…
  • 13. But…
  • 14. What is Hadoop?
  • 15. What is Hadoop?
  • 16. What is Hadoop?
  • 17. What is Hadoop?
  • 18. The Hadoop App StoreHDFS MapRed HCat Pig Hive HBase Ambari Avro CassandraChukwaIntelSyncFlume Hana HyperT Impala Mahout Nutch Oozie ScoopScribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMCIBM Talend TeraData Pivotal Informat Microsoft. Pentaho JasperKognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
  • 19. Data Storage
  • 20. Data Storage
  • 21. Hadoop Distributed File System•••
  • 22. Hadoop Distributed File System••
  • 23. HDFS Architecture
  • 24. Data Processing
  • 25. Data Processing
  • 26. MapReduce•••
  • 27. Typical large-data problem•••••
  • 28. MapReduce Flow𝐤 𝟏 𝐯 𝟏 𝐤 𝟐 𝐯 𝟐 𝐤 𝟒 𝐯 𝟒 𝐤 𝟓 𝐯 𝟓 𝐤 𝟔 𝐯 𝟔𝐤 𝟑 𝐯 𝟑a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8a 1 3 b 𝟐 7 c 2 8 9a 4 b 9 c 19
  • 29. Combined Hadoop Architecture
  • 30. Word Count Mapper in Javapublic class WordCountMapper extends MapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()){word.set(tokenizer.nextToken());output.collect(word, one);}}}
  • 31. Word Count Reducer in Javapublic class WordCountReducer extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable>{public void reduce(Text key, Iterator values, OutputCollectoroutput, Reporter reporter) throws IOException{int sum = 0;while (values.hasNext()){IntWritable value = (IntWritable) values.next();sum += value.get();}output.collect(key, new IntWritable(sum));}}
  • 32. Scripting for Hadoop
  • 33. Scripting for Hadoop
  • 34. Apache Pig••••
  • 35. Pig in the Hadoop ecosystemHadoop Distributed File SystemDistributed Programming FrameworkMetadata ManagementScripting
  • 36. Pig Latinusers = LOAD users.txt USING PigStorage(,) AS (name,age);pages = LOAD pages.txt USING PigStorage(,) AS (user,url);filteredUsers = FILTER users BY age >= 18 and age <=50;joinResult = JOIN filteredUsers BY name, pages by user;grouped = GROUP joinResult BY url;summed = FOREACH grouped GENERATE group,COUNT(joinResult) as clicks;sorted = ORDER summed BY clicks desc;top10 = LIMIT sorted 10;STORE top10 INTO top10sites;
  • 37. Pig Execution Plan
  • 38. Try that with Java…
  • 39. SQL for Hadoop
  • 40. SQL for Hadoop
  • 41. Apache Hive••
  • 42. Hive in the Hadoop ecosystemHadoop Distributed File SystemDistributed Programming FrameworkMetadata ManagementScripting Query
  • 43. Hive Architecture
  • 44. Hive ExampleCREATE TABLE users(name STRING, age INT);CREATE TABLE pages(user STRING, url STRING);LOAD DATA INPATH /user/sandbox/users.txt INTOTABLE users;LOAD DATA INPATH /user/sandbox/pages.txt INTOTABLE pages;SELECT pages.url, count(*) AS clicks FROM users JOINpages ON (users.name = pages.user)WHERE users.age >= 18 AND users.age <= 50GROUP BY pages.urlSORT BY clicks DESCLIMIT 10;
  • 45. Bringing it all together…
  • 46. Online AdServing••••
  • 47. AdServing Architecture
  • 48. Getting started…
  • 49. Hortonworks Sandbox
  • 50. Hadoop Training•••••••••