Introduction to the hadoop ecosystem by Uwe Seiler

630 views
558 views

Published on

Apache Hadoop is one of the most popular solutions for today’s Big Data challenges. Hadoop offers a reliable and scalable platform for fail-safe storage of large amounts of data as well as the tools to process this data. This presentation will give an overview of the architecture of Hadoop and explain the possibilities for integration within existing enterprise systems. Finally, the main tools for processing data will be introduced which includes the scripting language layer Pig, the SQL-like query layer Hive as well as the column-based NoSQL layer HBase.

Published in: Technology, News & Politics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
630
On SlideShare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction to the hadoop ecosystem by Uwe Seiler

  1. 1. Introduction to theHadoop ecosystem
  2. 2. About me
  3. 3. About us
  4. 4. Why Hadoop?
  5. 5. Why Hadoop?
  6. 6. Why Hadoop?
  7. 7. Why Hadoop?
  8. 8. Why Hadoop?
  9. 9. Why Hadoop?
  10. 10. Why Hadoop?
  11. 11. How to scale data?w1 w2 w3r1 r2 r3
  12. 12. But…
  13. 13. But…
  14. 14. What is Hadoop?
  15. 15. What is Hadoop?
  16. 16. What is Hadoop?
  17. 17. What is Hadoop?
  18. 18. The Hadoop App StoreHDFS MapRed HCat Pig Hive HBase Ambari Avro CassandraChukwaIntelSyncFlume Hana HyperT Impala Mahout Nutch Oozie ScoopScribe Tez Vertica Whirr ZooKee Cloudera Horton MapR EMCIBM Talend TeraData Pivotal Informat Microsoft. Pentaho JasperKognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
  19. 19. Data Storage
  20. 20. Data Storage
  21. 21. Hadoop Distributed File System•••
  22. 22. Hadoop Distributed File System••
  23. 23. HDFS Architecture
  24. 24. Data Processing
  25. 25. Data Processing
  26. 26. MapReduce•••
  27. 27. Typical large-data problem•••••
  28. 28. MapReduce Flow𝐤 𝟏 𝐯 𝟏 𝐤 𝟐 𝐯 𝟐 𝐤 𝟒 𝐯 𝟒 𝐤 𝟓 𝐯 𝟓 𝐤 𝟔 𝐯 𝟔𝐤 𝟑 𝐯 𝟑a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8a 1 3 b 𝟐 7 c 2 8 9a 4 b 9 c 19
  29. 29. Jobs & Tasks••••
  30. 30. Combined Hadoop Architecture
  31. 31. Word Count Mapper in Javapublic class WordCountMapper extends MapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()){word.set(tokenizer.nextToken());output.collect(word, one);}}}
  32. 32. Word Count Reducer in Javapublic class WordCountReducer extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable>{public void reduce(Text key, Iterator values, OutputCollectoroutput, Reporter reporter) throws IOException{int sum = 0;while (values.hasNext()){IntWritable value = (IntWritable) values.next();sum += value.get();}output.collect(key, new IntWritable(sum));}}
  33. 33. Scripting for Hadoop
  34. 34. Scripting for Hadoop
  35. 35. Apache Pig••••
  36. 36. Pig in the Hadoop ecosystemHadoop Distributed File SystemDistributed Programming FrameworkMetadata ManagementScripting
  37. 37. Pig Latinusers = LOAD users.txt USING PigStorage(,) AS (name,age);pages = LOAD pages.txt USING PigStorage(,) AS (user,url);filteredUsers = FILTER users BY age >= 18 and age <=50;joinResult = JOIN filteredUsers BY name, pages by user;grouped = GROUP joinResult BY url;summed = FOREACH grouped GENERATE group,COUNT(joinResult) as clicks;sorted = ORDER summed BY clicks desc;top10 = LIMIT sorted 10;STORE top10 INTO top10sites;
  38. 38. Pig Execution Plan
  39. 39. Try that with Java…
  40. 40. SQL for Hadoop
  41. 41. SQL for Hadoop
  42. 42. Apache Hive••
  43. 43. Hive in the Hadoop ecosystemHadoop Distributed File SystemDistributed Programming FrameworkMetadata ManagementScripting Query
  44. 44. Hive Architecture
  45. 45. Hive ExampleCREATE TABLE users(name STRING, age INT);CREATE TABLE pages(user STRING, url STRING);LOAD DATA INPATH /user/sandbox/users.txt INTOTABLE users;LOAD DATA INPATH /user/sandbox/pages.txt INTOTABLE pages;SELECT pages.url, count(*) AS clicks FROM users JOINpages ON (users.name = pages.user)WHERE users.age >= 18 AND users.age <= 50GROUP BY pages.urlSORT BY clicks DESCLIMIT 10;
  46. 46. Bringing it all together…
  47. 47. Online Advertising
  48. 48. Getting started…
  49. 49. Hortonworks Sandbox
  50. 50. Hadoop Training•••••••••
  51. 51. The end…or the beginning?

×