Hadoop in three use cases

645 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
645
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop in three use cases

  1. 1. 2 December 2011Hadoop in Three Use CasesJoey Echeverria | Solutions Architectjoey@cloudera.com | @fwiffo
  2. 2. About Joey • Solutions Architect • 6 months • 3+ years • Local2 ©2011 Cloudera, Inc. All Rights Reserved.
  3. 3. Cloudera’s Distribution including Apache Hadoop File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE* APACHE OOZIE* APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME*, APACHE SQOOP* APACHE HBASE Coordination APACHE ZOOKEEPER *currently under incubation in the Apache Software Foundation3 Copyright 2011 Cloudera Inc. All rights reserved
  4. 4. Extract, Transform, and Load4 Copyright 2011 Cloudera Inc. All rights reserved
  5. 5. ETL before Hadoop Difficult to maintain, not scalable Relational Databases Logs Custom ETL Enterprise Data Scripts Warehouse Files5 ©2011 Cloudera, Inc. All Rights Reserved.
  6. 6. ETL before Hadoop May be scalable, expensive Relational Databases Logs Enterprise Data Warehouse SQL: raw table → warehouse tables Files6 ©2011 Cloudera, Inc. All Rights Reserved.
  7. 7. ETL with Hadoop Managed, flexible, scalable Relational Databases Logs Enterprise Data Warehouse Files7 ©2011 Cloudera, Inc. All Rights Reserved.
  8. 8. Steps 1. In 2. Process 3. Out8 Copyright 2011 Cloudera Inc. All rights reserved
  9. 9. Flume9 Copyright 2011 Cloudera Inc. All rights reserved
  10. 10. Flume10 Copyright 2011 Cloudera Inc. All rights reserved
  11. 11. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume Warehouse Files11 ©2011 Cloudera, Inc. All Rights Reserved.
  12. 12. HDFS12 Copyright 2011 Cloudera Inc. All rights reserved
  13. 13. HDFS 02, 06, 10 NameNodeopen(“file.txt”) DataNode DataNode DataNode 01 05 09 DataNode DataNode DataNode 02 06 10 Client data data DataNode DataNode DataNode data 03 07 11 DataNode DataNode DataNode 04 08 12 13 Copyright 2011 Cloudera Inc. All rights reserved
  14. 14. HDFS • Distributed • Replication • Bulk I/O • Fault tolerant • Scalable • Append only • Not POSIX14 Copyright 2011 Cloudera Inc. All rights reserved
  15. 15. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume HDFS Warehouse Files15 ©2011 Cloudera, Inc. All Rights Reserved.
  16. 16. FUSE-DFS16 Copyright 2011 Cloudera Inc. All rights reserved
  17. 17. FUSE-DFS • FUSE – User space – File systems • FUSE-DFS – /hdfs – Mostly transparent17 Copyright 2011 Cloudera Inc. All rights reserved
  18. 18. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files18 ©2011 Cloudera, Inc. All Rights Reserved.
  19. 19. Sqoop19 Copyright 2011 Cloudera Inc. All rights reserved
  20. 20. Sqoop • SQL to Hadoop • Parallel import • File formats20 Copyright 2011 Cloudera Inc. All rights reserved
  21. 21. ETL with Hadoop Managed, flexible, scalable Relational Databases Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files21 ©2011 Cloudera, Inc. All Rights Reserved.
  22. 22. Pig22 Copyright 2011 Cloudera Inc. All rights reserved
  23. 23. Pig • Scripting language • Generates MapReduce jobs • Perl for Hadoop • Great for ETLA = LOAD data USING PigStorage() AS (f1:int, f2:int, f3:int);B = GROUP A BY f1;C = FOREACH B GENERATE COUNT ($0);DUMP C; 23 Copyright 2011 Cloudera Inc. All rights reserved
  24. 24. ETL with Hadoop Managed, flexible, scalable Relational Databases Pig Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files24 ©2011 Cloudera, Inc. All Rights Reserved.
  25. 25. Sqoop with connectors25 Copyright 2011 Cloudera Inc. All rights reserved
  26. 26. Sqoop with connectors • MySQL* • PostgreSQL* • Teradata* • Netezza* • Oracle* • Couchbase* • Microsoft SQL Server • VoltDB *Cloudera certified connector26 Copyright 2011 Cloudera Inc. All rights reserved
  27. 27. ETL with Hadoop Managed, flexible, scalable Relational Databases Pig Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Sqoop Files27 ©2011 Cloudera, Inc. All Rights Reserved.
  28. 28. Recommendations28 Copyright 2011 Cloudera Inc. All rights reserved
  29. 29. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs29 ©2011 Cloudera, Inc. All Rights Reserved.
  30. 30. Flume30 Copyright 2011 Cloudera Inc. All rights reserved
  31. 31. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs Flume31 ©2011 Cloudera, Inc. All Rights Reserved.
  32. 32. HDFS32 Copyright 2011 Cloudera Inc. All rights reserved
  33. 33. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs Flume HDFS33 ©2011 Cloudera, Inc. All Rights Reserved.
  34. 34. Sqoop34 Copyright 2011 Cloudera Inc. All rights reserved
  35. 35. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS35 ©2011 Cloudera, Inc. All Rights Reserved.
  36. 36. Pig36 Copyright 2011 Cloudera Inc. All rights reserved
  37. 37. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig37 ©2011 Cloudera, Inc. All Rights Reserved.
  38. 38. Mahout38 Copyright 2011 Cloudera Inc. All rights reserved
  39. 39. Mahout • Scalable machine learning algorithms – Collaborative Filtering – User and Item based recommenders – K-Means, Fuzzy K-Means clustering – Mean Shift clustering – Singular value decomposition – Complementary Naive Bayes classifier …39 Copyright 2011 Cloudera Inc. All rights reserved
  40. 40. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig Mahout40 ©2011 Cloudera, Inc. All Rights Reserved.
  41. 41. MapReduce41 Copyright 2011 Cloudera Inc. All rights reserved
  42. 42. MapReduce map shuffle reduce :1 toOne() :1 :1 :[1,1,1,1] count() :4 :[1,1] :2 :1 toOne() :1 :1 :[1,1] count() :2 :1 :[1] :1 toOne() :1 :142 Copyright 2011 Cloudera Inc. All rights reserved
  43. 43. MapReduce • Distributed • Code to data • Reliable • Scalable43 Copyright 2011 Cloudera Inc. All rights reserved
  44. 44. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig Mahout MapReduce Pig44 ©2011 Cloudera, Inc. All Rights Reserved.
  45. 45. Oozie45 Copyright 2011 Cloudera Inc. All rights reserved
  46. 46. Oozie • Workflows • Coordinator – Triggers46 Copyright 2011 Cloudera Inc. All rights reserved
  47. 47. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Oozie Pig Mahout MapReduce Pig47 ©2011 Cloudera, Inc. All Rights Reserved.
  48. 48. HBase48 Copyright 2011 Cloudera Inc. All rights reserved
  49. 49. HBase • Key/value store • Data stored in HDFS • Access model is get/put/del – Plus range scans and versions • Random reads and writes for Hadoop49 Copyright 2011 Cloudera Inc. All rights reserved
  50. 50. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS HBase Oozie Pig Mahout MapReduce Pig50 ©2011 Cloudera, Inc. All Rights Reserved.
  51. 51. Business Intelligence51 Copyright 2011 Cloudera Inc. All rights reserved
  52. 52. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs52 ©2011 Cloudera, Inc. All Rights Reserved.
  53. 53. Flume53 Copyright 2011 Cloudera Inc. All rights reserved
  54. 54. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs Flume54 ©2011 Cloudera, Inc. All Rights Reserved.
  55. 55. HDFS55 Copyright 2011 Cloudera Inc. All rights reserved
  56. 56. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs Flume HDFS56 ©2011 Cloudera, Inc. All Rights Reserved.
  57. 57. Sqoop57 Copyright 2011 Cloudera Inc. All rights reserved
  58. 58. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS58 ©2011 Cloudera, Inc. All Rights Reserved.
  59. 59. Hive59 Copyright 2011 Cloudera Inc. All rights reserved
  60. 60. Hive • Data warehouse • Ad-hoc queries – Not real-time (minutes) • SQL • Tables • Joins60 Copyright 2011 Cloudera Inc. All rights reserved
  61. 61. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive61 ©2011 Cloudera, Inc. All Rights Reserved.
  62. 62. MapReduce62 Copyright 2011 Cloudera Inc. All rights reserved
  63. 63. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive MapReduce63 ©2011 Cloudera, Inc. All Rights Reserved.
  64. 64. Oozie64 Copyright 2011 Cloudera Inc. All rights reserved
  65. 65. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Oozie Hive MapReduce65 ©2011 Cloudera, Inc. All Rights Reserved.
  66. 66. HBase66 Copyright 2011 Cloudera Inc. All rights reserved
  67. 67. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS HBase Oozie Hive MapReduce67 ©2011 Cloudera, Inc. All Rights Reserved.
  68. 68. Hive68 Copyright 2011 Cloudera Inc. All rights reserved
  69. 69. Hive for Business Intelligence • JDBC – JasperReports* – Pentaho* • ODBC – MicroStrategy*^ * Vender certified connector ^ Cloudera certified connector69 Copyright 2011 Cloudera Inc. All rights reserved
  70. 70. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive HBase Oozie Hive MapReduce70 ©2011 Cloudera, Inc. All Rights Reserved.
  71. 71. CDH File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE* APACHE OOZIE* APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME*, APACHE APACHE HBASE SQOOP* Coordination APACHE ZOOKEEPER *currently under incubation in the Apache Software Foundation71 Copyright 2011 Cloudera Inc. All rights reserved
  72. 72. What’s next? • Cloudera Training Videos • CDH Virtual Machines • Hadoop: The Definitive Guide, 2nd Edition • Cloudera University – Developer Training in Columbia, MD • Dec 13-16, Feb 13-16 – Administrator Training in Herndon, VA • Jan 4-6 – Private Training72 Copyright 2011 Cloudera Inc. All rights reserved
  73. 73. We’re Hiring! • http://www.cloudera.com/company/careers/ • Customer Operations – Customer Operations Engineer – Customer Operations Tools Developer • Customer Solutions – Solutions Architect • Engineering – Senior Data Integration Developer – Senior Distributed Systems Engineer – Senior UI Engineer – Software Quality Engineer – Technical Writer • IT/Operations – Systems Administrator73 Copyright 2011 Cloudera Inc. All rights reserved
  74. 74. 74

×