2 December 2011Hadoop in Three Use CasesJoey Echeverria | Solutions Architectjoey@cloudera.com | @fwiffo
About Joey    •   Solutions Architect    •   6 months    •   3+ years    •   Local2                       ©2011 Cloudera, ...
Cloudera’s Distribution including Apache Hadoop      File System Mount                  UI Framework                      ...
Extract, Transform, and Load4                 Copyright 2011 Cloudera Inc. All rights reserved
ETL before Hadoop    Difficult to maintain, not scalable                        Relational                        Database...
ETL before Hadoop    May be scalable, expensive                      Relational                      Databases      Logs  ...
ETL with Hadoop    Managed, flexible, scalable                          Relational                          Databases     ...
Steps    1. In    2. Process    3. Out8                 Copyright 2011 Cloudera Inc. All rights reserved
Flume9            Copyright 2011 Cloudera Inc. All rights reserved
Flume10         Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop Managed, flexible, scalable                       Relational                       Databases              ...
HDFS12        Copyright 2011 Cloudera Inc. All rights reserved
HDFS                       02, 06, 10                                                               NameNodeopen(“file.txt...
HDFS •   Distributed •   Replication •   Bulk I/O •   Fault tolerant •   Scalable •   Append only •   Not POSIX14         ...
ETL with Hadoop Managed, flexible, scalable                        Relational                        Databases            ...
FUSE-DFS16            Copyright 2011 Cloudera Inc. All rights reserved
FUSE-DFS • FUSE     – User space     – File systems • FUSE-DFS     – /hdfs     – Mostly transparent17                     ...
ETL with Hadoop Managed, flexible, scalable                       Relational                       Databases              ...
Sqoop19         Copyright 2011 Cloudera Inc. All rights reserved
Sqoop • SQL to Hadoop • Parallel import • File formats20                Copyright 2011 Cloudera Inc. All rights reserved
ETL with Hadoop Managed, flexible, scalable                       Relational                       Databases              ...
Pig22       Copyright 2011 Cloudera Inc. All rights reserved
Pig   •   Scripting language   •   Generates MapReduce jobs   •   Perl for Hadoop   •   Great for ETLA = LOAD data USING P...
ETL with Hadoop Managed, flexible, scalable                       Relational                       Databases              ...
Sqoop with connectors25             Copyright 2011 Cloudera Inc. All rights reserved
Sqoop with connectors •   MySQL* •   PostgreSQL* •   Teradata* •   Netezza* •   Oracle* •   Couchbase* •   Microsoft SQL S...
ETL with Hadoop Managed, flexible, scalable                       Relational                       Databases              ...
Recommendations28           Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop                                                              CUSTOMERS            Relational  ...
Flume30         Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop                                                                      CUSTOMERS                ...
HDFS32        Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop                                                                      CUSTOMERS                ...
Sqoop34         Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop                                                                      CUSTOMERS                ...
Pig36       Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop                                                                      CUSTOMERS                ...
Mahout38          Copyright 2011 Cloudera Inc. All rights reserved
Mahout • Scalable machine learning algorithms     – Collaborative Filtering     – User and Item based recommenders     – K...
Recommendations with Hadoop                                                                        CUSTOMERS              ...
MapReduce41             Copyright 2011 Cloudera Inc. All rights reserved
MapReduce        map            shuffle                                         reduce               :1     toOne()       ...
MapReduce •   Distributed •   Code to data •   Reliable •   Scalable43                    Copyright 2011 Cloudera Inc. All...
Recommendations with Hadoop                                                                        CUSTOMERS              ...
Oozie45         Copyright 2011 Cloudera Inc. All rights reserved
Oozie • Workflows • Coordinator     – Triggers46                  Copyright 2011 Cloudera Inc. All rights reserved
Recommendations with Hadoop                                                                        CUSTOMERS              ...
HBase48         Copyright 2011 Cloudera Inc. All rights reserved
HBase • Key/value store • Data stored in HDFS • Access model is get/put/del     – Plus range scans and versions • Random r...
Recommendations with Hadoop                                                                        CUSTOMERS              ...
Business Intelligence51              Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                               ANALYSTS            Relati...
Flume53         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                                       ANALYSTS          ...
HDFS55        Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                                       ANALYSTS          ...
Sqoop57         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                                       ANALYSTS          ...
Hive59        Copyright 2011 Cloudera Inc. All rights reserved
Hive • Data warehouse • Ad-hoc queries     – Not real-time (minutes) • SQL • Tables • Joins60                    Copyright...
Business Intelligence with Hadoop                                                                       ANALYSTS          ...
MapReduce62             Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                                          ANALYSTS       ...
Oozie64         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                                               ANALYSTS  ...
HBase66         Copyright 2011 Cloudera Inc. All rights reserved
Business Intelligence with Hadoop                                                                               ANALYSTS  ...
Hive68        Copyright 2011 Cloudera Inc. All rights reserved
Hive for Business Intelligence • JDBC     – JasperReports*     – Pentaho* • ODBC     – MicroStrategy*^                    ...
Business Intelligence with Hadoop                                                                               ANALYSTS  ...
CDH      File System Mount                  UI Framework                                   SDK                       FUSE-...
What’s next? •   Cloudera Training Videos •   CDH Virtual Machines •   Hadoop: The Definitive Guide, 2nd Edition •   Cloud...
We’re Hiring! • http://www.cloudera.com/company/careers/ • Customer Operations     – Customer Operations Engineer     – Cu...
74
Upcoming SlideShare
Loading in...5
×

Hadoop in three use cases

391
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
391
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop in three use cases

  1. 1. 2 December 2011Hadoop in Three Use CasesJoey Echeverria | Solutions Architectjoey@cloudera.com | @fwiffo
  2. 2. About Joey • Solutions Architect • 6 months • 3+ years • Local2 ©2011 Cloudera, Inc. All Rights Reserved.
  3. 3. Cloudera’s Distribution including Apache Hadoop File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE* APACHE OOZIE* APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME*, APACHE SQOOP* APACHE HBASE Coordination APACHE ZOOKEEPER *currently under incubation in the Apache Software Foundation3 Copyright 2011 Cloudera Inc. All rights reserved
  4. 4. Extract, Transform, and Load4 Copyright 2011 Cloudera Inc. All rights reserved
  5. 5. ETL before Hadoop Difficult to maintain, not scalable Relational Databases Logs Custom ETL Enterprise Data Scripts Warehouse Files5 ©2011 Cloudera, Inc. All Rights Reserved.
  6. 6. ETL before Hadoop May be scalable, expensive Relational Databases Logs Enterprise Data Warehouse SQL: raw table → warehouse tables Files6 ©2011 Cloudera, Inc. All Rights Reserved.
  7. 7. ETL with Hadoop Managed, flexible, scalable Relational Databases Logs Enterprise Data Warehouse Files7 ©2011 Cloudera, Inc. All Rights Reserved.
  8. 8. Steps 1. In 2. Process 3. Out8 Copyright 2011 Cloudera Inc. All rights reserved
  9. 9. Flume9 Copyright 2011 Cloudera Inc. All rights reserved
  10. 10. Flume10 Copyright 2011 Cloudera Inc. All rights reserved
  11. 11. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume Warehouse Files11 ©2011 Cloudera, Inc. All Rights Reserved.
  12. 12. HDFS12 Copyright 2011 Cloudera Inc. All rights reserved
  13. 13. HDFS 02, 06, 10 NameNodeopen(“file.txt”) DataNode DataNode DataNode 01 05 09 DataNode DataNode DataNode 02 06 10 Client data data DataNode DataNode DataNode data 03 07 11 DataNode DataNode DataNode 04 08 12 13 Copyright 2011 Cloudera Inc. All rights reserved
  14. 14. HDFS • Distributed • Replication • Bulk I/O • Fault tolerant • Scalable • Append only • Not POSIX14 Copyright 2011 Cloudera Inc. All rights reserved
  15. 15. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume HDFS Warehouse Files15 ©2011 Cloudera, Inc. All Rights Reserved.
  16. 16. FUSE-DFS16 Copyright 2011 Cloudera Inc. All rights reserved
  17. 17. FUSE-DFS • FUSE – User space – File systems • FUSE-DFS – /hdfs – Mostly transparent17 Copyright 2011 Cloudera Inc. All rights reserved
  18. 18. ETL with Hadoop Managed, flexible, scalable Relational Databases Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files18 ©2011 Cloudera, Inc. All Rights Reserved.
  19. 19. Sqoop19 Copyright 2011 Cloudera Inc. All rights reserved
  20. 20. Sqoop • SQL to Hadoop • Parallel import • File formats20 Copyright 2011 Cloudera Inc. All rights reserved
  21. 21. ETL with Hadoop Managed, flexible, scalable Relational Databases Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files21 ©2011 Cloudera, Inc. All Rights Reserved.
  22. 22. Pig22 Copyright 2011 Cloudera Inc. All rights reserved
  23. 23. Pig • Scripting language • Generates MapReduce jobs • Perl for Hadoop • Great for ETLA = LOAD data USING PigStorage() AS (f1:int, f2:int, f3:int);B = GROUP A BY f1;C = FOREACH B GENERATE COUNT ($0);DUMP C; 23 Copyright 2011 Cloudera Inc. All rights reserved
  24. 24. ETL with Hadoop Managed, flexible, scalable Relational Databases Pig Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Files24 ©2011 Cloudera, Inc. All Rights Reserved.
  25. 25. Sqoop with connectors25 Copyright 2011 Cloudera Inc. All rights reserved
  26. 26. Sqoop with connectors • MySQL* • PostgreSQL* • Teradata* • Netezza* • Oracle* • Couchbase* • Microsoft SQL Server • VoltDB *Cloudera certified connector26 Copyright 2011 Cloudera Inc. All rights reserved
  27. 27. ETL with Hadoop Managed, flexible, scalable Relational Databases Pig Sqoop Enterprise Data Logs Flume HDFS Warehouse FUSE-DFS Sqoop Files27 ©2011 Cloudera, Inc. All Rights Reserved.
  28. 28. Recommendations28 Copyright 2011 Cloudera Inc. All rights reserved
  29. 29. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs29 ©2011 Cloudera, Inc. All Rights Reserved.
  30. 30. Flume30 Copyright 2011 Cloudera Inc. All rights reserved
  31. 31. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs Flume31 ©2011 Cloudera, Inc. All Rights Reserved.
  32. 32. HDFS32 Copyright 2011 Cloudera Inc. All rights reserved
  33. 33. Recommendations with Hadoop CUSTOMERS Relational Databases Web Application Logs Flume HDFS33 ©2011 Cloudera, Inc. All Rights Reserved.
  34. 34. Sqoop34 Copyright 2011 Cloudera Inc. All rights reserved
  35. 35. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS35 ©2011 Cloudera, Inc. All Rights Reserved.
  36. 36. Pig36 Copyright 2011 Cloudera Inc. All rights reserved
  37. 37. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig37 ©2011 Cloudera, Inc. All Rights Reserved.
  38. 38. Mahout38 Copyright 2011 Cloudera Inc. All rights reserved
  39. 39. Mahout • Scalable machine learning algorithms – Collaborative Filtering – User and Item based recommenders – K-Means, Fuzzy K-Means clustering – Mean Shift clustering – Singular value decomposition – Complementary Naive Bayes classifier …39 Copyright 2011 Cloudera Inc. All rights reserved
  40. 40. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig Mahout40 ©2011 Cloudera, Inc. All Rights Reserved.
  41. 41. MapReduce41 Copyright 2011 Cloudera Inc. All rights reserved
  42. 42. MapReduce map shuffle reduce :1 toOne() :1 :1 :[1,1,1,1] count() :4 :[1,1] :2 :1 toOne() :1 :1 :[1,1] count() :2 :1 :[1] :1 toOne() :1 :142 Copyright 2011 Cloudera Inc. All rights reserved
  43. 43. MapReduce • Distributed • Code to data • Reliable • Scalable43 Copyright 2011 Cloudera Inc. All rights reserved
  44. 44. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Pig Mahout MapReduce Pig44 ©2011 Cloudera, Inc. All Rights Reserved.
  45. 45. Oozie45 Copyright 2011 Cloudera Inc. All rights reserved
  46. 46. Oozie • Workflows • Coordinator – Triggers46 Copyright 2011 Cloudera Inc. All rights reserved
  47. 47. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS Oozie Pig Mahout MapReduce Pig47 ©2011 Cloudera, Inc. All Rights Reserved.
  48. 48. HBase48 Copyright 2011 Cloudera Inc. All rights reserved
  49. 49. HBase • Key/value store • Data stored in HDFS • Access model is get/put/del – Plus range scans and versions • Random reads and writes for Hadoop49 Copyright 2011 Cloudera Inc. All rights reserved
  50. 50. Recommendations with Hadoop CUSTOMERS Relational Databases Web Sqoop Application Logs Flume HDFS HBase Oozie Pig Mahout MapReduce Pig50 ©2011 Cloudera, Inc. All Rights Reserved.
  51. 51. Business Intelligence51 Copyright 2011 Cloudera Inc. All rights reserved
  52. 52. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs52 ©2011 Cloudera, Inc. All Rights Reserved.
  53. 53. Flume53 Copyright 2011 Cloudera Inc. All rights reserved
  54. 54. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs Flume54 ©2011 Cloudera, Inc. All Rights Reserved.
  55. 55. HDFS55 Copyright 2011 Cloudera Inc. All rights reserved
  56. 56. Business Intelligence with Hadoop ANALYSTS Relational Databases BI / Analytics Logs Flume HDFS56 ©2011 Cloudera, Inc. All Rights Reserved.
  57. 57. Sqoop57 Copyright 2011 Cloudera Inc. All rights reserved
  58. 58. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS58 ©2011 Cloudera, Inc. All Rights Reserved.
  59. 59. Hive59 Copyright 2011 Cloudera Inc. All rights reserved
  60. 60. Hive • Data warehouse • Ad-hoc queries – Not real-time (minutes) • SQL • Tables • Joins60 Copyright 2011 Cloudera Inc. All rights reserved
  61. 61. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive61 ©2011 Cloudera, Inc. All Rights Reserved.
  62. 62. MapReduce62 Copyright 2011 Cloudera Inc. All rights reserved
  63. 63. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive MapReduce63 ©2011 Cloudera, Inc. All Rights Reserved.
  64. 64. Oozie64 Copyright 2011 Cloudera Inc. All rights reserved
  65. 65. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Oozie Hive MapReduce65 ©2011 Cloudera, Inc. All Rights Reserved.
  66. 66. HBase66 Copyright 2011 Cloudera Inc. All rights reserved
  67. 67. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS HBase Oozie Hive MapReduce67 ©2011 Cloudera, Inc. All Rights Reserved.
  68. 68. Hive68 Copyright 2011 Cloudera Inc. All rights reserved
  69. 69. Hive for Business Intelligence • JDBC – JasperReports* – Pentaho* • ODBC – MicroStrategy*^ * Vender certified connector ^ Cloudera certified connector69 Copyright 2011 Cloudera Inc. All rights reserved
  70. 70. Business Intelligence with Hadoop ANALYSTS Relational Databases Sqoop BI / Analytics Logs Flume HDFS Hive HBase Oozie Hive MapReduce70 ©2011 Cloudera, Inc. All Rights Reserved.
  71. 71. CDH File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE* APACHE OOZIE* APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME*, APACHE APACHE HBASE SQOOP* Coordination APACHE ZOOKEEPER *currently under incubation in the Apache Software Foundation71 Copyright 2011 Cloudera Inc. All rights reserved
  72. 72. What’s next? • Cloudera Training Videos • CDH Virtual Machines • Hadoop: The Definitive Guide, 2nd Edition • Cloudera University – Developer Training in Columbia, MD • Dec 13-16, Feb 13-16 – Administrator Training in Herndon, VA • Jan 4-6 – Private Training72 Copyright 2011 Cloudera Inc. All rights reserved
  73. 73. We’re Hiring! • http://www.cloudera.com/company/careers/ • Customer Operations – Customer Operations Engineer – Customer Operations Tools Developer • Customer Solutions – Solutions Architect • Engineering – Senior Data Integration Developer – Senior Distributed Systems Engineer – Senior UI Engineer – Software Quality Engineer – Technical Writer • IT/Operations – Systems Administrator73 Copyright 2011 Cloudera Inc. All rights reserved
  74. 74. 74
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×