Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.


Published on

ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.

Published in: Technology
  • Be the first to comment

Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.

  1. 1. Scalable ETL with Talend and Hadoop Talend,  Global  Leader  in  Open  Source  Integra7on  Solu7ons   Cédric Carbone – Talend CTO Twitter : @carbone
  2. 2. Why speaking about ETL with Hadoop Hadoop  is  complex   BI  consultants  don’t  have  the  skill  to  manipulate  Hadoop   ➜  Biggest  issue  in  Hadoop  project  is  to  find  skilled  Hadoop   Engineers   ➜  ETL  tool  like  Talend  can  help  to  democra7ze  Hadoop  
  3. 3. Trying  to  get  from  this…  
  4. 4. to  this…   Why Talend… Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.
  5. 5. BIG Data Management Big  Data     Produc7on   RDBMS   Analy7cal  DB   NoSQL  DB   ERP/CRM   SaaS   Social  Media   Web  Analy7cs   Log  Files   RFID   Call  Data  Records   Sensors   Machine-­‐Generated   Big  Data  Management   Big  Data     Integra7on   Big Data Quality Big  Data     Consump7on   Mining   Analy7cs   Storage Processing Filtering Parsing Checking Search   Enrichment   Turn Big Data into actionable information
  6. 6. Two methods for inserting data quality into a big data job 1.  Pipelining:  as  part  of  the  load  process   2.  Load  the  cluster  than  implement  and  execute  a  data   quality  map  reduce  job  
  7. 7.             Extract  –  Transform  -­‐  Load   E-­‐T-­‐L
  8. 8.                 Extract  –  Improve/Cleanse  -­‐  Load   E-­‐              -­‐L DQ
  9. 9. Pipelining: data quality with big data CRM ERP Finance   DQ   DQ DQ   Big Data DQ Social Networking Mobile Devices DQ •  Use  tradi7onal  data  quality  tools   •  Once  and  done  
  10. 10. Big data alternative: Load and improve within the cluster   CRM DQ ERP   DQ   Finance Social Networking Mobile Devices Big Data •  Load  first,  improve  later   •  Complex  matching  cannot  be  done  outside  
  11. 11. One key DQ rules: Match Find  duplicates  within  Hadoop   ➜  Today’s  matching  algorithms  are  processor-­‐intensive   ➜  Tomorrow’s  matching  algorithms  could  be  more   precise,  more  intensive   ➜ 
  12. 12. What is Hadoop?  
  13. 13. What’s hadoop ➜  ➜  ➜  ➜  The  Apache™  Hadoop®  project  develops  open-­‐source   so^ware  for  reliable,  scalable,  distributed  compu7ng.   Java  framework  for  storage  and  running  data   transforma7on  on  large  cluster  of  commodity   hardware   Licensed  under  the  Apache  v2  license   Created  from  Google's  MapReduce,  BigTable  and   Google  File  System  (GFS)  papers  
  14. 14. Hadoop ecosystem Sqoop   ➜  Pig   ➜  Hive   ➜  Oozie   ➜  HCatalog   ➜  Mahout   ➜  ➜  HBase  
  15. 15. Talend for Big Data : Hadoop Story 4.0:  [April  2010]   Put  or  get  data  into   Haddop  through   HDFS  connectors   4.1:    [Oct  2010]   Hadoop  Query  (Hive)     Bulk  loald  &  fast  export  to   Hadoop  (Sqoop)   5.1:  [May  2012]   Metadata  (Hcatalog)   Deployement  &  Scheduling  (Oozie)   Embeded  into  HDP   HCatalog     4.2:  [May  2011]   Transforma7on   (Pig)   5.2:  [Oct  2012]   Visual  ELT  mapping  (Hive)   DataLineage  &  Impact  Analysis     5.0:  [Nov  2011]   Hbase  NoSQL.   Extend  our  tPig*   5.3  -­‐  [June  2013]   Visual  Pig  mapping   Machine  Learning  (Mahout)   Na7ve  MapReduce  Code  Gen    
  16. 16. Democratizing Integration with Data Integration tools for Big Data  
  17. 17. WordCount ➜  WordCount   •  Comes  with  Hadoop   •  “First”  demo  that  everyone  tries!   ➜  How-­‐to  in  Talend  Big  Data   •  Simple  read,  count,  load  results   •  No  coding,  just  drag-­‐n-­‐drop   •  Runs  remotely  
  18. 18. Map Reduce Data  Node  1   Data  Node  2  
  19. 19. Map Reduce
  20. 20. Map Reduce
  21. 21. Map Reduce
  22. 22. Map Reduce
  23. 23. In Java
  24. 24. WordCount: Howto with a Graphical ETL
  25. 25. Thank You!      
  26. 26. Let  us  show  you…  
  27. 27. Talend Open Studio
  28. 28. Generate Pure Map Reduce
  29. 29. Pig Latin Script generation FOREACH  tPigMap_1_out1_RESULT  GENERATE  $4  AS   Revenu  ,  $6  AS  Label  
  30. 30. HiveQL generation
  31. 31. HDFS Management and Sqoop
  32. 32. Apache Mahout Big  Data  can  also  be  a  blob  of   data  to  an  organiza7on   ➜  Apache  Mahout  provides   algorithms  to  understanding   data  –  data  mining   ➜  “You  don’t  know  what  you   don’t  know.”  and  mahout  will   tell  you.   ➜ 
  33. 33. Metadata Management ➜  Centralize  Metadata  repository  for  Hadoop  Cluster,   HDFS,  Hive…   •  Versioning   •  Impact  Analysis  and  Data  Lineage     ➜  HCatalog  accros     •  HDFS   •  Hive   •  Pig  
  34. 34. Thank You!      
  35. 35. Choose your Hadoop distro Widely  adopted   ➜  Management  tooling  is  not  OSS   ➜  Fully  OpenSource   ➜  Strong  Developer  ecosystem   ➜  More  proprietary   ➜  GTM  partner  with  AWS   ➜  ➜  A  lot  of  more  are  comming  
  36. 36. Choose your Hadoop distro Provide  tooling  :   ➜  For  installa7on   ➜  For  server  monitoring     But   ➜  No  GUI  for  parsing,   transforming,  easily   loading.  No  data   management    
  37. 37. Parse and Standardize Big  Data  is  not  always  structured   ➜  Correct  big  data  so  that  data  conforms  to  the  same   rules   ➜ 
  38. 38. Profiling & Monitor DQ
  39. 39. Implications for Integration DATA Challenge:  Informa7on  explosion  increases   complexity  of  integra7on  and  requires   governance  to  maintain  data  quality     Requirement:  Informa7on  processing     must  scale  
  40. 40. Implications for Integration APPLICATION     Challenge:  Brisle,  point-­‐to-­‐point   connec7ons  cannot  adapt  to  evolving   business  requirements,  new  channels,   and  quickly  changing  topologies     Requirement:  Applica7on  architecture   must  scale    
  41. 41. Implications for Integration PROCESS   Challenge:  Compe77ve  market  forces  drive   frequent  process  changes  and  increased   process  complexity     Requirement: Business  processes   must  scale  
  42. 42. Implications for Integration Challenge:  Interdependencies   across  data,  applica7ons  and   processes  require  more   resources  and  budget     Requirement:  Resources  and   skillsets  must  scale      
  43. 43. Integration at Any Scale True scalability for •  Any  integra7on  challenge   •  Any  data  volume   •  Any  project  size   Enables integration convergence  
  44. 44. Technology that Scales Java   STANDARDS-BASED Easy  to  learn,  flexible  to  adopt,   reduces  vendor  lock-­‐in   SQL   Camel   Map   Reduce   ……   CODE GENERATOR No  black-­‐box  engine  means  faster   maintenance  and  deployment   with  improved  quality     The  “engine”  for  Big  Data  is   “Hadoop”,  making  it  uniquely  run   at  infinite  scale.  
  45. 45. Technology Continuum… Google Big Query     ELT (SQL CodeGen) •  Terradata     •  Netezza •  Vertica         •  Visual Wizard •  Hive/Pig/MR CodeGen •  HDFS, Sqoop, Oozie…     ETL     •  Java Code •  Partionning •  Paralelisation NoSQL •  MongoDB •  Neo4J •  Cassandra •  Hbase •  Amazon Redshift
  46. 46. Talend Overview At a glance Talend  today   ➜  400  employees  in  7  countries  with  dual  HQ  in  Los  Altos,  CA   and  Paris,  France   ➜  Over  4,000  paying  customers  across  different  industry   ver7cals  and  company  sizes   Backed  by  Silver  Lake  Sumeru,  Balderton  Capital  and   Idinvest  Partners   ▶  Founded  in  2005   ▶  Offers  highly  scalable   integra7on  solu7ons   addressing  Data  Integra7on,   Data  Quality,  MDM,  ESB  and   BPM   ▶  Provides:     ➜  High  growth  through  a  proven  model   §  Subscrip7ons  including   24/7  support  and   indemnifica7on;     Brand     Awareness   20  million   Downloads   §  Worldwide  training  and   services   ▶  Recognized  as  the  open  source   leader  in  each  of  its  market   categories   Market   Momentum   AdopDon   1,000,000   Users   +50  New   Customers  /   Month   MoneDzaDon   4,000   Customers  
  47. 47. Talend’s Unique Integration Solution Best-ofBreed Solutions Data Quality Data Integration MDM ESB BPM + Studio   Repository   Deployment   Execu7on   Monitoring   Talend Unified Single  web-­‐based   Web-­‐based   Comprehensive   Platform monitoring  console   Eclipse-­‐based     deployment  &   ➜  Reduce  costs   Talend scheduling   user  interface   ➜  Eliminate  risk   5 Unified = 1 3 Same  container  for   Consolidated   ➜  Reuse  skills   Platform metadata  &  project   batch  processing,   Unique ➜  Economies  of  scale   informa7on   message  rou7ng  &   Integration services   ➜  Incremental  adop7on   2 Solution 4 Recognized  as  the  open  source  leader  in  each  of  its  market  category  by   all  industry  analysts  
  48. 48. Solutions that Scale Big  Data         Studio   Data    Quality         Repository   Data    Integra7on         Data     Integra7on     MDM   ESB   BPM           TALEND UNIFIED  PLATFORM           Deployment     Execu7on   Monitoring   UNIFIED PLATFORM A  shared  founda7on  and  toolset  increases  resource  reuse   CONVERGED INTEGRATION Use  for  any  data,  applica7on  and     process  project  
  49. 49. The 6 Dimensions of BIG Data Primary  challenges   ¾ Volume   ¾ Velocity   ¾ Variety     And  also   ¾ Complexity   ¾ Valida7on   ¾ Lineage  
  50. 50. What Is BIG Data? "Big  data"     is  informa7on     of  extreme  size,   diversity,  complexity   and  need  for  rapid   processing.   Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011 3,500  tweets     per  second   (June  2011)   1,000,000  transac7ons   per  day  at  Walmart   200  billion   intelligent  devices   200,000,000,000   2015 275  exabytes   of  data  flowing  over     the  Internet  each  day   275,000,000,000,000,000,000   2020
  51. 51. What is Big Data?  
  52. 52. How to define Big data is…. Hans  Rosling  –  uses  big  data  to  analyze  world  health  trends   Key  Takeaway  #1   volume, variety, velocity
  53. 53. Traditional Data Flows CRM   ERP   ETL   Data  Quality   Normalized   Data   Tradi7onal  Data   Warehouse   Finance   •  Scheduled–daily  or  weekly,   some7mes  more  frequently.     •  Volumes  rarely  exceed   terabytes   Business   Analyst   Warehouse   Administrator     Business   User   Execu7ves  
  54. 54. The new world of big data CRM   Social  Networking       ERP   Finance   Big Data
  55. 55. The new world of big data Social  Networking     CRM     ERP   Finance     Big Data Mobile  Devices  
  56. 56. The new world of big data Social  Networking     CRM     ERP   Mobile  Devices     Transac7ons     Finance   Network  Devices     Big Data Sensors  
  57. 57. Key  Takeaway  #2   Forces us to think differently
  58. 58. Data driven business enables data governance   information supports Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the   result  can  be  missed  opportuni7es,  or  higher   costs.   Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.     The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).   decisions drives Your business
  59. 59. BIG data driven business BIG  data   enables governance   BIG   supports information Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the   result  can  be  missed  opportuni7es,  or  higher   costs.   Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.     The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).   BIG   decisions drives BIG   business
  60. 60. How is big data integration being used? Use  Cases   •  •  •  •  •  •  •  •  •  •      Recommenda7on  Engine   Sen7ment  Analysis   Risk  Modeling   Fraud  Detec7on   Behavior  Analysis   Marke7ng  Campaign  Analysis   Customer  Churn  Analysis   Social  Graph  Analysis   Customer  Experience  Analy7cs   Network  Monitoring           BUT:  to  what  level  is  DQ  required  for  your  use  case?  
  61. 61. Key  Takeaway  #3   Define your use case