0
Scalable ETL with Talend and Hadoop
Talend,	
  Global	
  Leader	
  in	
  Open	
  Source	
  Integra7on	
  Solu7ons	
  
Cédr...
Why speaking about ETL with Hadoop
Hadoop	
  is	
  complex	
  
BI	
  consultants	
  don’t	
  have	
  the	
  skill	
  to	
 ...
Trying	
  to	
  get	
  from	
  this…	
  
to	
  this…	
  

Why Talend…
Talend generates code that is executed within map reduce. This open
approach removes the limi...
BIG Data Management
Big	
  Data	
  	
  
Produc7on	
  
RDBMS	
  
Analy7cal	
  DB	
  
NoSQL	
  DB	
  
ERP/CRM	
  
SaaS	
  
S...
Two methods for inserting data quality into a
big data job

1. 

Pipelining:	
  as	
  part	
  of	
  the	
  load	
  process...
 
	
  
	
  
	
  
	
  
	
  
Extract	
  –	
  Transform	
  -­‐	
  Load	
  

E-­‐T-­‐L
 
	
  
	
  
	
  
	
  
	
   	
  
	
  
Extract	
  –	
  Improve/Cleanse	
  -­‐	
  Load	
  

E-­‐	
  	
  	
  	
  	
  	
  	
  -...
Pipelining: data quality with big data

CRM

ERP

Finance

	
  

DQ

	
  

DQ
DQ

	
  
Big Data

DQ
Social
Networking

Mob...
Big data alternative: Load and improve within
the cluster
	
  

CRM

DQ
ERP

	
  

DQ
	
  

Finance

Social
Networking

Mo...
One key DQ rules: Match
Find	
  duplicates	
  within	
  Hadoop	
  
➜  Today’s	
  matching	
  algorithms	
  are	
  processo...
What is Hadoop?	
  
What’s hadoop
➜ 

➜ 

➜ 

➜ 

The	
  Apache™	
  Hadoop®	
  project	
  develops	
  open-­‐source	
  
so^ware	
  for	
  reli...
Hadoop ecosystem

Sqoop	
  
➜  Pig	
  
➜  Hive	
  
➜  Oozie	
  
➜  HCatalog	
  
➜  Mahout	
  
➜ 

➜ 

HBase	
  
Talend for Big Data : Hadoop Story
4.0:	
  [April	
  2010]	
  
Put	
  or	
  get	
  data	
  into	
  
Haddop	
  through	
  
...
Democratizing Integration with Data
Integration tools for Big Data	
  
WordCount

➜ 

WordCount	
  
•  Comes	
  with	
  Hadoop	
  
•  “First”	
  demo	
  that	
  everyone	
  tries!	
  

➜ 

How-...
Map Reduce

Data	
  Node	
  1	
  

Data	
  Node	
  2	
  
Map Reduce
Map Reduce
Map Reduce
Map Reduce
In Java
WordCount: Howto with a Graphical ETL
Thank You!
	
  
	
  
	
  
Let	
  us	
  show	
  you…	
  
Talend Open Studio
Generate Pure Map Reduce
Pig Latin Script generation

FOREACH	
  tPigMap_1_out1_RESULT	
  GENERATE	
  $4	
  AS	
  
Revenu	
  ,	
  $6	
  AS	
  Label...
HiveQL generation
HDFS Management and Sqoop
Apache Mahout
Big	
  Data	
  can	
  also	
  be	
  a	
  blob	
  of	
  
data	
  to	
  an	
  organiza7on	
  
➜  Apache	
  Mah...
Metadata Management
➜ 

Centralize	
  Metadata	
  repository	
  for	
  Hadoop	
  Cluster,	
  
HDFS,	
  Hive…	
  
•  Versio...
Thank You!
	
  
	
  
	
  
Choose your Hadoop distro
Widely	
  adopted	
  
➜  Management	
  tooling	
  is	
  not	
  OSS	
  
➜ 

Fully	
  OpenSource	
...
Choose your Hadoop distro
Provide	
  tooling	
  :	
  
➜  For	
  installa7on	
  
➜  For	
  server	
  monitoring	
  
	
  
Bu...
Parse and Standardize
Big	
  Data	
  is	
  not	
  always	
  structured	
  
➜  Correct	
  big	
  data	
  so	
  that	
  data...
Profiling & Monitor DQ
Implications for Integration
DATA
Challenge:	
  Informa7on	
  explosion	
  increases	
  
complexity	
  of	
  integra7on	
 ...
Implications for Integration
APPLICATION
	
  	
  

Challenge:	
  Brisle,	
  point-­‐to-­‐point	
  
connec7ons	
  cannot	
 ...
Implications for Integration
PROCESS
	
  

Challenge:	
  Compe77ve	
  market	
  forces	
  drive	
  
frequent	
  process	
 ...
Implications for Integration
Challenge:	
  Interdependencies	
  
across	
  data,	
  applica7ons	
  and	
  
processes	
  re...
Integration at Any Scale
True
scalability for
•  Any	
  integra7on	
  challenge	
  
•  Any	
  data	
  volume	
  
•  Any	
 ...
Technology that Scales

Java	
  

STANDARDS-BASED
Easy	
  to	
  learn,	
  flexible	
  to	
  adopt,	
  
reduces	
  vendor	
 ...
Technology Continuum…

Google Big Query

	
  	
  
ELT (SQL CodeGen)
•  Terradata
	
  	
  
•  Netezza
•  Vertica

	
  

	
 ...
Talend Overview
At a glance

Talend	
  today	
  
➜ 

400	
  employees	
  in	
  7	
  countries	
  with	
  dual	
  HQ	
  in	...
Talend’s Unique Integration Solution
Best-ofBreed
Solutions
Data
Quality

Data
Integration

MDM

ESB

BPM

+
Studio	
  

R...
Solutions that Scale

Big	
  Data	
  
	
  
	
  
	
  

Studio	
  

Data	
  
	
  Quality	
  
	
  
	
  
	
  

Repository	
  
...
The 6 Dimensions of BIG Data
Primary	
  challenges	
  
¾ Volume	
  
¾ Velocity	
  
¾ Variety	
  
	
  
And	
  also	
  

...
What Is BIG Data?
"Big	
  data"	
  	
  
is	
  informa7on	
  	
  
of	
  extreme	
  size,	
  
diversity,	
  complexity	
  
a...
What is Big Data?	
  
How to
define Big
data is….
Hans	
  Rosling	
  –	
  uses	
  big	
  data	
  to	
  analyze	
  world	
  health	
  trends	
  
...
Traditional Data Flows

CRM	
  

ERP	
  

ETL	
  
Data	
  Quality	
  

Normalized	
  
Data	
  

Tradi7onal	
  Data	
  
War...
The new world of big data

CRM	
  

Social	
  Networking	
  

	
  
	
  

ERP	
  

Finance	
  

Big Data
The new world of big data

Social	
  Networking	
  

	
  

CRM	
  

	
  
ERP	
  

Finance	
  

	
  
Big Data

Mobile	
  De...
The new world of big data

Social	
  Networking	
  

	
  

CRM	
  

	
  
ERP	
  

Mobile	
  Devices	
  

	
  

Transac7ons...
Key	
  Takeaway	
  #2	
  

Forces us to think
differently
Data driven business
enables
data

governance	
  

information

supports

Information provides
value to the business
If	
 ...
BIG data driven business

BIG	
  data	
  

enables
governance	
  

BIG	
  

supports

information

Information provides
va...
How is big data integration being used?
Use	
  Cases	
  
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
	
  
	
  

Recommenda7on	
  Engine	...
Key	
  Takeaway	
  #3	
  

Define your use case
Upcoming SlideShare
Loading in...5
×

Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.

2,005

Published on

ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,005
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
148
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend."

  1. 1. Scalable ETL with Talend and Hadoop Talend,  Global  Leader  in  Open  Source  Integra7on  Solu7ons   Cédric Carbone – Talend CTO Twitter : @carbone ccarbone@talend.com
  2. 2. Why speaking about ETL with Hadoop Hadoop  is  complex   BI  consultants  don’t  have  the  skill  to  manipulate  Hadoop   ➜  Biggest  issue  in  Hadoop  project  is  to  find  skilled  Hadoop   Engineers   ➜  ETL  tool  like  Talend  can  help  to  democra7ze  Hadoop  
  3. 3. Trying  to  get  from  this…  
  4. 4. to  this…   Why Talend… Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.
  5. 5. BIG Data Management Big  Data     Produc7on   RDBMS   Analy7cal  DB   NoSQL  DB   ERP/CRM   SaaS   Social  Media   Web  Analy7cs   Log  Files   RFID   Call  Data  Records   Sensors   Machine-­‐Generated   Big  Data  Management   Big  Data     Integra7on   Big Data Quality Big  Data     Consump7on   Mining   Analy7cs   Storage Processing Filtering Parsing Checking Search   Enrichment   Turn Big Data into actionable information
  6. 6. Two methods for inserting data quality into a big data job 1.  Pipelining:  as  part  of  the  load  process   2.  Load  the  cluster  than  implement  and  execute  a  data   quality  map  reduce  job  
  7. 7.             Extract  –  Transform  -­‐  Load   E-­‐T-­‐L
  8. 8.                 Extract  –  Improve/Cleanse  -­‐  Load   E-­‐              -­‐L DQ
  9. 9. Pipelining: data quality with big data CRM ERP Finance   DQ   DQ DQ   Big Data DQ Social Networking Mobile Devices DQ •  Use  tradi7onal  data  quality  tools   •  Once  and  done  
  10. 10. Big data alternative: Load and improve within the cluster   CRM DQ ERP   DQ   Finance Social Networking Mobile Devices Big Data •  Load  first,  improve  later   •  Complex  matching  cannot  be  done  outside  
  11. 11. One key DQ rules: Match Find  duplicates  within  Hadoop   ➜  Today’s  matching  algorithms  are  processor-­‐intensive   ➜  Tomorrow’s  matching  algorithms  could  be  more   precise,  more  intensive   ➜ 
  12. 12. What is Hadoop?  
  13. 13. What’s hadoop ➜  ➜  ➜  ➜  The  Apache™  Hadoop®  project  develops  open-­‐source   so^ware  for  reliable,  scalable,  distributed  compu7ng.   Java  framework  for  storage  and  running  data   transforma7on  on  large  cluster  of  commodity   hardware   Licensed  under  the  Apache  v2  license   Created  from  Google's  MapReduce,  BigTable  and   Google  File  System  (GFS)  papers  
  14. 14. Hadoop ecosystem Sqoop   ➜  Pig   ➜  Hive   ➜  Oozie   ➜  HCatalog   ➜  Mahout   ➜  ➜  HBase  
  15. 15. Talend for Big Data : Hadoop Story 4.0:  [April  2010]   Put  or  get  data  into   Haddop  through   HDFS  connectors   4.1:    [Oct  2010]   Hadoop  Query  (Hive)     Bulk  loald  &  fast  export  to   Hadoop  (Sqoop)   5.1:  [May  2012]   Metadata  (Hcatalog)   Deployement  &  Scheduling  (Oozie)   Embeded  into  HDP   HCatalog     4.2:  [May  2011]   Transforma7on   (Pig)   5.2:  [Oct  2012]   Visual  ELT  mapping  (Hive)   DataLineage  &  Impact  Analysis     5.0:  [Nov  2011]   Hbase  NoSQL.   Extend  our  tPig*   5.3  -­‐  [June  2013]   Visual  Pig  mapping   Machine  Learning  (Mahout)   Na7ve  MapReduce  Code  Gen    
  16. 16. Democratizing Integration with Data Integration tools for Big Data  
  17. 17. WordCount ➜  WordCount   •  Comes  with  Hadoop   •  “First”  demo  that  everyone  tries!   ➜  How-­‐to  in  Talend  Big  Data   •  Simple  read,  count,  load  results   •  No  coding,  just  drag-­‐n-­‐drop   •  Runs  remotely  
  18. 18. Map Reduce Data  Node  1   Data  Node  2  
  19. 19. Map Reduce
  20. 20. Map Reduce
  21. 21. Map Reduce
  22. 22. Map Reduce
  23. 23. In Java
  24. 24. WordCount: Howto with a Graphical ETL
  25. 25. Thank You!      
  26. 26. Let  us  show  you…  
  27. 27. Talend Open Studio
  28. 28. Generate Pure Map Reduce
  29. 29. Pig Latin Script generation FOREACH  tPigMap_1_out1_RESULT  GENERATE  $4  AS   Revenu  ,  $6  AS  Label  
  30. 30. HiveQL generation
  31. 31. HDFS Management and Sqoop
  32. 32. Apache Mahout Big  Data  can  also  be  a  blob  of   data  to  an  organiza7on   ➜  Apache  Mahout  provides   algorithms  to  understanding   data  –  data  mining   ➜  “You  don’t  know  what  you   don’t  know.”  and  mahout  will   tell  you.   ➜ 
  33. 33. Metadata Management ➜  Centralize  Metadata  repository  for  Hadoop  Cluster,   HDFS,  Hive…   •  Versioning   •  Impact  Analysis  and  Data  Lineage     ➜  HCatalog  accros     •  HDFS   •  Hive   •  Pig  
  34. 34. Thank You!      
  35. 35. Choose your Hadoop distro Widely  adopted   ➜  Management  tooling  is  not  OSS   ➜  Fully  OpenSource   ➜  Strong  Developer  ecosystem   ➜  More  proprietary   ➜  GTM  partner  with  AWS   ➜  ➜  A  lot  of  more  are  comming  
  36. 36. Choose your Hadoop distro Provide  tooling  :   ➜  For  installa7on   ➜  For  server  monitoring     But   ➜  No  GUI  for  parsing,   transforming,  easily   loading.  No  data   management    
  37. 37. Parse and Standardize Big  Data  is  not  always  structured   ➜  Correct  big  data  so  that  data  conforms  to  the  same   rules   ➜ 
  38. 38. Profiling & Monitor DQ
  39. 39. Implications for Integration DATA Challenge:  Informa7on  explosion  increases   complexity  of  integra7on  and  requires   governance  to  maintain  data  quality     Requirement:  Informa7on  processing     must  scale  
  40. 40. Implications for Integration APPLICATION     Challenge:  Brisle,  point-­‐to-­‐point   connec7ons  cannot  adapt  to  evolving   business  requirements,  new  channels,   and  quickly  changing  topologies     Requirement:  Applica7on  architecture   must  scale    
  41. 41. Implications for Integration PROCESS   Challenge:  Compe77ve  market  forces  drive   frequent  process  changes  and  increased   process  complexity     Requirement: Business  processes   must  scale  
  42. 42. Implications for Integration Challenge:  Interdependencies   across  data,  applica7ons  and   processes  require  more   resources  and  budget     Requirement:  Resources  and   skillsets  must  scale      
  43. 43. Integration at Any Scale True scalability for •  Any  integra7on  challenge   •  Any  data  volume   •  Any  project  size   Enables integration convergence  
  44. 44. Technology that Scales Java   STANDARDS-BASED Easy  to  learn,  flexible  to  adopt,   reduces  vendor  lock-­‐in   SQL   Camel   Map   Reduce   ……   CODE GENERATOR No  black-­‐box  engine  means  faster   maintenance  and  deployment   with  improved  quality     The  “engine”  for  Big  Data  is   “Hadoop”,  making  it  uniquely  run   at  infinite  scale.  
  45. 45. Technology Continuum… Google Big Query     ELT (SQL CodeGen) •  Terradata     •  Netezza •  Vertica         •  Visual Wizard •  Hive/Pig/MR CodeGen •  HDFS, Sqoop, Oozie…     ETL     •  Java Code •  Partionning •  Paralelisation NoSQL •  MongoDB •  Neo4J •  Cassandra •  Hbase •  Amazon Redshift
  46. 46. Talend Overview At a glance Talend  today   ➜  400  employees  in  7  countries  with  dual  HQ  in  Los  Altos,  CA   and  Paris,  France   ➜  Over  4,000  paying  customers  across  different  industry   ver7cals  and  company  sizes   Backed  by  Silver  Lake  Sumeru,  Balderton  Capital  and   Idinvest  Partners   ▶  Founded  in  2005   ▶  Offers  highly  scalable   integra7on  solu7ons   addressing  Data  Integra7on,   Data  Quality,  MDM,  ESB  and   BPM   ▶  Provides:     ➜  High  growth  through  a  proven  model   §  Subscrip7ons  including   24/7  support  and   indemnifica7on;     Brand     Awareness   20  million   Downloads   §  Worldwide  training  and   services   ▶  Recognized  as  the  open  source   leader  in  each  of  its  market   categories   Market   Momentum   AdopDon   1,000,000   Users   +50  New   Customers  /   Month   MoneDzaDon   4,000   Customers  
  47. 47. Talend’s Unique Integration Solution Best-ofBreed Solutions Data Quality Data Integration MDM ESB BPM + Studio   Repository   Deployment   Execu7on   Monitoring   Talend Unified Single  web-­‐based   Web-­‐based   Comprehensive   Platform monitoring  console   Eclipse-­‐based     deployment  &   ➜  Reduce  costs   Talend scheduling   user  interface   ➜  Eliminate  risk   5 Unified = 1 3 Same  container  for   Consolidated   ➜  Reuse  skills   Platform metadata  &  project   batch  processing,   Unique ➜  Economies  of  scale   informa7on   message  rou7ng  &   Integration services   ➜  Incremental  adop7on   2 Solution 4 Recognized  as  the  open  source  leader  in  each  of  its  market  category  by   all  industry  analysts  
  48. 48. Solutions that Scale Big  Data         Studio   Data    Quality         Repository   Data    Integra7on         Data     Integra7on     MDM   ESB   BPM           TALEND UNIFIED  PLATFORM           Deployment     Execu7on   Monitoring   UNIFIED PLATFORM A  shared  founda7on  and  toolset  increases  resource  reuse   CONVERGED INTEGRATION Use  for  any  data,  applica7on  and     process  project  
  49. 49. The 6 Dimensions of BIG Data Primary  challenges   ¾ Volume   ¾ Velocity   ¾ Variety     And  also   ¾ Complexity   ¾ Valida7on   ¾ Lineage  
  50. 50. What Is BIG Data? "Big  data"     is  informa7on     of  extreme  size,   diversity,  complexity   and  need  for  rapid   processing.   Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011 3,500  tweets     per  second   (June  2011)   1,000,000  transac7ons   per  day  at  Walmart   200  billion   intelligent  devices   200,000,000,000   2015 275  exabytes   of  data  flowing  over     the  Internet  each  day   275,000,000,000,000,000,000   2020
  51. 51. What is Big Data?  
  52. 52. How to define Big data is…. Hans  Rosling  –  uses  big  data  to  analyze  world  health  trends   Key  Takeaway  #1   volume, variety, velocity
  53. 53. Traditional Data Flows CRM   ERP   ETL   Data  Quality   Normalized   Data   Tradi7onal  Data   Warehouse   Finance   •  Scheduled–daily  or  weekly,   some7mes  more  frequently.     •  Volumes  rarely  exceed   terabytes   Business   Analyst   Warehouse   Administrator     Business   User   Execu7ves  
  54. 54. The new world of big data CRM   Social  Networking       ERP   Finance   Big Data
  55. 55. The new world of big data Social  Networking     CRM     ERP   Finance     Big Data Mobile  Devices  
  56. 56. The new world of big data Social  Networking     CRM     ERP   Mobile  Devices     Transac7ons     Finance   Network  Devices     Big Data Sensors  
  57. 57. Key  Takeaway  #2   Forces us to think differently
  58. 58. Data driven business enables data governance   information supports Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the   result  can  be  missed  opportuni7es,  or  higher   costs.   Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.     The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).   decisions drives Your business
  59. 59. BIG data driven business BIG  data   enables governance   BIG   supports information Information provides value to the business If  you  can't  rely  on  your  informa7on  then  the   result  can  be  missed  opportuni7es,  or  higher   costs.   Mashew  West  and  Julian  Fowler  (1999).  Developing  High  Quality  Data  Models.     The  European  Process  Industries  STEP  Technical  Liaison  Execu7ve  (EPISTLE).   BIG   decisions drives BIG   business
  60. 60. How is big data integration being used? Use  Cases   •  •  •  •  •  •  •  •  •  •      Recommenda7on  Engine   Sen7ment  Analysis   Risk  Modeling   Fraud  Detec7on   Behavior  Analysis   Marke7ng  Campaign  Analysis   Customer  Churn  Analysis   Social  Graph  Analysis   Customer  Experience  Analy7cs   Network  Monitoring           BUT:  to  what  level  is  DQ  required  for  your  use  case?  
  61. 61. Key  Takeaway  #3   Define your use case
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×