0
Intro to
Talend Open Studio
for
Data Integration
Philip Yurchuk
http://philip.yurchuk.com
What is Talend?
 Eclipse-based visual

programming editor
 Generates executable Java code
 Jobs can run standalone or
e...
What is ETL?
 Extract: suck up data

 Transform: mess with it

Load: blow it out
Batch, integration, mi

gration, etc.
Extract from/load to where?
 Over 600 components

 Over 450 connectors
 Allows multiple

inputs/outputs in single job
Connectors
 Flat files

 Applications/Platforms

 Delimted (tab, CSV…)

 Alfresco

 XML

 Microsoft Dynamics

 JSON...
Connectors (continued)
 Relational Databases
 MySQL
 Postgresql
 MS SQL
 Oracle
 Many more

 NoSQL/Columnar/OLAP/

...
How do we transport data?
 File system
 FTP
 SFTP/SCP
 Web service (SOAP,

REST)

 HTTP
 Mail, POP
 XMLRPC, Sockets...
Other Components
 Process data: join, filter, aggregate
 Flow control: loops, job invocation
 Logs, statistics
 Code: ...
Demo
Nifty Components
 FuzzyMatch - calculate Levenshtein distance or

phonetic similarity
 IntervalMatch – perform lookup/jo...
More Nifty Components
 XMLMap - Allows joins, column or row

filtering, transformations, and multiple outputs
 Normalize...
Tips and Tricks
 CamelCase job names for embedded jobs.
 Or prefix with ETL phase and order of execution
 Whenever appr...
Tips and Tricks
 Propagating changes to a DB component will

change it to a built-in schema, which won't get
updated afte...
Tips and Tricks
 Failure handling subjob:
 It’s an unconnected job (no triggers point to it)
 Use LogCatcher to catch, ...
Tips and Tricks
 In Java expressions, use methods, not

operators. E.g., concat(String) instead of the dot
operator, equa...
Tips and Tricks
 When connecting, propagating changes to a DB

component will change it to a built-in
schema, which won't...
Tips and Tricks
 Use a context for job variables.
 Note you can specify type for variables.
 You can read from a file o...
Tips and Tricks
 For multi-host deployment:
 Export the job with a “bootstrap” context that has all
variables, but popul...
Tips and Tricks
 Use “Bulk” output components when possible.
 For transactional behavior:
 Start the job with DB connec...
Room for Improvement
 UI stability

 Documentation
Books
 Getting Started with Talend Open Studio

for Data Integration by Bowen Jonathan
 Talend Open Studio Cookbook by R...
Talend Forge
 http://www.talendforge.org/
 Forum – super helpful
 Exchange – free community components!
 Tutorials
 B...
Talend Resources
 http://www.talend.com/resources
 Help Center
 Knowledge Base

 Webinars, screencasts
 Tutorials

 ...
Questions?
Compliments?
Consulting gigs?
 Contact me:
 philip@yurchuk.com
 http://philip.yurchuk.com
 http://www.linke...
Thank You!
Upcoming SlideShare
Loading in...5
×

Intro to Talend Open Studio for Data Integration

3,232

Published on

An overview of Talend Open Studio for Data Integration, along with some tips learned from building production jobs and a list of resources. Feel free to contact me for more information.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,232
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
133
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Intro to Talend Open Studio for Data Integration"

  1. 1. Intro to Talend Open Studio for Data Integration Philip Yurchuk http://philip.yurchuk.com
  2. 2. What is Talend?  Eclipse-based visual programming editor  Generates executable Java code  Jobs can run standalone or embedded (no special server)  Batch or interactive (user input)
  3. 3. What is ETL?  Extract: suck up data  Transform: mess with it Load: blow it out Batch, integration, mi gration, etc.
  4. 4. Extract from/load to where?  Over 600 components  Over 450 connectors  Allows multiple inputs/outputs in single job
  5. 5. Connectors  Flat files  Applications/Platforms  Delimted (tab, CSV…)  Alfresco  XML  Microsoft Dynamics  JSON  Excel  Positional  Apache HTTP logs, HL7... (CRM, AX)  SAP  Sage ERP X3  Salesforce  SugarCRM
  6. 6. Connectors (continued)  Relational Databases  MySQL  Postgresql  MS SQL  Oracle  Many more  NoSQL/Columnar/OLAP/ Other  Amazon RedShift  Greenplum  Hive  OLAP cubes  LDAP  VectorWise  Teradata  More in Big Data ed.
  7. 7. How do we transport data?  File system  FTP  SFTP/SCP  Web service (SOAP, REST)  HTTP  Mail, POP  XMLRPC, Sockets, JMS, RSS...
  8. 8. Other Components  Process data: join, filter, aggregate  Flow control: loops, job invocation  Logs, statistics  Code: Java, Groovy  On row data or standalone  Can load libraries
  9. 9. Demo
  10. 10. Nifty Components  FuzzyMatch - calculate Levenshtein distance or phonetic similarity  IntervalMatch – perform lookup/join based on values falling within an interval  Replace, ReplaceList - search and replace, substitution  UniqRow - output distinct rows based on defined key columns
  11. 11. More Nifty Components  XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs  Normalize/Denormalize - split delimited strings into columns or join columns into a string  AggregateRow – GROUP BY; min, max, sum, other functions used to aggregate rows on a column
  12. 12. Tips and Tricks  CamelCase job names for embedded jobs.  Or prefix with ETL phase and order of execution  Whenever appropriate (esp. for inserting data), use the schema from the repository.  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.
  13. 13. Tips and Tricks  Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  14. 14. Tips and Tricks  Failure handling subjob:  It’s an unconnected job (no triggers point to it)  Use LogCatcher to catch, record component failures.  Record failure in DB, file, email, etc.  Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
  15. 15. Tips and Tricks  In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==.  Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.p hp?pid=110860
  16. 16. Tips and Tricks  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  17. 17. Tips and Tricks  Use a context for job variables.  Note you can specify type for variables.  You can read from a file or database, or pass in a context if an embedded Java job.
  18. 18. Tips and Tricks  For multi-host deployment:  Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines.  The context config file has all values required for that host, e.g. test DB connection for test machine.  You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C:Data  Be mindful of file permissions for sensitive context data (e.g., DB password)
  19. 19. Tips and Tricks  Use “Bulk” output components when possible.  For transactional behavior:  Start the job with DB connection  Check “use existing connection” in all relevant components  Check "Die on error" in all relevant components  End job with commit component
  20. 20. Room for Improvement  UI stability  Documentation
  21. 21. Books  Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan  Talend Open Studio Cookbook by Rick Daniel Barton  Big Data book coming…
  22. 22. Talend Forge  http://www.talendforge.org/  Forum – super helpful  Exchange – free community components!  Tutorials  Bug tracker  Source code
  23. 23. Talend Resources  http://www.talend.com/resources  Help Center  Knowledge Base  Webinars, screencasts  Tutorials  Docs are on download page  And by pressing F1 on a component
  24. 24. Questions? Compliments? Consulting gigs?  Contact me:  philip@yurchuk.com  http://philip.yurchuk.com  http://www.linkedin.com/in/philipyurchuk/
  25. 25. Thank You!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×