Intro to Talend Open Studio for Data Integration

4,467
-1

Published on

An overview of Talend Open Studio for Data Integration, along with some tips learned from building production jobs and a list of resources. Feel free to contact me for more information.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,467
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
182
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Intro to Talend Open Studio for Data Integration

  1. 1. Intro to Talend Open Studio for Data Integration Philip Yurchuk http://philip.yurchuk.com
  2. 2. What is Talend?  Eclipse-based visual programming editor  Generates executable Java code  Jobs can run standalone or embedded (no special server)  Batch or interactive (user input)
  3. 3. What is ETL?  Extract: suck up data  Transform: mess with it Load: blow it out Batch, integration, mi gration, etc.
  4. 4. Extract from/load to where?  Over 600 components  Over 450 connectors  Allows multiple inputs/outputs in single job
  5. 5. Connectors  Flat files  Applications/Platforms  Delimted (tab, CSV…)  Alfresco  XML  Microsoft Dynamics  JSON  Excel  Positional  Apache HTTP logs, HL7... (CRM, AX)  SAP  Sage ERP X3  Salesforce  SugarCRM
  6. 6. Connectors (continued)  Relational Databases  MySQL  Postgresql  MS SQL  Oracle  Many more  NoSQL/Columnar/OLAP/ Other  Amazon RedShift  Greenplum  Hive  OLAP cubes  LDAP  VectorWise  Teradata  More in Big Data ed.
  7. 7. How do we transport data?  File system  FTP  SFTP/SCP  Web service (SOAP, REST)  HTTP  Mail, POP  XMLRPC, Sockets, JMS, RSS...
  8. 8. Other Components  Process data: join, filter, aggregate  Flow control: loops, job invocation  Logs, statistics  Code: Java, Groovy  On row data or standalone  Can load libraries
  9. 9. Demo
  10. 10. Nifty Components  FuzzyMatch - calculate Levenshtein distance or phonetic similarity  IntervalMatch – perform lookup/join based on values falling within an interval  Replace, ReplaceList - search and replace, substitution  UniqRow - output distinct rows based on defined key columns
  11. 11. More Nifty Components  XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs  Normalize/Denormalize - split delimited strings into columns or join columns into a string  AggregateRow – GROUP BY; min, max, sum, other functions used to aggregate rows on a column
  12. 12. Tips and Tricks  CamelCase job names for embedded jobs.  Or prefix with ETL phase and order of execution  Whenever appropriate (esp. for inserting data), use the schema from the repository.  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.
  13. 13. Tips and Tricks  Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  14. 14. Tips and Tricks  Failure handling subjob:  It’s an unconnected job (no triggers point to it)  Use LogCatcher to catch, record component failures.  Record failure in DB, file, email, etc.  Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
  15. 15. Tips and Tricks  In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==.  Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.p hp?pid=110860
  16. 16. Tips and Tricks  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  17. 17. Tips and Tricks  Use a context for job variables.  Note you can specify type for variables.  You can read from a file or database, or pass in a context if an embedded Java job.
  18. 18. Tips and Tricks  For multi-host deployment:  Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines.  The context config file has all values required for that host, e.g. test DB connection for test machine.  You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C:Data  Be mindful of file permissions for sensitive context data (e.g., DB password)
  19. 19. Tips and Tricks  Use “Bulk” output components when possible.  For transactional behavior:  Start the job with DB connection  Check “use existing connection” in all relevant components  Check "Die on error" in all relevant components  End job with commit component
  20. 20. Room for Improvement  UI stability  Documentation
  21. 21. Books  Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan  Talend Open Studio Cookbook by Rick Daniel Barton  Big Data book coming…
  22. 22. Talend Forge  http://www.talendforge.org/  Forum – super helpful  Exchange – free community components!  Tutorials  Bug tracker  Source code
  23. 23. Talend Resources  http://www.talend.com/resources  Help Center  Knowledge Base  Webinars, screencasts  Tutorials  Docs are on download page  And by pressing F1 on a component
  24. 24. Questions? Compliments? Consulting gigs?  Contact me:  philip@yurchuk.com  http://philip.yurchuk.com  http://www.linkedin.com/in/philipyurchuk/
  25. 25. Thank You!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×