Talend Open Studio: How I Learned To Relax And Enjoy ETL

9,378 views
9,182 views

Published on

A presentation given at the Denver Open Source Users Group on Tuesday, March 3, 2009.

Published in: Technology
1 Comment
12 Likes
Statistics
Notes
  • worst presentation
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
9,378
On SlideShare
0
From Embeds
0
Number of Embeds
71
Actions
Shares
0
Downloads
0
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide

  • I’m your presenter.
  • I work with a lot of open source technologies.
  • I’m also a Java developer.
  • I haven’t written a lot of Java in the past year. Mostly these days I write Groovy.
  • I participate in the local development community by serving on the boards of www.denveropensource.org and www.iasadenver.org.
  • The August Technology Group is my consulting firm. The group’s cardinality is not large.
  • I’m all those things, but I am not a DBA. Being a DBA is a valuable specialty one has to pursue in exclusion to being a developer. Notwithstanding that fact, we’re going to pretend to be DBAs in this talk.
  • As enterprise and web developers, our worlds are filled with data.
  • Unfortunately, the data is not always in the condition we’d like. It’s in the wrong place, it’s not clean, and it’s stored in containers that might not suit our purposes.
  • Businesses run with data in this condition, so it’s hardly worthless. There are some things it can’t do in this form, though.
  • To be useful for analysis purposes, data should be in one place, and should be structured in such a way that it cooperates with the analysis process, rather than the read-write processes required by the operation of the business.
  • We’re going to be looking at the free and open-source Talend Open Studio to help us with this problem.
  • We’ll also refer to some basic data warehousing theory covered in this book. It’s a great reference if you need to know the topic.


  • Talend wants to provide a lower-cost ETL tool set still capable of enterprise-grade work.
  • We’ll be looking at the free product in this presentation, but they also offer subscription products with features useful for team environments.

  • Your Talend jobs will generate code. We’ll be looking at the Java option, but you can have them generate Perl code as well. I suspect this doesn’t see many enterprise deployments.
  • Often text that appears in the UI will sound kind of French. To American audiences, this lends the program an air of elegance! :)
  • Their web site, for further research.
  • Let’s look over the basic parts of Talend Open Studio that we’ll be interacting with.
  • We will focus our attention on the Job Designer and the Metadata Repository. The ability to add custom Java and Groovy code is a very important feature of the product, but we don’t have time to cover it here.
  • The Business Modeler is a very simple drawing program that is for documentation purposes only. You’re better off using OmniGraffle or Visio, and focusing your attention on the things Talend does well.
  • In the Job Designer, we’ll drag visual components from a palette, configure their properties, and connect them with data and event flows.
  • Here’s a screen shot of a simple job.
  • The Metadata Repository holds more than just schemas. It also holds actual database connections (potentially to disparate relational sources), input and output file definitions, and even WSDL files if you’re connecting to SOAP web services.
  • Let’s build a simple transform of an OPML file into an Excel spreadsheet containing a list of the RSS feeds and a text file listing the different types of links in the file.
  • Now, a very brief review of data warehousing by a non-DBA.
  • First there is the system or set of systems containing the data of interest.
  • Transactionality ensures consistency in the presence of write operations that may fail while in process. It works great when there are many small reads and writes. It doesn’t work well for large bulk writes.
  • We all know the great things normalization does for us, and we love it. However, a normalized schema is a lot harder for business users to query.
  • We really shouldn’t be running reports against a production system. Even if we don’t break it (and given enough time, we probably will), we will certainly slow it down. We should keep our hands off.

  • The data warehouse knows nothing of substance other than what it gets from the system of record.
  • We only ever write to the data warehouse in large, bulk updates, so we don’t necessarily want transactions. They wouldn’t help us, but they would slow us down.
  • The tables in the warehouse are often very wide, containing lots of null fields as necessary. It’s not particularly normalized, but it is easy to query.
  • It may be a “production” system of sorts, but when it goes down, lines of business do not close. Decision-making may be compromised, but business can continue. This doesn’t mean it will necessarily lack an uptime SLA, but it does mean that only internal users will be affected by downtime.
  • Let’s transform a transactional database into a data warehouse using Talend Open Studio.
  • Start with this. This is an actual schema from a startup I was involved in. No, you can’t look any closer. :)
  • This is what we want to produce. To be fair, this is just a part of the order schema. The whole ERD would take a bigger warehouse schema to represent it all.
  • Extracting is obtaining and copying disparate data sources. Might just be queries from a relational DB, might be XML files, text lookup tables, Excel, Access, web services, web logs...
  • Extract data should be validated. Make sure data meets constraints, referential integrity is enforced, business rules are met, etc.
  • Because it comes from all over, different data models must be reconciled. Customer Support and Sales might both create account records—are they the same?
  • This extracted, cleaned, conformed data is transformed into a standard schema form, regardless of how the DB designer build the transactional database.
  • The snowflake schema is the standard pattern of the data warehouse.
  • A fact table represents a single measurable business event. In this case, we are measuring an order. It has the measured quantities that apply to the fact’s grain—the kind of thing it is measuring—plus foreign keys to dimensions.
  • Dimensions supply detail about a fact. They are heavily denormalized, possibly 100 or 200 columns wide, and may have many nulls. Generally they do not have any foreign keys, so all columns are data values.


  • After your transactional data is transformed into the warehouse, the offline snowflake schema is now available to standard tools that can perform analysis and reports more easily, and without stressing the production system.
  • In general, I don’t believe it does. But this is not the general case; it is ETL. The problem domain is specialized enough that visual tools seem to be pretty effective.
  • Custom coding is always an option for ETL systems (in Groovy or any other language), but it’s probably not the right option. The constraints imposed by a tool will probably result in a simpler, more robust system.

  • ETL is not cheap.
  • This is an open source group, so we’ll stick to the open source options. The commercial offerings are surely worthy products that have their own strengths.


  • All images used in this presentation are either licensed from iStockphoto.com or are Creative Commons Attribution works from flickr.com.
  • Talend Open Studio: How I Learned To Relax And Enjoy ETL

    1. 1. TALEND OPEN STUDIO OR, HOW I LEARNED TO RELAX AND ENJOY ETL Thursday, March 5, 2009
    2. 2. TIM BERGLUND Thursday, March 5, 2009
    3. 3. Thursday, March 5, 2009
    4. 4. Thursday, March 5, 2009
    5. 5. Thursday, March 5, 2009
    6. 6. DOSUG Thursday, March 5, 2009
    7. 7. Thursday, March 5, 2009
    8. 8. DBA Thursday, March 5, 2009
    9. 9. Thursday, March 5, 2009
    10. 10. Thursday, March 5, 2009
    11. 11. ACTUAL DATA DISPERSED IDIOSYNCRATIC MESSY Thursday, March 5, 2009
    12. 12. DATA THE BUSINESS LIKES CENTRALIZED CONSISTENT ANSWERS QUESTIONS Thursday, March 5, 2009
    13. 13. OUR SOFTWARE Thursday, March 5, 2009
    14. 14. OUR REFERENCE Thursday, March 5, 2009
    15. 15. ABOUT AN OPEN-SOURCE STARTUP Thursday, March 5, 2009
    16. 16. ABOUT BASED IN FRANCE Thursday, March 5, 2009
    17. 17. ABOUT ENVISIONS LOW-COST ETL (EXTRACT, TRANSFORM, AND LOAD) Thursday, March 5, 2009
    18. 18. ABOUT FREE AND SUBSCRIPTION PRODUCTS Thursday, March 5, 2009
    19. 19. ABOUT ECLIPSE-BASED Thursday, March 5, 2009
    20. 20. ABOUT THEREFORE JAVA-BASED, BUT HAS STRANGE PERL OPTION Thursday, March 5, 2009
    21. 21. ABOUT HAS A FRENCH ACCENT Thursday, March 5, 2009
    22. 22. ABOUT http://www.talend.com Thursday, March 5, 2009
    23. 23. BASIC COMPONENTS Thursday, March 5, 2009
    24. 24. Business Modeler Data Processing “Jobs” Custom Java and Groovy Scripts Metadata Repository Thursday, March 5, 2009
    25. 25. BUSINESS MODELER Thursday, March 5, 2009
    26. 26. JOB DESIGNER VISUAL WORKSPACE WHERE DEVELOPMENT TAKES PLACE Thursday, March 5, 2009
    27. 27. JOB DESIGNER Thursday, March 5, 2009
    28. 28. METADATA REPOSITORY DATABASE CONNECTIONS INPUT FILES OUTPUT FILES SCHEMAS WEB SERVICES Thursday, March 5, 2009
    29. 29. DEMO, PLX Thursday, March 5, 2009
    30. 30. AND NOW, SOME THEORY Thursday, March 5, 2009
    31. 31. THE SYSTEM OF RECORD Thursday, March 5, 2009
    32. 32. THE SYSTEM OF RECORD TRANSACTIONAL Thursday, March 5, 2009
    33. 33. THE SYSTEM OF RECORD HEAVILY NORMALIZED (IDEALLY) Thursday, March 5, 2009
    34. 34. THE SYSTEM OF RECORD PRODUCTION SYSTEM Thursday, March 5, 2009
    35. 35. THE DATA WAREHOUSE Thursday, March 5, 2009
    36. 36. THE DATA WAREHOUSE DERIVED Thursday, March 5, 2009
    37. 37. THE DATA WAREHOUSE NONTRANSACTIONAL Thursday, March 5, 2009
    38. 38. THE DATA WAREHOUSE DENORMALIZED Thursday, March 5, 2009
    39. 39. THE DATA WAREHOUSE IMPORTANT, BUT OFFLINE Thursday, March 5, 2009
    40. 40. HOW TO DO IT Thursday, March 5, 2009
    41. 41. Thursday, March 5, 2009
    42. 42. Thursday, March 5, 2009
    43. 43. EXTRACT, TRANSFORM, AND LOAD EXTRACTING Thursday, March 5, 2009
    44. 44. EXTRACT, TRANSFORM, AND LOAD CLEANING Thursday, March 5, 2009
    45. 45. EXTRACT, TRANSFORM, AND LOAD CONFORMING Thursday, March 5, 2009
    46. 46. EXTRACT, TRANSFORM, AND LOAD THE SNOWFLAKE SCHEMA Thursday, March 5, 2009
    47. 47. THE SNOWFLAKE SCHEMA ONE FACT REFERENCING MANY DIMENSIONS Thursday, March 5, 2009
    48. 48. THE FACT TABLE order_fact user_id (FK) shipping_location_id (FK) billing_location_id (FK) payment_method_id (FK) line_item_group_id (FK) order_timestamp (fact) total (fact) subtotal (fact) shipping_cost (fact) Thursday, March 5, 2009
    49. 49. THE DIMENSION TABLE user_dimension id (PK) user_id (business key) username first_name last_name company show_only_same_mfg show_nonzero_inventory mailing_list_opt_in Thursday, March 5, 2009
    50. 50. ion PUT TOGETHER ey) user_dimension order_fact id (PK) user_id (FK) internal_user_id (business key) mfg shipping_location_id (FK) username tory billing_location_id (FK) first_name in payment_method_id (FK) last_name line_item_group_id (FK) sion company order_timestamp (fact) city total (fact) key) state subtotal (fact) mailing_list_opt_in shipping_cost (fact) mfg ntory in Thursday, March 5, 2009
    51. 51. LET’S SEE ANOTHER DEMO! Thursday, March 5, 2009
    52. 52. NOW WHAT? REPORTING OLAP TOOLS BUSINESS DASHBOARDS DATA LIFE CYCLE OPTIONS Thursday, March 5, 2009
    53. 53. A CRITIQUE DOES VISUAL PROGRAMMING REALLY WORK? Thursday, March 5, 2009
    54. 54. A CRITIQUE WHY NOT JUST USE GROOVY? Thursday, March 5, 2009
    55. 55. COMMERCIAL OPTIONS THERE ARE MANY Thursday, March 5, 2009
    56. 56. COMMERCIAL OPTIONS THEY ALL PROBABLY INVOLVE GOLF Thursday, March 5, 2009
    57. 57. COMMERCIAL OPTIONS THIS IS DOSUG, SO WE’LL MOVE ON Thursday, March 5, 2009
    58. 58. ACKNOWLEDGEMENTS THANKS TO www.intellidata.net FOR THE TEST DATA! Thursday, March 5, 2009
    59. 59. THANK YOU! TIM BERGLUND AUGUST TECHNOLOGY GROUP, LLC http://www.augusttechgroup.com tim.berglund@augusttechgroup.com @tlberglund Thursday, March 5, 2009
    60. 60. PHOTO CREDITS OIL DRUMS: HTTP://WWW.FLICKR.COM/PHOTOS/THE_JUSTIFIED_SINNER/2720599186/ JUNGLE: HTTP://WWW.FLICKR.COM/PHOTOS/LOLLYKNIT/1155225799/ FRENCH GARDEN: HTTP://WWW.FLICKR.COM/PHOTOS/NOMAD-PHOTOGRAPHY/23295537/ SNOWFLAKE: HTTP://WWW.FLICKR.COM/PHOTOS/JOHNCHARLTON/360919818/ Thursday, March 5, 2009

    ×