• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Talend Open Studio: How I Learned To Relax And Enjoy ETL
 

Talend Open Studio: How I Learned To Relax And Enjoy ETL

on

  • 12,050 views

A presentation given at the Denver Open Source Users Group on Tuesday, March 3, 2009.

A presentation given at the Denver Open Source Users Group on Tuesday, March 3, 2009.

Statistics

Views

Total Views
12,050
Views on SlideShare
11,988
Embed Views
62

Actions

Likes
12
Downloads
0
Comments
1

2 Embeds 62

http://www.slideshare.net 61
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • worst presentation
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • I’m your presenter. <br />
  • I work with a lot of open source technologies. <br />
  • I’m also a Java developer. <br />
  • I haven’t written a lot of Java in the past year. Mostly these days I write Groovy. <br />
  • I participate in the local development community by serving on the boards of www.denveropensource.org and www.iasadenver.org. <br />
  • The August Technology Group is my consulting firm. The group’s cardinality is not large. <br />
  • I’m all those things, but I am not a DBA. Being a DBA is a valuable specialty one has to pursue in exclusion to being a developer. Notwithstanding that fact, we’re going to pretend to be DBAs in this talk. <br />
  • As enterprise and web developers, our worlds are filled with data. <br />
  • Unfortunately, the data is not always in the condition we’d like. It’s in the wrong place, it’s not clean, and it’s stored in containers that might not suit our purposes. <br />
  • Businesses run with data in this condition, so it’s hardly worthless. There are some things it can’t do in this form, though. <br />
  • To be useful for analysis purposes, data should be in one place, and should be structured in such a way that it cooperates with the analysis process, rather than the read-write processes required by the operation of the business. <br />
  • We’re going to be looking at the free and open-source Talend Open Studio to help us with this problem. <br />
  • We’ll also refer to some basic data warehousing theory covered in this book. It’s a great reference if you need to know the topic. <br />
  • <br />
  • <br />
  • Talend wants to provide a lower-cost ETL tool set still capable of enterprise-grade work. <br />
  • We’ll be looking at the free product in this presentation, but they also offer subscription products with features useful for team environments. <br />
  • <br />
  • Your Talend jobs will generate code. We’ll be looking at the Java option, but you can have them generate Perl code as well. I suspect this doesn’t see many enterprise deployments. <br />
  • Often text that appears in the UI will sound kind of French. To American audiences, this lends the program an air of elegance! :) <br />
  • Their web site, for further research. <br />
  • Let’s look over the basic parts of Talend Open Studio that we’ll be interacting with. <br />
  • We will focus our attention on the Job Designer and the Metadata Repository. The ability to add custom Java and Groovy code is a very important feature of the product, but we don’t have time to cover it here. <br />
  • The Business Modeler is a very simple drawing program that is for documentation purposes only. You’re better off using OmniGraffle or Visio, and focusing your attention on the things Talend does well. <br />
  • In the Job Designer, we’ll drag visual components from a palette, configure their properties, and connect them with data and event flows. <br />
  • Here’s a screen shot of a simple job. <br />
  • The Metadata Repository holds more than just schemas. It also holds actual database connections (potentially to disparate relational sources), input and output file definitions, and even WSDL files if you’re connecting to SOAP web services. <br />
  • Let’s build a simple transform of an OPML file into an Excel spreadsheet containing a list of the RSS feeds and a text file listing the different types of links in the file. <br />
  • Now, a very brief review of data warehousing by a non-DBA. <br />
  • First there is the system or set of systems containing the data of interest. <br />
  • Transactionality ensures consistency in the presence of write operations that may fail while in process. It works great when there are many small reads and writes. It doesn’t work well for large bulk writes. <br />
  • We all know the great things normalization does for us, and we love it. However, a normalized schema is a lot harder for business users to query. <br />
  • We really shouldn’t be running reports against a production system. Even if we don’t break it (and given enough time, we probably will), we will certainly slow it down. We should keep our hands off. <br />
  • <br />
  • The data warehouse knows nothing of substance other than what it gets from the system of record. <br />
  • We only ever write to the data warehouse in large, bulk updates, so we don’t necessarily want transactions. They wouldn’t help us, but they would slow us down. <br />
  • The tables in the warehouse are often very wide, containing lots of null fields as necessary. It’s not particularly normalized, but it is easy to query. <br />
  • It may be a “production” system of sorts, but when it goes down, lines of business do not close. Decision-making may be compromised, but business can continue. This doesn’t mean it will necessarily lack an uptime SLA, but it does mean that only internal users will be affected by downtime. <br />
  • Let’s transform a transactional database into a data warehouse using Talend Open Studio. <br />
  • Start with this. This is an actual schema from a startup I was involved in. No, you can’t look any closer. :) <br />
  • This is what we want to produce. To be fair, this is just a part of the order schema. The whole ERD would take a bigger warehouse schema to represent it all. <br />
  • Extracting is obtaining and copying disparate data sources. Might just be queries from a relational DB, might be XML files, text lookup tables, Excel, Access, web services, web logs... <br />
  • Extract data should be validated. Make sure data meets constraints, referential integrity is enforced, business rules are met, etc. <br />
  • Because it comes from all over, different data models must be reconciled. Customer Support and Sales might both create account records—are they the same? <br />
  • This extracted, cleaned, conformed data is transformed into a standard schema form, regardless of how the DB designer build the transactional database. <br />
  • The snowflake schema is the standard pattern of the data warehouse. <br />
  • A fact table represents a single measurable business event. In this case, we are measuring an order. It has the measured quantities that apply to the fact’s grain—the kind of thing it is measuring—plus foreign keys to dimensions. <br />
  • Dimensions supply detail about a fact. They are heavily denormalized, possibly 100 or 200 columns wide, and may have many nulls. Generally they do not have any foreign keys, so all columns are data values. <br />
  • <br />
  • <br />
  • After your transactional data is transformed into the warehouse, the offline snowflake schema is now available to standard tools that can perform analysis and reports more easily, and without stressing the production system. <br />
  • In general, I don’t believe it does. But this is not the general case; it is ETL. The problem domain is specialized enough that visual tools seem to be pretty effective. <br />
  • Custom coding is always an option for ETL systems (in Groovy or any other language), but it’s probably not the right option. The constraints imposed by a tool will probably result in a simpler, more robust system. <br />
  • <br />
  • ETL is not cheap. <br />
  • This is an open source group, so we’ll stick to the open source options. The commercial offerings are surely worthy products that have their own strengths. <br />
  • <br />
  • <br />
  • All images used in this presentation are either licensed from iStockphoto.com or are Creative Commons Attribution works from flickr.com. <br />

Talend Open Studio: How I Learned To Relax And Enjoy ETL Talend Open Studio: How I Learned To Relax And Enjoy ETL Presentation Transcript

  • TALEND OPEN STUDIO OR, HOW I LEARNED TO RELAX AND ENJOY ETL Thursday, March 5, 2009
  • TIM BERGLUND Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • DOSUG Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • DBA Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • ACTUAL DATA DISPERSED IDIOSYNCRATIC MESSY Thursday, March 5, 2009
  • DATA THE BUSINESS LIKES CENTRALIZED CONSISTENT ANSWERS QUESTIONS Thursday, March 5, 2009
  • OUR SOFTWARE Thursday, March 5, 2009
  • OUR REFERENCE Thursday, March 5, 2009
  • ABOUT AN OPEN-SOURCE STARTUP Thursday, March 5, 2009
  • ABOUT BASED IN FRANCE Thursday, March 5, 2009
  • ABOUT ENVISIONS LOW-COST ETL (EXTRACT, TRANSFORM, AND LOAD) Thursday, March 5, 2009
  • ABOUT FREE AND SUBSCRIPTION PRODUCTS Thursday, March 5, 2009
  • ABOUT ECLIPSE-BASED Thursday, March 5, 2009
  • ABOUT THEREFORE JAVA-BASED, BUT HAS STRANGE PERL OPTION Thursday, March 5, 2009
  • ABOUT HAS A FRENCH ACCENT Thursday, March 5, 2009
  • ABOUT http://www.talend.com Thursday, March 5, 2009
  • BASIC COMPONENTS Thursday, March 5, 2009
  • Business Modeler Data Processing “Jobs” Custom Java and Groovy Scripts Metadata Repository Thursday, March 5, 2009
  • BUSINESS MODELER Thursday, March 5, 2009
  • JOB DESIGNER VISUAL WORKSPACE WHERE DEVELOPMENT TAKES PLACE Thursday, March 5, 2009
  • JOB DESIGNER Thursday, March 5, 2009
  • METADATA REPOSITORY DATABASE CONNECTIONS INPUT FILES OUTPUT FILES SCHEMAS WEB SERVICES Thursday, March 5, 2009
  • DEMO, PLX Thursday, March 5, 2009
  • AND NOW, SOME THEORY Thursday, March 5, 2009
  • THE SYSTEM OF RECORD Thursday, March 5, 2009
  • THE SYSTEM OF RECORD TRANSACTIONAL Thursday, March 5, 2009
  • THE SYSTEM OF RECORD HEAVILY NORMALIZED (IDEALLY) Thursday, March 5, 2009
  • THE SYSTEM OF RECORD PRODUCTION SYSTEM Thursday, March 5, 2009
  • THE DATA WAREHOUSE Thursday, March 5, 2009
  • THE DATA WAREHOUSE DERIVED Thursday, March 5, 2009
  • THE DATA WAREHOUSE NONTRANSACTIONAL Thursday, March 5, 2009
  • THE DATA WAREHOUSE DENORMALIZED Thursday, March 5, 2009
  • THE DATA WAREHOUSE IMPORTANT, BUT OFFLINE Thursday, March 5, 2009
  • HOW TO DO IT Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • Thursday, March 5, 2009
  • EXTRACT, TRANSFORM, AND LOAD EXTRACTING Thursday, March 5, 2009
  • EXTRACT, TRANSFORM, AND LOAD CLEANING Thursday, March 5, 2009
  • EXTRACT, TRANSFORM, AND LOAD CONFORMING Thursday, March 5, 2009
  • EXTRACT, TRANSFORM, AND LOAD THE SNOWFLAKE SCHEMA Thursday, March 5, 2009
  • THE SNOWFLAKE SCHEMA ONE FACT REFERENCING MANY DIMENSIONS Thursday, March 5, 2009
  • THE FACT TABLE order_fact user_id (FK) shipping_location_id (FK) billing_location_id (FK) payment_method_id (FK) line_item_group_id (FK) order_timestamp (fact) total (fact) subtotal (fact) shipping_cost (fact) Thursday, March 5, 2009
  • THE DIMENSION TABLE user_dimension id (PK) user_id (business key) username first_name last_name company show_only_same_mfg show_nonzero_inventory mailing_list_opt_in Thursday, March 5, 2009
  • ion PUT TOGETHER ey) user_dimension order_fact id (PK) user_id (FK) internal_user_id (business key) mfg shipping_location_id (FK) username tory billing_location_id (FK) first_name in payment_method_id (FK) last_name line_item_group_id (FK) sion company order_timestamp (fact) city total (fact) key) state subtotal (fact) mailing_list_opt_in shipping_cost (fact) mfg ntory in Thursday, March 5, 2009
  • LET’S SEE ANOTHER DEMO! Thursday, March 5, 2009
  • NOW WHAT? REPORTING OLAP TOOLS BUSINESS DASHBOARDS DATA LIFE CYCLE OPTIONS Thursday, March 5, 2009
  • A CRITIQUE DOES VISUAL PROGRAMMING REALLY WORK? Thursday, March 5, 2009
  • A CRITIQUE WHY NOT JUST USE GROOVY? Thursday, March 5, 2009
  • COMMERCIAL OPTIONS THERE ARE MANY Thursday, March 5, 2009
  • COMMERCIAL OPTIONS THEY ALL PROBABLY INVOLVE GOLF Thursday, March 5, 2009
  • COMMERCIAL OPTIONS THIS IS DOSUG, SO WE’LL MOVE ON Thursday, March 5, 2009
  • ACKNOWLEDGEMENTS THANKS TO www.intellidata.net FOR THE TEST DATA! Thursday, March 5, 2009
  • THANK YOU! TIM BERGLUND AUGUST TECHNOLOGY GROUP, LLC http://www.augusttechgroup.com tim.berglund@augusttechgroup.com @tlberglund Thursday, March 5, 2009
  • PHOTO CREDITS OIL DRUMS: HTTP://WWW.FLICKR.COM/PHOTOS/THE_JUSTIFIED_SINNER/2720599186/ JUNGLE: HTTP://WWW.FLICKR.COM/PHOTOS/LOLLYKNIT/1155225799/ FRENCH GARDEN: HTTP://WWW.FLICKR.COM/PHOTOS/NOMAD-PHOTOGRAPHY/23295537/ SNOWFLAKE: HTTP://WWW.FLICKR.COM/PHOTOS/JOHNCHARLTON/360919818/ Thursday, March 5, 2009