47. THE SNOWFLAKE
SCHEMA
ONE FACT REFERENCING MANY
DIMENSIONS
Thursday, March 5, 2009
48. THE FACT TABLE
order_fact
user_id (FK)
shipping_location_id (FK)
billing_location_id (FK)
payment_method_id (FK)
line_item_group_id (FK)
order_timestamp (fact)
total (fact)
subtotal (fact)
shipping_cost (fact)
Thursday, March 5, 2009
49. THE DIMENSION TABLE
user_dimension
id (PK)
user_id (business key)
username
first_name
last_name
company
show_only_same_mfg
show_nonzero_inventory
mailing_list_opt_in
Thursday, March 5, 2009
50. ion
PUT TOGETHER
ey)
user_dimension
order_fact id (PK)
user_id (FK) internal_user_id (business key)
mfg
shipping_location_id (FK)
username
tory
billing_location_id (FK)
first_name
in
payment_method_id (FK)
last_name
line_item_group_id (FK)
sion company
order_timestamp (fact)
city
total (fact)
key) state
subtotal (fact)
mailing_list_opt_in
shipping_cost (fact)
mfg
ntory
in
Thursday, March 5, 2009
56. COMMERCIAL OPTIONS
THEY ALL PROBABLY INVOLVE
GOLF
Thursday, March 5, 2009
57. COMMERCIAL OPTIONS
THIS IS DOSUG, SO WE’LL MOVE
ON
Thursday, March 5, 2009
58. ACKNOWLEDGEMENTS
THANKS TO
www.intellidata.net FOR THE
TEST DATA!
Thursday, March 5, 2009
59. THANK YOU!
TIM BERGLUND
AUGUST TECHNOLOGY GROUP, LLC
http://www.augusttechgroup.com
tim.berglund@augusttechgroup.com
@tlberglund
Thursday, March 5, 2009
60. PHOTO CREDITS
OIL DRUMS: HTTP://WWW.FLICKR.COM/PHOTOS/THE_JUSTIFIED_SINNER/2720599186/
JUNGLE: HTTP://WWW.FLICKR.COM/PHOTOS/LOLLYKNIT/1155225799/
FRENCH GARDEN: HTTP://WWW.FLICKR.COM/PHOTOS/NOMAD-PHOTOGRAPHY/23295537/
SNOWFLAKE: HTTP://WWW.FLICKR.COM/PHOTOS/JOHNCHARLTON/360919818/
Thursday, March 5, 2009
Editor's Notes
I’m your presenter.
I work with a lot of open source technologies.
I’m also a Java developer.
I haven’t written a lot of Java in the past year. Mostly these days I write Groovy.
I participate in the local development community by serving on the boards of www.denveropensource.org and www.iasadenver.org.
The August Technology Group is my consulting firm. The group’s cardinality is not large.
I’m all those things, but I am not a DBA. Being a DBA is a valuable specialty one has to pursue in exclusion to being a developer. Notwithstanding that fact, we’re going to pretend to be DBAs in this talk.
As enterprise and web developers, our worlds are filled with data.
Unfortunately, the data is not always in the condition we’d like. It’s in the wrong place, it’s not clean, and it’s stored in containers that might not suit our purposes.
Businesses run with data in this condition, so it’s hardly worthless. There are some things it can’t do in this form, though.
To be useful for analysis purposes, data should be in one place, and should be structured in such a way that it cooperates with the analysis process, rather than the read-write processes required by the operation of the business.
We’re going to be looking at the free and open-source Talend Open Studio to help us with this problem.
We’ll also refer to some basic data warehousing theory covered in this book. It’s a great reference if you need to know the topic.
Talend wants to provide a lower-cost ETL tool set still capable of enterprise-grade work.
We’ll be looking at the free product in this presentation, but they also offer subscription products with features useful for team environments.
Your Talend jobs will generate code. We’ll be looking at the Java option, but you can have them generate Perl code as well. I suspect this doesn’t see many enterprise deployments.
Often text that appears in the UI will sound kind of French. To American audiences, this lends the program an air of elegance! :)
Their web site, for further research.
Let’s look over the basic parts of Talend Open Studio that we’ll be interacting with.
We will focus our attention on the Job Designer and the Metadata Repository. The ability to add custom Java and Groovy code is a very important feature of the product, but we don’t have time to cover it here.
The Business Modeler is a very simple drawing program that is for documentation purposes only. You’re better off using OmniGraffle or Visio, and focusing your attention on the things Talend does well.
In the Job Designer, we’ll drag visual components from a palette, configure their properties, and connect them with data and event flows.
Here’s a screen shot of a simple job.
The Metadata Repository holds more than just schemas. It also holds actual database connections (potentially to disparate relational sources), input and output file definitions, and even WSDL files if you’re connecting to SOAP web services.
Let’s build a simple transform of an OPML file into an Excel spreadsheet containing a list of the RSS feeds and a text file listing the different types of links in the file.
Now, a very brief review of data warehousing by a non-DBA.
First there is the system or set of systems containing the data of interest.
Transactionality ensures consistency in the presence of write operations that may fail while in process. It works great when there are many small reads and writes. It doesn’t work well for large bulk writes.
We all know the great things normalization does for us, and we love it. However, a normalized schema is a lot harder for business users to query.
We really shouldn’t be running reports against a production system. Even if we don’t break it (and given enough time, we probably will), we will certainly slow it down. We should keep our hands off.
The data warehouse knows nothing of substance other than what it gets from the system of record.
We only ever write to the data warehouse in large, bulk updates, so we don’t necessarily want transactions. They wouldn’t help us, but they would slow us down.
The tables in the warehouse are often very wide, containing lots of null fields as necessary. It’s not particularly normalized, but it is easy to query.
It may be a “production” system of sorts, but when it goes down, lines of business do not close. Decision-making may be compromised, but business can continue. This doesn’t mean it will necessarily lack an uptime SLA, but it does mean that only internal users will be affected by downtime.
Let’s transform a transactional database into a data warehouse using Talend Open Studio.
Start with this. This is an actual schema from a startup I was involved in. No, you can’t look any closer. :)
This is what we want to produce. To be fair, this is just a part of the order schema. The whole ERD would take a bigger warehouse schema to represent it all.
Extracting is obtaining and copying disparate data sources. Might just be queries from a relational DB, might be XML files, text lookup tables, Excel, Access, web services, web logs...
Extract data should be validated. Make sure data meets constraints, referential integrity is enforced, business rules are met, etc.
Because it comes from all over, different data models must be reconciled. Customer Support and Sales might both create account records—are they the same?
This extracted, cleaned, conformed data is transformed into a standard schema form, regardless of how the DB designer build the transactional database.
The snowflake schema is the standard pattern of the data warehouse.
A fact table represents a single measurable business event. In this case, we are measuring an order. It has the measured quantities that apply to the fact’s grain—the kind of thing it is measuring—plus foreign keys to dimensions.
Dimensions supply detail about a fact. They are heavily denormalized, possibly 100 or 200 columns wide, and may have many nulls. Generally they do not have any foreign keys, so all columns are data values.
After your transactional data is transformed into the warehouse, the offline snowflake schema is now available to standard tools that can perform analysis and reports more easily, and without stressing the production system.
In general, I don’t believe it does. But this is not the general case; it is ETL. The problem domain is specialized enough that visual tools seem to be pretty effective.
Custom coding is always an option for ETL systems (in Groovy or any other language), but it’s probably not the right option. The constraints imposed by a tool will probably result in a simpler, more robust system.
ETL is not cheap.
This is an open source group, so we’ll stick to the open source options. The commercial offerings are surely worthy products that have their own strengths.
All images used in this presentation are either licensed from iStockphoto.com or are Creative Commons Attribution works from flickr.com.