2. In the Beginning …
• “Put all your eggs in one basket, and …
watch the basket.”
Mark Twain
• “Data is only valuable if it can be accessed in
a timely fashion.”
An IMS/DC Axiom
3. Table of Contents
• An Introduction
• A Problem Sampler
– Diagnostician at Play
– A Little Dirty Data
– A SQL Query
• SSIS and ETL Options
– SSIS and Data Management
• BIDS, SSAS, and MDX
– New Tools, Growing Arsenal
• At Your Service …
4. David Maeda: An Introduction
• Completing an intense 10 week course on Microsoft
Business Intelligence technologies, i.e. SQL Server,
T-SQL, SSIS, SSAS, SSRS, and Visual Studio
interfaces.
• Broad background in IT including expertise in
database and transaction management systems.
• Experience includes leadership and project
management positions.
• An accomplished diagnostician and software
engineer.
5. Diagnostician At Play
• Earlier this year, I got a good deal on a nice fly reel intended
for 9 and 10 weight lines. While using the reel for striped
bass on the Roanoke River several weeks later, I noticed that
the drag did not tightened down to a point where it was
effectively useful.
• An exchange of emails with the US distributor got me a new
one way clutch bearing but it did not fix the issue.
• Examining the parts diagram for the reel, I decided to add a
7 cent wave lock washer to the drag assembly. Tested reel on
the Roanoke. Problem resolved.
• Notified the distributor. After an evaluation, the fix was
adopted by the manufacturer several days later.
6. A Little Dirty Data Problem
• In dealing with a national organization, membership
information was found to have the following issues:
– 30% to 60% of the email address were bad
– 10% of the regular mail addresses were bad
– Inconsistent data formats in downloaded CSV files
– Multiple entries per member
• The Problem: How to work around the “questionable” data
and maintain effective membership communications with
the following criteria:
– Minimize expenses
– On average, needs less than 4 hours per week to manage
7. A Little Dirty Data Problem
• The Solution:
o Design a database to allow
downloads to update
existing data without
affecting “local” data.
o The Members table is what
gets downloaded.
o The MemberExtension
table is the repository for
“local” data.
o Manage both tables via a
web based user interface
(UI).
o UI is implemented with
PHP and JavaScript.
o Automate as much as
possible.
8. A Little Dirty Data Problem
• Implementation:
– A Nasty Surprise: CSV Data as downloaded would not import cleanly
into MySQL. This was due to MySQL load data infile processing
requiring certain characters to be escaped.
• A short Java script was written to transform the downloaded CSV file
into the necessary format prior to importing it into MySQL.
– Any downloaded data is considered “questionable”.
• MySQL load data infile processing overlays existing records.
• Restrict downloaded updates to only affect the Members table.
– The Members and MemberExtension tables are synchronized as part
of the update process invoked from the UI.
• Every Members entry has a corresponding MemberExtension entry.
• A new MemberExtension will be created if necessary and initialized with
date and email info if present.
• Existing MemberExtension entries are not touched.
9. A Little Dirty Data Problem
o A Utilitarian UI
• Apache
• HTML Frames
• AJAX
• PHP
10. A Little Dirty Data Problem
• In Summary:
– We were able to circumvent most of the dirty data issues by isolating
the “questionable” data.
– The MySQL RDBMS supports ad hoc SQL queries should the necessity
to alter tables, etc arise.
– Expenses were minimized by:
• Using freely available components, i.e. Java, Apache 2.2, PHP 5, MySQL
5.2, and JavaScript.
• Using volunteer labor to write the ETL code.
– A download and update sequence takes less than 10 minutes.
– A typical request to update the email distribution takes less than 5
minutes.
– Managing the database and generating the necessary distribution
lists via the provided UI takes typically less than 4 hours per week.
11. A SQL Query
• On a recent phone interview, I was asked:
– How would you construct an SQL query to find the second highest sales
total?
• My answer was:
– Use a pair of nested queries. The inner query would ascertain the top 2
totals. The outer query would return the lower of the two totals.
• In T-SQL this looks something like (It may look somewhat different in
other SQL dialects):
select top 1 orderid, (unitprice * quantity) as 'totalsale'
from [order details] where (unitprice * quantity) in
(
select top 2 (unitprice * quantity) as 'ordertotal'
from [order details]
group by (unitprice * quantity)
order by ordertotal desc
)
order by totalsale asc
12. ETL Options and SSIS
package appCSV;
o All CSV files are not
import java.io.*;
created equal. Neither are the
import java.util.StringTokenizer;
ETL tools used to prepare
/** and load them into a
* @author Dave Maeda
database. Compare:
*
* Class to convert csv field form
* o To the left is a more
* Invoke as: java appCSV.Convert
traditional approach (as used
*
* Where: filename is the name of for the Dirty Data problem).
* ext is the file extension.
*
o To the right is an approach
* Output: A file named <filename>.
* Note: ext will default to "csv" if utilizing Microsoft’s SSIS
*/ facility.
public class Convert
{
private static void usage() o SSIS has Data Management
{ applications beyond ETL.
System.out.println("n");
System.out.println(" >> Usage:
13. Data Management 101: DID
• Three basic principles:
– Disclosure
• Viewing of data
– Who’s viewing your data and are they authorized to do so?
– Integrity
• Accuracy and currency of data
– Data is only meaningful if it is accurate and up to date.
– Durability
• Data loss prevention
– More data is lost to accidents than malicious actions.
14. BIDS, SSAS, and MDX
o Business Intelligence Design Studio (BIDS)
• Ships as part on MS SQL Server
o SQL Server Analysis Server (SSAS)
• OLAP store and engine
• Builds multi-dimensional cubes
o Multi-Dimensional eXpressions (MDX)
• Used to retrieve cube data
• Used in SSAS Calculations and KPIs
15. SSRS
o Web Enabled
• Report Management
• Distribution
o Charts
• Conditional Fonts
• Calculated Members
• Multiple Charting Options
• Custom Colors
o Tables
• Multiple Formatting Options
• Data
• Calculated Members
• Conditional Fonts
16. MOSS, PPS, Dashboards, and KPIs
o MOSS
• SharePoint Server
o PPS
• PerformancePoint Server
o Dashboard
• Scorecard
o KPIs
• Parameters
• Values
• Goals and Status
• Trends (not shown)
17. Excel Services
o Excel Local Client
• Parameters
• Pivot Table
• Associated Chart
o Excel Services
• MOSS
• PPS Dashboard
• PPS Report
Parameters
Chart
18. New Tools, Growing Arsenal
• Latest additions: BIDS, SSIS, SSAS, SSRS, and MDX
• Arsenal already includes:
– OS platforms: z/OS, Windows, Unix (AIX and Sun), and
Linux (Red Hat and SUSE)
– Databases: IMS, DB2, Oracle, MySQL, and SQL Server
– Languages: Assembler (IBM and Intel), C/C++, Java,
JavaScript, PHP, Smalltalk, SQL, and REXX.
– Core competencies: Leadership, process improvement,
team facilitation, interpersonal communications, client
relations, and project management.
19. At Your Service …
• David Maeda
– Software Engineer
• Business Intelligence Analyst
• Diagnostician/Programmer
– Hard working and Persevering
• Personal Integrity and High Standards
– Team Leader and Team Player
• “Your prime directive as a leader is to position your
team for success.”