Do you ELT on z?
Upcoming SlideShare
Loading in...5
×
 

Do you ELT on z?

on

  • 711 views

 

Statistics

Views

Total Views
711
Slideshare-icon Views on SlideShare
711
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is a technical overview of InfoSphere Warehouse on System z
  • To support end-to-end business intelligence solutions, DB2® Data Warehouse Edition provides an infrastructure for warehouse building and maintenance, including tools for application design, deployment, execution, and administration. The SQL Warehousing Tool is integrated with the WebSphere® DataStage® ETL system to provide import and export capabilities between the two design environments. The SQL Warehousing Tool solves data integration problems in a DB2 data warehouse environment. Users can model logical flows of higher-level operations, which generate units of code that are organized inside execution plans. The tool provides a metadata system and an IDE to create, edit, and manage these flows, and a code generation system that understands the source graph and translates it into optimized code for execution. When the development of the flows is complete, the second part of the system comes into play: the packaging of all of the generated code and associated artifacts into a data warehouse application that can be deployed into various target runtime systems.
  • Data Server Component – Prerequisite DB2 for z/OS -- Not included in the offering, prerequisite Application Server Components – Linux on System z Partition WebSphere Application Server Manages/executes the SQW runtime processes Server for the browser-based Admin Console SQW Runtime Execution engine for SQW jobs (control flows, data flows) Runs in the WAS environment Cubing Services engine Cube server providing cube access to MDX clients through ODBO and XMLA Cubing engine and cube optimizer ported to Linux on System z. Design and Administration Client Components – Windows, Linux Design Studio, Eclipse-based design tool Physical modeling based on Rational Data Architect (RDA) SQW Data flow, control flow editor – extension of RDA Cubing Services cube model editor, optimization tool – extension of RDA Admin Console, browser-based administration tool Common admin tasks, data source, role-based security assignments Cube metadata management (import/export) Cube Server management (configure / start / stop, etc.) SQW runtime job administration

Do you ELT on z? Do you ELT on z? Presentation Transcript

  • Do you ELT on z? Mar 16. 2009
  • Agenda
    • Design Studio Overview
    • Development SQW Flows using Design Studio
      • Physical Data Modeling
      • Data Flows
      • Control Flows
    • Deployment & Runtime Management
  • Inhibitors to data warehousing on System z?
    • DB2 functionality?
      • DB2 V8 and V9 – great strides
    • Costs?
      • Specialty processors help – IIPs
      • Lower cost licensing for new DB2 workloads – DB2 VUE
    • Perception?
      • Conventional wisdom over last decade: use distributed sytems for Data Warehouses
      • A lot of shops defied conventional wisdom!
    • Lack of skills and/or appropriate data warehouse building tooling on System z
      • Fewer people have skills on System z?
      • Recent graduates grew up using graphical tooling!
        • No green screen for them
      • IBM recognizes this and is porting tooling to System z
  • Data Movement and Transformation
    • Moving and transforming data is a key component to building a data warehouse
    • What do you use on System z?
      • Programs? COBOL? Rexx?
      • Database utilities? Import, Export?
      • SQL scripts?
      • Stored procedures?
      • FTPs?
      • ????
  • Alphabet Soup for Tools
    • ETL
    • ELT
    • ETLT
    • TELT
    • TETLT
    • What???
    • Just MarketingSpeak of stating how and where the data movement and transformation activities occur
      • Leverage an stand-alone transformation engine
      • Leverage a database engine for data transformations
    E = Extract T = Transform L = Load
  • ETL – Extract Transform Load Usually a stand-alone server separate from any source or target systems Referred to as an ETL Engine that performs all extracts and transformations Database agnostic Usually a procedural design slant May push some processing to source databases May invoke database utilities Extract Transform Load
  • ELT – Extract Load Transform Turns the relational database engine into a data movement and transformation engine The work is usually done at the target system Can source from multiple database types but typically supports a specific target By definition, has a set-based design slant May push some processing to source databases May invoke database utilities Extract Transform Load
  • Variations on a theme
    • ETLT, TETLT
      • ETL tools are now pushing some processing down into the source and/or target relational databases by generating SQL
    • TELT, TETLT
      • ELT tools can typically push SQL to remote databases
      • ELT tools can do some limited non-relational processing – typically by calling executables or scripts
  • Reality…
    • Don’t get caught up in the ETL vs ELT wars
      • Advantages to doing some work outside the database
      • Advantages go doing some work inside the database
    • Don’t build an ETL or an ELT system
      • Build an architected Population Subsystem
      • Apply the appropriate tools to the appropriate function for the available (and future) skill level
        • ETL tools when appropriate
        • ELT tools when appropriate
        • Other tools when appropriate
  • A Population Subsystem
    • Consider using an architected model for building population subsystems
      • Each logical layer performs a specific kind of function and processing is encapsulated to that layer
      • Logical layers are grouped into physical components and a staging of data occurs between physical components
      • Each physical component will be implemented with technical functions implemented by one or more technologies
  • Population Subsystem
    • Consider using an architected model for building population subsystems
      • Each logical layer performs a specific kind of function and processing is encapsulated to that layer
      • Logical layers are grouped into physical components and a staging of data occurs between physical components
      • Each physical component will be implemented with technical functions implemented by one or more technologies
    From IBM course DW130
  • Tools on System z (z/OS or Linux on System z)
    • Data Movement and Transformation
      • ETL: InfoSphere DataStage for Linux on System z
      • Data Cleansing: InfoSphere Quality Stage for Linux on System z
      • ELT: InfoSphere Warehouse on System z – SQL Warehousing Tool
    • Expanded Sources
      • Heterogeneous data access: InfoSphere Federation Server
      • Classic Data Sources: InfoSphere Classic Federation Server for z/OS
    • Capturing Changes
      • SQL and Q-based data replication: InfoSphere Replication Server for z/OS
      • Classic data source replication: InfoSphere Classic Replication Server for z/OS
      • DB2 event publishing: InfoSphere Data Event Publisher for z/OS
      • Class event publishing: InfoSphere Classic Data Event Publisher for z/OS
  • ELT on System z InfoSphere Warehouse on System z SQL Warehousing Tool (SQW)
  • InfoSphere Warehouse on System z Source Systems DB2 for z/OS Data Warehouse Server Cubing Services Engine WebSphere App Server Linux on System Z Partition / IFL Application Server Cognos 8 BI for System z Excel Third Parties / BPs
    • Client Layer
    • Design and admin client
    • BI / Reporting tools and apps
    MQT Advisor MQT Cube Metadata Control DB JDBC/DB2 Connect JDBC/DB2 Connect MDX IE/Firefox Eclipse SQL JDBC/DB2 Connect SQW Runtime Administration DB2 for z/OS IMS VSAM RDBMS Design Studio Admin Console Windows / Linux
  • Design Studio
  • InfoSphere Warehouse Design Studio – Key features
    • (IDE) Integrated Development Environment for DB2 Warehouse projects
    • Integrated consistent and interoperable tools for:
      • Connecting and browsing databases
      • Exploring data
      • Designing physical database models (reverse/forward engineering)
      • Designing OLAP objects
      • Designing data movement and transformation flows
    • The platform is extensible (Eclipse based) and can be easily extended with third party or customer developed plug-ins
  • Getting Started – The Workspace
    • A workspace is:
      • A directory on the local file system where the projects (i.e. Metadata) created in the Design Studio are stored as XML files.
      • The Design Studio GUI.
    Metadata
  • The Business Intelligence Perspective
    • The Business Intelligence Perspective is the default perspective in the Design Studio.
    • It contains the views which are useful during the development of a Data Warehousing project.
    • Views can be moved/stacked by using drag and drop. They can be maximized by double-clicking on their title.
    • View can be closed and reopened. To add a new view to the perspective, use the menu bar “Window -> Show View”
    Data Project Explorer View Data Source Explorer View Outline View Properties View Data Output View Problems View Editor(s) Object Palette
  • Data Source Explorer
    • Define live jdbc connections to relational sources
      • Could be nicknames if Federated Server installed
    • Browse and work with objects in the live database
    • Live connection required for many Design Studio operations
      • Reverse engineering data models
      • Test execution of flows
      • Sampling data
  • Team component
    • Version and configuration management
    • Share resources with team via a repository
    • The Design Studio includes a CVS repository provider
    • Other repository providers can be used by installing the plugins provided by the repository vendors
      • Rational Clearcase
      • IBM CMVC
      • Merant PVCS; Version Manager
      • … see eclipse.org community page for a list of the available plugins.
    Repository Check in/out Check in/out
  • Integration with other IBM Tools – Eclipse Shell Sharing
    • Share the core Eclipse components so that they are not duplicated between each Eclipse-based product. Shell sharing eliminates the need to install several Eclipse platforms for each product, thus saving disk space and eliminating duplication of components.
    • Supported products that shell share with the Design Studio:
      • Data Studio Developer (DSD) v2.1
      • Data Studio Administrator (DSA) v2.1
      • InfoSphere Developer Architect (IDA) v7.5.1
      • Data Studio Optimization Expert for z/OS (DSOE) v2.1
      • Rational Architect Developer (RAD) v7.5.1
      • Rational Software Architect (RSA) v7.5.1
  • Physical Data Modeling
  • Data Models
    • Two types of data models
      • Logical – The “Business” representation of data without regard to underlying DBMS
      • Physical – The representation of the data as it would appear in the DBMS
    • Design Studio supports development of physical data models
      • Metadata representation of actual objects that are present in the DBMS
      • Create from scratch
      • Reverse engineer from existing database or DDL
      • Physical data model required to provide database metadata to other SQW components
  • Data modeling overview
    • Design and modify database physical models (schema & storage design, as well as cubes, dimensions, hierarchies)
    • Key Features:
    • Create a new DB design from scratch
    • Reverse engineer from an existing Database Explorer connection or from DDL
    • Create overview diagrams
    • Modify the schema graphically or in the project tree
    • Compare DB objects with each other or with objects existing in the database
    • Analyze design (best practices, and dependencies), Validation
    • Generate DDL script & Deploy
    • Impact Analysis
    • DB2 Storage Modeling: Table Space, Buffer Pool, Partition
  • Design Studio vs InfoSphere Data Architect
    • Design Studio includes a subset of functionality provided in InfoSphere Data Architect (IDA).
    • Design Studio includes the physical data modeling and corresponding SQL generation capabilities to help you implement and modify to your physical model.
    Logical data modeling Naming model Glossary model Other non-LUW advanced physical data modeling Web publishing and report Mapping editor UML - LDM transformation Data Project Explorer Database Explorer Complete Physical Data Modeling for DB2 Basic Physical Data Modeling for others Impact Analysis IDA Design Studio
  • Design Studio - Physical Data Model Physical data model Database – one per model Schema Diagrams logical folder – one per schema Diagram SQL statements logical folder – one per schema Table Primary key column Column Primary key constraint Unique constraint Index Data Project
  • Physical Data Model - Diagram Hide/show palette Palette – select the element to create on diagram Geometric shapes Drawing area
  • Embedded Data Movement SQL Warehouse Tool (SQW)
  • SQL warehousing tool (SQW)
    • Build and execute intra-warehouse (SQL-based) data movement and transformation services
    • Integrated Development Environment and metadata system
      • Model logical flows of higher-level operations
      • Generate code and create execution plans
      • Test and debug flows
      • Package generated code and artifacts into a data warehouse application
      • Integrate SQW Flows and DataStage jobs
    • Generate DB2 z/OS specific optimized SQL code ( Data Flows )
      • DB2 z/OS specific operators
      • DB2 z/OS specific code generation
      • Across query optimization
      • Predicates pushdown and move around
      • Unnecessary column reduction
      • Staging table handled automatically by the engine
    • Integrate SQL based flows with non-database activities ( Control Flows )
    • Sequence and manage activity flow
  • Data Flows
  • Data flows
    • Data flows are flow models that represent data movement and transformation requirements
    • SQW Codegen translates the models into repeatable, SQL-based warehouse building processes
    • Data from source files and tables moves through a series of transformation steps then loads or updates a target table or creates a file
  • Would you rather type this ? ……………………… . SELECT SALES.OU_IP_ID AS STR_IP_ID, SALES.PD_ID AS PD_ID, SALES.MSR_PRD_ID AS TIME_ID, SALES.C_D_MKT_BSKT_TXN_ID AS NMBR_OF_MRKT_BSKTS, SALES.SUM_NBR_ITM AS NUMBER_OF_ITEMS, CASE WHEN SALES.M_BK_PD_SUB_DEPT_NM IN ('BATH AND SHOWER', 'CAMERAS') THEN SALES.BKP_SUM_NBR_ITMXPRC * DECIMAL(MARTS.RAND1N(5) + 123) / 100 ELSE SALES.BKP_SUM_NBR_ITMXPRC * DECIMAL(MARTS.RAND1N(5) + 102) / 100 END AS PRDCT_BK_PRC_AMUNT, CASE WHEN SALES.MIN_CG_PD_DEPT_NM IN ('TEEN BOYS', 'TEEN BOYS JEANS', 'DRESS FORMAL','MEN SHOES') THEN (DECIMAL(68 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC WHEN SALES.MIN_CG_PD_DEPT_NM IN ('ELECTRICAL APPLIANCES','ELECTRONICS','COLORED TELEVISIONS','WOMEN SHOES') THEN (DECIMAL(77 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC WHEN SALES.MIN_CG_PD_DEPT_NM IN ('HEALTH AND BEAUTY') THEN (DECIMAL(65 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC ELSE (DECIMAL(72 - MARTS.RAND1N(5)) / 100) * SALES.SUM_CG_NBR_ITMX_PRC END AS CST_OF_GDS_SLD_CGS, SALES.SUM_NBR_ITMXSTM_PRC AS SALES_AMOUNT FROM SALES) …………
  • would you rather describe your logic at a higher-level ? A simple “Skills” Star schema
  • … and have optimized SQL generated for you?
    • INSERT INTO OLAPANL.STAR_FACT_TABLE
    • (ID, COMPANY_ID, TIME_ID, SKILL_DETAILS_ID, NB_SKILLS)
    • WITH INPUT_04 (COMPANY_NAME, TIME, ID, SKILL_CAT, SKILL_DETAILS, SKILL_ID)
    • AS (
    • SELECT
    • COMPANY_NAME AS COMPANY_NAME,
    • TIME AS TIME,
    • ID AS ID,
    • SKILL_CAT AS SKILL_CAT,
    • SKILL_DETAILS AS SKILL_DETAILS,
    • SKILL_ID AS SKILL_ID
    • FROM
    • TXTANL.IT_SKILLS_ASKED INPUT_0281),
    • IN4_07 (ID, SKILLS_PER_OFFER)
    • AS (
    • SELECT
    • INPUT_04.ID AS ID,
    • COUNT(*) AS SKILLS_PER_OFFER
    • FROM
    • INPUT_04
    • GROUP BY INPUT_04.ID)
    • SELECT
    • IN_07.ID AS ID,
    • IN1_07.COMPANY_ID AS COMPANY_ID,
    • IN2_07.TIME_ID AS TIME_ID,
    • IN3_07.SKILL_DETAILS_ID AS SKILL_DETAILS_ID,
    • DOUBLE(1) / DOUBLE(IN4_07.SKILLS_PER_OFFER) AS NB_SKILLS
    • FROM
    • TXTANL.IT_SKILLS_ASKED IN_07,
    • OLAPANL.STAR_COMPANY IN1_07,
    • OLAPANL.STAR_TIME IN2_07,
    • OLAPANL.STAR_SKILL IN3_07,
    • IN4_07
    • WHERE (IN_07.SKILL_DETAILS = IN3_07.SKILL_DETAILS AND
    • IN_07.COMPANY_NAME = IN1_07.COMPANY_NAME AND
    • IN_07.TIME = IN2_07.TIME_DATE AND
    • IN_07.ID = IN4_07.ID)
  • Data Flow Operators Most operators same as in LUW versions but generate DB2 z/OS specific SQL
    • Sources & Targets
      • Table Source (Local and Remote)
      • Table Target (Local and Remote)
      • Data Set Import
      • Data Set Export
      • SQL Query Source
      • Data Station
    • SQL Transformation Operators
      • Select List
      • Distinct
      • Group By
      • Order By
      • Table Join
      • Where (Filter)
      • Union
    • Warehouse Operators
      • Fact Key Replace
      • Key Lookup
      • Pivot
      • Unpivot
      • Splitter
    • Custom Tranformations
      • Custom SQL
      • DB2 Table Functions
      • DB2 User Defined Functions
  • Data Flows that call DB2 z/OS utilities
    • Data Set Import Operator
    • Invokes the load utility to load data in a target table from a data set
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
    • Data Set Export Operator
    • Invokes the unload utility to unload data from a table to a BSAM sequential data set.
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
    • Cross Loader Operator
    • Invoke load utility to directly load the output of a dynamic SQL SELECT statement into a table
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
  • Execution Database
    • The DB2 for z/OS subsystem to which the generated SQL of a dataflow is submitted
    • Sources/Target tables are local when in the execution subsystem, otherwise remote
    Blue = Source Table Green – Target Table Control Data DB2A DB2C Linux Admin Console/SQW Runtime Execution database = DB2C Distributed Database
  • Local Source – Local Target
    • Both Source and Target tables are in the same DB2 for z/OS subsystem
      • SQL submitted to DB2C
      • No data flows outside of DB2C
    Execution database = DB2C Blue = Source Table Green – Target Table Control Data SQL processing DB2A DB2C Linux Admin Console/SQW Runtime Distributed Database
  • Remote Source Table via jdbc – Local Target
    • Source table in remote database accessed as remote Table Source operator
      • Java application runs on Linux connects to DB2A and DB2C
      • Data flows throught Linux
    Execution database = DB2C Blue = Source Table Green – Target Table Control Data SQL processing DB2A DB2C Linux Admin Console/SQW Runtime Distributed Database
  • Remote Source DB2 via Cross Loader – Local Target
    • Source table in remote DB2 database accessed as remote Table Source operator but using Cross Loader Target operator
      • Cross Load utility invoked at DB2C
      • Uses DDF to access remote DB2 table over DRDA
    Execution database = DB2C Blue = Source Table Green – Target Table Control Data SQL processing DB2A DB2C Linux Admin Console/SQW Runtime Distributed Database
  • Local Source Table– Remote Target via jdbc
    • Source table in local DB2 z database and target is defined as remote Table Target operator
      • Java application runs on Linux connects to DB2A and DB2C
      • Data flows throught Linux
    Execution database = DB2C Blue = Source Table Green – Target Table Control Data SQL processing DB2A DB2C Linux Admin Console/SQW Runtime Distributed Database
  • Remote Source Table– Remote Target Table via jdbc
    • Source and Target tables are remote
      • Java application runs on Linux connects to DB2A and DB2C
      • Data flows through Linux
      • All SQL processing is in DB2 z @ DB2C
    DB2A DB2C Execution database = DB2C Blue = Source Table Green – Target Table Control Data SQL processing Linux Admin Console/SQW Runtime Distributed Database
  • Other data flow features
    • Variables
      • Variables can be used in Data Flows
      • Defer the definition of certain properties until a later phase in the life cycle.
        • File Names
        • Table Names
        • Database Schema Names
        • Many more
      • Generalize a Data Flow
    • Subflows
      • A subflow is a predefined set of operators that you can place inside a data flow.
      • Useful as a plugin into multiple versions of the same or similar data flows
      • Containers or building blocks for complex flows (division of labor)
      • Blue ports represent subflow inputs and outputs
  • Control Flows
  • Definition and simple example
    • A control flow is a flow model that sequences one or more data flows and integrates other data processing tasks and activities.
    • Control flows are the unit of execution.
    • This simple example processes two data flows in sequence. If they fail, e-mail is sent to an administrator :
  • Control Flow Operators
    • Task-oriented operators (Do things)
      • Data flow
      • Subprocess
      • JCL Job
      • Command (DB2 Shell/FTP)
      • Secure Command
      • Secure FTP
      • Email
      • Period row generator
      • Load
      • Unload
      • Reorg
      • Runstats
      • Table Partition
      • Stored procedure
      • DataStage job sequence
      • DataStage parallel job
      • Custom SQL
  • Control Flow Operators
    • Flow control operators (Manage things)
      • Parallel Container
      • Start
      • End
      • Iterator/End Iterator
      • Continue
      • Break
      • Fail
      • File wait
      • Variable assignment
      • Variable comparison
      • File Write
  • CF Operator Introduction(1)
    • Exchange Operator
    • Switch the contents of a base table and its associated clone table.
    • Exchange operation is not supported when it runs against a DB2 z/OS version 8 database.
    • Table Partition Operator
    • Perform table partition operation
          • Adding a partition
          • Rotating partitions
          • Changing partition boundary
    • Runstats Operator
    • Update the system catalog statistics for DB2 for z/OS database through the DB2 RUNSTATS utility.
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
  • CF Operator Introduction(2)
    • Reorg Operator
    • Reorganize a table space or an index for DB2 for z/OS database through the DB2 REORG utility.
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
    • Unload Operator
    • Unload data from an entire table space or select table, columns to the BSAM sequential data sets.
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
    • DB2 Online Utility Operator
    • Runs any DB2 for z/OS utility that can be invoked by the stored procedure DSNUTILU
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU
  • CF Operator Introduction(3)
    • Load Operator
    • Four load approaches
          • Load from Z/OS data set
          • Load from cursor
          • Load from file on client side (Using FTP to transfer file to Z/OS data set)
          • Load from file on client side (Using FTP to transfer file to Z/OS BATCHPIPES data set)
    • Call DB2-Supplied stored procedure SYSPROC.DSNUTILU for 1,2,3 approaches, Call ADMIN_JOB_SUBMIT, ADMIN_JOB_QUERY, ADMIN_JOB_FETCH, ADMIN_JOB_CANCEL for 4 approach.
    • Command Operator (FTP)
    • Advanced Options Tab has been added ,user could specify following additional options for Z/OS file transfer:
      • Record length
      • Record format
      • Data transfer type
  • CF Operator Introduction(4)
    • JCL Operator
    • JCL: a control language that is used to identify a job to an operating system and to describe the job's requirements.
    • JCL Operator works to submit a job, query the status of the job, fetch the output of a job and purge a job.
    • Support three scenarios
          • JCL on local machine
          • JCL on Z/OS side
          • New edited JCL
    • Call DB2-Supplied stored procedures :ADMIN_JOB_SUBMIT, ADMIN_JOB_QUERY, ADMIN_JOB_FETCH, ADMIN_JOB_CANCEL, ADMIN_DS_BROWSE
  • Deployment and Runtime
  • Deployment Design / Test Prepare Application Development Deploy / Install Manage / Execute Test Deployment preparation is done in the Design Studio Deployment is done via the Administration Console The process of promoting a Warehouse Application from a development environment to Test and Production environments. Deploy / Install Manage / Execute Production DDL Application Profile EPG Zip file
  • InfoSphere Warehouse on System z
    • Administration Console – manage the runtime environment
        • Deploy data movement applications
        • Schedule, Execute, Monitor flows
        • Define and manage cube servers
        • Manage OLAP Metadata
        • Assign cubes to cube servers
  • Admin Console - SQW Manage database connections Deploy and manage SQW apps Run and schedule Control Flows Monitor run instances
  • Contact Info – z Warehouse SWAT Team
    • Mgr – Beth Hamel – [email_address]
      • Andy Perkins – [email_address]
      • Jonathan Sloan – [email_address]
      • Sundari Voruganti – [email_address]
      • Willie Favero – [email_address]