SlideShare a Scribd company logo
SmithGroup JJR | Technical Analysis of BI Environment
Version 1.0
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 2 of 21
Technical Analysis of BI Environment
Executive Summary
At the beginning of the engagement with the SmithGroup JJR, AIM Report Writing was asked to provide an initial, high
level Technical Analysis of their Business Intelligence (BI) Environment. This document is what resulted from the weeks
of investigation that followed. The following areas were agreed upon for this delivery item:
• Executive Summary
• Recommendations
• Data Warehouse Architecture
• SQL Server Best Practices
• Case for Hadoop: Indoor Positioning Study (POE)
• SQL Server / Database Discovery
• Data Warehouse and Data Marts
• Extract, Transform, and Load
• Appendix A | Microsoft Data Warehouse On-Premises Architecture
• Appendix B | Design Questions to Review
The current business intelligence and data warehouse environment at SmithGroup JJR includes 3 primary components: a
data warehouse, an ETL process, and a data mart. One additional storage location exists on separate SQL Server
resources. There are two main data sources stored in the cloud with the intent of future expansion in the cloud.
This business intelligence and data warehouse environment was analyzed against SQL Server Best Practices, the Data
Warehouse (Data Mart), and the ETL process. In addition to this analysis, a database discovery and a case for Hadoop
was completed.
In order to implement the following recommendations, it is suggested to follow an Agile approach. This document’s
technical analysis has identified and allowed the defining groups of work, called an Epic in Agile. Breaking up the findings
of this analysis into Epics (groups of work) provides a starting points for identifying scope and vision, user stories, and
backlog. These Agile efforts provide the framework for defining the effort, sprints, and delivery to production.
During on-site meetings regarding the topics in this document, it was informally agreed that the initial two Epics for
development focus on the two areas listed below:
• SSIS Error Flows to replace TSQL Functions
• Dimensional Modeling and Star Schema
The team at SmithGroup JJR has started gathering use cases. These use cases will be used during the dimensional
modeling (star schema) development. These use cases serve 4 purposes:
• Identify Entities, Relationships, and Attributes for Star Schema Conceptual Model
• Used to Develop the Dimensional Model, Star Schema Conceptual Model
• After Draft Conceptual Model is Complete They are Used to Verify that the Model can Source the Use Cases
• Used to Develop the Front End Requirement such as a Report, or Dashboard
Future Epics need to address validation, scalability, transaction processing, and load Meta data.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 3 of 21
Recommendations
• SSIS Error Flows to replace TSQL Functions
o All Existing Source to Stage functions in ETL Database
o This needs to be investigated as to where and how this change would be implemented
o Alternative: T-SQL Error Control can produce the current failing row, but not a batch of failed rows like SSIS
• Dimensional Modeling and Star Schema
o Continue to Gather and Collect Use Cases
o Identify Entities (Dimensions), Relationships, and Attributes for Star Schema Conceptual Model
▪ Used to Obtain Stakeholder Consensus
o Create Flat Table Definitions of Entities (Dimensions) using Excel
o Model the Star Schema Conceptual Model using ERWin
o Continued Learning and Training on ERWin, possible Pluralsight, or Webinars (http://erwin.com/videos/)
• Architecture | Option 1, Shown Below in the Following Section, Data Warehouse Architecture
o Current Use does not require an integrated cloud environment
▪ Users do not experience issues using a gateway with on-premises in terms of performance
▪ Especially in consideration the on-premises environment should receive maximum efforts
o Cloud Analysis and Analytics using Power BI using Gateway and On-Premises Data
o Tableau users have Access to On-Premises Data for Analytics
o Approach allows a scalable and future roadmap to Integrate the On-Premises and the Cloud
• Hadoop | Reserve for Future Roadmap
o Low Volume | Carl Estimated 1TB of Data
▪ As an unwritten rule shared by experts, Hadoop needs at least 5TB to justify and achieve performance
o High to Moderate Investment for On-Premises, or Cloud based Hadoop
o Although Volume, Velocity, Variety, and Veracity are all considerations, Volume is required for federation.
• Naming of Business Resource with Industry Standard Naming
o Data Lake is used with Hadoop. Rename to something else
o Data Vault is actually a Data Warehouse (not a big concern like the above naming conflict)
• 2 Summary Tables from the Analysis Sections towards the end of this document.
o 1 | Data Warehouse and Data Marts
o 2 | Extract, Transform, and Load
1 | Data Warehouse and Data Marts
Security The SQL Server security appears to be adequate and industry standards.
Partitioning After review of dev data only, we do not see a data need for partitioning. No performance issues reported.
Alerting Combining Try-catch, database mail, and SQL Agent is highly recommended for SQL Server alerting of issues.
Indexing Index discovery and creating an enterprise index strategy is recommended for production servers.
Star Schema There is currently no star schema and it is highly recommended.
Conformed Dimensions Since there is not a star schema, there are not conformed dimensions.
Scalability Scalability appears to be a concern. A future use plan for instances, files, and file groups is suggested.
Exception Handling Try-Catch is not being used in functions and stored procedures. It is highly recommended that they be added.
Transaction Processing The environment doesn’t use Transaction Processing. Transaction Processing is recommend for future phases.
SQL Views (Business Views) A star schema is suggested to reduce complexity in creating and managing SQL Views for the business.
Surrogate Keys Surrogate Keys with a data type of Integer is suggested for the star schema.
Delta Loads TSQL Merge is adequate, however the additional use of checksums should be considered. Meta data needed.
2 | Extract, Transform, and Load
Load Meta Data Currently, there no tracking of load meta data and it is highly recommended.
Environments and Environment Variables Currently, environments and environment variables are being used with success.
Parameters Currently, parameters are being used with success.
Logging Logging with the SSIS Framework is working, however, load meta data logging is suggested.
Validation Since little, to no validation is currently employed, it is highly recommended.
Transaction Processing It is recommend not using transaction processing in SSIS, but rather at the SQL Server level functions and SPs.
Package Sequencing There are no reported errors, or issues with the current package sequencing.
Connection Managers Currently, connection managers are being used with success.
Alerting Alerting is not enabled in the solutions evaluated. Alerting is highly recommended.
Exception Handling There is no exception handling in the currently evaluated packages, this is highly recommended.
Checkpoints It is recommended that some restart ability be designed and implemented.
Naming Conventions Naming conventions are recommended.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 4 of 21
Data Warehouse Architecture
Below, we provided diagrams representing the current architecture and 2 options for the next stage of the BI / Data
Warehouse Architecture. An “all-features” Microsoft On-Premises Architecture diagram can be found in Appendix A.
These diagrams are intended to help the decision makers compare their current architecture with possible phases. In
order to help clarify these phases, this section also includes information about the Current Error Control Design and
information about Integrating On-Premises and the Cloud.
The Current Architecture stages both enterprise and project / application specific data sources from both internal and
external locations. Data sources intended for the data warehouse are staged first then loaded. The data stored in the
Data Mart
Data
Warehouse
(3NF)
Data Vault
UltiPro Schema
Vision Schema
SmithGroupJJR All On-Premises DW and Cloud | Current
UltiPro
Active Directory
Vision
NewForma
Enterprise
Systems
ETL
Stage
ELT
Process
Structured
Data
RevIt
NSF, IPEDs
SQL
Server
Indoor Positioning
Denormalized
Tableau
End User
Power BI
End User
Blue Vision IPS RAW
Blue
Vision
Azure Event Hub
Azure Streaming Analytics
Azure Table Storage
Azure SQL Database Azure SQL Database
Marquette
Normalized Data
Figure 1 | Current Architecture
data warehouse becomes the source for the data mart. Data source examples include: Ultipro, Vision, and Active
Directory. Other data sources such as NSF, IPEDs, and RevIt are stored on other dedicated SQL Server storage. In the
cloud, data sources from Indoor Positioning and Marquette indicate the slow adoption of integrating on-premises data
with cloud data. With this identified, the options discussed include an all on-premises option and an integrated on-
premises and cloud option.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 5 of 21
Phase 1, an All On-Premises Data Warehouse design dictates that all of the data structures and data be stored on
internal company resources (no cloud). In this option, the star schema and cubes exist and remain on internal company
resources, however, these resources and their content connect to cloud apps such as Power BI. This connection is
facilitated by a gateway. In this option, the gateway is required to make multiple trips when sourcing data. However, this
design allows end users using Power BI, or Tableau a flexible and feasible option.
All
On-Premises
with Gateway
Data Mart
Star
Schema
Cube Analyitics
Data Mart
Data
Warehouse
(3NF)
Data Vault
UltiPro Schema
Vision Schema
SmithGroupJJR All On-Premises DW and Cloud | Option 1
UltiPro
Active Directory
Vision
NewForma
Enterprise
Systems
ETL
Stage
ELT
Process
Structured
Data
RevIt
NSF, IPEDs
SQL
Server
Indoor Positioning
Denormalized
Tableau
End User
Power BI
End User
Blue Vision IPS RAW
Star
Schema
Blue
Vision
Azure Event Hub
Azure Streaming Analytics
Azure Table Storage
Azure SQL Database
HTTP
No VPN
Azure SQL Database
Marquette
Normalized Data
Figure 2| Option 1
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 6 of 21
Phase 2, an Integrated On-Premises and Cloud Data Warehouse, seeks to design a hybrid data warehouse providing the
best of both the on-premises and cloud data warehouses. In option 2, SQL Server Integration Services loads data from a
data mart start schema into Azure SQL Database, or directly into Azure SQL Server Analysis Services. This option differs
from option 1 since the cloud is used to store on-premises data in the cloud to be consumed by applications such as
Power BI. When using Power BI as a data scientist, or a business analyst, having the on-premises data in the cloud
provides fast analysis with external data already in the cloud. In the first option, the gateway is required to make
multiple trips when sourcing data. In this option, the data exists in the cloud, so the gateway use in minimalized.
Gateway
Data Mart
Star
Schema
Cube Analyitics
Data Mart
Data
Warehouse
(3NF)
Data Vault
UltiPro Schema
Vision Schema
SmithGroupJJR On-Premises and Cloud | Option 2
UltiPro
Active Directory
Vision
NewForma
Enterprise
Systems
ETL
Stage
ELT
Process
Structured
Data
RevIt
NSF, IPEDs
SQL
Server
Indoor Positioning
Denormalized
Tableau
End User
Power BI
End User
Blue Vision IPS RAW
Star
Schema
Blue
Vision
Azure Event Hub
Azure Streaming Analytics
Azure Table Storage
Azure SQL Database
Azure SSAS
VPN
Azure SQL Database
Azure SQL Database
Marquette
Normalized Data
Figure 3| Option 2
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 7 of 21
In the next two diagrams, the flow of data is separated into five phases. These phases are Enterprise Source Systems,
Staging, Data Warehouse, Data Mart, and Star Schema. However, for this analysis, we are focusing on the first two
rectangles titled Enterprise Source Systems and Staging. The first diagram displays the current use of functions to extract
data from the source systems. The second diagram displays a possible use of SSIS Error Flows.
Functions, the diagram below represents the current flow of data where functions extract data using functions. These
function are intended to exist on the actual source system in a database named ETL, but in the case of Ultipro (backup /
restore) the functions exist on the ETL database used by the data warehouse. These functions are used to load the
staging tables used in the downstream merge. The advantage to this design is that changes to the architecture can be
implemented without affecting downstream objects such as SSIS. The concern with this design is that during a load
failure, specific rows that have failed are not easily identifiable to generate a detailed alert that contains the failed rows.
Star
Schema
SmithGroupJJR | Current using Functions
Star
Schema
Enterprise
Source
Systems
Staging
Data
Warehouse
Data
Mart
SSAS
SSAS
SSAS
SSAS
SQL Functions
ETL Database Stores
Table Returned by
Functions
SQL Stored
Procedures
ETL Database Stores
Functions to Create Load
Tables from Source
UltiPro
Active Directory
Vision
Backup and Restore
Process. Requires
Functions to be stored on
Stage, not the Source
System.
Functions stored on
Source System.
Process Uses SSIS Plugin
for Connecting and
Extracting Active Directory
Data
Merge Statement Loads
Inserts and Updates Into
Data Warehouse
DeNormalized
SQL Functions
SQL Stored
Procedures
Stored Procedures on the
Data Mart Execute
Functions on the Data
Warehouse to Load the
Data Mart
Star Schema to be
Designed and Developed
in Future Phases
Using Functions to Pull the
Source Data Prevents
using SSIS Data Flow
Tasks. This means that we
can t have an error flow
that stores failed rows for
evaluation and fixing.
Warning!
Figure 4| Current Architecture Using Functions
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 8 of 21
Error Flow, the diagram below represents the proposed flow of data using SSIS Error flow. The use of SSIS is intended to
replace the existing functions that are used to load the staging tables. As shown in the diagram, SSIS has the ability to
create an error flow to capture rows that fail the load process. This ability allows the details and cause of the failure to
be emailed to alert the appropriate stake holders. Once SSIS loads the staging tables and stores any row failures, the rest
of the data flow remains the same as in the current diagram.
Star
Schema
SmithGroupJJR | Using Error Flow
Star
Schema
Enterprise
Source
Systems
Staging
Data
Warehouse
Data
Mart
SSAS
SSAS
SSAS
SSAS
ETL Database Stores
Table Returned by
Functions
SQL Stored
Procedures
Data Flow Task uses an
SSIS Data Source to
Extract Data from Source
Systems
UltiPro
Active Directory
Vision
Merge Statement Loads
Inserts and Updates Into
Data Warehouse
DeNormalized
SQL Functions
SQL Stored
Procedures
Stored Procedures on the
Data Mart Execute
Functions on the Data
Warehouse to Load the
Data Mart
Star Schema to be
Designed and Developed
in Future Phases
Using The Data Flow Task
Allows the Use of Error
Flows. This means that we
can have an error flow that
stores failed rows for
evaluation and fixing.
Notice!
Figure 5 | Current Architecture Using SSIS Error Flows
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 9 of 21
The options to Integrate On-Premises and Cloud are diagramed below. The Full Overview show a Site-to-Site and Point-
to-Site VPNs as well as a HTTP connection. All three of these options provide different levels of security and IPSec
standards. An additional option to the Site-to-Site VPN is ExpressRoute
(https://azure.microsoft.com/en-us/services/expressroute/).
ExpressRoute is a Microsoft Azure app that provides advanced scalability, increased reliability and speed, lower latency,
and WAN integration. This is a fee and pay for use application.
SmithGroupJJR | Integrate On Premises and Cloud
Full Overview | Integrate On Premises and Cloud
Site-to-Site Secure VPN | ExpressRoute Point-to-Site VPN | HTTP
Internet
WorkstationGateway
SQL
Server
Workstation
Point-to-Site VPN
Site -to-Site VPN
(ExpressRoute)
• Secure
• Controlled
• better connectivity quality
SQL
Server
Site -to-Site VPN
(ExpressRoute)
* optional
• Secure
• Controlled
• better connectivity quality
Gateway
Gateway
Workstation
Point-to-Site VPN
Figure 6 | Integrate On-Premises and Cloud
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 10 of 21
SQL Server Best Practices
SQL Server best practices were discussed and explained during a meeting with the SmithGroup JJR DBA and
Infrastructure teams. The meeting was demonstrated on development servers to ensure production SLAs. All decisions
in regards to implement, or not to implement these best practices and when, were left up to the SmithGroup JJR.
NTFS Allocation Unit (AU) Block Size = 64k, Alignment = 1024k | Default is 4k, Use /L with Format on Windows 2012 and above.
Max Degree of Parallelism MAXDOP | Set to Number of Cores in a Single CPU Socket
DB Auto Growth Set very high for performance. (100MB to xxx GB)
Cost Threshold for Parallelism For OLTP where we seek to minimize Parallelism and offer more concurrency then Use 15-20.
Up to 50 with modern CPUs. For DSS, OLAP, Data Warehouse, and test environments
Consider leaving at default and Managing Parallelism with MAXDOP if concurrency is a problem.
TempDB 1:2, or 1:4 Ratio (TempDB Data Files to Cores). 1:1 Ratio for large systems. Pre SQL Server 2016:
Use Trace Flags T1117 and T1118 to enable consistent AutoGrowth. On Flash Arrays enable
SORT_IN_TEMPDB Index Build option to prevent index rebuilds.
Separate Data / Log Volumes Tier 1. Test to determine for Tier 2 Flash Arrays. Multiple Volumes per File Group to Reduce Latch Contention.
4-8 Files per File Group.
3 Volumes | TempDB, Data / Log Files, and Backups for Fast Flash (Under 1ms Response Times)
Max Server Memory 90% of Available Server Memory
Enable Instant File Initialization Windows Server Setting: Perform Volume Maintenance Tasks needs to be
Set Under Local Policies and User Rights Assignments.
Case for Hadoop: Indoor Positioning Study (POE)
During our initial meetings regarding the data sources here at the SmithGroup JJR, we identified one possible use case
for Hadoop. This one possible use case is Indoor Positioning Study (POE). During our conversations, multiple questions
were asked about Hadoop such as what the minimum size is and how to handle aggregates on unstructured data.
Hadoop does not perform well on 5TB, or less. Also, it is worth noting small files do not work well with Hadoop and
should be combined into larger files. As for aggregates in Hadoop, if SmithGroup JJR were to use Azure Data Lake Store
(ADLS), they could use HDInsights and Hive, or if they use SQL Data Warehouse, or SQL Server 2016 then they could use
PolyBase. Another option is Azure Data Lake Analytics/U-SQL to aggregate Hadoop data.
Below are some questions that are given in the Indoor Positioning Study (POE) documentation to describe the questions
and answers that the SmithGroup JJR would like to answer with this data source. These questions below are broad
topics, each topic with more specific questions.
• How do people utilize space?
o What is the average dwell time by space?
o How does the number of people within a space vary over time?
o What are the most frequently used paths between spaces?
• How do people interact and collaborate?
o How much time do people spend in spaces occupied by other people?
o What is the average number of people in a collaborative space?
o How does job/organizational role impact collaboration?
• Person movement
o How often do people move between spaces?
o What is the average duration of rest (motion)?
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 11 of 21
Additional Questions:
• Exact location of a user within space
• Actual paths traveled between spaces
• Relationship between workspace and study subject (employee/organizational) measures, such as happiness or
productivity (what is an abstract term that captures these types of things?)
• Comparison of varied workspace configurations / designs / arrangements such as office /open / free assignment
• Integration with other technologies and data sources, such as space scheduling software, communication
software, galvanic skin response, implanted telemetry chips, health and dental records, etc.
After talking with Peter, he estimated that the size in terabytes for the Indoor Positioning Study (POE) at the SmithGroup
JJR was at most one terabyte. Understanding that this is much less that the five terabyte minimum for Hadoop clusters,
it is not suggested to implement a Hadoop cluster for this use case.
SQL Server / Database Discovery
In order to complete a data discovery, we were provided 3 databases:
• DataVault
• DataMart
• ETL
We provided a data discovery by employing 2 different methods. The first method was to create a web based document
of each database using Redgate’s SQL Doc. The second method was to use SQL Server DMVs and TSQL to create an Excel
based data dictionary. The files are included on the SharePoint folder along with this document. Also note, that
Shabhana provided the WBS Migration Changes to Datawarehouse Systems.pdf where you can find much of this type of
information as well.
Finally, we collected information about the various data sources (both internal and external). The list of data sources is
as follows:
Internal
Enterprise Data Source
• Vision Enterprise Resource Planning Software
• UltiPro Human Resources
• SharePoint Document Management and Collaboration
• Active Directory (AD) Network / Domain Information
• NewForma Project Meta Data and RFIs
Project / Application Specific
• Revit Data Collector Building Information Modeling (Model Statistics)
• CER
• WorkSim Space Planning Space Planning
• Indoor Positioning Study (POE) Azure SQL for People Movement in Workspace
• Campus Project Data (Marquette) Campus Planning & Space
External
• IPEDS Public University Data
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 12 of 21
• National Science Foundation Public for Funded Projects
• Bureau of Labor Government Labor Statistics
• GIS Typographical, Surveys of Land
Data Warehouse and Data Marts
In order to analyze SmithGroup JJR’s Data Warehouse / Data Mart environment, we were provided 3 databases:
• DataVault
• DataMart
• ETL
Overview
The data warehouse (DataVault) and the data mart of the same name are the 2 databases that make up the SmithGroup
JJR BI environment. The DataVault is a 3NF database. The DataMart in de-normalized and currently contains employee
and project data. At this time, there is no star schema. However, there are plans to build out a star schema in the future.
The DataVault stores source data by the corresponding source system name by using schemas of the same name such as
UltiPro and Vision. In order to complete this analysis below, a Server and Database Discovery was completed as well.
Analysis
The areas for analysis for these 2 databases include the following topics:
• Security
• Scalability
• Partitioning
• Exception Handling
• Alerting
• Transaction Processing
• Indexing
• SQL Views (Business Views)
• Star Schema
• Surrogate Keys
• Conformed Dimensions
• Delta Loads (Merge SCD 1 and SCD 2, Checksums)
Security should always be the first concern in planning and deploying any data warehouse / data mart environment. In
reviewing what roles were defined, we found the following server roles: bulkadmin, dbcreator, diskadmin,
processadmin, public, securityadmin, serveradmin, setupadmin, and sysadmin. There were no user defined SQL Server
roles. The sa account was enabled, but not being used. There was not an implementations of Row Level Security or, Role
Based Security. SQL Schemas, such as ultipro, vision, ad, and admin were used to scale and organize the various SQL
Server objects.
Scalability is a very high priority for companies that want to deliver solutions that last 5, or more years after the initial
deployment. Many of the Server and Database DMVs listed above help us determine the scalability. For instance, using
instances, partitioning, files and file groups, and synonyms can help make a system more scalable. Instances allow better
resource management between different processes on the same server. They also allow us to separate load layers such
as stage, consolidation, transformation, 3NF, Star, and Analytics. Since we only had access to development servers, we
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 13 of 21
did not see any examples of instances, but we highly recommend them in production. We also looked at partitioning,
but that will be discussed below. As for files and file groups, we have provided an Excel spreadsheet to identify what the
files and file groups and their current sizes. We also provided size information for all of the tables in the 3 databases we
were asked to analyze. File and Table sizes are important indicators for scalability and where to set the auto growth for
your tables. The data and log files were on the same volume and had the following sizes:
FileName FileSizeMB SpaceUsedMB AvailableSpaceMB %FreeSpace
DataVault 3004 1636.69 1367.31 45.52
DataVault_log 36828.31 1256.45 35571.87 96.59
When looking at development, we were okay with these settings, however, the auto growth was not what we
recommend in production. Finally, synonyms are an easy way to manage server to server (physical, or instances)
connections without taking the risk of using linked servers. We did notice 3 linked servers (FINANCIALDATA, SGJJR-
SQL2ASCCM2012, and VISIONDEVDB. These linked servers were not part of our scope provided by the SmithGroup JJR,
however, we would warn against relying too heavy on linked servers.
Partitioning is a great way to manage reporting performance in a data warehouse / data mart environment. Currently,
the SmithGroup JJR is not using any partitioning strategy. Since we only had access to the development data, it is hard to
tell if the sizes in our analysis represent the real production sizes, however, with the database sizes we encountered in
the scope of this analysis, we do not recommend partitioning at this time. For the DataVault, there were a total of
2,548,891 rows. The table with the most rows was 730,655 rows in the [vision].[ProjectFinancialsByPeriod] table. We
can see in the section above that data size for DataVault is 3004 MB. At this time, partitioning is not recommended.
Exception Handling in SQL Server (TSQL) is accomplished by using Try Catch clauses. We did our due diligence in
verifying that there is no exception handling at the SQL Server level. We confirmed this with the different teams at
SmithGroup JJR. Exception handling is recommend for future phases of development.
Alerting in SQL Server is a combination of Try Catch clauses, database mail, and SQL Server Agent. Both SQL Server
Agent allow us to define operators and alerts. Alerts can then be defined on performance conditions, or SQL Server
events based on an error number from a try catch clause. Alerting is recommend for future phases of development.
Transaction Processing is the process of ensuring that data is written to disk before we commit a transaction and move
on to the next step in the process. Transaction Processing also provides a mechanism to rollback any data that have
been written if the transaction fails before a commit can take place. Transaction Processing is a critical part of any
design. Currently, the environment here does not use Transaction Processing. Transaction Processing is recommend for
future phases of development.
Indexing has a huge impact on Server and Query Performance. DMV queries to identify any unused indexes and what
indexes need to be re-organized and re-built should be ran on a regular basis. Index discovery and creating an enterprise
index strategy is recommended.
SQL Views (Business Views) can be used to denormalize and simplify data structures in the 3NF for reporting purposes.
Currently, both the DataVault and DataMart use SQL Views, however, the ETL database does not. There are 32 SQL
Views grouped into 5 different schemas (admin, api, dbo, lookup, and vision) in the DataVault database. The DataMart
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 14 of 21
database has 10 SQL Views all in the dbo schema. A star schema is suggested to reduce complexity in creating and
managing SQL Views for the business.
Star Schema is not used and is not being developed. It is highly recommended for future phases.
Surrogate Keys are used to provide referential security in a Data Warehouse / Data Mart that sources data from
numerous data sources that all have different keys defined for the same entity / attribute such as Person / Social
Security Number. Surrogate keys are employed in the SmithGroup JJR DataVault and DataMart, however, GUIDs have
been used. This design has no issues for the data warehouse, however, the reporting start schema should use integers
for load and processing performance.
Conformed Dimensions at the database level entails ensuring that the Data Warehouse in a star schema only has 1
dimension for a specific entity such as employee, or region. Any data mart use of the entity employee, or region need to
be sourced from the Data Warehouse and not reloaded with different logic and processes. Since there is not a star
schema for DataVault, or DataMart, we do not have any dimensions to conform. We suggest a robust star schema for
both a data warehouse and data marts.
Delta Loads for data is both a performance issue and a management issue. Loading only the data that has changed since
the last load can be implemented and managed in many ways. We can use TSQL merge, checksums, and last load date
tables to determine if a row has changed since the last time the table has been loaded. SmithGroup JJR uses TQL merge,
but not checksums and a last load date table. At the size of the data today, checksums and storing a last load date is not
necessary, but recommended for performance and scalability. These same processes can also be used to implement
slowly changing dimensions once a star schema is developed. The was an initial plan at the SmithGroup JJR to use a
logging, error logging, and number table to manage load meta data, but it was not implemented.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 15 of 21
Extract, Transform, and Load
In order to analyze SmithGroup JJR’s SSIS environment, we were provided 4 SSIS Solutions for the following areas:
• Active Directory | 12 Packages
• Deltek Vision | 63 Packages
• Ultipro | 19 Packages
• NewForma | 21 Packages with 11 Disabled
Overview
The Active Directory load process uses the KingswaySoft Directory Services Integration toolkit to provide access to Active
Directory data. Using this tool, this solution extracts Active Directory data for the following areas: computer, group,
group member, and user. The master ActiveDirectory.dtsx package call 3 sub-packages named: Extract, Transform, and
Load. As the names of these packages indicate, the Extracts package takes data from the source system and temporarily
stores this data in the ETL database. Unlike the other more complex ETL solutions, this solution does not have any tasks,
or data flow transformations in the Transformation package. This Transform.dtsx could possibly be disabled. The Loads
package calls stored procedures located in the ETL database then loads the data warehouse.
The Active Directory design pattern uses the KingswaySoft Directory Services Integration toolkit to extract the data from
the source system and place the extracted data into stage tables located on the Data Warehouse in the ETL database on
that server. Once the data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL
database load the staged data into the DataVault tables using the merge statement.
The Deltek Vision solution also uses a similar process by calling separate sub-packages for the Extract, Transform, and
Load phases of the data load. However, this solution also has a PreProcessing and a PostProcessing package. The
PreProcessing package truncates the TPH tables. The PostProcessing package is empty and could possibly be disabled.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 16 of 21
The Extracts package takes data from the source system and temporarily stores this data in the ETL database for tables
like client, vendor, and employee. The Transform package transforms data for vendor / client, contact / employee,
project / opportunity, project dependents. The Loads package calls stored procedures located in the ETL database then
loads the data warehouse for these same areas such as client, vendor, and employee.
The Vision design pattern includes an ETL database that is stored on the transactional server. This ETL database stores
functions that are used to extract the data from the source system and place the extracted data into stage tables located
on the Data Warehouse in the ETL database on that server. Once the data is staged in the ETL database on the Data
Warehouse, ETL stored procedures in the ETL database load the staged data into the DataVault tables using the merge
statement.
The Ultipro solution uses a different process by organizing the Extract, Transform, and Load phases of the data load into
separate containers. The Extracts package stores data in the ETL database. The Transform package transforms data for
organization, employment, and employmentHistory. The Load package loads tables using the TSQL Merge statement
from the ETL database to the data warehouse. Please review Shobhana’s WBS Migration Changes to Datawarehouse
Systems.pdf for an ETL Dataflow Diagram and other useful package information.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 17 of 21
Since Ultipro is a backup and restore process, the Ultipro design pattern does not include ETL database functions that
are stored on the transactional server. Instead, these extract functions are stored in the ETL database on the Data
Warehouse server. This ETL database stores functions that are used to extract the data from the source system and
place the extracted data into stage tables located on the Data Warehouse in the ETL database on that server. Once the
data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL database load the staged
data into the DataVault tables using the merge statement.
The NewForma (oblivion) load process is a non-standard load process that needs to be updated into the new design
pattern described with the Vision load process above. There is a monthly load that calls a weekly load that calls the
hourly load that is not being currently used. The weekly load package also has an archival process. Besides the monthly
load, there is a daily load that calls the hourly load. Both the weekly and daily load packages call the same hourly
package.
The current state of the NewForma load process executes two packages in parallel. The first package is Execute
etlOrgChart and the second package is Execute etlNewformaProjects. Execute etlNewformaProjects has two child
packages named Execute etlProjectRFIs and Execute etlProjectMilestones. The packages etlOrgChart and
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 18 of 21
etlProjectMilestones both use the KingswaySoft SharePoint Integration toolkit to extract and load data to and from
SharePoint lists. This process needs to be updated and implemented to production using the standard process.
Analysis
The areas for analysis for these 3 solutions include the following topics:
• Load Meta Data
• Package Sequencing (Master and Child Packages, SQL Jobs, Conformed Dimensions, Dimensions, Facts, Data Marts)
• Environments and Environment Variables
• Connection Managers (Package and Project)
• Parameters (Package and Project)
• Alerting
• Logging
• Exception Handling
• Validation
• Checkpoints
• Transaction Processing (requires MSDTC)
• Naming Conventions
Load Meta Data is important since it can help us track load start and end times by package, table, and even cube
processing, but it also can track load row counts for inserts and updates, it can provide restart ability that is more robust
than SSIS checkpoints, and it can provide rollback information during a failure. Currently, there no tracking of load meta
data and it is highly recommended.
Package Sequencing controls the order of how the packages load the tables. In terms of packages, we have a master
package and then child packages. The master package may call child packages such as a conformed dimension package,
a dimension package, and a fact package. Child packages can also call data mart packages that duplicate data warehouse
dimensions and facts to be used as data marts. Finally, SQL Jobs can be used to schedule different load patterns and
times such as daily, every hour, and even weekly, or monthly. Since package sequencing is already working and not
causing issues at this time, this is not a high priority for redesign.
Environments and Environment Variables are used to provide a mechanism for changing project data connections and
variables during a change control migration from one environment to another, such as development to test, or test to
production. Currently, environments and environment variables are being used with success.
Connection Managers can be either project, or package level. In most cases, connections that need to change from
environment to environment, or will be used in many packages will be project connections. Any connections that are
just required by 1 package and will not change between environments can be package connections.
Parameters can be either project, or package level. In most cases, parameters that need to change from environment to
environment, or will be used in many packages will be project parameters. Any parameters that are just required by 1
package and will not change between environments can be package parameters.
Alerting in SSIS is provided by using a SMTP connection. This connection can then be used in a task flow, data flow, or
even as an event handler such as OnError. Alerting is not enabled in the solutions evaluated. Alerting is highly
recommended.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 19 of 21
Logging can be very customized by using a custom logging schema and then tying that custom logging to the logging
built into SQL Server 2012 and newer. Since a robust version Logging is provided in new versions of SQL Server that
include verbose logging for troubleshooting, it is not recommended to make any changes to logging. An example custom
logging diagram that can bridge to the data logged by new versions of SQL Server has been provided for your reference.
Exception Handling in SSIS can be addressed in multiple methods. One method is exception data flows. These data flows
can load exception data into flat files such as text files, or a table that stores exception data in an XML format. Another
example of exception handling is rolling back data when an error occurs. This can be accomplished at the end of an error
flow, or in an event handler such as OnError. Finally, exception handling can be combined with Alerting to let the
appropriate technical and business users know of and issue, or delay. Since there is no exception handling in the
currently evaluated packages, this is highly recommended.
Validation has many solutions. The most common include row and column based validation combined with what are
called sanity checks. In a fact table, column validation can sum the column and compare that to an aggregate value in
the consolidation layer of the load process. For dimension validation, we can verify that the surrogate key for a specific
user ties back to multiple source systems through the stored business keys. Sanity checks tend to focus on a known
business rule and verify that a calculated business rule matches in multiple systems such as a source system, data
warehouse, data mart, the cloud, and reporting tools. This validation can use load meta data as well as SQL Tasks to
gather can validate complex scenarios (data sources) when necessary. Since little, to no validation is currently employed,
it is highly recommended.
Checkpoints are SSIS’s built in method for providing package restart ability. They are configured by providing a
checkpoint location in the package level properties. The property name is CheckPointFileName. Two other properties
need to be configured named CheckPointUsage and SaveCheckPoints. These properties are defined in the solutions we
evaluated. It is recommended that some restart ability be designed and implemented.
Transaction Processing is a feature in SSIS, however it requires that Microsoft Distributed Transaction Coordinator
(MSDTC) be enabled. This coordinator comes with overhead and is not always well received by the DBA team. Since the
SmithGroup JJR already uses SQL Tasks to call stored procedures that utilize the TSQL merge statement, we recommend
not using transaction processing in SSIS, but rather at the SQL Server level.
Naming Conventions in SSIS may seem elementary, but good naming conventions in SSIS can help with readability and
maintenance, especially when introducing new developers to the ETL environment. A sample SSIS naming convention
document has been provided.
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 20 of 21
Appendix A | Microsoft Data Warehouse On-Premises Architecture
Below is a diagram that illustrates an ideal business intelligence (BI) architecture. This example is intended to show the
many pieces available in the Microsoft BI stack. This diagram includes source systems (structured data) that flow into a
SQL Server data repository. Data repositories offload the reporting loads onto non-production, transactional servers.
The ELT process will stage the data required for that load only. There are three (3) paths for the data once it is staged.
The first path is a quickly cleaned path intended for daily business reporting needs called an Operational Data Store.
Moving down the diagram, the second path is the fact pipeline. The fact pipeline will begin to denormalize and prepare
the fact data for transformation. The third path is the dimension pipeline. The dimension pipeline goes through two
other tools offered in the Microsoft BI Stack.
The first tool is Data Quality Services. This tool is used to cleanse the data. The second tool is Master Data Management.
Master Data Management provides what the industry calls “Golden Record Management.” Golden record management
gives you access to the most pure, validated, and complete picture of your individual records in your domain. Products
like Profisee (https://profisee.com/grm) offer functionality beyond the built in tools offered out of the box with SQL
Server. This extra “Golden Record” functionality include matching, de-duplication, mastering, and record harmonization.
Profisee also offers graphical user interfaces, scorecards, and reports.
Stage
ODS
DQS
*DQS = Data Quality Services
*MDM = Master Data Management
MDM
Data
Warehouse
(3NF)
eDW
Sales Schema
CRM Schema
Marketing Schema
Microsoft Data Warehouse On-Premises Architecture
*Data Steward(s)
Fact Pipeline
Dimension Pipeline
Structured
Data
eDW
Cube Farm
Sales Cube
CRM Cube
Marketing Cube
SharePoint Portal
Excel SSRSSSAS Power QueryPower BI
Data
Repository
Flat File Data
Sales Data
Customer Service
(CRM) Data
Accounting Data
Human Resource
(HR) Data
Supply Chain Data
Enterprise
Systems
Configuration Logging Audit
Real Time
Operational
Reporting
Transform
Supplement
Data Steward Tasks
Data Steward(s)
Define Business Rules
Manage Master Data
Subject Matter Expert (SME)
Liaison Between Business and BI Team
Structured
Data
Identify Business Question
Define Staffing Roles
Data Discovery
Establish Data Stewards
Agree on Business Rules
Determine Master Data Lists
Business Tasks
Data Mart
Data Mart
Star
Schema
eAnalyitics
Sales
CRM
Plan Prototype Around Question
Product Licensing
Examine Existing Infrastructure
Determine eDW Infrastructure
Plan Security / Kerberos
Develop eDW Architecture
BI Team Tasks
Figure 7 | Microsoft Data Warehouse On-Premises Architecture
Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 21 of 21
Appendix B | Design Questions to Review
How is data from multiple sources consolidated? An example is currently, when we model DataVault, we see 3
person tables, a vision.Person, an ultipro.Person, and a dbo.person. According to the DBAs, dbo.Person is a
consolidated version of Person. This begs another question of what logic is used to consolidate the 2 different
versions of Person. Is there a reference, or lookup table?
What was the reasons and domain knowledge used to use GUIDs and not INTs for Surrogate Keys?
Confusing nomenclature for DataVault, DataMart, and DataLake databases. May want to rethink these names
to not confuse the more general and industry accepted meanings.

More Related Content

What's hot

Data Lake
Data LakeData Lake
A data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonA data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madison
Terry Bunio
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Scott Mitchell
 
SQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery Implementation
Syed Jahanzaib Bin Hassan - JBH Syed
 
The Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous WorldThe Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous World
Maria Colgan
 
Oracle data integrator (odi) online training
Oracle data integrator (odi) online trainingOracle data integrator (odi) online training
Oracle data integrator (odi) online training
Glory IT Technologies Pvt. Ltd.
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
Syed Jahanzaib Bin Hassan - JBH Syed
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
ines beltaief
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Sandesh Rao
 
Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...
Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...
Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...
Charlie Berger
 
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Cloudera, Inc.
 
Ryan-Symposium-v5
Ryan-Symposium-v5Ryan-Symposium-v5
Ryan-Symposium-v5
Kevin Ryan
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
RTTS
 
DBCS Office Hours - Modernization through Migration
DBCS Office Hours - Modernization through MigrationDBCS Office Hours - Modernization through Migration
DBCS Office Hours - Modernization through Migration
Tammy Bednar
 
Data Federation
Data FederationData Federation
Data Federation
Stephen Lahanas
 
SQL - Parallel Data Warehouse (PDW)
SQL - Parallel Data Warehouse (PDW)SQL - Parallel Data Warehouse (PDW)
SQL - Parallel Data Warehouse (PDW)
Karan Gulati
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Charlie Berger
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
RTTS
 

What's hot (20)

Data Lake
Data LakeData Lake
Data Lake
 
A data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonA data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madison
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
SQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery Implementation
 
The Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous WorldThe Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous World
 
Oracle data integrator (odi) online training
Oracle data integrator (odi) online trainingOracle data integrator (odi) online training
Oracle data integrator (odi) online training
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
 
Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...
Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...
Oracle Database House Party_Oracle Machine Learning to Pick a Good Inexpensiv...
 
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
 
Ryan-Symposium-v5
Ryan-Symposium-v5Ryan-Symposium-v5
Ryan-Symposium-v5
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
 
DBCS Office Hours - Modernization through Migration
DBCS Office Hours - Modernization through MigrationDBCS Office Hours - Modernization through Migration
DBCS Office Hours - Modernization through Migration
 
Data Federation
Data FederationData Federation
Data Federation
 
SQL - Parallel Data Warehouse (PDW)
SQL - Parallel Data Warehouse (PDW)SQL - Parallel Data Warehouse (PDW)
SQL - Parallel Data Warehouse (PDW)
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
Oracle Machine Learning Overview and From Oracle Data Professional to Oracle ...
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
 

Similar to BI Environment Technical Analysis

ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
Sql server 2008_replication_technical_case_study
Sql server 2008_replication_technical_case_studySql server 2008_replication_technical_case_study
Sql server 2008_replication_technical_case_study
Klaudiia Jacome
 
Day 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologyDay 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminology
tovetrivel
 
Presentation application change management and data masking strategies for ...
Presentation   application change management and data masking strategies for ...Presentation   application change management and data masking strategies for ...
Presentation application change management and data masking strategies for ...
xKinAnx
 
Resume_of_Vasudevan - Hadoop
Resume_of_Vasudevan - HadoopResume_of_Vasudevan - Hadoop
Resume_of_Vasudevan - Hadoop
vasudevan venkatraman
 
ShashankJainMSBI
ShashankJainMSBIShashankJainMSBI
ShashankJainMSBI
Shashank Jain
 
A Primer To Sybase Iq Development July 13
A Primer To Sybase Iq Development July 13A Primer To Sybase Iq Development July 13
A Primer To Sybase Iq Development July 13
sparkwan
 
Run Your Oracle BI QA Cycles More Effectively
Run Your Oracle BI QA Cycles More EffectivelyRun Your Oracle BI QA Cycles More Effectively
Run Your Oracle BI QA Cycles More Effectively
KPI Partners
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
Matsuo Sawahashi
 
Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10
Harsha Gowda B R
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
sharpan
 
Financial, Retail And Shopping Domains
Financial, Retail And Shopping DomainsFinancial, Retail And Shopping Domains
Financial, Retail And Shopping Domains
Sonia Sanchez
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Daniel Zivkovic
 
What_to_expect_from_oracle_database_12c
What_to_expect_from_oracle_database_12cWhat_to_expect_from_oracle_database_12c
What_to_expect_from_oracle_database_12c
Maria Colgan
 
ChakravarthyUppara
ChakravarthyUpparaChakravarthyUppara
ChakravarthyUppara
Chakravarthy Uppara
 
Satya Cv
Satya CvSatya Cv
Satya Cv
sqlmaster
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
Terry Bunio
 
3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt
BsMath3rdsem
 
SSAS RLS Prototype | Vision and Scope Document
SSAS RLS Prototype | Vision and Scope DocumentSSAS RLS Prototype | Vision and Scope Document
SSAS RLS Prototype | Vision and Scope Document
Ryan Casey
 
Copy of Alok_Singh_CV
Copy of Alok_Singh_CVCopy of Alok_Singh_CV
Copy of Alok_Singh_CV
Alok Singh
 

Similar to BI Environment Technical Analysis (20)

ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
Sql server 2008_replication_technical_case_study
Sql server 2008_replication_technical_case_studySql server 2008_replication_technical_case_study
Sql server 2008_replication_technical_case_study
 
Day 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminologyDay 02 sap_bi_overview_and_terminology
Day 02 sap_bi_overview_and_terminology
 
Presentation application change management and data masking strategies for ...
Presentation   application change management and data masking strategies for ...Presentation   application change management and data masking strategies for ...
Presentation application change management and data masking strategies for ...
 
Resume_of_Vasudevan - Hadoop
Resume_of_Vasudevan - HadoopResume_of_Vasudevan - Hadoop
Resume_of_Vasudevan - Hadoop
 
ShashankJainMSBI
ShashankJainMSBIShashankJainMSBI
ShashankJainMSBI
 
A Primer To Sybase Iq Development July 13
A Primer To Sybase Iq Development July 13A Primer To Sybase Iq Development July 13
A Primer To Sybase Iq Development July 13
 
Run Your Oracle BI QA Cycles More Effectively
Run Your Oracle BI QA Cycles More EffectivelyRun Your Oracle BI QA Cycles More Effectively
Run Your Oracle BI QA Cycles More Effectively
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
 
Financial, Retail And Shopping Domains
Financial, Retail And Shopping DomainsFinancial, Retail And Shopping Domains
Financial, Retail And Shopping Domains
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
What_to_expect_from_oracle_database_12c
What_to_expect_from_oracle_database_12cWhat_to_expect_from_oracle_database_12c
What_to_expect_from_oracle_database_12c
 
ChakravarthyUppara
ChakravarthyUpparaChakravarthyUppara
ChakravarthyUppara
 
Satya Cv
Satya CvSatya Cv
Satya Cv
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
 
3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt3._DWH_Architecture__Components.ppt
3._DWH_Architecture__Components.ppt
 
SSAS RLS Prototype | Vision and Scope Document
SSAS RLS Prototype | Vision and Scope DocumentSSAS RLS Prototype | Vision and Scope Document
SSAS RLS Prototype | Vision and Scope Document
 
Copy of Alok_Singh_CV
Copy of Alok_Singh_CVCopy of Alok_Singh_CV
Copy of Alok_Singh_CV
 

More from Ryan Casey

Invoicing Bus Matrix
Invoicing Bus MatrixInvoicing Bus Matrix
Invoicing Bus Matrix
Ryan Casey
 
First Steps Snapshot vs Transaction Grain Statements
First Steps Snapshot vs Transaction Grain StatementsFirst Steps Snapshot vs Transaction Grain Statements
First Steps Snapshot vs Transaction Grain Statements
Ryan Casey
 
First Steps to Define Grain
First Steps to Define GrainFirst Steps to Define Grain
First Steps to Define Grain
Ryan Casey
 
RLS Prototype ETL | Vision and Scope Document
RLS Prototype ETL | Vision and Scope DocumentRLS Prototype ETL | Vision and Scope Document
RLS Prototype ETL | Vision and Scope Document
Ryan Casey
 
Dynamic CSV String Business Rules and Pseudo Logic
Dynamic CSV String Business Rules and Pseudo LogicDynamic CSV String Business Rules and Pseudo Logic
Dynamic CSV String Business Rules and Pseudo Logic
Ryan Casey
 
Defining the Grain | Source system: Dynamics 365
Defining the Grain | Source system: Dynamics 365Defining the Grain | Source system: Dynamics 365
Defining the Grain | Source system: Dynamics 365
Ryan Casey
 
SSRS RLS Prototype | Vision and Scope Document
SSRS RLS Prototype | Vision and Scope Document  SSRS RLS Prototype | Vision and Scope Document
SSRS RLS Prototype | Vision and Scope Document
Ryan Casey
 

More from Ryan Casey (7)

Invoicing Bus Matrix
Invoicing Bus MatrixInvoicing Bus Matrix
Invoicing Bus Matrix
 
First Steps Snapshot vs Transaction Grain Statements
First Steps Snapshot vs Transaction Grain StatementsFirst Steps Snapshot vs Transaction Grain Statements
First Steps Snapshot vs Transaction Grain Statements
 
First Steps to Define Grain
First Steps to Define GrainFirst Steps to Define Grain
First Steps to Define Grain
 
RLS Prototype ETL | Vision and Scope Document
RLS Prototype ETL | Vision and Scope DocumentRLS Prototype ETL | Vision and Scope Document
RLS Prototype ETL | Vision and Scope Document
 
Dynamic CSV String Business Rules and Pseudo Logic
Dynamic CSV String Business Rules and Pseudo LogicDynamic CSV String Business Rules and Pseudo Logic
Dynamic CSV String Business Rules and Pseudo Logic
 
Defining the Grain | Source system: Dynamics 365
Defining the Grain | Source system: Dynamics 365Defining the Grain | Source system: Dynamics 365
Defining the Grain | Source system: Dynamics 365
 
SSRS RLS Prototype | Vision and Scope Document
SSRS RLS Prototype | Vision and Scope Document  SSRS RLS Prototype | Vision and Scope Document
SSRS RLS Prototype | Vision and Scope Document
 

Recently uploaded

Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 

Recently uploaded (20)

Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 

BI Environment Technical Analysis

  • 1. SmithGroup JJR | Technical Analysis of BI Environment Version 1.0
  • 2. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 2 of 21 Technical Analysis of BI Environment Executive Summary At the beginning of the engagement with the SmithGroup JJR, AIM Report Writing was asked to provide an initial, high level Technical Analysis of their Business Intelligence (BI) Environment. This document is what resulted from the weeks of investigation that followed. The following areas were agreed upon for this delivery item: • Executive Summary • Recommendations • Data Warehouse Architecture • SQL Server Best Practices • Case for Hadoop: Indoor Positioning Study (POE) • SQL Server / Database Discovery • Data Warehouse and Data Marts • Extract, Transform, and Load • Appendix A | Microsoft Data Warehouse On-Premises Architecture • Appendix B | Design Questions to Review The current business intelligence and data warehouse environment at SmithGroup JJR includes 3 primary components: a data warehouse, an ETL process, and a data mart. One additional storage location exists on separate SQL Server resources. There are two main data sources stored in the cloud with the intent of future expansion in the cloud. This business intelligence and data warehouse environment was analyzed against SQL Server Best Practices, the Data Warehouse (Data Mart), and the ETL process. In addition to this analysis, a database discovery and a case for Hadoop was completed. In order to implement the following recommendations, it is suggested to follow an Agile approach. This document’s technical analysis has identified and allowed the defining groups of work, called an Epic in Agile. Breaking up the findings of this analysis into Epics (groups of work) provides a starting points for identifying scope and vision, user stories, and backlog. These Agile efforts provide the framework for defining the effort, sprints, and delivery to production. During on-site meetings regarding the topics in this document, it was informally agreed that the initial two Epics for development focus on the two areas listed below: • SSIS Error Flows to replace TSQL Functions • Dimensional Modeling and Star Schema The team at SmithGroup JJR has started gathering use cases. These use cases will be used during the dimensional modeling (star schema) development. These use cases serve 4 purposes: • Identify Entities, Relationships, and Attributes for Star Schema Conceptual Model • Used to Develop the Dimensional Model, Star Schema Conceptual Model • After Draft Conceptual Model is Complete They are Used to Verify that the Model can Source the Use Cases • Used to Develop the Front End Requirement such as a Report, or Dashboard Future Epics need to address validation, scalability, transaction processing, and load Meta data.
  • 3. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 3 of 21 Recommendations • SSIS Error Flows to replace TSQL Functions o All Existing Source to Stage functions in ETL Database o This needs to be investigated as to where and how this change would be implemented o Alternative: T-SQL Error Control can produce the current failing row, but not a batch of failed rows like SSIS • Dimensional Modeling and Star Schema o Continue to Gather and Collect Use Cases o Identify Entities (Dimensions), Relationships, and Attributes for Star Schema Conceptual Model ▪ Used to Obtain Stakeholder Consensus o Create Flat Table Definitions of Entities (Dimensions) using Excel o Model the Star Schema Conceptual Model using ERWin o Continued Learning and Training on ERWin, possible Pluralsight, or Webinars (http://erwin.com/videos/) • Architecture | Option 1, Shown Below in the Following Section, Data Warehouse Architecture o Current Use does not require an integrated cloud environment ▪ Users do not experience issues using a gateway with on-premises in terms of performance ▪ Especially in consideration the on-premises environment should receive maximum efforts o Cloud Analysis and Analytics using Power BI using Gateway and On-Premises Data o Tableau users have Access to On-Premises Data for Analytics o Approach allows a scalable and future roadmap to Integrate the On-Premises and the Cloud • Hadoop | Reserve for Future Roadmap o Low Volume | Carl Estimated 1TB of Data ▪ As an unwritten rule shared by experts, Hadoop needs at least 5TB to justify and achieve performance o High to Moderate Investment for On-Premises, or Cloud based Hadoop o Although Volume, Velocity, Variety, and Veracity are all considerations, Volume is required for federation. • Naming of Business Resource with Industry Standard Naming o Data Lake is used with Hadoop. Rename to something else o Data Vault is actually a Data Warehouse (not a big concern like the above naming conflict) • 2 Summary Tables from the Analysis Sections towards the end of this document. o 1 | Data Warehouse and Data Marts o 2 | Extract, Transform, and Load 1 | Data Warehouse and Data Marts Security The SQL Server security appears to be adequate and industry standards. Partitioning After review of dev data only, we do not see a data need for partitioning. No performance issues reported. Alerting Combining Try-catch, database mail, and SQL Agent is highly recommended for SQL Server alerting of issues. Indexing Index discovery and creating an enterprise index strategy is recommended for production servers. Star Schema There is currently no star schema and it is highly recommended. Conformed Dimensions Since there is not a star schema, there are not conformed dimensions. Scalability Scalability appears to be a concern. A future use plan for instances, files, and file groups is suggested. Exception Handling Try-Catch is not being used in functions and stored procedures. It is highly recommended that they be added. Transaction Processing The environment doesn’t use Transaction Processing. Transaction Processing is recommend for future phases. SQL Views (Business Views) A star schema is suggested to reduce complexity in creating and managing SQL Views for the business. Surrogate Keys Surrogate Keys with a data type of Integer is suggested for the star schema. Delta Loads TSQL Merge is adequate, however the additional use of checksums should be considered. Meta data needed. 2 | Extract, Transform, and Load Load Meta Data Currently, there no tracking of load meta data and it is highly recommended. Environments and Environment Variables Currently, environments and environment variables are being used with success. Parameters Currently, parameters are being used with success. Logging Logging with the SSIS Framework is working, however, load meta data logging is suggested. Validation Since little, to no validation is currently employed, it is highly recommended. Transaction Processing It is recommend not using transaction processing in SSIS, but rather at the SQL Server level functions and SPs. Package Sequencing There are no reported errors, or issues with the current package sequencing. Connection Managers Currently, connection managers are being used with success. Alerting Alerting is not enabled in the solutions evaluated. Alerting is highly recommended. Exception Handling There is no exception handling in the currently evaluated packages, this is highly recommended. Checkpoints It is recommended that some restart ability be designed and implemented. Naming Conventions Naming conventions are recommended.
  • 4. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 4 of 21 Data Warehouse Architecture Below, we provided diagrams representing the current architecture and 2 options for the next stage of the BI / Data Warehouse Architecture. An “all-features” Microsoft On-Premises Architecture diagram can be found in Appendix A. These diagrams are intended to help the decision makers compare their current architecture with possible phases. In order to help clarify these phases, this section also includes information about the Current Error Control Design and information about Integrating On-Premises and the Cloud. The Current Architecture stages both enterprise and project / application specific data sources from both internal and external locations. Data sources intended for the data warehouse are staged first then loaded. The data stored in the Data Mart Data Warehouse (3NF) Data Vault UltiPro Schema Vision Schema SmithGroupJJR All On-Premises DW and Cloud | Current UltiPro Active Directory Vision NewForma Enterprise Systems ETL Stage ELT Process Structured Data RevIt NSF, IPEDs SQL Server Indoor Positioning Denormalized Tableau End User Power BI End User Blue Vision IPS RAW Blue Vision Azure Event Hub Azure Streaming Analytics Azure Table Storage Azure SQL Database Azure SQL Database Marquette Normalized Data Figure 1 | Current Architecture data warehouse becomes the source for the data mart. Data source examples include: Ultipro, Vision, and Active Directory. Other data sources such as NSF, IPEDs, and RevIt are stored on other dedicated SQL Server storage. In the cloud, data sources from Indoor Positioning and Marquette indicate the slow adoption of integrating on-premises data with cloud data. With this identified, the options discussed include an all on-premises option and an integrated on- premises and cloud option.
  • 5. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 5 of 21 Phase 1, an All On-Premises Data Warehouse design dictates that all of the data structures and data be stored on internal company resources (no cloud). In this option, the star schema and cubes exist and remain on internal company resources, however, these resources and their content connect to cloud apps such as Power BI. This connection is facilitated by a gateway. In this option, the gateway is required to make multiple trips when sourcing data. However, this design allows end users using Power BI, or Tableau a flexible and feasible option. All On-Premises with Gateway Data Mart Star Schema Cube Analyitics Data Mart Data Warehouse (3NF) Data Vault UltiPro Schema Vision Schema SmithGroupJJR All On-Premises DW and Cloud | Option 1 UltiPro Active Directory Vision NewForma Enterprise Systems ETL Stage ELT Process Structured Data RevIt NSF, IPEDs SQL Server Indoor Positioning Denormalized Tableau End User Power BI End User Blue Vision IPS RAW Star Schema Blue Vision Azure Event Hub Azure Streaming Analytics Azure Table Storage Azure SQL Database HTTP No VPN Azure SQL Database Marquette Normalized Data Figure 2| Option 1
  • 6. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 6 of 21 Phase 2, an Integrated On-Premises and Cloud Data Warehouse, seeks to design a hybrid data warehouse providing the best of both the on-premises and cloud data warehouses. In option 2, SQL Server Integration Services loads data from a data mart start schema into Azure SQL Database, or directly into Azure SQL Server Analysis Services. This option differs from option 1 since the cloud is used to store on-premises data in the cloud to be consumed by applications such as Power BI. When using Power BI as a data scientist, or a business analyst, having the on-premises data in the cloud provides fast analysis with external data already in the cloud. In the first option, the gateway is required to make multiple trips when sourcing data. In this option, the data exists in the cloud, so the gateway use in minimalized. Gateway Data Mart Star Schema Cube Analyitics Data Mart Data Warehouse (3NF) Data Vault UltiPro Schema Vision Schema SmithGroupJJR On-Premises and Cloud | Option 2 UltiPro Active Directory Vision NewForma Enterprise Systems ETL Stage ELT Process Structured Data RevIt NSF, IPEDs SQL Server Indoor Positioning Denormalized Tableau End User Power BI End User Blue Vision IPS RAW Star Schema Blue Vision Azure Event Hub Azure Streaming Analytics Azure Table Storage Azure SQL Database Azure SSAS VPN Azure SQL Database Azure SQL Database Marquette Normalized Data Figure 3| Option 2
  • 7. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 7 of 21 In the next two diagrams, the flow of data is separated into five phases. These phases are Enterprise Source Systems, Staging, Data Warehouse, Data Mart, and Star Schema. However, for this analysis, we are focusing on the first two rectangles titled Enterprise Source Systems and Staging. The first diagram displays the current use of functions to extract data from the source systems. The second diagram displays a possible use of SSIS Error Flows. Functions, the diagram below represents the current flow of data where functions extract data using functions. These function are intended to exist on the actual source system in a database named ETL, but in the case of Ultipro (backup / restore) the functions exist on the ETL database used by the data warehouse. These functions are used to load the staging tables used in the downstream merge. The advantage to this design is that changes to the architecture can be implemented without affecting downstream objects such as SSIS. The concern with this design is that during a load failure, specific rows that have failed are not easily identifiable to generate a detailed alert that contains the failed rows. Star Schema SmithGroupJJR | Current using Functions Star Schema Enterprise Source Systems Staging Data Warehouse Data Mart SSAS SSAS SSAS SSAS SQL Functions ETL Database Stores Table Returned by Functions SQL Stored Procedures ETL Database Stores Functions to Create Load Tables from Source UltiPro Active Directory Vision Backup and Restore Process. Requires Functions to be stored on Stage, not the Source System. Functions stored on Source System. Process Uses SSIS Plugin for Connecting and Extracting Active Directory Data Merge Statement Loads Inserts and Updates Into Data Warehouse DeNormalized SQL Functions SQL Stored Procedures Stored Procedures on the Data Mart Execute Functions on the Data Warehouse to Load the Data Mart Star Schema to be Designed and Developed in Future Phases Using Functions to Pull the Source Data Prevents using SSIS Data Flow Tasks. This means that we can t have an error flow that stores failed rows for evaluation and fixing. Warning! Figure 4| Current Architecture Using Functions
  • 8. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 8 of 21 Error Flow, the diagram below represents the proposed flow of data using SSIS Error flow. The use of SSIS is intended to replace the existing functions that are used to load the staging tables. As shown in the diagram, SSIS has the ability to create an error flow to capture rows that fail the load process. This ability allows the details and cause of the failure to be emailed to alert the appropriate stake holders. Once SSIS loads the staging tables and stores any row failures, the rest of the data flow remains the same as in the current diagram. Star Schema SmithGroupJJR | Using Error Flow Star Schema Enterprise Source Systems Staging Data Warehouse Data Mart SSAS SSAS SSAS SSAS ETL Database Stores Table Returned by Functions SQL Stored Procedures Data Flow Task uses an SSIS Data Source to Extract Data from Source Systems UltiPro Active Directory Vision Merge Statement Loads Inserts and Updates Into Data Warehouse DeNormalized SQL Functions SQL Stored Procedures Stored Procedures on the Data Mart Execute Functions on the Data Warehouse to Load the Data Mart Star Schema to be Designed and Developed in Future Phases Using The Data Flow Task Allows the Use of Error Flows. This means that we can have an error flow that stores failed rows for evaluation and fixing. Notice! Figure 5 | Current Architecture Using SSIS Error Flows
  • 9. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 9 of 21 The options to Integrate On-Premises and Cloud are diagramed below. The Full Overview show a Site-to-Site and Point- to-Site VPNs as well as a HTTP connection. All three of these options provide different levels of security and IPSec standards. An additional option to the Site-to-Site VPN is ExpressRoute (https://azure.microsoft.com/en-us/services/expressroute/). ExpressRoute is a Microsoft Azure app that provides advanced scalability, increased reliability and speed, lower latency, and WAN integration. This is a fee and pay for use application. SmithGroupJJR | Integrate On Premises and Cloud Full Overview | Integrate On Premises and Cloud Site-to-Site Secure VPN | ExpressRoute Point-to-Site VPN | HTTP Internet WorkstationGateway SQL Server Workstation Point-to-Site VPN Site -to-Site VPN (ExpressRoute) • Secure • Controlled • better connectivity quality SQL Server Site -to-Site VPN (ExpressRoute) * optional • Secure • Controlled • better connectivity quality Gateway Gateway Workstation Point-to-Site VPN Figure 6 | Integrate On-Premises and Cloud
  • 10. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 10 of 21 SQL Server Best Practices SQL Server best practices were discussed and explained during a meeting with the SmithGroup JJR DBA and Infrastructure teams. The meeting was demonstrated on development servers to ensure production SLAs. All decisions in regards to implement, or not to implement these best practices and when, were left up to the SmithGroup JJR. NTFS Allocation Unit (AU) Block Size = 64k, Alignment = 1024k | Default is 4k, Use /L with Format on Windows 2012 and above. Max Degree of Parallelism MAXDOP | Set to Number of Cores in a Single CPU Socket DB Auto Growth Set very high for performance. (100MB to xxx GB) Cost Threshold for Parallelism For OLTP where we seek to minimize Parallelism and offer more concurrency then Use 15-20. Up to 50 with modern CPUs. For DSS, OLAP, Data Warehouse, and test environments Consider leaving at default and Managing Parallelism with MAXDOP if concurrency is a problem. TempDB 1:2, or 1:4 Ratio (TempDB Data Files to Cores). 1:1 Ratio for large systems. Pre SQL Server 2016: Use Trace Flags T1117 and T1118 to enable consistent AutoGrowth. On Flash Arrays enable SORT_IN_TEMPDB Index Build option to prevent index rebuilds. Separate Data / Log Volumes Tier 1. Test to determine for Tier 2 Flash Arrays. Multiple Volumes per File Group to Reduce Latch Contention. 4-8 Files per File Group. 3 Volumes | TempDB, Data / Log Files, and Backups for Fast Flash (Under 1ms Response Times) Max Server Memory 90% of Available Server Memory Enable Instant File Initialization Windows Server Setting: Perform Volume Maintenance Tasks needs to be Set Under Local Policies and User Rights Assignments. Case for Hadoop: Indoor Positioning Study (POE) During our initial meetings regarding the data sources here at the SmithGroup JJR, we identified one possible use case for Hadoop. This one possible use case is Indoor Positioning Study (POE). During our conversations, multiple questions were asked about Hadoop such as what the minimum size is and how to handle aggregates on unstructured data. Hadoop does not perform well on 5TB, or less. Also, it is worth noting small files do not work well with Hadoop and should be combined into larger files. As for aggregates in Hadoop, if SmithGroup JJR were to use Azure Data Lake Store (ADLS), they could use HDInsights and Hive, or if they use SQL Data Warehouse, or SQL Server 2016 then they could use PolyBase. Another option is Azure Data Lake Analytics/U-SQL to aggregate Hadoop data. Below are some questions that are given in the Indoor Positioning Study (POE) documentation to describe the questions and answers that the SmithGroup JJR would like to answer with this data source. These questions below are broad topics, each topic with more specific questions. • How do people utilize space? o What is the average dwell time by space? o How does the number of people within a space vary over time? o What are the most frequently used paths between spaces? • How do people interact and collaborate? o How much time do people spend in spaces occupied by other people? o What is the average number of people in a collaborative space? o How does job/organizational role impact collaboration? • Person movement o How often do people move between spaces? o What is the average duration of rest (motion)?
  • 11. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 11 of 21 Additional Questions: • Exact location of a user within space • Actual paths traveled between spaces • Relationship between workspace and study subject (employee/organizational) measures, such as happiness or productivity (what is an abstract term that captures these types of things?) • Comparison of varied workspace configurations / designs / arrangements such as office /open / free assignment • Integration with other technologies and data sources, such as space scheduling software, communication software, galvanic skin response, implanted telemetry chips, health and dental records, etc. After talking with Peter, he estimated that the size in terabytes for the Indoor Positioning Study (POE) at the SmithGroup JJR was at most one terabyte. Understanding that this is much less that the five terabyte minimum for Hadoop clusters, it is not suggested to implement a Hadoop cluster for this use case. SQL Server / Database Discovery In order to complete a data discovery, we were provided 3 databases: • DataVault • DataMart • ETL We provided a data discovery by employing 2 different methods. The first method was to create a web based document of each database using Redgate’s SQL Doc. The second method was to use SQL Server DMVs and TSQL to create an Excel based data dictionary. The files are included on the SharePoint folder along with this document. Also note, that Shabhana provided the WBS Migration Changes to Datawarehouse Systems.pdf where you can find much of this type of information as well. Finally, we collected information about the various data sources (both internal and external). The list of data sources is as follows: Internal Enterprise Data Source • Vision Enterprise Resource Planning Software • UltiPro Human Resources • SharePoint Document Management and Collaboration • Active Directory (AD) Network / Domain Information • NewForma Project Meta Data and RFIs Project / Application Specific • Revit Data Collector Building Information Modeling (Model Statistics) • CER • WorkSim Space Planning Space Planning • Indoor Positioning Study (POE) Azure SQL for People Movement in Workspace • Campus Project Data (Marquette) Campus Planning & Space External • IPEDS Public University Data
  • 12. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 12 of 21 • National Science Foundation Public for Funded Projects • Bureau of Labor Government Labor Statistics • GIS Typographical, Surveys of Land Data Warehouse and Data Marts In order to analyze SmithGroup JJR’s Data Warehouse / Data Mart environment, we were provided 3 databases: • DataVault • DataMart • ETL Overview The data warehouse (DataVault) and the data mart of the same name are the 2 databases that make up the SmithGroup JJR BI environment. The DataVault is a 3NF database. The DataMart in de-normalized and currently contains employee and project data. At this time, there is no star schema. However, there are plans to build out a star schema in the future. The DataVault stores source data by the corresponding source system name by using schemas of the same name such as UltiPro and Vision. In order to complete this analysis below, a Server and Database Discovery was completed as well. Analysis The areas for analysis for these 2 databases include the following topics: • Security • Scalability • Partitioning • Exception Handling • Alerting • Transaction Processing • Indexing • SQL Views (Business Views) • Star Schema • Surrogate Keys • Conformed Dimensions • Delta Loads (Merge SCD 1 and SCD 2, Checksums) Security should always be the first concern in planning and deploying any data warehouse / data mart environment. In reviewing what roles were defined, we found the following server roles: bulkadmin, dbcreator, diskadmin, processadmin, public, securityadmin, serveradmin, setupadmin, and sysadmin. There were no user defined SQL Server roles. The sa account was enabled, but not being used. There was not an implementations of Row Level Security or, Role Based Security. SQL Schemas, such as ultipro, vision, ad, and admin were used to scale and organize the various SQL Server objects. Scalability is a very high priority for companies that want to deliver solutions that last 5, or more years after the initial deployment. Many of the Server and Database DMVs listed above help us determine the scalability. For instance, using instances, partitioning, files and file groups, and synonyms can help make a system more scalable. Instances allow better resource management between different processes on the same server. They also allow us to separate load layers such as stage, consolidation, transformation, 3NF, Star, and Analytics. Since we only had access to development servers, we
  • 13. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 13 of 21 did not see any examples of instances, but we highly recommend them in production. We also looked at partitioning, but that will be discussed below. As for files and file groups, we have provided an Excel spreadsheet to identify what the files and file groups and their current sizes. We also provided size information for all of the tables in the 3 databases we were asked to analyze. File and Table sizes are important indicators for scalability and where to set the auto growth for your tables. The data and log files were on the same volume and had the following sizes: FileName FileSizeMB SpaceUsedMB AvailableSpaceMB %FreeSpace DataVault 3004 1636.69 1367.31 45.52 DataVault_log 36828.31 1256.45 35571.87 96.59 When looking at development, we were okay with these settings, however, the auto growth was not what we recommend in production. Finally, synonyms are an easy way to manage server to server (physical, or instances) connections without taking the risk of using linked servers. We did notice 3 linked servers (FINANCIALDATA, SGJJR- SQL2ASCCM2012, and VISIONDEVDB. These linked servers were not part of our scope provided by the SmithGroup JJR, however, we would warn against relying too heavy on linked servers. Partitioning is a great way to manage reporting performance in a data warehouse / data mart environment. Currently, the SmithGroup JJR is not using any partitioning strategy. Since we only had access to the development data, it is hard to tell if the sizes in our analysis represent the real production sizes, however, with the database sizes we encountered in the scope of this analysis, we do not recommend partitioning at this time. For the DataVault, there were a total of 2,548,891 rows. The table with the most rows was 730,655 rows in the [vision].[ProjectFinancialsByPeriod] table. We can see in the section above that data size for DataVault is 3004 MB. At this time, partitioning is not recommended. Exception Handling in SQL Server (TSQL) is accomplished by using Try Catch clauses. We did our due diligence in verifying that there is no exception handling at the SQL Server level. We confirmed this with the different teams at SmithGroup JJR. Exception handling is recommend for future phases of development. Alerting in SQL Server is a combination of Try Catch clauses, database mail, and SQL Server Agent. Both SQL Server Agent allow us to define operators and alerts. Alerts can then be defined on performance conditions, or SQL Server events based on an error number from a try catch clause. Alerting is recommend for future phases of development. Transaction Processing is the process of ensuring that data is written to disk before we commit a transaction and move on to the next step in the process. Transaction Processing also provides a mechanism to rollback any data that have been written if the transaction fails before a commit can take place. Transaction Processing is a critical part of any design. Currently, the environment here does not use Transaction Processing. Transaction Processing is recommend for future phases of development. Indexing has a huge impact on Server and Query Performance. DMV queries to identify any unused indexes and what indexes need to be re-organized and re-built should be ran on a regular basis. Index discovery and creating an enterprise index strategy is recommended. SQL Views (Business Views) can be used to denormalize and simplify data structures in the 3NF for reporting purposes. Currently, both the DataVault and DataMart use SQL Views, however, the ETL database does not. There are 32 SQL Views grouped into 5 different schemas (admin, api, dbo, lookup, and vision) in the DataVault database. The DataMart
  • 14. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 14 of 21 database has 10 SQL Views all in the dbo schema. A star schema is suggested to reduce complexity in creating and managing SQL Views for the business. Star Schema is not used and is not being developed. It is highly recommended for future phases. Surrogate Keys are used to provide referential security in a Data Warehouse / Data Mart that sources data from numerous data sources that all have different keys defined for the same entity / attribute such as Person / Social Security Number. Surrogate keys are employed in the SmithGroup JJR DataVault and DataMart, however, GUIDs have been used. This design has no issues for the data warehouse, however, the reporting start schema should use integers for load and processing performance. Conformed Dimensions at the database level entails ensuring that the Data Warehouse in a star schema only has 1 dimension for a specific entity such as employee, or region. Any data mart use of the entity employee, or region need to be sourced from the Data Warehouse and not reloaded with different logic and processes. Since there is not a star schema for DataVault, or DataMart, we do not have any dimensions to conform. We suggest a robust star schema for both a data warehouse and data marts. Delta Loads for data is both a performance issue and a management issue. Loading only the data that has changed since the last load can be implemented and managed in many ways. We can use TSQL merge, checksums, and last load date tables to determine if a row has changed since the last time the table has been loaded. SmithGroup JJR uses TQL merge, but not checksums and a last load date table. At the size of the data today, checksums and storing a last load date is not necessary, but recommended for performance and scalability. These same processes can also be used to implement slowly changing dimensions once a star schema is developed. The was an initial plan at the SmithGroup JJR to use a logging, error logging, and number table to manage load meta data, but it was not implemented.
  • 15. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 15 of 21 Extract, Transform, and Load In order to analyze SmithGroup JJR’s SSIS environment, we were provided 4 SSIS Solutions for the following areas: • Active Directory | 12 Packages • Deltek Vision | 63 Packages • Ultipro | 19 Packages • NewForma | 21 Packages with 11 Disabled Overview The Active Directory load process uses the KingswaySoft Directory Services Integration toolkit to provide access to Active Directory data. Using this tool, this solution extracts Active Directory data for the following areas: computer, group, group member, and user. The master ActiveDirectory.dtsx package call 3 sub-packages named: Extract, Transform, and Load. As the names of these packages indicate, the Extracts package takes data from the source system and temporarily stores this data in the ETL database. Unlike the other more complex ETL solutions, this solution does not have any tasks, or data flow transformations in the Transformation package. This Transform.dtsx could possibly be disabled. The Loads package calls stored procedures located in the ETL database then loads the data warehouse. The Active Directory design pattern uses the KingswaySoft Directory Services Integration toolkit to extract the data from the source system and place the extracted data into stage tables located on the Data Warehouse in the ETL database on that server. Once the data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL database load the staged data into the DataVault tables using the merge statement. The Deltek Vision solution also uses a similar process by calling separate sub-packages for the Extract, Transform, and Load phases of the data load. However, this solution also has a PreProcessing and a PostProcessing package. The PreProcessing package truncates the TPH tables. The PostProcessing package is empty and could possibly be disabled.
  • 16. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 16 of 21 The Extracts package takes data from the source system and temporarily stores this data in the ETL database for tables like client, vendor, and employee. The Transform package transforms data for vendor / client, contact / employee, project / opportunity, project dependents. The Loads package calls stored procedures located in the ETL database then loads the data warehouse for these same areas such as client, vendor, and employee. The Vision design pattern includes an ETL database that is stored on the transactional server. This ETL database stores functions that are used to extract the data from the source system and place the extracted data into stage tables located on the Data Warehouse in the ETL database on that server. Once the data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL database load the staged data into the DataVault tables using the merge statement. The Ultipro solution uses a different process by organizing the Extract, Transform, and Load phases of the data load into separate containers. The Extracts package stores data in the ETL database. The Transform package transforms data for organization, employment, and employmentHistory. The Load package loads tables using the TSQL Merge statement from the ETL database to the data warehouse. Please review Shobhana’s WBS Migration Changes to Datawarehouse Systems.pdf for an ETL Dataflow Diagram and other useful package information.
  • 17. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 17 of 21 Since Ultipro is a backup and restore process, the Ultipro design pattern does not include ETL database functions that are stored on the transactional server. Instead, these extract functions are stored in the ETL database on the Data Warehouse server. This ETL database stores functions that are used to extract the data from the source system and place the extracted data into stage tables located on the Data Warehouse in the ETL database on that server. Once the data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL database load the staged data into the DataVault tables using the merge statement. The NewForma (oblivion) load process is a non-standard load process that needs to be updated into the new design pattern described with the Vision load process above. There is a monthly load that calls a weekly load that calls the hourly load that is not being currently used. The weekly load package also has an archival process. Besides the monthly load, there is a daily load that calls the hourly load. Both the weekly and daily load packages call the same hourly package. The current state of the NewForma load process executes two packages in parallel. The first package is Execute etlOrgChart and the second package is Execute etlNewformaProjects. Execute etlNewformaProjects has two child packages named Execute etlProjectRFIs and Execute etlProjectMilestones. The packages etlOrgChart and
  • 18. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 18 of 21 etlProjectMilestones both use the KingswaySoft SharePoint Integration toolkit to extract and load data to and from SharePoint lists. This process needs to be updated and implemented to production using the standard process. Analysis The areas for analysis for these 3 solutions include the following topics: • Load Meta Data • Package Sequencing (Master and Child Packages, SQL Jobs, Conformed Dimensions, Dimensions, Facts, Data Marts) • Environments and Environment Variables • Connection Managers (Package and Project) • Parameters (Package and Project) • Alerting • Logging • Exception Handling • Validation • Checkpoints • Transaction Processing (requires MSDTC) • Naming Conventions Load Meta Data is important since it can help us track load start and end times by package, table, and even cube processing, but it also can track load row counts for inserts and updates, it can provide restart ability that is more robust than SSIS checkpoints, and it can provide rollback information during a failure. Currently, there no tracking of load meta data and it is highly recommended. Package Sequencing controls the order of how the packages load the tables. In terms of packages, we have a master package and then child packages. The master package may call child packages such as a conformed dimension package, a dimension package, and a fact package. Child packages can also call data mart packages that duplicate data warehouse dimensions and facts to be used as data marts. Finally, SQL Jobs can be used to schedule different load patterns and times such as daily, every hour, and even weekly, or monthly. Since package sequencing is already working and not causing issues at this time, this is not a high priority for redesign. Environments and Environment Variables are used to provide a mechanism for changing project data connections and variables during a change control migration from one environment to another, such as development to test, or test to production. Currently, environments and environment variables are being used with success. Connection Managers can be either project, or package level. In most cases, connections that need to change from environment to environment, or will be used in many packages will be project connections. Any connections that are just required by 1 package and will not change between environments can be package connections. Parameters can be either project, or package level. In most cases, parameters that need to change from environment to environment, or will be used in many packages will be project parameters. Any parameters that are just required by 1 package and will not change between environments can be package parameters. Alerting in SSIS is provided by using a SMTP connection. This connection can then be used in a task flow, data flow, or even as an event handler such as OnError. Alerting is not enabled in the solutions evaluated. Alerting is highly recommended.
  • 19. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 19 of 21 Logging can be very customized by using a custom logging schema and then tying that custom logging to the logging built into SQL Server 2012 and newer. Since a robust version Logging is provided in new versions of SQL Server that include verbose logging for troubleshooting, it is not recommended to make any changes to logging. An example custom logging diagram that can bridge to the data logged by new versions of SQL Server has been provided for your reference. Exception Handling in SSIS can be addressed in multiple methods. One method is exception data flows. These data flows can load exception data into flat files such as text files, or a table that stores exception data in an XML format. Another example of exception handling is rolling back data when an error occurs. This can be accomplished at the end of an error flow, or in an event handler such as OnError. Finally, exception handling can be combined with Alerting to let the appropriate technical and business users know of and issue, or delay. Since there is no exception handling in the currently evaluated packages, this is highly recommended. Validation has many solutions. The most common include row and column based validation combined with what are called sanity checks. In a fact table, column validation can sum the column and compare that to an aggregate value in the consolidation layer of the load process. For dimension validation, we can verify that the surrogate key for a specific user ties back to multiple source systems through the stored business keys. Sanity checks tend to focus on a known business rule and verify that a calculated business rule matches in multiple systems such as a source system, data warehouse, data mart, the cloud, and reporting tools. This validation can use load meta data as well as SQL Tasks to gather can validate complex scenarios (data sources) when necessary. Since little, to no validation is currently employed, it is highly recommended. Checkpoints are SSIS’s built in method for providing package restart ability. They are configured by providing a checkpoint location in the package level properties. The property name is CheckPointFileName. Two other properties need to be configured named CheckPointUsage and SaveCheckPoints. These properties are defined in the solutions we evaluated. It is recommended that some restart ability be designed and implemented. Transaction Processing is a feature in SSIS, however it requires that Microsoft Distributed Transaction Coordinator (MSDTC) be enabled. This coordinator comes with overhead and is not always well received by the DBA team. Since the SmithGroup JJR already uses SQL Tasks to call stored procedures that utilize the TSQL merge statement, we recommend not using transaction processing in SSIS, but rather at the SQL Server level. Naming Conventions in SSIS may seem elementary, but good naming conventions in SSIS can help with readability and maintenance, especially when introducing new developers to the ETL environment. A sample SSIS naming convention document has been provided.
  • 20. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 20 of 21 Appendix A | Microsoft Data Warehouse On-Premises Architecture Below is a diagram that illustrates an ideal business intelligence (BI) architecture. This example is intended to show the many pieces available in the Microsoft BI stack. This diagram includes source systems (structured data) that flow into a SQL Server data repository. Data repositories offload the reporting loads onto non-production, transactional servers. The ELT process will stage the data required for that load only. There are three (3) paths for the data once it is staged. The first path is a quickly cleaned path intended for daily business reporting needs called an Operational Data Store. Moving down the diagram, the second path is the fact pipeline. The fact pipeline will begin to denormalize and prepare the fact data for transformation. The third path is the dimension pipeline. The dimension pipeline goes through two other tools offered in the Microsoft BI Stack. The first tool is Data Quality Services. This tool is used to cleanse the data. The second tool is Master Data Management. Master Data Management provides what the industry calls “Golden Record Management.” Golden record management gives you access to the most pure, validated, and complete picture of your individual records in your domain. Products like Profisee (https://profisee.com/grm) offer functionality beyond the built in tools offered out of the box with SQL Server. This extra “Golden Record” functionality include matching, de-duplication, mastering, and record harmonization. Profisee also offers graphical user interfaces, scorecards, and reports. Stage ODS DQS *DQS = Data Quality Services *MDM = Master Data Management MDM Data Warehouse (3NF) eDW Sales Schema CRM Schema Marketing Schema Microsoft Data Warehouse On-Premises Architecture *Data Steward(s) Fact Pipeline Dimension Pipeline Structured Data eDW Cube Farm Sales Cube CRM Cube Marketing Cube SharePoint Portal Excel SSRSSSAS Power QueryPower BI Data Repository Flat File Data Sales Data Customer Service (CRM) Data Accounting Data Human Resource (HR) Data Supply Chain Data Enterprise Systems Configuration Logging Audit Real Time Operational Reporting Transform Supplement Data Steward Tasks Data Steward(s) Define Business Rules Manage Master Data Subject Matter Expert (SME) Liaison Between Business and BI Team Structured Data Identify Business Question Define Staffing Roles Data Discovery Establish Data Stewards Agree on Business Rules Determine Master Data Lists Business Tasks Data Mart Data Mart Star Schema eAnalyitics Sales CRM Plan Prototype Around Question Product Licensing Examine Existing Infrastructure Determine eDW Infrastructure Plan Security / Kerberos Develop eDW Architecture BI Team Tasks Figure 7 | Microsoft Data Warehouse On-Premises Architecture
  • 21. Technical Analysis of BI Environment © AIM Report Writing, 2017 Page 21 of 21 Appendix B | Design Questions to Review How is data from multiple sources consolidated? An example is currently, when we model DataVault, we see 3 person tables, a vision.Person, an ultipro.Person, and a dbo.person. According to the DBAs, dbo.Person is a consolidated version of Person. This begs another question of what logic is used to consolidate the 2 different versions of Person. Is there a reference, or lookup table? What was the reasons and domain knowledge used to use GUIDs and not INTs for Surrogate Keys? Confusing nomenclature for DataVault, DataMart, and DataLake databases. May want to rethink these names to not confuse the more general and industry accepted meanings.