Time and again, we hear about the failure of data warehouses – while things may be improving, they’re moving only slowly. One explanation data quality being overlooked is that the I.T. department is often responsible for delivering and operating the DWH/BI
environment. What ensues ends up being an agenda based on “how do we build it”, not a “why are we doing this”. This needs to change. In this discussion paper, I explore the issues of data quality in data warehouse, business intelligence and analytic environments, and propose an approach based on "Data Quality by Design"
In this new paper, I explore the organisational and cultural challenges of implementing information governance and data quality. I identify potential problems with the traditional centralised methods of data quality management, and offer alternative organistional models which can enable a more distributed and democratised approach to improving your organisations data. I also propose a simple four-step approach to delivering immediate business value from your data.
A template defining an outline structure for the clear and unambiguous definition of the discreet data elements (tables, columns, fields) within the physical data management layers of the required data solution.
A template to define an outline structure for the clear and unambiguous definition of the discreet component data elements (atomic items of Entity/Attribute/Relationship/Rule) within the Logical layer of an Enterprise Information Model (a.k.a. Canonical Model).
06. Transformation Logic Template (Source to Target)Alan D. Duncan
This document template defines an outline structure for the clear and unambiguous definition of transmission of data between one data storage location to another. (a.k.a. Source to Target mapping)
Example data specifications and info requirements framework OVERVIEWAlan D. Duncan
This example framework offers a set of outline principles, standards and guidelines to describe and clarify the semantic meaning of data terms in support of an Information Requirements Management process.
It provides template guidance to Information Management, Data Governance and Business Intelligence practitioners for such circumstances that need clear, unambiguous and reliable understanding of the context, semantic meaning and intended usages for data.
The one question you must never ask!" (Information Requirements Gathering for...Alan D. Duncan
Presentation from 2014 International Data Quality Summit (www.idqsummit.org, Twitter hashtag #IDQS14). Techniques for business analysts and data scientists to facilitate better requirements gathering in data and analytic projects.
A template for capturing the overall high-level business requirements and expectations for business solutions with a significant impact on or requirement for data. (cf. the “Project Mandate” document in PRINCE2).
Understanding, Planning and Achieving
Data Quality in Your Organization
by Joe Caserta, President of Caserta Concepts
For more information, visit www.casertaconcepts.com or contact us at info@casertaconcepts.com
In this new paper, I explore the organisational and cultural challenges of implementing information governance and data quality. I identify potential problems with the traditional centralised methods of data quality management, and offer alternative organistional models which can enable a more distributed and democratised approach to improving your organisations data. I also propose a simple four-step approach to delivering immediate business value from your data.
A template defining an outline structure for the clear and unambiguous definition of the discreet data elements (tables, columns, fields) within the physical data management layers of the required data solution.
A template to define an outline structure for the clear and unambiguous definition of the discreet component data elements (atomic items of Entity/Attribute/Relationship/Rule) within the Logical layer of an Enterprise Information Model (a.k.a. Canonical Model).
06. Transformation Logic Template (Source to Target)Alan D. Duncan
This document template defines an outline structure for the clear and unambiguous definition of transmission of data between one data storage location to another. (a.k.a. Source to Target mapping)
Example data specifications and info requirements framework OVERVIEWAlan D. Duncan
This example framework offers a set of outline principles, standards and guidelines to describe and clarify the semantic meaning of data terms in support of an Information Requirements Management process.
It provides template guidance to Information Management, Data Governance and Business Intelligence practitioners for such circumstances that need clear, unambiguous and reliable understanding of the context, semantic meaning and intended usages for data.
The one question you must never ask!" (Information Requirements Gathering for...Alan D. Duncan
Presentation from 2014 International Data Quality Summit (www.idqsummit.org, Twitter hashtag #IDQS14). Techniques for business analysts and data scientists to facilitate better requirements gathering in data and analytic projects.
A template for capturing the overall high-level business requirements and expectations for business solutions with a significant impact on or requirement for data. (cf. the “Project Mandate” document in PRINCE2).
Understanding, Planning and Achieving
Data Quality in Your Organization
by Joe Caserta, President of Caserta Concepts
For more information, visit www.casertaconcepts.com or contact us at info@casertaconcepts.com
Itlc hanoi ba day 3 - thai son - data modellingVu Hung Nguyen
https://www.facebook.com/events/535707009911719/
(ITLC HN) BA DAY3: CHIẾN LƯỢC THIẾT KẾ MÔ HÌNH DỮ LIỆU
1.Thời gian: 18:30 - 21:00, 10/9/2015 (Tối thứ 5)
2. Địa điểm: HATCH - Tầng 14 - 195B Đội Cấn (http://nest.hatch.vn/nest-14.html)
3. Tổ chức: Ban tổ chức sự kiện ITLC Hà Nội
4. Chương trình:
18:30 - 18:45: Đón khách
18:45 - 19:00: Nguyễn Mạnh Cường (Fis) Giới thiệu ITLC Hà Nội
19:00 - 19:30: Thái Sơn chia sẻ “Một số mô hình dữ liệu mẫu trong phân tích nghiệp vụ”
19:30 - 19:50: Lê Phú Cường chia sẻ “Chiến lược lưu giữ dữ liệu lịch sử”
19:50 - 20:50: Panel cùng với: Thái Sơn, Lê Phú Cường, Lê Văn Duy
20:50 - 21:00: Tổng kết sự kiện và chụp hình kỷ niệm
5. Đăng ký: theo form sau đây http://topi.ca/baday3
6. Phí tham gia: 100K
7. Liên hệ, giải đáp: Lê Đại Nam: 0902-261-239
Xem thêm sự kiện BA1 tại đây: https://www.facebook.com/events/1616821285258614/
Xem thêm sự kiện BA2 tại đây: https://www.facebook.com/events/1669594633274443/
Data Profiling: The First Step to Big Data QualityPrecisely
Big data offers the promise of a data-driven business model generating new revenue and competitive advantage fueled by new business insights, AI, and machine learning. Yet without high quality data that provides trust, confidence, and understanding, business leaders continue to rely on gut instinct to drive business decisions.
The critical foundation and first step to deliver high quality data in support of a data-driven view that truly leverages the value of big data is data profiling - a proven capability to analyze the actual data content and help you understand what's really there.
View this webinar on-demand to learn five core concepts to effectively apply data profiling to your big data, assess and communicate the quality issues, and take the first step to big data quality and a data-driven business.
Master data management (MDM) is defined as an application-independent process which describes, owns and manages core business data entities. The establishment of the MDM process is a Business Engineering (BE) tasks which requires organizational design. This paper reports on the results of a questionnaire survey among large enterprises aiming at delivering insight into what tasks and master data classes MDM organizations cover (“scope”) and how many people they employ (“size”). The nature of the study is descriptive, i.e. it allows for the identification of patterns and trends in organizing the MDM process.
This presentation briefly discusses the following topics:
Classification of Data
What is Structured Data?
What is Unstructured Data?
What is Semistructured Data?
Structured vs Unstructured Data: 5 Key Differences
A Reference Process Model for Master Data ManagementBoris Otto
The management of master data (MDM) plays an important role for companies in responding to a number of business drivers such as regulatory compliance and efficient reporting. With the understanding of MDM’s impact on the business drivers companies are today in the process of organizing MDM on corporate level. While managing master data is an organizational task that cannot be encountered by simply implementing a software system, business processes are necessary to meet the challenges efficiently. This paper describes the design process of a reference process model for MDM. The model design process spanned several iterations comprising multiple design and evaluation cycles, including the model’s application in three participative case studies. Practitioners may use the reference model as an instrument for the analysis and design of MDM processes. From a scientific perspective, the reference model is a design artifact that represents an abstraction of processes in the field of MDM.
Learn how to start a data governance initiative to ensure developing successful frameworks by leveraging the best practices outlined in this inforgraphic.
A brief introduction to Data Quality rule development and implementation covering:
- What are Data Quality Rules.
- Examples of Data Quality Rules.
- What are the benefits of rules.
- How can I create my own rules?
- What alternate approaches are there to building my own rules?
The presentation also includes a very brief overview of our Data Quality Rule services. For more information on this please contact us.
Master Data Management: Extracting Value from Your Most Important Intangible ...FindWhitePapers
This SAP Insight explores the importance of master data and the barriers to achieving sound master data, describes the ideal master data management solution, and explains the value and benefits of effective management of master data.
Vendor-neutral presentation about the common functionality provided by data profiling tools, which can help automate some of the work needed to begin your preliminary data analysis.
Itlc hanoi ba day 3 - thai son - data modellingVu Hung Nguyen
https://www.facebook.com/events/535707009911719/
(ITLC HN) BA DAY3: CHIẾN LƯỢC THIẾT KẾ MÔ HÌNH DỮ LIỆU
1.Thời gian: 18:30 - 21:00, 10/9/2015 (Tối thứ 5)
2. Địa điểm: HATCH - Tầng 14 - 195B Đội Cấn (http://nest.hatch.vn/nest-14.html)
3. Tổ chức: Ban tổ chức sự kiện ITLC Hà Nội
4. Chương trình:
18:30 - 18:45: Đón khách
18:45 - 19:00: Nguyễn Mạnh Cường (Fis) Giới thiệu ITLC Hà Nội
19:00 - 19:30: Thái Sơn chia sẻ “Một số mô hình dữ liệu mẫu trong phân tích nghiệp vụ”
19:30 - 19:50: Lê Phú Cường chia sẻ “Chiến lược lưu giữ dữ liệu lịch sử”
19:50 - 20:50: Panel cùng với: Thái Sơn, Lê Phú Cường, Lê Văn Duy
20:50 - 21:00: Tổng kết sự kiện và chụp hình kỷ niệm
5. Đăng ký: theo form sau đây http://topi.ca/baday3
6. Phí tham gia: 100K
7. Liên hệ, giải đáp: Lê Đại Nam: 0902-261-239
Xem thêm sự kiện BA1 tại đây: https://www.facebook.com/events/1616821285258614/
Xem thêm sự kiện BA2 tại đây: https://www.facebook.com/events/1669594633274443/
Data Profiling: The First Step to Big Data QualityPrecisely
Big data offers the promise of a data-driven business model generating new revenue and competitive advantage fueled by new business insights, AI, and machine learning. Yet without high quality data that provides trust, confidence, and understanding, business leaders continue to rely on gut instinct to drive business decisions.
The critical foundation and first step to deliver high quality data in support of a data-driven view that truly leverages the value of big data is data profiling - a proven capability to analyze the actual data content and help you understand what's really there.
View this webinar on-demand to learn five core concepts to effectively apply data profiling to your big data, assess and communicate the quality issues, and take the first step to big data quality and a data-driven business.
Master data management (MDM) is defined as an application-independent process which describes, owns and manages core business data entities. The establishment of the MDM process is a Business Engineering (BE) tasks which requires organizational design. This paper reports on the results of a questionnaire survey among large enterprises aiming at delivering insight into what tasks and master data classes MDM organizations cover (“scope”) and how many people they employ (“size”). The nature of the study is descriptive, i.e. it allows for the identification of patterns and trends in organizing the MDM process.
This presentation briefly discusses the following topics:
Classification of Data
What is Structured Data?
What is Unstructured Data?
What is Semistructured Data?
Structured vs Unstructured Data: 5 Key Differences
A Reference Process Model for Master Data ManagementBoris Otto
The management of master data (MDM) plays an important role for companies in responding to a number of business drivers such as regulatory compliance and efficient reporting. With the understanding of MDM’s impact on the business drivers companies are today in the process of organizing MDM on corporate level. While managing master data is an organizational task that cannot be encountered by simply implementing a software system, business processes are necessary to meet the challenges efficiently. This paper describes the design process of a reference process model for MDM. The model design process spanned several iterations comprising multiple design and evaluation cycles, including the model’s application in three participative case studies. Practitioners may use the reference model as an instrument for the analysis and design of MDM processes. From a scientific perspective, the reference model is a design artifact that represents an abstraction of processes in the field of MDM.
Learn how to start a data governance initiative to ensure developing successful frameworks by leveraging the best practices outlined in this inforgraphic.
A brief introduction to Data Quality rule development and implementation covering:
- What are Data Quality Rules.
- Examples of Data Quality Rules.
- What are the benefits of rules.
- How can I create my own rules?
- What alternate approaches are there to building my own rules?
The presentation also includes a very brief overview of our Data Quality Rule services. For more information on this please contact us.
Master Data Management: Extracting Value from Your Most Important Intangible ...FindWhitePapers
This SAP Insight explores the importance of master data and the barriers to achieving sound master data, describes the ideal master data management solution, and explains the value and benefits of effective management of master data.
Vendor-neutral presentation about the common functionality provided by data profiling tools, which can help automate some of the work needed to begin your preliminary data analysis.
Is your application system process facing problem? With the help of System-level analysis you can save your application from failures at different levels. It analyzes how the components are interacting at multiple layers & technologies. Keep your system efficient and secure.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
Moving From a Selenium Grid to the Cloud - A Real Life StorySauce Labs
Come hear how Anshul Sharma, Senior QA Engineer at Emmi Solutions, made the move from testing on an in-house Selenium Grid to the Cloud while expanding test coverage and making great strides in moving to a full continuous integration workflow.
Progeny LIMS sets a new standard for life science LIMS software. Manage absolutely any type of sample and associated data in a fully customizable multi-level inventory system. With a flexible database that is configurable directly through the user interface, you have the power to customize your system to match any lab requirement without the need for custom programming. A dedicated Progeny LIMS specialist will work with you hand-in-hand from defining your project until the moment you go live. From biobanks to individual clinical labs, Progeny LIMS is your affordable sample tracking solution.
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
This slide deck is based on the concepts in a great book by William Ury called Getting Past No. If these slides pique your interest, I suggest reading the book; it is well worth your time.
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented What Data Do You Have and Where is it?
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Good systems development often depends on multiple data management disciplines. One of these is metadata. While much of the discussion around metadata focuses on understanding metadata itself along with associated technologies, this comprehensive issue often represents a typical tool-and-technology focus, which has not achieved significant results. A more relevant question when considering pockets of metadata is whether to include them in the scope of organizational metadata practices. By understanding metadata practices, you can begin to build systems that allow you to exercise sophisticated data management techniques and support business initiatives.
Learning Objectives:
How to leverage metadata in support of your business strategy
Understanding foundational metadata concepts based on the DAMA DMBOK
Guiding principles & lessons learned
DAMA Australia: How to Choose a Data Management ToolPrecisely
The explosion of data types, sources, and use cases makes it difficult to make the right decisions around the best data management tools for your organisation. Why do you need them? Who is going to use them? What is their value?
Watch this webinar on-demand to learn how to demystify the decision making process for the selection of Data Management Tools that support:
· Data governance
· Data quality
· Data modelling
· Master data management
· Database development
· And more
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Kevin Pledge
Presentation given at the Canadian Institute of Actuaries Annual Meeting in June 2013. Covers the direction business intelligence is moving in for insurance.
• History of Data Management
• Business Drivers for implementation of data governance • Building Data Strategy & Governance Framework
• Data Management Maturity Models
• Data Quality Management
• Metadata and Governance
• Metadata Management
• Data Governance Stakeholder Communication Strategy
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfvenkatakeerthi3
One of the most fascinating fields today that is enabling businesses to improve their operations is data science.
Databases, network servers and official social media pages.
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
Building an effective and extensible data and analytics operating modelJayakumar Rajaretnam
To keep pace with ever-present business and technology change and challenges, organizations need operating models built with a strong data and analytics foundation. Here’s how your organization can build one incorporating a range of key components and best practices to quickly realize your business objectives.
[DSC Europe 22] The Making of a Data Organization - Denys HolovatyiDataScienceConferenc1
Data teams often struggle to deliver value. KPIs, data pipelines, or ML driven predictions aren't inherently useful - unless the data team enables the business to use them. Having worked on 37 data projects over the past 5 years, with total client revenue clocking at about $350B, I started noticing simple success factors - and summarized those in the Operating Model Canvas & the Value Delivery Process. With those, I branched out into what I call data organization consulting and help clients build their data teams for success, the one you see not only on paper but also in your P&L. In this talk, I'll share some insight with you.
Whether you are interested in healthcare data analytics or looking to get started with big data and marketing, these fundamental principles from data experts will contribute to your success. http://www.qubole.com/new-series-big-data-tips/
Data science and data analytics professionals enable organizations to utilize the potential of predictive analytics to make informed decisions & help in transforming analytics maturity model of the organization.
a whistlestop tour through some of the ethical dilemmas and challenges that arise in this "Big Data Age" and the various approaches to considering them, if not solving them.
In this 10 minute "lightning talk" delegates will get insights into some of the research agenda and issues being considered in this area, touching on Business Analytics, Data Quality, analytic risks, ethics and evidence-based decision-making culture
07. Analytics & Reporting Requirements TemplateAlan D. Duncan
This document template defines an outline structure for the clear and unambiguous definition of analytics & reporting outputs (including standard reports, ad hoc queries, Business Intelligence, analytical models etc).
03. Business Information Requirements TemplateAlan D. Duncan
A template for the clear and unambiguous definition of business data and information requirements. (cf. “Business Requirements Document”, “Functional Specification” or similar from standard SDLC processes). As such, the contents will typically form the basis for population and publication of a business glossary of information terms.
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...Alan D. Duncan
This session reflects on the human aspects of Data Governance and examines what it takes to be successful in implementing effective information-enabled business transformation:
* Do we need to rethink our Data Governance strategies?
* Is enterprise-wide Data Management & Governance really achievable?
* What techniques and capabilities do we need to focus on?
* What skills and personal attributes does a Data Governance Manager need?
The ABC of Data Governance: driving Information ExcellenceAlan D. Duncan
Overview of Data Governance requirements, techniques and outcomes. Presented at 5th Annual Records & Information Officers' Forum, Melbourne 19-20 Feb 2014.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Data Quality in Data Warehouse and Business Intelligence Environments - Discussion Paper
1. 1
“Gartner Hype Cycle”, Gartner, July 2013
2 “International Journal of Latest Trends in Computing Vol.1, Issue2 p139:The Empirical Study on the Factors Affecting Data Warehousing Success”, Md. Ruhul Amin & Md.
Taslim Arefin, December 2010
3
“The State of Data Quality” Experian 2013
4
“2012 BI and Information Management Trends”, Information Week, November 2011
Data Quality By Design
Managing Data Quality for Data Warehousing and Business Intelligence environments
ź More than 50% of DW/BI projects are failing to meet expectations, with more than 50% of data projects having limited acceptance or be
outright failures as a result of lack of attention to data quality issuest 2
ź Common data errors plague 91% of organisations. 3
ź 46% of businesses cite data quality as a barrier to BI adoption. 4
“Data is the life blood of the business” is a phrase that gets bandied about.Yet all too often, the quality of data is typically not thought about –
at least not up front and until it’s too late. If data really was “life blood of the business,” you’d expect data quality to be fundamental!
One explanation data quality being overlooked is that the I.T. department is often responsible for delivering and operating the DWH/BI
environment.What ensues ends up being an agenda based on “how do we build it”, not a “why are we doing this”.This needs to change.
Fast Facts
Data Quality By Design
Data Quality By Design is an approach that aims to make the quality of data a foundational part of business systems design and
implementation, both for business processes and business applications. Data warehouses (DW), Business Intelligence (BI) and Analytics
initiatives are an excellent entry point to introduce the concept of data quality by design, simply because the implementation of these
solutions is entirely focussed on delivering information to business users for decision-making purposes. (For the purposes of this paper,
“DW/BI” will hereafter be used to refer to and all such solution initiatives).
There are many different approaches and methodologies for DW/BI implementation. However, all DW/BI methods will typically feature four
major stages of delivery:
ź Data Discovery (Source data analysis)
ź Data Modelling (Business functional models, logical models, physical data structures)
ź Data Movement (ETL/ELT/ESB)
ź Develop Key Outputs (Business Intelligence, standard reports & ad hoc analytics)
Additionally, most modern DW/BI methodologies will follow an iterative, incremental (Agile) delivery approach; sequential (“waterfall”)
methods are typically not recommended.
Data Quality techniques can help support each stage in the DW/BI delivery process.The remainder of this paper will explore the relationship
between Data Quality techniques and these key stages of DW/BI solution delivery.
It’s probably reasonable to claim that we have finally reached a stage of pervasiveness with Data Warehouses & Business Intelligence (BI).
Data warehouses are now generally accepted as part of mainstream business systems, to the point where they are almost ubiquitous.
Certainly, most major businesses would have some form of BI solution these days and business users now expect their Business
Intelligence analyses to be available any time, any where on any device.
Additionally, “Big Data” is getting information and analytics onto the business agenda like never before (although according to Gartner’s
recent Hype Cycle analysis, key components of “Big Data” solutions such as cloud-based grid computing, in-memory databases and1
MapReduce are only moving through the “trough of disillusionment” phase.) All in all, we really are now living the dream of the digital
economy in the Information Age.
And yet, time and again, we hear about the failure of data warehouses – while things may be improving, they’re moving only slowly.
Introduction
2. Data Quality during Data Discovery
Data Discovery process steps:
ź Agreeing scope of suitable source data sets
ź Source data analysis
ź Logical source-to-target mapping
How Data Quality techniques can help:
ź Data Profiling: Identify previously unknown issues with data as part of discovery
ź Data Inspection: Increase Data Stewards’ understanding of the data
- Get more intimate with data
- Discovery of additional context and narrative
- Articulation of new business rules (“why is it that…?”)
ź Corrective Action Planning: feedback to remediate data issues before solution development / testing.
Data Quality Profiling:
Data quality profiling is an excellent diagnostic method for gaining additional understanding of the data.
Profiling the source data helps inform both business requirements definition and detailed solution designs for data-related project, as well as
enabling data issues to be managed ahead of project implementation.
Profiling may be required at several levels:
ź Simple profiling with a single table (e.g. Primary Key constraint violations)
ź Medium complexity profiling across two or more interdependent tables (e.g. Foreign Key violations)
ź Complex profiling across two or more data sets, with applied business logic (e.g. reconciliation checks)
ź Field by field analysis is required to truly understand the data gaps.
Any data profiling analysis must not only identify the issues and underlying root causes, but must also identify the business impact of the
data quality problem (effectiveness, efficiency, risk inhibitors).
This will help identify any the value of remediating the data. Root cause analysis also helps identify any process outliers and drives out
requirements for remedial action on managing any identified exceptions.
Be sure to profile your data and take baseline measures before applying any remedial actions – this will enable you to measure the impact
of any changes.
“The data is always right” – a data quality error indicates a failure in the process, the system or the people. Use the data to inform and
drive process change.
RECOMMENDED ACTION POINTS:
ź Data Quality Profiling and root-cause analysis to be undertaken as an initiation activity as part of all data warehouse, master data and
application migration project phases.
DQ as part of data modelling
Data modelling steps:
ź Define Business Functions & requirements
ź Identify Logical Data Model subject areas
ź Derive target physical data models
How DQ techniques can help:
ź Data gap analysis: identify requirements for new data that is useful to a business function, but not currently captured.
ź Identify unnecessary data: screen for data sets that currently get captured but that actually have no utility to the business.
ź Develop Data Quality Rules: articulate explicit rules for how the data should be represented, and build screening capability (the “data
quality firewall”) to ensure that business data is fit for purpose.
ź Manage fragmentation and duplication of data: identify more explicitly where data needs to be integrated and replicated, and apply
more rigorous controls to ensure that the “golden record” for a given data entity is maintained in the designated system of record only.
3. The Enterprise Information Model is a crucial tool for driving consistency, integrity and utility of information across the business. A number
of discreet steps are identified to derive the overall Enterprise Information Model for the enterprise:
ź Establish a core top-down Business Functional Model of the business (a.k.a. Conceptual Business Model).This describes the
idealised view of what should be happening in the enterprise from a functional perspective. It establishes the business context(s) that
operate upon the data and within which data is then managed. Structured interviews with business leadership team are a good entry
point for capturing the core structure and expectations of the idealised business functional model. (N.B. this is not the process model,
nor a model of the organisational structure. Departments are not functions!).
ź Derive the core Business Glossary of common informational terms (high-level consistent business lexicon that supports shared
interpretation and clear semantic meaning of the data).The Business Glossary will be underpinned by a much more detailed and
expansive set of Technical Metadata which captures the explicit and context–specific metadata definitions, derivation, business rules,
calculation logic and lineage for each atomic information term in use within the enterprise.
ź Derive the logical data model for the enterprise, which represents the core entities, attributes and relationships for informational items
required to support the identified business functions.
ź Model the Reporting Catalogue of key business questions (operational queries, historic reports, analytical and predictive models,
data mining models) that the business should be asking in order to monitor and drive its performance.This will almost certainly include
questions that are currently not being asked by the business community, and may include questions that currently cannot be answered
using existing data. Note too that many of these questions will be cross-functional in nature. Creative thinking is required, including
learning from what other industries are doing.
ź Derive the CRUD Matrix (Create, Read, Update, Delete) for business data by mapping the Functional Model to the entities in the
Logical Data Model. Every business function must act upon at least one entity, and every entity must be acted upon by at least one
function.
ź Map to the Information Asset Register, which documents the catalogue of current data holdings: what data currently exists within the
enterprise, where it exists, for what purpose(s) it is currently used, and by whom.
RECOMMENDED ACTION POINTS:
ź Continue to develop the Enterprise Information Model elements:
- Business Functional Model, Business Glossary, Technical Metadata, Logical Data Model, Reporting Catalogue, CRUD Matrix,
Information Asset Register.
ź Adopt, apply and enhance these models and tools within data-related projects as an explicit part of the Systems Development Lifecycle
(SDLC).
DQ as a part of Data Movement (ETL/ELT, data migration, integration etc.)
Data Validation and Dimensions of Data Quality
In any data quality validation process, target tolerances & business rules need to be established.
A pragmatic approach to data quality measurement takes into account fitness-for-purpose(s). How this is measured will be up to the
individual business, however a suggested schema for profiling data is based across the following “ACE” dimensions:
ź Availability
- Currency: the reference period of the data; is the data still up-to-date and relevant, or is it “stale”? (e.g. the only available
Customer List was last updated in 2007).
- Timeliness: is the data made available when it is needed? (e.g. I currently don’t receive the Sales Order figures until three days
after Month-End.)
ź Completeness
- Individual Record Completeness: Are all required field provided within each data record, or are there inappropriate NULL
values? (e.g. in the sales transaction Ref# 340254 for Mr. Smith on 16/03/14, there is no Sales Quantity recorded).
- Data Set Completeness: Are all expected records present, or are there gaps in the history? (e.g. In my list of Sales transactions
for 2014, there are no rows of data for July.)
ź Error Free
- Integrity/Coherence: Do different data sets join up as intended? (e.g. Do the “Orders” and “Dispatches” data sets both include
the Customer Reference Number, and are the CRNs the same?)
- Uniqueness: Are discreet values expected and preserved? (e.g. do all customers have a unique Customer Reference Number,
or do we have two customers with the same CRN?)
- Validity: Is the data within an acceptable range of known parameters. (e.g. I have a dispatch note for 31st November 2013).
Data Movement Process Steps:
ź Detailed source-to-target mapping
ź Audit and integrity checks
ź Integration code & test
ź Reconciliation checks
How DQ techniques can help:
ź Data validation: pre-load checks as precursor to incorporating data into the data
warehouse
ź Data Quality Firewall profile and alert: feedback loop to source (alert, trouble ticket
generation).
4. Data Quality Firewall
Perceptions of data quality are affected by our past experiences. If we can proactively influence the quality of data before a new solution is
delivered, then there is a greater chance of project success.
A “Data Quality Firewall” establishes a visible window on data quality for both business and technical operations.With respect to project
delivery, new solutions should apply the above model to identify and incorporate a number of data quality firewall elements into their design
and delivery:
ź Up front visibility of known data quality issues, based on proactive profiling of the data (see above).
ź Strong metadata management at the Business Glossary level to ensure consistent business understanding of data, and good
technical metadata management of data to align with the underlying detailed business rules.
ź Using profiling to inform and develop the library of Data Quality Rules that apply to data.
ź Ongoing profiling of data prior to Data Warehouse loading, with rejecting any data records that fail profiling checks.
ź Feedback loop from the DQ firewall to inform and influence changes to both operational systems and data warehouse designs
(“improving the data improves the design, which improves the data”).
ź Parameter-based targets and tolerances for data quality, with automated alerts and reporting summaries on identified issues that fail
DQ profiling checks.
ź Automated generation of trouble tickets on any identified issues.
ź Preferred approach of “load trusted”: validate data before accepting it into the DW/BI environment (Compare with “Load everything”
environments that flag erroneous records for after-the-fact correction (not ideal)
Note too that the just because we have the ability to profile a feature doesn’t mean we should! 100% data quality is almost never necessary
(at least for analytic decision-making). Pragmatism should be applied to prioritise profiling and remedial efforts.
RECOMMENDED ACTION POINTS:
ź Incorporate “Data Quality Firewall” considerations as a necessary requirement for all new projects
ź Coach Data Quality by Design into Business Analyst and Solution Architecture teams.
For measuring a particular data set against each dimension, appropriate tolerances need to established to define what is (or is not) deemed
to be acceptable for general usage. A Data Quality Declaration statement (DQD) with each data set it issues provides contextual and
narrative guidance to data consumers as to the relative suitability of the data set within a given context.The consumer can then adjudge
whether or not the data set is suitable for their purpose.
Simple GREEN/RED indicator for each DQ Dimension, together with a summary statement of the source context of the data, including
description of the provenance, relevance and authority of the data source.
RECOMMENDED ACTION POINTS:
ź Consider defining a Data Quality Declaration for each data.
ź Interpretability: any explanatory narrative.
ź Accessibility: How the data set can be accessed, and it’s format.
Data Quality as part of Business Intelligence
BI Process Steps:
ź Define semantic layer (business
glossary terms)
ź Prepare key outputs (reports, analysis,
ad hoc capability)
How DQ techniques can help:
ź Publish a Data Quality Declaration to indicate level of trust in the data consumed by
a report
ź Metadata management, lineage & communication of shared understanding
Conclusions
Data Quality techniques enhance the delivery and operation at each stage of delivering the DW/BI environment.This drives better solutions
design, increased adoption and enhanced business value. Ideally, data quality capabilities will be incorporated during initial stages of DW/BI
solution delivery. However, capability can also be retro-fitted to drive better trust of the DW/BI output.
Ongoing profiling, validation and correction of data ensures the contents of the DW/BI solution remain trusted.
About the author
Alan D. Duncan is an evangelist for information and analytics as enablers of better business outcomes, and a member of the Advisory
Board for QFire Software.
An executive-level leader in the field of Information and Data Management Strategy, Governance and Business Analytics, he has over
20 years of international business experience, working with blue-chip companies in a range of industry sectors.
Alan was named by Information-Management.com in their 2012 list of “Top 12 Data Governance gurus you should be following on Twitter”.
Twitter: Blog:@Alan_D_Duncan http://informationaction.blogspot.com.au