IBM Industry Models and Data Lake

© 2016 IBM Corporation
IBM Industry Models and the IBM Data Lake
January 2017
Pat O’Sullivan – IBM Analytics
Email : posulliv@ie.ibm.com
Twitter : @PatOSullivanIBM
© 2017 IBM Corporation

© 2015 IBM Corporation2 © 2017 IBM Corporation
Disclaimer
 IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without
notice at IBM’s sole discretion.
 Information regarding potential future products is intended to outline our general product direction and it
should not be relied on in making a purchasing decision.
 The information mentioned regarding potential future products is not a commitment, promise, or legal
obligation to deliver any material, code or functionality. Information about potential future products may
not be incorporated into any contract.
 The development, release, and timing of any future features or functionality described for our products
remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can
be given that an individual user will achieve results similar to those stated here.
2

SOA
The broadening scope of analytics
Master Data
Management
Hub
Applications Data
Warehouse
Pattern
Discovery for
Analytics
Operational
Data Store
Adding in a business desire for real-time analytics,
self service data and increasing regulations relating
to individual privacy, it becomes necessary to have a
well- defined, managed and governed approach to
information architecture. We call this IBM’s data
Lake.
SAND
BOXES
Analyze
Values
Search
For Data
Reporting
Data
Lake
Hadoop

Big Data Lakes or Swamps?
 As we collect data
• Can we preserve clarity?
• Do we know what we are collecting?
• Can we find the data we need?
 Are we creating a data swamp?
 How do we build trust in big data?
• Do we know what data is being used
for?

The Data Lake
Data Lake = Efficient Management, Governance, Protection and Access.
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories

Users supported by the Data Lake
Data Lake (System of Insight)
Data Lake Services
Analytics
Teams
Governance, Risk and
Compliance Team
Information
Curator
Line of Business
Teams
Data Lake
Operations
Enterprise IT
Other Data
Lakes
Systems of
Engagement
Systems of
Automation
Systems of
Record
New Sources

The Data Lake subsystems
Data Lake (System of Insight)
Catalogue
Self-
Service
Access
Enterprise
IT Data
Exchange
Self-Service
Access
Analytics
Teams
Compliance Team
Information
Curator
Line of Business
Teams
Data Lake
Operations
Enterprise IT
Other Data
Lakes
Systems of
Engagement
Systems of
Automation
Systems of
Record
New Sources

Data lake repositories
Specialist Processing
Structured and Optimized
System-level Data
(Landing Area)
Accumulation of Context for
Master and Reference Data
Self-managed DataMetadata
Refined data formatted for
particular consumers

IBM Industry Data Models
IBM Industry Data Models provide pre-defined data structures which help accelerate data warehouse,
data lake and business intelligence projects.
Industry specific issues
being addressed
Integrated set of Models
from business requirements
to low level design
Predefined and pretested
deployment to RDBMS and
HDFS environments
IBM Industry Data Models
KPIsBusiness Vocabulary
Atomic DW Models Dimensional Models
Banking Insurance Fin Markets Retail Healthcare Telecom E&U
Customer Insight Profitability Risk Regulatory Compliance
ProjectAcceleration
Technical
Business
Analysis ModelsData Classifications
Business
Models
Analysis
Models
Design
Models
Supportive Terms
Data
Warehouse
Operational
Data Store
Big DataData
Marts
Information Integration & Governance

IBM Industry Models and main data lake deployment paths
Business Vocabulary is deployed to Data
Lake Catalog via tools such as InfoSphere
Information Governance Catalog (IGC)
Atomic (Inmon) and Dimensional (Kimball)
Data Models deployed to data lake via
tools such as InfoSphere Data Architect
(IDA) and ERwin
Supporting collateral
Models-specific white papers and best practice docs outlining the main
deployment patterns and implementation considerations

Overall set of Models
Business
Terms/
FSDMSupportive
Content
Analytical
Requirements
Atomic
Warehouse
Model
Dimensional
Warehouse
Models
Business
Vocabulary
(IGC)
Analysis level
Models (IDA)
Design level
Models (IDA)
Data
Models
Business Data
Model

Data Lake
View-
based
Interaction
Big Data Landscape – main components touched by the IBM Data Models
Line of Business
Applications
Simple,
Ad Hoc
Discovery
and
Analysis
Reporting
Information
Service Calls
Search
Requests
Report
Requests
Understand
Information
Sources
Understand
Information
Sources
Deploy
Decision
Models
Understand
Compliance
Report
Compliance
Information
Service Calls
Data
Access
Catalog
Interfaces
Advertise
Information
Source
Deploy
Real-time
Decision
Models
Enterprise IT
Interaction
Data Reservoir
Operations
Curation
Interaction
Management
Data
Access
Data
Deposit
Data
Deposit
Raw Data
Interaction
Information Integration
& Governance
Repositories
Decision Model
Management
Compliance Team
Information
Curator
Enterprise IT
Events to
Evaluate
Information
Service Calls
Data Out
Data In
Other Systems
Of Insight
Notifications
System of
Record
Applications
Enterprise
ServiceBus
New Sources
Third Party Feeds
Third Party APIs
Systems of
Engagement
Internal Sources
Other Systems
Of Insight
Deploy
Real-time
Decision
Models
Published
Data
Harvested
Data
INFORMATION
WAREHOUSE
DEEP DATA
Historical
Data
Descriptive
Data
CATALOG
OPERATIONAL
HISTORY
REPORTING
DATA
MARTS
SAND
BOXES
Full info on the IBM Data Lake Reference Architecture see IBM Redbook : Designing and Operating a Data Reservoir
http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg248274.html?Open

Options regarding common models/glossaries to encourage
standardization and reuse
Data
Access
Enterprise IT
System of
Record
Application
s
Enterprise
ServiceBus
New Sources
Third Party Feeds
Third Party APIs
Systems of
Engagement
Internal Sources
Enterprise IT
Interaction
Information
Service Calls
Data Out
Publishing
Feeds
Service
Interfaces
Data In
Information
Integration &
Governance
Data
Ingestion
Deploy
Decision
Models
Information
Service Calls
Data
Access
Deploy
Real-
time
Decision
Models
Data
Deposit
Deploy
Real-time
Decision
Models
View-based
Interaction
Published
OBJECT
CACHE
Repositories
Shared
Operational
Data
ASSET
HUB
EXECUTION
ENGINES
WORKFLOWMONITOR
Information
Service Calls
Search
Requests
Curation
Interaction
Management
Data
Deposit
Report
Requests
Harvested
Data
Historical
Data
DEEP DATA
OPERATIONAL
HISTORY
INFORMATIONWAREHOUSE
REPORTING
DATA
MARTS
Line of Business
Applications
Consumers of
Insight
Simple, ad hoc
Discovery
and Analysis
Reporting
Analytical Insight
Applications
Descriptive
Data
CATALOG
SAND
BOXES
Data Analysts/Data Scientists
Analytics Tools
Data Management Operations
Shared set of
term and
physical asset
definitions in
the Catalog that
underpin all
queries by all
users
Data Scientists can make use of predefined catalogs and likely to
create new catalog entries during their daily activities
Business Users
use specific
subsets of the
same shared
Catalog as users
to ensure
consistency of
language and
meaning
Any published structures required by the Business are based on
the same standard definitions and structures as those used
elsewhere
Standardized
set of Business
Term and Data
Model
definitions used
to enforce both
the meaning
and where
appropriate
structure of
stored data
Data Management Operations use the same
shared set of models and catalog entries to
build the necessary production ETL assets

Catalog Deployment - Models in the Descriptive Data Zone
Business
Terms/
FSDMSupportive
Content
Analytical
Requirements
Atomic
Warehouse
Model
Dimensional
Warehouse
Models
Business
Vocabulary
(IGC)
Analysis level
Models (IDA)
Design level
Models (IDA),
Purpose
Provide a standard business language and information
model that can be used when discussing business concepts
and related technical components.
Steps
1. Business Vocabulary Models are deployed to the
Catalog (IGC) where they used and maintained by
business analysts and data stewards
2. The Logical data Models (eg. Business and Atomic &
Dimensional Warehouse Models) are be imported
into the catalog. However they are mastered in a
modelling tool like InfoSphere Data Architect
Considerations
 Evolving patterns/best practices for the overall
management of enterprise and LOB glossaries
Repositories
Harvested
Data
Historical
Data
Enterprise IT
Interaction
Shared
Operational
DataInformation
Service Calls
Data Out
Publishing
Feeds
Service
Interfaces
Data In
Data
Ingestion
Enterprise IT
System of
Record
Applications
Enterprise
ServiceBus
New Sources
Third Party Feeds
Third Party APIs
Systems of
Engagement
Internal Sources
ASSET
HUB
DEEP DATA
OPERATIONAL
HISTORY
INFORMATION WAREHOUSE
REPORTING
DATA
MARTS
Information
Integration &
Governance
2
1
SAND
BOXES
Business Users
Data Scientists
Business Data
Model
Descriptive
Data
CATALOG
Descriptive Data Zone

Repositories
Harvested
Data
Historical
Data
Enterprise IT
Interaction
Shared
Operational
DataInformation
Service Calls
Data Out
Publishing
Feeds
Service
Interfaces
Data In
Data
Ingestion
Enterprise IT
System of
Record
Applications
Enterprise
ServiceBus
New Sources
Third Party Feeds
Third Party APIs
Systems of
Engagement
Internal Sources
ASSET
HUB
OPERATIONAL
HISTORY
Information
Integration &
Governance
Descriptive
Data
CATALOG
Business
Terms
Supportive
Content
Analytical
Requirements
Warehouse and Marts – Models in Integrated Warehouse Zone
Atomic
Warehouse
Model
Dimensional
Warehouse
Models
Business
Vocabulary
(IGC)
Purpose
Provide data modellers with consistent data structures
for deployment across the different aspects of an
integrated Information Warehouse and Marts zone.
Steps
1. The Atomic Warehouse Model is used as the basis
for the Inmon-style central relational Information
Warehouse
2. The Dimensional Warehouse Model is used as the
basis for the Kimball-style Dimensional
Information Warehouse.
3. The Dimensional Warehouse Model provides the
business-issue-specific structures to enable the
deployment of Reporting Data Marts.
I
Integrated Warehouse & Marts Zone
DEEP DATA
3
1
2
REPORTING
DATA
MARTS
Business Users
Analysis level
Models (IDA)
Design level
Models (IDA),

Repositories
Harvested
Data
Historical
Data
Enterprise IT
Interaction
Shared
Operational
DataInformation
Service Calls
Data Out
Publishing
Feeds
Service
Interfaces
Data In
Data
Ingestion
Enterprise IT
System of
Record
Applications
Enterprise
ServiceBus
New Sources
Third Party Feeds
Third Party APIs
Systems of
Engagement
Internal Sources
ASSET
HUB
Information
Integration &
Governance
Dimensional
Warehouse
Models
Business
Terms
Supportive
Content
Analytical
Requirements
Big Data Deployment – Models in the Landing Area Zone
Atomic
Warehouse
Model
Business
Vocabulary
(IGC)
Purpose
Provide the basis for a consistent and appropriate use
of schemas in the different repositories in the Landing
Area Zone.
Steps
1. Atomic Warehouse Model used as the basis for
the deployment for both schema-at-write and
schema-at-read Hadoop Deep Data structures
2. Atomic Warehouse Model may provide the
basis for deployment for schema-at-read for
Operational History raw data structures
Considerations
 Further investigation needed into the potential
role for DWM deployments to Hadoop-based
technology
Landing Area
Zone
2
1
DEEP DATA
OPERATIONAL
HISTORY
REPORTING
DATA
MARTS
SAND
BOXES
Business Users
Data Scientists
Analysis level
Models (IDA)
Design level
Models (IDA),
Descriptive
Data
CATALOG

Information
Integration &
Governance
Descriptive
Data
CATALOG
Repositories
Shared
Operational
Data
ASSET
HUB
Harvested
Data
Historical
Data
Enterprise IT
Interaction
Information
Service Calls
Data Out
Publishing
Feeds
Service
Interfaces
Data In
Data
Ingestion
Enterprise IT
System of
Record
Applications
Enterprise
ServiceBus
New Sources
Third Party Feeds
Third Party APIs
Systems of
Engagement
Internal Sources
DEEP DATA
OPERATIONAL
HISTORY
REPORTING
DATA
MARTS
SAND
BOXES
Business Users
Data Scientists
Summary Picture
Physical Model
Hadoop
Physical
Model RDBMS
Physical Model
Dimensional
Logical Model
Atomic
Logical Model
Dimensional
Business Vocabulary
Mappings to inform common Business
Meaning using the Business Vocabulary in IGC
Generation of Technical Structure using
the ER Data Models in ER tool (e.g. IDA)
Legend
Use of Business Vocabulary to understand
Business Meaning by Users
• The Business Vocabulary Terms in IGC can be used to enforce common
business meaning through out the Data lake landscape
• The output of the various Logical Models can be used to define the
technical structure of assets in the lake that need to be created. Where
a predefined schema is required (e.g. Schema at Write)
4
1 2 3
5
6
7
8
9
10

Three different lifecycles relating to the evolution of the models with the
Data Lake
Analysis
Refine
Deploy
Review
Requirement
Maintenance of the
Business Language
AR
BT
SG
Analysis
Design
Generate
Review
Requirement
Development of the
ER/UML Models
AWM DWM
The use of the Industry Models
Business Vocabularies to enable a
common Business meaning of
language by all Data Lake users
The use of the
Industry Models
Business
Vocabularies and
derived physical
assets in the
creation and
ongoing
management of
the Data Lake
The use of the ER and UML models
to enforce a common structure of
artifacts where required in the Data
Lake
BDM
BT - Business Terms
AR - Analytical Requirements
SG - Supportive Glossaries
BDM - Business Data Model
AWM - Atomic Warehouse Model
DWM - Dimensional Warehouse Model
Legend
AWM
(Physical)
DWM
(Physical)
Management of the runtime
production environment
BT
Data Lake
Repositories
Data Lake
Catalog
Data
Data Lake
Users

The Repositories used by the Data Lake Lifecycles
IGC Dev
Repository
Modelling
Environment
Collaboration/Versioning
Repository (e.g. RTC)
Business
Language
Environment
Runtime Data Lake
Environment
IGC Production
Repository
Data
Repositories
RDBMS
IGC Browser
IDA
IGC for Eclipse
Data
Repositories
HDFS
Data Lake Catalog
IGC Anywhere/REST
IGC Browser
IMAM IDA Import
IMAM
Physical Data
Model
IGC
Workflow

Lifecycle 1 - Maintaining the Business Language of the Data Lake
 Objective : The creation and ongoing maintenance of the
common Business Language to be used by all users to describe
the various components of the Data Lake oi underpin the Data
Lake
 Roles Involved : Business user reps, Business SMEs, Business
Language Stakeholders
Analysis
Refine
Deploy
Review
Requirement
Maintenance of the
Business Language
AR
BT
SG
 Considerations:
• Determining the needs of the different users of
the Data Lake (different uses, need for different
dialects, amount of technical metadata in the
Language)
• Determining the approach to building the
business language, the overall flow for
creation, promotion and maintenance of terms
• Defining the specific glossary suitable for pure
business users , versus Business Analysts, Data
Scientists, Data Modellers and IT staff
• Determining the role of using IBM Industry
Models to build out the Business Language

Lifecycle 2 - Developing the technical Models
 Objective : The use of the ER and UML models to enforce a common
structure of artifacts where required in the Data Lake
 Roles Involved : Modellers, Business SMEs,
 Considerations:
• Ensuring the appropriate communications
between the Data Modellers and the
Business Users
• Determining when to use and not to use
Data models for the data lake repositories
• Determining the ongoing use of a Canonical
Platform Independent Logical Model as a
basis for the deployment of the different
types of Platform specific, physical Models
required across the Data Lake Repositories
• Determining the specific data modelling
approaches and scenarios for deploying to
the different Data lake repositories.
Analysis
Design
Generate
Review
Requirement
Development of the
ER/UML Models
AWM DWM
BDM

Lifecycle 3 - Deploying the Models into the runtime Data Lake environment
 Objective : The use of the Industry Models Business Vocabularies
and derived physical assets in the creation and ongoing
management of the Data Lake
 Roles Involved : Business user reps, Modellers, Data Lake Ops staff
 Considerations:
• Determining how to deploy the Business
Language for optimal use by the different Data
Lake users (management access to the
different terms, handling of ongoing updates)
• Determine the strategy for the ongoing
association of the Business Terms with Data
Assets (which users tag new data elements
with the Business Language and when)
• What is the approach for the Data Lake ops
staff to deploy the physical Data Models – how
is feedback to the Data Modellers handled.
• How to incorporate the Data Model artifacts
into the ongoing Data Lake governance aspects
AWM
(Physical)
DWM
(Physical)
Management of the runtime
production environment
BT
Data Lake
Repositories
Data Lake
Catalog
Data
Data Lake
Users

Claim
File
Patient
Information
File
Sample Source
Data
/data/udmh/patient/<date>/<version>/.. Data files..
Data
Transformation
Process
(Hive,Spark, Pig,
ETL, ..)
Data
Transformation
Process
(Hive,Spark, Pig,
ETL, ..)
Hive
Metastore
Patient party ext Table
HIVE
Vendor SQL
for Hadoop
interface
/data/udmh/claim/<date>/<version>/.. Data files..
Claim ext Table
Logical
Data
Model
Physical
Data
Model
Patient ClaimPatient
/ Claim
Patient Claim
Downstream
Data
Transformation
processes
1
23
Industry Models Hadoop deployment example – low level
HDFS
Three possible
deployment
paths

Mapping of incoming new structures in the Data Lake
IGC Dev
Repository
Runtime Data Lake
Environment
IGC Production
Repository
Data
Repositories
RDBMS
IDA
IGC for Eclipse
Data
Repositories
HDFS
Data Lake Catalog
IGC Anywhere/REST
IGC Browser
IMAM IDA Import
IMAM
Physical Data
Model
IGC
Workflow
New HDFS
Structure
1
2a
2b
2c
Question about what are the best practices for the “Bottom-up” mapping of a new structure in the data lake which has
not been originally derived from a Data Model.
1. Direct mapping from the Physical Asset to the appropriate Term in the Catalog
2. Indirect mapping via a specifically created data model (actual mapping done either via BGE or in BG Browser)
a. Reverse engineer a new model from the HDFS Structure
b. Import the Data model into the Catalog
c. Import the mappings into the Catalog from IDA (is mapping done in IDA via BGE)

Model artifacts in the Data Lake Runtime environment – main usage
patterns
There are three main categories ways in which the data model artifacts are used in or impact the Data Lake runtime
environment
• Industry Model artifacts are deployed
into the Data Lake runtime
environment
• Most likely as an output from the two
lifecycles “Maintaining the Business
Language” and “Deploying the
Technical Models”
• Industry Model artifacts deployed in
the Data lake are used by and effected
by Data Lake users
• For example , Data lake users provide
feedback on
changes/corrections/additions to the
model artifacts
• Industry Model artifacts deployed in
the Data lake are impacted by new or
changed data coming into the Data
Lake Repositories
• The most obvious example is the need
for new mappings to a new or
changed Repository brought into the
Data Lake.

REFERENCE MATERIAL
New Information Architectures and Capabilities

Designing and Operating a Data Reservoir
 Description of the behaviour and
processes that make up a data
reservoir (IBM’s Data Lake)
 Blog
• 5 things to know about a data
reservoir
https://www.ibm.com/developerwo
rks/community/blogs/5things/entry
/5_things_to_know_about_data_res
ervoir?lang=en
 Redbook
• http://www.redbooks.ibm.com/Red
books.nsf/RedpieceAbstracts/sg248
274.html?Open

IBM Industry Models and Data lake publications so far :
http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=IMW14877USEN
http://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=IMW14872USEN
http://www-
01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=IMW14877US
EN
http://www-
bin/ssialias?htmlfid=IMW14872US
EN
https://www-
bin/ssialias?htmlfid=IMW14911IEEN
&

IBM Industry Models and Data Lake

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to IBM Industry Models and Data Lake

Similar to IBM Industry Models and Data Lake (20)

Recently uploaded

Recently uploaded (20)

IBM Industry Models and Data Lake