Exploration of large and complex data estates to gain an accurate understanding of the data structures and data quality.
Presentation given by Ontology Systems and BSkyB at SemTechBiz - The Semantic Technology & Business Conference on October 2nd 2013
Breaking the Kubernetes Kill Chain: Host Path Mount
Zen and the Art of Datanauting
1. Exploration of large and complex data
estates to gain an accurate understanding
of the data structures and data quality
Zen, and the art of Datanauting
Carl Bray
Product Manager, Ontology Systems
Matt Clark
Design Authority, BSkyB
3. 3
15 years of transaction data
10 million+ customers
900 engineers making changes
30 TB of data
20+ Applications
Q) How do you start to understand this data estate?
4. The company
• UK subsidiary of a global media organisation
• Provides fixed line telephone, Internet and television entertainment services to UK residents
• 10 million+ customers, trading for 15 years
Business drivers:
• Driven by marketing innovation
• Extend and upsell to customer base
• React to competitive threats
• Technical infrastructure impacting commercial agility
The motivation behind the project
Background and Business Drivers
4
5. Objective
• Significantly reduce the time to capture new business strategies in IT systems
Significant change in IT delivery
• Embrace Agile delivery of new functionality
• Develop new payment and sales systems
• Access and extend existing data
• Multiple SCRUM teams using test-driven development
• Phased delivery
Short-term technical drivers
• Quickly understand the structure, nature and consistency of the existing data
Longer term technical drivers
• Introduce a service-based semantic agent to access software services
Fundamentally changing the way IT functionality is delivered
A new IT Strategy
5
6. Subject matter experts (SMEs)
• Understanding the data means interfacing with SMEs
• Multiple SCRUM teams need access to SMEs
• Knowledge is in Silos and not co-located with SCRUM teams
• SMEs may not know the answers
Bottleneck / Choke point
• SCRUM teams need quick answers to data / process questions
• SME bandwidth stifles SCRUM agility
• Introduces a single project bottleneck/choke point
Overwhelming the SMEs
• Free and unfettered access to the SMEs would create chaos
• Need to filter questions to the SMEs
Challenges
Many technical challenges stood in their way
6
CRM
Billing
Ref Data
Debt
Orders
Ticketing
Content
Product
SME
SCRUM
SCRUM
SCRUM
SCRUM
SME
SME
SME
SME
SME
SME
SME
7. Many systems with complex interdependencies
• CRM
• Billing
• Reference Data
• Debt processing
• Order handling
• Trouble ticketing systems
• Subscriber card management systems
• Content access entitlements
• Product catalogue
Fragmentation
• Business entities fragmented
• “Customer” properties in many systems
The Scope and Scale of the Problem
Payments and sales system involving 20+ systems and legacy data
7
8. Data estate problems
• Data quality isn’t consistent
• Data fragmentation is high
• Understanding the data is complex
• How are business entities stored in different applications and
data sources?
• What impact should processes have on the data –
flags, statuses, etc.
• When data is duplicated, which data sources should take
preference?
• Scale of data
• 30+ TB of historic trading data
• 3 Vs - The Variety and Volume of data are very high
The Data
30TB of transactional data over 15 years of system changes
8
?
9. Non-semantic alternatives
• Train more SMEs
• Work around SME’s other priorities
• Educational workshops
• Take time to document systems
Data-profiling alternatives
• Reverse engineering schemas
• ETL Tooling
• Didn’t want to create yet another data warehouse
Chose a datanauting approach
• Supports their commitment to Agile development
• Allows SCRUM teams to explore and ask questions of the data
without overloading SMEs
Alternatives
Alternative approaches to solving the problem were considered
9
10. What we do, and why we’re different
• Ontology leverages graph and semantic search technologies to address enterprise data issues
• We address complex data integration problems
• Data Acquisition
• Data Correlation
• Data Migration
• We produce fully fledged operational applications that use semantic search in
• Telecommunications
• Media
• Financial services
• The Ontology Difference
• Inherently agile – no schema
• Datanauting: data-first, structure later
• Just enough modelling
• Structured and unstructured data
How we approached the problem
The Ontology Approach
10
11. Exploration of data sources…
The Ontology Approach - Datanaughting
Identify sources
Connect to sources
•Index source
Search for entities
•Refactor entities
•Create URI pattern matching
•Map entities to RDF
Search for linked
entities
•Add references
Search for equivalent
entities
•Create matching URIs
•Map entities to RDF
12. • DBs
• SPARQL Endpoints
• Structured files
• MS Excel, CSV, XML, RDF
• CISCO and other device configurations
• Propriety formats
• Unstructured files
• MS Word, PDFs, etc.
The Ontology Approach - Datanaughting
Identify sources
Identify sources
Connect to
sources
Search for
entities
Search for linked
entities
Search for
equivalent
entities
13. • Setup the connection
• Index sources
• Add search facets
• Tokenise compound values e.g.
• Service names are concatenated “Service-LON/01”
• Product names use “CamelCase”
The Ontology Approach - Datanaughting
Connect to sources
Identify sources
Connect to
sources
Search for
entities
Search for linked
entities
Search for
equivalent
entities
14. • Search for business entities
• Refactor “denormalised” data
• Choose a URI pattern to represent instances
• Set a type for the entity
• Map properties to owl:DatatypeProperty
The Ontology Approach - Datanaughting
Search for entities
Identify sources
Connect to
sources
Search for
entities
Search for linked
entities
Search for
equivalent
entities
15. • Search for entities that should be linked
• Add references (owl:ObjectProperty) between entities that are to
be linked
The Ontology Approach - Datanaughting
Search for linked entities
Identify sources
Connect to
sources
Search for
entities
Search for linked
entities
Search for
equivalent
entities
16. • Search for semantically equivalent entities in other data sources
• Search based on property names
• Search based on strict value matching/weighting
• Search based on sub-string matching/weighting
• Reuse the URI pattern
• Create references
The Ontology Approach - Datanaughting
Search for equivalent entities
Identify sources
Connect to
sources
Search for
entities
Search for linked
entities
Search for
equivalent
entities
17. High-level solution to the problems the organisation faced
• Removed the SME bottleneck - a key enabler for the Agile / SCRUM approach
• Creates a searchable domain model, breaking the data into discrete “chunks”
• Ontology allows the SCRUM teams to understand the legacy data through ad-hoc queries
• Can understand how business concepts are mapped across multiple contradictory data repositories
• The quality and suitability of data can more easily be assessed
• Provides a definitive view of the commercial position for a given subscriber or set of subscribers
• Backlog and sprint priorities are based on a complete understanding of the complexity of the task
• Provide data to facilitate mock ups and test harnesses
Ontology provides SCRUM members with insight into the data
Project Results
17
18. Project Results
SCRUM teams gain insight into data
18
CRM
Billing
Ref Data
Debt
Orders
Ticketing
Content
Product
SME
SCRUM
SCRUM
SCRUM
SCRUM
SME
SME
SME
SME
SME
SME
SME
19. Project Results
Product Architecture
19
Modeller
External
Event
Sources
Web UI
Ontology Intelligent 360 Ontology Integrity
Manager
Semantic Graph
Store
Query API
Universal
Search Core
Semantic Processing Core
Universal
Search Core
Authenticationand
Notification
LDAP
Server
(optional)
Mail Server
(optional)
HTTPS
RTIA
Fully Modelled Data Sources
CSV
RDBMS
XML
JDBC
XLS
Other Data Sources
DOC PDF XLS MAIL
XML
Ontology 4 Modeller Ontology 4 RuntimeHTTPS
End Users
(Browser Access)
20. Variety
• Ability to access data in a variety of formats
• Avoid integration to live systems
• Possible to work from database - dumps avoids politics
• Embracing change – inherently agile
Volume
• Ontology techniques for managing data scale
• Partial index of data
• Partial modelling
• Semantic search with SQL query to live systems
Velocity
VarietyVolume
Project Results
Dealing with two large Vees
20
21. Why Ontology?
• Agile response through inherently agile technology
• Datanauting provides agile response to SCRUM teams
• SME time can now be used for valuable queries
Technical advantages
• No Schema, No Integration, No Big Bang, No Search
Restrictions, No Upfront Risk
Benefits delivered
• Speed – Greatly accelerated the analysis phase of the project
• Risk – Project is not viable without an understanding of the data
No
Upfront
Risk
No Schema
No
Integration
No Big Bang
No Search
Restrictions
Zen, and the art of Datanauting
Advantages of the Ontology approach to Data Integration
21