STI Summit 2011 - di@scale

402 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
402
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

STI Summit 2011 - di@scale

  1. 1. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Michael L. Brodie STI Fellow Michaelbrodie.com ! Copyright 2011 STI - INTERNATIONAL www.sti2.org The following is for Fortune 100 organizations 1
  2. 2. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Based on 2010 industry data verified by 4 analyst organizations 10+ Fortune 100 enterprises Did you know… Information Systems? 2
  3. 3. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Information Systems in a Fortune 100 company ~10 thousand 100s per business area (e.g., billing) 30% mission critical 30% modified significantly annually Did you know… Databases? 3
  4. 4. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Logical databases in a Fortune 100 … ~10,000 100s per business area (e.g., billing) ~90% relational 100s new / year Database size … 20% large 1-10 TB 40% medium 10 GB-1 TB 40% small <10GB 4
  5. 5. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Database growth… 2X every 18 mo OLTP / transactional 2X every 12 mo OLAP / decision support Logical database complexity … Tables/database: 100-200 Attributes/table: 50-200 Stored procedures/database: ~# tables Web services/database: < # tables but !! Views: ~3X # tables 5
  6. 6. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Did you know… Database Design? Database design level … ~100% table (SQL) ~0% Entity Relationship 6
  7. 7. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Did you know… Relational Database Integration? Data integration accounts for … ~40% software project costs 7
  8. 8. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Circa 1985 Tables: 2+ Dominant operation: join Design: schema-based Circa 2010 Tables: 10s-1000s Dominant operations !  Extract, Translate, Load (ETL) !  File transfer !  Then … join Design !  Evidence gathering (often manual) !  60-75% schema-less 8
  9. 9. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Did you know… Information Ecosystems? Information systems that interact to support domain-specific end-to-end processes, e.g., supply chain, billing, marketing, sales, engineering, operations, finance, … 9
  10. 10. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Information Ecosystems involve … 100s-1,000s Information Systems 100s-1,000s Databases Conclusion Information systems & databases live in complex, deeply inter-connected worlds 10
  11. 11. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Did you know… Database Integration @ Scale? Database integration in an Information Ecosystem … 100s-1,000s source databases 1,000s-millions function-specific views 11
  12. 12. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Every organization with a database depends critically on … Data + Data Integration 12
  13. 13. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Boring huh? Then how about … 13
  14. 14. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Deeply sophisticated analysis [data integration] across … 28 intelligence databases connected across 17 US intelligence agencies over 5 years totally missed … 14
  15. 15. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Data Integration @ Scale 15
  16. 16. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com What is Data Integration? Combine 2+ data items to form a meaningful result Source1 … Mapping Target Sourcen 1980’s Data Integration Account Charge Name Address Acct# … Acct# Charge Date … Bill 43 Main 123… 123… $2.32 2/1/10 Sally 12 Peel 489… 149 $1.17 1/7/10 TelcoSum (Charge) where Account.Acct# = Charge.Acct# DI computation Entity Mapping Monthly Bill Name Address Acct# Amount Bill 43 Main 123… $2.32 Sally 12 Peel 489… $0.00 16
  17. 17. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com What does DI mean? "  Determined by application “semantics” "  Financial Close for a Large Enterprise •  What: Combine data from 100s-1,000s of systems •  Meaning: Defined by financial accounting practices, jurisdictional regulations, industry standards, … •  Speed: 1-2 days after end of period •  Solutions: e.g., SAP Financial, PeopleSoft Financial, Hyperion •  Governance: laws and regulations •  Correctness: o  US= Generally Accepted Accounting Principles (US GAAP) o  GB = Generally Accepted Accounting Practice (UK GAAP) o  Etc. "  Data Integration = •  Entity mapping •  Data integration computation Definitions (cont.) "  Semantically homogeneous relational data descriptions •  Verified by an authoritative expert •  To describe the same entity (considering set inclusion) •  For which there is one and only one simple, complete entity mapping between the relational schemas for the data descriptions •  The entity mapping can be defined in relational expressions "  Semantically heterogeneous relational data descriptions •  Verified by an authoritative expert •  To describe the same entity (considering set inclusion) •  For which there is not a complete entity mapping between the relational schemas for the data descriptions •  Entity mappings cannot be defined in relational expressions 17
  18. 18. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Deeper Thoughts "  Schemas: sufficient but not necessary "  Simple map: 1:N, … atomicity "  Degree of semantic heterogeneity Semantic Semantic homogeneity heterogeneity Complete No mapping mapping Consequences "  Semantically homogeneous relational data descriptions •  Data independence + updatable => consistent views •  Provide basis for o  Correct o  Efficient •  Schema level design and integration "  Semantically heterogeneous relational data descriptions •  No guarantee of: Data independence + updatable views •  Provide no basis for o  Correctness o  Efficiency •  Schema level design and integration not always possible "  Data modelling and integration best practices •  Semantic homogeneity •  Radical simplification •  Standardization: ontologies, schemas, application solutions, … 18
  19. 19. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Semantic Heterogeneity … "  Is unavoidable @ scale •  Autonomous design •  Ontologically heterogeneous, e.g., Enterprise Customer o  Legal entity-based ontology o  Location-based ontology "  Dominates semantic homogeneity Information Ecosystem Integration (3D) Telco Service Provider VZW Worldcom MCI BA GTE Business Domains Ordering Marketing Finance Billing Repair Sales Enterprise Customer Wells Fargo Wells Fargo Wachovia Bank of America Merril Lynch Bank of America 19
  20. 20. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Correctness & Efficiency "  Legally Binding Requirements •  Law governing communications, financial transactions, privacy, … o  E.g.,: Customer Proprietary Network Information (CPNI) •  Regulation: rules defining operational details o  E.g.,: FCC CPNI rules –customer views require manual verification •  Customer agreements: SLAs for all services •  Telco agreements: SLAs for all production systems "  Mission Critical: Operational & Information Correctness "  Complex Views •  Customer Defined Views •  Inconsistent & Conflicting Views o  Business perspectives, manually defined, no ultimate truth or agreement o  Conflicting telephone #: Long-distance vs. local Data Modelling: Enterprise Customer "  Enterprise Customer data description •  Enterprise customer identifier (ECID) •  Enterprise customer data Level of abstraction o  Telco information, e.g., billing, must be mapped (integrated into) Enterprise Customer organizational units "  Enterprise Customer Ontologies = Conceptualizations •  Legal entity •  Locations •  Telco contracts Perspective •  ~40 others 20
  21. 21. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Enterprise A Legal entity structure (partial) •  ~400 legal entities •  14 levels deep •  20,000 accounts (~50/entity) Industry Standard Legal-entity Ontology Industry Standard Ontology Radically Simplified Schema Legal-entity based ECID instance (DUNS code) 21
  22. 22. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Location-based Ontology Industry Standard Ontology Location-based ECID instance Radically Simplified Schema Contract-based Ontology Industry Standard Ontology Contract-based ECID Instance Radically Simplified Schema 22
  23. 23. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Entity maps for semantically heterogeneous Enterprise Customer Entities are schema-less Location-based ECID schema Legal entity-based ECID schema Contract-based ECID schema Enterprise Customer: 3 Departments in 2 locations les Sa l- C ga Le l- L ga Le Ma rke tin Ma g rke tin g Le ga l- L 23
  24. 24. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Location-Based Enterprise Customer IDs: 2 locations Two Entities Legal entity-based Enterprise Customer IDs: 3 organizations with 1, 2 and 3 departments les Sa l- CMa ga rke Le tin g Le ga l- L Four Entities 24
  25. 25. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Multiple Versions of Truth 1st Amendment Civil Liberties Bigamy Religious Freedom LDS Law repealed 1990 Fundamental Against LDS Law 1838 Women’s Statutory rights rape Women’s rights 46% 54% Govt out of bedroom US Fund. State US Gay Anti- US LDS US Federal LDS ACLU Feminists Feminists of Federal Rights Citizens Church Citizens Govt Church Utah Govt Married Age: 48 Male Female Age: 13 Nov 2002 www.sti2.org 5/12/2007 - Vienna Data Integration Observations "  Each view is unique •  Entity map •  Data integration computation "  Views often conflict or mutually inconsistent "  Data integration for semantically heterogeneous relations •  Schema-less •  By evidence gathering •  At instance level "  Views are •  Dominant means of accessing data •  Lenses into Information Ecosystems 25
  26. 26. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Codd’s Rules Extended "  0: System must be relational, a "  7: High-level insert, update, and database, a management system delete "  1: Information rule (tabular) "  8: Physical data independence "  2: Guaranteed access rule (keys) "  9: Logical data independence "  3: Systematic treatment of nulls "  10: Integrity independence "  4: Active online catalog based on "  11: Distribution independence the relational model extra "  12: Nonsubversion rule "  5: Comprehensive data sublanguage rule "  13: Relational power rule: The full power of relational technology "  6: View updating rule applies to semantically homogeneous relational databases. For Semantically Heterogeneous Databases "  Cannot be guaranteed •  Relational model •  Codd’s rules •  Relational properties o  Updateable views •  Relational assumptions o  Global schema o  Single version of truth "  No basis for correctness or efficiency "  Unfortunately, like the real world 26
  27. 27. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Shadow Approach "  Database Approach: power and limits •  Data structures •  Schemas •  Data models •  Logic "  Shadow Approach •  Meta-tagging o  Equivalence (E-tags) for entity mapping after resolution o  What (W-tags) for “meaning” o  Projection (P-tags) for ETL over logical data structures (Traditional) Data Integration Billing Ordering Repair Sales Marketing Service view view view view view view Evidence 1 Evidence 3 Evidence 4 Evidence 5 Evidence N Evidence 2 Business In SQL, Java, C/C+ Ontology Rules +… Total Semantic Heterogeneity Schema Single Version of Truth ?? Enterprise portal Ontology 1 Ontology 2 Ontology 3 Ontology 4 Schema Metadata Schema Schema DB DB DB data Single Version of Truth Single Version of Truth Single Version of Truth Single Version of Truth Single Version of Truth Single Version of Truth 27
  28. 28. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Shadow Theory-based Data Integration Billing Ordering Repair Sales Marketing Service view view view view view view Evidence 1 Evidence 3 Evidence 4 Evidence 5 Evidence N Evidence 2 Business Need a new language to implement rules by Ontologies Rules tags Explicitly represented w/ Multiple Version of Truth. Schema Enterprise portal Ontology 1 Ontology 2 Ontology 3 Ontology 4 Schema Metadata Schema Schema DB DB DB data Shadows Shadows Shadows Shadows Shadows Shadows Shadows Additional slides 28
  29. 29. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Data Integration Pragmatics "  Codd’s Rules violated "  Views conflict •  Un-natural fit •  For good reason! •  Primary keys o  Sales vs. marketing vs. repair o  Problems: technical, managerial •  Maintain conflicts or lose value o  Autonomous design o  Semantic heterogeneity "  Best practices rare •  Few views updatable •  Semantic homogeneity •  Radical simplification "  Schemas largely unavailable •  Standardization •  10% not RDBMSs •  30% legal "  Data Integration •  60% autonomy •  Table-level (Not ER-level) •  Instance-based "  Federation often prohibited •  Seldom schema-based •  Legal: privacy, CPNI, … •  Manual •  Evidence gathering "  Single Source of Truth •  Makes sense in specific cases •  Usually not! Current Data Integration Technologies "  Relational DBMS "  Limitations •  Schema-based "  Extract, Transform, & Load (ETL) •  Federation issues "  Federation aka Enterprise •  Limited (no!) support for information integration (EII) o  Schema-less integration o  Instance-level integration "  DI Platforms: 50+, e.g., o  Evidence gathering "  IBM InfoSphere Information Server o  Conflicting views "  Informatica Platform "  Microsoft SQL Server Integration o  Semantic heterogeneity Services •  50+ ad hoc tools "  Oracle Data Integrator, … •  Expensive to migrate "  SAP Data Integrator/Federator Information Ecosystem onto data integration platform "  Instance-level integration •  XML-based integration •  Custom tools: Verizon •  Dun & Bradstreet ECID maps 29
  30. 30. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Emerging Data Integration Technologies "  DBMS Architecture "  Data Discovery •  In-Database Analytics "  Data Quality Services s ) s •  Database Appliances tic Instance-level integration tic "  an Distributed DBMS Architecture an "  •  XML-based integration Change Data Capture (CDC) m •  em Data Access •  Custom tools se •  •  Dun & Bradstreet ECID maps s •  Data Replication o l; •  Distributed Data Caching Schema-less Tools (n "  ua "  Service Orientation •  Evidence gathering: Web, … ce an •  Data Services Platforms •  Entity / Identity Resolution an m •  Information-As-A-Service •  Social Networking rm – •  Enterprise Service Buses (ESBs) o  Collaboration g o  Web 2.0 rfo "  Master Data Management (MDM) in o  Social Search lv Pe •  Data & Process Mining So •  Linguistics / Ontologies / Taxonomies – o  Derivation & Analysis m n •  Semantic Technologies tio le •  Content Analytics ob ta "  Shadow Theory pu Pr •  Plato’s Cave m •  Monadic 2nd Order Logic Co Heterogeneous meanings of a telephone number North America Numbering Plan (NANP) telephone number (NPA) NXX-XXXX NPA - Number Plan Area NXX - Central office code XXXX – telephone line # Telephone attribute in customer Domestic long distance services, charges database. determined by NPA. Telephone entity = a physical phone. ? International long distance services, charges determined by country code. Owner identified uniquely with additional attribute NPA+NXX+XXX+CustomerID Ported to cell or internet phone Routed to / from another phone number (e.g. 1 800 toll free) 30
  31. 31. Data Integration @ Scale 7/8/11 Michael L. Brodie2011 STI Semantic Summit Michaelbrodie.com Limitations of Current Solutions "  Data Model Principles Don’t Apply "  Tools Don’t Meet Requirements •  Unique IDs (keys) are limiting •  No single model of data •  Modelling Choices: entities, •  Inconsistent / conflicting views properties, or relationships "  Meta-data and Multi-database Scale "  Schemas Are Seldom Available •  Information ecosystem (vs. database) •  10% not RDBMSs management •  Millions of views •  30% legal •  10,000s of cross database dependencies •  60% owners maintain autonomy o  Biz: impact operations "  Migrating Information Ecosystems o  Tech: impact schedule & cost to Vendor Solutions Is Prohibitive "  Schema-based Integration Challenges "  Pragmatics •  Inconsistent data models •  Bootstrapping in a legacy environment •  View Conflicts •  Cascaded Views "  Conceptual Modelling Support •  Different models required for different "  Federation Seldom Allowed (CPNI) workers •  No round-trip engineering "  No Single Source of Truth 31

×