Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

209 views

Published on

http://2016.semantics.cc/robert-isele

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

  1. 1. WWW.LEDS-PROJEKT.DE ECCENCA CORPORATE MEMORY SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES September 29, 20161
  2. 2. MOTIVATION Enterprise Data Management Objective: “Ensure all data is aligned to a common meaning in order to achieve automation in performing complex analytics and generating trusted reports.” Source: 2015 Data Management Industry Benchmark - EDM Council September 29, 20162 In 2015 only 7% of respondents claim to already be using shared and unambiguous definitions of data across the firm and have it accessible as operational metadata. 7%
  3. 3. ARCHITECTURE September 29, 20163 Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Corporate Memory Inbound Data Sources Outbound and Consumption Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure
  4. 4. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Data Ingestion • Files in the data lake (CSV, XML, Excel) • (relational) Databases
  5. 5. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Data Lake • Emerging approach to handle large amounts of data • Cost-effective storage • Data is held in their native formats Good Does not force an up-front integration of the ingested data sets Bad Retaining an overview of disparate data silos in the lake without having a coherent shared view is a challenging issue
  6. 6. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Data Warehouses • Existing infrastucture • Typically relational databases
  7. 7. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Metadata Layer • Dataset Metadata • Ontologies • Integration Rules
  8. 8. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Graphical User Interface Customer Applications
  9. 9. INTEGRATION PROCESS Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 20169
  10. 10. DATASET MANAGEMENT Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 201610
  11. 11. DATASET CATALOG • Enables the user to explore and manage datasets in the data lake • Files in the data lake (CSV, XML, Excel) • Databases (Apache Hive or external databases) September 29, 201611
  12. 12. MANAGING METADATA • Exploring and editing dataset metadata • Semantic content information, like textual descriptions, tags and related Persons • Technical information and parameters, like formats, data model and encoding • Access information, like access path or URL, source system or API call • Organizational provenance, like organizational units owning or maintaining the dataset September 29, 201612
  13. 13. DATASET DISCOVERY Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 201613
  14. 14. DATASET DISCOVERY • Goal: Augment a dataset with data from related datasets • Automatic discovery of dataset with overlapping information • Explorative interface • Discovery is based on two data parts • Business meta data • Profiling summary September 29, 201614
  15. 15. DISCOVERY VIEW • Datasets are matched based on their metadata (profiling + business data) September 29, 201615
  16. 16. DATASET PROFILING • Datasets often contain implicit and explicit schema information • Column names, data formats, enumerated values etc. • Example: column contains formatted dates • Idea: Extract a dataset summary • For each column / property the summary contains: 1. Data type (e.g., number, date, industry classification) 2. Data format (e.g., date format) 3. Data statistics (e.g., range, distribution, most frequent values) • Materialized as RDF with UI view September 29, 201616
  17. 17. DETECTING DATA TYPES • Detecting common datatypes as well as user-defined types • Common datatypes • Numbers • Dates / Times • Geographic locations (geo-coordinates, states, countries) • User-defined data types can be integrated by adding an ontology / taxonomy • Usually a SKOS taxonomy • Managed as another dataset in the dataset management • Example: Industry taxonomy • Standard taxonomy (NACE, SIC, NAICS) or company specific September 29, 201617
  18. 18. FORMATS AND STATISTICS • For some types, the data format is detected • Example: Dates are formatted in DD-MM-YYYY • Two functions are generated: 1. Parser that is able to read the detected representation 2. Normalizer that converts the parsed values into a configurable, organization-wide target representation • Statistics summarize the values: • Value range and distribution • Most frequent values • Data selectivity September 29, 201618
  19. 19. DISCOVERY VIEW • Datasets are matched based on their metadata (profiling + business data) September 29, 2016 19
  20. 20. INTEGRATION PROCESS Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 201620
  21. 21. DATA INTEGRATION • The integration process is driven by a set of rules • Lifting Rules map the source datasets to a ontology • Linking Rules connect different datasets to a knowledge graph • Rules are operator trees, consisting of four types of operators • Data Access Operators • Transformation Operators • Similarity Operators • Aggregation Operators • Rules can be learned using genetic programming algorithms • Rules are human understandable and can be edited September 29, 201621
  22. 22. DATASET LIFTING • Objective: Map the datasets in the data lake to a consistent vocabulary. • A lifting rule consists of a number of mappings • Each mapping assigns a term in the original data set (such as a column for tabular data) to a term in the target ontology (such as a property provided by an ontology). • Multiple mappings for each dataset can be managed to allow different views on the same data. • Initial mappings are generated automatically based on the profiling results from where the user can continue to build on. September 29, 201622
  23. 23. LIFTING EXAMPLE September 29, 201623 Bond ISIN Country Industry NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical Equipment Electricite de France (EDF), 6,5% 26jan2019 USF2893TAB29 France Utilities NEDWBK CAD 5,2%25 fibo:hasSecurityIdentifier Utilities Industry Ontology Banking France Country Ontology Germany EMEA “CA639832AA25” fibo:legallyRecordedIn fibo:industrySector
  24. 24. LINKING • Goal: Connect individual datasets to a knowledge graph • Identify related entities in different datasets and link them • Either entities describing the same real world object or another relation September 29, 201624 NEDWBK CAD 5,2%25 ratingScore Industry OntologyCountry Ontology EMEA “AAA” fibo:legallyRecordedIn fibo:industrySector Rating CAD 5,2%25 hasRating fibo:industrySector fibo:legallyRecordedIn
  25. 25. LINKAGE RULES • Linking is based on domain-specific rules • Specify the conditions that must hold true for two entities to be linked September 29, 201625
  26. 26. LEARNING LINKAGE RULES Problem: Manually writing rules is time-consuming and requires expertise Approach: Interactive machine learning algorithm for generating rules • Generates a rule based on a number of user-confirmed link candidates. • Link candidates are actively selected by the learning algorithm to include link candidates that yield a high information gain. • The user does not need any knowledge of the characteristics of the dataset or any particular similarity computation techniques. September 29, 201626
  27. 27. INTEGRATION PROCESS Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop
  28. 28. VIEW GENERATION • The user selects a set of lifted and linked datasets September 29, 201628
  29. 29. Hadoop Data Lake DATA ACCESS • Generate data flows based on Apache Spark • The data flows utilize Resilient Distributed Datasets (RDDs) • RDDs derive new data sets from existing data sets by applying a chain of transformations • A derived data set can either • be recomputed on-the-fly • persisted on stable storage • Data flows can be executed efficiently on Hadoop clusters. September 29, 201629 Corporate Bonds Data Lifting 1 (Apache Spark RDD) Data Linking (Apache Spark RDD) Internal Ratings Data Lifting 2 (Apache Spark RDD) External Ratings Data Lifting 3 (Apache Spark RDD) eccenca Corporate Memory Data Consumer SQL CSV Excel Spark API
  30. 30. DEMO
  31. 31. Contact Dr. Robert Isele Tel: +49 151 17238616 email: robert.isele@eccenca.com eccencaCommand your Data!

×