In various scientific and industrial domains, new findings and decisions are based on the analysis and interpretation of very large data sets. Often the benefit of data analytics is increased by combining data from a variety of different sources. This requires high-quality data integration including the expensive and tedious matching of data and metadata objects from two or more sources. Due to a rapid development in most domains, many (already linked) data sources are continuously updated, i.e. they undergo a steady evolution. Moreover, data is often gathered on a regular basis, e.g. to analyze developments over time.
To improve data quality and avoid redundant effort, data integration workflows should make use of already existing and especially validated links between two or more sources, and prefer link reuse over redetermination. Thus, outdated links should be migrated to the currently valid versions of data sources. Besides, temporal linkage of data objects allows for the analysis of changes in a domain of interest.
This talk sketches my research regarding link reuse in data integration tasks. In particular, I will present approaches in the context of evolution and temporal linkage for ontologies as well as data records using examples from the medical and social domains.
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
Analysis on Social Networking sites such as Facebook, Flickr and Twitter has long been a trending topic of fascination for data analysts, researchers and enthusiasts in the recent years to maximize the value of knowledge acquired from processing and analysis of the data. Apache Spark is an Open-source data-parallel computation engine that offers faster solutions compared to traditional Map-Reduce engines such as Apache Hadoop. This paper discusses the performance evaluation of Apache Spark for analyzing social network data. The performance of analysis varies significantly based on the algorithms being implemented. This is the reason to what makes this analysis worthwhile of evaluation with respect to their versatility and diverse nature in the dynamic field of Social Network Analysis. We compare performance of Apache Spark by evaluating the performance using various algorithms (PageRank, Connected Components, Counting Triangle, K-Means and Cosine Similarity) making efficient use of the Spark cluster.
This introduction show how OpenRefine can help any data project, from analytics, migration or reconciliation. OpenRefine powerful interface helps domains expert to explore, transform and enrich their data.
Knowledge graphs - it’s what all businesses now are on the lookout for. But what exactly is a knowledge graph and, more importantly, how do you get one? Do you get it as an out-of-the-box solution or do you have to build it (or have someone else build it for you)? With the help of our knowledge graph technology experts, we have created a step-by-step list of how to build a knowledge graph. It will properly expose and enforce the semantics of the semantic data model via inference, consistency checking and validation and thus offer organizations many more opportunities to transform and interlink data into coherent knowledge.
Large corporations have to master vast amounts of heterogeneous data in order to stay competitive. While existing approaches have attempted to consolidate and manage the data by forcing it into a single shared data model, data lakes recently emerged that instead provide a central storage point for holding all data sets in their original form.
In this talk, we present eccenca CorporateMemory, which extends the data lake paradigm with a semantic integration layer for managing diverse, but semantically enriched data. eccenca CorporateMemory builds an extensible knowledge graph that employs RDF vocabularies for transforming and linking multiple datasets in order to generate an integrated semantic understanding of the data.
Robert Isele | Head of Data Integration Unit at eccenca GmbH
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
As technology and needs evolve and the need for scalable and high availability solutions increase there is a need to evaluate new databases. The lack of clarity in the market makes in difficult for IT stakeholders to understand the differences between the solutions available and the choice to make. The key areas to consider while evaluating NoSql databases are data model, query model, consistency model, APIs, support and community strength.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
Analysis on Social Networking sites such as Facebook, Flickr and Twitter has long been a trending topic of fascination for data analysts, researchers and enthusiasts in the recent years to maximize the value of knowledge acquired from processing and analysis of the data. Apache Spark is an Open-source data-parallel computation engine that offers faster solutions compared to traditional Map-Reduce engines such as Apache Hadoop. This paper discusses the performance evaluation of Apache Spark for analyzing social network data. The performance of analysis varies significantly based on the algorithms being implemented. This is the reason to what makes this analysis worthwhile of evaluation with respect to their versatility and diverse nature in the dynamic field of Social Network Analysis. We compare performance of Apache Spark by evaluating the performance using various algorithms (PageRank, Connected Components, Counting Triangle, K-Means and Cosine Similarity) making efficient use of the Spark cluster.
This introduction show how OpenRefine can help any data project, from analytics, migration or reconciliation. OpenRefine powerful interface helps domains expert to explore, transform and enrich their data.
Knowledge graphs - it’s what all businesses now are on the lookout for. But what exactly is a knowledge graph and, more importantly, how do you get one? Do you get it as an out-of-the-box solution or do you have to build it (or have someone else build it for you)? With the help of our knowledge graph technology experts, we have created a step-by-step list of how to build a knowledge graph. It will properly expose and enforce the semantics of the semantic data model via inference, consistency checking and validation and thus offer organizations many more opportunities to transform and interlink data into coherent knowledge.
Large corporations have to master vast amounts of heterogeneous data in order to stay competitive. While existing approaches have attempted to consolidate and manage the data by forcing it into a single shared data model, data lakes recently emerged that instead provide a central storage point for holding all data sets in their original form.
In this talk, we present eccenca CorporateMemory, which extends the data lake paradigm with a semantic integration layer for managing diverse, but semantically enriched data. eccenca CorporateMemory builds an extensible knowledge graph that employs RDF vocabularies for transforming and linking multiple datasets in order to generate an integrated semantic understanding of the data.
Robert Isele | Head of Data Integration Unit at eccenca GmbH
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
As technology and needs evolve and the need for scalable and high availability solutions increase there is a need to evaluate new databases. The lack of clarity in the market makes in difficult for IT stakeholders to understand the differences between the solutions available and the choice to make. The key areas to consider while evaluating NoSql databases are data model, query model, consistency model, APIs, support and community strength.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
A podium abstract presented at AMIA 2016 Joint Summits on Translational Science. This discusses Data Café — A Platform For Creating Biomedical Data Lakes.
Modeling employees relationships with Apache SparkWassim TRIFI
How to build a graphX in order to analyze relationships based on emails exchanges between employees of an organization. Then apply different algorithm to get more insights on the business model and influent person among employees.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Coherent and consistent tracking of provenance data and in particular update history information is a crucial building block for any serious information system architecture.
Marvin Frommhold | AKSW, Universität Leipzig
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
Slides from webinar: Provenance and social science data. Presented on 15 March 2017. Presenter was Prof George Alter, Research Professor, ICPSR, and visiting Professor, ANU
FULL webinar recording: https://youtu.be/elPcKqWoOPg
3. Prof George Alter, (Research Professor, ICPSR & Visiting Prof, ANU)
The C2Metadata Project is producing new tools that will work with common statistical packages (eg R and SPSS) to automate the capture of metadata describing variable transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards: DDI and Ecological Markup Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Of special interest to social sciences with its strong metadata standards and heavy reliance on statistical analysis software.
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
A podium abstract presented at AMIA 2016 Joint Summits on Translational Science. This discusses Data Café — A Platform For Creating Biomedical Data Lakes.
Modeling employees relationships with Apache SparkWassim TRIFI
How to build a graphX in order to analyze relationships based on emails exchanges between employees of an organization. Then apply different algorithm to get more insights on the business model and influent person among employees.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Coherent and consistent tracking of provenance data and in particular update history information is a crucial building block for any serious information system architecture.
Marvin Frommhold | AKSW, Universität Leipzig
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
Slides from webinar: Provenance and social science data. Presented on 15 March 2017. Presenter was Prof George Alter, Research Professor, ICPSR, and visiting Professor, ANU
FULL webinar recording: https://youtu.be/elPcKqWoOPg
3. Prof George Alter, (Research Professor, ICPSR & Visiting Prof, ANU)
The C2Metadata Project is producing new tools that will work with common statistical packages (eg R and SPSS) to automate the capture of metadata describing variable transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards: DDI and Ecological Markup Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Of special interest to social sciences with its strong metadata standards and heavy reliance on statistical analysis software.
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
by Yannis Stavrakas (“Athena” Research Center
), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
OpenAIRE Content Providers Community Call, July 1st, 2020
This call was focused on Data Repositories namely the OpenAIRE Research Graph and Data Repositories, the OpenAIRE Content Acquisition Policy, and the Guidelines for Data Archive Managers.
Was also an opportunity to share the most recent updates and novelties in the OpenAIRE Content Provider Dashboard, and to get feedback from community.
Follow the Community activities at https://www.openaire.eu/provide-community-calls
A (vintage) presentation about a database system for the study of gene expression data. Including distributed metadata annotation and some interactive analytics. Some ideas are still actual today.
Tutorial given at the European Conference for Machine Learning (ECMLPKDD 2015). It covers OpenML, how to use it in your research, interfaces in Java, R, Python, use through machine learning tools such as WEKA and MOA. Also covers topics in open science and reproducible research.
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
Through the use of Linked Data (LD), Libraries, Archives and Museums (LAMs) have the potential to expose their collections to a larger audience and to allow for more efficient user searches. Despite this, relatively few LAMs have invested in LD projects and the majority of these display limited interlinking across datasets and institutions. A survey was conducted to understand Information Professionals' (IPs') position with regards to LD, with a particular focus on the interlinking problem. The survey was completed by 185 librarians, archivists, metadata cataloguers and researchers. Results indicated that, when interlinking, IPs find the process of ontology and property selection to be particularly challenging, and LD tooling to be technologically complex and unsuitable for their needs.
Our research is focused on developing an authoritative interlinking framework for LAMs with a view to increasing IP engagement in the linking process. Our framework will provide a set of standards to facilitate IPs in the selection of link types, specifically when linking local resources to authorities. The framework will include guidelines for authority, ontology and property selection, and for adding provenance data. A user-interface will be developed which will direct IPs through the resource interlinking process as per our framework. Although there are existing tools in this domain, our framework differs in that it will be designed with the needs and expertise of IPs in mind. This will be achieved by involving IPs in the design and evaluation of the framework. A mock-up of the interface has already been tested and adjustments have been made based on results. We are currently working on developing a minimal viable product so as to allow for further testing of the framework. We will present our updated framework, interface, and proposed interlinking solutions.
DataGraft: Data-as-a-Service for Open Datadapaasproject
Tutorial at "The Data Matters Series – Transforming Service Industry via Big Data Analytics", May 4, 2016, Cyberjaya, Malaysia
http://www.eventbrite.com/e/the-data-matters-series-transforming-service-industry-via-big-data-analytics-tickets-24617911837
Antelope: A Web service for publishing Life Cycle Assessment models and resul...Brandon Kuczenski
We describe a data format, interface, and prototype implementation of a web service that can be used to publish Life Cycle Assessment (LCA) studies and their results. The service uses the fragment data model to describe product systems in a structured way. The API provides a provenance framework for LCA results and enables documentation and external validation of LCA studies.
Watch the presentation:
https://www.youtube.com/watch?v=2P4vYvdc1uI&t=208m15s
Graph enhancements to Artificial Intelligence and Machine Learning are changing the landscape of intelligent applications. Beyond improving accuracy and modeling speed, graph technologies make building AI solutions more accessible. Join us to hear about 4 areas at the forefront of graph enhanced AI and ML, and find out which techniques are commonly used today and which hold the potential for disrupting industries. We'll provide examples and specifically look how: - Graphs provide better accuracy through connected feature extraction - Graphs provide better performance through contextual model optimization - Graphs provide context through knowledge graphs - Graphs add explainability to neural networks
Speakers: Jake Graham, Alicia Frame
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
An Overview of the iMicrobe Project and available tools in the iPlant Cyberinfrastructure. This talk was given at a workshop at ASLO in Granada, Spain focused on applications in Oceanography and Limnology.
Linked Data Experiences at Springer NatureMichele Pasin
An overview of how we're using semantic technologies at Springer Nature, and an introduction to our latest product: www.scigraph.com
(Keynote given at http://2016.semantics.cc/, Leipzig, Sept 2016)
Similar to Link Reuse and Evolution for Data Integration (LSWT 2020) (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Link Reuse and Evolution for Data Integration (LSWT 2020)
1. Link Reuse and Evolution for Data Integration
Anika Groß
It‘s all about the
data
Link Reuse and Evolution
for Data Integration
Anika Groß
8. Leipziger Semantic Web Tag, 17.06.2020
6. Link Reuse and Evolution for Data Integration
Anika Groß
Matching / Linking
• Schema level
• Schema and ontology matching
• Schema merging
3
…Hämatologische
Krankheit
…
Krankheiten
Blutarmut Leukopenie
…Hematological
Disease
Disease
Cytopenia
Anemia Leukopenia
Thrombo
cytopenia
…
Aim: (Semi-)automatically interconnect different data sources via explicit links
7. Link Reuse and Evolution for Data Integration
Anika Groß
Matching / Linking
• Schema level
• Schema and ontology matching
• Schema merging
• Instance level
• Entity resolution, link discovery
• Object fusion
3
Severe anemia
(hemoglobin < 8 g/dL),
leukopenia (white blood
cell count [WBC] < 2500
mm3), thrombocytopenia
(platelet count < 80,000
mm3)
Patients with
significantly impaired
bone marrow function
or significant anemia,
leukopenia, or
thrombocytopenia
…Hämatologische
Krankheit
…
Krankheiten
Blutarmut Leukopenie
…Hematological
Disease
Disease
Cytopenia
Anemia Leukopenia
Thrombo
cytopenia
…
Aim: (Semi-)automatically interconnect different data sources via explicit links
8. Link Reuse and Evolution for Data Integration
Anika Groß
Matching / Linking
• Schema level
• Schema and ontology matching
• Schema merging
• Semantic annotation
• Linking instances
with ontology concepts
• Entity linking
• Instance level
• Entity resolution, link discovery
• Object fusion
3
Severe anemia
(hemoglobin < 8 g/dL),
leukopenia (white blood
cell count [WBC] < 2500
mm3), thrombocytopenia
(platelet count < 80,000
mm3)
Patients with
significantly impaired
bone marrow function
or significant anemia,
leukopenia, or
thrombocytopenia
…Hämatologische
Krankheit
…
Krankheiten
Blutarmut Leukopenie
…Hematological
Disease
Disease
Cytopenia
Anemia Leukopenia
Thrombo
cytopenia
…
Aim: (Semi-)automatically interconnect different data sources via explicit links
9. Link Reuse and Evolution for Data Integration
Anika Groß
Data is not static
4
≥ 2 Input
sources
Integration &
Enrichment
linking, fusion, …
Analysis
e.g. graph-based
Result
interpretation
Intra-source links
Inter-source links
10. Link Reuse and Evolution for Data Integration
Anika Groß
Evolution, Dynamics
Data is not static
4
≥ 2 Input
sources
Integration &
Enrichment
linking, fusion, …
Analysis
e.g. graph-based
Result
interpretation
Intra-source links
Inter-source links
Links between different versions, temporal links
11. Link Reuse and Evolution for Data Integration
Anika Groß
Agenda
✓Introduction
✓ Data Science Workflow
✓ Matching / Linking
✓ Evolution
• Link Reuse
• Link Evolution and Temporal Linking
• Future Research Directions
5
12. Link Reuse and Evolution for Data Integration
Anika Groß
Can be real/tiny/no
improvement
Many many
test runs
Cooperativeness of
domain experts
Again and again …
• Implementation of matching tools/algorithms
• Configuration of matching workflows
• Verification of links
6
13. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse
Again and again …
• Implementation of matching tools/algorithms
• Configuration of matching workflows
• Verification of links
Existing links between (meta)data sources
• Linked Open Data Cloud
• Repositories/platforms: Bioportal, local / own project, sameas.org
…
6
14. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse
Again and again …
• Implementation of matching tools/algorithms
• Configuration of matching workflows
• Verification of links
Existing links between (meta)data sources
• Linked Open Data Cloud
• Repositories/platforms: Bioportal, local / own project, sameas.org
…
x No solution
Manual or (semi-)
automatic Matching
6
15. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse
Again and again …
• Implementation of matching tools/algorithms
• Configuration of matching workflows
• Verification of links
Existing links between (meta)data sources
• Linked Open Data Cloud
• Repositories/platforms: Bioportal, local / own project, sameas.org
…
✓ Complete solution⸦ Partial solution
Link reuse instead of full (manual
or automatic) re-determination
x No solution
Manual or (semi-)
automatic Matching
6
16. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse
Again and again …
• Implementation of matching tools/algorithms
• Configuration of matching workflows
• Verification of links
Existing links between (meta)data sources
• Linked Open Data Cloud
• Repositories/platforms: Bioportal, local / own project, sameas.org
…
✓ Complete solution⸦ Partial solution
Link reuse instead of full (manual
or automatic) re-determination
x No solution
Manual or (semi-)
automatic Matching
Aims
• Improved match result quality
• Less effort
• Link update (evolution)
6
17. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse - Methods
7
Composition
Combine mappings via intermediate sources
I1
I2
S1 S2
indirect
direct
18. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse - Methods
7
Composition
Combine mappings via intermediate sources
I1
I2
S1 S2
indirect
direct
Clustering
Create groups of (connected) entities
19. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse - Methods
7
Composition
Combine mappings via intermediate sources
I1
I2
S1 S2
indirect
direct
Clustering
Create groups of (connected) entities
Supervised Learning
Train ML model
20. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse - Methods
7
Composition
Combine mappings via intermediate sources
I1
I2
S1 S2
indirect
direct
Clustering
Create groups of (connected) entities
Evolution
Connect and update over time
Supervised Learning
Train ML model
21. Link Reuse and Evolution for Data Integration
Anika Groß
Link Reuse – in my research
8
Composition
• Indirect Ontology Matching
(schema level)
Clustering
• Holistic entity clustering
for linked data (instance level)
• Semantic annotation of
medical documents
Supervised Learning
• Combination of results from
different semantic annotation
tools
Temporal Linking
• Ontology mapping evolution and
update (schema level)
• Temporal group linkage for
census data (instance level)
Evolution
• Ontology mapping evolution
and update (schema level)
• Temporal group linkage for
census data (instance level)
22. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
Find links between different
source versions or temporal datasets
23. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
Find links between different
source versions or temporal datasets
S1’’ …
24. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
Update set of outdated links
between older versions
Find links between different
source versions or temporal datasets
S1’’ …
25. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
S1
S2
𝑴 𝑺𝟏,𝑺𝟐
Update set of outdated links
between older versions
Find links between different
source versions or temporal datasets
S1’’ …
26. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟐,𝑺𝟐′
S1
S2
𝑴 𝑺𝟏,𝑺𝟐
S1’
S2’
𝑴 𝑺𝟏,𝑺𝟏′
𝑴 𝑺𝟐,𝑺𝟐′
Update set of outdated links
between older versions
Find links between different
source versions or temporal datasets
S1’’ …
27. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟐,𝑺𝟐′
S1
S2
𝑴 𝑺𝟏′,𝑺𝟐′𝑴 𝑺𝟏,𝑺𝟐
S1’
S2’
𝑴 𝑺𝟏,𝑺𝟏′
𝑴 𝑺𝟐,𝑺𝟐′
Update set of outdated links
between older versions
Find links between different
source versions or temporal datasets
S1’’ …
28. Link Reuse and Evolution for Data Integration
Anika Groß
Link Evolution and Temporal Linking
Reuse existing intra- or intersource links
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟐,𝑺𝟐′
S1
S2
𝑴 𝑺𝟏′,𝑺𝟐′𝑴 𝑺𝟏,𝑺𝟐
S1’
S2’
𝑴 𝑺𝟏,𝑺𝟏′
𝑴 𝑺𝟐,𝑺𝟐′
Update set of outdated links
between older versions
Find links between different
source versions or temporal datasets
S1’’ …
29. Link Reuse and Evolution for Data Integration
Anika Groß
Evolution
• Ontology mapping evolution
and update (schema level)
• Temporal group linkage for
census data (instance level)
Link Reuse - Methods
10
Composition
• Indirect Ontology Matching
(schema level)
Clustering
• Holistic entity clustering
for linked data (instance level)
• Semantic annotation of
medical documents
Supervised Learning
• Combination of results from
different semantic annotation
tools
Temporal Linking
• Ontology mapping evolution and
update (schema level)
• Temporal group linkage for
census data (instance level)
• Temporal group linkage for
census data (instance level)
30. Link Reuse and Evolution for Data Integration
Anika Groß
Temporal Group Linkage
for Census Data
11
• 6 census (1851-1901) in Rawtenstall, Lancashire, U.K.
• Household graphs (known family connections) but unknown temporal links
Temporal Linking
Instance level
Reuse
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
31. Link Reuse and Evolution for Data Integration
Anika Groß
Temporal Group Linkage
for Census Data
11
• 6 census (1851-1901) in Rawtenstall, Lancashire, U.K.
• Household graphs (known family connections) but unknown temporal links
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
Alice Smith Mary Smith
daughter
1871 1881
Temporal Linking
Instance level
Reuse
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
32. Link Reuse and Evolution for Data Integration
Anika Groß
Temporal Group Linkage
for Census Data
11
• 6 census (1851-1901) in Rawtenstall, Lancashire, U.K.
• Household graphs (known family connections) but unknown temporal links
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
Alice Smith Mary Smith
daughter
1871 1881
Temporal Linking
Instance level
Reuse
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
33. Link Reuse and Evolution for Data Integration
Anika Groß
Temporal Group Linkage
for Census Data
11
• 6 census (1851-1901) in Rawtenstall, Lancashire, U.K.
• Household graphs (known family connections) but unknown temporal links
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
Alice Smith Mary Smith
daughter
1871 1881Problems
• Attribute values change over time (surname, occupation)
• Difficult disambiguation (same pre- and surname)
• Poor data quality (misspelling etc.)
• …
Temporal Linking
Instance level
Reuse
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
34. Link Reuse and Evolution for Data Integration
Anika Groß
Temporal Group Linkage
for Census Data
11
• 6 census (1851-1901) in Rawtenstall, Lancashire, U.K.
• Household graphs (known family connections) but unknown temporal links
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
Alice Smith Mary Smith
daughter
1871 1881Problems
• Attribute values change over time (surname, occupation)
• Difficult disambiguation (same pre- and surname)
• Poor data quality (misspelling etc.)
• …
Temporal Linking
Instance level
Reuse
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
Temporal Entity and Group Linkage
• Method → paper
• ≈ 96% F-Measure for record and group mapping
(2-9% improvement over compared approaches)
35. Link Reuse and Evolution for Data Integration
Anika Groß
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
• Evolution patterns on individual (preserve, add, remove)
and group level (split, merge, move, …)
Evolution Patterns and Evolution Graph
12
36. Link Reuse and Evolution for Data Integration
Anika Groß
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.
• Evolution patterns on individual (preserve, add, remove)
and group level (split, merge, move, …)
• Evolution Graph over longer time periods
Evolution Patterns and Evolution Graph
12
37. Link Reuse and Evolution for Data Integration
Anika Groß
“The Reuse Application”: Knowledge Graphs
13
& many more
• Continuous reuse and integration
• of instances, ontology concepts and links from various sources
• methods:
• matching/link discovery, NLP, entity linking, clustering, fusion/merging, …
+ expert knowledge / verification
• Evolution and update
• Can be highly dynamic graph
• Direct change in knowledge graph
• Extension and update based on usage (user queries)
• Update when source versions evolve
• Integrate additions, deletions, structural changes, …
• Complex: keep meanwhile verified changes
38. Link Reuse and Evolution for Data Integration
Anika Groß
✓ Improved link
quality
✓ Less effort /
more efficient
✓ Up-to-date links
✓ New temporal links
• Data sources evolve over time … and so do the links between them
• Reuse existing verified links to create new links for new versions
• Create new temporal links between objects and object groups
• Problems: poor trust, missing context, no knowledge of existing links, …
need to be overcome
• Lineage, provenance, data profiling, accessibility …
Conclusion
14
39. Link Reuse and Evolution for Data Integration
Anika Groß
Future Research Directions
15
Evolution of
Knowledge Graphs
• Evolution of
integrated sources
• evolution-aware
ontology merge,
knowledge graph
update
• Scalable iterative
integration
• Temporal patterns
on graph data
• …
Semantic
Interoperability
• Semantic Annotation
of heterogenous, un-
/ semi structured
data
• Multilingual
Matching
• Semantic Mappings
(“beyond sameAs”)
• …
End-to-End
Analytics Workflows
• Close to seamless
data integration for
complex analytics
workflows
• Management and
reproducibility of
scientific workflows
• …
40. Link Reuse and Evolution for Data Integration
Anika Groß
Future Research Directions
15
Evolution of
Knowledge Graphs
• Evolution of
integrated sources
• evolution-aware
ontology merge,
knowledge graph
update
• Scalable iterative
integration
• Temporal patterns
on graph data
• …
Semantic
Interoperability
• Semantic Annotation
of heterogenous, un-
/ semi structured
data
• Multilingual
Matching
• Semantic Mappings
(“beyond sameAs”)
• …
End-to-End
Analytics Workflows
• Close to seamless
data integration for
complex analytics
workflows
• Management and
reproducibility of
scientific workflows
• …
41. Link Reuse and Evolution for Data Integration
Anika Groß
References
Reuse Annotation
• Christen, Lin, Groß, Domingos Cardoso, Pruski, Da Silveira, Rahm: A Learning-Based Approach to Combine Medical Annotation Results - (Short Paper).
13th Intl. Conference on Data Integration in the Life Sciences (DILS), 2018.
• Christen, Groß, Rahm: A Reuse-based Annotation Approach for Medical Documents. The Semantic Web -- ISWC 2016: 15th Intl. Semantic Web
Conference, 2016.
Reuse Entity Links
• Nentwig, Groß, Möller, Rahm: Distributed Holistic Clustering on Linked Data. Proc. OTM 2017 Conferences - Confederated International Conferences:
CoopIS, C&TC, and ODBASE, 2017.
• Nentwig, Groß, Rahm: Holistic Entity Clustering for Linked Data. IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016.
Temporal Linking / Entity Evolution
• Christen, Groß, Fisher, Wang, Christen, Rahm: Temporal group linkage and evolution analysis for census data. 19th Intl. Conference on Extending
Database Technology (EDBT), 2017.
Mapping/Link Composition
• A. Groß, Hartung, Kirsten, Rahm: Mapping Composition for Matching Large Life Science Ontologies. 2nd Intl. Conference on Biomedical Ontology
(ICBO), 2011.
• M. Hartung, Groß, Rahm: Composition Methods for Link Discovery. Proc. of 15. GI-Fachtagung für Datenbanksysteme in Business, Technologie und
Web (BTW), 2013.
Mapping / Link Evolution
• Groß, Pruski, Rahm: Evolution of Biomedical Ontologies and Mappings: Overview of Recent Approaches. Computational and Structural Biotechnology
Journal. 14, 2016.
• Groß, dos Reis, Hartung, Pruski, Rahm: Semi-automatic Adaptation of Mappings between Life Science Ontologies. 9th Intl. Conference on Data
Integration in the Life Sciences (DILS), 2013.
• Groß, Hartung, Prüfer, Kelso, Rahm: Impact of ontology evolution on functional analyses. Bioinformatics 28(20), 2012.
• Groß, Hartung, Kirsten, Rahm: Estimating the Quality of Ontology-Based Annotations by Considering Evolutionary Changes.
6th Intl. Workshop on Data Integration in the Life Sciences, 2009.
Ontology Evolution
• Christen, Groß, Hartung: REX - A Tool for Discovering Evolution Trends in Ontology Regions. 10th Intl. Conference on Data Integration in the Life
Sciences (DILS), 2014.
• Hartung, Groß, Rahm: COnto-Diff: Generation of Complex Evolution Mappings for Life Science Ontologies. Journal of Biomedical Informatics 46(1),
2013.
• M. Hartung, Groß, Rahm: CODEX: exploration of semantic changes between ontology versions. Bioinformatics 28(6), 2012.
16