The presentation slides of conference IC2020
https://webikeo.fr/webinar/ic-2-partie-1
Yoan Chabot, Thomas Labbé, Jixiong Liu, Raphaël Troncy
DAGOBAH : Un système d’annotation sémantique de données tabulaires indépendant du contexte
Linked Statistical Data: does it actually pay off?Oscar Corcho
Invited keynote at the ISWC2015 Workshop on Semantics and Statistics (SemStats 2015). http://semstats.github.io/2015/
The release of the W3C RDF Data Cube recommendation was a significant milestone towards improving the maturity of the area of Linked Statistical Data. Many Data Cube-based datasets have been released since then. Tools for the generation and exploitation of such datasets have also appeared. While the benefits for the usage of RDF Data Cube and the generation of Linked Data in this area seem to be clear, there are still many challenges associated to the generation and exploitation of such data. In this talk we will reflect about them, based on our experience on generating and exploiting such type of data, and hopefully provoke some discussion about what the next steps should be.
[Hydro]geological analysis using open source app: case Cikapundung RiverDasapta Erwin Irawan
My talk on Sarasehan Geologi Populer, 16th March 2015, at Badan Geologi. This talk covers various open source tools for geological and hydrogeological analysis with focus on Cikapundung river case. Some examples of R code to extract hidden pattern in the data set, in order to explain natural phenomenon.
As part of the final BETTER Hackathon, project partners prepared 4 hackathon exercises. Fraunhofer IAIS organised this exercise in conjunction with external partner MKLab ITI-CERTH (EOPEN project). This step-by-step exercise featured the setup of local Docker images on Linux OS featuring Dcoker Compose and (pre-installed) Python, SANSA, Hadoop, Apache Spark and Apache Zeppelin. It featured semantic transformation and and the use of SANSA (Scalable Semantic Analytics Stack - http://sansa-stack.net/) libraries on a sample of tweets ahead of geo-clustering.
Project website (Hackathon information): https://www.ec-better.eu/pages/2nd-hackathon
Github repository: https://github.com/ec-better/hackathon-2020-semanticgeoclustering
Air Pollution in Nova Scotia: Analysis and PredictionsCarlo Carandang
"Air Pollution in Nova Scotia: Analysis and Predictions"
Halifax, Nova Scotia, Canada; May 22, 2018
Presentation to the Department of Environment, Government of Nova Scotia.
Analysis of air fine particulate matter (PM 2.5) open datasets in Nova Scotia, showing both business intelligence and predictive analytics.
This presentation discusses the following topics:
DBMS Architecture
Relational Algebra
Union and Differences
Selection
Projection
Cartesian Product
Renaming
Join
Limitations of Relational Algebra
Linked Statistical Data: does it actually pay off?Oscar Corcho
Invited keynote at the ISWC2015 Workshop on Semantics and Statistics (SemStats 2015). http://semstats.github.io/2015/
The release of the W3C RDF Data Cube recommendation was a significant milestone towards improving the maturity of the area of Linked Statistical Data. Many Data Cube-based datasets have been released since then. Tools for the generation and exploitation of such datasets have also appeared. While the benefits for the usage of RDF Data Cube and the generation of Linked Data in this area seem to be clear, there are still many challenges associated to the generation and exploitation of such data. In this talk we will reflect about them, based on our experience on generating and exploiting such type of data, and hopefully provoke some discussion about what the next steps should be.
[Hydro]geological analysis using open source app: case Cikapundung RiverDasapta Erwin Irawan
My talk on Sarasehan Geologi Populer, 16th March 2015, at Badan Geologi. This talk covers various open source tools for geological and hydrogeological analysis with focus on Cikapundung river case. Some examples of R code to extract hidden pattern in the data set, in order to explain natural phenomenon.
As part of the final BETTER Hackathon, project partners prepared 4 hackathon exercises. Fraunhofer IAIS organised this exercise in conjunction with external partner MKLab ITI-CERTH (EOPEN project). This step-by-step exercise featured the setup of local Docker images on Linux OS featuring Dcoker Compose and (pre-installed) Python, SANSA, Hadoop, Apache Spark and Apache Zeppelin. It featured semantic transformation and and the use of SANSA (Scalable Semantic Analytics Stack - http://sansa-stack.net/) libraries on a sample of tweets ahead of geo-clustering.
Project website (Hackathon information): https://www.ec-better.eu/pages/2nd-hackathon
Github repository: https://github.com/ec-better/hackathon-2020-semanticgeoclustering
Air Pollution in Nova Scotia: Analysis and PredictionsCarlo Carandang
"Air Pollution in Nova Scotia: Analysis and Predictions"
Halifax, Nova Scotia, Canada; May 22, 2018
Presentation to the Department of Environment, Government of Nova Scotia.
Analysis of air fine particulate matter (PM 2.5) open datasets in Nova Scotia, showing both business intelligence and predictive analytics.
This presentation discusses the following topics:
DBMS Architecture
Relational Algebra
Union and Differences
Selection
Projection
Cartesian Product
Renaming
Join
Limitations of Relational Algebra
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
A talk on Data Science in Piano, contains the following:
1. Tips on how to make sure your data are analysis-friendly
2. A short introduction into how to do data science with a for loop (partially stolen from https://goo.gl/wHwZKv)
3. A brief look on output evolution for paywall health check for our clients (publishers)
4. A sneak peek into challenges we face currently
Comparative analysis of national open data portals or whether your portal is ...Anastasija Nikiforova
This file is a supplementary material for the following article -> Nikiforova, A. (2020). Comparative analysis of national open data portals or whether your portal is ready to bring benefits from open data. In IADIS International Conference on ICT, Society and Human Beings (pp. 21-23).
This paper focuses on the analysis of usability of the national open data portals. Open [government] data are considered as one of the most influenceable tool for preventing and reducing corruption and reaching innovative solutions that create added value for society.Thus, it is important to ensure that they are provided in a form that are useful and suitable for the original purpose of the open data. Critical voices and many discussions on whether open government data and national open data portals are of sufficient quality appear more frequently. Therefore, this study deals with this topic and aims to find the main challenges that can negatively impact users’ experience through an analysis of usability of 42 open data portals by applying a unified methodology on them allowing their comparative analysis to be carried out.This study highlights the weakest aspects for 42 national open data portals, pointing on both, the most common weakest points, and individual. The analysis carried out also identifies portals that can be considered as leaders and as an example for the less successful open data portals.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Engineering your cloud infrastructure using CHEF. This presentation was given as part of my application to the University of Ottawa for a role as a tenure track professor in the Faculty of Engineering. The focus was about using CHEF for infrastructure as code, with a small tangent discussion a MapReduce example. This presentation is partially in English and French.
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
Over the summer, we introduced the MapR Ecosystem Pack (MEP) which is a natural evolution of our existing software update program that decouples open source ecosystem updates from core platform updates. MEP gives our customers quick access to the latest open source innovations while also ensuring cross-project compatibility in any given MEP version.
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
This is the overview of my research as I finish the doctoral degree. This presentation was made on 2018-02-15 as part of the Kasetsart University and Nara Institute of Science and Technology Research Meeting. The content concerns my research and possible future contributions that I can make towards KU-NAIST joint research effort.
** This document has been edited from the time of presentation to remove sensitive and confidential material.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
A talk on Data Science in Piano, contains the following:
1. Tips on how to make sure your data are analysis-friendly
2. A short introduction into how to do data science with a for loop (partially stolen from https://goo.gl/wHwZKv)
3. A brief look on output evolution for paywall health check for our clients (publishers)
4. A sneak peek into challenges we face currently
Comparative analysis of national open data portals or whether your portal is ...Anastasija Nikiforova
This file is a supplementary material for the following article -> Nikiforova, A. (2020). Comparative analysis of national open data portals or whether your portal is ready to bring benefits from open data. In IADIS International Conference on ICT, Society and Human Beings (pp. 21-23).
This paper focuses on the analysis of usability of the national open data portals. Open [government] data are considered as one of the most influenceable tool for preventing and reducing corruption and reaching innovative solutions that create added value for society.Thus, it is important to ensure that they are provided in a form that are useful and suitable for the original purpose of the open data. Critical voices and many discussions on whether open government data and national open data portals are of sufficient quality appear more frequently. Therefore, this study deals with this topic and aims to find the main challenges that can negatively impact users’ experience through an analysis of usability of 42 open data portals by applying a unified methodology on them allowing their comparative analysis to be carried out.This study highlights the weakest aspects for 42 national open data portals, pointing on both, the most common weakest points, and individual. The analysis carried out also identifies portals that can be considered as leaders and as an example for the less successful open data portals.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Engineering your cloud infrastructure using CHEF. This presentation was given as part of my application to the University of Ottawa for a role as a tenure track professor in the Faculty of Engineering. The focus was about using CHEF for infrastructure as code, with a small tangent discussion a MapReduce example. This presentation is partially in English and French.
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
Over the summer, we introduced the MapR Ecosystem Pack (MEP) which is a natural evolution of our existing software update program that decouples open source ecosystem updates from core platform updates. MEP gives our customers quick access to the latest open source innovations while also ensuring cross-project compatibility in any given MEP version.
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
This is the overview of my research as I finish the doctoral degree. This presentation was made on 2018-02-15 as part of the Kasetsart University and Nara Institute of Science and Technology Research Meeting. The content concerns my research and possible future contributions that I can make towards KU-NAIST joint research effort.
** This document has been edited from the time of presentation to remove sensitive and confidential material.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
1. Orange restricted
DAGOBAH
An End-to-End Context-Free Tabular Data
Semantic Annotation System
Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy
Orange Orange Orange EURECOM
@yoan_chabot @rtroncy@tau_labbe @yansera1
DAGOBAH-IC 202001
2. Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
I don’t
know
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
In which city was "Our
Happy Lives" filmed?
P840
narrative
location
3. Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
In which city was "Our
happy lives" filmed?
In
Belfort!
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
Movie Location
Our Happy Lives Belfort
The French
Kissers
Rennes
P840
narrative
location
P840
narrative
location
DAGOBAH
4. Movie Location
Our Happy Lives Belfort
The French Kissers Rennes
Tabular Data to Knowledge Graph Matching
DAGOBAH-IC 202003
CTA Column-Type Annotation
CEA Cell-Entity Annotation
CPA Columns-Property Annotation
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
P840
narrative
location
P840
narrative
location
CPA
5. State of the Art
▪ Disambiguate cell values (CEA)
▪ 2 Strategies
▪ For each cell, lookup for the most probable entity. [1] [2]
▪ Joint disambiguation of each cell considering the entire row. [3]
▪ Matches for entities can be made using:
▪ Syntactic comparisons [1][2]
▪ Alignment of ontologies [1][3]
▪ Word embeddings [2][3]
▪ Extract column type (CTA)
▪ Majority voting based on CEA outputs [4]
▪ Extract relationships between columns (CPA)
▪ Majority voting based on previously determined types and entities [5]
[1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships.
In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347.
[2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018).
Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000.
[3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity
embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277.
[4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD).
[5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables.
In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004
7. Challenges Requiring Pre-processing
Pre-processing
• Relational table
• Horizontal
• Header: True, index = 0
• Key column: 0
• Primitive Typing: [Object, Unit, Unit, Object]
Lake Area Depth County
Windermere 14,73 km² 66 m Cumbria
Kielder Reservoir 10,86 km² 52 m Northumberland
Ullswater 8,9 km² 63 m Lake district
Bassenthwaite
Lake
5,1 km² 21 m Cumbria
Derwent Water 5,1 km² 22 m Lake District
DAGOBAH-IC 202006
Challenges:
• Table nature
• Table orientation
• Column header presence
• Key column identification
• Column type detection
15. Evaluation Dataset- Semtab2019
DAGOBAH-IC 202014
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Table from:Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems
Statistics of the datasets in each SemTab round
16. ▪ T:denotes all the columns for annotation.
▪ P: The most fine-grained classes in the (ontology) hierarchy that also appear in the
ground truth.
▪ O:Involving the super-classes (excluding very generic top classes like owl:Thing) of
perfect classes
▪ W:Other annotations not in the ground truths.
DAGOBAH-IC 202015
Assessment Criteria
17. Results
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Baseline
Embedding
0.479
1.212
0.242
0.336
0.883
0.841
0.892
0.853
0.415
-
0.347
-
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Round 2
Baseline
Mtab
0.641
1.414
0.247
0.276
0.713
0.911
0.816
0.911
0.533
0.881
0.919
0.929
Round 3
Baseline
Mtab
0.745
1.956
0.161
0.261
0.725
0.970
0.745
0.970
0.519
0.844
0.826
0.845
Round 4
Baseline
Mtab
0.684
2.012
0.206
0.300
0.578
0.983
0.599
0.983
0.398
0.832
0.874
0.832
DAGOBAH-IC 202016
▪ DAGOBAH
result
for Round 1:
✓ Mtab is the winner of this challenge
✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings
18. Conclusions
Approach Pros Cons
Baseline ▪ High coverage (multiple sources)
▪ Computational efficiency
▪ Lookup-services dependency (reliability)
▪ Blackbox (indexing, scoring…)
▪ Queries volume
Embedding ▪ Lookup strategy independence
▪ Relevant clustering even with few data
▪ Generalization (no tailored cleaning + less
heuristics in lookups and scoring)
▪ Computational performances
▪ K optimization
▪ Embedding dependency
DAGOBAH-IC 202017
▪ New homogeneity factor that improves the pre-processing
▪ 2 approaches:
▪ Baseline composed of lookups and majority voting
▪ Clustering of embeddings
▪ Performance bottlenecks (due to the challenge context):
✓ Light Data cleaning … on purpose
✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)
19. Future Work
✓ Test other Wikidata embeddings methods (Currently TransE)
✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage
✓ Experiment more clustering algorithms and parameters on different datasets
✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space
✓ …
DAGOBAH-IC 202018