Talk given by fellow Claus Stadler at the 11th International Conference on Semantic Systems - SEMANTiCS 2015
Paper available here: http://jens-lehmann.org/files/2015/semantics_dbtax.pdf
An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, Sweden
Information Extraction in the TalkOfEurope Creative CampWim Peters
The CLARIN Talk of Europe Creative Camp event in March 2015 invited people to work on the EuroParliament data of the Talk of Europe data set (http://linkedpolitics.ops.few.vu.nl/home)
Our work during that event covers the conceptualization of the content of two data sets:
- English EuroParliament speeches from the Talk of Europe data set and
- UK Parliament speeches.
We performed term extraction, term organisation and the linking of terminology between these two data sets. the results were
Information-rich programming in F# (ML Workshop 2012)Tomas Petricek
We live in an information-rich world that provides huge opportunities for programmers to explore and create exciting applications. Traditionally, statically typed programming languages have not been aware of the data types that are implicitly available in the outer world such as the web, databases and semi-structured files. This tutorial presents the recent work on F# Type Providers - a language mechanism that enables a smooth integration of diverse data sources into an ML-style programming language. As a case study, we look at a type provider for accessing the world development indicators from the World Bank and we will discuss some intriguing research problems associated with mapping real-world data into an ML-style type system.
This document discusses different data types used in databases and provides an exercise for students to practice identifying the appropriate data type for different fields and entering data into a database table and form. The key data types covered are boolean, integer, currency, date, and string. Students are asked to enter details about 4 team members or henchmen into a database table called "characters" using both the table's datasheet view and a data entry form. An extension activity asks students to identify the appropriate data type for a telephone number field.
Information Extraction from EuroParliament and UK Parliament dataWim Peters
These slides describe the work done at the CLARIN talk of Europe Creative Camp, in which groups from various countries worked with EuroParliament speeches.
Our work covers term extraction, term organisation and term linking between the Europarliament and UK Parliament data sets.
Presentation of QALD 7 challenge at ESWC2017: Question Answering over Linked Data.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
QALD-7 @ ESWC 2017 Portoroz, Slovenia
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, Sweden
Information Extraction in the TalkOfEurope Creative CampWim Peters
The CLARIN Talk of Europe Creative Camp event in March 2015 invited people to work on the EuroParliament data of the Talk of Europe data set (http://linkedpolitics.ops.few.vu.nl/home)
Our work during that event covers the conceptualization of the content of two data sets:
- English EuroParliament speeches from the Talk of Europe data set and
- UK Parliament speeches.
We performed term extraction, term organisation and the linking of terminology between these two data sets. the results were
Information-rich programming in F# (ML Workshop 2012)Tomas Petricek
We live in an information-rich world that provides huge opportunities for programmers to explore and create exciting applications. Traditionally, statically typed programming languages have not been aware of the data types that are implicitly available in the outer world such as the web, databases and semi-structured files. This tutorial presents the recent work on F# Type Providers - a language mechanism that enables a smooth integration of diverse data sources into an ML-style programming language. As a case study, we look at a type provider for accessing the world development indicators from the World Bank and we will discuss some intriguing research problems associated with mapping real-world data into an ML-style type system.
This document discusses different data types used in databases and provides an exercise for students to practice identifying the appropriate data type for different fields and entering data into a database table and form. The key data types covered are boolean, integer, currency, date, and string. Students are asked to enter details about 4 team members or henchmen into a database table called "characters" using both the table's datasheet view and a data entry form. An extension activity asks students to identify the appropriate data type for a telephone number field.
Information Extraction from EuroParliament and UK Parliament dataWim Peters
These slides describe the work done at the CLARIN talk of Europe Creative Camp, in which groups from various countries worked with EuroParliament speeches.
Our work covers term extraction, term organisation and term linking between the Europarliament and UK Parliament data sets.
Presentation of QALD 7 challenge at ESWC2017: Question Answering over Linked Data.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
QALD-7 @ ESWC 2017 Portoroz, Slovenia
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
Structural syntactic metrics for RDF Datasets that correlate with high level quality deficiencies.
The vision of the Linked Open Data (LOD) initiative is to provide a model for publishing data and meaningfully interlinking such dispersed but related data. Despite the importance of data quality for the successful growth of the LOD, only limited attention has been focused on quality of data prior to their publication on the LOD. This paper focuses on the systematic assessment of the quality of datasets prior to publication on the LOD cloud. To this end, we identify important quality deficiencies that need to be avoided and/or resolved prior to the publication of a dataset. We then propose a set of metrics to measure and identify these quality deficiencies in a dataset. This way, we enable the assessment and identification of undesirable quality characteristics of a dataset through our proposed metrics.
Slides for paper presentation at DEXA 2015:
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri:
Quality Metrics for Linked Open Data. DEXA (1) 2015: 144-152
The ability to take data, understand it, visualize it and extract useful information from it is becoming a hugely important skill. How can you turn all those logs, histories of purchases and trades or open government data, into useful information that help your business make money?
In this talk, we’ll look at doing data science using F#. The F# language is perfectly suited for this task – type providers integrate external data directly into the language – your language suddenly _understands_ CSV, XML, JSON, REST services and other sources. The interactive development style makes it easy to explore data and test your algorithms as you’re writing them. Rich set of libraries for working with data frames, time series and for visualization gives you all the tools you need. And finally – F# easily integrates with statistical environments like R and Matlab, giving you access to the industry standard libraries.
The ability to take data, understand it, visualize it and extract useful information from it is becoming a hugely important skill. How can you turn all those logs, histories of purchases and trades or open government data, into useful information that help your business make money?
In this talk, we’ll look at doing data science using F#. The F# language is perfectly suited for this task – type providers integrate external data directly into the language – your language suddenly _understands_ CSV, XML, JSON, REST services and other sources. The interactive development style makes it easy to explore data and test your algorithms as you’re writing them. Rich set of libraries for working with data frames, time series and for visualization gives you all the tools you need. And finally – F# easily integrates with statistical environments like R and Matlab, giving you access to the industry standard libraries.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Queen Mary University of London Collection SLidesmartinbarge
This document describes a project to create an online collection of law PhD thesis abstracts using the FLAX tools to build language learning activities for a pre-sessional English law course. The process involved obtaining permissions to use the abstracts, selecting and uploading them to FLAX, building the collection, creating various activities including cloze tests, sentence matching, and scrambled sentences. Lessons learned included selecting appropriately lengthy texts, consulting the FLAX team on procedures, and allocating sufficient time to build and test the collection.
Learn how to manipulate data frames using the dplyr package by Hadley Wickham. This session will cover select, filter, summarize, tally, group_by, and mutate. Based on the data carpentry ecology lessons
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
The document discusses data manipulation in R. It notes that R has some quirks with naming conventions and variable types but is well-suited for exploratory data analysis, generating visualizations, and statistical modeling. The tidyverse collection of R packages, including dplyr and ggplot2, helps make data analysis easier by providing tools for reshaping data into a tidy format with one variable per column and observation per row. Dplyr's verbs like filter, arrange, select, mutate and summarize allow for splitting, applying transformations, and combining data in a functional programming style.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
On Mining Citations to Primary and Secondary Sources in HistoriographyGiovanni Colavizza
This document discusses the development of a pipeline to extract citations from footnotes in historical texts. It aims to create a resource like Google Scholar tailored for the study of history. The project analyzes journals and monographs related to the history of Venice. The pipeline involves three main steps: 1) detecting text blocks containing footnotes, 2) extracting citations from footnotes, and 3) parsing the elements of each citation. Machine learning methods like SVMs and CRFs are used and challenges include citation variations and data scarcity in the humanities. The goal is to build a database of citations to primary and secondary sources to enable new bibliometric analyses and research services.
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
This document discusses recommender systems and linked open data. It begins with an introduction to linked open data, describing its key components like URIs, RDF, and popular vocabularies. It then provides an overview of recommender systems, explaining how they help with information overload by matching users to items. Different recommendation techniques are described like collaborative filtering, content-based, knowledge-based, and hybrid approaches. Evaluation methods for recommender systems like dataset splitting are also briefly covered. The document aims to lay the foundation for discussing how recommender systems can utilize linked open data.
The document discusses stacks and queues, which are linear data structures that maintain order. Stacks follow LIFO (last in, first out) order, where new elements are added to the top and the top element is removed first. Queues follow FIFO (first in, first out) order, where new elements are added to the rear and elements are removed from the front. The document compares stacks and queues, noting that stacks are used for calculations and function calls while queues are used for character buffers and print queues.
Pandas is an open-source Python library used for data manipulation and analysis. It allows users to extract data from files like CSVs into DataFrames and perform statistical analysis on the data. DataFrames are the primary data structure and allow storage of heterogeneous data in tabular form with labeled rows and columns. Pandas can clean data by removing missing values, filter rows/columns, and visualize data using Matplotlib. It supports Series, DataFrames, and Panels for 1D, 2D, and 3D labeled data structures.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
PhyloTastic: names-based phyloinformatic data integrationRutger Vos
Lightning talk to the 2013 TDWG conference symposium on phyloinformatics, brief report on PhyloTastic with special attention to the taxonomic name reconciliation service TaxoSaurus.
Academic Writing and Research Data ManagementCESSDA Training
This document discusses academic writing standards for research data management and documentation. It provides examples of documentation from the European Values Study conducted in 1981, 1990, 1999, and 2008. The analysis found improvements over time in documenting the sample, methodology, variables, and providing references to allow other researchers to understand and replicate the work. Standards evolved as the replication movement increased, making methodology sections more transparent and data more reusable.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
Similarity digesting is a class of algorithms and technologies that generate hashes from files and preserve file similarity. They find applications in various areas across security industry: malware variant detection, spam filtering, computer forensic analysis, data loss prevention and etc.. There are a few schemes and tools available that include ssdeep, sdhash and TLSH. While being useful for detecting file similarity, they define similarity from different perspectives. In other words, they take different approaches to describe what file similarity is about. In order to compare those tools with better evaluation, we introduce a simple mathematical model to describe similarity that would cover all three schemes and beyond. This model enables us to establish a theoretic framework for analyzing essential differences of various similarity digesting tools. The general use cases proposed by NIST are studied. As a result, a few tools are found to be complementary to each other so that we can use them in a hybrid approach in practice. Data experiment results are be provided to support the theoretic analysis.
The document discusses two NSF-funded research projects on intelligence and security informatics:
1. A project to filter and monitor message streams to detect "new events" and changes in topics or activity levels. It describes the technical challenges and components of automatic message processing.
2. A project called HITIQA to develop high-quality interactive question answering. It describes the team members and key research issues like question semantics, human-computer dialogue, and information quality metrics.
Structural syntactic metrics for RDF Datasets that correlate with high level quality deficiencies.
The vision of the Linked Open Data (LOD) initiative is to provide a model for publishing data and meaningfully interlinking such dispersed but related data. Despite the importance of data quality for the successful growth of the LOD, only limited attention has been focused on quality of data prior to their publication on the LOD. This paper focuses on the systematic assessment of the quality of datasets prior to publication on the LOD cloud. To this end, we identify important quality deficiencies that need to be avoided and/or resolved prior to the publication of a dataset. We then propose a set of metrics to measure and identify these quality deficiencies in a dataset. This way, we enable the assessment and identification of undesirable quality characteristics of a dataset through our proposed metrics.
Slides for paper presentation at DEXA 2015:
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri:
Quality Metrics for Linked Open Data. DEXA (1) 2015: 144-152
The ability to take data, understand it, visualize it and extract useful information from it is becoming a hugely important skill. How can you turn all those logs, histories of purchases and trades or open government data, into useful information that help your business make money?
In this talk, we’ll look at doing data science using F#. The F# language is perfectly suited for this task – type providers integrate external data directly into the language – your language suddenly _understands_ CSV, XML, JSON, REST services and other sources. The interactive development style makes it easy to explore data and test your algorithms as you’re writing them. Rich set of libraries for working with data frames, time series and for visualization gives you all the tools you need. And finally – F# easily integrates with statistical environments like R and Matlab, giving you access to the industry standard libraries.
The ability to take data, understand it, visualize it and extract useful information from it is becoming a hugely important skill. How can you turn all those logs, histories of purchases and trades or open government data, into useful information that help your business make money?
In this talk, we’ll look at doing data science using F#. The F# language is perfectly suited for this task – type providers integrate external data directly into the language – your language suddenly _understands_ CSV, XML, JSON, REST services and other sources. The interactive development style makes it easy to explore data and test your algorithms as you’re writing them. Rich set of libraries for working with data frames, time series and for visualization gives you all the tools you need. And finally – F# easily integrates with statistical environments like R and Matlab, giving you access to the industry standard libraries.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Queen Mary University of London Collection SLidesmartinbarge
This document describes a project to create an online collection of law PhD thesis abstracts using the FLAX tools to build language learning activities for a pre-sessional English law course. The process involved obtaining permissions to use the abstracts, selecting and uploading them to FLAX, building the collection, creating various activities including cloze tests, sentence matching, and scrambled sentences. Lessons learned included selecting appropriately lengthy texts, consulting the FLAX team on procedures, and allocating sufficient time to build and test the collection.
Learn how to manipulate data frames using the dplyr package by Hadley Wickham. This session will cover select, filter, summarize, tally, group_by, and mutate. Based on the data carpentry ecology lessons
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
The document discusses data manipulation in R. It notes that R has some quirks with naming conventions and variable types but is well-suited for exploratory data analysis, generating visualizations, and statistical modeling. The tidyverse collection of R packages, including dplyr and ggplot2, helps make data analysis easier by providing tools for reshaping data into a tidy format with one variable per column and observation per row. Dplyr's verbs like filter, arrange, select, mutate and summarize allow for splitting, applying transformations, and combining data in a functional programming style.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
On Mining Citations to Primary and Secondary Sources in HistoriographyGiovanni Colavizza
This document discusses the development of a pipeline to extract citations from footnotes in historical texts. It aims to create a resource like Google Scholar tailored for the study of history. The project analyzes journals and monographs related to the history of Venice. The pipeline involves three main steps: 1) detecting text blocks containing footnotes, 2) extracting citations from footnotes, and 3) parsing the elements of each citation. Machine learning methods like SVMs and CRFs are used and challenges include citation variations and data scarcity in the humanities. The goal is to build a database of citations to primary and secondary sources to enable new bibliometric analyses and research services.
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
This document discusses recommender systems and linked open data. It begins with an introduction to linked open data, describing its key components like URIs, RDF, and popular vocabularies. It then provides an overview of recommender systems, explaining how they help with information overload by matching users to items. Different recommendation techniques are described like collaborative filtering, content-based, knowledge-based, and hybrid approaches. Evaluation methods for recommender systems like dataset splitting are also briefly covered. The document aims to lay the foundation for discussing how recommender systems can utilize linked open data.
The document discusses stacks and queues, which are linear data structures that maintain order. Stacks follow LIFO (last in, first out) order, where new elements are added to the top and the top element is removed first. Queues follow FIFO (first in, first out) order, where new elements are added to the rear and elements are removed from the front. The document compares stacks and queues, noting that stacks are used for calculations and function calls while queues are used for character buffers and print queues.
Pandas is an open-source Python library used for data manipulation and analysis. It allows users to extract data from files like CSVs into DataFrames and perform statistical analysis on the data. DataFrames are the primary data structure and allow storage of heterogeneous data in tabular form with labeled rows and columns. Pandas can clean data by removing missing values, filter rows/columns, and visualize data using Matplotlib. It supports Series, DataFrames, and Panels for 1D, 2D, and 3D labeled data structures.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
PhyloTastic: names-based phyloinformatic data integrationRutger Vos
Lightning talk to the 2013 TDWG conference symposium on phyloinformatics, brief report on PhyloTastic with special attention to the taxonomic name reconciliation service TaxoSaurus.
Academic Writing and Research Data ManagementCESSDA Training
This document discusses academic writing standards for research data management and documentation. It provides examples of documentation from the European Values Study conducted in 1981, 1990, 1999, and 2008. The analysis found improvements over time in documenting the sample, methodology, variables, and providing references to allow other researchers to understand and replicate the work. Standards evolved as the replication movement increased, making methodology sections more transparent and data more reusable.
These slides were presented at the "graph databases in life sciences workshop". There is an accompanying Neo4j guide that will walk you through importing data into Neo4j using web services form a number of databases at EMBL-EBI.
https://github.com/simonjupp/importing-lifesci-data-into-neo4j
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
Similarity digesting is a class of algorithms and technologies that generate hashes from files and preserve file similarity. They find applications in various areas across security industry: malware variant detection, spam filtering, computer forensic analysis, data loss prevention and etc.. There are a few schemes and tools available that include ssdeep, sdhash and TLSH. While being useful for detecting file similarity, they define similarity from different perspectives. In other words, they take different approaches to describe what file similarity is about. In order to compare those tools with better evaluation, we introduce a simple mathematical model to describe similarity that would cover all three schemes and beyond. This model enables us to establish a theoretic framework for analyzing essential differences of various similarity digesting tools. The general use cases proposed by NIST are studied. As a result, a few tools are found to be complementary to each other so that we can use them in a hybrid approach in practice. Data experiment results are be provided to support the theoretic analysis.
The document discusses two NSF-funded research projects on intelligence and security informatics:
1. A project to filter and monitor message streams to detect "new events" and changes in topics or activity levels. It describes the technical challenges and components of automatic message processing.
2. A project called HITIQA to develop high-quality interactive question answering. It describes the team members and key research issues like question semantics, human-computer dialogue, and information quality metrics.
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
This document provides an overview of the Language Variation Suite (LVS) toolkit. The LVS is a web application designed for sociolinguistic data analysis. It allows users to upload spreadsheet data, perform data cleaning and preprocessing, generate summary statistics and cross tabulations, create data visualizations, and conduct various statistical analyses including regression modeling, clustering, and random forests. The workshop will cover the structure and functionality of the LVS through practical examples and exercises using sample sociolinguistic datasets.
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
The amount of structured data that is published on the Web has increased sharply over the last years. The deluge of available data calls for new search techniques which support users in finding and integrating data from large numbers of data sources. In his talk, Christian Bizer will give an overview of the different types of data search that have been proposed so far: Entity search, table search, constraint and unconstraint search joins. As an example of a system from the last category, he will introduce the Mannheim Search Join Engine which provides for executing unconstraint search joins over different types of Web data including Linked Data, Microdata, Web tables and Wikipedia tables.
Building better knowledge graphs through social computingElena Simperl
Elena Simperl discusses how social computing can help build better knowledge graphs. She presents research on how the editing behaviors and diversity of communities impact the quality of knowledge graphs like Wikidata and DBpedia. Her studies found that bot edits, tenure diversity, and interest diversity positively influence item and ontology quality. She also shows how crowdsourcing can enhance knowledge graphs by having experts and non-experts perform different quality assurance tasks, like detecting errors or classifying entities.
1. Machine learning was used to create a decision tree model to diagnose problems in telecommunications networks, achieving 99% accuracy with only 10,000 examples.
2. The model was simplified for comprehensibility, becoming probabilistic and covering 50% of cases with general rules and 50% with specific small disjuncts.
3. Lessons from the success include the importance of model comprehensibility, handling small datasets, addressing systematic errors, and considering future extensions when applying machine learning solutions.
The document discusses using topic modeling techniques to cluster and classify records from multiple OAI repositories to enhance metadata and subject descriptions. Key steps included preprocessing records, building a vocabulary, running topic modeling to generate 500 topics, organizing topics into broad topical categories, and developing a browser to explore topics and records. Evaluation of the techniques found it worked well for English repositories but requires more testing on other languages and repository types. Potential products and services are proposed like integrating the topics into OAIster for subject search and browse.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
The talk will discuss the concept of Search Joins. A Search Join is a join operation which extends a local table with additional attributes based on the large corpus of structured data that is published on the Web in various formats. The challenge for Search Joins is to decide which Web tables to join with the local table in order to deliver high-quality results. Search joins are useful in various application scenarios. They allow for example a local table about cities to be extended with an attribute containing the average temperature of each city for manual inspection. They also allow tables to be extended with large sets of additional attributes as a basis for data mining, for instance to identify factors that might explain why the inhabitants of one city claim to be happier than the inhabitants of another.
In the talk, Christian Bizer will draw a theoretical framework for Search Joins and will highlight how recent developments in the context of Linked Data, RDFa and Microdata publishing, public data repositories as well as crowd-sourcing integration knowledge contribute to the feasibility of Search Joins in an increasing number of topical domains.
This document discusses key aspects of building databases to catalog global biodiversity in the 2000s, including standards, technology, data sharing challenges, and classification methods. It covers how database infrastructure requires stable standards and technology to ensure data accessibility over time. Issues around data ownership, privacy, and ensuring data can be shared and reused across disciplines are also addressed. Classification systems are evolving from paper-based to digital formats using tools like cladistics and computer programs to help organize the vast amounts of data being collected through worldwide biodiversity projects.
This document discusses machine learning challenges posed by hypertext and the web. It presents two examples of applying machine learning to hypertext documents: 1) semi-supervised learning to classify topics of hypertext documents using both text and hyperlinks, and 2) classifying interconnected entities by labeling graphs with many classes. The author proposes models that combine text and link information to better learn from hypertext documents and address issues like "topic drift".
This document outlines a course on data warehousing and data mining. It introduces key concepts like relational databases, data warehouses, dimensional modeling, and data mining techniques. It also details the course objectives, schedule, assignments, and policies. The goal is for students to gain experience applying data mining methods and understanding the relationship between data mining and other fields.
This document discusses using qualitative research software like WebCT and N6 to collect and analyze online discussion data. It outlines a three stage data collection strategy including open, axial, and selective coding. Advantages of computer assisted qualitative data analysis include organization, systematic approaches, and time savings. Disadvantages include complex software, loss of context, and potential data loss. The document demonstrates exporting discussion data, open coding to develop categories and properties, transforming free nodes to a tree structure, and using text searching to support research variables in analysis.
Using Computer as a Research Assistant in Qualitative ResearchJoshuaApolonio1
This document discusses using qualitative research software to collect and analyze online discussion data. It demonstrates exporting discussion data from WebCT into N6 for coding. A three-stage data collection strategy is outlined, beginning with open coding to generate categories and properties, then axial coding to interconnect categories, and ending with selective coding to build a theoretical model. Advantages of this approach include organization of large data sets and time savings, while disadvantages include complexity of software and potential to lose sight of data contexts.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
Machine Learning for Understanding Biomedical PublicationsGrigorios Tsoumakas
This document discusses machine learning techniques for understanding biomedical publications. It describes multi-label classification approaches for semantic indexing of biomedical literature and modality classification of figures. It also discusses ensemble methods, multi-label learning, and applications to tasks like article screening in systematic reviews and PICO sentence identification.
Web Services: Encapsulation, Reusability, and Simplicityhannonhill
The document discusses web services and their encapsulation, reusability, and simplicity. It covers topics like hiding usernames/passwords, using fully qualified identifiers to locate nodes, and creating reusable classes like Asset and Property. Code examples show how to retrieve assets, work with data definition blocks, and traverse an asset tree to publish pages simply using global functions. The presentation aims to highlight best practices for web services development.
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. In this talk I argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. In this talk I will first define the boundaries of what constitutes a single coherent knowledge graph within Linked Data, i.e., present a principled notion of what a dataset is and what links within and between datasets are. I will also define different link types for data in Linked datasets and present the results of our empirical analysis of linkage among the datasets of the Linked Open Data cloud. Recent results from our analysis of Wikidata, which has not been part of the Linked Open Data Cloud, will also be presented.
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
This document describes a cluster model for combining latent topics with document attributes in text analysis. It introduces topic models and describes how metadata can be incorporated. The model restricts each document to one topic to allow collapsing observations. An algorithm is provided and applied to congressional speech and restaurant review data. Results show the model can recover topics similarly to topic models, while also capturing variation explained by metadata like political affiliation or review rating.
Similar to Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia (20)
Kick-off seminar of the largest Wikimedia IEG, 2015 round 2 call.
In conjunction with Wikipedia's 15 birthday.
Project page: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
This document outlines a Google Summer of Code project to teach machines to extract facts from Wikipedia articles by using machine learning and lexical semantics. It discusses extracting lexical units through part-of-speech tagging and statistical ranking, classifying frames and frame elements in an unsupervised or supervised manner, constructing a crowdsourced training set, and serializing the extracted facts into RDF triples for inclusion in DBpedia to discover new relations and populate the knowledge base automatically. The approach is demonstrated on soccer domain articles from the Italian Wikipedia.
The document discusses using Linked Open Data from DBpedia to help with Unicode localization interoperability (ULI). DBpedia extracts structured data from Wikipedia and makes it available as Linked Data. It describes how ULI aims to standardize localization data exchange between tools. DBpedia data on abbreviations in over 100 languages was extracted and evaluated, finding it could help improve text segmentation precision and recall. The extracted data is being considered for inclusion in the Common Locale Data Repository (CLDR) to further standardization efforts.
DBpedia: Glue for all Wikipedias and a Use Case for MultilingualismMarco Fossati
Dbpedia extracts structured data from Wikipedia to create a multilingual linked open data cloud. It has language-specific chapters that map data in different languages to a common structure. This enables multilingual queries over the data and use cases like helping with text segmentation by modeling abbreviations. Mapping sprints help create high quality data in new languages like the first Italian Dbpedia mapping done in a high school hackathon.
This document discusses challenges and solutions related to data quality. It addresses issues with template-dependent and fully manual mapping approaches and proposes machine learning-based methods and mapping assistants as solutions. It also discusses problems with community-based ontologies like lack of coverage and proposes consistency checks and data-driven schemas using sources like Wikipedia categories to address them. Finally, it lists various multimedia data sources for photos, audio and video that could be linked.
This document discusses outsourcing FrameNet annotation to crowdsourcing. It presents a two-step and simplified one-step methodology for crowdsourcing frame and semantic role annotation. Experiments using these methods on the CrowdFlower platform showed that the simplified one-step approach had higher accuracy and was faster than the two-step approach. Lessons learned include that definitions need to be simplified for non-experts and negation and modality are difficult concepts. Further research directions include larger-scale experiments and linking entities to structured knowledge bases.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
1. Unsupervised Learning
of an
Extensive and Usable
Taxonomy
for DBpedia
Presented by Claus Stadler
Vienna, 17th September 2015
Marco Fossati, Dimitris Kontokostas, and Jens Lehmann
5. Heterogeneous granularity:
Lack of coverage: 2.8 M typed resources out of 4.9 M
Wikipedia Category System
Chaotic: cycles
Too high granularity: "Radio Stations in Traverse
City, Michigan"
DBpedia ontology (DBPO)
Organisation
Band
SambaSchool
???
13. Stage 1:
Leaf Node Extraction
INPUT = cyclic graph; OUTPUT = tree
Bottom-up approach: from leaves to the root
Extract categories linked to actual articles only
Set of categories with no sub-categories =
Leaf Nodes Set:
13
Inuit_deities
Ugandan_monarchies
Inuit_goddesses
14. Stage 2:
Prominent Node Discovery
(A)Leaf Graph Traversal
(B)Natural Language Processing for is-a relations
(C) Interlanguage Links Weight
14
15. Stage 2A:
Leaf Graph Traversal
INPUT = leaf nodes set
For each leaf L :
Get parents;
For each parent P :
Are all its children leaves?
YES: P is a prominent node
NO: L is a prominent node
15
Inuit_goddesses
Inuit_deities
Ugandan_monarchies
16. Stage 2B:
NLP for is-a relations
Category = Noun Phrase (NP)
HEAD extraction
Shallow syntactic parsing
Is the HEAD plural?
YES: class candidate;
Depluralize
16
Deity Monarchy
17. Stage 2C:
Interlanguage Links Weight
The more interlanguage links a category has, the more it
is used across language editions
Prune categories with interlanguage links < Threshold
Threshold = 3
17
Inuit_deities Ugandan_monarchies
19. Stage 4:
A-Box
INPUT = prominent nodes heads
For each prominent node head H :
Extract the category set with head = H
Extract the page set for each category ;
For each page P :
Is it an article page?
YES: < P, instance-of, H >
NO: Repeat until it is
19
Monarchy
Bengal_Sultanate
instance-of
24. T-Box Evaluation:
Questions (1/2)
• “Is this a class or an instance?”
Restaurant VS Puella_Magi_Madoka_Magica (movie)
• “Can this class be broken down into more than one
class?
Mountain VS Musical_groups_from_Gothenburg
• “Is this a valid class hierarchy path?”
wikicategory_Golden_Bear_winners < yagoLegalActorGeo <
owl#Thing
25. T-Box Evaluation:
Questions (2/2)
• “Is this hierarchy too specific?” (too many levels)
Porter_County,_Indiana < Chicago_metropolitan_area <
Metropolitan_areas_of_Illinois < Populated_places_in_Illinois <
owl#Thing
• “Is this hierarchy too broad?” (very few levels)
Gonorynchiforme (fish family) < owl:Thing
27. A-Box Evaluation:
Settings
Crowdsourced to the layman
Evaluation set: 500 random entities with no type in
DBpedia
5 judgments per entity
Prevent a worker from answering the same question
twice
28. A-Box Evaluation:
Test Questions
Automatically discard untrusted judgments
Untrusted worker: < 80% correct test questions
Subjective task: missed test questions
They affect # of untrusted judgments
The class label may be ambiguous
32. Advantages
Exhaustive coverage (almost 100%)
Type coverage comparison
Recall in A-Box evaluation
Intuitive
Crowdsourced (the layman) A-Box evaluation
Least # of untrusted judgments
32
33. Drawbacks
Short hierarchy paths
Cycle removal
Instance pruning
Relatively low precision
NLP may still yield "weird" is-a relations
"Elvis Presley is a Burial"
33
!!!
35. Conclusion
Significant type coverage leap
Intuitive for end users
Balance between DBPO (too generic) and YAGO (too
specific)
Integrated in the latest DBpedia release
35
36. Future Work
Merge the T-Box into mappings.dbpedia.org for
curation
Word Sense Disambiguation for homonymous classes
Multilingual deployment (currently English and Italian)
36
38. Thanks for your
attention!
Download DBTax at:
http://downloads.dbpedia.org/current/core-i18n/en/
Browse the Italian DBTax at:
http://it.dbpedia.org/sparql
Contact the first author at:
fossati@fbk.eu