Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints
Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a
steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.
"SPARQL Cheat Sheet" is a short collection of slides intended to act as a guide to SPARQL developers. It includes the syntax and structure of SPARQL queries, common SPARQL prefixes and functions, and help with RDF datasets.
The "SPARQL Cheat Sheet" is intended to accompany the SPARQL By Example slides available at http://www.cambridgesemantics.com/2008/09/sparql-by-example/ .
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyTimm Heuss
Presentation hold at the SEMANTiCS 2014, in regard of this paper: http://doi.acm.org/10.1145/2660517.2660520
In this paper we compare several state-of-the-art Linked Data Knowledge Extraction tools, with regard to their ability to recognise entities of a controlled, domain-specific vocabulary. This includes tools that offer APIs as a Service, locally installed platforms as well as an UIMA-based approach as reference. We evaluate under realistic conditions, with natural language source texts from keywording experts of the Städel Museum Frankfurt. The goal is to find first hints which tool approach or strategy is more convincing in case of a domain specific tagging/annotation, towards a working solution that is demanded by GLAMs world-wide.
This presentation was part of the workshop on Materials Project Software infrastructure conducted for the Materials Virtual Lab in Nov 10 2014. It presents an introduction to the Python Materials Genomics (pymatgen) materials analysis library. Pymatgen is a robust, open-source Python library for materials analysis. It currently powers the public Materials Project (http://www.materialsproject.org), an initiative to make calculated properties of all known inorganic materials available to materials researchers. These are some of the main features:
1. Highly flexible classes for the representation of Element, Site, Molecule, Structure objects.
Extensive io capabilities to manipulate many VASP (http://cms.mpi.univie.ac.at/vasp/) and ABINIT (http://www.abinit.org/) input and output files and the crystallographic information file format. This includes generating Structure objects from vasp input and output. There is also support for Gaussian input files and XYZ file for molecules.
2. Comprehensive tool to generate and view compositional and grand canonical phase diagrams.
3. Electronic structure analyses (DOS and Bandstructure).
4. Integration with the Materials Project REST API.
"SPARQL Cheat Sheet" is a short collection of slides intended to act as a guide to SPARQL developers. It includes the syntax and structure of SPARQL queries, common SPARQL prefixes and functions, and help with RDF datasets.
The "SPARQL Cheat Sheet" is intended to accompany the SPARQL By Example slides available at http://www.cambridgesemantics.com/2008/09/sparql-by-example/ .
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyTimm Heuss
Presentation hold at the SEMANTiCS 2014, in regard of this paper: http://doi.acm.org/10.1145/2660517.2660520
In this paper we compare several state-of-the-art Linked Data Knowledge Extraction tools, with regard to their ability to recognise entities of a controlled, domain-specific vocabulary. This includes tools that offer APIs as a Service, locally installed platforms as well as an UIMA-based approach as reference. We evaluate under realistic conditions, with natural language source texts from keywording experts of the Städel Museum Frankfurt. The goal is to find first hints which tool approach or strategy is more convincing in case of a domain specific tagging/annotation, towards a working solution that is demanded by GLAMs world-wide.
This presentation was part of the workshop on Materials Project Software infrastructure conducted for the Materials Virtual Lab in Nov 10 2014. It presents an introduction to the Python Materials Genomics (pymatgen) materials analysis library. Pymatgen is a robust, open-source Python library for materials analysis. It currently powers the public Materials Project (http://www.materialsproject.org), an initiative to make calculated properties of all known inorganic materials available to materials researchers. These are some of the main features:
1. Highly flexible classes for the representation of Element, Site, Molecule, Structure objects.
Extensive io capabilities to manipulate many VASP (http://cms.mpi.univie.ac.at/vasp/) and ABINIT (http://www.abinit.org/) input and output files and the crystallographic information file format. This includes generating Structure objects from vasp input and output. There is also support for Gaussian input files and XYZ file for molecules.
2. Comprehensive tool to generate and view compositional and grand canonical phase diagrams.
3. Electronic structure analyses (DOS and Bandstructure).
4. Integration with the Materials Project REST API.
In this talk at the CECAM 2015 Workshop on Future Technologies in Automated Atomistic Simulations, I will discuss the Materials Project Ecosystem, an initiative to develop a comprehensive set of open-source software and data tools for materials informatics. The Materials Project is a US Department of Energy-funded initiative to make the computed properties of all known inorganic materials publicly available to all materials researchers to accelerate materials innovation. Today, the Materials Project database boasts more than 58,000 materials, covering a broad range of properties, including energetic properties (e.g., phase and aqueous stability, reaction energies), electronic structure (bandstructures, DOSs) and structural and mechanical properties (e.g., elastic constants).
A linchpin of the Materials Project is its robust data and software infrastructure, built on best open-source software development practices such as continuous testing and integration, and comprehensive documentation. I will provide an overview of the open-source software modules that have been developed for materials analysis (Python Materials Genomics), error handling (Custodian) and scientific workflow management (FireWorks), as well as the Materials API, a first-of-its-kind interface for accessing materials data based on REpresentational State Transfer (REST) principles. I will show a materials researcher may use and build on these software and data tools for materials informatics as well as to accelerate his own research.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
These slides are a brief update on the status of the work of the current SPARQL Working Group. "SPARQL 1.1" collectively refers to the upcoming versions of the SPARQL query language, SPARQL update language, and other deliverables of the 2nd (current) SPARQL Working Group.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Cross-language information retrieval (CLIR) is a technique to locate documents written in one natural language by queries expressed in another language. This project investigates the feasibility of CLIR based on domain-specific bilingual corpus databases.
I built this presentation for Informatica World in 2006. It is all about Data Administration, Data Quality and Data Management. It is NOT about the Informatica product. This presentation was a hit, with standing room only full of about 150 people. The content is still useful and applicable today. If you want to use my material, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identifying and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is desired to optimize sophisticated error discovery, that requires inequality joins, rather than naïvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing.
Applying Data Quality Best Practices at Big Data ScalePrecisely
Global organizations are investing aggressively in data lake infrastructures in the pursuit of new, breakthrough business insights. At the same time, however, 2 out of 3 business executives are not highly confident in the accuracy and reliability of their own Big Data. Regaining that confidence requires utilizing proven data quality tools at Big Data scale.
In this on-demand webinar, discover how to ensure your data lake is a trusted source for advanced business insights that lead to new revenue, cost savings and competitiveness. You will have the opportunity to:
• Compare your organization’s data lake “readiness” against initial findings from our upcoming annual Big Data Trends survey
• Gain insight into where and how to leverage data quality best practices for Big Data use cases
• Explore how a ‘Develop Once, Deploy Anywhere’ approach, including to native Big Data infrastructures such as Hadoop and Spark, facilitates consistent data quality patterns
Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek
Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.
Data-Ed: Best Practices with the Data Management Maturity ModelData Blueprint
The Data Management Maturity (DMM) model is a framework for the evaluation and assessment of an organization's data management capabilities. The model allows an organization to evaluate its current state data management capabilities, discover gaps to remediate, and strengths to leverage. The assessment method reveals priorities, business needs, and a clear, rapid path for process improvements. This webinar will describe the DMM, its evolution, and illustrate its use as a roadmap guiding organizational data management improvements.
In this talk at the CECAM 2015 Workshop on Future Technologies in Automated Atomistic Simulations, I will discuss the Materials Project Ecosystem, an initiative to develop a comprehensive set of open-source software and data tools for materials informatics. The Materials Project is a US Department of Energy-funded initiative to make the computed properties of all known inorganic materials publicly available to all materials researchers to accelerate materials innovation. Today, the Materials Project database boasts more than 58,000 materials, covering a broad range of properties, including energetic properties (e.g., phase and aqueous stability, reaction energies), electronic structure (bandstructures, DOSs) and structural and mechanical properties (e.g., elastic constants).
A linchpin of the Materials Project is its robust data and software infrastructure, built on best open-source software development practices such as continuous testing and integration, and comprehensive documentation. I will provide an overview of the open-source software modules that have been developed for materials analysis (Python Materials Genomics), error handling (Custodian) and scientific workflow management (FireWorks), as well as the Materials API, a first-of-its-kind interface for accessing materials data based on REpresentational State Transfer (REST) principles. I will show a materials researcher may use and build on these software and data tools for materials informatics as well as to accelerate his own research.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
These slides are a brief update on the status of the work of the current SPARQL Working Group. "SPARQL 1.1" collectively refers to the upcoming versions of the SPARQL query language, SPARQL update language, and other deliverables of the 2nd (current) SPARQL Working Group.
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Cross-language information retrieval (CLIR) is a technique to locate documents written in one natural language by queries expressed in another language. This project investigates the feasibility of CLIR based on domain-specific bilingual corpus databases.
I built this presentation for Informatica World in 2006. It is all about Data Administration, Data Quality and Data Management. It is NOT about the Informatica product. This presentation was a hit, with standing room only full of about 150 people. The content is still useful and applicable today. If you want to use my material, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identifying and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is desired to optimize sophisticated error discovery, that requires inequality joins, rather than naïvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing.
Applying Data Quality Best Practices at Big Data ScalePrecisely
Global organizations are investing aggressively in data lake infrastructures in the pursuit of new, breakthrough business insights. At the same time, however, 2 out of 3 business executives are not highly confident in the accuracy and reliability of their own Big Data. Regaining that confidence requires utilizing proven data quality tools at Big Data scale.
In this on-demand webinar, discover how to ensure your data lake is a trusted source for advanced business insights that lead to new revenue, cost savings and competitiveness. You will have the opportunity to:
• Compare your organization’s data lake “readiness” against initial findings from our upcoming annual Big Data Trends survey
• Gain insight into where and how to leverage data quality best practices for Big Data use cases
• Explore how a ‘Develop Once, Deploy Anywhere’ approach, including to native Big Data infrastructures such as Hadoop and Spark, facilitates consistent data quality patterns
Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek
Presentation from the BigClean event in spring 2011 in Prague. Briefly introduces to data quality, cleansing and shows some examples from existing open data/open government projects.
Data-Ed: Best Practices with the Data Management Maturity ModelData Blueprint
The Data Management Maturity (DMM) model is a framework for the evaluation and assessment of an organization's data management capabilities. The model allows an organization to evaluate its current state data management capabilities, discover gaps to remediate, and strengths to leverage. The assessment method reveals priorities, business needs, and a clear, rapid path for process improvements. This webinar will describe the DMM, its evolution, and illustrate its use as a roadmap guiding organizational data management improvements.
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
With the emergence of the Web of Data, most notably Linked Open Data (LOD), an abundance of data has become available on the web. However, LOD datasets and their inherent subgraphs vary heavily with respect to their size, topic and domain coverage, the schemas and their data dynamicity (respectively schemas and metadata) over the time. To this extent, identifying suitable datasets, which meet spefic criteria, has become an increasingly important, yet challenging task to support issues such as entity retrieval or semantic search and data linking. Particularly with respect to the interlinking issue, the current topology of the LOD cloud underlines the need for practical and ecient means to recommend suitable datasets: currently, only well-known reference graphs such as DBpedia (the most obvious target), YAGO or Freebase show a high amount of in-links, while there exists a long tail of potentially suitable yet under-recognized datasets. This problem is due to
the semantic web tradition in dealing with "fnding candidate datasets to link to", where data publishers are used to identify target datasets for interlinking.
While an understanding of the nature of the content of specic datasets is a crucial
prerequisite for the mentioned issues, we adopt in this dissertation the notion of
\dataset prole" | a set of features that describe a dataset and allow the comparison
of dierent datasets with regard to their represented characteristics. Our
rst research direction was to implement a collaborative ltering-like dataset recommendation
approach, which exploits both existing dataset topic proles, as well
as traditional dataset connectivity measures, in order to link LOD datasets into
a global dataset-topic-graph. This approach relies on the LOD graph in order to
learn the connectivity behaviour between LOD datasets. However, experiments have
shown that the current topology of the LOD cloud group is far from being complete
to be considered as a ground truth and consequently as learning data.
Facing the limits the current topology of LOD (as learning data), our research
has led to break away from the topic proles representation of \learn to rank"
approach and to adopt a new approach for candidate datasets identication where
the recommendation is based on the intensional proles overlap between dierent
datasets. By intensional prole, we understand the formal representation of a set of
schema concept labels that best describe a dataset and can be potentially enriched
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
The Logical Model Designer - Binding Information Models to TerminologySnow Owl
This presentation demonstrates the functionality provided by the Logical Model Designer (LMD) and Snow Owl tools, which enables terminology to be bound to the Singapore Logical Information Model.
Abstract:
A critical enabler in the journey towards semantic interoperability in Singapore is the Singapore "˜Logical Information Model' (LIM). The LIM is a model of the healthcare information shared within Singapore, and is defined as a set of reusable "˜archetypes' for each clinical concept (e.g. Problem/Diagnosis, Pharmacy Order). These archetypes are then constrained and composed into "˜templates' to support specific use cases.
The Singapore LIM harmonises the semantics of the information structures with the terminology, using multiple types of terminology bindings, including semantic, value domain and constraint bindings. Value domain bindings are defined to both national "˜reference terminology' (used for querying nationally-collated data), as well as to a variety of "˜interface terminologies' used within local clinical systems (required to enforce conformance-compliance rules over message specifications generated from the LIM). To support the diversity of pre-coordination captured in local interface terms, "˜design patterns' are included in the LIM, based on the SNOMED CT concept model. These design patterns represent a logical model of meaning for a specific concept, and allow more than one split between the information model and the terminology model to be represented in a semantically-consistent manner.
This presentation will demonstrate the "˜Logical Model Designer' (LMD) - an Eclipse-based tool that is being used to maintain Singapore's Logical Information Model. A number of features of the LMD tooling will be demonstrated, with a specific focus on how the information structure is bound to the terminology via an interface to the Snow Owl platform. Value Domains are defined as reference sets within Snow Owl and then linked to the information structures defined in the LMD.
Please see our website http://b2i.sg for further information.
Finding knowledge, data and answers on the Semantic Webebiquity
Web search engines like Google have made us all smarter by providing ready access to the world's knowledge whenever we need to look up a fact, learn about a topic or evaluate opinions. The W3C's Semantic Web effort aims to make such knowledge more accessible to computer programs by publishing it in machine understandable form.
<p>
As the volume of Semantic Web data grows software agents will need their own search engines to help them find the relevant and trustworthy knowledge they need to perform their tasks. We will discuss the general issues underlying the indexing and retrieval of RDF based information and describe Swoogle, a crawler based search engine whose index contains information on over a million RDF documents.
<p>
We will illustrate its use in several Semantic Web related research projects at UMBC including a distributed platform for constructing end-to-end use cases that demonstrate the semantic web’s utility for integrating scientific data. We describe ELVIS (the Ecosystem Location Visualization and Information System), a suite of tools for constructing food webs for a given location, and Triple Shop, a SPARQL query interface which searches the Semantic Web for data relevant to a given query ELVIS functionality is exposed as a collection of web services, and all input and output data is expressed in OWL, thereby enabling its integration with Triple Shop and other semantic web resources.
Resource Description Framework Approach to Data Publication and FederationPistoia Alliance
Bob Stanley, CEO, IO Informatics, explains the utility to RDF as a standard way of defining and redefining data that could have utility in managing life science information.
Semantics and optimisation of the SPARQL1.1 federation extensionOscar Corcho
Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
PhD thesis defense.
This manuscript describes a methodology designed and implemented to realise the recommendation of vocabularies based on the content of a given website. The goal of the proposed approach is to generate vocabularies by reusing existing schemas. The automatic recommendation helps to leverage websites to self-described web entities in the Web of Data; understandable by both humans and machines. In this direction, the implemented approach is wrapped within a broader methodology of turning a website in a machine understandable node by using technologies that have been developed in the scope of the Semantic Web vision. Transforming a website to a machine understandable entity is the first step required by the websites side in order to narrow the gap with web agents and enable the structured content consumption without the need of implementing an Application Programming Interface (API) that would provide read-write functionality. The motivation of the thesis stems from the fact that the data provided via an API is already presented on the corresponding website in most of the cases.
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
The large-scale analysis of scholarly artifact usage is constrained primarily by current practices in usage data archiving, privacy issues concerned with the dissemination of usage data, and the lack of a practical ontology for modeling the usage domain. As a remedy to the third constraint, this article presents a scholarly ontology that was engineered to represent those classes for which large-scale bibliographic and usage data exists, supports usage research, and whose instantiation is scalable to the order of 50 million articles along with their associated artifacts (e.g. authors and journals) and an accompanying 1 billion usage events. The real world instantiation of the presented abstract ontology is a semantic network model of the scholarly community which lends the scholarly process to statistical analysis and computational support. We present the ontology, discuss its instantiation, and provide some example inference rules for calculating various scholarly artifact metrics.
Getting the Most out of Transition-based Dependency ParsingJinho Choi
This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a boot- strapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-the- art performance with respect to other parsing approaches evaluated on the same data set.
Text as Data: processing the Hebrew BibleDirk Roorda
The merits of stand-off markup (LAF) versus inline markup (TEI) for processing text as data. Ideas applied to work with the Hebrew Bible, resulting in tools for researchers and end-users.
A view on data quality in the real estate domain.
Presented at the LDQ workshop, colocated with SEMANTICS 2017 conference.
see https://2017.semantics.cc/satellite-events/linked-data-quality-assessment-and-improvement-academia-industry
for more details
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Understanding Nidhi Software Pricing: A Quick Guide 🌟
Choosing the right software is vital for Nidhi companies to streamline operations. Our latest presentation covers Nidhi software pricing, key factors, costs, and negotiation tips.
📊 What You’ll Learn:
Key factors influencing Nidhi software price
Understanding the true cost beyond the initial price
Tips for negotiating the best deal
Affordable and customizable pricing options with Vector Nidhi Software
🔗 Learn more at: www.vectornidhisoftware.com/software-for-nidhi-company/
#NidhiSoftwarePrice #NidhiSoftware #VectorNidhi
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
NLP Data Cleansing Based on Linguistic Ontology Constraints
1. NLP Data Cleansing Based on Linguistic Ontology
Constraints
Dimitris Kontokostas13
Martin Brümmer1
Sebastian Hellmann13
Jens Lehmann1
Lazaros Ioannidis2
1AKSW, University of Leipzig
2Aristotle University of Thessaloniki
3DBpedia Association
2014-05-27
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33
5. Linguistic workshops & conferences
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33
6. Linguistic workshops & conferences
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33
7. Linguistic LOD Cloud (LLOD Cloud)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33
8. Problem denition
Linguistic (related) Data
Purpose-Driven denition
Increasing Data, ontologies vocabularies
New-comers → hard to understand the ontologies / follow updates
Validation is essential
Many dierent pipelines (parsing, annotation, disambiguation, etc)
Errors are propagated
Partially provided by maintainers (incomplete)
Focus on Lemon NIF (proof of concept)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33
9. Lemon - Lexicon Model for Ontologies
Models lexicon and machine-readable
dictionaries
RDF-native form
Linguistically sound structure (LMF)
Separation of the lexicon and
ontology layers
Linking to data categories →
arbitrarily complex linguistic
description
Principle of least power - the less
expressive the language, the more
reusable the data.
http://lemon-model.net/
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33
10. Lemon - Example
: l e x i c o n a lemon : Lexicon ;
lemon : entry : Pizza , : T o r t i l l a .
: Pizza a lemon : LexicalEntry ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource /Pizza ] .
: T o r t i l l a a lemon : LexicalEntry ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource / T o r t i l l a ] .
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33
11. Lemon - Example (Correct)
: l e x i c o n a lemon : Lexicon ;
lemon : language en ;
lemon : entry : Pizza , : T o r t i l l a .
: Pizza a lemon : LexicalEntry ;
lemon : canonicalForm [
lemon : writtenRep Pizza @en ] ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource /Pizza ].
: T o r t i l l a a lemon : LexicalEntry ;
lemon : canonicalForm [
lemon : writtenRep T o r t i l l a @en ] ;
lemon : sense [ lemon : r e f e r e n c e
http :// dbpedia . org / resource / T o r t i l l a ].
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33
12. NIF - NLP Interchange Format
RDF/OWL-based format that aims to achieve interoperability between
Natural Language Processing (NLP) tools, language resources and
annotations
In a nutshell:
Logical formalisation of strings and annotations
Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147
Reuse of RDF tool stack
Decreases development cost for integration
Integrated in:
DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator,
ConLL converter , ...
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33
14. NIF - Example
http :// abc . com/doc#char=0,17
a n i f : Context ;
a n i f : RFC147String ;
n i f : beginIndex 0 ;
n i f : endIndex 17 ;
n i f : i s S t r i n g My dog l i k e s pizza .
http :// abc . com/doc#char=2,7
a n i f : RFC5147String ;
n i f : anchorOf dog ;
n i f : referenceContext http :// abc . com/doc#char=0,17 .
i t s r d f : taClassRef dbo : Animal ;
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33
15. NIF - Example (Correct)
http :// abc . com/doc#char=0,18
a n i f : Context ;
a n i f :RFC5147 String ;
n i f : beginIndex 0^^xsd : nonNegativeInteger ;
n i f : endIndex 18^^xsd : nonNegativeInteger ;
n i f : i s S t r i n g My dog l i k e s pizza ^^xsd : s t r i n g .
http :// abc . com/doc#char=2,7
a n i f : RFC5147String ;
n i f : beginIndex 2^^xsd : nonNegativeInteger ;
n i f : endIndex 7^^xsd : nonNegativeInteger ;
n i f : anchorOf dog ^^xsd : s t r i n g ;
n i f : referenceContext http :// abc . com/doc#char=0,27 .
i t s r d f : taClassRef dbo : Animal ;
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33
16. Maintainer validation
Lemon
Python script
24 tests for structural criteria
too slow on big datasets
not good reporting
NIF
SPARQL queries
11 tests for common errors
not complete
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33
17. Built on previous work
Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick
Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland
Cornelissen, and Amrapali J. Zaveri in WWW 2014.
Horizontal, multi-domain data quality assessment
Massive detection of errors for ve large-scale LOD data sets
291 vocabularies, independent of their domain or purpose
New contributions:
Relation to OWL reasoners
Test Driven Data Engineering Ontology
Domain-specic validation
Quickly improving existing validation options provided by maintainers
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33
18. Test-Driven Data Development Methodology
Test case: a data constraint that involves one or more triples
Test suite: a set of test cases for testing a dataset
Status: Success, Fail, Timeout (complexity) or Error (e.g. network)
Fail: Error, warning or notice
RDF: basis for both data and schema
Unied model facilitates automatic test case generation
SPARQL serves as the test case denition language
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33
19. Example test case
A nif:RFC5147String should never have a nif:beginIndex greater than
nif:endIndex
Test cases are written in SPARQL
SELECT ? s WHERE {
? s n i f : beginIndex ?v1 .
? s n i f : endIndex ?v2 .
FILTER ( ?v1 ?v2 ) }
We query for errors
Success: Query returns empty result set
Fail: Query returns results
Every result we get is a violation instance
Timeout / Error: needs further investigation on SPARQL Engine
capabilities, query syntax or query complexity
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33
20. Patterns Bindings
Data Quality Test Patterns (DQTP)
abstract patterns, which can be further rened into concrete data quality
test cases using test pattern bindings
Existing library of 20 patterns
SELECT ? s WHERE {
? s %%P1%% ?v1 .
? s %%P2%% ?v2 .
FILTER ( ?v1 %%OP%% ?v2 ) }
Bindings
mapping of variables to valid pattern replacement
P1 = n i f : beginIndex | SELECT ? s WHERE {
P2 = n i f : endIndex | ? s n i f : beginIndex ?v1 .
OP = | ? s n i f : endIndex ?v2 .
| FILTER ( ?v1 ?v2 ) }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33
21. Test Auto Generators (TAGs)
RDF(s) OWL (partial) support
Query schema for supported axioms
SELECT DISTINCT ?T1 ?T2 WHERE {
?T1 owl : d i s j o i n t W i t h ?T2 . }
For every result a binding to a pattern is generated a test case
instantiated
Supported axioms at the moment:
RDFS: domain range
OWL: minCardinality, maxCardinality, cardinality, functionalProperty,
InverseFunctionalProperty, disjointClass, propertyDisjointWith,
AsymmetricProperty and deprecated
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33
22. Test Case Elicitation Workow
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33
23. TD(D)D vs Reasoners
SPARQL test cases detect a subset of validation errors detectable by
an OWL reasoner. Limited by
SPARQL endpoint reasoning support
limitations of the OWL-to-SPARQL translation.
SPARQL test cases detect validation errors not expressible in OWL
OWL reasoning is often not feasible on large datasets.
Datasets are already deployed and accessible via SPARQL endpoints
Pattern library more user friendly approach for building validation rules
compared to modelling OWL axioms.
requires familiarity
non-common validations require manual SPARQL test cases
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33
24. Data Engineering Ontology
Input / Output entirely in RDF
Model the methodology in OWL
test suites, test cases, patterns, auto generators
Strict to serve as a validation layer
Four dierent levels of error reporting
simple test case report (success, fail) / enriched with counts
violation instance reporting / enriched with annotations
Reuse dcterms, prov, spin, rlog
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33
25. Data Engineering Ontology - Denition Generation
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33
26. Data Engineering Ontology - Result Representation
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33
27. Lemon NIF Test case elicitation
RDFUnit Suite implements our methodology
Run on Lemon NIF ontologies
TAGs could not yet handle some complex owl:Restrictions
owl:unionOf, owl:allValuesFrom, owl:someValuesFrom,
owl:hasSelf and some rdfs:subPropertyOf cases
Manual test cases for constraints not captured in OWL.
Total Domain Range Datatype Card. Disj. Func. I. Func. Manual
Lemon 182 40 34 1 29 64 3 1 10
NIF 96 42 24 4 6 10 10
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33
28. Example of manual Lemon test case
lemon:narrower denotes that one sense of a word is narrower than the
other and must never be symmetric or contain cycles.
SELECT DISTINCT ? s WHERE {
? s lemon : narrower+ ? narrower .
? narrower lemon : narrower+ ? s . }
lemon:language must not have a language tag (RDF1.1 to the rescue)
SELECT DISTINCT ? s WHERE {
? s lemon : language ?v1 .
FILTER ( lang (? v1 )!=) }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33
29. Example of manual NIF test case
Ensure that nif:beginIndex nif:endIndex index are correct
SELECT DISTINCT ? s WHERE {
? s n i f : anchorOf ? anchorOf ;
n i f : beginIndex ? beginIndex ;
n i f : endIndex ? endIndex ;
n i f : referenceContext
[ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] .
BIND (SUBSTR(? r e f e r e n c e S t r i n g ,
? beginIndex ,
(? endIndex − ? beginIndex ) ) AS ? t e s t ) .
FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . }
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 29 / 33
30. Evaluation Datasets
Name Description Ontology Type
lemon datasets
LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon,
UBY-LMF
Dictionary
LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon,
UBY-LMF
Dictionary
LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMF
model
lemon,
UBY-LMF
WordNet
DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary
QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary
NIF datasets
Wikilinks sample of 60976 randomly selected phrases linked to
Wikipedia articles
NIF NER
DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER
KORE 50 evaluation
dataset
50 NE annotated natural language sentences from the AIDA
corpus
NIF NER
News-100 100 manually annotated German news articles NIF NER
RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER
Reuters-128 128 news articles manually curated NIF NER
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 30 / 33
32. Conclusion
Extended a previously introduced methodology for test-driven quality
assessment
Data engineering ontology
Devised 277 test cases for NLP datasets using the Lemon and NIF
vocabularies
Revealed a substantial number of errors for Lemon NIF datasets
Future directions
extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF)
automatic dependencies between test cases
wrap RDFUnit for NLP services (integrated in NIF)
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 32 / 33
33. Thank you!
Dimitris Kontokostas
With kind support of
John McCrae (Lemon model)
http://rdfunit.aksw.org
http://github.com/AKSW/RDFUnit
#eswc2014kontokostas
Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 33 / 33