The position paper aims at discussing the potential of exploiting linked data best practice to provide metadata documenting domain specific resources created through verbose acquisition-processing pipelines. It argues that resource selection, namely the process engaged to choose a set of resources suitable for a given analysis/design purpose, must be supported by a deep comparison of their metadata. The semantic similarity proposed in our previous works is discussed for this purpose and the main issues to make it scale up to the web of data are introduced. Discussed issues contribute beyond the re-engineering of our similarity since they largely apply to every tool which is going to exploit information made available as linked data. A research plan and an exploratory phase facing the presented issues are described remarking the lessons we have learnt so far.
DataONE Education Module 09: Analysis and WorkflowsDataONE
Lesson 9 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Linked Data Quality Assessment – daQ and Luzzujerdeb
Presentation at the Ontology Engineering Group at UPM related to Linked Data Quality and the work done in the Enterprise Information System Group at Universität Bonn
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
Through the use of Linked Data (LD), Libraries, Archives and Museums (LAMs) have the potential to expose their collections to a larger audience and to allow for more efficient user searches. Despite this, relatively few LAMs have invested in LD projects and the majority of these display limited interlinking across datasets and institutions. A survey was conducted to understand Information Professionals' (IPs') position with regards to LD, with a particular focus on the interlinking problem. The survey was completed by 185 librarians, archivists, metadata cataloguers and researchers. Results indicated that, when interlinking, IPs find the process of ontology and property selection to be particularly challenging, and LD tooling to be technologically complex and unsuitable for their needs.
Our research is focused on developing an authoritative interlinking framework for LAMs with a view to increasing IP engagement in the linking process. Our framework will provide a set of standards to facilitate IPs in the selection of link types, specifically when linking local resources to authorities. The framework will include guidelines for authority, ontology and property selection, and for adding provenance data. A user-interface will be developed which will direct IPs through the resource interlinking process as per our framework. Although there are existing tools in this domain, our framework differs in that it will be designed with the needs and expertise of IPs in mind. This will be achieved by involving IPs in the design and evaluation of the framework. A mock-up of the interface has already been tested and adjustments have been made based on results. We are currently working on developing a minimal viable product so as to allow for further testing of the framework. We will present our updated framework, interface, and proposed interlinking solutions.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
With the continuously increasing number of datasets published in the Web of Data and form part of the Linked Open Data Cloud, it becomes more and more essential to identify resources that correspond to the same real world object in order to interlink web resources and set the basis for large-scale data integration. This requirement becomes apparent in a multitude of domains ranging from science (marine research, biology, astronomy, pharmacology) to semantic publishing and cultural domains. In this context, instance matching is of crucial importance.
It is though essential at this point to develop, along with instance and entity matching systems, benchmarks to determine the weak and strong points of those systems, as well as their overall quality in order to support users in deciding the system to use for their needs. Hence, well defined, and good quality benchmarks are important for comparing the performance of the developed instance matching systems.
In this tutorial we aim at:
- Discussing the state-of-the-art instance matching benchmarks
- Presenting the benchmark design principles
- Providing an analysis of the performance results of instance matching systems for the presented benchmarks
- Presenting the research directions that should be exploited for the creation of novel benchmarks to answer the needs of the Linked Data paradigm.
Please click here for the Tutorial web-page: http://www.ics.forth.gr/isl/BenchmarksTutorial/
DataONE Education Module 09: Analysis and WorkflowsDataONE
Lesson 9 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Linked Data Quality Assessment – daQ and Luzzujerdeb
Presentation at the Ontology Engineering Group at UPM related to Linked Data Quality and the work done in the Enterprise Information System Group at Universität Bonn
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
Through the use of Linked Data (LD), Libraries, Archives and Museums (LAMs) have the potential to expose their collections to a larger audience and to allow for more efficient user searches. Despite this, relatively few LAMs have invested in LD projects and the majority of these display limited interlinking across datasets and institutions. A survey was conducted to understand Information Professionals' (IPs') position with regards to LD, with a particular focus on the interlinking problem. The survey was completed by 185 librarians, archivists, metadata cataloguers and researchers. Results indicated that, when interlinking, IPs find the process of ontology and property selection to be particularly challenging, and LD tooling to be technologically complex and unsuitable for their needs.
Our research is focused on developing an authoritative interlinking framework for LAMs with a view to increasing IP engagement in the linking process. Our framework will provide a set of standards to facilitate IPs in the selection of link types, specifically when linking local resources to authorities. The framework will include guidelines for authority, ontology and property selection, and for adding provenance data. A user-interface will be developed which will direct IPs through the resource interlinking process as per our framework. Although there are existing tools in this domain, our framework differs in that it will be designed with the needs and expertise of IPs in mind. This will be achieved by involving IPs in the design and evaluation of the framework. A mock-up of the interface has already been tested and adjustments have been made based on results. We are currently working on developing a minimal viable product so as to allow for further testing of the framework. We will present our updated framework, interface, and proposed interlinking solutions.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
With the continuously increasing number of datasets published in the Web of Data and form part of the Linked Open Data Cloud, it becomes more and more essential to identify resources that correspond to the same real world object in order to interlink web resources and set the basis for large-scale data integration. This requirement becomes apparent in a multitude of domains ranging from science (marine research, biology, astronomy, pharmacology) to semantic publishing and cultural domains. In this context, instance matching is of crucial importance.
It is though essential at this point to develop, along with instance and entity matching systems, benchmarks to determine the weak and strong points of those systems, as well as their overall quality in order to support users in deciding the system to use for their needs. Hence, well defined, and good quality benchmarks are important for comparing the performance of the developed instance matching systems.
In this tutorial we aim at:
- Discussing the state-of-the-art instance matching benchmarks
- Presenting the benchmark design principles
- Providing an analysis of the performance results of instance matching systems for the presented benchmarks
- Presenting the research directions that should be exploited for the creation of novel benchmarks to answer the needs of the Linked Data paradigm.
Please click here for the Tutorial web-page: http://www.ics.forth.gr/isl/BenchmarksTutorial/
Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
ODSC East 2017: Data Science Models For GoodKarry Lu
Abstract: The rise of data science has been largely fueled by the promise of changing the business landscape - enhancing one's competitive advantage, increasing business optimization and efficiency, and ultimately delivering a better bottom-line. This promise reaches across sectors as machine learning methods are getting better, data access continues to grow, and computation power is easily accessible. However, because the practice of doing data science can be expensive, there is a danger that this so-called promise of data science may only be available to the most well-resourced organizations with sophisticated data capabilities and staff. For the past five years, DataKind has been working to ensure social change organizations too have access to data science, teaming them up with data scientists to build machine learning and artificial intelligence solutions that aim to reduce human suffering. In doing so, DataKind has learned what it takes to apply data science in the social sector and the many applications it has for creating positive change in the world. This session presents DataKind projects showcasing the wide range of applications for ML/AI for social good. From using satellite imagery and remote sensing techniques to detect wheat farm boundaries to protect livelihoods in Ethiopia, to leveraging NLP to automate the time consuming process of synthesizing findings from academic studies to inform conservation efforts and to classifying text records to better understand human rights conditions across the world to using machine learning to reduce traffic fatalities in U.S. cities, learn about some of the latest breakthroughs and findings in the data science for social good space and learn how you can get involved
The Next Generation of AI-powered SearchTrey Grainger
What does it really mean to deliver an "AI-powered Search" solution? In this talk, we’ll bring clarity to this topic, showing you how to marry the art of the possible with the real-world challenges involved in understanding your content, your users, and your domain. We'll dive into emerging trends in AI-powered Search, as well as many of the stumbling blocks found in even the most advanced AI and Search applications, showing how to proactively plan for and avoid them. We'll walk through the various uses of reflected intelligence and feedback loops for continuous learning from user behavioral signals and content updates, also covering the increasing importance of virtual assistants and personalized search use cases found within the intersection of traditional search and recommendation engines. Our goal will be to provide a baseline of mainstream AI-powered Search capabilities available today, and to paint a picture of what we can all expect just on the horizon.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
The increasing adoption of Linked Data principles has led
to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
The paper trail:steps towards a reference model for the metadata ecologyR. John Robertson
The paper trail: steps towards a reference model for the metadata ecology, presentation at ~CoLIS5 workshop. Presentation with Jane Barton. http://mwi.cdlr.strath.ac.uk/Colisworkshop.htm
Archiving- from June 2005.
please note this presentation is currently all rights reserved until i contact the other author.
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Data citation supports attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and analyze data citation practices.
This session presented as part of the the Program on Information Science seminar series, examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation.
A number of principles have emerged for citation -- the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way.
Adoption of data citation by journals can provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science -- revealing connections among evidence bases that are not now visible through citations of articles.
Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
ODSC East 2017: Data Science Models For GoodKarry Lu
Abstract: The rise of data science has been largely fueled by the promise of changing the business landscape - enhancing one's competitive advantage, increasing business optimization and efficiency, and ultimately delivering a better bottom-line. This promise reaches across sectors as machine learning methods are getting better, data access continues to grow, and computation power is easily accessible. However, because the practice of doing data science can be expensive, there is a danger that this so-called promise of data science may only be available to the most well-resourced organizations with sophisticated data capabilities and staff. For the past five years, DataKind has been working to ensure social change organizations too have access to data science, teaming them up with data scientists to build machine learning and artificial intelligence solutions that aim to reduce human suffering. In doing so, DataKind has learned what it takes to apply data science in the social sector and the many applications it has for creating positive change in the world. This session presents DataKind projects showcasing the wide range of applications for ML/AI for social good. From using satellite imagery and remote sensing techniques to detect wheat farm boundaries to protect livelihoods in Ethiopia, to leveraging NLP to automate the time consuming process of synthesizing findings from academic studies to inform conservation efforts and to classifying text records to better understand human rights conditions across the world to using machine learning to reduce traffic fatalities in U.S. cities, learn about some of the latest breakthroughs and findings in the data science for social good space and learn how you can get involved
The Next Generation of AI-powered SearchTrey Grainger
What does it really mean to deliver an "AI-powered Search" solution? In this talk, we’ll bring clarity to this topic, showing you how to marry the art of the possible with the real-world challenges involved in understanding your content, your users, and your domain. We'll dive into emerging trends in AI-powered Search, as well as many of the stumbling blocks found in even the most advanced AI and Search applications, showing how to proactively plan for and avoid them. We'll walk through the various uses of reflected intelligence and feedback loops for continuous learning from user behavioral signals and content updates, also covering the increasing importance of virtual assistants and personalized search use cases found within the intersection of traditional search and recommendation engines. Our goal will be to provide a baseline of mainstream AI-powered Search capabilities available today, and to paint a picture of what we can all expect just on the horizon.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
The increasing adoption of Linked Data principles has led
to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
The paper trail:steps towards a reference model for the metadata ecologyR. John Robertson
The paper trail: steps towards a reference model for the metadata ecology, presentation at ~CoLIS5 workshop. Presentation with Jane Barton. http://mwi.cdlr.strath.ac.uk/Colisworkshop.htm
Archiving- from June 2005.
please note this presentation is currently all rights reserved until i contact the other author.
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This workshop was presented in Riyadh, SA in 21-22 Jan 2019, with the collaboration with Riyadh Data Geeks group.
To learn more about the workshop please see this website:
http://bit.ly/2Ucjmm5
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Data citation supports attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and analyze data citation practices.
This session presented as part of the the Program on Information Science seminar series, examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation.
A number of principles have emerged for citation -- the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way.
Adoption of data citation by journals can provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science -- revealing connections among evidence bases that are not now visible through citations of articles.
Talk of Deep Learning for Natural Language Processing presented by Thomas Delteil and Miguel González-Fierro at Open Data Science Conference (ODSC) in London 2016.
Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta
This presentation will guide you through the application of Python NLP Techniques to analyze arguments during a debate and define a strategy to figure out the winner of the debate on the basis of strength and relevance of the arguments.
This is made for PyCon India 2015.
For details : https://in.pycon.org/cfp/pycon-india-2015/proposals/analyzing-arguments-during-a-debate-using-natural-language-processing-in-python/
Contact me : abhinav.gpt3@gmail.com
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...Rizwan Habib
A Path To Unsupervised Learning Through Adversarial Networks - (Soumith Chintala, Researcher at Facebook AI Research)
Soumith Chintala is a Researcher at Facebook AI Research, where he works on deep learning, reinforcement learning, generative image models, agents for video games and large-scale high-performance deep learning. He holds a Masters in CS from NYU, and spent time in Yann LeCun's NYU lab building deep learning models for pedestrian detection, natural image OCR, depth-images among others.
Soumith will go over generative adversarial networks, a particular way of training neural networks to build high quality generative models. The talk will take you through an easy to follow timeline of the research and improvements in adversarial networks, followed by some future directions, as well as applications.
Provides a basic introduction to Natural Language Processing (NLP), its properties, and some common techniques such as stemming, tokenization, bag-of-words, stripping, and n-grams
Homoeopathy is a science and this presentation explores the methodology behind understanding, analysing the disease and prescribing medicine to the patient. It defines the key milestones that if followed diligently by any physician can help him end up with a remedy which may prove out to be a panacea for all patient complaints.
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...DuraSpace
“Hot Topics: The DuraSpace Community Webinar Series," Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 2: “Metadata and Repository Services for Research Data Curation”
Presented by Declan Fleming, Chief Technology Strategist, Arwen Hutt, Metadata Librarian & Matt Critchlow, Manager of Development and Web ServicesUC, San Diego Library.
This presentation was provided by Carolyn Hansen of the University of Cincinnati during the NISO Training Thursday event, Metadata and the IR, held on Thursday, February 23, 2017.
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of “Beyond metadata: Supporting non-standardized documentation to facilitate data reuse”
Duraspace Hot Topics Series 6: Metadata and Repository ServicesMatthew Critchlow
Presented by Declan Fleming, Arwen Hutt, and Matt Critchlow. The second in a three part Webinar series on Research Data Curation at UC San Diego, as part of the larger Research Cyberinfrastructure initiative.
The explosion in growth of the Web of Linked Data has provided, for the first time, a plethora of information in disparate locations, yet bound together by machine-readable, semantically typed relations. Utilisation of the Web of Data has been, until now, restricted to the members of the community, eating their own dogfood, so to speak. To the regular web user browsing Facebook and watching YouTube, this utility is yet to be realised. The primary factor inhibiting uptake is the usability of the Web of Data, where users are required to have prior knowledge of elements from the Semantic Web technology stack. Our solution to this problem is to hide the stack, allowing end users to browse the Web of Data, explore the information it contains, discover knowledge, and use Linked Data. We propose a template-based visualisation approach where information attributed to a given resource is rendered according to the rdf:type of the instance.
This presentation was provided by Clara Llebot of Oregon State University, during the NISO hot topic virtual conference "Effective Data Management," which was held on September 29, 2021.
Talk given by prof. T.K. Prasad at the workshop on Semantics in Geospatial Architectures: Applications and Implementation. The workshop was held from October 28-29, 2013 at Pyle Center (702 Langdon Street, Madison, WI), University of Wisconsin-Madison.
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
This presentation was provided by Chris Erdmann of Library Carpentries and by Judy Ruttenberg of ARL during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access
Similar to Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice (20)
Very limited attention has been dedicated to the quality of linksets, the connections of information belonging to distinct datasets, that might be as important as dataset's quality when consuming Linked Data.
In this paper, we present a rst linkset quality measure proposing a function able to estimate the new information gained through linksets among SKOS thesauri. A scoring function, the linkset importing is provided focusing on the multilingual gain, in terms of the new translated labels, obtained by complementing a SKOS thesaurus through skos:exactMatch links. We finally discuss how the linkset importing can be signicantly used in the context of the EU project eENVplus.
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Riccardo Albertoni
The development of a Spatial Data Infrastructure (SDI) at
European level is strategic to answer the needs of environmental management requested by the European, national and local policies. Several European projects and initiatives aim to share, integrate and make accessible large amount of environmental data in order to overcome cross-
border/language/cultural barriers. To this purpose, environmental thesauri are used as shared nomenclatures in metadata compilation and information discovery, and they are increasingly made available on the web.
This paper provides a methodological approach for creating a catalogue of the environmental thesauri available on the web and assessing their reusability with respect to domain independent criteria. It highlights critical issues providing some recommendations for improving thesauri reusability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice
1. Semantic Similarity and Selection
of Resources Published According
to Linked Data Best Practice
Riccardo Albertoni,
Monica De Martino
CNR-IMATI-GE
Institute of Applied Mathematics and Information Technologies
(Dept of Genoa)
Consiglio Nazionale delle Ricerche, Italy
The 6th International Workshop on Ontology Content (OnToContent 2010)
Oct 28, 2010 Crete Part of the OTM (OTM'2010)
2. Outline
• Resource Selection, Semantic Similarity and Linked
data.
▫ Why does Resource Selection matter?
▫ Real example:
Complex metadata to document resources
Linked data paves the way for sharing complex metadata
▫ Semantic Similarity as base for resource selection
Nice features as Asymmetry & Context-Dependence
• Scaling Semantic similarity up to Web of Data
▫ Issues & Research plandirection
▫ Exploratory phase with real data from the web data
Are the issues we consider relevant? In which varieties
shapes issues occur in real data?
▫ Lesson learnt from the exploratory phase
3. Resource Selection:
• why does it matter?
▫ Effective sharing and reuse of data are still
desiderata of many scientific and industrial
domains where the selection of tailored and
high-quality data is a necessary condition to
provide successful and competitive services
• Resource selection
▫ in order to select the resources which fit a given
problemtask we rely on an analysis of
metadata documenting resources
4. Real Example
Acquisition
Preprocessing
Integration
ModelsAnalysis
Web server
Sea Trial
courtesy of NATO Undersea Research Centre (NURC),
Example developed in NURC Research Assistance
granted to R. Albertoni (2008)
Short term perspective: data is collected and
elaborated for well planned purposes (aka sea trial experiments)
5. Potential new “customer” for sea trial
Data
• NATO Agencies/Nations ask for data previously
collected
• New scientists arriving at NURC
▫ They want to access to data in order to produce model by their own
approaches and to compare the results with models already
produced at NURC (Benchmarking)
• Scientists/Agencies investigating how phenomena
have been changed in a long period
▫ They are interested in data collected in the past
• Scientists/Agencies planning a new sea trial
▫ It can be useful to know what have collected in previous sea trials,
how data have been elaborated
Data reusability: unplanned use of data
long term perspective
6. Potential customers’ point of view
These curtomers were not involved in sea
trials, thus, searching for data they
wonder:
• Is data collected at NURC suitable for the
application I have in mind?
• Is data reliable enough?
To answer to these macro questions
• Users need to have details about how data has
been acquired, pre-processed, integrated,
analyzed, and even to know who was in charge
for which part…
7. ModelsData
Processes
People
Sensors
Characteristics
Metadata Complexity in Real World- Linked data helps in
share complex metadata
Sensor’s responsible
party
sensor settings
Parameters, choices made
during the preprocessing
Analysis applied..
Parameters etc
Sensor
Sensor
Sensor
Sensor APO
FOAF
ISO19115CoreTest PlanDublin Core
SensorML
SensorML
SensorML
SensorML
8. Problem: keep the bar balanced !!
Semantic similarity
as Metadata analysis
to support user
comparing the
features of candidate
resources
Huge amount of
ontology driven
metadata describing
complex features as
linked data
9. semantic similarity as metadata
analysis tool
• instance similarity is fundamental to support detailed
comparison, ranking and selection of resources through
its ontology driven metadata
▫ Albertoni R., De Martino M., Asymmetric and context-dependent
semantic similarity among ontology instances, Journal on Data
Semantics X, Springer Verlag, (2008).
• Explicitly addressing the
▫ Context as explicit parameterization of similarity assessment
Context specifies which features to consider and how
▫ Asymmetry to highlight containment between resources
Sim(A,B) ranges [0,1] is worked out to measure how many
features A shares with B out of the overall A features
If features(A) are contained in features(B): sim(A,B)=1 and
sim(B,A)<1
• Limitation: Not for linked data, it was for locally-stored
ontology-driven repository and one well defined schema
10. How to make Semantic Similarity to
scale up to the web of data? 1/2
Identified issues Research Plan
non-authoritative metadata, metadata
published by actors who are neither the resource
producers nor the owners
WHEN metadata documenting resources that
have been re-elaborated or reviewed by third
parties
Synergies with semantic
web indexes (e.g.,
SINDICE ) to retrieve non
authoritative features
heterogeneous metadata, metadata
provided according to different, sometimes
interlinked, more often overlapping metadata
vocabularies
WHEN metadata for a resource is provided by
stakeholders with different fields of competency,
then they may use different vocabularies, not
always these vocabularies are independent
deploying schema and
entity level
consolidation using both
explicit metadata
statements and mining
implicit equivalences
through co-occurring
resources annotations;
11. How to make Semantic Similarity to
scale up to the web of data? 2/2
Identified issues Research Plan
non-consistently identified metadata,
namely metadata occurring when the same
resource has different identifiers in distinct
metadata sets
WHEN
Two actors in the pipeline documents
independently the same resource at different
stage of the pipeline
•reasoning techniques to be
applied to web datasets, e.g., to
smush fragments of
distributed metadata
• scripts to interlink
resources relying on a-
priori knowledge about how
datasets have been originated;
efficiency and computational issue: in
a longer perspective an accurate similarity
assessment might result computationally
prohibitive
WHEN
the number of resources discovered and
features considered increase.
•cashing of intermediate
comparisons
•techniques to prune
comparisons according to a
specified application context
•algorithms for efficient
parallelization can be studied
12. Exploratory phase
• Facing with the aforementioned issues is a very
challenging research plan!!!
• Let’s get a first hand experience in varieties introduced
by data providers
▫ Requirements:
Real metadata published as linked data
Provided by third parties
• Linked data provides huge potential for documenting
resources produced in complex pipelines but it is not yet
a common practice
▫ We considered a simpler domain (researchers and
their publications)
Semantic Web Dog Food-SWDF
(http://data.semanticweb.org/)
DBLP in RDF (http://dblp.l3s.de/d2r).
13. Instance similarity redesigned
prototype
• As test bed for experimenting and deepen the
aforementioned issues
• Extension
▫ Extended the notion of context including
namespaces to consider properties from different
RDF schemas
▫ Updated the ontology model, moving from
ProtegeAPI to RDF model
JENA Reasoner and SPARQL
14. Context:
Researcher X (URI(X)=A) Researcher Y (URI(Y)=B)
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<B> rdfs:label “B descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <vbn> ;
foaf:made <paperC>;
foaf:made <paperF>.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
[foaf:Person]->{{},{(foaf:made, Count)}}
Two researchers are as similar as they
have a similar number of publications
3 2
SIM(X,Y)= SIM(A,B)= 2/max(3,2)=2/3
SIM(Y,X)= SIM(B,A)= 3/max(3,2)=1
Take a look to R. Albertoni, M. De Martino
JODS X, 2008 for more complex similarity
assessment !!
We compare researchers by their URIs
15. Non-Authoritative Metadata - Example
URI(Giovanni)=A URI (Renaud)=B
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<B> rdfs:label “B descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <vbn> ;
foaf:made <paperC>;
foaf:made <paperF>.
Let’s compare Giovanni and Renaud starting from their URI in
DBLP
A= http://dblp.l3s.de/…../Giovanni_Tummarello
B= http://dblp.l3s.de/…./Renaud_Delbru
But we know, semantic web dog food (SWDF) might provide more info about
Giovanni and Renaud,
What if SWDF provides an additional paper for Renaud,
paper which Giovanni is not coauthoring?
SIM(Giovanni, Renaud)=1 instead of 2/3….
16. Non-Authoritative Metadata -SINDICE
You get RDF Fragments from DBLP only !!!
none from semantic web dog food we know providing further info..
IDEA: Querying SINDICE by Researchers’ URIs A, B to get RDF
fragments pertaining to Giovanni and Renaud
•URIs not name as keywords, because different people might share
the same name, URIs are in principle more precise
First lesson: Non-authoritative metadata and Non-consistently
identified metadata are tightly inter-related in the real practice. To
effectively deal with the former issue often we have to care about the
latter issue.
SWDF
Researchers
URI
DBLP
Researchers’
URI
They do not
overlap!!!
17. DBLP URI ---
How to move next?
IDEA: if SWDF added rules likes
<http://data.semanticweb.org/person/name-[midlename]-[familyname]>
owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/name_[middle-
name]_familyname>
SWDF URIOwl:SameAs
The SWDF fragments would have been retrieved by SINDICE..
[We are implicitly assuming some reasoning:
e.g.:
(X owl:sameAs X1) and (X1 rel Z) -> (X rel Z)
]
18. heterogeneous metadata- Example
RDF for Giovanni in DBLP RDF for Giovanni in SWDF
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
<paperE> foaf:maker <A>.
<paperB> foaf:maker <A>.
This problem does not appear in terms of different RDF
schemas
Both DBLP and SWDF deploy foaf …
foaf:made is owl:inverseOf foaf:maker, but you cannot know it if you don’t
dereference/load the foaf schema
Second lesson: ontology/schema/properties in the context must be
dereferenced as much as entity’s URIs to make the semantics of
properties exploitable.
19. We must be careful dereferencing
• Dereferencing schemata and URI
▫ is extremely slow
▫ adds many RDF statements which might result
useless for semantic similarity assessment
Info not pertaining to specified context
▫ ends up with huge amount of derived RDF
statement which might worsen efficiency ad
computational problems
Third lesson: specific and context driven policies to dereference the URI
and retrieve RDF fragments should be deployed in order to ease
efficiency and computational problems.
For example : to dereference only properties mentioned in context .. Or
consider only RDF fragments returned by SINDICE with explicit
reference to schemas mentioned in the context.
20. How to move next?
RDF for Giovanni in DBLP RDF for Giovanni in SWDF
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc>.
<paperE> foaf:maker <A>.
<paperB> foaf:maker <A>.
<A> foaf:make <paperE>.
<A> foaf:make <paperB>.
Assuming we have dereferenced the foaf:maker, or
upload in the reasoner a rule saying (P foaf:maker X)-
> (X foaf:make P)
21. Non-consistently identified metadata
What if the same pub is provided both by DBLP and SWDF?
E.g., DBLP:paperC and SWDF:paperB are two URI for the same paper
We count it twice
Fourth lesson: Non-consistently identified metadata is a recursive
problem. Consolidating researchers without consolidating papers brings
to wrong similarity results. We must be sure entities and properties in
the similarity context have been properly consolidated before applying
instance similarity.
RDF for Giovanni in DBLP + for Giovanni in SWDF
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
foaf:made <DBLP:paperC>;
foaf:made <DBLP:paperD>;
foaf:made <DBLP:paperH>.
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc>.
<A> foaf:made <SWDF:paperE>.
<A> foaf:made <SWDF:paperB>.
22. Conclusion (I)
• Linked data best practice and our semantic
similarity
▫ good potential to support data selection for
complex domain resource
• But scaling semantic similarity up to web of data
means to deal with
▫ Non authoritative metadata
▫ Heterogeneous metadata
▫ Non-consistently identified metadata
▫ Efficiency and computational issue
23. Conclusion (II)
• The exploratory phase shows
▫ All the mentioned issues arise even in very simple
scenario assessing the semantic similarity
▫ It is pivotal to have first-hand experience with real
data to discover the shape issues might assume
• Consideration
▫ Problems we found are not exclusive for similarity
assessment
We suspect this issues arise whenever you try to
elaborate information published as linked data in
order to mining new factsinfo from the published
data
24. Do not hesitate to email me (Albertoni@ge.imati.cnr.it)
If you have off line questions
Editor's Notes
Enabling factors for establishing the web of data as preferred selling point for complex resources are: (i) linked data best practice relies on light-weighed ontologies encoded in Resource Description Framework (RDF) which can be exploited to provide ontology driven metadata. Such a kind of metadata takes advantage from the Open Word Assumption, enabling the adoption of complex, domain specialized and independently developed metadata vocabularies, which are pivotal to document resources produced in complex and loosely coupled pipelines; (ii) linked data best practice relies on content negotiation exploiting the standard HTTP protocol, it is not proposing a brand new platform replacing the existing technologies. Rather, it can be placed side by side to domain specific protocol and standards (e.g., Open Geospatial Consortium specification for the geographic domain) making metadata available in human and machine consumable format; (iii) technological headways have brought to mature prototypes in order to expose resource as linked data (e.g., D2R and Pubby), to query them by appropriate query language (i.e., SPARQL), to retrieve their pertaining RDF fragments published around the web (e.g., Sindice), to reason, store and manipulate these fragments once there are retrieved (e.g., JENA API).
However, even supposing the linked data was massively adopted to share the metadata of complex resources, the selection of the most suitable datasets for complex domains like environmental analysis would still be an enervating task. A huge amount of resource features and their complex relations must be considered during the selection process. Especially for assisting in this process, semantic similarity algorithms supporting a deep comparison of resource features are pivotal.
Before engaging in this challenging research plan, we have undertaken an exploratory phase analyzing real web data. The goal is to get a first-hand experience in varieties introduced by data providers publishing metadata. Although publishing metadata according linked data best practice has a huge potential for documenting resources produced in complex pipelines, it is not yet a common practice in the specialized domains we have mentioned. For this reason, we have been forced to move on a simpler domain considering the scientific publications exposed as linked data by Semantic Web Dog Food-SWDF (http://data.semanticweb.org/) and DBLP in RDF (http://dblp.l3s.de/d2r).
Very simple context!
We would expect that the similarity starting from comparing two uris, takes advantage from non authoritative info, in order to give as much as possible a realistic assessment of the entity similarity ..