We need to start understanding documents within an electronic machine procesable environment. Such conception goes beyond the PDF and HTML; it entails, I argue, understanding the document as a fluid aggregator.
The International Federation of Library Associations and Institutions (IFLA) is responsible for the development and maintenance of International Standard Bibliographic Description (ISBD), UNIMARC, and the "Functional Requirements" family for bibliographic records (FRBR), authority data (FRAD), and subject authority data (FRSAD). ISBD underpins the MARC family of formats used by libraries world-wide for many millions of catalog records, while FRBR is a relatively new model optimized for users and the digital environment. These metadata models, schemas, and content rules are now being expressed in the Resource Description Framework language for use in the Semantic Web.
This webinar provides a general update on the work being undertaken. It describes the development of an Application Profile for ISBD to specify the sequence, repeatability, and mandatory status of its elements. It discusses issues involved in deriving linked data from legacy catalogue records based on monolithic and multi-part schemas following ISBD and FRBR, such as the duplication which arises from copy cataloging and FRBRization. The webinar provides practical examples of deriving high-quality linked data from the vast numbers of records created by libraries, and demonstrates how a shift of focus from records to linked-data triples can provide more efficient and effective user-centered resource discovery services.
The International Federation of Library Associations and Institutions (IFLA) is responsible for the development and maintenance of International Standard Bibliographic Description (ISBD), UNIMARC, and the "Functional Requirements" family for bibliographic records (FRBR), authority data (FRAD), and subject authority data (FRSAD). ISBD underpins the MARC family of formats used by libraries world-wide for many millions of catalog records, while FRBR is a relatively new model optimized for users and the digital environment. These metadata models, schemas, and content rules are now being expressed in the Resource Description Framework language for use in the Semantic Web.
This webinar provides a general update on the work being undertaken. It describes the development of an Application Profile for ISBD to specify the sequence, repeatability, and mandatory status of its elements. It discusses issues involved in deriving linked data from legacy catalogue records based on monolithic and multi-part schemas following ISBD and FRBR, such as the duplication which arises from copy cataloging and FRBRization. The webinar provides practical examples of deriving high-quality linked data from the vast numbers of records created by libraries, and demonstrates how a shift of focus from records to linked-data triples can provide more efficient and effective user-centered resource discovery services.
MR^3: Meta-Model Management based on RDFs Revision ReflectionTakeshi Morita
We propose a tool to manage several sorts of relationships among RDF and RDFS. Our tool consists of three main functions: graphical editing of RDF contents, graphical editing of RDFS contents, and meta-model management facility. Metamodel management facility supports maintenance of relationship between RDF and RDFS contents. The above facilities are implemented based on plug-in system. We provide basic plug-in modules for consistency checking of RDFS classes and properties. The prototyping tool, called MR^3 (Meta-Model Management based on RDFs Revision Reflection), is implemented by Java language. Through the experiment of using MR^3, we show how MR^3 contributes the Semantic Web paradigm from the standpoint of RDFs contents management.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
How to use index to highlight social networks
in historical digital corpora ?
Présentation à Digital Humanities, 6 juillet 2006 (Paris).
Attention, c\'est un peu vieilli...
NATIONAL WORKSHOP ON RESEARCH METHODOLOGY, STATISTICAL
ANALYSIS AND STRESS MANAGEMENT
Organized by: - Panjab University Campus Students Council (PUCSC) in Collaboration With
Centre for Public Health, Panjab University, Chandigarh
SA2: Text Mining from User Generated ContentJohn Breslin
ICWSM 2011 Tutorial
Lyle Ungar and Ronen Feldman
The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems. The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including recent advances in sentiment analysis and how to handle user generated text such as blogs and user reviews.
Lyle H. Ungar is an Associate Professor of Computer and Information Science (CIS) at the University of Pennsylvania. He also holds appointments in several other departments at Penn in the Schools of Engineering and Applied Science, Business (Wharton), and Medicine. Dr. Ungar received a B.S. from Stanford University and a Ph.D. from M.I.T. He directed Penn's Executive Masters of Technology Management (EMTM) Program for a decade, and is currently Associate Director of the Penn Center for BioInformatics (PCBI). He has published over 100 articles and holds eight patents. His current research focuses on developing scalable machine learning methods for data mining and text mining.
Ronen Feldman is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University and his Ph.D. in Computer Science from Cornell University in NY. He is the author of the book "The Text Mining Handbook" published by Cambridge University Press in 2007.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Talk delivered at YOW! Developer Conferences in Melbourne, Brisbane and Sydney Australia on 1-9 December 2016.
Abstract: Governments collect a lot of data. Data on air quality, toxic chemicals, laws and regulations, public health, and the census are intended to be widely distributed. Some data is not for public consumption. This talk focuses on open government data — the information that is meant to be made available for benefit of policy makers, researchers, scientists, industry, community organisers, journalists and members of civil society.
We’ll cover the evolution of Linked Data, which is now being used by Google, Apple, IBM Watson, federal governments worldwide, non-profits including CSIRO and OpenPHACTS, and thousands of others worldwide.
Next we’ll delve into the evolution of the U.S. Environmental Protection Agency’s Open Data service that we implemented using Linked Data and an Open Source Data Platform. Highlights include how we connected to hundreds of billions of open data facts in the world’s largest, open chemical molecules database PubChem and DBpedia.
WHO SHOULD ATTEND
Data scientists, software engineers, data analysts, DBAs, technical leaders and anyone interested in utilising linked data and open government data.
This poster provides referencing services to linking bibliographical papers and citations with existing Linked Open Data. It aims to convert current bibliographical data in various digital library databases into semantic bibliographical data to enable research profiling and intelligent knowledge discovery
MR^3: Meta-Model Management based on RDFs Revision ReflectionTakeshi Morita
We propose a tool to manage several sorts of relationships among RDF and RDFS. Our tool consists of three main functions: graphical editing of RDF contents, graphical editing of RDFS contents, and meta-model management facility. Metamodel management facility supports maintenance of relationship between RDF and RDFS contents. The above facilities are implemented based on plug-in system. We provide basic plug-in modules for consistency checking of RDFS classes and properties. The prototyping tool, called MR^3 (Meta-Model Management based on RDFs Revision Reflection), is implemented by Java language. Through the experiment of using MR^3, we show how MR^3 contributes the Semantic Web paradigm from the standpoint of RDFs contents management.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
How to use index to highlight social networks
in historical digital corpora ?
Présentation à Digital Humanities, 6 juillet 2006 (Paris).
Attention, c\'est un peu vieilli...
NATIONAL WORKSHOP ON RESEARCH METHODOLOGY, STATISTICAL
ANALYSIS AND STRESS MANAGEMENT
Organized by: - Panjab University Campus Students Council (PUCSC) in Collaboration With
Centre for Public Health, Panjab University, Chandigarh
SA2: Text Mining from User Generated ContentJohn Breslin
ICWSM 2011 Tutorial
Lyle Ungar and Ronen Feldman
The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems. The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including recent advances in sentiment analysis and how to handle user generated text such as blogs and user reviews.
Lyle H. Ungar is an Associate Professor of Computer and Information Science (CIS) at the University of Pennsylvania. He also holds appointments in several other departments at Penn in the Schools of Engineering and Applied Science, Business (Wharton), and Medicine. Dr. Ungar received a B.S. from Stanford University and a Ph.D. from M.I.T. He directed Penn's Executive Masters of Technology Management (EMTM) Program for a decade, and is currently Associate Director of the Penn Center for BioInformatics (PCBI). He has published over 100 articles and holds eight patents. His current research focuses on developing scalable machine learning methods for data mining and text mining.
Ronen Feldman is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University and his Ph.D. in Computer Science from Cornell University in NY. He is the author of the book "The Text Mining Handbook" published by Cambridge University Press in 2007.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Talk delivered at YOW! Developer Conferences in Melbourne, Brisbane and Sydney Australia on 1-9 December 2016.
Abstract: Governments collect a lot of data. Data on air quality, toxic chemicals, laws and regulations, public health, and the census are intended to be widely distributed. Some data is not for public consumption. This talk focuses on open government data — the information that is meant to be made available for benefit of policy makers, researchers, scientists, industry, community organisers, journalists and members of civil society.
We’ll cover the evolution of Linked Data, which is now being used by Google, Apple, IBM Watson, federal governments worldwide, non-profits including CSIRO and OpenPHACTS, and thousands of others worldwide.
Next we’ll delve into the evolution of the U.S. Environmental Protection Agency’s Open Data service that we implemented using Linked Data and an Open Source Data Platform. Highlights include how we connected to hundreds of billions of open data facts in the world’s largest, open chemical molecules database PubChem and DBpedia.
WHO SHOULD ATTEND
Data scientists, software engineers, data analysts, DBAs, technical leaders and anyone interested in utilising linked data and open government data.
This poster provides referencing services to linking bibliographical papers and citations with existing Linked Open Data. It aims to convert current bibliographical data in various digital library databases into semantic bibliographical data to enable research profiling and intelligent knowledge discovery
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
Keynote presentation at the North Atlantic Health Science Library meeting, October 26, 2009.
An introduction to semantic web technologies and their relationship to libraries and bibliographic data.
Stuart Weibel, Senior Research Scientist, OCLC Research
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paper as a Research Object
1. Research around and about
the scientific paper in the
biomedical domain.
Supporting Literature Based
Discovery
From the paper to the data back and forth
Alexander Garcia, PhD.
FSU
2. 350 Years and Counting
Scientific articles have adopted electronic dissemination
channels
Scholarly communication has been complemented by
the adoption of blogs, mailing lists, social networks, and
other technologies
Information remains locked up in PDFs
3. And so we are…
Managing the publication on a postmortem basis…
The paper as an interface to the Web of Data?
The problem remains, so…
To be born semantics… why not?
4. Heading towards
A semantic document, one where human-readable
knowledge is augmented to enable its interpretation by
machine
A human interpretable document fully procesable by
machines
Human interoperability and machine interoperability
Literature Based Discovery and the Paper as an interface
to the WoD
5. We all know that
Information is locked up in discrete documents
Mostly PDF
Controlled vocabularies are not always available
Text Mining depends on availability of data
Poor metadata
7. Literature Based Discovery
• The key idea is: putting together explicit
assertions from different papers to form
new implicit assertions
– PTSD and suicide
– Magnesium-migraine
– Fish oil-Raynaud’s or calcium-channel blokers
• Sophisticated access to online information
• Supplement document retrieval with:
– Information extraction
– Automatic summarization
– Question answering
8. The White Paper Challenge
Search and Retrieval
How to get relevant documents faster
Info Sources
Query Builders
Notifications
How to “scan” the document in a meaningful
manner?
How to repurpose fragments of the documents?
9. Literature Discovery Process
Search
Usually string-based search mechanisms
Little cognitive support
Retrieval
Simple list of DB entries
Little cognitive support
Interacting with the document
Straight into the PDF
Zero cognitive support
Data availability
10.
11. Literature Discovery Process
Search
Usually string-based search mechanisms
Little cognitive support
Retrieval
Simple list of DB entries
Little cognitive support
Interacting with the document
Straight into the PDF
Zero cognitive support
12.
13. Literature Discovery Process
Search
Usually string-based search mechanisms
Little cognitive support
Retrieval
Simple list of DB entries
Little cognitive support
Interacting with the document
Straight into the PDF
Zero cognitive support
14.
15. Challenge: Language Complexity
The average age of participants (approximately 63
years), the predominance of women, and the high
prevalence of comorbid conditions (for
example, hypertension and cardiovascular disease) reflect
typical characteristics of patients with osteoarthritis.
Language encodes a lot of information
17. Semantic Predications
The average age of participants (approximately 63
years), the predominance of women, and the high
prevalence of comorbid conditions (for
example, hypertension and cardiovascular disease)
reflect typical characteristics of patients with
osteoarthritis.
19. What is needed
Disambiguate Text and tag/link concepts
Meta-analyse information at concept level
Provide meta-analysed information
Support Information Based Knowledge Discovery
(especially new associations)
20. In order to support
Literature Based
Discovery
Ontologies
Communities
Annotation
Machinereadable
documents
In a nutshell….
…documents as interfaces
to the Web of Data….
Biotea
• Machine-readable and
procesable documents
• Interactive documents
• Enriched metadata
• Full content
management, document
centric
• Social hub
Citagora
-Aggregated search
-Single entry point
-Social hub
-Citation centric
21. Biotea in a nutshell
It is a knowledge model for biomedical literature
We are semantically annotating literature with text mining
and ontologies
Delivers a network of interrelated documents
Delivers a semantic infrastructure for PMC and scientific
literature in general
23. RDF4PMC, some results
Makes possible
How similar are two articles? based
on
authors, keywords, abstracts, ontologi
cal terms
Metadata +
Content +
References
What articles use this reference in a
section with title “Results”?
Annotations
Makes possible
•
How similar are two articles?
based on semantic
distance
•
Which annotation co-occurs
more with this “YYY”
annotation?
•
Which articles include “TERM”
but not this other “TERM”?
Annotations
Some numbers, article PMC126253
“Computational method for
reducing variance with
Affymetrix microarrays”
•
NCBO
•
Annotations: 407
•
Topics: 633
•
Whatizit
•
Annotations: 14
•
Topics: 203
Delivering: the platform that makes possible to build interactive environments for semantic publications
24. A dashboard for semantic biopublications
Semantically
enriched
publication
Metadata+
Content +
References
SPARQL
Catalase
Automatically
Annotated
RDF
25. Cloud of Bioannotations
(term + # of bioentities)
Title &
authors
Links
Abstra
ct
Paragraphs
containing the
annotation selected
by the user
27. Citagora
An Agora for Citations
From Citations to Social Web to an Interactive Document
Aggregating activity from Social Networks, Reference
Management Systems, Blogs, Publishers, etc.
Aggregating sources from Google Scholar, Microsoft
Academics, Zotero, Mendely, etc.
28. What is MSRC.CITAGORA?
Corpus of documents for one specific domain
•
•
•
BibRef centric
Enrichment mechanism
Based on heterogeneous data
sources, aggregator
o
•
o
Heterogeneous BibRef data sources
Heterogeneous PDF layouts
Value in
o
o
o
o
Enriching semantics around the BibRef
Aggregating social activity around the BibRef
Social activity as part of the BifRef
Making use of the content without exposing it
DATA for and compatible with the Web of Data
29. MSRC.CITAGORA
Data Source
Data Sources, may be users
uploading ENL files, that have
for
each
record
the
corresponding PDF.
Result
from
harvesting
Mendeley, ZOTERO, Elsevier
API, Microsoft Academics
API, etc.
Extracting Meaningful
Information by
Processing the Data
Source
-List of references
this document
cites_to
-Meaningful bag of
words
Authors, affiliations,
emails
Outcome: RDF
-BibRef for the
original PDF
-Annotations
for the whole
document
-Text
-List of cites_to
31. Moving Towards OPEN.CITAGORA
Lets build the largest OPEN repository of everything around a
standardized interoperable bibliographic reference
Annotations
has_part
BibRef
has_part
has_part
has_part
Living in the Web of Data
References
Content
PDF
33. Semantic Enrichment
Jailbreaking
PDF
Content is
locked up
Meaningful Text
Citations, cites_t
o
this paper
cites_to
-Authors
-this paper
has_authors
-Title, DOI, etc
-Content as text
-Bag of words
describing
content
Annotations
PDF
has_part
has_part
BibRef
has_part
has_part
Content
References
34. Semantic Enrichment
Jailbreaking
BibRef
PDF
Meaningful Text
-Citations,
cites_to
Heterogeneous Content is
this paper
locked up
formats
cites_to
Diversity in APIs
-Authors
for collecting
-this paper
BibRefs
has_authors
Poor in
-Title, DOI, etc
descriptors
-Content as text
anchored in the
-Bag of words
content
Not justdescribing
about the
Louzy
content
PDF
metadata
Standardization, all in one place, one
URI, etc
Annotatio
ns
PDF
has_p
art
has_p
art
BibRef
has_p
art
Reference
s
has_p
art
Conte
nt
35.
36.
37.
38. Translational Research
How is MSRC contributing to Translational Research in
Clinical Psychology?
Data Standards
Semantic Infrastructure
Bridging the gap between documents and data
repositories
43. We have learned so far
Born semantic enables the semantics to be of use to the
authors, as they are present in the publication process
from the start. To add value for readers and
computational consumption these semantics must then
be "preserved” throughout the publication process;
so, we need to address the publication process to
achieve this goal.
44. Acknowledgments
Special Thanks to John Gomez, John Patterson, Dietrich
Rebholz-Schuhmann, Robert Morris, Oscar Corcho, Diane
Leiva and Greg Riccardi
Editor's Notes
From paper-based journals to purely electronic formats.
El siguiente paso consistió en hacer énfasis en la importancia de añadir semántica a los datos o anotaciones hechas en diferentes tipos de procedimientos experimentales o técnicas de laboratorio. En los cuadernos analizados se encontraron anotaciones de diferentes procedimientos experimentales, siendo los mas recurrentes la extraccion de ADN, la PCR incluyendo algunas de sus variantes y la electroforesis en geles de agarosa y poliacrilamida. El tipo de anotaciones encontradas estan relacionadas con los materiales y métodos y otros relacionados con diseño experimental, observandose datos de algun tipo de analisis de resultados.Entonces, con base en ésta estructura retórica de los cuadernos de laboratorio se planeó la construcción de dos ontologias, una que provea los metadatos que autodescriben el cuaderno de laboratorio y una actividad experimental; y otra que contuviera términos relacionados con procesos de laboratorio comúnmente usados en biología molecular de plantas.El propósito de contar con estas ontologías es poder soportar preguntas de competencia como “en que fechas fue extraído el ADN de los materiales de arroz usados en el proyecto titulado “identificación de marcadores moleculares asociados a QTLs de rendimiento en arroz” ?En que proyectos de investigación participó OXG entre el 2005 y el 2009?