DSPy a system for AI to Write Prompts and Do Fine Tuning
Linked Data, Cultural Heritage & the Karma Mapping Software
1. Linked Data &
Cultural Heritage
Pedro Szekely and Craig Knoblock
USC/Information Sciences Institute
pszekely@isi.edu, knoblock@isi.edu
http://isi.edu/integration/karma
February 2015
2. Outline
• Problem
• Linked Data
• Karma
• Reconciliation
• Next steps
CC-By 2.0 2USC Information Sciences Institute
4. Humans Browsing the Web
Crystal Bridges
Museum of
American Art
Dallas Museum
of Art
Indianapolis
Museum
of Art
The Metropolitan
Museum of Art
National Portrait
Gallery
Smithsonian American
Art Museum
USC Information Sciences Institute CC-By 2.0 4
7. WEB PAGES ARE UNUSABLE FOR
CREATING INNOVATIVE APPLICATIONS
USING THE DATA
CC-By 2.0 7USC Information Sciences Institute
8. SOLUTION:
Linked Open Data
“web pages for computers”
using W3C standards for publishing data
CC-By 2.0 8USC Information Sciences Institute
9. CC-By 2.0 9
Tim Berners Lee
on Linked Open Data
USC Information Sciences Institute
http://youtu.be/OM6XIICm_qo
10. Humans Browsing the Web
Crystal Bridges
Museum of
American Art
Dallas Museum
of Art
Indianapolis
Museum
of Art
The Metropolitan
Museum of Art
National Portrait
Gallery
Smithsonian American
Art Museum
USC Information Sciences Institute CC-By 2.0 10
12. Publish Your Raw Data
Crystal Bridges
Museum of
American Art
Dallas Museum
of Art
Indianapolis
Museum
of Art
The Metropolitan
Museum of Art
National Portrait
Gallery
Smithsonian American
Art Museum
USC Information Sciences Institute CC-By 2.0 12
13. CC-By 2.0 13
Examples of
Raw Data Now
USC Information Sciences Institute
https://github.com/cooperhewitt/collection
https://github.com/IMAmuseum/ima-collection
14. Convert Data to CRM (2 star)
Crystal Bridges
Museum of
American Art
Dallas Museum
of Art
Indianapolis
Museum
of Art
The Metropolitan
Museum of Art
National Portrait
Gallery
Smithsonian American
Art Museum
USC Information Sciences Institute CC-By 2.0 14
15. Linked Museum Data (3 star)
Crystal Bridges
Museum of
American Art
Dallas Museum
of Art
Indianapolis
Museum
of Art
The Metropolitan
Museum of Art
National Portrait
Gallery
Smithsonian American
Art Museum
USC Information Sciences Institute CC-By 2.0 15
17. Represent Resources Using URIs
h&p://szekelys.com/family#pedro
“Pedro”
h&p://xmlns.com/foaf/0.1/firstName
USC Information Sciences Institute CC-By 2.0 17
18. Represent Information as Triples
h&p://szekelys.com/family#pedro
h&p://xmlns.com/foaf/0.1/firstName
Subject
Predicate
Object
The resource being described
A property of the resource
The value of the property
“Pedro”
USC Information Sciences Institute CC-By 2.0 18
22. Steps to Create Linked Open Data
• Publish the raw data
… get the data out of the proprietary database
• Select ontologies
… that define classes and properties for our data
• Define URI scheme
… identifiers of your resources
• Convert data to RDF
… from data sources to the ontologies
• Identify links to other Linked Data datasets
… aka reconciliation, entity resolution, …
USC Information Sciences Institute CC-By 2.0 22
23. CC-By 2.0 23
CIDOC CRM
• Select ontologies
… that define classes and properties for our data
http://www.cidoc-crm.org/
USC Information Sciences Institute
24. CC-By 2.0 24
• Define URI scheme
… identifiers of your resources
USC Information Sciences Institute
26. CC-By 2.0 26
• Convert data to RDF
… from data sources to the ontologies
USC Information Sciences Institute
27. RDF Mapping Tools
CC-By 2.0 27USC Information Sciences Institute
TOOL SHORTCOMINGS BENEFITS
custom
code
labor intensive w
error prone
flexible
R2RML difficult to learn w
only SQL databases
W3C standard w good documentation
w multiple vendors
Open
Refine
no guidance w
only tabular data
graphical user interface w support
for reconciliation w open source
Karma university product easy to use w flexible w multiple data
formats w multiple deployment
databases w scalable w R2RML
compatible w open source
29. KARMA DEMO
CC-By 2.0 29USC Information Sciences Institute
http://youtu.be/h3_yiBhAJIc
30. Easy To Use
CC-By 2.0 30
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
CLEAR DEPICTION OF MAPPING
USC Information Sciences Institute
31. CC-By 2.0 31
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
LEARNS TO MAP
YOUR DATA
USC Information Sciences Institute
32. CC-By 2.0 32
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
SUGGEST CORRECT
ADJUSTMENTS
USC Information Sciences Institute
33. CC-By 2.0 33
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
EMBEDDED PYTHON
SCRIPTING
USC Information Sciences Institute
34. CC-By 2.0 34
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
IMPORT POPULAR
DATA FORMATS
USC Information Sciences Institute
35. CC-By 2.0 35
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
OUTPUT RDF IN
MULTIPLE FORMATS
ntriples
JSON
AVRO
SPARQL
ElasticSearch, GitHub, …
Hadoop, BigData
USC Information Sciences Institute
36. CC-By 2.0 36
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
40 million documents
1 billion triples
larger than all AAC museums combined
USC Information Sciences Institute
37. CC-By 2.0 37
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
periodic update
every hour, every day
continuous update
as new records come in
USC Information Sciences Institute
38. CC-By 2.0 38
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
Karma compatible with
R2RML tools
USC Information Sciences Institute
39. CC-By 2.0 39
easy to use w flexible w multiple data formats w multiple deployment databases w scalable w R2RML compatible w open source
Karma Is Open Souce
USC Information Sciences Institute
44. URI Reconciliation In Karma
Pedro
Szekely
USC Information Sciences Institute CC-By 2.0 44
45. Results of Automatic Linking
Pedro
Szekely
99% are correct
6% are missing
USC Information Sciences Institute CC-By 2.0 45
46. Steps to Create Linked Open Data
• Publish the raw data
… get the data out of the proprietary database
• Select ontologies
… that define classes and properties for our data
• Define URI scheme
… identifiers of your resources
• Convert data to RDF
… from data sources to the ontologies
• Identify links to other Linked Data datasets
… aka reconciliation, entity resolution, …
USC Information Sciences Institute CC-By 2.0 46
47. CC-By 2.0 47
TMS to CRM
easy?
USC Information Sciences Institute
48. CC-By 2.0 48
TMS to CRM
easy?
USC Information Sciences Institute
NO
49. COMMUNITY EFFORT
• Publish the raw data
… get the data out of the proprietary database
• Select ontologies
… that define classes and properties for our data
• Define URI scheme
… identifiers of your resources
• Convert data to RDF
… from data sources to the ontologies
• Identify links to other Linked Data datasets
… aka reconciliation, entity resolution, …
USC Information Sciences Institute CC-By 2.0 49
50. Radical Ideas
• ULAN in Wikipedia or Wikidata
• ULAN in GitHub
• Collection data in GitHub
• Community created CRM mappings in GitHub
• CRM in JSON-LD in GitHub
• Tools to export from TMS to GitHub
USC Information Sciences Institute CC-By 2.0 50
52. Deployment Options
CC-By 2.0 52USC Information Sciences Institute
Technology Shortcomings Benefits
SPARQL
endpoint
low reliability,
esoteric, slow
sophisticated query
language
RDF dump no query capability,
esoteric
flexibility: clients can
download and use in
applications, easy to publish
JSON-LD +
ElasticSearch
restricted query
language
very high performance,
mainstream technology, easy
to publish
Karma supports the three options
53. CC-By 2.0 53
federation
every publishes their data with
their own URIs
aggregation
aggregator repulishes everyone’s
data with new URIs
USC Information Sciences Institute
54. thanks for your attention!
https://github.com/usc-isi-i2/Web-Karma!
Open Source, Apache 2 License!
CC-By 2.0 54USC Information Sciences Institute