Linked Open Data in the World of Patents

Patents, Semantics and
Open Innovation
The role of LOD in a business directory for
knowledge intensive industries
Nice, 20-OCT-2015
Ricardo Eito-Brun
reito@uc3m.es

Patents, Semantic and Open
Innovation. LOD
 Reasons leading to this research:
◦ Semantic Web technologies and applications, in particular LOD
publishing, constitutes a preliminary step to achieve Information
Systems interoperability.
◦ Having access to distributed data, hosted by different agents and
repositories, open new possibilities to research in multiple areas.
◦ In the particular case of patent information:
 Which are the possibilities we have if we are able of aggregating /
analysing these data together with other information data set?
 Is it feasible to implement improvements in the way we are
accessing patent information?
 Can we figure out innovative user interfaces to browse and search
patent collections?

Patents, Semantic and Open
Innovation. LOD
 Schema:
◦ The LOD promises. Potential benefits, technologies and
standards for data encoding and interoperability.
◦ LOD in the world of patents. A review of major milestones.
◦ Overview of current research: what researchers are doing.
◦ Case Study: the particular case of Web-based Innovation
Platforms and digital repositories.

Linked Open Data:
benefits, technologies and standards
 LOD has become the main application of the SMW approach.
 Semantic Web (SMW)
◦ Proposed by Tim Berners-Lee “The Semantic Web: A New Form
of Web Content That Is Meaninguful to Computers Will
Unleash a Revolution of New Possibilities”, Scientific American
2001.
◦ SMW is about having a more intelligent web, made up of
documents that could be easily processed by computers and
software agents with no human intervention.
◦ SMW data should be “exposed” or published in a machine-
readable format.
◦ Computes should be able of understanding the meaning of data.

Linked Open Data:
 W3C SMW Activity:
◦ “The Semantic Web provides a common framework that
allows data to be shared and reused across application,
enterprise, and community boundaries. […]
◦ The Semantic Web is about two things.
 about common formats for integration and combination of
data drawn from diverse sources…
 about language for recording how the data relates to real
world objects. “*
*http://www.w3.org/2001/sw/

Linked Open Data:
 SMW is presented as an extension of the traditional web.
 If Content in the traditional web is published for humans, Content in
the SMW is published for software programs which could interpret
them and obtain new data, information and knowledge.
 Inicial SMW initiatives were focused on improving software agents’
capabilities to solve information management problems:
◦ "Mom needs to see a specialist and then has to have a series of physical
therapy sessions. Biweekly or something. Lucy instructed her Semantic
Web agent through her handheld Web browser. The agent promptly
retrieved information about Mom's prescribed treatment from the
doctor's agent, looked up several lists of providers, and checked for the
ones in-plan for Mom's insurance within a 20-mile radius of her home…”

Linked Open Data:
 SMW pillars:
◦ URIs or IRIs to provide unique identifiers to resources.
◦ XML to encode and transfer information.
◦ RDF as an XML-based vocabulary to encode metadata describing
resorces.
◦ RDF-S as a means to structure the metadata about resources
(What can be asserted for a resource of a specific type).
◦ OWL as an alternative to RDF/RDF-S with additional capablitities
to express constraints on data.
◦ Additional languages to express the rules that govern reasoining
on SMW data.

Linked Open Data:
 RDF proposes a vocabulary (set of tags) to express metadata about
any type of resource.
 RDF data can be expressed in XML or in other alternative formats.
 An RDF file usually encloses metadata about a specific resource,
e.g.: person, document, institution, company, event…
 Resources are identified by unique identifiers (URIs).
◦ URIs are used to ensure that metadata about the same entity are grouped
together.
◦ In case different applications use different identifiers for the same entity, it
is possible to keep the equivalences between the different identifiers.

Linked Open Data:
Unique identifier for the
resource, expressed as
an URI.
Equivalences with other
identifiers proposed in
other contexts
(owl:sameAs)
Metadata about the
resource, with clearly
defined meaning.
Resources are given a
type (rdf:type)

Linked Open Data:
 RDF records about resources can be linked or related.
 The value of a specific metadata or property may refer to the
Identifier of another resource.
 This allows having sets of structured, linked data.
 For example:
◦ Subjects or Topics in a classification code may have unique Ids.
◦ The “Subject” metadata field in a document will take as a value the ID of the
referred topic.
◦ Personal or corporate authors may have unique Ids.
◦ The “Author” metadata field in a document will take as a value the ID of its
personal/corporate author.

Linked Open Data:
dc:language refers to
the ID of the English
language.
Dc:subject refers to
the ID of a
classification code
taken from the DDC
system.
Dc:subject also refers
to the ID of a topic
taken from the LCSH
system.

Linked Open Data:
 SMW standards and languages are not limited to RDF.
◦ RDF-S provides a way to define “schemas” for metadata,
in other words, what properties/metadata can we use to
describe entities of a specific type.
◦ SKOS provies a way to encode “subject headings”,
“thesauri” or “classification schemas” used to indicate the
topics the documents are about.
◦ Specific vocabularies to indicate which properties are
available to provide metadata on resources: e.g. Dublin
Core.
 .

Linked Open Data:
Properties starting
with dc: and dct: are
taken from the Dublin
Core vocabulary, that
provides a set of
metadata.
Dc:subject points to
LCSH/DDC topics that
are expressed – in
some other place in
the web – using
SKOS.
A separate RDF-S
document states
which properties can
be used when
providing metadata
about resources of
type « Book »

Linked Open Data:
 RDF statements build a “graphs” of resources, properties
and values.
 As the number of metadata collected about the different entities
grows, the graph is expanded.
 RDF model to represent information allows browsing and discovering
mechanisms that go beyond traditional search/browse capabilites.

SKOS:
conceptual structures for the SMW
 Another vocabulary closely related to the SMW and LOD.
 Used to:
◦ Encode “subject headings” or “classification schemas” in XML
format.
◦ Encode relationships between these conceptual structures (e.g.
equivalences between classes of different classification schemas)
◦ Provide list of topics to which document descriptions can be
linked.
 Concepts within a SKOS-encoded schema are related to each other
by relationships like <broader> , <narrower> or <related>.
 Labels can be given to concepts (linguistic equivalences, authorized,
not authorized, deprecated…).
 Concepts can also be annotated.

SKOS:
Each concept has a
separate skos:Concept
element, identified by
an URI.
Skos:related points to
other concepts with a
related meaning.
Skos:prefLabel and
altLabel provides
linguistic labels to the
concept.
Skm:UF points to
deprecated concepts.

SKOS:
 SKOS has become one of the key points in SMW initiatives.
 Organizations usually start putting their controlled vocabularies /
classification schemes online using SKOS.
 Then, “bibliographic” descriptions are linked to SKOS topics as a
second stage.
 This provides an initial pair of linked data sets.
 But SKOS becomes powerful if we can take advantage of the
capability of expressing relationships between different classification
schemas.
 This gives the opportunity of cross-searching different repositories.

Semantic Web: standards
SPARQL
 SPARQL is a W3C standard that defines a query language to search
for information within RDF graphs.
 SPARQL is for the SMW, the equivalent to the SQL language used
for relational databases like Oracle, MySQL, Postgresql…
 Collections of RDF documents within a repository can be searched
using “SPARQL end points”.
 SPARQL end points are aimed to software agents and software
applications.
 Queries are constructed dinamically by software agents, and results
are returned on XML for further processing.

Semantic Web: standards
SPARQL

Semantic Web:
technologies…
 Technologies and tools used to deal with SMW standards and
concepts can be classified in these groups:
◦ Editors, to help define RDF-S schemas.
◦ Conversion tools, to move existing data into the RDF format.
◦ RDF repositories or “triple stores”, to support:
 The storage of large data sets
 Bulk downloads,
 Human browsing
 SPARQL searching.
◦ Specific tools to manage controlled vocabularies and generate
SKOS representations.

Linked Open Data (LOD)
SMW at work…
 “Linked Data is simply about using the Web to create typed links
between data from different sources. These may be databases
maintained by two organisations, or heterogeneous systems within
one organisation that, historically, have not easily interoperated at the
data level.”
 “… Linked Data refers to data published on the Web in such a way
that it is machine-readable, its meaning is explicitly defined, it is
linked to other external data sets, and can in turn be linked to from
external data sets.”
 Tim Berners-Lee (2006)

SMW at work…
 Linked Data Principles:
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful information, using
the standards (RDF, SPARQL)
4. Include links to other URIs, so that they can discover more
things.
 These principles are “rules” or “recommendations” on how to publish
LOD data on the web.

SMW at work…
 There is a graphical display created by Richard Cyganiak and Anja
Jentzsch showing published data sets, http://lod-cloud.net/

SMW at work…
 Conditions to be included in this catalog:
◦ Data available via URIs through http or https.
◦ Data published in RDF format (any serialization method: RDFa, RDF/XML, Turtle,
N-Triples).
◦ Dataset must have at least 1000 RDF statements.
◦ Dataset must contain links to at least one of the datasets in the diagram (at least
50 links).
 It is also possible to get data about the number of published LOD on:
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
 http://datahub.io is another site where you can identify datasets.
It contains 31 datasets related to patents, including EPO, USPTO,
KIPO, UK Patent…

SMW at work…
 Conditions to be included in this catalog:
◦ Data available via URIs through http or https.
◦ Data published in RDF (serialization method: RDFa, RDF/XML, Turtle, N-Triples).
◦ Dataset must have at least 1000 RDF statements.
◦ Dataset must contain links to at least one of the datasets in the diagram (at least
50 links).
◦ Dataset available via dump OR through SPARQL endpoint.
 It is also possible to get data about the number of published LOD on:
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
 http://datahub.io is another site where you can identify datasets.
It contains 31 datasets related to patents, including EPO, USPTO,
KIPO, UK Patent…

SMW at work…
 To check the use of specific vocabularies/metadata, Open
Knowledge Foundation hosts the LOV (Linked Open Vocabularies)
site since 2012:
 http://lov.okfn.org/dataset/lov/
 Additional LOD-related tools (search engines) include:
◦ http://watson.kmi.open.ac.uk/WatsonWUI/
◦ http://swoogle.umbc.edu/
◦ http://ws.nju.edu.cn/falcons/ontologysearch/index.jsp?query

LOD and Patents
Early, Academic Initiatives
 NSF (SciSIP) – Science of Science and Innovation Policy
Discussion on 2011 “Patent Data Workshop” to support quantitaive
studies on innovation.
◦ Remarked the effort o USPTO Patent Dashboard and Data Visualization
Center to study causes of innovation and outcomes of programs to
stimulate innovation.
 SciWire platform* that ingest and links metadata about patents,
grants, to explore R&D landscape.
 AKSW (Agile Knowledge Engineering and Semantic Web) to
publish US Patents:
◦ SPARQL End Point: http://us.patents.aksw.org/sparql
◦ June 2014, 187 million triples.
◦ Proposed RDF schema with basic data based on dc, foaf and its own
schema.
*Haak, Laurel; Baker, David; Probus, Matt: Creating a Data Infrastructure for Tracking
Knowledge Flow. 2012

LOD and Patents
Academic Initiatives
 Subramanian, S. (2013)*, analyzed USPTO patent conversion into
RDF format, and its merge with dbpedia data to provide
“consolidated / merged search results”. (enrich patent data).
 Singhi M., Ding Y.***, also merged USPTO patent data from the
SDB (Scholarly Database at Indiana Univ.) with dbpedia entries for
locations in a common database.
 SDB database includes 26 millions records (4 millions US patents
in the 1976-2010 period) including MEDLINE and NSF documents
about research grants.
 SDB uses patent information as part of its R&D analysis with the
Sci2 bibliometric tool.
* Subramanian, S.; Dhilpe, S. Yalamanchi, U. Exploiting Linked Data and Big Data for Semantic Patent Discovery.
COEN 296, Aug. 2013.
**Singhi, M., Ding, Y. Linking US Patent Data with Wikipedia.
*** Bäumer et al., Linked Open Data for Scientific Data Sets. KONVENS, 2014. Heldesheim, Germany.

LOD and Patents
 Heinz Nixdorf Institute (Paderborn University) and KISTI (Korean
Institute of Science and Technology)* :
◦ Make scientific data available through RDF.
◦ Use of W3C D2R and D2RQ server/converter and RelFinder
visualization tool. Pilot project with 60 researchers and 400 related
publications..
 Zaveri, A. et al., (2012)** describe its conversion of the USPTO
patents into RDF.
◦ The USPTO patents full-text data is available for download in XML format from
the years 2002 onwards.
◦ From the years 1976 to 2001, data are available in plain-text.
◦ Each week USPTO releases a zipped file of all patents accepted in that week.
◦ Each year ca. 52 files are published each one containing about 5000 patents.
◦ They developed an “ontology” or schema to encode patent information.
* Bäumer et al., Linked Open Data for Scientific Data Sets. KONVENS, 2014. Heldesheim, Germany.
** Zaveri et al. (2012). Publishing and Interlinking the USPTO Patent Data. Semantic Web Journal. 24/09/2014.

LOD and Patents
 Dongmin Seo et al. (2011) designed InSciTe, a technology
opportunity discovery (TOD) service to support decision-making on
R&D planning.
 It used RDF data (including patent information) to analyze and
visualize relations between technologies and agents.
◦ Trends and predictions,
◦ Relationship,
◦ Roadmaps,
◦ Competitors and collaborators.
 Data set included 3,100,000 patents from US, Europe and Japan.

LOD and Patents
 Zaveri et al. (2011) described an interesting study on the use of
linked open data to assess the impact of research on the
biomedical area.
 They analyzed data for 20 European countries over a 10 years
period (1999 to 2009)..
 The data set included data from Eurostat and World Bank LOD
datasets.
 Input data included the number of Biotechnology patent
applications submitted to EPO.
Zaveri et al. (2013). Using Linked Data to evaluate the impact of Research and
Development in Europe: a Structural Equation Model. LNCS 8219, pp 244-259

LOD and Patents
Official Initiatives
 USPTO,
◦ In 2014 developed a 18 month roadmap for open data initiatives.
◦ On April 2015, it published the “Report of Findings from an Open Data
Roundtable with the U.S. Patent & Trademark Office”.
◦ No specific reference to RDF or “linked open data”
◦ Bulk download through http://patents.reedtech.com/, Google Patents.
◦ Plans and achievements:
 PatentsView prototype for patent visualization (5 million U.S. patents).
 Electronic Data Hosting. Repository of public bulk patent and
trademark data
 Assignment Search. Searchable database containing all recorded
Patent Assignment information dating back to August 1980.

LOD and Patents
 EPO (European Patent Office),
◦ OPS (Open Patent Service)** provides long time ago XML patent data
via REST-based web services.
◦ Queries built on the CQL query language.
◦ Data coming from EPODOC, EPOQUE (full-text) and BNS (image)
databases (same soures and coverage for bibliographic data as
Espacenet)
◦ Bibliographic data, legal status, facsimile images, CPC classification,
character-coded full text, register and family.
◦ Well-documented query interface for developers.
◦ For large datasets, possibility of bulk download is available.
Kallas, P. (2006). Open Patent Service. World Patent Information
Volume 28, Issue 4, December 2006, Pages 296–304

LOD and Patents
 EPO (European Patent Office),.
◦ In the LOD specific context, http://epo.publicdata.eu/ dataset with
around 22 million triples.
◦ Based on an OWL encoded schema/ontology.
◦ Current pilot based on the conversion of 100000 EP applications and
the CPC hierarchy (250000 technical classification symbols) into RDF
triples.
◦ A LOD user interface provided to view the data in a user friendly way
without programming, plus SPARQL endpoint and bulk downloads.
◦ Export to JSON, text, XML or Turtle
◦ Links to technical terms extracted from the abstracts are linked to
DBpedia
◦ States (geographical units) from addresses are mapped to
nuts.geovocab.org, and language codes to the Library of Congress
**Information provided by with Martin Kraker (EPO)

LOD and Patents
 Intellectual Property Government Open Data, IPGOD (Australia).
◦ Announced in October 2014, (IPGOD).
◦ Available via the Australian government's data portal at data.gov.au,
◦ It covers more than a century of Australian patent, trade mark, and
design data
◦ Information on the application process for each right.
◦ They have also created a unique set of identifiers to link the data to
external information on companies.
◦ IPGOD includes the PATSTAT "application identifier", so its data can be
linked direct to PATSTAT.
◦ Harmonised names of rights holders in Australia.
◦ Available through: https://data.gov.au/organization/ip-australia
◦ Detailed, well-documented data model**.
“Linked and open patent data: Australia and Korea moving forward” Patent Information News. EPO. Issue 1/2015.
**http://www.ipaustralia.gov.au/uploaded-files/reports/IP_Government_Open_Data_Paper_-_Final.pdf

LOD and Patents
 OEPM (Spain) Open Data.
◦ Part of the datos.gob.es initiative.
◦ http://datos.gob.es/catalogo/catalogo-opendata-de-oficina-
espanola-de-patentes-marcas-oepm
◦ It provides bulk downalod of data in XML (plus PDF) based on
the WIPO (ST36 standard for encoding data in XML)
◦ No SPARQL end point.

LOD and Patents
 KIPO (Korean Intellectual Property Office), as part of IP5 initiative,
started dissemination of patent information in XML on July 2014.
◦ KIPRIS tool, http://plus.kipris.or.kr/eng/main.do
◦ Related to Open Government Data in South Korea.
◦ Patents is one of 16 strategic areas in this program.
◦ What information to share, How to Share, How to support utilization.
◦ How to share: bulk, via API or LOD in XML format (WIPO ST.96 Std.).
◦ How to support utilization: Applicant Name standardization.
◦ IP-Biz Integrated service to connect patent and Business data.
◦ API, Bulk download.
“Linked and open patent data: Australia and Korea moving forward” Patent Information News. EPO. Issue
1/2015.

LOD and Patents
A brief summary
 Different data sets currently available, but…
◦ Open data (publicly made accessible) does not mean “linked
data”.
◦ XML publishing is not the same as RDF publishing.
◦ Adding links to other entities (companies, people or topics) is a
must-have to talk about “linked” open data.
◦ RDF-S standardization is also an interesting choice.
◦ Publishing data in RDF format is just a first step… to allow target
communities figure out how to use the data.
“Linked and open patent data: Australia and Korea moving forward” Patent Information News. EPO. Issue
1/2015.

LOD and Patents
Analysis of potential applications
 LOD data are useful when they are linked to other data to enhance
the capability of finding additional information.
 Different use cases are being considered to exploit existing patent
data sets, in particular:.
◦ Enhance institutional repositories with patent data (bulk import or
access through APIs).
 Prototype under development to integrate UCA repository with OEPM data)
 Integration of dspace repository SW to gather patent details.
◦ Integrate patent data into Web-based Innovation Platforms and
Business Directories.
 Checking the “innovation capabilities” of agents (persons, entities)
 Dynamic building of an “innovation profile” collecting and merging data about
patents, projects and papers.

 In innovation, linear models have been replaced by
collaborative models
 These models are based on feedback and interactions between
different partners.
 This evolution has evolved toward the Open Innovation model or
paradigm (Chesbrough 2003).
 Today, IM is seen as a:
◦ non-linear,
◦ evolutionary,
◦ interactive process between the company and its environment,
◦ that requires the close collaboration of different agents.
LOD and Patents

Principles of Open Innovation
 Valuable ideas may come from both inside and outside the
company
 It is the consequence of several factors:
◦ knowledge specialization,
◦ availability of highly skilled workers,
◦ increasing capabilities of suppliers,
◦ and the difficulties of having a complete domain of all the aspects that need to be
mastered in a successful innovation life cycle.
◦ Different “knowledge streams” must be managed: market, scientific and
technical, and social knowledge.
 Different agents have a different level of participation in the
generation of the knowledge that provide the inputs to create
innovations
 Complex interfaces between them.
LOD and Patents

 OI depends on the ability to cooperate with other partners who
are developing innovation.
 Agents need to give visibility to their innovation capabilities and
achievements (products or services, skills, etc.) in a global
context.
 In some sectors, e.g. aerospace, big companies need to set up
agreements with other companies to ensure “geographical return
on investment”.
 Which are the tools we have to give visibility to our company?
◦ Business directories
◦ Collaborative, Innovation platforms.
LOD and Patents

 Current company/business directories and databases focus on
“contact details” and “financial data”
 They are mostly oriented to assess the “health of the companies
from an economic perspective”.
◦ Do these directories offer the data to support OI planning
activities?
◦ Do they provide data to identify areas of expertise and
previous experience?
◦ How easy is to identify partners for specific projects?
LOD and Patents

 Restrictions of current business directories fall in these areas:
◦ They are not exhaustive, and some of them exclude SMEs/VSEs.
◦ They classify companies by large, general areas of activity.
◦ They focus on company financial characteristics: income, sales, audits,
investors, employees….
 Missing information:
◦ Product and service descriptions.
◦ Previous experience in projects.
◦ Technical achievements, patents.
◦ Experience and compliance with regulations and standards.
◦ Assessment of intellectual capital (employees with specific profiles),
areas of expertise, etc.
LOD and Patents

And the Collaborative innovation platforms?
 Innovation platforms are web-based collective workspaces to
leverage the innovation process.
◦ Registered companies put “challenges”, with a technical description of
the problem to solve.
◦ Participants can propose their solution to the problem.
◦ The company that propose the challenge select the most suitable
solution.
 Main constraints:
◦ Partners are identified in the specific context of a problem.
◦ They do not support partners’ assessments, but just the assessment of
the proposed solutions.
◦ Innovation life cycle requires a “long-term partnership”.
LOD and Patents

 We can conclude that a different type of “directory” may be
useful to support and foster collaboration and innovation:
◦ With not only data about companies, but individual researchers, university
departments and research groups.
◦ Additional data : work experience, technical achievements (patents,
technical papers, products)
◦ With a high level of specialization to characterize content (areas of expertise
and achievements).
 May Web 3.0 technologies be useful?
◦ Business directories are mainly Web 1.0 tools.
◦ Innovation platforms are mainly Web 2.0 tools.
◦ Hypothesis: this is a data integration problem.
LOD and Patents

 A preliminary survey let us identify these information items:
◦ Company contact details, including lines of business and activities.
◦ Areas of knowledge/expertise, going further in detail.
◦ Description of the company facilities and resources.
◦ Achievements:
 Projects
 Products and services.
 Technologies
 Patents
 Papers
◦ Other entities the company has worked with in collaborative projects
◦ References and customers, linked to achievements.
LOD and Patents

 These data items contribute to a metadata infrastructure that can
be used in two different ways :
◦ to identify and assess partners in a global context
◦ to assess the relevance of incoming ideas sent in response to
“innovation challenges”.
LOD and Patents

 Reusable ontologies:
◦ FOAF, DC, SIOC, SKOS, OBI, VIVO, Organization ontology, Core
Business Vocabulary.
◦ Idea Ontology for Innovation Management (Riedl et al., 2009)
◦ OntoGate (Bullinger, 2008)
 Modelling he idea assessment and selection.
◦ GI2MO Ontology (Westerski et al. 2010)
 Formalization of data to describe ideas and associated information.
◦ Iteams Ontology (Ning, 2006)
 Cover goals, actions, teams, results and community.
◦ Innovation Management Ontology (Elbassiti, 2014)
Our Research
Metadata Infrastructure

 Target:
◦ Having a prototype of a “Semantic-enabled” repository of agents
(companies, researchers and groups) and related achievements to
demonstrate how these tools can support OI initiatives.
◦ Two lines of work: biomedical engineering and aerospace.
◦ Geographical scope: Spain.
 Phases:
◦ Identification of information needs.
◦ Data capture in relational structure (2000 main entities).
◦ Vocabulary selection for data encoding.
◦ Loading data into repository.
◦ Linking data to external sources.
◦ Building user interfaces including dynamic searching of remote sites.
Our Research
Phases

 Patents contribution:
◦ Patents are part of the entity / person achievements.
◦ Patents provide “linguistic clues” to identify skills, competences and
areas of knowledge and build search/browse systems.
Our Research
Phases

 Thanks!
Our Research
Thanks

Linked Open Data in the World of Patents

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Linked Open Data in the World of Patents

Similar to Linked Open Data in the World of Patents (20)

More from Dr. Haxel Consult

More from Dr. Haxel Consult (20)

Recently uploaded

Recently uploaded (16)

Linked Open Data in the World of Patents