Linked Data and
the Future of Scientific Publishing
Bradley P. Allen, Elsevier Labs
Presentation to NFAIS Webinar – “Linked Data: What It Is, What It
Does and The Future of Information Discovery”
2012-10-25
Scientific knowledge in a post-print world


 “Our new knowledge does not consist of a
   careful set of works that have passed through
   a series of gates. … Our new knowledge is not
   even a set of works. It is an infrastructure of
   connection.”
 David Weinberger. 2011. Too Big to Know: Rethinking Knowledge Now That the Facts Aren't
 the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room, Basic
 Books, New York, NY




                                                                                             2
                                                                                             2
“Infrastructure of connection” = linked data
 Type of data        Content Inputs        Linked Data Outputs                 Benefits
                 • XML                    • Asset metadata            • Better discoverability
                 • Long-form free text    • Citations                 • Better visualization and
                                                                        understandability
                 • Short-form free text   • Classifications
   What the      • Tables                 • Clusters
                                                                      • Better integration for use
                                                                        in information solutions
 literature is   • Images                 • Entities
     about       • Video                  • Relations
                 • Audio                  • Language models
                                          • Probabilistic graphical
                                            models
                 • Article views          • Article-level metrics     • Provides the researcher
                 • Search queries         • Sentiment analysis          insight about her career
   How the       • User behavior          • Ranking and impact
                                                                      • Provides institutions data
                                                                        about their performance
 literature is   • Social media streams     metrics
                                                                        and impact
                                          • User interest profiles
  being used                                                          • Provides publishers data
                                                                        for optimizing our
                                                                        business



                                                                                                   3
                                                                                                   3
Linked data as standards and best practices
 “Linked data is just a term                   1.       Use URIs as names for
  for how to publish data on                            things
  the web while working                        2.       Use HTTP URIs so that
  with the web. And the web                             people can look up those
  is the best architecture we                           names
  know for publishing
  information in a hugely                      3.       When someone looks up
  diverse and distributed                               a URI, provide useful
  environment, in a gradual                             information, using the
  and sustainable way.”                                 standards
                                               4.       Include links to
    Jeni Tennison. 2010. Why Linked Data for
    data.gov.uk?                                        other URIs, so that they
    http://www.jenitennison.com/blog/node/
    140                                                 can discover more things
                                                    Tim Berners-Lee. 2006. Linked Data
                                                    http://www.w3.org/DesignIssues/LinkedData.html
Scientific publication as linked data
                                  Linked data


                                                              Provenance
                                                               metadata
                                             Entity record
                           Relational
                           Metadata
                Document                                            Asset
                                                                   metadata




      Acquire                   Relational                   Relational       Deliver
                                Metadata                     metadata

                                             Media object


                                                                Asset
                            Asset                              metadata
                           Metadata




                                  Transform,
                            Enhance, Index, Analyze,
                                   Compose




                                                                                        5
Linked data is increasingly important in science




                                                   6
The challenge for publishers
 • Create greater online engagement with our content
   and platform
 • Semantically enrich our content and enhance value of
   discovery services compared to the same and similar
   content at other platforms
 • Drive additional usage (in journals and books, in
   downloads and interactivity)
 • Improve our ability to be a partner in research, and as
   a publisher that adds value
 • Improve our connection with the scientific community
   through productive collaborations that improve
   search and discovery for all researchers

                                                         7
Elsevier’s approach to linked data
 • Expose existing asset and subject metadata as linked
   data in Web pages to aid discovery
 • Embrace linked data principles while leveraging our
   existing content production workflow and
   infrastructure
 • Leverage partners for content enhancement and
   knowledge organization
 • Reuse Web-standard vocabularies, taxonomies,
   ontologies and entity resources where possible
 • Collaborate in building needed authoritative resources
   for identity resolution and metrics
 • Deliver benefits across the complementary use cases
   of researcher and practitioner

                                                        8
Creating smart content by extracting & linking

                                    Asset
                                   Metadata


                        Usage                   Entities




                           Citations      Relations




                                                           9
Methods for extracting and linking content & data




• Very mature, but      • Variable degrees of maturity, but huge      • Language-driven,
  hard to scale           strides through machine learning research     so challenging to
• Crowdsourcing is a      and practical application on the consumer     generalize and
  possible solution,      Internet                                      scale
  but quality control   • Data-driven, so the more data the better    • Crucial to realize
  is a challenge        • Models can be used to build applications,     promise of ease of
                          can be a new type of publication              integration



                                                                                   10
Packaging linked data for content production
             tag:satelliteWrapper +
             XML Schema
              rdf:RDF+namespaces

             sat:Satellite

              Concept schemes                               SKOS
             Statement 1                                    Generator




                                 Tags
             Diabetes
             Statement 2

             Hypertension                                                              LDR
                        ...                                 RDF
                                                            Generator



             Para1-Statement-1
                                 Region Tags

             Diabetes                          Example RDF Statements
                        ...                    Tags from a taxonomy for a given document
                                               Document sections relevant to a given concept
             Para2-Statement-2
                                               Document sections providing answers to a given question
             Hypertension                      Learning objects compliant with a given state educational standard
                                               Genes mentioned in a given document
                                               Documents supporting or disputing conclusions of a given document
                                               Concepts that are in the areas of expertise for a given author
                    ...



                                                                                                             11
Infrastructure for storing and publishing linked data
                                 Loader (REST)

          Data Spaces




                                                      tes
                                                      Satelli
                                                      ation
                                                      Annot


                                                                    es
                                                                    Satellit
                                                                    Asset


                                                                               es
                                                                               Satellit
                                                                               Vocab

                                                                                          Data
                                                                                          Party
                                                                                          3rd
                               Pipeline
                               Coordination      Pipeline Services (Hadoop EMR)

                                                          N-
                                                                                               RDF Ontology
                                                  JSON                 Reaso       Interlin    ValidatiSvcs
                                                          Quads
                                                  Transform
                                                          Extract      ning        king        on
          Discovery Services




                                     Amazo               MongoDB                    SIREN/                Virtuoso
                                     n S3                                           SOLR                  Triplestor
                                                                                                          e

                                        Discovery
                                                         Atom              Admin&                     Ontology         SPARQL
                     A&E                Service API                                       Analytics
                                                         Feed              Monotoring                 Service          Endpoint
                                        (REST)


                      Load Balance & Failover (Akamai GTM & Amazon ELB)




                                                                                                                                  12
Integrating content & data services with linked data




                                                   13
Delivering linked data through multiple online services
Organization                             Main driver                                     Example             Benefits         Linked data
S&T    Journals                          Making the article more engaging and            Article of the      Understanding, Entities, Citations,
                                         informative through visualization and linking   Future              Discovery      Relations
       Books                             Making the book more engaging and               Brain Navigator     Understanding, Entities
                                         informative through visualization and linking                       Discovery
       A&G                               Making the discovery of relevant content        Lipids SciVerse     Discovery,       Entities, Asset
       Research                          easier and more engaging                        App                 Integration      Metadata
       A&G                               Making data about the production and use        SciVal Spotlight    Understanding    Entities, Citations,
       Institutional                     of scientific content easier to understand                                           Usage
       Corporate       Alternative       Making the exploration of design                Elsevier Biofuels   Discovery        Entities, Citations
                       Fuels             alternatives easier
                       Bibliographical   Automating the indexing of content for          Embase              Discovery        Asset Metadata,
                       Databases         traditional discovery channels                                                       Entities
                       Engineering &     Making the discovery of technology trends       Illumin8            Discovery        Entities, Citations,
                       Technology        and sources easier                                                                   Relations
                       Pharma Biotech    Rich integration of content and data in         Target Insights     Discovery,       Entities, Citations
                                         support of research and design workflows                            Understanding
HS     CDS                               Delivering actionable information in the        Order Sets          Integration      Entities, Relations
                                         context of medical decision making
       GCR                               Making the discovery of relevant medical        Clinical Key        Discovery        Entities, Asset
                                         content easier and contextual                                                        Metadata
       NHP                               Making the delivery and organization of         General             Discovery,       Entities, Asset
                                         medical content easier to integrate with        Education           Integration      Metadata,
                                         educational workflows                           Platform                             Relations




                                                                                                                                          14
Challenges in implementing linked data
 • Access to content and data                 • Production
    – Usage data not integrated or               – Manually intensive knowledge
                                                   engineering
      leveraged
                                                 – Balancing production validation and
    – Hard to stage content for modeling           rapid iterative development
      and analytics                              – Relation extraction needed but
                                                   capabilities are minimal at best
 • Integration                                   – Tools for syntactic rather than
    – Adoption of standards across silos           semantic validation
      and legacy systems                      • Sharing
    – Globalization/localization of              – Culture and legacy
      knowledge organization systems             – Business model disincentives
    – Named entity registries for identity       – Identifier, URI and namespace
      resolution for accreditation,                governance
      provenance and trust                    • Quality control
 • Human resources                               – Lack of clean external data
                                                 – Gaps in linked data resources
    – Scarcity of data scientists, language      – Bugs in knowledge organization
      engineers                                    systems
Trends within Elsevier today
 • Increasing acquisition of data and text analytics
   capabilities
 • Shifting dependence from partners to in-house
   resources for content enhancement and
   knowledge organization
 • Innovation in new knowledge organization
   systems (some through integration of existing
   ones)
    – Two main design emphases: taxonomy for discovery,
      ontology for understanding and integration
 • Emergence of shared smart content
   infrastructure based on linked data principles

                                                          16
Smart content is a bridge to the future of publishing
 • Smart content allows publishers to create new
   products and services through structuring
   content for better discovery, insight and utility
    – The value is in the structure, not the content
    – Creating that structure is hard work
    – The kind of hard work that publishers have
      traditionally focused on
 • Consumer Internet businesses are using text and
   data mining to add structure to content today…
   quickly and on the cheap
 • Publishers, societies and libraries both large and
   small can use the same techniques to follow suit

                                                       17
Thank you

Bradley P. Allen
b.allen@elsevier.com
bradleypallen on twitter, github

Linked data and the future of scientific publishing

  • 1.
    Linked Data and theFuture of Scientific Publishing Bradley P. Allen, Elsevier Labs Presentation to NFAIS Webinar – “Linked Data: What It Is, What It Does and The Future of Information Discovery” 2012-10-25
  • 2.
    Scientific knowledge ina post-print world “Our new knowledge does not consist of a careful set of works that have passed through a series of gates. … Our new knowledge is not even a set of works. It is an infrastructure of connection.” David Weinberger. 2011. Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room, Basic Books, New York, NY 2 2
  • 3.
    “Infrastructure of connection”= linked data Type of data Content Inputs Linked Data Outputs Benefits • XML • Asset metadata • Better discoverability • Long-form free text • Citations • Better visualization and understandability • Short-form free text • Classifications What the • Tables • Clusters • Better integration for use in information solutions literature is • Images • Entities about • Video • Relations • Audio • Language models • Probabilistic graphical models • Article views • Article-level metrics • Provides the researcher • Search queries • Sentiment analysis insight about her career How the • User behavior • Ranking and impact • Provides institutions data about their performance literature is • Social media streams metrics and impact • User interest profiles being used • Provides publishers data for optimizing our business 3 3
  • 4.
    Linked data asstandards and best practices “Linked data is just a term 1. Use URIs as names for for how to publish data on things the web while working 2. Use HTTP URIs so that with the web. And the web people can look up those is the best architecture we names know for publishing information in a hugely 3. When someone looks up diverse and distributed a URI, provide useful environment, in a gradual information, using the and sustainable way.” standards 4. Include links to Jeni Tennison. 2010. Why Linked Data for data.gov.uk? other URIs, so that they http://www.jenitennison.com/blog/node/ 140 can discover more things Tim Berners-Lee. 2006. Linked Data http://www.w3.org/DesignIssues/LinkedData.html
  • 5.
    Scientific publication aslinked data Linked data Provenance metadata Entity record Relational Metadata Document Asset metadata Acquire Relational Relational Deliver Metadata metadata Media object Asset Asset metadata Metadata Transform, Enhance, Index, Analyze, Compose 5
  • 6.
    Linked data isincreasingly important in science 6
  • 7.
    The challenge forpublishers • Create greater online engagement with our content and platform • Semantically enrich our content and enhance value of discovery services compared to the same and similar content at other platforms • Drive additional usage (in journals and books, in downloads and interactivity) • Improve our ability to be a partner in research, and as a publisher that adds value • Improve our connection with the scientific community through productive collaborations that improve search and discovery for all researchers 7
  • 8.
    Elsevier’s approach tolinked data • Expose existing asset and subject metadata as linked data in Web pages to aid discovery • Embrace linked data principles while leveraging our existing content production workflow and infrastructure • Leverage partners for content enhancement and knowledge organization • Reuse Web-standard vocabularies, taxonomies, ontologies and entity resources where possible • Collaborate in building needed authoritative resources for identity resolution and metrics • Deliver benefits across the complementary use cases of researcher and practitioner 8
  • 9.
    Creating smart contentby extracting & linking Asset Metadata Usage Entities Citations Relations 9
  • 10.
    Methods for extractingand linking content & data • Very mature, but • Variable degrees of maturity, but huge • Language-driven, hard to scale strides through machine learning research so challenging to • Crowdsourcing is a and practical application on the consumer generalize and possible solution, Internet scale but quality control • Data-driven, so the more data the better • Crucial to realize is a challenge • Models can be used to build applications, promise of ease of can be a new type of publication integration 10
  • 11.
    Packaging linked datafor content production tag:satelliteWrapper + XML Schema rdf:RDF+namespaces sat:Satellite Concept schemes SKOS Statement 1 Generator Tags Diabetes Statement 2 Hypertension LDR ... RDF Generator Para1-Statement-1 Region Tags Diabetes Example RDF Statements ... Tags from a taxonomy for a given document Document sections relevant to a given concept Para2-Statement-2 Document sections providing answers to a given question Hypertension Learning objects compliant with a given state educational standard Genes mentioned in a given document Documents supporting or disputing conclusions of a given document Concepts that are in the areas of expertise for a given author ... 11
  • 12.
    Infrastructure for storingand publishing linked data Loader (REST) Data Spaces tes Satelli ation Annot es Satellit Asset es Satellit Vocab Data Party 3rd Pipeline Coordination Pipeline Services (Hadoop EMR) N- RDF Ontology JSON Reaso Interlin ValidatiSvcs Quads Transform Extract ning king on Discovery Services Amazo MongoDB SIREN/ Virtuoso n S3 SOLR Triplestor e Discovery Atom Admin& Ontology SPARQL A&E Service API Analytics Feed Monotoring Service Endpoint (REST) Load Balance & Failover (Akamai GTM & Amazon ELB) 12
  • 13.
    Integrating content &data services with linked data 13
  • 14.
    Delivering linked datathrough multiple online services Organization Main driver Example Benefits Linked data S&T Journals Making the article more engaging and Article of the Understanding, Entities, Citations, informative through visualization and linking Future Discovery Relations Books Making the book more engaging and Brain Navigator Understanding, Entities informative through visualization and linking Discovery A&G Making the discovery of relevant content Lipids SciVerse Discovery, Entities, Asset Research easier and more engaging App Integration Metadata A&G Making data about the production and use SciVal Spotlight Understanding Entities, Citations, Institutional of scientific content easier to understand Usage Corporate Alternative Making the exploration of design Elsevier Biofuels Discovery Entities, Citations Fuels alternatives easier Bibliographical Automating the indexing of content for Embase Discovery Asset Metadata, Databases traditional discovery channels Entities Engineering & Making the discovery of technology trends Illumin8 Discovery Entities, Citations, Technology and sources easier Relations Pharma Biotech Rich integration of content and data in Target Insights Discovery, Entities, Citations support of research and design workflows Understanding HS CDS Delivering actionable information in the Order Sets Integration Entities, Relations context of medical decision making GCR Making the discovery of relevant medical Clinical Key Discovery Entities, Asset content easier and contextual Metadata NHP Making the delivery and organization of General Discovery, Entities, Asset medical content easier to integrate with Education Integration Metadata, educational workflows Platform Relations 14
  • 15.
    Challenges in implementinglinked data • Access to content and data • Production – Usage data not integrated or – Manually intensive knowledge engineering leveraged – Balancing production validation and – Hard to stage content for modeling rapid iterative development and analytics – Relation extraction needed but capabilities are minimal at best • Integration – Tools for syntactic rather than – Adoption of standards across silos semantic validation and legacy systems • Sharing – Globalization/localization of – Culture and legacy knowledge organization systems – Business model disincentives – Named entity registries for identity – Identifier, URI and namespace resolution for accreditation, governance provenance and trust • Quality control • Human resources – Lack of clean external data – Gaps in linked data resources – Scarcity of data scientists, language – Bugs in knowledge organization engineers systems
  • 16.
    Trends within Elseviertoday • Increasing acquisition of data and text analytics capabilities • Shifting dependence from partners to in-house resources for content enhancement and knowledge organization • Innovation in new knowledge organization systems (some through integration of existing ones) – Two main design emphases: taxonomy for discovery, ontology for understanding and integration • Emergence of shared smart content infrastructure based on linked data principles 16
  • 17.
    Smart content isa bridge to the future of publishing • Smart content allows publishers to create new products and services through structuring content for better discovery, insight and utility – The value is in the structure, not the content – Creating that structure is hard work – The kind of hard work that publishers have traditionally focused on • Consumer Internet businesses are using text and data mining to add structure to content today… quickly and on the cheap • Publishers, societies and libraries both large and small can use the same techniques to follow suit 17
  • 18.
    Thank you Bradley P.Allen b.allen@elsevier.com bradleypallen on twitter, github