The Semantic Web & Web 3.0


Published on

A Practical Guide to the Semantic Web for Publishers

Published in: Technology
  • Be the first to comment

The Semantic Web & Web 3.0

  1. 1. Ready for Web 3.0? A Practical Guide to the Semantic Web for Publishers Introduction The web as we know it has completely transformed the way we socialise, shop, communicate and generally consume information. However, the full potential of the internet as a medium has yet to be realised. The Semantic Web is the next step (or more accurately leap!) in the evolution of the web. Moving us from a position where individual documents (whether they are articles, chapters, images, raw data sets, multi-media files, encyclopaedia etc.) exist within individual data silos to Tim Berners- Lee’s vision of an integrated web of linked data, where mash-ups, apps, and context sensitive discovery routes are the norm. “The most exciting thing about As we enter the next phase of the Web publishers are well placed to benefit the Semantic Web is not what from new technologies to add value to content assets, increase discoverability, we can imagine doing with it, cross promote products and ultimately remain competitive in terms of but what we can't yet imagine it attracting authors and customers. will do” - Sir Tim Berners-Lee Does your publishing strategy consider the Web 3.0 needs of your content and your audience? This document cuts through the hype to provide a high level practical guide to the buzz words, opportunities, challenges plus examples of real life semantic web applications. What is the Semantic Web? … extension of the World Wide Web that provides an easier way to find, share and combine information from disparate sources. In the simplest terms it’s the relationship between things, described in a way which can be understood by people and machines. World Wide Web = Web of Documents with Limited Interoperability Semantic Web = Web of Integrated, Linked Data The Semantic Web is about: Common formats for the integration and combination of data drawn from diverse sources. RDF, OWL, SPARQL are a set of Semantic Web standards led by the W3C (please see the glossary below for further information). Content structured in a semantic way so that it is meaningful to computers and to humans. Ontologies and taxonomies provide the core structure, allowing for new ways of navigating and discovering content. Language for representing how the data relates to real world objects – this allows a person or a machine to understand context and provide a more relevant user experience. “Data on the web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications” - Sir Tim Berners-Lee
  2. 2. Metastore                                                                           In 2005 Publishing Technology began a two year research and development exercise to determine the format of the future to be used as the backbone for all of our online products. Semantic web technologies were identified as offering benefits over and traditional forms of data storage. More specifically, RDF was determined as the optimal format to future proof content, to support flexible content models i.e. thinking beyond the traditional article and chapter containers to provide a data storage model that is truly content agnostic and can easily handle journals, books, images, multi-media but also support the most granular concepts such as a taxonomy item, author name or a snippet of text. To provide a solid foundation from which publishers can experiment with semantic web technologies at their own pace. Metastore was built using the Jena Semantic Web Framework, open source technology originally created by Hewlett Packard Labs. Publishing Technology won the “Best Applications” paper at the Hewlett Packard Labs conference in 2006 for the initial prototype – groundbreaking as the largest commercial RDF triplestore. Metastore is now in live production and holds content from 270 publishers via the IngentaConnect and pub2web services. There is no requirement for publishers to change their workflow to benefit, Publishing Technology accept a wide range of data inputs (PDF, XML, typesetting files, hardcopy) and convert content into RDF as part of the standard service. Therefore, providing a solid foundation for publishers to experiment without making changes to their existing workflow. Breathing Space: Breathing Space is the first semantic web collaboration between scholarly societies and aims to explore the value to researchers of compiling and mining a critical mass of data within a discipline; the project collates content from the European Respiratory Society and American Thoracic Society who together account for 30% of citations within the field of respiratory medicine. Publishing Technology’s Metastore and pub2web technologies are combined with open source data mining software (Whatizit provided by the European Bioinformatics Institute) to showcase a number of semantic web features and functionality including: Faceted search & browse, allowing users to explore the content by Disease, Gene Ontology Terms, Species, Authors and Images and refine search results by their preferred facet. Tag Clouds. Disease, Gene Ontology & Author homepages. Automatically derived from the publisher’s content and combined with additional information from trusted, external open data sets (Bio2RDF, DBPedia, Enriched user experience, presenting relevant information from disparate sites in a single interface. Visualization of related concepts offers a new way to explore the content and follow through lines of research.
  3. 3. “This project is helping us to explore our members’ expectations and information habits, as the role of society publishers in disseminating research evolves,” says Elin Reeves, from the European Respiratory Society. “We’re very pleased that Publishing Technology have joined the project as technology partner; their leadership in the deployment of semantic technologies within highly functional publication websites guarantees us the robust, innovative platform we need to support the project.” TBI Communications is coordinating the project and will be presenting the results of the experiment at the Society for Scholarly Publishers Annual Meeting in June 2010. The results will focus specifically on feedback from authors, researchers and the editorial board.
  4. 4. What does it mean for me? The semantic web has been overhyped but is now gaining momentum; it is increasingly being used within the publishing community, by governments for the management of data and by news services to provide a more interactive, engaging user experience within numerous communities by integrating information from a variety of sources to present relevant information to the reader in a single location, therefore reducing the number of clicks to get to what you need. A broad understanding of the foundation technologies as ways for readers to manage the ever increasing volumes of information and content is critical for publishers. How do you ensure your content is visible, discoverable and linked? How can you structure your workflow and content to benefit from the semantic web? Once the foundation technology is in place the possibilities are endless. Now is the time to experiment and explore what your audience is expecting from your content. The volume of information available on the web has exploded in recent years and it is becoming increasingly difficult to quickly find relevant and trusted information amongst the “noise” on the web. Publisher, society and journal brands are key markers of quality. Publishers have a huge amount of specialist information at their fingertips in terms of authors and in-house experts. The challenge is how can this expertise and knowledge be efficiently harnessed to enrich content and online products? Workflows can be adapted to capture additional information as part of the publishing process but this takes time to implement and encourage. Automated data mining can offer a compromise for semantically enriching your content now. Data mining allows for concepts, facts and other relevant information to be extracted from content. Once extracted the concepts or facts can be used to drive new discovery routes to relevant content, new products can be created on the fly giving publishers the competitive edge. A more “lively” and interactive user experience can be driven via the use of semantic web technologies by the integration of external data sets to create a more valuable product, the display of supplementary data which could take many forms (a YouTube clip, raw statistics, audio files, images, podcast etc.), greater interaction and Opportunities for publishers: Opportunities differ depending on the subject area. Start simple and add functionality incrementally – experiment! Utilize in-house experts and author knowledge to enrich content. Enhance discoverability and visibility of your content. Cross marketing benefit of offering new routes to your content. Relevancy and precision of search results encourages return visits to your site. Gain competitive edge. Challenges to consider: Adapting workflow: how can additional information be captured as part of the standard workflow? Balancing automated semantic enrichment ‘v’ manual specialist input. Authority, trust, provenance of open data sets. Setting clear terms of reuse and attribution (Creative Commons suggested).
  5. 5. Buzz Words: A Glossary This is not intended to be comprehensive, rather provide a snapshot of the key terms used when discussing the Semantic Web. The World Wide Web Consortium (W3C) is an invaluable resource for further information the Semantic Web and associated standards ( Linked Data “…a term used to describe a recommended best practice for exposing, sharing and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF” Wikipedia For example, DBPedia is the semantic web (or RDF) version of Wikipedia. By creating this alternative view of Wikipedia using semantic web technologies other content creators and sites can automatically harvest data from DBPedia for inclusion within their own sites. This may be useful in the context of providing further background on a topic, integrating definitions etc. alongside published content. The LinkingOpenData Project’s goal is to: “extend the Web with a data commons by publishing various open data sets as RDF on the Web and by setting RDF links between data items from different data sources. RDF links enable you to navigate from a data item within one data source to related data items within other sources. RDF links can also be followed by the crawlers of Semantic Web search engines, which may provide sophisticated search and query capabilities over crawled data. The figures below show the data sets that have been published and interlinked by the project so far. Collectively, the data sets consist of over 13.1 billion RDF triples, which are interlinked by around 142 million RDF links (2009). “ The Linking OpenData Project Cloud Diagram below gives a flavour of the data sets currently available. Linking Open (LOD) Data Project Cloud Diagram,
  6. 6. Mash up “ …a web application that combines data from more than one source into a single integrated tool, thereby creating a new and distinct Web service that was not original provided by either source” Wikipedia Examples include: WikiFM a mashup between Last.FM and Wikipedia allowing you to view the artist and song information from Wikipedia while listening to Last.FM radio. Integrating field study data with Google Maps to allow readers to focus in on geographical areas of interest and date range as well as providing ease of integration of additional data sets which may result in new discoveries and lines of research. RDF (Resource Description Framework) A framework for expressing data as subject – predicate – object triples. For example:   Charles Dickens    wrote    Pickwick Papers  Subject    Predicate  Object    Relationships with other RDF statements can therefore be inferred i.e. what else did Charles Dickens write? Which critics have discussed his work? What topics are covered within the Pickwick Papers and so on a web of relationships can be automatically inferred. RDF is designed to describe data in a distributed world (and is therefore the standard format for linked data). Everything expressed in RDF means something, whether a reference to a physical object, an abstract concept, or a fact. Standards built on RDF describe logical inferences between facts. By storing data in RDF it can be more easily integrated with other sets of data to create entirely new datasets and visualizations. SPARQL The query language for the Semantic Web. Web 3.0 Increasingly being used as an alternative name for the Semantic Web, but not an official term. The Semantic Web is often described as being a component of Web 3.0: Tim Berners-Lee: “People keep asking what Web 3.0 is. I think maybe when you've got an overlay of scalable vector graphics - everything rippling and folding and looking misty — on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an Conrad Wolfram: unbelievable data resource." “Web 3.0 is where "the computer is generating new information, rather than humans”
  7. 7. Events: Want to hear more about the Semantic Web? Event: London Book Fair Supply Chain Seminar Date: 21st April 2010, London Title: Semantic Web Technologies and Book publishing Speaker: Priya Parvatikar (Technical Architect, Publishing Technology) URL: Event: Society for Scholarly Publishing Annual Meeting Date: 3rd-4th June 2010, San Francisco Presentation: Semantic societies: uncovering new research paths, engaging members better Speaker: Charlie Rapple (Head of Marketing Development, TBI Communications) URL: Event: “Ready for Web 3.0?” ALPSP seminar, chaired by Louise Tutton (COO, Publishing Technology) Date: 1st July 2010, London Presenters include: Priya Parvatikar (Publishing Technology), Leigh Dodds (Talis), Outsell, Biochemical Society, Southampton University and Charlesworth. URL: Contact: For more information Publishing Technology, its products and services, please visit or contact: Emily Taylor Publishing Technology plc Tel: +44 1865 397873 Email: About Publishing Technology plc Publishing Technology’s brands include advance, IngentaConnect, VISTA, author2reader, pub2web, ICS and PCG. The Publishing Technology group enables publishers to focus on their core competences by providing a single, trusted partner for both technology requirements and business development services. It is one of the largest providers of software and services for the publishing industry, servicing eight out of ten of the world's largest publishers. The group’s proposition uniquely spans front and back office systems – complemented by a range of business development services – to provide the industry's only end-to-end suite of software specifically designed to support the publishing process. Capabilities cover editorial & production, product information, billing & fulfilment, content conversion & hosting, website development, marketing programs, information commerce, customer relationship management, rights & royalties and business intelligence. All application modules can be configured independently to meet specific publishers' needs and to allow flexible integration with existing systems. Associated sales and marketing services include consultancy and research, sales representation and telemarketing. The company is listed on the AIM market of the London Stock Exchange and has offices in Europe, North America and Australia.    Related Links    Visit at, follow on Twitter @publishingtech, or connect with us  on LinkedIn.