Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NoSQL Technologies from an STM Publishing Perspective (NoSQL Now 2011)


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

NoSQL Technologies from an STM Publishing Perspective (NoSQL Now 2011)

  1. 1. NoSQL technologies from an STMpublishing perspectiveBradley P. Allen, Elsevier LabsPresentation at NoSQL Now 2011San Jose, CA, USA2011-08-25
  2. 2. Peak physical media: is it here? • “Music Sales”, New York Times, 1 August 2009. • “Initial Circs per student”, William Denton, 31 January 2011. • “Rise of e-book Readers to Result in Decline of Book Publishing Business”, Steven Mather, iSuppli, 28 April 2011. Electronics/News/Pages/Rise-of-e-book-Readers-to-Result-in-Decline-of-Book- Publishing-Business.aspx 2
  3. 3. In any case, the challenge to STM publishers is clear • Print revenue is softening • Online channels are exploding – Changing the way customers create and consume our content – Leading to new requirements and market opportunities for online products 3
  4. 4. Additional challenges in STM publishing • Academic context and tradition inhibits business model innovation • Technology and business traditionally separate concerns • Acquisitions create content and data silos • Global market drives lowest common denominator technology choices 4
  5. 5. A simple model of the evolution of STM publishing Print era: 1600s - Digital Library era: Platform-as-a- 1980 1980 – 2010s service era: 2010s • Packaged as • Packaged as • Packaged as books and books and apps journals journals • Digitally • Physically • Digitally distributed distributed distributed • Access and • Access and • Access and discovery discovery discovery through social through through search networks libraries engines 5
  6. 6. STM publishing use cases in transitionUse case Digital Library era Platform-as-a-service eraA new medical term relevant to an emerging Organizational governance issues about how A single, automated and standardizedhealthcare issue (e.g. a new type of avian flu taxonomies are be updated, coupled with taxonomy management and contentvirus) needs to be incorporated into a search manually-intensive workflows and ad-hoc enhancement workflow allows rapid andindex immediately approaches to content tagging, inhibit rapid timely update of search applications responseApplication developers want to mash up Data silos without easy means of Content API and single-point-of-accessepidemiological data with medical journal programmatic access by developers, coupled repository allow data and content to bearticles to create topic-specific Web resource with governance and business model accessed, discovered and reused across questions , inhibit data reuse multiple applicationsDigital library developers want to stage Duplication of core content leads to Consolidation of duplicate repositories into acontent into single repository for unified synchronization, quality control issues single point of truth across all contentsearch index generation accessible and discoverable through a Content API eliminates the need for duplication and synchronizationThird party solutions providers want to No standards, no APIs for point-of-care Standards and APIs that scale across multipleintegrate content (e.g. tagged medical journal content integration across all content and partners, for all content types, for all deliveryarticles, medical taxonomies) into point-of- data formatscare solutionsPublishers want to deliver their content to No clear standard or approach for targeting Web- and industry-standards for eReader,tablets and e-readers in delivery formats that emerging eReader, tablet devices, multiple tablet devices supported as part of standardtake advantage of the displays and interaction and divergent approaches leading to siloed automated processing into delivery channel-modalities on those devices solutions, duplication of effort specific formats, regularly updated and exposed through a Content APIJournal publisher wants to integrate content No single point of access to content Easy access to multiple opportunities forenhancements across multiple subject matter enhancements, no standards for content content enhancements embedded inareas to add value to products leveraging enhancement suppliers and partners to standard next-generation article formats andArticle of the Future technology deliver enhancements for integration provided using standard content enhancement formats 6
  7. 7. Facets of STM publishing processes Process Type Access and Acquisition Transformation Enhancement Composition Delivery discovery Entity Activity Content Type submitting entity extractionauthor product catalog crawling fact extractionsupplier editor syndicating clustering articleWeb site reviewer formatting aggregating booktypesetter user mapping ordering media objectautomated process designer cleansing summarizing entity recordsubject matter expert developer indexing filtering taxonomysearch engine e-book querying analysis ontologycontent repository mobile app updating rendering user-generated contententity registry mobile-enhanced Web site storing design API annotating publishing subject tagging accessing classification retrieving entity recognition deleting 7
  8. 8. Emerging content requirements • Broad range of content types • Accessible – Must treat as first-class objects video, audio, – Must be easily accessed through content images, datasets, metadata and knowledge creation, retrieval, update and deletion (CRUD) organization systems in addition to articles and services books • Flexible • Standards-based – New content types and associated schemas – Web-standard formats to support ease of must be easily added through configuration integration and interoperability • Reusable • Fine-grained – It must be efficient for product developers to – Must be decomposable into and addressable in aggregate and compose content fragments into fragments smaller than the unit of publication; new products e.g., down to the level of specific words, phrases, images, table cells in articles or book • Modifiable chapters, key frames and segments in videos – Support the enhancement and correction of content at any time following creation • Discoverable – Must be easily located across all levels of • Broad range of delivery formats granularity, – Content standards and services must support fulfillment, delivery and presentation across desktop, notebook, tablet and mobile computing devices 8
  9. 9. Emerging content architecture Linked data Relational metadata Entity record Relational Metadata Document Relational metadata Relational Acquire Metadata Relational Deliver metadata Media object Relational Relational metadata Metadata Transform, Enhance, Compose 9
  10. 10. Content acquisition and transformation 10
  11. 11. Content enhancement and analytics 11
  12. 12. Content composition and delivery 12
  13. 13. Why NoSQL is important to STM publishing • NoSQL emphasizes design choices that focus on delivering robust, scalable Web applications – Document-centric – Schemaless – Support for analytics – Read/write at Web scale – Move scale-out from development to operations • As we shift to the platform-as-a-service era, these features become an important part of the STM publishing technology stack 13
  14. 14. How NoSQL addresses STM publishing’s needs • Schemaless, document-centric stores – Ease repository extension to accommodate expanding range of new, finer- grained content types – Fit HTML5/JS/CSS content stack providing web-based alternatives to native apps – Expedite application stack refresh in support of authoring and editorial workflow portals and tools • Support for analytics eases innovation in scientometrics • Read/write at Web scale accommodates solutions incorporating content at more dynamic, fine-grained scale – Entity records – Annotations – Other forms of community-contributed content – Linked data integration of heterogeneous information resources across the Web for mashups/solutions • Moving scale-out from development to operations reduces time-to- market, cost of failure for emerging, niche publishing opportunities 14
  15. 15. Where STM publishing can drive NoSQL requirements • Integrated support for search – Free text retrieval – Faceted navigation • Query language functionality – Nearest-neighbor matching – Joins vs. join-free • Primitives/support for analytics design patterns – Clustering – Classification – Entity resolution • Primitives/support for semantic enhancement – Linked data – Language processing • Versioning for document stores 15
  16. 16. Elsevier applications of NoSQL technologies • Entity registries • Metadata repositories • Big data analytics • User-built apps 16
  17. 17. Linked Data Repository 17
  18. 18. SciVal 18
  19. 19. SciVerse 19
  20. 20. Conclusions • STM publishing is in transition • This is driving new requirements for content • Many of these requirements are well met by NoSQL solutions • Some requirements point to areas of future work for NoSQL technologists and vendors 20