Digital Library Collection Management using HBase


Published on

Speaker: Ron Buckley (OCLC)

OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.

Published in: Software, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 22
  • Digital Library Collection Management using HBase

    1. 1. The world’s libraries. Connected. Digital Library Collection Management using HBase “AKA: A Success Story” Case Studies Ron Buckley HBaseCon May 5, 2014
    2. 2. The world’s libraries. Connected. About OCLC Worldwide, member-owned library cooperative • Based in Dublin, Ohio • Founded in 1967 • Not-for -profit Worldcat • Union catalog of library items from 72,000 libraries in 170 countries • Over 2 billion records, 2.5 billions location listings Hosting • Melvyl, University of California Digital Library (and many others) are hosted directly out of Worldcat
    3. 3. The world’s libraries. Connected. Center of our world • 15 month project to rebuild data infrastructure with Hadoop at the center. • Leveraged HBase to build multiple new products. • Replaced and decommissioned multiple Oracle RAC environments. Old Meets New • Dewey Decimal System – OCLC owns and maintains the Dewey Decimal System. The Dewey Decimal System is stored in and maintain in HBase. HBase @ OCLC
    4. 4. The world’s libraries. Connected. Why • Data set was too big a long time ago – Not long after we built our Oracle database we removed almost all joins and views. • Too expensive – Making a dataset available for free open-access was going to cost us almost $1 Million, just for storage • Slow – Couldn’t analyze data set because it took a week just to walk it. How • Text index and our own secondary indexing for Hbase • Transition period of about 12 months with both - Multiple tools built and run find and fix discrepancies. Moving from Relational to HBase
    5. 5. The world’s libraries. Connected. HBase Book – from HBase
    6. 6. The world’s libraries. Connected. HBase - Hub of Linked Data It is imperative that library data be available in new data formats that are native to the web. • Databases are walked and analyzed frequently • Many hundreds of millions, soon billions, of interrelated endpoints are stored back to HBase. • Endpoints are made available through multiple standard protocols (RDF,JSON,Turtle, N-Triple) for machine use. - Tim Berners Lee
    7. 7. The world’s libraries. Connected. HBase - Hub of Linked Data
    8. 8. The world’s libraries. Connected. “Libraries aren’t just about books” • OCLC Contentdm is used by 1000’s of libraries to manage local digital content preservation. • We’re moving over 40 millions digital objects (many TB’s) into a centrally hosted HBase repository. HBase as Content Store
    9. 9. The world’s libraries. Connected. • Key – Internal Key is MD5 hashed into HBase key. • PDF’s - Compression (snappy) doesn’t reduce the size of PDF documents. • 10 MB cellsize - Objects over 10 MB are not being stored in HBase. We’re storing them in HDFS. (We do store Metadata Rows for these objects in HBase.) Digital storage in HBase
    10. 10. The world’s libraries. Connected. University of the Pacific
    11. 11. The world’s libraries. Connected. Academy of Motion Picture Arts and Sciences. Margaret Herrick Library.
    12. 12. The world’s libraries. Connected. Illinois Digital Archives (via Illinois State Library)
    13. 13. The world’s libraries. Connected. U.S. Department of State
    14. 14. The world’s libraries. Connected. Stability - Almost 7 months uptime • CDH 4.3 – April 26, 2014 - 37 Region Servers up for 7 months
    15. 15. The world’s libraries. Connected. Performance –Fast
    16. 16. The world’s libraries. Connected. Performance – Cache Hits Help
    17. 17. The world’s libraries. Connected. • We run hundreds of M/R jobs a day on our user facing cluster. • Our cluster is oversized for HBase • M/R jobs run with limited tasks, niced,… • Still faster than “the old way” • Looking forward to multi-tenant features in upcoming releases M/R and HBase?
    18. 18. The world’s libraries. Connected. - We needed a way to upgrade HBase, without downtime. - Rolling installs on a 50-Node cluster sounded cumbersome Upgrading HBase
    19. 19. The world’s libraries. Connected. • HBase Master-Master replication is used to maintain an always available disaster site. • We have a middle tier service layer (like the thrift server) that knows about both our main cluster and our DR cluster. • When we shutdown the main cluster, the middle tier automatically switches to disaster site. • Each cluster runs a web server that exposes it’s hadoop config. • Example: Replication for 0 downtime install
    20. 20. The world’s libraries. Connected. • Instead of relying on HBase-site.xml in the classpath, we load the HBase-site.xml via addResource. public HBaseManagedConnection(String HBaseSiteUrl, int maxPoolSize) { tableCounter = new BlockingCounter(maxPoolSize); Configuration config = HBaseConfiguration.create(); try { config.addResource(new URL(HBaseSiteUrl)); } catch (MalformedURLException mue) { LOG.error("**** URL to HBase Site is invalid, Unable to connect to HBase: {} *****", HBaseSiteUrl); } Replication for 0 downtime install
    21. 21. The world’s libraries. Connected. Summary • HBase is the center of our world. By association, a lot of libraries. • You can move from relational to HBase. • We’ve been successful running user facing traffic alongside Map/Reduce. • EASY to support. We have two converted Oracle DBA’s as our front line admins. Mostly, they’re lent to MySQL support for other internal systems.
    22. 22. The world’s libraries. Connected. Questions?
    23. 23. The world’s libraries. Connected. Come to Ohio -Our snowballs roll themselves!