HBaseCon 2012 | HBase for the Worlds Libraries - OCLC


Published on

WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

  1. 1. Apache HBase at OCLC Ron Buckley May 22, 2012 ron_buckley@oclc.org
  2. 2. About OCLC OCLC delivers single-search-box access to more than 943 million items from your library and the worlds library collections. Youll find: 1.8 Billion Ownership Information indications 214+ million books in libraries worldwide 663+ million articles with one-click access to full text 28+ million digital items from trusted sources like Google Books, OAIster and HathiTrust 13+ million eBooks from leading aggregators and publishers 44+ million pieces of evaluative content (Tables of Contents, cover art, summaries, etc.) included at no additional charge And a LOT more (Facilitate Interlibrary Loan, API access, library centric research)
  3. 3. Main Case for OCLC • Library gets a new book. • Librarian needs to enter all the data about that item into their local system. • It takes quite some time to correctly enter cataloging data into local system. • Thousands of libraries are all going to get the same book and do the same things . Thereby replicating each others work. • There should be a system whereby libraries can share and build on each others work. • SaaS before buzzwords were cool. System proposed in July 1966. First use in 1971. *A member of the HBase implementation team also worked on the initial OCLC system.
  4. 4. Current Data State at OCLC • Oracle (WorldCat – Oracle RAC) • SAN Storage (Approximately 20 TB) • Several other smaller instances of Oracle • A LOT of stored procedures for read and update. The most commonly used are 10 years old and difficult to follow (being polite) • Two copies of the primary database in other formats, various processes to keep them in sync (or not)
  5. 5. Schema Design – Oracle Version 4 Main Tables, Primary Key (xwcmd_id) is an ever increasing OCLC assigned number for every library resource.
  6. 6. Schema Design – HBase Version 4 Tables become 1 Use Columns as data
  7. 7. Using column qualifiers to represent library ownershiphbase(main):001:0> get Worldcat,1‘data:createDate value=19690526 00:00:00.000data:hold:10810 value={"md":[{"CDATE":"20080410 15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411 15:05:28.000"},{"UPID":"NA"}]}data:hold:1100 value={"md":[{"CDATE":"20040826 02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826 02:08:57.000"},{"UPID":"NA"}]} Qualifier Value data:hold:10810 "md":[{"CDATE":"20080410 15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411 15:05:28.000"},{"UPID":"NA"}]} data:hold:1100 "md":[{"CDATE":"20040826 02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826 02:08:57.000"},{"UPID":"NA"}]} data:hold:727 "md":[{"CDATE":"20120522:08:57.000"},{"CPID":"NA" },{"UDATE":“20120522:08:57.000"},{"UPID":"NA"}]}
  8. 8. Advantages • Everything in one I/O – We get the record, all of its metadata and a complete set of „who owns it and for how long‟, in one call to HBase. HBase can generally read it in 1 physical I/O. • New requirements – The existing Oracle table is binary indicator of „I own this‟. Adding new columns to the table was going to be very difficult. • With HBase, we‟re now storing complete ownership, by just making up new column qualifiers.
  9. 9. Problems Nagle – We‟ve disabled Nagle across the board. HBase Balancer – We‟ve written a script that balances (outside of the default balancer) at the table. Hoping that the “Allow regions to be load-balanced by table” is included in 0.94 (HBASE-3373) IOPS – For us, HBase is used for online, user facing traffic. Our cluster is designed such that we have plenty of capacity for this use. It‟s easy for Map Reduce activity to fully utilize the amount of IO that‟s available and not leave HBase anything to work with.
  10. 10. Status – Hardware/Software Systems Production Cluster • 50 Nodes – 3 „Control‟ Nodes, 3 „edge‟ Nodes, 44 Data Nodes • 8 CPU/32 GB Ram/8 TB Disk • 3 Rack configuration – 10 GB interconnects 6 Node Clusters – Used for testing and disaster recovery • 2 development clusters – IntegrationTest, ProofOfConcept • 2 clusters in a separate datacenter – Business Continuity, Pre- production Testing Versions • Cloudera Distribution 3 Update 3 – CDH3U3 • Apache HBase 0.92.1
  11. 11. Backup/Restore We‟ve built our own backup/restore capability, like that described in: https://issues.apache.org/jira/browse/HBASE-4618 It allow for both inter and intra-datacenter backup and restores. On github at: https://github.com/oclc/HBase-Backup The backup runs weekly and on demand.
  12. 12. Other Interesting Data Sets OCLC is moving toHBase • The Dewey Decimal Editorial System - The system where the editors of the Dewey Decimal System do their work. • VIAF - "Virtual International Authority File" - A joint project of several national libraries plus selected regional and trans-national library agencies. The projects goal is to lower the cost and increase the utility of library authority files by matching and linking widely- used authority files and making that information available on the Web.