Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Html
Html
Loading in …3
×

Check these out next

1 of 45 Ad

Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

Download to read offline

Michael Williams, Coordinator of Area Studies Technical Services at the Penn Libraries, presents about his experiments with batch querying, web scraping, and data processing using Google Sheets. Mike focuses on his case of harvesting bibliographic data for use in library acquisitions work, but the techniques are applicable to a variety of tasks and disciplines.

Michael Williams, Coordinator of Area Studies Technical Services at the Penn Libraries, presents about his experiments with batch querying, web scraping, and data processing using Google Sheets. Mike focuses on his case of harvesting bibliographic data for use in library acquisitions work, but the techniques are applicable to a variety of tasks and disciplines.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets (20)

Advertisement

Recently uploaded (20)

Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

  1. 1. Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets presentation for CAMIG & Word Lab January 31, 2020 Michael P. Williams Area Studies Technical Services Coordinator, Penn Libraries mpw2@upenn.edu
  2. 2.  Area Studies Technical Services acquires, catalogs, and process materials in non-Western languages and in non-Latin scripts from suppliers across the world  Firm ordering is often done for long lists of materials advertised by a small set of vendors as spreadsheets, and selected by Area Studies bibliographies.  We are transitioning from single, copy-paste-click-heavy transactional orders to batch-ordering mediated by spreadsheets (Excel) and MarcEdit software to transforms tabular data into brief MARC records. What Our Department Does
  3. 3. Through seven use cases of real world applications, you’ll see how I have used:  Google Sheets' IMPORTHTML, IMPORTXML, and IMPORTDATA functions to fetch a variety of bibliographic info on the web, with the help of HTML structures amd XPath references.  Text/number formulas (applicable in Excel and Google Sheets) such as CONCATENATE, SPLIT, SUBSTITUTE, ROUNDUP, LEFT, RIGHT, MID, LEN, CHAR, VALUE, TEXT, MOD, and PROPER to manipulate text strings.  Conditional formulas like IF, IFNA (or IFERROR) to make statements so the spreadsheet can “make choices.”  Third-party applications, such as add-ons like MatchMarc which query OCLC with an API, or a Google Scripts application like "importRegex" to apply Regular Expressions to web scraping, or my own (really clunky) home-grown "ISBN Toolkit" to clean, validate, and reconstruct ISBNs for additional bibliographic information. What’s in This Slide Deck?
  4. 4. I. Building Useful Brief Records in Japanese Acquisitions
  5. 5. Use Case 1: Scraping Bibliographic Data from a Bookseller’s Site  Context: A certain vendor, Japan Publication Trading (JPT), lists their titles with a unique reference number (e.g. JPTB1907-0001, the first title in their July 2019 catalog) and can send us readymade MARC records from this data for ordering purposes. Their catalog also lists their price.  Problem: From a “wish list” of ISBNs compiled by a bibliographer, how can we determine which titles JPT readily stocks (and which we can fast-track order) and which titles will they need to source for us?  Solution: Use Google Sheets IMPORTHTML function to query an ISBN on the vendor site and return information stored as an unordered list (<ul>).  Assumptions: We will retrieve one and ONLY one result.
  6. 6. Using one formula in Google Sheets, we can turn this website’s <ul> element into spreadsheet cells: (ISBN is in A2) =SPLIT((SUBSTITUTE(IMPO RTHTML((CONCATENATE(" https://jptbooknews.jptc o.co.jp/product?q=",A2)), "list",2), char(10), "|")),"|")
  7. 7. =SPLIT((SUBSTITUTE(IMPORTHTML((CONCATENATE("https://jptbooknews.jptco.co.jp/ product?q=",A2)),"list",2), CHAR(10), "|")),"|") 1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN): CONCATENATE("https://jptbooknews.jptco.co.jp/product?q=",A2)  https://jptbooknews.jptco.co.jp/product?q=9784065137338 2. It IMPORTs the HTML from the URL it just built, finding the 2nd occurrence of the “list” element on the page 3. It takes the line breaks in the list (defined by formula CHAR(10)) and SUBSTITUTES a pipe (“|”) for each 4. It then SPLITs the data in that list at the pipe, sending it across spreadsheet cells. (Essentially a “text to columns” function) 5. Afterward, additional formulas fetch and clean that data with other SPLIT/SUBSTITUTE functions for easier readability. How Does This Work?
  8. 8. =ROUNDUP(E2*GOOGLEFINANCE("CURRENCY:JPYUSD"),2) 1. Fetches the yen price written to cell E2 and converts to USD with Google Finance formulas 2. Rounds up the value to 2 decimal places with ROUNDUP 3. Gets estimated US price =VALUE(SUBSTITUTE(LEFT(B3,6),"JPTB","20")) 1. Takes the LEFT-most 6 characters in cell B3 (where the vendor reference number is written), e.g. “JPTB19” 2. SUBSTITUTEs the string “JPTB” with the digits 20 3. Gets the numerical VALUE of this text string 4. Gets estimated year of publication based on the vendor reference number prefix. More Behind the Scenes Work
  9. 9. Use Case 2: Using Known ISBNs to Fetch Bibliographic Data from a Union Catalog  Context: Even if JPT doesn’t stock a title, they can source it for us. But what they can’t easily source is sufficient bibliographic data—especially accurate Romanized Japanese required for us to make useful MARC records for both ordering and pre-acquisition patron discovery.  Problem: From that same wish list of ISBNs, how can we get critical bibliographic data and accurate romanization?  Solution: Use Google Sheets IMPORTXML function to query an ISBN on the union catalog, retrieve the catalog ID, and then use the catalog ID for further IMPORTXML and IMPORTDATA functions. Finally, use Google Translate is used to get romanization from retrieved data.  Assumptions: We will retrieve one and ONLY one result.
  10. 10. First, we get the NCID (the record identifier) from search results: =SUBSTITUTE(IMPORTXML(CONCATENATE("https://ci.nii.ac.jp/books/search?ad vanced=false&count=20&sortorder=3&q=",A2),"/html/body/div/div[3]/div[1]/d iv/form/div/ul/li/div/dl/dt/a/@href") ,"/ncid/","") 1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN): 2. It IMPORTs the XML from the URL it just built, using an Xpath absolute reference to drill down the page to the NCID value, which occurs as a hyperlink (a link to the actual bib record)  /html/body/div/div[3]/div[1]/div/form/div/ul/li/div/dl/dt/a/@href 3. It takes the URL portion it fetched and SUBSTITUTES the string “/ncid/” with nothing (“”)
  11. 11. Then, we get the bib record data using the NCID we fetched 1. =CONCATENATE("https://ci.nii.ac.jp/ncid/",D2) [NCID is in D2]  builds the URL for the bib record 2. =IMPORTXML(M2, "/html/body/div/div[3]/div[2]/div/div[4]/ul/li[7]/dl/dd") [bib record URL is in M2]  fetches pagination, which is in the bib record only 3. =CONCATENATE(M2,".tsv") [bib record URL is in M2]  builds a URL for a .tsv file provided by CiNii with each bib (contains limited info) 4. =INDEX(IMPORTDATA(N2),2) [.tsv URL is in N2]  uses Google Sheets IMPORTDATA function to fetch the .tsv file  INDEX function specifies row 2 only (removes header row), to get bib data in one row
  12. 12. At last, use Google Translate to take the Japanese title reading (katakana) and transliterate it into roman characters.
  13. 13. …to make something like this: *(note: actual data here fetched from MatchMarc)
  14. 14. II. Discovering Titles & Scraping Bibliographic Details in South Asia Acquisitions
  15. 15. Use Case 3: Scraping Bibliographic Data from Vendor Catalog Searches  Context: There are many South Asian languages for which we do not have sufficient time or expertise to collect, but which are important for a representative collection. The catalog of Hindi Book Centre (https://www.hindibook.com/) provides long catalog lists for both Urdu and Punjabi.  Problem: These lists are long, and it would be time consuming to search through each without good filtering mechanisms, which aren’t at the initial list level.  Solution: Scrape the URLs from a list of titles in a search, then scrape the bib data from each URL fetched so they can be sorted and filtered.  Assumptions: Every catalog record has the same data in the same place, with no data fields missing. (…this proves to be mostly true)
  16. 16. https://www.hindibook.com/index.php?p=sr& String=urdu-books&Field=keywords https://www.hindibook.co m/index.php?p=sr&format= fullpage&Field=bookcode&S tring=9788178018539
  17. 17.  First, we want to get all the URLs of results for Urdu titles. There are 1514 results, and we can get up to 72 results per page. Since 1514 / 72 = 21.027, that means our results cover 22 pages.  When we click Page 2, we see the URL syntax: https://www.hindibook.com/index.php?&p=sr&String=urdu- books&Field=keywords&perpage=72&startrow=72.  We can guess that the start row is “0” for Page 1, and can use Google Sheets to CONCATENTE URLs using multiples of 72 (that is, a column of =A1+72, in succession) until we reach the page that starts at row 1512 (that’s page 22).
  18. 18.  For each page URL, we expect up to 72 titles returned. But we don’t need the titles, we need the URLs to the titles’ records.  If we right-click on any linked title, we can inspect the element, e.g.: <a href="index.php?p=sr&amp;format=fullpage&amp;Field=bookcode& amp;String=9788178018539 " class="h7 steelblue"> 1857 KI JUNG- E-AZADI KA GUMNAM SHAHEED RAJA NAHAR SINGH </a>  Using XPath and IMPORTXML we can say “get me the href where there is any <a> with class="h7 steelblue":  =IMPORTXML(A2,"//a[@class='h7 steelblue']/@href")
  19. 19.  All 72 (almost complete) URLs are returned in an array, which means we’d need to be cautious about how we sort the list page URLs in column A.
  20. 20.  For each page URL element in B2, we build a full URL =CONCATENATE("https://www.hindibook.com/",B2)  Then we IMPORT the XML from each URL in C2: =IMPORTXML(C2,"//div[@id='panel1d']") [all book data is contained in this <div>]  Optionally, we can get additional data (like the Hindi Book Centre vendor number) with additional XPaths =IMPORTXML(C2,"//div[@id='panel2d']") This would be helpful if we decided to use them as a vendor.
  21. 21. III. Checking Holdings & Fetching OCLC Data Using ISBNs
  22. 22. Use Case 4: Matching OCLC Data to Known ISBNs  Context: We’ve scraped a lot of Urdu ISBNs from Hindi Book Centre, but we need to make informed decisions about whether we should, and how we could, acquire these. We’d want to gauge whether these have OCLC records (for easy cataloging), whether our reliable South Asian vendor DK Agencies can provide them, and then get accurate information for ordering purposes (titles romanized by Hindi Book Centre do not match ALA/LC romanization standards).  Problem: There are 1500+ items, and no staff member has time to search for them one by one, to check duplication and get additional info and correct Romanization.  Solution: Use MatchMarc, a Google Sheets add-on, to query the ISBNs we fetched and return information from OCLC.
  23. 23. MatchMarc: A Google Sheets Add-on that uses the WorldCat Search AP By Michelle Suranofsky and Lisa McColl Lehigh University Libraries has developed a new tool for querying WorldCat using the WorldCat Search API. […] The tool will return a single “best” OCLC record number, and its bibliographic information for a given ISBN or LCCN, allowing the user to set up and define “best.” Code4Lib Journal, Issue 46, 2019-11-05 https://journal.code4lib.org /articles/14813
  24. 24. These Hindi Books ISBNs…. …searched against these criteria… …match this OCLC data.
  25. 25. Making decisions with this data: Assuming the bibliographer wanted all of these titles…  If a local record is found, our holdings are in OCLC. Title is a duplicate so we don’t need to order.  If the existing OCLC record does not have vernacular scripts (indicated by 066$a), we’d prefer to get DK Agencies to sell us the book and provide a MARC record with those scripts. The DK number was in a 938$n subfield.  If the existing OCLC record already has scripts, and good cataloging, we can order from any vendor. (DK is good but not cheap).  If the existing OCLC record is missing call numbers or subjects, we may want to weigh options in purchasing.  If there is no OCLC record at all, this will require original cataloging we cannot handle. Purchase from DK if wanted.
  26. 26. Use Case 5: From Known ISBNs, Check Franklin to Confirm Holdings  Context: (Ideally) OCLC will display our holdings for all items we have cataloged; our holdings for these are sent to OCLC. But for those items already on order but not yet cataloged, we should confirm whether there are in Alma/Franklin.  Problem: Once again, staff time is valuable, and copying and pasting ISBNs/titles in Alma/Franklin is time consuming with possibly little payoff.  Solution: Use IMPORTXML, IF, and IFNA functions to query Franklin, retrieve an MMS ID, a title, a link, or otherwise tell us we don’t have the title.  Assumptions: We will retrieve one and ONLY one result.
  27. 27. First query Franklin with an ISBN to check for an MMS ID… =IFNA(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isxn _search&q=",A2),"//div[@class='availability-ajax-load']/@data-availability-ids"),"no Franklin result found") ….then fetch the title from the first search result… =INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isx n_search&q=",A2),"//h3[@class='index_title document-title-heading col-sm-9 col-lg-10']"),1,2) ….then generate a link to that bib. =IF(B2="no Franklin result found","no Franklin link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FRANKLIN_",B2)))
  28. 28. =IFNA(IMPORTXML(CONCATENATE("https://franklin.library. upenn.edu/catalog?utf8=?&search_field=isxn_search&q=" ,A2),"//div[@class='availability-ajax-load']/@data- availability-ids"),"no Franklin result found") 1. It CONCATENTATEs the base search URL with the search query in A2 (the ISBN) 2. It IMPORTs the XML from the URL it just built, retrieving the MMS ID from the “data-availability-ids” attribute of <div> element whose class is “availability-ajax-load” 3. And IF such an element is not applicable (IFNA), it will display the text “no Franklin result found” instead How It Works 1: Perform a Query to Find an MMS ID
  29. 29. =INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upe nn.edu/catalog?utf8=?&search_field=isxn_search&q=",A2),"//h3 [@class='index_title document-title-heading col-sm-9 col-lg- 10']"),1,2) 1. As above, CONCATENTATEs the same search URL with the title retrieved from <h3> tag a link 2. <h3> tag contains a text break, so the INDEX function says “get row 1, column 2” (where the title will appear) How It Works 2: Perform a Query to Find Title
  30. 30. =IF(B2="no Franklin result found","no Franklin link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FR ANKLIN_",B2))) 1. IF the result in B2 is the text “no Franklin result found”, displays “no Franklin link” 2. Otherwise, CONCATENTATEs the Franklin URL with the MMS ID retrieved in B2 to generate a link How It Works 3: Generate a Link to the Bib Record
  31. 31. IV. Putting REGEX (Regular Expressions) to Work in Google Sheets
  32. 32. Use Case 6: Google’s IMPORT[X] Functions Are Slow, and Frequently Time Out  Context: Google Sheets is doing a lot of work importing HTML, XML, and DATA. This causes timeouts and results take time.  Problem: Sometimes the functions are working so hard that no results load at all. Google also throttles the amount of queries you can do per day and at a time across all Google Sheets in your Google Drive.  Solution: Make your own custom function with Google Scripts (or borrow one!) to bypass those speed issues.  Assumptions: You can program, you know a programmer, or you are willing to search for a solution online and just see what happens.
  33. 33. You don’t have to be a programmer, but you can fake it. Google Apps Script (based on JavaScript) can plug into Google Drive applications, like Google Sheets. Many scripts are available in forums like Stack Overflow, etc. custom importRegex function developed by Josh Bradley (@josh_b_rad) https://stackoverflow.com/questions/39014766/to-exceed-the- importxml-limit-on-google-spreadsheet
  34. 34. For example… we know Leila Books have catalog numbers, so can guess likely URLs by assuming the numbers in A1 match something in their catalog: =CONCATENATE("https://www.leilabooks.com/en/details.php?book no=",A1) We use the custom importRegex function to return the desired data from the URL in B column (B$) with regular expressions, e.g.: =importRegex($B1,"Book Title</td><td width='75%' class='colmn2'>(.*)</td>")
  35. 35. Addendum: ISBN Toolkit
  36. 36. Use Case 7: ISBNs Should be Unique and Valid… But Sometimes Aren’t  Context: In a perfect world, every resource should have a valid ISBN to differentiate titles, editions of those titles, and formats of those editions (i.e. 1st edition of a print title has a different ISBN than the eBook edition, and than its 2nd edition, etc.). ISBNs have come in different, equivalent “flavors” too, which look similar but are distinct.  Problem: The world isn’t perfect, and ISBNs aren’t free. Publishers recycle them across titles/editions, fail to use them as expected, or format them improperly. Else, they provide one flavor (ISBN-10) when we really expected another (ISBN-13) for a particular application.  Solution: Use Excel/Google Sheets to attempt to fix ISBNs using known rules for calculating ISBNs.  Assumptions: You have a lot of ISBNs, suspect some are broken/invalid, and/or you also want to convert between ISBN-10s and -13s.
  37. 37. Using some functions like SUBSTITUTE and IF, we can clean ISBNs of extra characters like hyphens, and then use the LEN (length) function to determine if they are valid lengths (10 or 13). If they are, we can then calculate the valid check digit, determine if the ISBN we entered is valid, and if it isn’t, we can “reconstruct” the supposedly valid ISBN.
  38. 38. Determining the ISBN type =IF(LEN(SUBSTITUTE(A2,"-",""))=13,"ISBN-13",IF(LEN(SUBSTITUTE(A2,"-",""))=10,"ISBN- 10","N/A")) 1. We’ve nested some IF statements. The first one, IF(LEN(SUBSTITUTE(A2,"-",""))=13, will SUBSTITUTE the hyphens (“-”) with nothing “”), then it will calculate the LENgth. IF that LEN is 13, it will display “ISBN-13” 2. If that LEN is not 13, it will try again, this time looking for 10 digits in that column. If it’s 10, it will display “ISBN-10”. 3. Otherwise, it displays “N/A”: The ISBN type cannot be determined, so we cannot presume where the missing or extra digits are.
  39. 39. Readymade formulas help us first calculate the ISBN-10 check digit, using the 9 digit “root” (the ISBN minus the 978- prefix, and minus the check digit). =IF(LEN(SUBSTITUTE(A2,"-",""))=10,MOD(MID((SUBSTITUTE(A2,"- ","")),1,1)+MID((SUBSTITUTE(A2,"-","")),2,1)*2+MID((SUBSTITUTE(A2,"- ","")),3,1)*3+MID((SUBSTITUTE(A2,"-","")),4,1)*4+MID((SUBSTITUTE(A2,"- ","")),5,1)*5+MID((SUBSTITUTE(A2,"-","")),6,1)*6+MID((SUBSTITUTE(A2,"- ","")),7,1)*7+MID((SUBSTITUTE(A2,"-","")),8,1)*8+MID((SUBSTITUTE(A2,"- ","")),9,1)*9,11),IF(LEN(SUBSTITUTE(A2,"-",""))=13,MOD(MID((SUBSTITUTE(A2,"- ","")),4,1)+MID((SUBSTITUTE(A2,"-","")),5,1)*2+MID((SUBSTITUTE(A2,"- ","")),6,1)*3+MID((SUBSTITUTE(A2,"-","")),7,1)*4+MID((SUBSTITUTE(A2,"- ","")),8,1)*5+MID((SUBSTITUTE(A2,"-","")),9,1)*6+MID((SUBSTITUTE(A2,"- ","")),10,1)*7+MID((SUBSTITUTE(A2,"-","")),11,1)*8+MID((SUBSTITUTE(A2,"- ","")),12,1)*9,11),"BAD ISBN")) And one more function helps us get a value of “X” for ISBN-10’s that end in X =IF(C2=10,"X",C2) Helpful sources: • http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/ • http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for- isbn-10.html
  40. 40. With the ISBN-10 check digit calculated (value of 0-9 or else X), we can reconstruct a valid ISBN-10 (and write it as a 10-character TEXT value, since it may end with an “X”)… =TEXT(IF(LEN(SUBSTITUTE(A2,"-",""))=13,CONCATENATE(MID((SUBSTITUTE(A2,"- ","")),4,9),D2),IF(LEN(SUBSTITUTE(A2,"- ",""))=10,CONCATENATE(MID((SUBSTITUTE(A2,"-","")),1,9),D2),"Cannot validate")),"0000000000") …and with that ISBN-10, we calculate the valid ISBN-13: =TEXT(IF(LEN(E3)=10,CONCATENATE("978",MID(E3,1,9),MOD((10- MOD(SUM(9,21,8,PRODUCT(MID(E3,1,1),3),MID(E3,2,1),PRODUCT(MID(E3,3,1),3),MI D(E3,4,1),PRODUCT(MID(E3,5,1),3),MID(E3,6,1),PRODUCT(MID(E3,7,1),3),MID(E3,8, 1),PRODUCT(MID(E3,9,1),3)),10)),10)), "Cannot validate"), 0) Helpful sources: • http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/ • http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for- isbn-10.html
  41. 41. And sometimes the “invalid” ISBN is “valid”, in context of the title in hand (or on a vendor spreadsheet) “Invalid” ISBN-10 8185360866 Valid ISBN-10 8185360863: Two editions matched (different date)
  42. 42. Questions/Demo Time?
  43. 43. Use Case Addendum: Fetching Book Data from Known Bookseller URLs (my first experiment, rebuilt!)  Context: We want to expand language coverage of titles in our South Asia collections, but no staffing model can accommodate the dozens of languages we need for a representative collection. We found a vendor, DC Books, who has a great website for Mayalayam books with info about them in English.  Problem: A staff member who knows Malayalam can make book recommendations, but asking him to copy and paste bibliographic data one at a time into a spreadsheet for a bibliographer to review is laborious and time- consuming.  Solution: Use structured data on the DC Books website, and have the staff member just record the URLs of books he recommends to us.
  44. 44. =IMPORTXML(A2,"//span[@style='font-size:14px; line- height:26px; color:#333;’]”)  Takes the URL identified, and imports a <span> element where the bib data lives. =PROPER() functions normalize ALL CAPS to Proper Case for title, author, and publisher. =RIGHT([cell],4) takes the YYYY from the DD-MM-YYYY date

×