HATHI TRUST RESEARCH
CENTER

Opportunities and Challenges of Text
Mining HathiTrust Digital Library
Koninklijke Bibiotheek...
Thanks to sponsors
HathiTrust Digital Library
• HathiTrust is a partnership of academic &
research institutions, offering a collection of
mil...
 HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-scale corpi
through computational...
Mission of HT Research Center
• Research arm of HathiTrust
• Goal: enable researchers world-wide to carry out
computationa...
HTRC
Request

Spatial plots
All the complexity

Statistical plots

Complexity hiding interface

Tabular info
Complexity hiding interface
Workset builder
HTRC Timeline
• Phase I: development 01 Jul 2011 – 31 Mar 2013
– HTRC software and services release v1.0
http://sourceforg...
•
•
•
•
•
•
•
•

Philosophy: computation moves to data
Web services (REST) architecture and protocols
WS02 Registry for wo...
Portal

Secure
Capsule
Instance
Manager

Secure
Capsule
Cluster

Sigiri
job
deployment

Hadoop
Cluster
(MapReduce/
HDFS)

...
HTRC’s guiding principle to
computational access
• No computational action or set of actions on part of
users, either acti...
HTRC Secure
Capsule
Architecture
Researcher
requests
new VM

VM
Image
Store
VM
Image
Manager
VM
Manager

VM
Image
Builder
...
Extensions to HT Use
1. Enhanced metadata
Miao Chen, i-school, Indiana Univ

2. Large scale text mining using HPC compute
...
Miao Chen, PhD, Indiana University

METADATA ENHANCEMENT:
PRELIMINARY STUDY
Metadata Enhancement
• Preliminary study of limits of HT Repository
metadata
• Current metadata fields are MARC-based
– E....
Top Metadata Enhancement Items
• 1st round user survey, top requested items
– Word frequency count and document length. At...
Other Metadata Enhancement Items
•
•
•
•
•
•
•
•

Stats analysis: TF-IDF
Readability score
Language identification
Topic m...
Guangchen Ruan, PhD candidate, IU

LARGE SCALE DATA ANALYSIS ON XSEDE
Study Objective
• Study how large scale text analysis of the HT
repository can be carried out using public
compute resourc...
Experimental Environment and Results
• Dataset
2,592,210 volumes, in total 2.1 TB, divided into 1024
partitions of 2GB eac...
Computation Time Distribution

#HTRC @HathiTrust
Word Frequency Distribution

#HTRC @HathiTrust
Stacy Kowalczyk, Asst. Professor, Dominican University
Zong Peng, HTRC, Indiana University

GENDER IDENTIFICATION OF HTRC
...
Gender Identification of Text
• Can we use author names in bibliographic records to
identify gender?
• 2.6 million bibliog...
Authors vs Names
• Methuen, Algernon Methuen Marshall, Sir bart., 18561924
• Methuem, Algernon
• Methuen Algernon
• Methue...
Sources of Data
• The Virtual International Authority File
– Hosted by OCLC

• Harvested names from multiple data sources
...
Initial Gender Results
• Approximately 80% of name strings have initial
gender identification
– Female
• 59,365
• 10%

– M...
Results by Data Source
Against the whole set of name strings
• VIAF
– 19% hit rate

• Web Names
– 54% hit rate

• Patents ...
• For details http://www.hathitrust.org/htrc/faq
• General contact info
– Beth Plale, Director HTRC, plale@indiana.edu
• R...
Upcoming SlideShare
Loading in …5
×

HathiTrust Reserach Center Nov2013

524 views

Published on

Current state of HathiTrust Research Center, and recent extensions

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
524
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Need to add slide that combines images of the blacklight view and the portal showing a workset. Building a query needs to go after architecture diagram.
  • HTRC hides complexity of analytics. In this sense, it is like Google search, which is a simple interface that hides complexity to search billions of pages. The kinds of things returned from HTRC interaction are spatial relationship of words (and their frequency obviously), statistical plots of information or tabular information.
  • Shifting the complexity hiding interface to the right, we open up the cloud to see what’s inside. HTRC at it simplest has 1) algorithms – these are drawn from SEASR and from other analysis tool suites including Mahout and mapreduce, the 2) HT corpus (and subsets of the corpus that users either have personally as part of a workset, or are publically available, and 3) other data sets that are used. HTRC brokers the bringing together of these pieces so that computation can take place on a resource like Big Red II (or XSEDE). Note that there is an arrow from the compute engine to the complexity hiding interface. This is because researcher interaction with the texts isn’t an automated workflow; it is one requiring levels of interaction with the computation as it is running.
  • Workset is created using this page then gets connected with algorithms, other resources
  • Add WS02 LogAdd logo brii and quarry
  • tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.[1] One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.[1]
  • In the experiment we use 1024 cores and each core processes the same amount of data, i.e. 2T/1024, the histogram distribution (using 100-bin) shows the time consumed by processes. So the y-axis indicate the # of processes that finish the processing within the time indicated by the x-axis. From the distribution we can see that the variance is small, most processes are able to finish within 15-16 mins. 
  • HathiTrust Reserach Center Nov2013

    1. 1. HATHI TRUST RESEARCH CENTER Opportunities and Challenges of Text Mining HathiTrust Digital Library Koninklijke Bibiotheek | 15.Nov.13 Beth Plale – @bplale Professor, School of Informatics and Computing Director, Data To Insight Center Indiana University Tweet us - @HathiTrust #HTRC
    2. 2. Thanks to sponsors
    3. 3. HathiTrust Digital Library • HathiTrust is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. – Founding members of HathiTrust along with University of Michigan are Indiana University, University of California, and University of Virginia http://www.hathitrust.org http://www.hathitrust.org/htrc #HTRC @HathiTrust
    4. 4.  HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions  Paradigm: computation takes place close to the data #HTRC @HathiTrust
    5. 5. Mission of HT Research Center • Research arm of HathiTrust • Goal: enable researchers world-wide to carry out computational investigation of HT repository through – Develop model for access: the ‘workset’ – Develop tools that facilitate research by digital humanities and informatics communities – Develop secure cyberinfrastructure that allows computational investigation of entire copyrighted and public domain HathiTrust repository • Established: July, 2011 • Collaborative effort of Indiana University and University of Illinois #HTRC @HathiTrust
    6. 6. HTRC Request Spatial plots All the complexity Statistical plots Complexity hiding interface Tabular info
    7. 7. Complexity hiding interface
    8. 8. Workset builder
    9. 9. HTRC Timeline • Phase I: development 01 Jul 2011 – 31 Mar 2013 – HTRC software and services release v1.0 http://sourceforge.net/p/htrc/code/ • Phase II: outreach, 01 Apr 2013 - present – 2nd HTRC UnCamp Sep ‘13 Attendees of UnCamp’13 #HTRC @HathiTrust
    10. 10. • • • • • • • • Philosophy: computation moves to data Web services (REST) architecture and protocols WS02 Registry for worksets and results Solr Indexes: full text, MARC, and new metadata noSQL (Cassandra) store as volume store Authentication using WS02 Identity Server Portal front-end, programmatic access Mining tools: currently SEASR #HTRC @HathiTrust
    11. 11. Portal Secure Capsule Instance Manager Secure Capsule Cluster Sigiri job deployment Hadoop Cluster (MapReduce/ HDFS) IU compute resources Agent instance Agent Registry Services, worksets Agent instance instance SEASR analytics service Solr index Meandre Orchestration SEASR service execution HTRC Data API v0.1 Secure Capsule Service User session management Blacklight Identity Server ssh client Volume store Volume store (Cassandra) Volume store (Cassandra) (Cassandra) rsync HathiTrust corpus Page/volume tree (file system) University of Michigan
    12. 12. HTRC’s guiding principle to computational access • No computational action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from the HT repository to reassemble pages from collection for reading • Definition disallows collusion between users, or accumulation of material over time. • Defining “sufficient information”: research has shown need to interact directly with select texts. How much of a text to show? Google withholds from showing to reader every 10th page of a book (Int’l NYTimes Nov 16-17, 2013) #HTRC @HathiTrust
    13. 13. HTRC Secure Capsule Architecture Researcher requests new VM VM Image Store VM Image Manager VM Manager VM Image Builder Secure Capsule: VM controls I/O instance behind scenes Researcher SSH Research install tools onto VM through window on her desktop Research results Registry Services, worksets Final location of results is registry
    14. 14. Extensions to HT Use 1. Enhanced metadata Miao Chen, i-school, Indiana Univ 2. Large scale text mining using HPC compute resources Guangchen Ruan, computer science, Indiana Univ 3. Enhanced metadata to include gender author information Stacy Kowalczyk, library science, Dominican Univ #HTRC @HathiTrust
    15. 15. Miao Chen, PhD, Indiana University METADATA ENHANCEMENT: PRELIMINARY STUDY
    16. 16. Metadata Enhancement • Preliminary study of limits of HT Repository metadata • Current metadata fields are MARC-based – E.g. publication date, authors, title, subject • MARC fields are fundamental • Needed: more fields of users’ interest for granular analytics (Metadata Enhancement) • Solicit user requirements and prioritize for implementation Thanks to Miao Chen #HTRC @HathiTrust
    17. 17. Top Metadata Enhancement Items • 1st round user survey, top requested items – Word frequency count and document length. At volume level ✔ – Author gender identification ✔ – Metadata de-duplication – Word frequency count at page level – Word frequency count for full 10.8 M volume repository #HTRC @HathiTrust
    18. 18. Other Metadata Enhancement Items • • • • • • • • Stats analysis: TF-IDF Readability score Language identification Topic modeling (e.g. LDA probability) Genre Era of compilation Book length (e.g. short or long) Concordance index (indexing with context) #HTRC @HathiTrust
    19. 19. Guangchen Ruan, PhD candidate, IU LARGE SCALE DATA ANALYSIS ON XSEDE
    20. 20. Study Objective • Study how large scale text analysis of the HT repository can be carried out using public compute resources • Task: compute word frequency over entire public domain portion of HT corpus • Uses Blacklight, an HPC machine at Pittsburgh supercomputer center #HTRC @HathiTrust
    21. 21. Experimental Environment and Results • Dataset 2,592,210 volumes, in total 2.1 TB, divided into 1024 partitions of 2GB each • Computation platform XSEDE Blacklight, 1024-core, each 2.27 GHz, 8.2 TB memory. Each core processes one partition • Results Whole corpus word count completed in 1,454 seconds or 24.23 minutes #HTRC @HathiTrust
    22. 22. Computation Time Distribution #HTRC @HathiTrust
    23. 23. Word Frequency Distribution #HTRC @HathiTrust
    24. 24. Stacy Kowalczyk, Asst. Professor, Dominican University Zong Peng, HTRC, Indiana University GENDER IDENTIFICATION OF HTRC AUTHORS BY NAMES Ref talk by Stacy Kowalczyk, http://www.hathitrust.org/htrc_uncamp2013
    25. 25. Gender Identification of Text • Can we use author names in bibliographic records to identify gender? • 2.6 million bibliographic records – Extracted personal author data – Marc 100 abcd and 700 abcd • 606,437 unique personal author strings • Bibliographic data is not fielded like patent names • Relying on Standard cataloging practice – Last name, first name middle name, titles/honorifics, dates #HTRC @HathiTrust
    26. 26. Authors vs Names • Methuen, Algernon Methuen Marshall, Sir bart., 18561924 • Methuem, Algernon • Methuen Algernon • Methuen Marshall, Sir, bart., 1856• Methuen, A. Sir, 1856-1924 • Methuen, A. Sir, bart., 1856-1924 • Methuen Marshall, Sir bart 1856-1924 • Methuen, Algernon Methuen Marshall, Sir, 1856-1924 • Methuen, Algernon Methuen Marshall, Sir, bart., 18561924 • Methuen, Algernon, 1856-1924 #HTRC @HathiTrust
    27. 27. Sources of Data • The Virtual International Authority File – Hosted by OCLC • Harvested names from multiple data sources – Census bureau – Baby name sites • EU Patent Research names list (Frietsch et al, 2009; Naldi et al. 2005) – Developed an extensive list of European names • Titles and honorifics – Multiple web resources – Sir, Baron, Count, Duke, Father, Cardinal, etc – Lady, Mrs. Miss, Countess, Duchess, Sister, etc #HTRC @HathiTrust
    28. 28. Initial Gender Results • Approximately 80% of name strings have initial gender identification – Female • 59,365 • 10% – Male • 425,994 • 70% – Unknown • 114,204 • 19% – Ambiguous • 5,965 • Less than 1% #HTRC @HathiTrust
    29. 29. Results by Data Source Against the whole set of name strings • VIAF – 19% hit rate • Web Names – 54% hit rate • Patents Names – 8% #HTRC @HathiTrust
    30. 30. • For details http://www.hathitrust.org/htrc/faq • General contact info – Beth Plale, Director HTRC, plale@indiana.edu • Requests for capability, interest – Miao Chen, Asst. Director HTRC miaochen@indiana.edu #HTRC @HathiTrust

    ×