SlideShare a Scribd company logo
HATHI TRUST RESEARCH
CENTER

Opportunities and Challenges of Text
Mining HathiTrust Digital Library
Koninklijke Bibiotheek | 15.Nov.13

Beth Plale – @bplale
Professor, School of Informatics and Computing
Director, Data To Insight Center
Indiana University

Tweet us - @HathiTrust #HTRC
Thanks to sponsors
HathiTrust Digital Library
• HathiTrust is a partnership of academic &
research institutions, offering a collection of
millions of titles digitized from libraries around
the world.
– Founding members of HathiTrust along with
University of Michigan are Indiana University,
University of California, and University of Virginia
http://www.hathitrust.org

http://www.hathitrust.org/htrc
#HTRC @HathiTrust
 HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-scale corpi
through computational tools, and
time-based analysis
Restricted nature of HT content
suggests need for new forms of
access that preserve intimate
nature of research investigation
while honoring restrictions
 Paradigm: computation takes
place close to the data
#HTRC @HathiTrust
Mission of HT Research Center
• Research arm of HathiTrust
• Goal: enable researchers world-wide to carry out
computational investigation of HT repository through
– Develop model for access: the ‘workset’
– Develop tools that facilitate research by digital humanities
and informatics communities
– Develop secure cyberinfrastructure that allows
computational investigation of entire copyrighted and
public domain HathiTrust repository

• Established: July, 2011
• Collaborative effort of Indiana University and
University of Illinois
#HTRC @HathiTrust
HTRC
Request

Spatial plots
All the complexity

Statistical plots

Complexity hiding interface

Tabular info
Complexity hiding interface
Workset builder
HTRC Timeline
• Phase I: development 01 Jul 2011 – 31 Mar 2013
– HTRC software and services release v1.0
http://sourceforge.net/p/htrc/code/
• Phase II: outreach, 01 Apr 2013 - present
– 2nd HTRC UnCamp Sep ‘13
Attendees of UnCamp’13

#HTRC @HathiTrust
•
•
•
•
•
•
•
•

Philosophy: computation moves to data
Web services (REST) architecture and protocols
WS02 Registry for worksets and results
Solr Indexes: full text, MARC, and new metadata
noSQL (Cassandra) store as volume store
Authentication using WS02 Identity Server
Portal front-end, programmatic access
Mining tools: currently SEASR
#HTRC @HathiTrust
Portal

Secure
Capsule
Instance
Manager

Secure
Capsule
Cluster

Sigiri
job
deployment

Hadoop
Cluster
(MapReduce/
HDFS)

IU compute resources

Agent
instance
Agent

Registry
Services, worksets

Agent
instance

instance

SEASR analytics
service
Solr index
Meandre
Orchestration

SEASR service
execution

HTRC Data API v0.1

Secure
Capsule
Service

User session
management

Blacklight

Identity Server

ssh client

Volume store
Volume store
(Cassandra)
Volume store
(Cassandra)
(Cassandra)

rsync

HathiTrust
corpus

Page/volume
tree (file system)
University of
Michigan
HTRC’s guiding principle to
computational access
• No computational action or set of actions on part of
users, either acting alone or in cooperation with other
users over duration of one or multiple sessions can
result in sufficient information gathered from the HT
repository to reassemble pages from collection for
reading
• Definition disallows collusion between users, or
accumulation of material over time.
• Defining “sufficient information”: research has shown
need to interact directly with select texts. How much
of a text to show? Google withholds from showing to
reader every 10th page of a book (Int’l NYTimes Nov 16-17, 2013)
#HTRC @HathiTrust
HTRC Secure
Capsule
Architecture
Researcher
requests
new VM

VM
Image
Store
VM
Image
Manager
VM
Manager

VM
Image
Builder

Secure
Capsule:
VM
controls I/O
instance
behind
scenes

Researcher

SSH

Research install tools onto
VM through window on her
desktop

Research
results

Registry
Services,
worksets

Final location
of results is
registry
Extensions to HT Use
1. Enhanced metadata
Miao Chen, i-school, Indiana Univ

2. Large scale text mining using HPC compute
resources
Guangchen Ruan, computer science, Indiana Univ

3. Enhanced metadata to include gender author
information
Stacy Kowalczyk, library science, Dominican Univ
#HTRC @HathiTrust
Miao Chen, PhD, Indiana University

METADATA ENHANCEMENT:
PRELIMINARY STUDY
Metadata Enhancement
• Preliminary study of limits of HT Repository
metadata
• Current metadata fields are MARC-based
– E.g. publication date, authors, title, subject

• MARC fields are fundamental
• Needed: more fields of users’ interest for
granular analytics (Metadata Enhancement)
• Solicit user requirements and prioritize for
implementation
Thanks to Miao Chen
#HTRC @HathiTrust
Top Metadata Enhancement Items
• 1st round user survey, top requested items
– Word frequency count and document length. At
volume level ✔
– Author gender identification ✔
– Metadata de-duplication
– Word frequency count at page level
– Word frequency count for full 10.8 M volume
repository

#HTRC @HathiTrust
Other Metadata Enhancement Items
•
•
•
•
•
•
•
•

Stats analysis: TF-IDF
Readability score
Language identification
Topic modeling (e.g. LDA probability)
Genre
Era of compilation
Book length (e.g. short or long)
Concordance index (indexing with context)
#HTRC @HathiTrust
Guangchen Ruan, PhD candidate, IU

LARGE SCALE DATA ANALYSIS ON XSEDE
Study Objective
• Study how large scale text analysis of the HT
repository can be carried out using public
compute resources
• Task: compute word frequency over entire
public domain portion of HT corpus
• Uses Blacklight, an HPC machine at Pittsburgh
supercomputer center

#HTRC @HathiTrust
Experimental Environment and Results
• Dataset
2,592,210 volumes, in total 2.1 TB, divided into 1024
partitions of 2GB each

• Computation platform
XSEDE Blacklight, 1024-core, each 2.27 GHz, 8.2
TB memory. Each core processes one partition

• Results
Whole corpus word count completed in 1,454
seconds or 24.23 minutes
#HTRC @HathiTrust
Computation Time Distribution

#HTRC @HathiTrust
Word Frequency Distribution

#HTRC @HathiTrust
Stacy Kowalczyk, Asst. Professor, Dominican University
Zong Peng, HTRC, Indiana University

GENDER IDENTIFICATION OF HTRC
AUTHORS BY NAMES
Ref talk by Stacy Kowalczyk, http://www.hathitrust.org/htrc_uncamp2013
Gender Identification of Text
• Can we use author names in bibliographic records to
identify gender?
• 2.6 million bibliographic records
– Extracted personal author data
– Marc 100 abcd and 700 abcd

• 606,437 unique personal author strings
• Bibliographic data is not fielded like patent names
• Relying on Standard cataloging practice
– Last name, first name middle name, titles/honorifics,
dates
#HTRC @HathiTrust
Authors vs Names
• Methuen, Algernon Methuen Marshall, Sir bart., 18561924
• Methuem, Algernon
• Methuen Algernon
• Methuen Marshall, Sir, bart., 1856• Methuen, A. Sir, 1856-1924
• Methuen, A. Sir, bart., 1856-1924
• Methuen Marshall, Sir bart 1856-1924
• Methuen, Algernon Methuen Marshall, Sir, 1856-1924
• Methuen, Algernon Methuen Marshall, Sir, bart., 18561924
• Methuen, Algernon, 1856-1924
#HTRC @HathiTrust
Sources of Data
• The Virtual International Authority File
– Hosted by OCLC

• Harvested names from multiple data sources
– Census bureau
– Baby name sites

• EU Patent Research names list (Frietsch et al, 2009; Naldi
et al. 2005)

– Developed an extensive list of European names

• Titles and honorifics
– Multiple web resources
– Sir, Baron, Count, Duke, Father, Cardinal, etc
– Lady, Mrs. Miss, Countess, Duchess, Sister, etc
#HTRC @HathiTrust
Initial Gender Results
• Approximately 80% of name strings have initial
gender identification
– Female
• 59,365
• 10%

– Male
• 425,994
• 70%

– Unknown
• 114,204
• 19%

– Ambiguous
• 5,965
• Less than 1%
#HTRC @HathiTrust
Results by Data Source
Against the whole set of name strings
• VIAF
– 19% hit rate

• Web Names
– 54% hit rate

• Patents Names
– 8%

#HTRC @HathiTrust
• For details http://www.hathitrust.org/htrc/faq
• General contact info
– Beth Plale, Director HTRC, plale@indiana.edu
• Requests for capability, interest
– Miao Chen, Asst. Director HTRC
miaochen@indiana.edu
#HTRC @HathiTrust

More Related Content

What's hot

New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
Stefan Schmunk
 
Text mining
Text miningText mining
Text mining
Ali A Jalil
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
Bhawi247
 
co:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatosco:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatos
ICARUS - International Centre for Archival Research
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
Seth Grimes
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
Bernhard Haslhofer
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
National Information Standards Organization (NISO)
 
Data Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural ProgramData Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural Program
Globus
 
Open hpi semweb-06-part4
Open hpi semweb-06-part4Open hpi semweb-06-part4
Open hpi semweb-06-part4
Nadine Ludwig
 
Webmining (1)
Webmining (1)Webmining (1)
Webmining (1)
Santosh Gulivindala
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
Filip Ilievski
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by Globus
Globus
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Filip Ilievski
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
Laura Po
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
National Information Standards Organization (NISO)
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
Micah Altman
 
Sanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUDSanderson Shout It Out: LOUD
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Mauro Dragoni
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
lljohnston
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Joel Azzopardi
 

What's hot (20)

New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
Text mining
Text miningText mining
Text mining
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
co:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatosco:op-READ-Convention Marburg - Basilis Gatos
co:op-READ-Convention Marburg - Basilis Gatos
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
 
Data Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural ProgramData Sharing via Globus in the NIH Intramural Program
Data Sharing via Globus in the NIH Intramural Program
 
Open hpi semweb-06-part4
Open hpi semweb-06-part4Open hpi semweb-06-part4
Open hpi semweb-06-part4
 
Webmining (1)
Webmining (1)Webmining (1)
Webmining (1)
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by Globus
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
 
Sanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUDSanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUD
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 

Similar to HathiTrust Reserach Center Nov2013

Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
Beth Plale
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
Robert H. McDonald
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
Robert H. McDonald
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
Brian Vetruba
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
Robert H. McDonald
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Beth Plale
 
Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"
National Information Standards Organization (NISO)
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
Thomas King
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
National Information Standards Organization (NISO)
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
Globus
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
Beth Plale
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
Robert H. McDonald
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
Hazel Hall
 
The Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South AfricaThe Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South Africa
Crossref
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Beth Plale
 
C N I20080404
C N I20080404C N I20080404
C N I20080404
Anita de Waard
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
Anita de Waard
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
petrknoth
 
23 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 201723 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 2017
ARDC
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
Robert H. McDonald
 

Similar to HathiTrust Reserach Center Nov2013 (20)

Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"Walsh "Text Data Mining with HTRC"
Walsh "Text Data Mining with HTRC"
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
The Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South AfricaThe Reach of Crossref metadata - Crossref LIVE South Africa
The Reach of Crossref metadata - Crossref LIVE South Africa
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
C N I20080404
C N I20080404C N I20080404
C N I20080404
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
 
23 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 201723 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 2017
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
 

More from Beth Plale

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
Beth Plale
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
Beth Plale
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
Beth Plale
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Beth Plale
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
Beth Plale
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
Beth Plale
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
Beth Plale
 

More from Beth Plale (7)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 

Recently uploaded

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 

Recently uploaded (20)

Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 

HathiTrust Reserach Center Nov2013

  • 1. HATHI TRUST RESEARCH CENTER Opportunities and Challenges of Text Mining HathiTrust Digital Library Koninklijke Bibiotheek | 15.Nov.13 Beth Plale – @bplale Professor, School of Informatics and Computing Director, Data To Insight Center Indiana University Tweet us - @HathiTrust #HTRC
  • 3. HathiTrust Digital Library • HathiTrust is a partnership of academic & research institutions, offering a collection of millions of titles digitized from libraries around the world. – Founding members of HathiTrust along with University of Michigan are Indiana University, University of California, and University of Virginia http://www.hathitrust.org http://www.hathitrust.org/htrc #HTRC @HathiTrust
  • 4.  HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions  Paradigm: computation takes place close to the data #HTRC @HathiTrust
  • 5. Mission of HT Research Center • Research arm of HathiTrust • Goal: enable researchers world-wide to carry out computational investigation of HT repository through – Develop model for access: the ‘workset’ – Develop tools that facilitate research by digital humanities and informatics communities – Develop secure cyberinfrastructure that allows computational investigation of entire copyrighted and public domain HathiTrust repository • Established: July, 2011 • Collaborative effort of Indiana University and University of Illinois #HTRC @HathiTrust
  • 6. HTRC Request Spatial plots All the complexity Statistical plots Complexity hiding interface Tabular info
  • 9. HTRC Timeline • Phase I: development 01 Jul 2011 – 31 Mar 2013 – HTRC software and services release v1.0 http://sourceforge.net/p/htrc/code/ • Phase II: outreach, 01 Apr 2013 - present – 2nd HTRC UnCamp Sep ‘13 Attendees of UnCamp’13 #HTRC @HathiTrust
  • 10. • • • • • • • • Philosophy: computation moves to data Web services (REST) architecture and protocols WS02 Registry for worksets and results Solr Indexes: full text, MARC, and new metadata noSQL (Cassandra) store as volume store Authentication using WS02 Identity Server Portal front-end, programmatic access Mining tools: currently SEASR #HTRC @HathiTrust
  • 11. Portal Secure Capsule Instance Manager Secure Capsule Cluster Sigiri job deployment Hadoop Cluster (MapReduce/ HDFS) IU compute resources Agent instance Agent Registry Services, worksets Agent instance instance SEASR analytics service Solr index Meandre Orchestration SEASR service execution HTRC Data API v0.1 Secure Capsule Service User session management Blacklight Identity Server ssh client Volume store Volume store (Cassandra) Volume store (Cassandra) (Cassandra) rsync HathiTrust corpus Page/volume tree (file system) University of Michigan
  • 12. HTRC’s guiding principle to computational access • No computational action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from the HT repository to reassemble pages from collection for reading • Definition disallows collusion between users, or accumulation of material over time. • Defining “sufficient information”: research has shown need to interact directly with select texts. How much of a text to show? Google withholds from showing to reader every 10th page of a book (Int’l NYTimes Nov 16-17, 2013) #HTRC @HathiTrust
  • 13. HTRC Secure Capsule Architecture Researcher requests new VM VM Image Store VM Image Manager VM Manager VM Image Builder Secure Capsule: VM controls I/O instance behind scenes Researcher SSH Research install tools onto VM through window on her desktop Research results Registry Services, worksets Final location of results is registry
  • 14. Extensions to HT Use 1. Enhanced metadata Miao Chen, i-school, Indiana Univ 2. Large scale text mining using HPC compute resources Guangchen Ruan, computer science, Indiana Univ 3. Enhanced metadata to include gender author information Stacy Kowalczyk, library science, Dominican Univ #HTRC @HathiTrust
  • 15. Miao Chen, PhD, Indiana University METADATA ENHANCEMENT: PRELIMINARY STUDY
  • 16. Metadata Enhancement • Preliminary study of limits of HT Repository metadata • Current metadata fields are MARC-based – E.g. publication date, authors, title, subject • MARC fields are fundamental • Needed: more fields of users’ interest for granular analytics (Metadata Enhancement) • Solicit user requirements and prioritize for implementation Thanks to Miao Chen #HTRC @HathiTrust
  • 17. Top Metadata Enhancement Items • 1st round user survey, top requested items – Word frequency count and document length. At volume level ✔ – Author gender identification ✔ – Metadata de-duplication – Word frequency count at page level – Word frequency count for full 10.8 M volume repository #HTRC @HathiTrust
  • 18. Other Metadata Enhancement Items • • • • • • • • Stats analysis: TF-IDF Readability score Language identification Topic modeling (e.g. LDA probability) Genre Era of compilation Book length (e.g. short or long) Concordance index (indexing with context) #HTRC @HathiTrust
  • 19. Guangchen Ruan, PhD candidate, IU LARGE SCALE DATA ANALYSIS ON XSEDE
  • 20. Study Objective • Study how large scale text analysis of the HT repository can be carried out using public compute resources • Task: compute word frequency over entire public domain portion of HT corpus • Uses Blacklight, an HPC machine at Pittsburgh supercomputer center #HTRC @HathiTrust
  • 21. Experimental Environment and Results • Dataset 2,592,210 volumes, in total 2.1 TB, divided into 1024 partitions of 2GB each • Computation platform XSEDE Blacklight, 1024-core, each 2.27 GHz, 8.2 TB memory. Each core processes one partition • Results Whole corpus word count completed in 1,454 seconds or 24.23 minutes #HTRC @HathiTrust
  • 24. Stacy Kowalczyk, Asst. Professor, Dominican University Zong Peng, HTRC, Indiana University GENDER IDENTIFICATION OF HTRC AUTHORS BY NAMES Ref talk by Stacy Kowalczyk, http://www.hathitrust.org/htrc_uncamp2013
  • 25. Gender Identification of Text • Can we use author names in bibliographic records to identify gender? • 2.6 million bibliographic records – Extracted personal author data – Marc 100 abcd and 700 abcd • 606,437 unique personal author strings • Bibliographic data is not fielded like patent names • Relying on Standard cataloging practice – Last name, first name middle name, titles/honorifics, dates #HTRC @HathiTrust
  • 26. Authors vs Names • Methuen, Algernon Methuen Marshall, Sir bart., 18561924 • Methuem, Algernon • Methuen Algernon • Methuen Marshall, Sir, bart., 1856• Methuen, A. Sir, 1856-1924 • Methuen, A. Sir, bart., 1856-1924 • Methuen Marshall, Sir bart 1856-1924 • Methuen, Algernon Methuen Marshall, Sir, 1856-1924 • Methuen, Algernon Methuen Marshall, Sir, bart., 18561924 • Methuen, Algernon, 1856-1924 #HTRC @HathiTrust
  • 27. Sources of Data • The Virtual International Authority File – Hosted by OCLC • Harvested names from multiple data sources – Census bureau – Baby name sites • EU Patent Research names list (Frietsch et al, 2009; Naldi et al. 2005) – Developed an extensive list of European names • Titles and honorifics – Multiple web resources – Sir, Baron, Count, Duke, Father, Cardinal, etc – Lady, Mrs. Miss, Countess, Duchess, Sister, etc #HTRC @HathiTrust
  • 28. Initial Gender Results • Approximately 80% of name strings have initial gender identification – Female • 59,365 • 10% – Male • 425,994 • 70% – Unknown • 114,204 • 19% – Ambiguous • 5,965 • Less than 1% #HTRC @HathiTrust
  • 29. Results by Data Source Against the whole set of name strings • VIAF – 19% hit rate • Web Names – 54% hit rate • Patents Names – 8% #HTRC @HathiTrust
  • 30. • For details http://www.hathitrust.org/htrc/faq • General contact info – Beth Plale, Director HTRC, plale@indiana.edu • Requests for capability, interest – Miao Chen, Asst. Director HTRC miaochen@indiana.edu #HTRC @HathiTrust

Editor's Notes

  1. Need to add slide that combines images of the blacklight view and the portal showing a workset. Building a query needs to go after architecture diagram.
  2. HTRC hides complexity of analytics. In this sense, it is like Google search, which is a simple interface that hides complexity to search billions of pages. The kinds of things returned from HTRC interaction are spatial relationship of words (and their frequency obviously), statistical plots of information or tabular information.
  3. Shifting the complexity hiding interface to the right, we open up the cloud to see what’s inside. HTRC at it simplest has 1) algorithms – these are drawn from SEASR and from other analysis tool suites including Mahout and mapreduce, the 2) HT corpus (and subsets of the corpus that users either have personally as part of a workset, or are publically available, and 3) other data sets that are used. HTRC brokers the bringing together of these pieces so that computation can take place on a resource like Big Red II (or XSEDE). Note that there is an arrow from the compute engine to the complexity hiding interface. This is because researcher interaction with the texts isn’t an automated workflow; it is one requiring levels of interaction with the computation as it is running.
  4. Workset is created using this page then gets connected with algorithms, other resources
  5. Add WS02 LogAdd logo brii and quarry
  6. tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.[1] One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.[1]
  7. In the experiment we use 1024 cores and each core processes the same amount of data, i.e. 2T/1024, the histogram distribution (using 100-bin) shows the time consumed by processes. So the y-axis indicate the # of processes that finish the processing within the time indicated by the x-axis. From the distribution we can see that the variance is small, most processes are able to finish within 15-16 mins.