SlideShare a Scribd company logo
ChemXSeer:
Digital library tools, features,
and crawling characteristics
Edward A. Fox
Professor, Computer Science, Virginia Tech
Blacksburg, VA 24061 USA
fox@vt.edu http://fox.cs.vt.edu
and

Sagnik Ray Choudhury
Ph.D. Student, College of Information Science and
Technology, Penn State, USA
szr163@ist.psu.edu
13 Jan. 2014 -- QU Library, Doha, Qatar

1
Outline
• Acknowledgments
• Introduction
• ELISQ
• Technology

13 Jan. 2014 -- QU Library, Doha, Qatar

2
Sponsored by Qatar University Library

HTTP://qnl.qa

HTTP://WWW.QU.EDU.QA/

Funding provided thru the ELISQ project:
Electronic Library Institute - SeerQ

HTTP://WWW.VT.EDU/

HTTP://WWW.PSU.EDU/

13 Jan. 2014 -- QU Library, Doha, Qatar

HTTP://WWW.TAMU.EDU/

3
Acknowledgments
• Dr. Mazen Hasna, VP and Chief Academic Officer,
Qatar University
• Dr. Rashid Alammari, Dean, College of Engineering,
Qatar University
• Dr. Moumen Hasnah , Director of Academic Research,
Qatar University
• Dr. Imad Bachir, Qatar University Library Director
• Prof. Sebti Foufou, Head of Department of Computer
Science and Engineering, Qatar University
• Prof. Ramazan Kahraman, Head of the Department of
Chemical Engineering, Qatar University
13 Jan. 2014 -- QU Library, Doha, Qatar

4
Additional Thanks
QScience – providing collection:
Christopher J. Leonard, Editorial Director
Paul Coyne, CTO
US National Science Foundation
(recent and current grants to Fox):
• IIS-1319578
• IIS-0916733
• DUE-0840719
• OCI-1032677
• plus those to PSU, TAMU
13 Jan. 2014 -- QU Library, Doha, Qatar

5
Outline
• Acknowledgments
• Introduction
• ELISQ
• Technology

13 Jan. 2014 -- QU Library, Doha, Qatar

6
Introduction
• Digital libraries have emerged since 1991.
• Now each major publisher has its own
digital library; many others exist too.
• Related systems include:
• Institutional repositories, e.g., at QU
• Content & courseware management systems

• Research and development funding of
hundreds of millions of dollars has led to
powerful tailored systems, such as for
chemical information.
13 Jan. 2014 -- QU Library, Doha, Qatar

7
13 Jan. 2014 -- QU Library, Doha, Qatar

8
Information Life Cycle
Authoring
Modifying
Using
Creating

Retention
/ Mining
Accessing
Filtering

Organizing
Indexing
Storing
Retrieving

Distributing
Networking
13 Jan. 2014 -- QU Library, Doha, Qatar

9
Infrastructure Services
Repository-Building
Creational

Preservational

Acquiring
Cataloging
Crawling (focused)
Describing
Digitizing
Federating
Harvesting
Purchasing
Submitting

Conserving
Converting
Copying/Replicating
Emulating
Renewing
Translating (format)

Add
Value
Annotating
Classifying
Clustering
Evaluating
Extracting
Indexing
Measuring
Publicizing
Rating
Reviewing (peer)
Surveying
Translating
(language)

13 Jan. 2014 -- QU Library, Doha, Qatar

Information
Satisfaction
Services
Browsing
Collaborating
Customizing
Filtering
Providing access
Recommending
Requesting
Searching
Visualizing

10
Outline
• Acknowledgments
• Introduction
• ELISQ
• Technology

13 Jan. 2014 -- QU Library, Doha, Qatar

11
ELISQ – Electronic Library Institute –
SeerQ –– Project Team
Qatar University, Qatar:
Mohammed Samaka (Ph.D., Co-Lead PI)
Sumaya Ali S A Al-Maadeed (Ph.D., PI)
Myrna Tabet
Asad Nafees
Tahseena Moideen
Qatar National Library, Qatar:
Claudia Lux (PI)
Krishna RoyChowdhury
Postdoc - TBA

Virginia Tech, USA:
Edward Fox (Ph.D., Lead-PI)
Tarek Kanan

Penn. State University, USA:
C. Lee Giles (Ph.D., PI)
Sagnik Ray Choudhury

Texas A&M, USA:
Richard Furuta (Ph.D., PI)
Hamed Alhoori

Consultants:
John Impagliazzo (Ph.D., Key Investigator)
Susan Lukesh (Ph.D.)
This project was made possible by NPRP Grant # 4 - 029 - 1 – 007 from
Carole Thompson
the Qatar National Research Fund (a member of Qatar Foundation).
13 Jan. 2014 -- QU Library, Doha, Qatar

12
ELISQ Project (1 of 2)
Project Objectives/Aims
A. Research and prototype digital library systems and
infrastructure for Qatar, focusing initially on Qatari
information related to government and scholarly
activities.
Leverage the crawling engine from Penn State‘s SeerSuite software
infrastructure, and extend it beyond its current focus on English to
support Arabic-English collections, and to cover a broad range of
scholarly disciplines, and all types of government information.

13 Jan. 2014 -- QU Library, Doha, Qatar

13
ELISQ Project (2 of 2)
Project Objectives/Aims (continued)
B. Research and build the digital library community in
Qatar, supporting digital library use, services,
collection development, tailored systems, and
advancing toward a Knowledge Society.
Study scholarly activities, and engage in community building in
Qatar, so DLs can be tailored to specific domains and to the unique
needs of Qatar. Through workshops, a consulting center at the
proposed Institute, and collaborative efforts with libraries and
museums in Qatar, we will identify particular needs and uses, and
tailor collections, systems, and services, to lead toward the Qatari
Knowledge Society.
13 Jan. 2014 -- QU Library, Doha, Qatar

14
Outline
• Acknowledgments
• Introduction
• ELISQ
• Technology

13 Jan. 2014 -- QU Library, Doha, Qatar

15
Crawler (Heritrix)
(for search engines & Web archives)
• A Web crawler starts with a list of URLs to visit,
called the seeds.
• On those page, identifies all the hyperlinks
• adds them to the list of URLs to visit
• recursively visits pages pointed to
• according to a set of policies.
• Prioritizes its downloads – some pages change often.
13 Jan. 2014 -- QU Library, Doha, Qatar

16
Selected SeerSuite Instantiations
• CiteSeerx
• http://citeseerx.ist.psu.edu
• A scientific literature digital library and search engine

• ChemXSeer

• http://chemxseer.ist.psu.edu
• Portal for researchers in environmental chemistry
integrating the scientific literature with experimental,
analytical, and simulation results and tools

• ArchSeer
• http://archseer.ist.psu.edu/
• Archeology literature

• TableSeer
13 Jan. 2014 -- QU Library, Doha, Qatar

17
CiteSeerX

http://citeseerx.ist.psu.edu

• CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in
computer science

• Converts PDF to text
• Automatically extracts OAI metadata and other data
• Automatic citation indexing, links to cited documents, creation of
document page, author disambiguation
• Software open source – can be used to build other such tools
• 3 M documents
• Ms of files
• 60 M citations
• 3 to 6 M authors
• 2 to 4 M hits day
• 100K documents added
monthly
• 800K individual users
• several Tbytes
13 Jan. 2014 -- QU Library, Doha, Qatar

18
13 Jan. 2014 -- QU Library, Doha, Qatar

19
13 Jan. 2014 -- QU Library, Doha, Qatar

20
SeerSuite
• Tool kit used to build search engines and digital libraries
• CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer,
AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer,
SeerSeer, YouSeer, etc.
• Built on commercial grade open source tools (Solr/Lucene)
• Penn State expertise – automated specialized metadata
extraction
• Supports research in
• Indexing and search
• Data mining & structures
• Information and knowledge extraction
• Social networks: Name/entity disambiguation
• Scientometrics/infometrics
• Systems engineering
• User interface design (HCI = human-computer interaction)
• Software engineering and management
SeerSuite is not Google
• Metadata (as in library catalogs) as well as content
• Sets of collections, rather than the Web as a whole
• Provided by a curator (e.g., publisher, museum)
• Provided by user submissions
• Or collected by focused ‘crawling’

• Tailored services, rather than the same for everyone
• Browsing using categories, preserving, adding value
• Based on studying user requirements, e.g., chemists

• Working with entities, rather than just words
• Citations, tables, figures, names, chemical formula
• Using knowledge bases, machine learning, artificial intelligence
13 Jan. 2014 -- QU Library, Doha, Qatar

22
Questions for Us?
• http://elisq.qu.edu.qa/

• fox@vt.edu
• http://fox.cs.vt.edu

13 Jan. 2014 -- QU Library, Doha, Qatar

23
Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying
Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki,
Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez
Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury
Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and
Technology
Pennsylvania State University, University Park, PA, USA

Past funding: NSF Cyberinfrastructure Chemistry, Microsoft
Current Support: Dow Chemical

http://chemxseer.ist.psu.edu
Talk Overview
●

Challenges and Motivation.

●

Functionalities
–
–
–
–
–
–
–

●

Fulltext Search
Author Search
Table Search
Figure Search
Expertise Search
Chemical Name and Formula Tagging
Chemical Name and Formula Search

Summary.
Based on cyberinfrastructure
for CiteSeerX
Built on Solr/Lucene,
SeerSuite, other OSS
ChemXSeer RSC
ChemXSeer Fulltext Search
ChemXSeer Author Search
ChemXSeer Table Search
• Tables are widely used to present experimental results or statistical
data in scientific documents.
• Existing search engines treat tabular data as regular text
– Structural information and semantics not preserved.
– We automatically identify tables and extract table metadata in xml.
Table Metadata Representation:
• Environment metadata: (document specifics: type, title,…)
• Frame metadata: (border left, right, top, bottom, …)
• Affiliated metadata: (Caption, footnote, …)
• Layout metadata: (number of rows, columns, headers,…)
• Cell content metadata: (values in cells)
• Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu, et.al, AAAI 2007, JCDL 2007.
Sample Table Metadata Extracted File
Sample Table Metadata Extracted File
ChemXSeer Table Search
ChemXSeer Figure/Plot Data Extraction and
Search
Numerical data in
scientific publications
are often found in figures.
No search engine allows
searching on figures and their
data in chemical documents.
Tools that automate the data extraction from figures and allow
search on them can provide the following:
•
•
•
•

Increases our understanding of key concepts of papers.
Provides data for automatic comparative analyses.
Enables regeneration of figures in different contexts.
Enables search for documents with figures containing specific experiment
results.
X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
Our Contribution
ChemXSeer Name and Formula Extraction
and Search
• Extraction and search of chemical names and formulae in scientific
documents has been shown to be very useful.
• Extraction and search on chemical names is hard:
– Many chemical molecules are created everyday, any dictionary based name
recognizer will fail eventually.
– Names need to segmented to get semantically meaningful sub-terms such as
“methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.

• Identifying formula is hard:
• “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)
• “… such as hydroxyl radical OH, superoxide O2- …” (formula)

• For searching, formulae cannot be treated as text.
• Domain knowledge (formula identification)
• Structural knowledge (substructure finding and search)

B. Sun, et.al., WWW 2007, WWW 2008, TOIS
Chemical Entity Extraction and Tagging
●

Name tagging
–
–

Each chemical name can be a phrase
Example
●
●

●

Formula tagging
–
–

Each formula is a single term
Example
●

–

"... such as hydroxyl radical OH, superoxide ..."

Non-formula example
●

●

"... Determination of lactic acid and ...“
"... insecticide promecarb (3-isopropyl-5-methylphenyl
methylcarbamate) acts against ..."

"... YSI 5301, Yellow Springs, OH, USA ... ”

Tagging examples
–

Name tagging:
"... of <name-type>lactic acid</name-type> and ...“

–

Formula tagging:
"...

radical <formula-type>OH</formula-type> , superoxide ..."
Online Chemical Entity Tagger
●

●

We have an open source chemical name and formula
tagger and a web based interface for evaluation.
The interface takes a PDF file as input, returns text of the
PDF with names or formulas tagged.
Online Chemical Entity Tagger: Chemical Name
Tagging Example
●
●

●

Results on a sample PDF.
Some chemical formula erroneously identified as chemical name (loss
of precision).
High recall (most chemical names identified)
Online Chemical Entity Tagger: Chemical
Formula Tagging Example
●
●
●

Results on a sample PDF.
Some chemical formulas not identified (loss of recall).
High precision (words identified as formula are actual formulas)
Chemical Name Indexing and Search
• Index Schemes:
– Which tokens to index?
– Indexing all subsequences generates a large size index
– “but” in “butane” is morpheme, but not for “nembutal”.
●

Segmentation-based index scheme
–

–
–
–
–

Used for indexing chemical names
First segment a chemical name hierarchically and then index
substrings at each node if frequent.
acetaldoxime->aldoxime->oxime.
Search for oxime returns all, depending on ranking function.
This can not be done in usual text search.
Example Formula Search

http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Built on top of millions of
papers in CiteSeerX.
A similar system was
developed for Dow
Chemicals.
Can find experts in
“polymer chemistry” or
expertise of “Linus Pauling”
Finds an expert based on
their publications.
Many approaches:
Keyphases
Citations
Download count.
Treeratpituk, Chen, JCDL’13
Affiliation
Future Work
Lots of interesting work to do! Few computer/machine
learning scientists involved.
•
•
•
•
•
•
•
•
•
•

Acquisitions - more documents, data, knowledge
Chemical 3D graph search
Fundamental chemical graph representation analysis
Table data storage and access
Figure search and data extraction and access
New data and feature search
• spectra, experimental methods, instrumentation
New documents: 400K PubMed
Semantic chemical graphs
Expert/collaborator search
Search integration of all features

More Related Content

What's hot

Web serachning tools & techniques
Web serachning tools & techniquesWeb serachning tools & techniques
Web serachning tools & techniquesSanath Pushpakumara
 
Vision of Library Technical Services
Vision of Library Technical ServicesVision of Library Technical Services
Vision of Library Technical Services
New York University
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
DMR (Directorate of Mushroom Research), ICAR, GOI
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
Robert H. McDonald
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
lisld
 
Slides | Research data literacy and the library
Slides | Research data literacy and the librarySlides | Research data literacy and the library
Slides | Research data literacy and the library
Colleen DeLory
 
Closing the scientific literature access gap with CORE - how to gain free acc...
Closing the scientific literature access gap with CORE - how to gain free acc...Closing the scientific literature access gap with CORE - how to gain free acc...
Closing the scientific literature access gap with CORE - how to gain free acc...
Nancy Pontika
 
Librarian building blocks; or, how to make the ideal librarian
Librarian building blocks; or, how to make the ideal librarianLibrarian building blocks; or, how to make the ideal librarian
Librarian building blocks; or, how to make the ideal librarian
Dom Bortruex
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Library 2.0: A Roadmap
Library 2.0: A RoadmapLibrary 2.0: A Roadmap
Library 2.0: A Roadmap
St. Petersburg College
 
How To Evaluate Web Based Information Resources
How To Evaluate Web Based Information ResourcesHow To Evaluate Web Based Information Resources
How To Evaluate Web Based Information ResourcesPrasanna Iyer
 
Electronic library and information resources
Electronic library and information resourcesElectronic library and information resources
Electronic library and information resources
avid
 
Visibility and internationalization USARB Through Institutional Repository
Visibility and internationalization USARB Through Institutional Repository Visibility and internationalization USARB Through Institutional Repository
Visibility and internationalization USARB Through Institutional Repository
Scientific Library of Alecu Russo State University Balts Moldova
 
Aligning library services with emerging research data needs
Aligning library services with emerging research data needsAligning library services with emerging research data needs
Aligning library services with emerging research data needs
Andrew Sallans
 
The SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction Data
The SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction DataThe SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction Data
The SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction Data
OCLC
 
Institutional repositories
Institutional repositoriesInstitutional repositories
Institutional repositoriessmtcd
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
Sarika Sawant
 
Digital Commons Institutional Repository: Roles for Library Liaisons
Digital Commons Institutional Repository: Roles for Library LiaisonsDigital Commons Institutional Repository: Roles for Library Liaisons
Digital Commons Institutional Repository: Roles for Library Liaisons
Sammie Morris
 
Institutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic LibrariesInstitutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic Libraries
Hong (Jenny) Jing
 

What's hot (20)

Web serachning tools & techniques
Web serachning tools & techniquesWeb serachning tools & techniques
Web serachning tools & techniques
 
Vision of Library Technical Services
Vision of Library Technical ServicesVision of Library Technical Services
Vision of Library Technical Services
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
 
Slides | Research data literacy and the library
Slides | Research data literacy and the librarySlides | Research data literacy and the library
Slides | Research data literacy and the library
 
Closing the scientific literature access gap with CORE - how to gain free acc...
Closing the scientific literature access gap with CORE - how to gain free acc...Closing the scientific literature access gap with CORE - how to gain free acc...
Closing the scientific literature access gap with CORE - how to gain free acc...
 
Librarian building blocks; or, how to make the ideal librarian
Librarian building blocks; or, how to make the ideal librarianLibrarian building blocks; or, how to make the ideal librarian
Librarian building blocks; or, how to make the ideal librarian
 
Edina cigs-21-september-2012
Edina cigs-21-september-2012Edina cigs-21-september-2012
Edina cigs-21-september-2012
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Library 2.0: A Roadmap
Library 2.0: A RoadmapLibrary 2.0: A Roadmap
Library 2.0: A Roadmap
 
How To Evaluate Web Based Information Resources
How To Evaluate Web Based Information ResourcesHow To Evaluate Web Based Information Resources
How To Evaluate Web Based Information Resources
 
Electronic library and information resources
Electronic library and information resourcesElectronic library and information resources
Electronic library and information resources
 
Visibility and internationalization USARB Through Institutional Repository
Visibility and internationalization USARB Through Institutional Repository Visibility and internationalization USARB Through Institutional Repository
Visibility and internationalization USARB Through Institutional Repository
 
Aligning library services with emerging research data needs
Aligning library services with emerging research data needsAligning library services with emerging research data needs
Aligning library services with emerging research data needs
 
The SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction Data
The SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction DataThe SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction Data
The SHARES Partnership, Plus Tracking Trends in ILL Cost and Transaction Data
 
Institutional repositories
Institutional repositoriesInstitutional repositories
Institutional repositories
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 
Digital Commons Institutional Repository: Roles for Library Liaisons
Digital Commons Institutional Repository: Roles for Library LiaisonsDigital Commons Institutional Repository: Roles for Library Liaisons
Digital Commons Institutional Repository: Roles for Library Liaisons
 
Institutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic LibrariesInstitutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic Libraries
 

Viewers also liked

20140113 q uchemxseerseminar
20140113 q uchemxseerseminar20140113 q uchemxseerseminar
20140113 q uchemxseerseminarTahseenaM
 
Development and operational control of two string maximum power point tracker...
Development and operational control of two string maximum power point tracker...Development and operational control of two string maximum power point tracker...
Development and operational control of two string maximum power point tracker...Ecwayt
 
20140106 qu seminar
20140106 qu seminar20140106 qu seminar
20140106 qu seminarTahseenaM
 
Presentation for ECSU Staff Retreat - July 2014
Presentation for ECSU Staff Retreat - July 2014Presentation for ECSU Staff Retreat - July 2014
Presentation for ECSU Staff Retreat - July 2014
sbclapp
 
digital libraries: the phoenix rises from the ashes
digital libraries: the phoenix rises from the ashesdigital libraries: the phoenix rises from the ashes
digital libraries: the phoenix rises from the ashes
Sarah Houghton
 
Marketing Of Digital Libraries
Marketing Of Digital LibrariesMarketing Of Digital Libraries
Marketing Of Digital Libraries
Elco van Staveren
 

Viewers also liked (6)

20140113 q uchemxseerseminar
20140113 q uchemxseerseminar20140113 q uchemxseerseminar
20140113 q uchemxseerseminar
 
Development and operational control of two string maximum power point tracker...
Development and operational control of two string maximum power point tracker...Development and operational control of two string maximum power point tracker...
Development and operational control of two string maximum power point tracker...
 
20140106 qu seminar
20140106 qu seminar20140106 qu seminar
20140106 qu seminar
 
Presentation for ECSU Staff Retreat - July 2014
Presentation for ECSU Staff Retreat - July 2014Presentation for ECSU Staff Retreat - July 2014
Presentation for ECSU Staff Retreat - July 2014
 
digital libraries: the phoenix rises from the ashes
digital libraries: the phoenix rises from the ashesdigital libraries: the phoenix rises from the ashes
digital libraries: the phoenix rises from the ashes
 
Marketing Of Digital Libraries
Marketing Of Digital LibrariesMarketing Of Digital Libraries
Marketing Of Digital Libraries
 

Similar to 20140113 q uchemxseerseminar

SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...
SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...
SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...
Sandra Gesing
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
African Open Science Platform
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...
Hong (Jenny) Jing
 
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
National Information Standards Organization (NISO)
 
HOW TO BUILDUP WORLD-CLASS LIBRARY? A PROPOSAL
HOW TO BUILDUP WORLD-CLASS LIBRARY?  A PROPOSALHOW TO BUILDUP WORLD-CLASS LIBRARY?  A PROPOSAL
HOW TO BUILDUP WORLD-CLASS LIBRARY? A PROPOSAL
Dr. Anjaiah Mothukuri
 
Research Support Services ECU Library
Research Support Services ECU LibraryResearch Support Services ECU Library
Research Support Services ECU Library
Julia Gross
 
#ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love #ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love
Kristi Holmes
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for LibrariesThomas King
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh University
Robin Rice
 
SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...
SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...
SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...
Sandra Gesing
 
RDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue LibrariesRDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue Libraries
ASIS&T
 
Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes "C...
Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes  "C...Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes  "C...
Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes "C...
National Information Standards Organization (NISO)
 
Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12
ASIS&T
 
Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web Archiving
Kristen Yarmey
 
2014 ALA MW SPARC-ACRL Forum Talk
2014 ALA MW SPARC-ACRL Forum Talk2014 ALA MW SPARC-ACRL Forum Talk
2014 ALA MW SPARC-ACRL Forum TalkPaul Bracke
 
A brief overview of metadata for datasets
A brief overview of metadata for datasetsA brief overview of metadata for datasets
A brief overview of metadata for datasets
sesrdm
 
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP PilotL&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
CASRAI
 
Open data and research data management at the University of Edinburgh: polici...
Open data and research data management at the University of Edinburgh: polici...Open data and research data management at the University of Edinburgh: polici...
Open data and research data management at the University of Edinburgh: polici...
Robin Rice
 
Qatar Digital Library Project Workshop
Qatar Digital Library Project WorkshopQatar Digital Library Project Workshop
Qatar Digital Library Project Workshop
Asad Nafees
 
Sgci ecss symposium-12-20-16
Sgci ecss symposium-12-20-16Sgci ecss symposium-12-20-16
Sgci ecss symposium-12-20-16
Nancy Wilkins-Diehr
 

Similar to 20140113 q uchemxseerseminar (20)

SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...
SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...
SGCI - URSSI - Research Software Engineers, Science Gateway Developers and Cy...
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...
 
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
 
HOW TO BUILDUP WORLD-CLASS LIBRARY? A PROPOSAL
HOW TO BUILDUP WORLD-CLASS LIBRARY?  A PROPOSALHOW TO BUILDUP WORLD-CLASS LIBRARY?  A PROPOSAL
HOW TO BUILDUP WORLD-CLASS LIBRARY? A PROPOSAL
 
Research Support Services ECU Library
Research Support Services ECU LibraryResearch Support Services ECU Library
Research Support Services ECU Library
 
#ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love #ALAAC15 Linked Data Love
#ALAAC15 Linked Data Love
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh University
 
SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...
SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...
SGCI Science Gateways: Software sustainability via on-campus teams - Webinar ...
 
RDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue LibrariesRDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue Libraries
 
Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes "C...
Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes  "C...Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes  "C...
Bergstrom, Carpenter, Jakobsen, Jurczyk, McKenna, Morris, and Nadav-Manes "C...
 
Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12
 
Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web Archiving
 
2014 ALA MW SPARC-ACRL Forum Talk
2014 ALA MW SPARC-ACRL Forum Talk2014 ALA MW SPARC-ACRL Forum Talk
2014 ALA MW SPARC-ACRL Forum Talk
 
A brief overview of metadata for datasets
A brief overview of metadata for datasetsA brief overview of metadata for datasets
A brief overview of metadata for datasets
 
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP PilotL&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
 
Open data and research data management at the University of Edinburgh: polici...
Open data and research data management at the University of Edinburgh: polici...Open data and research data management at the University of Edinburgh: polici...
Open data and research data management at the University of Edinburgh: polici...
 
Qatar Digital Library Project Workshop
Qatar Digital Library Project WorkshopQatar Digital Library Project Workshop
Qatar Digital Library Project Workshop
 
Sgci ecss symposium-12-20-16
Sgci ecss symposium-12-20-16Sgci ecss symposium-12-20-16
Sgci ecss symposium-12-20-16
 

Recently uploaded

Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
gb193092
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 

Recently uploaded (20)

Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 

20140113 q uchemxseerseminar

  • 1. ChemXSeer: Digital library tools, features, and crawling characteristics Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA 24061 USA fox@vt.edu http://fox.cs.vt.edu and Sagnik Ray Choudhury Ph.D. Student, College of Information Science and Technology, Penn State, USA szr163@ist.psu.edu 13 Jan. 2014 -- QU Library, Doha, Qatar 1
  • 2. Outline • Acknowledgments • Introduction • ELISQ • Technology 13 Jan. 2014 -- QU Library, Doha, Qatar 2
  • 3. Sponsored by Qatar University Library HTTP://qnl.qa HTTP://WWW.QU.EDU.QA/ Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ HTTP://WWW.VT.EDU/ HTTP://WWW.PSU.EDU/ 13 Jan. 2014 -- QU Library, Doha, Qatar HTTP://WWW.TAMU.EDU/ 3
  • 4. Acknowledgments • Dr. Mazen Hasna, VP and Chief Academic Officer, Qatar University • Dr. Rashid Alammari, Dean, College of Engineering, Qatar University • Dr. Moumen Hasnah , Director of Academic Research, Qatar University • Dr. Imad Bachir, Qatar University Library Director • Prof. Sebti Foufou, Head of Department of Computer Science and Engineering, Qatar University • Prof. Ramazan Kahraman, Head of the Department of Chemical Engineering, Qatar University 13 Jan. 2014 -- QU Library, Doha, Qatar 4
  • 5. Additional Thanks QScience – providing collection: Christopher J. Leonard, Editorial Director Paul Coyne, CTO US National Science Foundation (recent and current grants to Fox): • IIS-1319578 • IIS-0916733 • DUE-0840719 • OCI-1032677 • plus those to PSU, TAMU 13 Jan. 2014 -- QU Library, Doha, Qatar 5
  • 6. Outline • Acknowledgments • Introduction • ELISQ • Technology 13 Jan. 2014 -- QU Library, Doha, Qatar 6
  • 7. Introduction • Digital libraries have emerged since 1991. • Now each major publisher has its own digital library; many others exist too. • Related systems include: • Institutional repositories, e.g., at QU • Content & courseware management systems • Research and development funding of hundreds of millions of dollars has led to powerful tailored systems, such as for chemical information. 13 Jan. 2014 -- QU Library, Doha, Qatar 7
  • 8. 13 Jan. 2014 -- QU Library, Doha, Qatar 8
  • 9. Information Life Cycle Authoring Modifying Using Creating Retention / Mining Accessing Filtering Organizing Indexing Storing Retrieving Distributing Networking 13 Jan. 2014 -- QU Library, Doha, Qatar 9
  • 10. Infrastructure Services Repository-Building Creational Preservational Acquiring Cataloging Crawling (focused) Describing Digitizing Federating Harvesting Purchasing Submitting Conserving Converting Copying/Replicating Emulating Renewing Translating (format) Add Value Annotating Classifying Clustering Evaluating Extracting Indexing Measuring Publicizing Rating Reviewing (peer) Surveying Translating (language) 13 Jan. 2014 -- QU Library, Doha, Qatar Information Satisfaction Services Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing 10
  • 11. Outline • Acknowledgments • Introduction • ELISQ • Technology 13 Jan. 2014 -- QU Library, Doha, Qatar 11
  • 12. ELISQ – Electronic Library Institute – SeerQ –– Project Team Qatar University, Qatar: Mohammed Samaka (Ph.D., Co-Lead PI) Sumaya Ali S A Al-Maadeed (Ph.D., PI) Myrna Tabet Asad Nafees Tahseena Moideen Qatar National Library, Qatar: Claudia Lux (PI) Krishna RoyChowdhury Postdoc - TBA Virginia Tech, USA: Edward Fox (Ph.D., Lead-PI) Tarek Kanan Penn. State University, USA: C. Lee Giles (Ph.D., PI) Sagnik Ray Choudhury Texas A&M, USA: Richard Furuta (Ph.D., PI) Hamed Alhoori Consultants: John Impagliazzo (Ph.D., Key Investigator) Susan Lukesh (Ph.D.) This project was made possible by NPRP Grant # 4 - 029 - 1 – 007 from Carole Thompson the Qatar National Research Fund (a member of Qatar Foundation). 13 Jan. 2014 -- QU Library, Doha, Qatar 12
  • 13. ELISQ Project (1 of 2) Project Objectives/Aims A. Research and prototype digital library systems and infrastructure for Qatar, focusing initially on Qatari information related to government and scholarly activities. Leverage the crawling engine from Penn State‘s SeerSuite software infrastructure, and extend it beyond its current focus on English to support Arabic-English collections, and to cover a broad range of scholarly disciplines, and all types of government information. 13 Jan. 2014 -- QU Library, Doha, Qatar 13
  • 14. ELISQ Project (2 of 2) Project Objectives/Aims (continued) B. Research and build the digital library community in Qatar, supporting digital library use, services, collection development, tailored systems, and advancing toward a Knowledge Society. Study scholarly activities, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consulting center at the proposed Institute, and collaborative efforts with libraries and museums in Qatar, we will identify particular needs and uses, and tailor collections, systems, and services, to lead toward the Qatari Knowledge Society. 13 Jan. 2014 -- QU Library, Doha, Qatar 14
  • 15. Outline • Acknowledgments • Introduction • ELISQ • Technology 13 Jan. 2014 -- QU Library, Doha, Qatar 15
  • 16. Crawler (Heritrix) (for search engines & Web archives) • A Web crawler starts with a list of URLs to visit, called the seeds. • On those page, identifies all the hyperlinks • adds them to the list of URLs to visit • recursively visits pages pointed to • according to a set of policies. • Prioritizes its downloads – some pages change often. 13 Jan. 2014 -- QU Library, Doha, Qatar 16
  • 17. Selected SeerSuite Instantiations • CiteSeerx • http://citeseerx.ist.psu.edu • A scientific literature digital library and search engine • ChemXSeer • http://chemxseer.ist.psu.edu • Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools • ArchSeer • http://archseer.ist.psu.edu/ • Archeology literature • TableSeer 13 Jan. 2014 -- QU Library, Doha, Qatar 17
  • 18. CiteSeerX http://citeseerx.ist.psu.edu • CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in computer science • Converts PDF to text • Automatically extracts OAI metadata and other data • Automatic citation indexing, links to cited documents, creation of document page, author disambiguation • Software open source – can be used to build other such tools • 3 M documents • Ms of files • 60 M citations • 3 to 6 M authors • 2 to 4 M hits day • 100K documents added monthly • 800K individual users • several Tbytes 13 Jan. 2014 -- QU Library, Doha, Qatar 18
  • 19. 13 Jan. 2014 -- QU Library, Doha, Qatar 19
  • 20. 13 Jan. 2014 -- QU Library, Doha, Qatar 20
  • 21. SeerSuite • Tool kit used to build search engines and digital libraries • CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer, AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer, SeerSeer, YouSeer, etc. • Built on commercial grade open source tools (Solr/Lucene) • Penn State expertise – automated specialized metadata extraction • Supports research in • Indexing and search • Data mining & structures • Information and knowledge extraction • Social networks: Name/entity disambiguation • Scientometrics/infometrics • Systems engineering • User interface design (HCI = human-computer interaction) • Software engineering and management
  • 22. SeerSuite is not Google • Metadata (as in library catalogs) as well as content • Sets of collections, rather than the Web as a whole • Provided by a curator (e.g., publisher, museum) • Provided by user submissions • Or collected by focused ‘crawling’ • Tailored services, rather than the same for everyone • Browsing using categories, preserving, adding value • Based on studying user requirements, e.g., chemists • Working with entities, rather than just words • Citations, tables, figures, names, chemical formula • Using knowledge bases, machine learning, artificial intelligence 13 Jan. 2014 -- QU Library, Doha, Qatar 22
  • 23. Questions for Us? • http://elisq.qu.edu.qa/ • fox@vt.edu • http://fox.cs.vt.edu 13 Jan. 2014 -- QU Library, Doha, Qatar 23
  • 24. Search Engine and Repository for eChemistry C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology Pennsylvania State University, University Park, PA, USA Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical http://chemxseer.ist.psu.edu
  • 25. Talk Overview ● Challenges and Motivation. ● Functionalities – – – – – – – ● Fulltext Search Author Search Table Search Figure Search Expertise Search Chemical Name and Formula Tagging Chemical Name and Formula Search Summary.
  • 26. Based on cyberinfrastructure for CiteSeerX Built on Solr/Lucene, SeerSuite, other OSS
  • 30. ChemXSeer Table Search • Tables are widely used to present experimental results or statistical data in scientific documents. • Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml. Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …) Y. Liu, et.al, AAAI 2007, JCDL 2007.
  • 31. Sample Table Metadata Extracted File
  • 32. Sample Table Metadata Extracted File
  • 34. ChemXSeer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow search on them can provide the following: • • • • Increases our understanding of key concepts of papers. Provides data for automatic comparative analyses. Enables regeneration of figures in different contexts. Enables search for documents with figures containing specific experiment results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
  • 36. ChemXSeer Name and Formula Extraction and Search • Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard: – Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually. – Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”. • Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula) • “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, et.al., WWW 2007, WWW 2008, TOIS
  • 37. Chemical Entity Extraction and Tagging ● Name tagging – – Each chemical name can be a phrase Example ● ● ● Formula tagging – – Each formula is a single term Example ● – "... such as hydroxyl radical OH, superoxide ..." Non-formula example ● ● "... Determination of lactic acid and ...“ "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..." "... YSI 5301, Yellow Springs, OH, USA ... ” Tagging examples – Name tagging: "... of <name-type>lactic acid</name-type> and ...“ – Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."
  • 38. Online Chemical Entity Tagger ● ● We have an open source chemical name and formula tagger and a web based interface for evaluation. The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.
  • 39. Online Chemical Entity Tagger: Chemical Name Tagging Example ● ● ● Results on a sample PDF. Some chemical formula erroneously identified as chemical name (loss of precision). High recall (most chemical names identified)
  • 40. Online Chemical Entity Tagger: Chemical Formula Tagging Example ● ● ● Results on a sample PDF. Some chemical formulas not identified (loss of recall). High precision (words identified as formula are actual formulas)
  • 41. Chemical Name Indexing and Search • Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”. ● Segmentation-based index scheme – – – – – Used for indexing chemical names First segment a chemical name hierarchically and then index substrings at each node if frequent. acetaldoxime->aldoxime->oxime. Search for oxime returns all, depending on ranking function. This can not be done in usual text search.
  • 43. Expert Recommendation - CiteSeerX http://seerseer.ist.psu.edu (new version CSSeers) Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Treeratpituk, Chen, JCDL’13 Affiliation
  • 44. Future Work Lots of interesting work to do! Few computer/machine learning scientists involved. • • • • • • • • • • Acquisitions - more documents, data, knowledge Chemical 3D graph search Fundamental chemical graph representation analysis Table data storage and access Figure search and data extraction and access New data and feature search • spectra, experimental methods, instrumentation New documents: 400K PubMed Semantic chemical graphs Expert/collaborator search Search integration of all features