Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saura...
Talk Overview
● Challenges and Motivation.
● Functionalities
–
–
–
–
–
–
–

Fulltext Search
Author Search
Table Search
Fig...
Based on cyberinfrastructure
for CiteSeerX
Built on Solr/Lucene,
SeerSuite, other OSS
ChemXSeer RSC
ChemXSeer Fulltext Search
ChemXSeer Author Search
ChemXSeer Table Search
• Tables are widely used to present experimental results or
statistical data in scientific document...
Sample Table Metadata Extracted File
Sample Table Metadata Extracted File
•
•
•
•
•
•

•
•
•
•
•
•
•
•
•
•
•

<Table>
<DocumentOrigin>Analyst</DocumentOrigin>
...
ChemXSeer Table Search
ChemXSeer Figure/Plot Data Extraction
and Search
Numerical data in
scientific publications
are often found in figures.
No ...
Our Contribution
ChemXSeer Name and Formula
Extraction and Search
• Extraction and search of chemical names and formulae in
scientific docu...
Chemical Entity Extraction and Tagging
● Name tagging
– Each chemical name can be a phrase
– Example
● "... Determination ...
Online Chemical Entity Tagger
● We have an open source chemical name and formula
tagger and a web based interface for eval...
Online Chemical Entity Tagger: Chemical
Name Tagging Example
● Results on a sample PDF.
● Some chemical formula erroneousl...
Online Chemical Entity Tagger: Chemical
Formula Tagging Example
● Results on a sample PDF.
● Some chemical formulas not id...
Chemical Name Indexing and Search
• Index Schemes:
– Which tokens to index?
– Indexing all subsequences generates a large ...
Example Formula Search

http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Built on top of millions of
papers in ...
Future Work
Lots of interesting work to do! Few computer/machine
learning scientists involved.
•
•
•
•
•
•
•
•
•
•

Acquis...
DEMO
Upcoming SlideShare
Loading in …5
×

Chemxseer qr-sagnik

182 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
182
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The first data mining task is to detect chemical names and formulas from the literature.
    So the task of entity tagging is to find the hidden labels of each term in the text
  • The first data mining task is to detect chemical names and formulas from the literature.
    So the task of entity tagging is to find the hidden labels of each term in the text
  • The first data mining task is to detect chemical names and formulas from the literature.
    So the task of entity tagging is to find the hidden labels of each term in the text
  • The first data mining task is to detect chemical names and formulas from the literature.
    So the task of entity tagging is to find the hidden labels of each term in the text
  • most of those substrings on the tree are semantically meaningful
  • Chemxseer qr-sagnik

    1. 1. Search Engine and Repository for eChemistry C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology Pennsylvania State University, University Park, PA, USA Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical http://chemxseer.ist.psu.edu
    2. 2. Talk Overview ● Challenges and Motivation. ● Functionalities – – – – – – – Fulltext Search Author Search Table Search Figure Search Expertise Search Chemical Name and Formula Tagging Chemical Name and Formula Search ● Summary.
    3. 3. Based on cyberinfrastructure for CiteSeerX Built on Solr/Lucene, SeerSuite, other OSS
    4. 4. ChemXSeer RSC
    5. 5. ChemXSeer Fulltext Search
    6. 6. ChemXSeer Author Search
    7. 7. ChemXSeer Table Search • Tables are widely used to present experimental results or statistical data in scientific documents. • Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml. Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …) Y. Liu, et.al, AAAI 2007, JCDL 2007.
    8. 8. Sample Table Metadata Extracted File
    9. 9. Sample Table Metadata Extracted File • • • • • • • • • • • • • • • • • <Table> <DocumentOrigin>Analyst</DocumentOrigin> <DocumentName>b006011i.pdf</DocumentName> <Year>2001</Year> gas sensors </DocumentTitle> <DocumentTitle>Detection of chlorinated methanes by tin oxide Shaw, a Kenneth E. Creasy,* b and <Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R . of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University 3060</Author> <TheNumOfCiters></TheNumOfCiters> <Citers></Citers> ge ( D R ) and response timeof tin <TableCaption>Table 1 Temperature effect o n r esistance chan oxide thin film with 1 % C Cl 4</TableCaption> 2 ) (%) R esponse time Reproducibiliy <TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O </TableColumnHeading> 300 1027 21 < 2 0 s Yes 400 993 31 ~ 1 <TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes 0 s No </TableContent> > <TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote <ColumnNum>5</ColumnNum> 1% CCl4 at different temperatures are <TableReferenceText>In page 3, line 11, … Film responses to summarized in Table 1……</TableReferenceText> <PageNumOfTable>3</PageNumOfTable> <Snapshot>b006011i/b006011i_t1.jpg</Snapshot> </Table>
    10. 10. ChemXSeer Table Search
    11. 11. ChemXSeer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow search on them can provide the following: • • • • Increases our understanding of key concepts of papers. Provides data for automatic comparative analyses. Enables regeneration of figures in different contexts. Enables search for documents with figures containing specific experiment results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
    12. 12. Our Contribution
    13. 13. ChemXSeer Name and Formula Extraction and Search • Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard: – Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually. – Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”. • Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula) • “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, et.al., WWW 2007, WWW 2008, TOIS
    14. 14. Chemical Entity Extraction and Tagging ● Name tagging – Each chemical name can be a phrase – Example ● "... Determination of lactic acid and ...“ ● "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..." ● Formula tagging – Each formula is a single term – Example ● "... such as hydroxyl radical OH, superoxide ..." – Non-formula example ● "... YSI 5301, Yellow Springs, OH, USA ... ” ● Tagging examples – Name tagging: "... of <name-type>lactic acid</name-type> and ...“ – Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."
    15. 15. Online Chemical Entity Tagger ● We have an open source chemical name and formula tagger and a web based interface for evaluation. ● The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.
    16. 16. Online Chemical Entity Tagger: Chemical Name Tagging Example ● Results on a sample PDF. ● Some chemical formula erroneously identified as chemical name (loss of precision). ● High recall (most chemical names identified)
    17. 17. Online Chemical Entity Tagger: Chemical Formula Tagging Example ● Results on a sample PDF. ● Some chemical formulas not identified (loss of recall). ● High precision (words identified as formula are actual formulas)
    18. 18. Chemical Name Indexing and Search • Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”. ● Segmentation-based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically and then index substrings at each node if frequent. – acetaldoxime->aldoxime->oxime. – Search for oxime returns all, depending on ranking function. – This can not be done in usual text search.
    19. 19. Example Formula Search http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
    20. 20. Expert Recommendation - CiteSeerX http://seerseer.ist.psu.edu (new version CSSeers) Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Affiliation Treeratpituk, Chen, JCDL’13
    21. 21. Future Work Lots of interesting work to do! Few computer/machine learning scientists involved. • • • • • • • • • • Acquisitions - more documents, data, knowledge Chemical 3D graph search Fundamental chemical graph representation analysis Table data storage and access Figure search and data extraction and access New data and feature search • spectra, experimental methods, instrumentation New documents: 400K PubMed Semantic chemical graphs Expert/collaborator search Search integration of all features
    22. 22. DEMO

    ×