Using Lucene/Solr to Build CiteSeerX and Friends
Upcoming SlideShare
Loading in...5
×
 

Using Lucene/Solr to Build CiteSeerX and Friends

on

  • 2,347 views

Presented by C. Lee Giles, Pennsylvania State University - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 ...

Presented by C. Lee Giles, Pennsylvania State University - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We propose the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss its uses in building enterprise search and cyberinfrastructure for the sciences and academia. We highlight application domains with examples of specialized search engines that we have built for computer science, CiteSeerX, chemistry, ChemXSeer, archaeology, ArchSeer. acknowledgements, AckSeer, reference recommendation, RefSeer, collaboration recommendation, CollabSeer, and others, all using Solr/Lucene. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance.

Statistics

Views

Total Views
2,347
Views on SlideShare
2,347
Embed Views
0

Actions

Likes
2
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Using Lucene/Solr to Build CiteSeerX and Friends Using Lucene/Solr to Build CiteSeerX and Friends Presentation Transcript

  • Using  Lucene/Solr  to  Build  CiteSeerX  and   Friends     Dr. C. Lee Giles Information Sciences and Technology Computer Science and Engineering The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu
  • http://clgiles.ist.psu.edu Prof.  C.  Lee  Giles  •  Intelligent  and  specialty  search  engines;  cyberinfrastructure   for  science,  academia  and  government   –  Modular,  scalable,  robust,  automaEc  cyberinfrastructure  and   search  engine  creaEon  and  maintenance   –  Large  heterogeneous  data  and  informaEon  systems   –  Specialty  search  engines  and  portals  for  knowledge  integraEon   •  CiteSeerx  (computer  and  informaEon  science)   •  ChemXSeer  (e-­‐chemistry  portal)   •  GrantSeer  (grant  search)   •  RefSeer    (recommendaEon  of  paper  references)  •  Scalable  intelligent  tools/agents/methods/algorithms   –  InformaEon,  knowledge  and  data  integraEon   –  InformaEon  and  metadata  extracEon;  enEty  disambiguaEon   –  Unique  search,  knowledge  discovery,  informaEon  integraEon,   data  mining  algorithms   –  Web  2.0  methods   •  Automated  tagging  for  search  and  informaEon  retrieval   •  Social  network  analysis  
  • SeerSuite  Contributors/Collaborators:  recent   past  and  present  (incomplete  list)  Projects:  CiteSeer,  CiteSeerX,  ChemXSeer,  ArchSeer,  CollabSeer,  GrantSeer,   SeerSeer,  RefSeer,  AlgoSeer,  AckSeer,  BotSeer,  YouSeer,  …  •  P.  Mitra,  V.  Bhatnagar,  L.  Bolelli,  J.  Carroll,  I.  Councill,  F.  Fonseca,  J.  Jansen,   D.  Lee,  W-­‐C.  Lee,  H.  Li,  J.  Li,  E.  Manavoglu,  A.  Sivasubramaniam,  P.   Teregowda,  H.  Zha,  S.  Zheng,  D.  Zhou,  Z.  Zhuang,  J.  Stribling,  D.  Karger,  S.   Lawrence,  J.  Gray,  G.  Flake,  S.  Debnath,  H.  Han,  D.  Pavlov,  E.  Fox,  M.  Gori,   E.  Blanzieri,  M.  Marchese,  N.  Shadbolt,  I.  Cox,  S.  Gauch,  A.  Bernstein,  L.   Cassel,  M-­‐Y.  Kan,  X.  Lu,  Y.  Liu,  A.  Jaiswal,  K.  Bai,  B.  Sun,  Y.  Sung,  J.  Z.  Wang,   K.  Mueller,  J.Kubicki,  B.  Garrison,  J.  Bandstra,  Q.  Tan,  J.  Fernandez,  P.   Treeratpituk,  W.  Brouwer,  U.  Farooq,  J.  Huang,  M.  Khabsa,  M.  Halm,  B.   Urgaonkar,  Q.  He,  D.  Kifer,  J.  Pei,  S.  Das,  S.  Kataria,  D.  Yuan,  T.  Suppawong,   others.  •  Current  funding:  NSF,  Dow  Chemical  
  • Outline  •  MoEvaEon   –  Data  science;  Cyberinfrastructure   –  Vast  growth  in  domain  science  data  and  documents  •  SeerSuite   –  Tool  for  creaEng  Seers   –  Specialized  data  and  document  search  and  recommendaEons   •  Tables,  formulae,  figures,  references  …   –  Use  of  Solr/Lucene  •  Disciplinary  sciences,  indexes  &  informaEon  extracEon  (the   Seers)   –  Computer  science   –  Chemistry   –  Briefly  other  Seers  •  OpportuniEes  for  Research  •  Conclusions  and  DirecEons  
  • The  Evolu3on  of  Science  -­‐  the  4th   Paradigm   Jim Gray’s paradigm•  Observa3onal  Science     –  ScienEst  gathers  data  by  direct   observaEon   –  ScienEst  analyzes  data  •  Analy3cal  Science     –  ScienEst  builds  analyEcal  model   –  Makes  predicEons.  •  Computa3onal  Science     –  Simulate  analyEcal  model   –  Validate  model  and  makes  predicEons    •  Data  Driven  Science   –  Data  captured  from  the  web,  by   instruments,  or  from  documents   –  Data  generated  by  simulaEon   –  Placed  in  data  structures  /  files   –  ScienEst(s)  analyze(s)  data   –  Access  &  search  crucial  
  • Data  Access  Varies  with  Discipline   or  Small  vs  Big  Science  •  Small  vs  Big  science   –  “Data  from  Big  Science  is  …  easier  to  handle,  understand  and  archive.   Small  Science  is  horribly  heterogeneous  and  far  more  vast.  In  Eme  Small   Science  will  generate  2-­‐3  Emes  more  data  than  Big  Science.”       •  ‘Lost  in  a  Sea  of  Science  Data’  S.Carlson,  The  Chronicle  of  Higher  EducaEon   (23/06/2006)     –  Data  is  local   –  Data  will  not  be  shared  •  At  some  point  there  will  be  needed     –  indices  to  control  search   –  parallel  data  search  and  analysis  •  Cyberinfrastructure  can  help   –  If  you  can’t  move  the  data  around,   –  Bandwidth  of  a  van  loaded  with  disks       take  the  analysis  to  the  data!   –  Do  all  data  manipulaEons  locally   •  Build  custom  procedures  and  funcEons  locally  
  • SeerSuite  •  Open  source  search  engine  and  digital  library  tool  kit  used  to   build  search  engines  and  digital  libraries   –  CiteSeerX  ,  ChemXSeer,  RefSeer,YouSeer,  CollabSeer,  etc.  •  Supports  research  in   –  Indexing  and  search   –  Digital  libraries   –  Data  mining  &  structures   –  InformaEon  and  knowledge  extracEon   –  Social  networks   –  Scientometrics/infometrics   –  Systems  engineering,  User  design   –  Sokware  engineering  and  management   –  Web  crawling  •  Trains  students  in  search  and  sokware  systems   –  EducaEonal  tool  for  search  engine  creaEon   –  Students  highly  sought  in  industry  and  government  
  • SeerSuite  -­‐  proper3es  •  Modular,  scalable,  extensible,  robust  design   –  Extensible  to  many  problems  and  disciplines  •  Integrated  features   –  Focused  crawler  -­‐  Heritrix   –  Indexer  -­‐  Solr/lucene   –  Metadata  extracEon  -­‐  modular   –  Ranked  results  •  Builds  on  experience  with  other  domain  engines  and  OS  tools   –   Lucene  and  Solr   –   The  MySQL  Database  and  InnoDB  Storage  Engine   –   Apache  Tomcat   –   Spring  Framework   –   Acegi  Security   –   AcEveMQ   –   AcEveBPEL  Open  Source  Engine   –   Apache  Commons  Libraries   –   SVMlight  support  vector  machine  package   –   CRF++  condiEonal  random  field  package  •  Hardware  independent;  Linux  •  Reuse  not  reinvent  
  • Data Mining & Information Extraction in Seers•  Data acquisition •  SeerSuite systems often crawls the public web for new data •  Many data types available•  Richness of data offers unique data mining features •  CiteSeerX as testbed/sandbox •  Large scale data resources •  Millions of documents, authors, etc. •  Some common features/metadata •  Commercial grade indexer (Solr/Lucene) •  Scalable to G’s of documents and M’s of users •  “Watson” •  Modular design •  Cloudable•  State of the art algorithms (machine learning) for large scaleunique metadata (information) extraction & mining •  Unique parsers and indexing •  Quality of extraction •  Precision/recall •  Ranking •  Architecture/integration
  • Seer  Friends  •  In  various  stages  of  the  system  lifecycle  with  various  data  resources   and  indexes:   –  Mature  and  developing,  code  released   •  CiteSeer,  now  CiteSeerX   •  ChemXSeer   •  TableSeer   •  YouSeer   –  New,  future  TBD,  not  all  aspects  public   •  ArchSeer   •  AlgoSeer   •  CollabSeer   •  RefSeer   •  SeerSeer   •  GrantSeer   –  Dead  or  limping  by  (could  be  revived)   •  AckSeer  (acknowledgement  indexing)  (revived!)   •  BizSeer   •  BotSeer   –  Proposed,  but  do  not  exist   •  BrainSeer   •  CensorSeer   •  ArXivSeer  
  • Why  Solr/Lucene?  •  Only  open  source  considered  –  cost  •  CompeEtors:   –  Indri   –  Wumpus   –  Terrier   –  Others?  •  Must  scale  for  both  number  of  documents  and  users  •  Easily  integrable  and  customizable   –  Other  indexes,  crawlers,  ingesEon,  metadata  extractors    •  Well  used  (Watson)  •  AcEve  community  of  support   –  Enterprise  plaporm  a  plus  •  Easy  to  transiEon  to  government/industry/academia   –  Apache  license  
  • Next Generation CiteSeer, CiteSeerX•     2  M  documents  •     40  M  citaEons  •   2  to  5  M  authors  •   2  to  4  M  hits  day  •   800K  individual  users  •   en3re  data  shared  •   Index  -­‐  50  G   http://citeseerx.ist.psu.edu
  • History:  CiteSeer  (aka  ResearchIndex)     Project  at  NEC  Research  InsEtute,  Princeton     1st  academic  document  search  engine     Very  popular  with  computer  science   C. Lee Giles   Hosted  at  NEC  from  1997  –  2004.     Moved  to  Penn  State  as  collaborators  lek.     Provided  a  broad  range  of  unique  services   including     AutomaEc  citaEon  indexing,  reference  linking,   full  text  indexing,  similar  documents  lisEng,   Kurt Bollacker automated  metadata  extracEon  and  several   other  pioneering  features.     Refactored  and  redesigned  as  CiteSeerx     Released  2008     Lucene  based  indexing   CiteSeer continuously running for 15 years! Steve Lawrence
  • SeerSuite/CiteSeerX Architecture •  Web Application •  Focused Crawler •  Document Conversion and Extraction •  Document Ingestion •  Data Storage •  Maintenance Services •  Federated Services Teregowda, USENIX ‘10
  • 4 systems:•  Production•  Crawling•  Staging•  ResearchAll or somecan becloudized Teregowda, USENIX 2010
  • CiteSeerX  Services    CiteSeerX  is  a  very  automated  system:     Full  OAI  metadata  if  available     Full  text  Indexing  (many  different  indexes)     Documents     CitaEons     Tables     More  forthcoming    (Algorithms,  Figures,  Acknowledgements).     CitaEon  Graph.     Ranking  based  on  citaEons.     Linking  documents     -  Co-­‐citaEons   -  CiEng  documents     Author  DisambiguaEon     DisEnguish  between  authors  with  similar  names.     Profiles  and  publicaEon  informaEon  for  author.     AutomaEc  crawling  from  list  and  submissions     PersonalizaEon   -  Login  based  access  to  features  on  CiteSeerX.   -  CorrecEons  to  metadata.   -  Storage  of  queries.   -  CollecEon  of  papers   -  Follows  document  metadata  changes.  
  • Focused  Crawling  •  Maintain  a  list  of  parent  URLs  where  documents  were  previously  found   –  Parent  URLs  are  usually  academic  homepages.   •  300,000  unique  parent  URLs,  as  of  summer  2011   –  Parent  URLs  are  stored  in  a  database  table  with  two  addiEonal  fields  for   scheduling:   •  Last  Eme  changed,  get  new  documents  from  the  page.   •  EsEmated  change  rate  according  to  previous  crawls  of  this  page.  •  The  crawling  process  starts  with  the  scheduler  selecEng  1000  parent  URLs   which  have  the  highest  probability  of  having  new  documents  available.     –  Assume  Poisson  process  for  the  change  behavior  of  a  parent  page.     •  Suppose  a  parent  page  P’s  last  observed  change  occurred  at  Eme  t1,  and  its  esEmated   change  rate  is  R,  then  at  Eme  t2  (t2  =  t1  +  Δ),  the  probability  that  it  has  changed  again   since  t1  is  1  –  exp(-­‐R*Δ)   •  Larger  R  or  larger  Δ  will  give  larger  probability.   •  Aker  each  crawl,  the  change  rate  of  the  scheduled  parent  URL  should  be  recalculated.  •  Crawling  run  incrementally  daily  (invoked  by  a  Linux  cron  job  at  12  am)   –  Most  discovered  documents  have  been  crawled  before.     •  Use  hash  table  comparison  for  detecEon  of  new  documents   •  Normally  retrieve  a  few  thousand  NEW  documents  per  day,  someEmes  less  than  1k.  •  Moved  to  whitelist  vs  blacklist     Zheng, CIKM’09
  • documents  from  crawled  urls   90% all citations from the first 550 sites 90% all documents from the first 1250 sites
  • How  will  we  get  metadata  for  fields?   Now... that should clear up a few things around here
  • Metadata  ExtracEon  •  Documents  are  converted  from  PDF/PS  to  text  using   converters.   –  Converters  include  TET,  pd{ox,  pdkotext,gs.  •  Documents  are  filtered  checking,  for  existence  of   references  and  duplicaEon  (checksum).  •  Use  tools  or  build  your  own   –  Metadata  extracEon  system  uses  machine  learning   methods  like  SVM  (Header  Parser),  CRF  (ParsCit)  to   extract  various  enEEes  from  the  document.  •  Rule  based  templates  are  applied  before  extracEon.  
  • AutomaEcally  Created  DB  of  paper  in  CSX   10.1.1.130.782 Tensor Decompositions and Applications This .. 2009 pages 455-500 id title abstract year publisher SIAM“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500Abstract: This ….Cited 34 times, 6 times by Author venue Assigned SIAM REVIEW By venueType version cluster System JOURNAL Extractor/ 2 9248987 User/ 10 12/30/2008 True Inference n-cites 34 Inference/ selfCites 6 public User repositoryID crawldate
  • 3  Tier  Architecture   Queries Index Web 1 Index - TablesUser Request Load Balancer Web Application Load Balancer Repository Web 2 Database Requests Storage Crawler Ingestion Extraction
  • CiteSeer X  Sokware  Overview  •  IngesEon  Process:  Responsible  for  obtaining  and  preparing  a  document  and  the   related  metadata.   –  Process  the  document   •  Submi|ed  by  the  user  or  Crawler   –  Extract  Metadata   •  Header   •  CitaEons   •  Acknowledgements   –  Store  the  metadata  and  documents.  •  CitaEon  Matching   –  Iden>fying  the  underlying  graph  structure  –  documents  ci>ng  this  document  and   the  rela>onship  between  documents  and  cita>ons   •  Inference  matching  and  graph  generaEon   –  User  CorrecEons  (Version  Maintenance)   –  Determine  and  accept  valid  user  correc>ons   –  Regular  NoEficaEon  Mechanisms   –  Ensure  that  the  user  is  no>fied  when  new  documents  are  added  to  the  collec>on   •  Linked  to  MyCiteSeer.  •  Update  and  Maintenance   –  Update  and  make  valid  the  full  text  index  and  various  sta>s>cs.   –  StaEsEcs   –  Index  updates  
  • CiteSeerX  Search     Enabling  Search     Fulltext     Fields  created   -  Title   -  Authors   -  CitaEons   -  Venue   -  Keywords   -  Abstract   -  Range  (PublicaEon)   -  CitaEons  
  • Field  Schema   Field Type Indexed/Stored DOI String Y/Y - Unique Citation/Document String Y/Y Title Text Y/Y Author A Text Y/Y Authors Normalized A Text Y/N ncites (# cited by) Integer Y/Y URL String Y/Y cites Tokens Y/N citedby Tokens Y/N Timestamp Date Y/Y* - A Text is a Text field which does not have a stopword filter or stemming^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer
  • CiteSeerX  Search  Results     Results  SorEng     Relevance  (default)   -  Based  on  dismax  query   handling  with  boosEng.   Sorting   CitaEons   -  CitaEons  received  by  the   document  in  collecEon  plus   default     Year   -  PublicaEon  date.     Recency   -  Date  of  acquisiEon.  
  • CiteSeerX  CitaEon  Graph     RelaEonships   B Cited by   CitaEon  graph     E -  Store  Cited  by  and   A Cites Cites  in  index     Build  D -  Build  document   C graph  by  querying   index  for   relaEonship.  
  • Adding  documents    Ingest  documents  for  new  crawls   -  Add  metadata  to  collecEon   -  Add  full  text  to  system   -  Link  metadata  in  collecEon    Run  maintenance  scripts   -  Poll  updates  and  post  to  Solr.     Fulltext     Metadata     RelaEonships    Challenge:  Maintain  data  freshness.  
  • Query  Response  Web •  Query  forwarded  to  Solr   from  the  presentaEon   Web Interface layer  (JSP)   •  Solr  generates  ranked   response  in  JSON   •  Build  each  record  in  xml   with  the  database  (Add   Database fields:  Abstract)   •  PresentaEon  layer  (JSP)   Index formats  records  based   on  ranking.  
  • Ranking  with  BoosEng  (Relevance)    Use  of  Boost  FuncEon,  Minimum  Match,   Query  Fields     Boost  FuncEon  –    the  effect  of  citaEons   -  Map  number  of  citaEons  >  1  to  500     Minimum  Match  –  2       Query  Fields   -  Text  (1)   -  Title  (4)   -  Abstract  (2)  
  • Query  Response   Web Interaface   Query  at  Interface  (JSP)   Q   Hand  over  to  Web  Text R HashMap applicaEon  (Java/Spring)   Web Application   Hand  over  to  Solr   F   Ranked  response  from  Solr  Text JSON Q R HashMap (JSON)   DB   Response  unwrapped  and   more  details  included  with   Index informaEon  from  DB     Present  response  at   Interface  (JSP)  
  • Name  DisambiguaEon  •  Name  disambiguaEon  (NER)   –  A  person  can  be  referred  to  in  different  ways  with  different  a|ributes  in   mulEple  records;  the  goal  of  name  disambiguaEon  is  to  resolve  such   ambiguiEes,  linking  and  merging  all  the  records  of  the  same  enEty  together  •  Three  types  of  name  ambiguiEes:   –  Aliases  -­‐  one  person  with  mulEple  aliases,  name  variaEons,  or  name   changed     e.g.  CL  Giles  &  Lee  Giles,  Superman  &  Clark  Kent   –  Common  Names  -­‐  more  than  one  person  shares  a  common  name,     e.g.  Jian  Huang  –  103  papers  in  DBLP   –  Typography  Errors  -­‐  resulEng  from  human  input  or  automaEc  extracEon  •  Goal:  disambiguate,  cluster  and  link  names  in  a  large  digital   library  or  bibliographic  resource  such  as  Medline,  CiteSeerX,  etc.  
  • Efficient  Large  Scale  En3ty  Disambigua3on   Testbed:  CiteSeerX  and  PubMedSeer  et.al PKDD 2006 Huang, Treeratpituk, et.al JCDL 2009•  EnEty  disambiguaEon  problem   Online SVM –  Determine  the  real  idenEty  of  the   with Active Learning authors  using  metadata  of  the   Annotator Distance Learner research  papers,  including  co-­‐ Metadata authors,  affiliaEon,  physical   Actors, entities Extraction Soft- address,  email  address,     Module Jaccard TFIDF documents Similarity informaEon  from  crawling  such   Similarity SVM Distance DBSCAN as  host  server,  etc.   Function Clustering –  EnEty  normalizaEon   Similarity Module Function•  MoEvaEon   –  Enhance  search  funcEonaliEes   Blocking for  digital  repositories   Module Candidate Class •  Fielded  search  by  author  name   Author 1 Paper 3 Author 2 Paper 4 –  Improve  metadata  quality   –  Improved  social  network  analysis   –  Government  and  business   •  Key  features   intelligence   •  E.g.  census  data  and  credit   –  LASVM  distance  funcEon   records   •  AcEve  learning  •  Challenges   –  –  Simpler  and  more  accurate  model   Be|er  generalizaEon  power   –  Accuracy   •  Online  learning   –  Scalability   –  Expandable  to  new  training  data   –  Expandability   –  DBSCAN  clustering   •  Ameliorate  labeling  inconsistency  (transiEvity  problem)   •  Efficient  soluEon  to  find  name  clusters   •  N  logN  scaling  
  • Author  DisambiguaEon  Field  •  Currently  uses  author  fields   –  For  author  search  (both  for  author  menEons  and  for   disambiguated  authors)  •  Future  direcEon     –  Use  Lucene  index  for  blocking  in  author  disambiguaEon  –   creaEng  candidate  set  of  author  menEons  that  could   belong  to  the  same  cluster  
  • Author  DisambiguaEon  •  Random  Forest  (RF)     –  Use  random  feature  selecEon+bootstrap  sampling  to  construct  mulEple  decision  trees  from  one  training  data   –  Aggregate  votes  of  a  collecEon  of  decision  tree  as  final  decision   –  The  more  independent  each  tree  is,  the  be|er  the  improvement  over  a  single  decision  tree  •  Author  disambiguaEon  with  Random  Forest   –  Various  meta  data  is  used  as  features  in  Random  Forest  to  determine  whether  two  author  name  from  two  papers   refer  to  the  same  person   •  E.g.  Author  names,  affiliaEon,  coauthors,  keywords,  journal  informaEon,  year  of  publicaEons,  etc   –  MulEple  distance  funcEons  are  used  for  each  type  of  meta  data   •  E.g.  TFIDF,  Jaccard  distance,  for  comparing  affiliaEons  •  Compared  with  previous  SVM-­‐based  approach   –  Shown  to  provide  higher  accuracy  than  SVM  in  pair-­‐wise  author  disambiguaEon  task   –  Easy  parameterizaEon  in  the  training  phrase  (only  number  of  trees  and  randomness  at  each  node,  no  decision  on   kernel  funcEon  needed),  and  performance  is  not  sensiEve  to  parameters  chosen   –  Provide  measurement  for  importance  of  each  individual  features  (how   informaEve  each  feature  is,  and  how   sensiEve  the  decision  is  to  noise  in  a  parEcular  feature),  which  is  not  trivial  for  SVM  with  non-­‐linear  kernel   –  Training  Eme  &  classificaEon  Eme  is  linear  to  the  number  of  tree  and  data  size  •  Also  provide  higher  disambiguaEon  accuracy  when  compared  with  other  tradiEonal  method  (LogisEc   Regression,  Naïve  Bayes,  Decision  Tree)   Treeratpituk, Giles, JCDL09
  • Data and Publications in the Field of ChemistryChemistry • not physics - no arXiv – or computer science - no CiteSeer • Legacy of early information access - Chem Abstracts • Cheminformatics is not bioinformaticsChemistry has been up to recently a data poor field Data sharing tradition just being established Data creation is exploding - local (small science)Journals and societies sensitive to their IP issues dominate the field Unsubstantiated IP claims such as data in the paper belongs to the publisher Discourage online versions of publications - ACSLarge powerful international companies have a vested interest in research Chemical information extraction tools are easily monetized Standards exist - CML, InCHI“Fixing the past so we can fix the future.” Jeremy Frey Chemistry is an old discipline with publications going back 100 yearsChemistry is compound centric, not algorithmic centric Search is about the compound! Compounds have a rich data environ 3D graph structure, energies, etc.
  • ChemXSeer ArchitectureIntegrate and implement well-used open source tools Use CiteSeerX tools when possible Integrate into SeerSuite Search Chemical formulae unique search Table search Figure search More data (grey literature) than documents•  Automated information extraction modules based on machine learning methods•  Lucene/Solr indices for extracted fields,•  Relational databases for datasets,Work closely with chemists to understand their needs Tools for data conversionProvide a public portal and repository for easy use User access controlsIntegrated visualization tools like JMOL for Gaussian data residing intoour repositoryAPI’s for users for extracted dataData and documents standards de facto: xml, pdf, etc.
  • chemxseer.ist.psu.edu
  • ChemXSeer Formula Search• Extraction and search of chemical formulae in scientificdocuments has been shown to be very useful.• Intersection of two research areas: • Information retrieval • Chemoinformatics•  Formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, WWW’07, WWW’08, TOIS’11 D. Yuan, ICDE’12
  • Challenges in Formula SearchHow to identify a formula in scientific documents?Non-Formula“… This work was funded under NIH grants …”“ … YSI 5301, Yellow Springs, OH, USA …”“… action and disease. He has published over …”Formula“… such as hydroxyl radical OH, superoxide O2- …”“ and the other He emissions scarcely changed …”Machine learning algorithms (SVM + CRF) yield highaccuracies for correct formula identification.
  • SegmenEng  chemical  names  •  Goal:  to  discover  semanEcally  meaningful  sub-­‐terms  in   chemical  names   –  Methylethyl  alcohol   –  methionylglutaminylarginyltyrosylglutamylserylleucyl   phenylalanylalanylglutaminylleucyllysylglutamylarginyl   lysylglutamylglycylalanylphenylalanylvalylprolylphenyl   alanylvalylthreonylleucylglycylaspartylprolylglycylisol   eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl   threonylleucylisoleucylglutamylalanylglycylalanylaspartyl   alanylleucylglutamylleucylglycylisoleucylprolylphenyl   alanylserylaspartylprolylleucylalanylaspartylglycylprolyl   threonylisoleucylglutaminylasparaginylalanylthreonylleucyl   arginylalanylphenylalanylalanylalanylglycylvalylthreonyl   prolylalanylglutaminylcysteinylphenylalanylglutamyl   methionylleucylalanylleucylisoleucylarginylglutaminyllysyl   hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl   leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl   alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl   alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl   glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl   prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl   arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl   valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl   prolylprolylaspartylalanylaspartylaspartylaspartylleucyl   leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl   arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl  
  • Chemical  Search  Aspects  •  Parsing  •  ExtracEon  and  tagging  •  Indexing  •  Ranking  
  • Chemical  EnEty  ExtracEon  and  Tagging  •  Name  tagging   –  Each  chemical  name  can  be  a  phrase   –  Example   •  "...  Determina>on  of  lac4c  acid  and  ...“   •  "...  insec>cide  promecarb  (3-­‐isopropyl-­‐5-­‐methylphenyl  methylcarbamate)  acts   against  ..."  •  Formula  tagging   –  Each  formula  is  a  single  term   –  Example   •  "...  such  as  hydroxyl  radical  OH,  superoxide  ..."   –  Non-­‐formula  example   •  "...  YSI  5301,  Yellow  Springs,  OH,  USA  ...  ”  •  Tagging  examples   –  Name  tagging:   "...    of  <name-­‐type>lac>c  acid</name-­‐type>  and  ...“   –  Formula  tagging:   "...  radical  <formula-­‐type>OH</formula-­‐type>  ,  superoxide  ..."  
  • Textual  Chemical  Molecule  InformaEon   Indexing  and  Search   •  Index  Schemes:     –  Which  tokens  to  index?   –  Indexing  all  subsequences  generates  a  large  size  index  •  SegmentaEon-­‐based  index  scheme   –  Used  for  indexing  chemical  names   methylethyl –  First  segment  a  chemical  name  hierarchically   and  then  index  substrings  at  each  node   methyl ethyl meth yl eth yl me th •  Frequency-­‐and-­‐discriminaEon-­‐based  index  scheme   –  Used  for  indexing  chemical  formulas   –  SequenEally  select  frequent  and  discriminaEve  subsequences  of  a   formula  from  the  shortest  to  the  longest  
  • Features  for  Formula  Indexing  •  Formula   –  A  sequence  of  chemical  element  or  par3al  formula   with  corresponding  frequencies   –  E.g.  CH3(CH2)2OH  •  ParEal  formula   –  ParEal  formula:  a  subsequence  of  a  formula   –  E.g.  C,  H,  O,  CH3,  CH2,  OH,  CH3(CH)2,  H3(CH)2,  CH3 (CH)2O,  etc.  •  Index  construcEon   –  ParEal  formulas  with  frequencies:  e.g.  <C,3>,<H, 6>,<CH2,2>,  etc.   –  Too  many  parEal  formulas,  need  feature  selec3on  
  • Criteria  of  Feature  SelecEon  •  Criteria  of  feature  selecEon   –  Frequent  features  (Freqs≥Freqmin)   –  DiscriminaEve  features  (αs  ≥αmin)   •  If  a  sequence’s  selected  subsequences  are  enough  to   disEnguish  formulas  containing  them  from  other   formulas,  this  sequence  is  redundant.   •  DiscriminaEon  score   α s =| I s ∈F ∧ s p s Ds | / | Ds |  where  F  is  the  selected  feature  set,  and  Ds  is  the  set  of   formulas  containing  s.  
  • An  Example  for  Formula  Indexing  •  Data  set:     –  1.CH3COOH,  2.CH3(CH2)2OH,  3.CH3(CH2)3COOH  •  Parameter:     –  Freqmin=2,  αmin=1.1  •  Steps:   –  Length=1,  Candidates={C,H,O},  F={C,H,O}   –  Length=2,  Candidates={CH3,H3C,CO,OO,OH,CH2},  Frequent   Candidates={CH3,CO,OO,OH,CH2}   α CH 3 =| {1,2,3}C I {1,2,3}H | / | {1,2,3}CH 3 |= 1 α CO =| {1,2,3}C I {1,2,3}O | / | {1,3}CO |= 1.5  Frequent  &  DiscriminaEve  Candidates={CO,OO,CH2}    F={C,H,O,CO,OO,CH2}   –  Length=3,  …  
  • Formula  Search   •  SF.IEF:  Subsequence  Frequency  &  Inverse  EnEty  Frequency   Freq(s,e) |C | SF(s,e) = ,IEF(s) = log |e | |{e | s p e} | •  Exact  formula  search   –  Search  for  exact  representaEons.  E.g.  =C1-­‐2H4-­‐6  matches  CH4  and   C2H6,  not  H4C  or  H6C2.  € •  Frequency  formula  search   –  Full  frequency  search:  search  for  formulas  with  specified  chemical   elements  and  frequency  ranges,  ignoring  the  order,  no  unspecified   elements.  E.g.  C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  not   CH4O,  C2H6O2.   –  ParEal  frequency  search:  similar  but  allow  unspecified  elements.  E.g.   *C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  and  CH4O  and   C2H6O2  as  well.   –  Ranking  funcEon   score(q, e) = ∑ SF ( s, e) IFF ( s ) 2 /( | f | × ∑ IFF (s) 2 ) s∈q s∈q
  • Formula  Search  substructure  •  Substructure  formula  search   –  Search  for  formulas  that  may  have  a  substructure.  E.g.  -­‐COOH   matches  CH3COOH  (exact  match:  high  score),  HOOCCH3  (reverse   match:  medium  score),  and  CH3CHO2  (parsed  match:  low  score).   –  Ranking  funcEon   score(s,e) = W match(s, f )SF(s,e)IFF(s) / | e |  where  Wmatch(q,f)    is  the  weight  for  exact  match,  reverse  match,  and   parsed  match  •  Similarity  formula  search   –  Search  for  formulas  with  a  similar  structure  of  the  query  formula.   € Feature-­‐based  approach  using  parEal  formula  matching.  E.g.   ~CH3COOH  matches  CH3COOH,  (CH3COO)2Co,  CH3COO-­‐,  etc.   –  Ranking  funcEon   score(q,e) = ∑W match(q,e )W (s)SF(s,q)SF(s,e)IFF(s) / | e | sp q•  ConjuncEve  search  of  the  basic  types  of  formula  searches   –  E.g.  [*C2H4-­‐6  -­‐COOH]  matches  CH3COOH,  not  C2H4O  or   CH3CH2COOH.   €•  Document  query  rewriEng   –  E.g.  document  query  atom  formula:=CH4  is  rewri|en  to  atom  (CH4   OR  CD4),  if  formula  search  of  =CH4  matches  CH4  and  CD4.  
  • Formula  Search  -­‐Query  Models  Many  models  are  possible  from  exact  to  semanEc   Models  discriminated  by  matching  algorithms  •  Exact  search   –  Search  for  exact  representaEons   –  E.g.  =C1-­‐2H4-­‐6  matches  CH4  and  C2H6,  not  H4C  or  H6C2  •  Frequency  searches   –  Full  frequency  search:  search  for  formulae  with  specified  chemical  elements  and   frequency  ranges,  ignoring  the  order,  no  unspecified  elements   –  E.g.  C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  not  CH4O,  C2H6O2   –  ParEal  frequency  search:  similar  but  allow  unspecified  elements   –  E.g.  *C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  and  CH4O  and  C2H6O2  as  well  •  Substructure  search   –  Search  for  formulae  that  may  have  a  substructure   –  E.g.  -­‐COOH  matches  CH3COOH  (exact  match:  high  score),  HOOCCH3  (reverse  match:   medium  score),  and  CH3CHO2  (parsed  match:  low  score).  •  Similarity  search   –  Search  for  formulae  with  a  similar  structure  of  the  query  formula.  Feature-­‐based   approach  using  parEal  formulae  matching.   –  E.g.  ~CH3COOH  matches  CH3COOH,  (CH3COO)2Co,  CH3COO-­‐,  etc.  
  • Ranking  formulae  •  Ranking  formulae  has  to  depend  on  need  and  importance  •  Focus  on  structural  methods  and  frequency  •  Importance  can  be  introduced  by  citaEon  rank  or  pagerank  or  others  •  SF.IFF   –  Substructure  frequency  and  inverse  formula  frequency  •  Frequency  searches   –   score(q, f ) = SF (e, f ) IFF (e) 2 /( | f | ×   IFF (e) 2 ) ∑e∈q ∑ e∈q –  where  |f|  is  the  total  frequency  of  elements  •  Substructure  search       –  score(q, f ) = W SF (q, f ) IFF (q) / | f | match ( q , f ) –   where  Wmatch(q,f)    is  the  weight  for  exact  match,  reverse  match,  and   parsed  match  •  Similarity  search   –       score(q, f ) = ∑W s pq W ( s ) SF ( s, q ) SF ( s, f ) IFF ( s ) / | f | match ( q , f )
  • Chemical  compounds  as  graphs  •  Chemical  compound  modeled  as  a  semanEc   graph  with  properEes   Atom: vertex/node in the graph Bond: edge in the graph Dimensions: 3 or 4 Above figures are copied from eMolecules.com
  • What’s  Chemical  Structure  Search  •  Substructure  Search   –  Given  an  input  chemical  structure  sketch,  find  all   the  chemical  compounds  containing  the  input  as  a   substructure.    •  Super  structure  Search   –  Given  an  input  chemical  structure  sketch,  find  all   the  important  descriptors  (substructures/   funcEonal  group)  contained  in  the  input.    •  Similarity  Search   –  Given  an  input  chemical  structure  sketch,  find  all   the  chemical  compounds  “similar”  to  the  input.    
  • Table SearchTables are widely used to present experimental results or statisticaldata in scientific documents; some data only exists in these tables.Current search engines treat tabular data as regular text •  Structural information and semantics not preserved.Goal: automatically identify tables, extract table metadata from pdfdocuments into xml and rank dataTable Metadata Representation:•  Environment metadata: (document specifics: type, title,…)•  Frame metadata: (border left, right, top, bottom, …)•  Affiliated metadata: (Caption, footnote, …)•  Layout metadata: (number of rows, columns, headers,…)•  Cell content metadata: (values in cells)•  Type metadata: (numeric, symbolic, hybrid, …) Y. Liu AAA’07, JCDL’07.
  • Tables  •  A history that pre-dates that of sentential text –  Cuneiform clay tablets•  Not received the same level of formal characterization enjoyed by sentential text•  Varying and irregular formats•  Different intuitive understanding of what a “table” is. –  Is the Periodic Table of the Elements a table? –  Tables vs. Lists? –  Tables vs. Forms? –  Tables vs. Figures? –  Genuine table vs. non-genuine table? [12]•  Our definition: scientific genuine table –  Caption + tabular structure –  Ruling lines are not required
  • TableSeer  Beta design of a table search engine
  • TableSeer   System    Architecture  
  • Page  Box-­‐Cu‡ng  Algorithm  •  Improves  the  table  detecEon  performance  by   excluding  more  than  93.6%  document  content   in  the  beginning  
  • Sample  Table  Metadata  Extracted  File  •  <Table>  •  <DocumentOrigin>Analyst</DocumentOrigin>  •  <DocumentName>b006011i.pdf</DocumentName>  •  <Year>2001</Year>  •  <DocumentTitle>Detec3on  of  chlorinated  methanes  by  3n  oxide  gas  sensors  </DocumentTitle>  •  <Author>Sang  Hyun  Park,  a  ?  Young-­‐Chan  Son,  a  Brenda  R  .  Shaw,  a  Kenneth  E.  Creasy,*  b  and  Steven  L.  Suib*  acd  a  Department  of  Chemistry,  U-­‐60,  University  of  Connec3cut,   Storrs,  C  T  06269-­‐3060</Author>  •  <TheNumOfCiters></TheNumOfCiters>  •  <Citers></Citers>  •  <TableCap3on>Table  1  Temperature  effect  o  n  r  esistance  change  (  D  R  )  and  response  3meof  3n  oxide  thin  film  with  1  %  C  Cl  4</TableCap3on>  •  <TableColumnHeading>D  R  Temperature/  ¡ã  C  D  R  a  /  W  (  R  ,O  2  )  (%)  R  esponse  3me  Reproducibiliy  </TableColumnHeading>  •  <TableContent>100  223  5  ~  22  min  Yes  200  270  9  ~  7-­‐8  min  Yes  300  1027  21  <  2  0  s  Yes  400  993  31  ~  1  0  s  No  </TableContent>  •  <TableFootnote>  a  D  R  =(  R  ,  CCl  4  )  -­‐  (  R  ,O  2  ).  </TableFootnote>  •  <ColumnNum>5</ColumnNum>  •  <TableReferenceText>In  page  3,  line  11,  …  Film  responses  to  1%  CCl4  at  different  temperatures  are  summarized  in  Table  1……</TableReferenceText>  •  <PageNumOfTable>3</PageNumOfTable>  •  <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>  •  </Table>  
  • TableRank  • Rank tables by rating the <query, table> pairs, instead of the<query, document> pairs: preventing a lot of false positive hitsfor table search, which frequently occur in current web searchengines• The similarity between a <table, query> pair: the cosine of theangle between vectors• Tailored term vector space => table vectors: • Query vectors and table vectors, instead of document vectors
  • Table  Index    Index     CapEons     Footnotes     Reference  Text    BoosEng     CapEons  (2)     FuncEon:     -  Inversely  (recip)  proporEonal  to  #cites.  
  • Term  WeighEng  for  Tables  –  TTF  –  ITTF:  (Table  Term  Frequency-­‐Inverse  Table  Term  Frequency)  –  TLB:  Table  Level  Boost  Factors  (e.g.,  table  frequency)  –  DLB:  Document  Level  Boost  factors  (e.g.,  journal/proceeding  order,  document   citaEon)    
  • Table  term  ranking  • A term occurring in a few tables is likely to be a better discriminator than a termappearing in most or all tables• Similar to document abstract, table metadata and table query should be treated assemi-structured text • Not complete sentences and express a summary • P = 0.5 (G. Salton 1988)•  b is the total number of tables• IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)
  • Table  Level  Boost  and  Document  Level   Boost  Btbf is the boost value of the table frequencyBtrt is the boost value of the table reference text (e.g., the normalized length), andBtp is the boost value of the table position. r is a parameter, which is 1 if usersspecify the table position in the query. Otherwise, r = 0.IVj: document Importance Value (IV). If a table comes from a document witha high IV , all the table terms of this document should get a high documentlevel boostICj: the inherited citation value (ICj)DOj: source value (the rank of the journal/conference proceeding)DFj: document freshness
  • Table  citaEon  network  •  Similar  to  the  PageRank  network   –  Documents  construct  a  network  from  the  citaEons   –  The  “incoming  links”  –  the  documents  that  cite  the  document  in  which   the  table  is  located   –  ExponenEal  decay  used  to  deal  with  the  impact  of  the  propagated   importance  •  Unlike  the  PageRank  network   –  Directed  Acyclic  Graph   –  Importance  Value  (IV)  of  a  document  not  decreased  as  the  number  of   citaEons  increases   –  IV  not  divided  by  the  number  of  outbound  links  •  A  document  may  have  mulEple,  one,  or  no  tables      •  Each  table  is  consisted  as  a  set  of  metadata    •  Same  keywords  may  appear  in  different  metadata  in  different   tables    
  • Table  Search  Summary  •  An  novel  first  table  ranking  algorithm  -­‐-­‐  TableRank  •  A  tailored  table  term  vector  space  •  A  table  term  weighEng  scheme  –  TTF-­‐ITTF   –  AggregaEng  impact  factors  from  three  levels:  the   term,  the  table,  and  the  document  •  Index  table  referenced  texts,  term  locaEons,  and   document  backgrounds  •  Design  and  implement  first  table  search  engine,   TableSeer,  to  evaluate  the  TableRank  and  compare  with   popular  web  search  engines  •  Code  released  •  Currently  implement  in  CiteSeerX  -­‐  millions  of  tables  •  Improving  extracEon  –  Dow  Chemical  support  
  • Automated Figure Data Extraction and Search"•  Large amount of results in digital documents are recorded in figures, time series, experimental results (eg., NMR spectra, income growth) and this is the only record of the data"•  Extraction for purposes of:" –  Further modeling using presented data" –  Indexing, meta-data creation for storage & search on figures for data reuse"•  Current extraction done manually!! Documents   Extracted  Plot   Extracted  Info.   Document   Merged   Index   Plot  Index   Index   Digital  Library   User  
  • Seer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. Tools that automate the data extraction from figures provide the following: •  Increases our understanding of key concepts of papers •  Provides data for automatic comparative analyses. •  Enables regeneration of figures in different contexts. •  Enables search for documents with figures containing specific experiment results. X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08
  • Metadata & data to extract: 
 2 Dimensional Plot" Y-Axis Labels Legend Data Points Ticks Axis Units X-Axis LabelSnapshot of a document Extracted 2D plot
  • Our  Approach  to  Plot  Data  ExtracEon  • Identify and extract figures from digital documents • Ascii and image extraction (xpdf) • OCR - bit map, raster pdfs• Identify figures as images of 2D plots using SVM (Only for Bit map images) • Hough transform • Wavelets coefficients of image • Surrounding text features• Binarization of the 2D plots identified for preprocessing (No need for Vectorized Images) • Adaptive Thresholding•  Image segmentation to identify regions • Profiling or Image Signature•  Text block detection • Nearest Neighbor•  Data point detection • K-means Filtering•  Data point disambiguation for overlapping points • Simulated Annealing
  • Future Directions•  System integration within ChemXSeer or CiteSeerX" –  XML data generation" –  Open source tool in Lucene/SOLR "•  Extension to other figures (3D, …)   " 1.2e+08 1e+08" 8e+07" 6e+07" 4e+07" 2e+07" " 0 30 " 25 " " 20 " " 60 " 70 15 " " 50 10 " " 30 " 40 5 " 10 " 20
  • ChemXSeer Highlights•  Portal for academic researchers in environmental chemistry which integrates the scientificliterature with experimental, analytical and simulation results and tools•  Provides unique metadata extraction, indexing and searching pertinent to the chemicalliterature by using heuristics combined with machine learning •  Chemical formulae and names •  Tables •  Figures •  Publication functions as in CiteSeerX •  Interoperability ORE-Chem development •  Novel ranking required•  After extraction, data stored API accessible xml for users•  Hybrid repository (Not fully open): Serves as a federated information interoperational system •  Scientific papers crawled and indexed from the web •  User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) •  Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)•  Access control for publisher-provided content and user-submitted experiment data•  Takes advantage of developments in other funded cyberinfrastructure and open sourceprojects •  CiteSeerX, PlanetLab, Lucene/Solr, ORE, others •  Some released open source
  • Experimental Collaborator recommendation system•  CollabSeer  currently  supports  400k  authors  •  h|p://collabseer.ist.psu.edu  
  • CollaboraEon  recommendaEon  •  Metadata  of  authors  and  coauthors  and  topics  of  interest   (similar  to  expert  recommendaEon)  •  Use  social  network  and  topics  to  recommend   collaborators  of  collaborators  (FOF)  •  Devise  SN  index  and  ranking  scheme  •  Explore  models  of  vertex  similarity  •  Built  on  SeerSuite   Gou JCDL’10,•  Other  recommendaEons?   Gou MIR’10 –  Experimental  methods   Chen JCDL’11, SAC’12 –  Chemicals?  
  • RecommendaEon  list  and  user’s  topic  of  interest  
  • •  Users  refine  the  recommend  list  by  clicking  on  their  topic  of  interest.  (lek:  refined  by  “query   processing”,  right:  default  recommendaEon  list)  
  • •  How  two  potenEal  collaborators  are  linked  by  common  collaborators  
  • CollabSeer  Framework  
  • IntegraEon  of  Vertex  Similarity  and   Textual  Similarity  •      –  S:  vertex  similarity   –  SC.O.T.:  collaborator’s  contribuEon  to  a  specified  topic   –  Use  the  product  of  exponenEal  funcEons  to  avoid  zero   vertex  similarity  score  or  zero  contribuEon  (textual   similarity)  score  to  turn  the  whole  measure  into  zero  •  Other  measures?  
  • •  RefSeerX:  recommend  citaEons  for  papers   Use these paper   citaEons   The authors are unaware of related work  they do not know they are looking for  recommends related citations•  Based   –  ExisEng  citaEons   –  CitaEon  context   –  Venue  and  importance   –  Contemporary  vs  seminal  
  • He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,
  •   Expert  Search• Expert search for authors, currently in alpha
  •   Expert  Search• Expert search for authors, currently in alpha
  • Keyphrase  ExtracEon  for  experts   Text  Document   Parse document into sections with SecEon  Parser   regular expression Candidate   Use DBLP statistic to extract DBLP  data   keyphrase candidates Extractor   Train random forest to classify & Training  Data   Random  Forest   rank whether a phrase is a keyphrase Top  Keyphrases  Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases fromScholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshopon Semantic Evaluations (SemEval 2010), Sweden, July 2010.
  • GrantSeer  •  Prototype  search  engine  for  PI  profiles  and  their  grant   informaEon  to  assist  funding  agencies,  deans  of  research,   foundaEons  •  Link  PIs  with  their     –  Grants     –  PublicaEons   –  CitaEons   –  OrganizaEon   –  ExperEse   –  Others?  •  Data  that  can  be  shared   –  CiteSeerX  or  Google  Scholar  data   –  Database  of  funded  research   Funded by NSF – Julia Lane
  • Cover  page  NSF  XML  extracEon  
  • GrantSeer:  PI  profile   grants awarded PI’s expertisepublications + citations
  • Algorithm  Search  • Homepage search for authors, currently in alpha
  • AlgorithmSeer  Algorithm  Search  -­‐  ExtracEon  -­‐  Indexing  -­‐  Ranking   Suite Workshop ICSE ‘11
  • Algorithm Search
  • Metadata extraction• Extract • Pseudo-codes and their metadata • Captions • Reference sentences • Synopsys • Etc.• Index metadata using Solr to make the pseudo-codes searchable• Each search result has a pointer to the page in thedocument where the pseudo-code appears
  • Index Fieldsid <string>caption <text>reftext <text> (Reference Sentences)synopsis <text> (Summarizing Text)page <sint> (Page Number)paperid <string> (Document ID)year <sint> (Year of Publication)ncites <sint> (Number of Citations)
  • AckSeer   94
  • AckSeer   95
  • Number of Total C/A Name Acknowledge-ments Citations Metric Name Educational Funding Agencies Institutions National Science Carnegie Mello 12287 144643 11.77 Foundation University Defense Advanced Massachusetts 4712 80659 17.12 Research Projects Agency of Technology California Inst Office of Naval Research 3080 48873 15.87 TechnologyFunding Agency Impact Deutsche 2780 9782 3.52 Santa Fe Institu Forschungsgemeinschaft French Nationa National Aeronautics and 2408 21242 8.82 Institute for Re Space AdministrationFunding agency impact Engineering and Physical Computer Scie 2007 16582 8.26 Stanford Unive•  based on Science Research Council Air Force Office of University of Cacknowledgement indexing Scientific Research 1657 16850 10.17 at Berkeley National Sciences and National Cente•  # of acknowledgements Engineering Research 1422 12050 8.47 Supercomputin•  total citations Council of Canada Applications International C•  #Citation / #ack metric Department of Energy 1054 5562 5.28 Science Institu Australian Research 1010 5464 5.41 Cornell Univer CouncilBased on acknowledgment European Union University of I Information Technologies 825 9594 11.63entities extracted from 150K Program Urbana-Champacknowledgements in CiteSeer National Institutes of 709 7279 10.27 USC Informati Health Sciences Instit University of CNew system available this spring Army Research Office 666 7709 11.58 Los Angeles Netherlands OrganizationAckSeer for Scientific Research 646 2843 4.4 McGill Univer Science and Engineering Australian Nat 489 6976 14.27 Research Council University Companies Individuals International Business Giles, PNAS, 2004 1380 23948 17.35 Olivier Danvy Machines Intel Corporation 962 14441 15.01 Oded Goldreic
  • Most Acknowledged Authors and Impact Factor C/A Author Citations Acknowledge-ments Metric OlivierInterviewed by Danvy 847 268 29.85Nature as to why Oded 3277 259 17.82 Goldreichhe was the most Luca 3847 247 43.91acknowledged Cardelli Tomcomputer scientist Mitchell 3336 226 24.31 Martin 3507 222 43.46 Abadi Phil 3780 181 40.07 Wadler Moshe 3786 180 33.86 VardiWho is most acknowledged? 1790 Peter Lee 167 53.54 Avi 2566 160 18.13Mom or dad Wigderson MatthiasTheorists or experimentalists Felleisen 1622 154 30.55 Benjamin 1484 152 30.53Who has a better metric? Pierce Noga Alon 2640 152 15.71 John 3693 152 41.9 Ousterhout Frank 1639 148 13.84 Pfenning Andrew 2064 144 52.99 Appel
  • Clouding CiteSeerX•  Hosting cloud CiteSeerX instances •  Economic issues •  Cost of hosting •  Cost of refactoring the source to be hosted in the cloud. •  Computational/technical issues •  What workflow to cloudize •  Component modification for efficient operation •  VM size: storage, memory and CPU sizing as a function of needs •  Establishing computational needs and availability clusters •  Appropriate load balancing across multiple sites. •  Security of data stored including metadata and user data. •  Policy issues •  Privacy of user data •  Copyright issues. Teregowda Cloud’10 USENIX’10
  • SeerSuite  Research/Development  Opportuni3es  •  Old  Seers   –  Improve  or  revive  old  systems  and  port  them  into  compeEEve  SeerX  space   •  eBizSeer  to  eBizSeerX;  BotSeer  to  BotSeerX;  ArchSeer  to  ArchSeerX  •  New  Seers   –  New  domains  such  as  physics,  neuroscience,  biology,  algorithms,  TBD  (build  new  indexes)   –  MyCiteSeerX  •  Be|er  features   –  Parsing   –  EnEty  disambiguaEon   –  CitaEon  analysis   –  Ranking;  ranking,  ranking  •  New  features   –  New  parsing,  indexing,  ranking   •  Tables,  figures,  equaEons,  algorithms,  maps,  carbon  daEng,  chemical  formulae,  etc   –  Homepage  linking   –  ORE  search  and  data  integraEon   –  CollaboraEve  spaces   –  API/web  services   –  IntegraEon  with  DL  such  as  Fedora   –  New  clusters   •  Topics,  venues,  affiliaEons   –  Recommender  systems   –  SNA  analysis   –  Others  Collabora>ons  welcomed!    Data  and  sohware  available  
  • Research  SeerSuite  supports  •  Many  uses  as  a  research  testbed  and  support  structure   –  Scaling  of  algorithms  for  IR,  IE,  data  mining,  social  networks,  ...   –  NLP  methods  on  large  text  collecEons   –  ML  methods  to  automaEcally  extract  data   –  Novel  indexing  and  ranking   –  Federated  search   –  CollaboraEve  and  social  networks   –  Focused  crawling  –  new  data  resources   –  Interface  design  and  integraEon   –  Systems  analysis  •  Many  development    applied  research  issues   –  IntegraEon  with  other  DLs   –  Automated  feature  development   –  Transfer  to  nontechnical  use   –  Cloud  based  delivery  
  • Summary  •  Propose  an  infrastructure  for  academic  and  scienEfic  search  engine/digital  library   creaEon  -­‐  SeerSuite   –  Modular,  scalable,  extensible,  robust   –  Based  on  commercial  grade  open  source  (Solr/Lucene);  easy  to  use   –  Easy  to  apply  to  other  domains  (separable  indexes  and  projects  -­‐  integraEon)  •  Allows  scalable  data  mining  and  informaEon  extracEon  for  actual  systems   –  Unique  informa4on  extrac4on  plugins   –  Focus  on  unique  scalable  extracEon/data  mining  methods   •  Most  methods  less  than  N2  complexity   –  AutomaEcally  populates  databases  or  data  structures  •  Demonstrate  with  beta  systems  in   –  Computer  science,  Archaeology,  Chemistry,  Robots.txt,  PubMed,  YouSeer,  Tables,   Figures,  Maps,  References,  CollaboraEons,  DisambiguaEon   –  Personal  features  •  Systems  are  reasonably  easy  to  build;  issues  are   –  Data  collecEon  or  data  access   –  InformaEon  extracEon,  indexing,  ranking  •  Many  uses  as  a  research  testbed   –  Data  sharing  models  •  Want  to  find  a  Seer,  search  Google  or  use  my  homepage.  
  • Opportun3es  •  Science  is  being  flooded  with  data   –  SimulaEons,  sensors,  web  •  Digital  humaniEes  is  right  behind  •  Needs  in   –  Large  scale  data  management  (tera  to  peta)   •  NoSQL  databases:  graphs,  documents,  floaEng  point,     –  Large  scale     •  data  mining   •  informaEon  extracEon   •  search  •  Domain  experEse  crucial  •  Reuse  not  reinvent  (much  is  out  there)  •  Solr/Lucene  is  great  for  both  demos,  producEon  and   research.  
  • “Human attention is the scarce resource, not information.” Herbert A. Simon, Nobel Laureate, 1997.For  more  informaEon  •  clgiles.ist.psu.edu    •  giles@ist.psu.edu  •  SourceForge.com