EB-eye Back End
Upcoming SlideShare
Loading in...5
×
 

EB-eye Back End

on

  • 276 views

The new EBI search engine: EB-eye

The new EBI search engine: EB-eye
Backend : An overview of what is under the hood.

Statistics

Views

Total Views
276
Views on SlideShare
276
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

EB-eye Back End EB-eye Back End Presentation Transcript

  • The new EBI search engine: EB-eye Backend : An overview of what is under the hood Franck Valentin – External Services group Bioinformatics masters' students Open Day
  • Summary
    • What is available
    • Parsing
    • Indexing challenge
    • Software behind EB-eye
    • Acknowledgments
  • What is the data available ? Array Express Ligand Interpro > 20 domains >130M entries > 550 Gb of data
  • What is the data available – formats Array Express Ligand Interpro <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID : .. PARENT ID : .. RANK : .. ... ID ... AC ... DT ... ID ... AC ... DT ... ID ... AC ... DT ...
  • What is the data available – sizes 43M 4.2G 1G 8.4G Interpro 81M 57Gb, >500 files 374Gb, >600 files
  • Points to take into consideration
    • Our World
      • A variety of file formats
      • A large amount of data
      • A variety of file sizes
      • Data formats are changing
    • Our Quest
      • Index the data as fast as possible
      • Add and configure a new domain easily
      • Detect errors in the data
  • Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar Uniprot grammar . . . Parser (ANTXR) Medline grammar Interpro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC AF030562; DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC AF030562 ; XX DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site . XX KW STS. XX OS Fusarium venenatum OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN [1] RP 1-852 RA Yoder W.T., Christianson L.M .; RT &quot;Species-specific primers resolve members of the section Fusarium . RT Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL Fungal Genet. Biol. 0:0-0(1997). XX RN [2] RP 1-852 RA Yoder W.T., Christianson L.M.; RT ; RL Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL USA XX FH Key Location/Qualifiers FH FT source 1..852 FT /organism=&quot;Fusarium venenatum&quot; FT /strain=&quot;ATCC20334&quot; . . . ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
  • Divide and Conquer the Indexing Uniprot (>4M entries) Embl (>83M entries) 2 files, ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file, ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML Db XML dump XML dump XML dump Uniprot Index Embl Index Taxonomy Index Medline Index GO Index ArrayExpress Index Ensembl Index Intact Index 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump
  • Libraries
    • Indexing
      • Lucene ( http://lucene.apache.org )
      • ANTLR ( http://www.antlr.org/ )
      • ANTXR ( http://javadude.com/tools/antxr/index.html )
      • JGroups( http://www.jgroups.org )
    • Web
      • Tomcat ( http://tomcat.apache.org/ )
      • Spring Framework ( http://www.springframework.org )
  • Acknowledgements
    • Rodrigo Lopez
    • Janet Thornton and Graham Cameron
    • EMBL, EBI Industry Support Programme, European Patent Office and the EU.
    • All the data providers
    • Our colleagues in the External Services group
    • The System Group