SlideShare a Scribd company logo
1 of 13
The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID  : ..  PARENT ID : .. RANK  : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
What is the data available – sizes 43M 4.2G 57Gb, >500 files   1G 8.4G 374Gb, >600 files   6.3G 25K 81M
Points to take into consideration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID  AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC  AF030562; DT  04-DEC-1997 (Rel. 53, Created) DT  03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID  AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC  AF030562 ; XX DT  04-DEC-1997  (Rel. 53, Created) DT  03-MAR-2000  (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site . XX KW  STS. XX OS  Fusarium venenatum OC  Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC  Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN  [1] RP  1-852 RA  Yoder W.T., Christianson L.M .; RT  &quot;Species-specific primers resolve members of the section Fusarium . RT  Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL  Fungal Genet. Biol. 0:0-0(1997). XX RN  [2] RP  1-852 RA  Yoder W.T., Christianson L.M.; RT  ; RL  Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL  Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL  USA XX FH  Key  Location/Qualifiers FH FT  source  1..852 FT  /organism=&quot;Fusarium venenatum&quot; FT  /strain=&quot;ATCC20334“ FT  /db_xref=&quot;taxon:56646&quot; . . .  ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
Divide and Conquer the Indexing UniProt (>4M entries)   Embl (>83M entries) 2 files,  ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries)   1 file,  ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
Let’s put some figures on it Less than 18 hours to index all the EBI
Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
Being up to date ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Libraries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Acknowledgements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

Viewers also liked

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Yasuhiro Ohsaka
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇Yasuhiro Ohsaka
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessPaul Smith
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Kantar
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.Marcus9000
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.Giacomo Caleffi
 

Viewers also liked (6)

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative Success
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.
 

Similar to EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine

Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Updatebosc
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xmlagosti
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 

Similar to EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine (20)

Biological databases
Biological databasesBiological databases
Biological databases
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Gen bank
Gen bankGen bank
Gen bank
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
SAFE EDBT 2011
SAFE EDBT 2011SAFE EDBT 2011
SAFE EDBT 2011
 
NCBI
NCBINCBI
NCBI
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Odp
OdpOdp
Odp
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xml
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine

  • 1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
  • 2.
  • 3. What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
  • 4. What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID : .. PARENT ID : .. RANK : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
  • 5. What is the data available – sizes 43M 4.2G 57Gb, >500 files 1G 8.4G 374Gb, >600 files 6.3G 25K 81M
  • 6.
  • 7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC AF030562; DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC AF030562 ; XX DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site . XX KW STS. XX OS Fusarium venenatum OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN [1] RP 1-852 RA Yoder W.T., Christianson L.M .; RT &quot;Species-specific primers resolve members of the section Fusarium . RT Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL Fungal Genet. Biol. 0:0-0(1997). XX RN [2] RP 1-852 RA Yoder W.T., Christianson L.M.; RT ; RL Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL USA XX FH Key Location/Qualifiers FH FT source 1..852 FT /organism=&quot;Fusarium venenatum&quot; FT /strain=&quot;ATCC20334“ FT /db_xref=&quot;taxon:56646&quot; . . . ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
  • 8. Divide and Conquer the Indexing UniProt (>4M entries) Embl (>83M entries) 2 files, ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file, ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
  • 9. Let’s put some figures on it Less than 18 hours to index all the EBI
  • 10. Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
  • 11.
  • 12.
  • 13.