0
Bots & spiders Bio-informatica II    19/04/2012        Maté Ongenaert   Center for Medical GeneticsGhent University Hospit...
 Part 1: Bots & spiders  Background Part 2: Real-life case studies  The use of bots and spiders in bio-informatics
 About the presenter   Bio-engineer cell and gene biotechnology (2005)    •   Master thesis: identificatie van kanker-sp...
Part 1Bots & spiders: background
Overview Bots and spiders     Introduction     Bots     Spiders     The Google case Bots/spiders and bio-informatics...
Bots and spiders Bots and spiders    The web history       •   In 1989, while working at CERN, Tim Berners-           Le...
Bots and spiders Bots   Webbots      •   Web robots, WWW robots, bots): software          applications that run automate...
Bots and spiders Bots   A spam bot, called the ‘Zunker Bot’      •   Is installed on unpatched Windows machines      •  ...
Bots and spiders Spiders   Webspiders      •   Webspiders / Crawlers are programs or          automated scripts which br...
Bots and spiders Spiders
Bots and spiders Spiders   Use of webcrawlers:      •   Mainly used to create a copy of all the visited pages for later ...
Bots and spiders PageRank
Bots and spiders PageRank
Bots and spiders Google   Hardware      •   Standard server hardware (2009): 16 GB RAM / 2 TB storage per server      • ...
Overview Bots and spiders     Introduction     Bots     Spiders     The Google case Bots/spiders and bio-informatics...
Bots and spiders Bots/spiders and bio-informatics   Automated querying      •   Collecting information nowadays means th...
Bots and spiders Bots/spiders and bio-informatics   Some more advanced features      •   LWP::UserAgent (demo2 – show se...
Bots and spiders Bots/spiders and bio-informatics   Why not make use of crawls, indexing and serving    technologies of ...
Bots and spiders Bots/spiders and bio-informatics APIs   Google example used Google API   NCBI API      • The NCBI Web ...
Bots and spiders Bots/spiders and bio-informatics APIs   NCBI API   A NCBI database, frequently used is PubMed      •  ...
Part 2Real-life case studies: the use of bots and         spiders in bio-informatics
Bots and spiders TextMining   Create and translate query      •   User query -> query suited for PubMed   Query is exec...
Bots and spiders TextMining
Bots and spiders TextMining
Bots and spiders TextMining
Bots and spiders TextMining   Demonstration: GoldMine   Web-application   Translate query – find aliases for genes or ...
Bots and spiders Data analysis     NCBI GEO – Gene Expression Omnibus     Raw expression data on FTP-server     Annota...
Bots and spiders Case study: superficial vs. Infiltrating  bladder cancer     Find experiments on GEO     Annotation of...
Bots and spiders Case study: superficial vs. Infiltrating  bladder cancer
Bots and spiders Case study: superficial vs. Infiltrating  bladder cancer   Use this to couple sample annotation feature...
Bots and spiders Case study: superficial vs. Infiltrating  bladder cancer   Platform / data files / samples / sample ann...
Bots and spiders Case study: superficial vs. Infiltrating  bladder cancer     data.justrma<just.rma("GSM90305.CEL”,”… SA...
Bots and spiders Case study: superficial vs. Infiltrating  bladder cancer     Combine results accross studies     Biolo...
Bots and spiders OncoMine
Bots and spiders Integrated analysisRank     Meth Pca   Lit   Meth other Expression Pca   Progression   Rank1    2       ...
Acknowledgments   CMGG       Anneleen Decock       Frank Speleman       Jo Vandesompele   BioBix       Leander Van N...
Upcoming SlideShare
Loading in...5
×

Bots & spiders

2,881

Published on

Bioinfiormatics II - Bots & spiders

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,881
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Bots & spiders"

  1. 1. Bots & spiders Bio-informatica II 19/04/2012 Maté Ongenaert Center for Medical GeneticsGhent University Hospital, Belgium
  2. 2.  Part 1: Bots & spiders Background Part 2: Real-life case studies The use of bots and spiders in bio-informatics
  3. 3.  About the presenter  Bio-engineer cell and gene biotechnology (2005) • Master thesis: identificatie van kanker-specifiek gemethyleerde genen  PhD applied biological sciences: cell and gene biotechnology (2009) • PhD thesis: cellular reprogramming  Industrial experience • Research scientist (methylation biomarkers)  Currently: postdoc at CMGG • Prognostic methylation biomarkers in neuroblastoma
  4. 4. Part 1Bots & spiders: background
  5. 5. Overview Bots and spiders  Introduction  Bots  Spiders  The Google case Bots/spiders and bio-informatics  Automated querying  APIs  NCBI E-Utils (PubMed/GenBank)  Ensembl
  6. 6. Bots and spiders Bots and spiders  The web history • In 1989, while working at CERN, Tim Berners- Lee invented a network-based implementation of the hypertext concept • Since then, information can be retrieved by ‘following links’ instead of having to know the exact location at first • Information is not at a single location, it is dynamic and spread across machines
  7. 7. Bots and spiders Bots  Webbots • Web robots, WWW robots, bots): software applications that run automated tasks over the Internet  Bots perform tasks that: • Are simple • Structurally repetitive • At a much higher rate than would be possible for a human • Automated script fetches, analyses and files information from web servers at many times the speed of a human  Other uses: • Chatbots / IM / Skype / Wiki bots • Malicious bots and bot networks (Zombies)
  8. 8. Bots and spiders Bots  A spam bot, called the ‘Zunker Bot’ • Is installed on unpatched Windows machines • Controls the clients trough a neat application • Can install additional software and execute commands
  9. 9. Bots and spiders Spiders  Webspiders • Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot  The spider starts with a list of URLs to visit, called the seeds • As the crawler visits these URLs, it identifies all the hyperlinks in the page • It adds them to the list of URLs to visit, called the crawl frontier • URLs from the frontier are recursively visited according to a set of policies • This process is called web crawling: in most cases a mean of collecting up-to-date data
  10. 10. Bots and spiders Spiders
  11. 11. Bots and spiders Spiders  Use of webcrawlers: • Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches • Automating maintenance tasks on a website, such as checking links or validating HTML code • Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses  Most commonly used crawler is probably the GoogleBot crawler • Crawls • Indexes (content + key content tags and attributes, such as Title tags and ALT attributes) • Serves results: PageRank Technology
  12. 12. Bots and spiders PageRank
  13. 13. Bots and spiders PageRank
  14. 14. Bots and spiders Google  Hardware • Standard server hardware (2009): 16 GB RAM / 2 TB storage per server • 2009 estimate: 450 000 servers – 2 million $/month electricity cost  Software • Webserver (Not apache-based) • Storage (Google File System / BigTable): distributed storage – mostly in memory • Borg job scheduling and monitoring • Indexing services: caffeine / percolator • MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker nodes (Map), answers are gathered and combined to solve the original question (Reduce)
  15. 15. Overview Bots and spiders  Introduction  Bots  Spiders  The Google case Bots/spiders and bio-informatics  Automated querying  APIs  NCBI E-Utils (PubMed/GenBank)  Ensembl
  16. 16. Bots and spiders Bots/spiders and bio-informatics  Automated querying • Collecting information nowadays means the power to automatically query datasources (databases, websites, Google, Ensembl or NCBI databases) • Query in web-terms: GET / POST • Web-queries using Perl: LWP library  LWP: set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web • Free LWP E-book: http://lwp.interglacial.com/  LWP for newbies • LWP::Simple (demo1) • Go to a URL, fetch data, ready to parse • Attention: HTML tags and regular expression
  17. 17. Bots and spiders Bots/spiders and bio-informatics  Some more advanced features • LWP::UserAgent (demo2 – show server access logs) • Fill in forms and parse results • Depending on content: follow hyperlinks to other pages and parse these again,… • Mechanize package: follow links; fill in forms,…  Bioinformatics examples • Use genome browser data (demo3) and sequences • Get gene aliases and symbols from GeneCards (demo4)
  18. 18. Bots and spiders Bots/spiders and bio-informatics  Why not make use of crawls, indexing and serving technologies of others (e.g. Google) • Google allows automated queries: per account 1000 queries a day • Google uses Snippets: the short pieces of text you get in the main search results • This is the result of its indexing and parsing algoritms • Demo5: LWP and Google APIs combined and parsing the results  API: Application Programming Interface • Hides complexity by sharing ‘libraries’ with functions that can be applied within another programming language • Bridges programming languages – crosses abstraction layers • Example: displaying on a screen; printing; querying Google or NCBI from within a programming language
  19. 19. Bots and spiders Bots/spiders and bio-informatics APIs  Google example used Google API  NCBI API • The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP) • Programmers may write software applications that access the E-Utilities using any SOAP development tool • Main tools (demo6): – E-Search: Searches and retrieves primary IDs and term translations and optionally retains results for future use in the users environment – E-Fetch: Retrieves records in the requested format from a list of one or more primary IDs  Ensembl API (demo7) • Uses ‘Slices’ and adaptors • You have to know the ‘application’ or database (Compare/Core/…)
  20. 20. Bots and spiders Bots/spiders and bio-informatics APIs  NCBI API  A NCBI database, frequently used is PubMed • PubMed can be queried using E-Utils • Uses syntax as regular PubMed website • Get the data back in data formats as on the website (XML, Plain Text) • Parse XML results and apply more advanced Text-mining techniques • Demo8 • Parse results and present them in an interface – Methylated genes in cancer: – http://matrix.ugent.be/mate/methylome/result1.html – miRNAs in cancer: – http://matrix.ugent.be/mate/textmining/preprocess/
  21. 21. Part 2Real-life case studies: the use of bots and spiders in bio-informatics
  22. 22. Bots and spiders TextMining  Create and translate query • User query -> query suited for PubMed  Query is executed, results are returned • Results format: XML, TXT, MedLine, ASN,… • Human readable <> parsable (XML parsers)  Parse results • Extract information: authors, title, abstract • Store results  Analyse results • Identify gene names, keywords, GO-terms,… -> score • Semantic analysis / NLP processing / …  Visualise results • Highlighting, hierarchie, filters, searches, graphics
  23. 23. Bots and spiders TextMining
  24. 24. Bots and spiders TextMining
  25. 25. Bots and spiders TextMining
  26. 26. Bots and spiders TextMining  Demonstration: GoldMine  Web-application  Translate query – find aliases for genes or miRNAs and incorporate them in the search  Query NCBI PubMed using E-fetch  Get the results and process them  Count  Highlight  Rank  Visualization
  27. 27. Bots and spiders Data analysis  NCBI GEO – Gene Expression Omnibus  Raw expression data on FTP-server  Annotation: can be queried using NCBI E-Utils  Annotation: in Excel-files at FTP-server  For specific experimental conditions, get all raw data and annotations and perform an automated analysis Create a scheme how you would proceed: biological question: superficial vs. Infiltrating bladder cancer
  28. 28. Bots and spiders Case study: superficial vs. Infiltrating bladder cancer  Find experiments on GEO  Annotation of samples: up to the submittors  ‘Uniform’ sample sheet available (Matrix-file)  Current update of GEO: view ‘factors’ in graphical overview
  29. 29. Bots and spiders Case study: superficial vs. Infiltrating bladder cancer
  30. 30. Bots and spiders Case study: superficial vs. Infiltrating bladder cancer  Use this to couple sample annotation features (stage, age, risk, sex) to unique sampleID (GSMxxxxxxx)  Get raw data for each sample in dataset  Either txt files (uniform) or raw data files (such as Affy CEL files)  Dependends on the used platform: GPLxxxx
  31. 31. Bots and spiders Case study: superficial vs. Infiltrating bladder cancer  Platform / data files / samples / sample annotation relationship  Set up standardised analysis strategy  Make use of sample annotations  Combine studies or keep them seperate?  Normalisation  RankProd analysis
  32. 32. Bots and spiders Case study: superficial vs. Infiltrating bladder cancer  data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES  expression<-exprs(data.justrma) NORMALISATION  results[,2:103]<-expression  library(hgu95av2.db) PLATFORM  cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ANNOTATION  RP.out.stage <- RP(results[,3:104], cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = TRUE, rand = 123) ANALYSIS STRATEGY
  33. 33. Bots and spiders Case study: superficial vs. Infiltrating bladder cancer  Combine results accross studies  Biological question <> data analysis  Scoring scheme, priorization  Superficial vs. Infiltrating  Metastasis vs. Primary cancer  High stage vs. Low stage  Normal vs. Cancer
  34. 34. Bots and spiders OncoMine
  35. 35. Bots and spiders Integrated analysisRank Meth Pca Lit Meth other Expression Pca Progression Rank1 2 3 4 5 6 7 8 EXPRESSION RE-EXP CpG Pc 1 1 x 0,95 1 0,993 0,997 0,84 1 2 0,998 0,995 1 0,958 0,091 0,994 3 1 x x x 1 0,993 1 0,996 0,312 4 1 x x x 0,995 0,767 0,96 1 0,931 0,998 0,635 5 1 x 0,997 0,968 1 1 0,364 0,746 0,199 6 x 0,711 0,948 0,994 0,559 0,991 0,993 7 0,998 0,993 0,83 0,936 0,996 8 0,997 0,99 0,998 0,759 0,726 0,575 9 1 x x 0,886 0,995 0,997 1 0,7 10 1 0,998 0,409 0,99 0,88 0,998 0,779 11 1 x x 0,995 0,999 0,995 0,687 12 1 x x 0,997 0,999 0,999 0,257 13 1 x x x 0,799 0,996 0,969 0,994 0,848 0,981 0,887 14 1 x x 0,916 0,568 0,99 0,993 0,994 0,988 0,558 15 0,986 0,995 0,956 0,983 0,998 16 1 x 0,157 1 0,925 0,989 0,984 0,993
  36. 36. Acknowledgments CMGG  Anneleen Decock  Frank Speleman  Jo Vandesompele BioBix  Leander Van Neste  Tim De Meyer  Gerben Mensschaert  Geert Trooskens  Wim Van Criekinge
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×