• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Sequence Matrix: Gene concatenation made easy
 

Sequence Matrix: Gene concatenation made easy

on

  • 4,754 views

Creating large datasets by concatenating genes can be challenging. This tool hopes to make that process much, much easier. ...

Creating large datasets by concatenating genes can be challenging. This tool hopes to make that process much, much easier.

For more information, see http://code.google.com/p/sequencematrix/ or http://www3.interscience.wiley.com/journal/123577052/abstract

Statistics

Views

Total Views
4,754
Views on SlideShare
4,754
Embed Views
0

Actions

Likes
0
Downloads
29
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sequence Matrix: Gene concatenation made easy Sequence Matrix: Gene concatenation made easy Presentation Transcript

    • Sequence Matrix Gene concatenation made easy Gaurav Vaidya1, David Lohman2, Rudolf Meier2 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.
    • Our goals ✤ Many powerful tools exist for concatenating sequences. ✤ Adding new sequences to an existing dataset is tedious and time consuming. ✤ Our initial goal: simple, user-friendly program for concatenating sequences. ✤ We also added a few tools to help you look for lab contamination in your dataset.
    • Sequence Matrix ✤ Written in Java. ✤ Graphical user interface libraries. ✤ Works on different operating systems. ✤ Easy to install: download and run the batch file.
    • Importing sequences ✤ You can use the sequence names as entered in the input file. ✤ Or you can ask Sequence Matrix to try to identify the species names.
    • Importing sequences ✤ Sequences mode: ✤ Species name ✤ gi|237510679|gb|AY556753.2|Daubentonia ✤ Daubentonia madagascariensis madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ gi|237510678|gb|AY556735.2|Macaca ✤ Macaca sylvanus sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
    • Importing sequences ✤ A common source of error is forgetting to recode leading and trailing gaps as missing information. ✤ Sequence Matrix can automatically replace such gaps with question marks.
    • Importing sequences: Naming ✤ Sequences from one dataset are matched up to another dataset by sequence name. ✤ Errors in sequence naming need to be fixed. ✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
    • Export: Taxonsets ✤ By default, we generate taxonsets on the basis of: ✤ Combined length. ✤ Number of character sets ✤ Information for a particular gene.
    • Gene trees ✤ Two ways to do them: ✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa. ✤ Export the entire dataset with one file per column.
    • Export features ✤ You can also export the Sequence Matrix table as an Excel-readable text file. ✤ Supervisory mode. ✤ Keep track of a project as it grows.
    • Character sets ✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands. ✤ These can be “split” into individual columns, or imported as a single column representing the entire file.
    • Excision ✤ Individual sequences can be excised from the dataset. ✤ Excised sequences will not be exported. ✤ Sequence Matrix will warn you about that.
    • Contamination ✤ You thought you were sequencing Gorilla gorilla ✤ but you were really sequencing Homo sapiens. ✤ We have two tools you can use: ✤ If Homo sapiens is in your dataset. ✤ If Homo sapiens is not in your dataset (experimental!).
    • H. sapiens in dataset ✤ Looks for pairs of sequences whose pairwise distance is very low. ✤ Expected difference depends on gene: ✤ 28S doesn’t change very much, but ✤ COI changes very quickly. ✤ Some interpretation is required.
    • H. sapiens not present ✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances. ✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”. ✤ Colour sequences by their individual pairwise distances to the reference taxon.
    • H. sapiens not present ✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon. ✤ Look for colour variation which is unusual or out of place. ✤ We would expect sequences from different species to be correlated together.
    • Pairwise distance mode ✤ You need to vary: ✤ The gene you are studying. ✤ The reference taxon being compared against. ✤ Possibly helpful as an alert mechanism.
    • Summary ✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets. ✤ Taxonsets allow you to analyse subsets of your data in downstream programs. ✤ Excising sequences gives you greater control over which sequences to analyse. ✤ You can look for contamination in two ways: ✤ Looking for very low pairwise distances across your entire dataset. ✤ Looking for unusual pairwise distances in Pairwise Distance Mode.
    • Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali ✤ David Lohman ✤ Everybody at the NUS DBS Evolutionary Biology lab.
    • Question time!