Introduction to Gene Mining
Part A: BLASTn-off!
After Part A you will demonstrate your ability to:
Use the bioinformatics NCBI Gene and BLASTn
tools to search for a human gene of interest in a
plant model.
Evaluate the significance of your search results
to see how similar human and plant
genes might be.
1
The Arabidopsis Information Portal is funded by a grant from
the National Science Foundation (#DBI-1262414)
and co-funded by a grant from the Biotechnology and
Biological Sciences Research Council (BB/L027151/1).
These lessons were developed during the summer of 2015 as
education outreach for the www.Araport.org portal in
conjunction with the J. Craig Venter Institute, Rockville, MD,
20850, USA.
Contact information
General information: araport@jcvi.org
Jason Miller, Grant Co-Principal Investigator, JCVI
jmiller@jcvi.org
This lesson was prepared by Andrea Cobb, Ph.D.
(adcobb@fcps.edu)
with the help of Margot Goldberg
(mgoldberg1@pghboe.net)
The images below are all examples of….?
3
What science models do you recall?
Lipid bilayer model
Lock and key model of enzymes
Stickleback model of evolution
Computer models
Experimental model of osmosis
4
Why use models instead of the “real
thing”?
To simplify a complex system
Example: Study an enzyme reaction
in a test tube rather than in the
whole organism which contains many enzymes.
To better manipulate and measure an effect
Example: Treat Drosophila with drug X and measure the
drug’s effect on Drosophila life span.
To predict (test the model)
Example: Use a computer model to find protein coding
regions in the DNA of a newly sequenced genome.
Other ideas?
5
Thanks for volunteering for our study. Your chart
says you have problems eating, facial weakness
and overall poor muscle tone. Looks like your
mother had the same symptoms.
Your diagnosis is nemaline myopathy. I am sad to
tell you that no known treatment exists, but my
researchers and I are working hard to find a
treatment.
You can find information on this genetic disorder
in a website called Online Mendelian Inheritance
in Man http://www.OMIM.org
The OMIM database shows that you might have
a mutation in your Actin alpha 1 gene.
We won’t experiment on you! It is much faster,
kinder and less expensive to use a plant model.
Thanks
for your
help,
Doctor!
https://www.youtube.c
om/watch?v=foHiKrlY9
Qc explains why
scientists use a certain
plant for a model
7
Which plant will you use to
study a version of my actin alpha 1
(ACTA1) gene?
https://www.arabidopsis.org/portals/education
/aboutarabidopsis.jsp
8
Can plants really be used as models
for studying human diseases?
9
Xiang Ming Zu and Simon Geir Molier, Current Opinions in
Biotechnology, 2011, 22, 300-307. 10
• http://www.bbc.co.uk/progra
mmes/p00lx6cl
• https://www.youtube.com/w
atch?v=eDA8rmUP5ZM
http://aboutlifting.com/music-helps-plants-
grow-and-will-help-muscles-grow/ 11
Before we find out whether plants have
human muscle genes, it would be
important to know if plants move!
12
Why don’t you rest? I am going to search the OMIM
database to find out more about your possible gene
mutation.
Use your computer and go to:
http://www.OMIM.org
and find out more about nemaline
myopathy and the ACTA1 gene that
may be involved.
After you answer questions on
your handout, type in any human
disease that interests you and
examine the results.
• Use your computer to find:
http://www.OMIM.org and learn more about
nemaline myopathy and the ACTA1 gene that
may be involved.
• After you answer questions on your handout,
search for any human disease that interests
you and examine the information.
13
Use your textbook, open access textbooks, videos
and databases to begin to find information about
muscle genes and proteins.
https://www.boundless.com/biology/
14
Usually, a general search engine will give
you too many hits for the question below!
15
108 results
Even a
broad
scientific
database
may
provide
too many
unrelated
hits!
Why are
there
SO MANY
results?
16
“BIG DATA”
Biologists are increasingly able to quickly generate enormous amounts of data but
their data analysis may take weeks or even years. Data transfer protocols are not
interchangeable, data storage is expensive, queries can crash!
https://en.wikipedia.org/wiki/List_of_RNAs
17
What scientific approach
finds better information?
• Bioinformatics is an
interdisciplinary
approach which uses
computational,
mathematical, and
engineering methods
to analyze and make
discoveries from
enormous data sets.
18
To address the
problem of BIG DATA,
scientists can share
data and analysis with
other scientists.
This speeds analysis
and adds expertise .
Scientists can share
their data in research-
specific portals.
These research-specific
portals usually have
customized
bioinformatics tools.
19
A few examples of how bioinformatics is used….
Use Questions addressed:
Basic research How is DNA organized in chromosomes?
Are genes related to other genes? Given
sequence data, how do we find a gene?
How are genes expressed in response to
the environment?
Biomedicine Will this drug work on this patient? Can
we cure genetic diseases? Which genetic
variations are associated with heart
disease? Which pathogen proteins are
best for vaccine development?
Microbiology Can microbes remove pollution? Can
microbes decrease the impact of climate
change? Where did a disease originate?
Agriculture Can drought resistant plants be identified,
bred or engineered? Can insect resistant
plants improve food supplies? Can more
healthful food sources be developed?
Use Questions addressed:
Basic research How is DNA organized in chromosomes?
Are genes related to other genes? Given
sequence data, how do we find a gene?
How are genes expressed in response to
the environment?
Biomedicine Will this drug work on this patient? Can
we cure genetic diseases? Which genetic
variations are associated with heart
disease? Which pathogen proteins are
best for vaccine development?
Microbiology Can microbes remove pollution? Can
microbes decrease the impact of climate
change? Where did a disease originate?
Agriculture Can drought resistant plants be identified,
bred or engineered? Can insect resistant
plants improve food supplies? Can more
healthful food sources be developed?
20
Scientists are more likely to find useful information in
bioinformatics portals that support their particular research.
21
National Center for
Biotechnology
Information
http://www.ncbi.nlm.
nih.gov/gene
Araport
https://www.araport.org/
An example of increasingly more specific research-centered portals
22
http://www.phytosystems.ulg.ac.be/florid/
FLOR-ID
23
For our plant model to be useful for my
research, I must find a similar plant
version of the ACTA1 gene involved in
nemaline myopathy.
Since plants and animals both move, do
they use the same types of proteins to
move?
Do they have the same genes coding for
these proteins?
Begin your search on the NCBI portal to find
names of human muscle genes.
Use http://www.ncbi.nlm.nih.gov/ and enter information shown, use the pull- down menu to
select Gene. (Note: Araport.org and similar genome browsers will also allow you to search
for genes and proteins of interest.)
24
Could plant and animal versions of this gene
have a function in common?
25
Actin subunits self-assemble to form filaments which
have a role in cell structure.
Check the “Inner Life of the Cell” video.
https://www.youtube.com/watch?v=FzcTgrxMzZk
(2:20 until 3:15)
https://www.youtube.com
/watch?v=VVgXDW_8O4U
is a video showing
polymerization of G-actin,
a protein similar to Alpha
Actin.
This is how
your actin
should work.
26
Click on
FASTA to
obtain the
human
ACTA1
gene
sequence.
If it is reasonable that plants might have a gene similar
to human ACTA1, you will need to find the ACTA1 gene
sequence.
27
Copy, then paste the ACTA1 gene sequence to a new Word
document or clipboard—we will use this to look for an
Arabidopsis thaliana version of this gene. Save the Word
document as “human ACTA1 DNA sequence”.
28
I want to search for a version of the human ACTA1
gene in Arabidopsis thaliana.
What bioinformatics tool
could I use?
29
30
BLAST Types
BLASTn compares 2 or more DNA sequences
BLASTp compares 2 or more protein sequences
BLASTX reads a DNA sequence in the 6
possible reading frames then compares it to a
protein sequence database
tBLASTX compares 2 or more DNA sequence
translated in 6 reading frames
31
32
http://www.ncbi.nlm.nih.gov/
There are several ways to
access NCBI BLAST. Start at
the URL and page, then
select BLAST.
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Or just go to the
BLAST page URL
below.
Select nucleotide blast
If I have a known DNA sequence ,
how can I use BLASTn to look for an
unknown similar sequence?
33
Click on
FASTA to
obtain the
human
ACTA1
gene
sequence.
You found a human gene to compare…
34
And you’ve already copied and pasted the ACTA1
gene sequence to a Word document or clipboard—we
will use this to look for an Arabidopsis thaliana
version of this gene.
35
Steps to use Blastn
Paste in your
copied ACTA1
sequence
Enter the name of
the organism in
which we are
looking for the
same gene
(Arabidopsis
thaliana)
Select the
program –use
“Somewhat
similar
sequences” for
the broadest
search
#4 push blast
button
Check “show results” in
a new window, then
click on BLAST 36
What information is provided in an NCBI BLASTn report?
The Graphics Section shows the query sequence in the red bar (green
arrow) and aligned sequences are shown in colored tracks below.
Each “track” represents a sequence
that the BLASTn tool discovered in
the database that is similar to your
query sequence. The colored
sections in each track are blocks of
DNA which align with varying
similarity (score), shown by the
colored bar above. The black lines
connecting the colored blocks are
poorly aligned sequences (less than
40% identity).
Move the mouse over a
block to see the definition
and score for that sequence
result (also called “hit”).
By clicking on a colored box,
you will jump to the actual
DNA alignment farther down
the page.
37
38
What information is provided in an NCBI BLASTn report?
The Descriptions Section lists the aligned sequence names and provides
information about the alignment. In this search, we are using one gene
sequence to find a similar gene sequence. Look at the results that end in
“gene”.
What is gene alignment?
What BLASTn values tell us whether the
alignment is meaningful?
39
40
https://www.youtube.com/watch?v=6Udqou3vmng
Go to 31:13-40:15 for a more detailed explanation of alignment.
Query
Subject
(database
used for
search)
Starting and ending nucleotides of your query
Starting and ending nucleotide coordinates for this sequence in its database
41
BLASTn seeks to maximize the score for aligning shorter stretches of
Query compared to the database. Alignment of the entire query is not
required by Local alignment.
Matching nucleotides are given a score of +1 and mismatches are
negative. There are penalties for gaps. There are different algorithms,
but this is the general idea.
42
“Query cover” tells what percentage of the alignment is a good
match to your input sequence (query).
Note that the query is more than 2750 nucleotides long.
43
The query coverage is low here (20%) because you are comparing 2
DNA sequences which contain exons (conserved, thus aligned) and
introns (not highly conserved, thus non-aligned or poorly aligned.
44
Although only 20% of the query aligns to a sequence in the Arabidopsis
database, 80% of the aligned part is identical to the query (see the
“Ident” value of 80% and the color-coded portions of the result track. )
45
“Alignments” provides
details about
nucleotide locations,
matches, gaps or
mismatches.
Access more
info about the
sequence by
clicking on the
sequence ID
46
The E-value indicates the number of alignments with an
equivalent or better score from this database that would be
expected just by chance. For example, a one-in- a million
(1/1,000,000) chance is a very small chance and would be
written 1e-6.
The lower the E-value, the more significant the score (less likely
due just to chance) .
E-values are in scientific notation, ex: 3e-80 = 3 x 10-80
47
In general, an E-value of 1X10-5 or smaller is considered
significant (not just aligned by chance).
48
This is from the Alignments
Section and shows the details
Results are arranged in a default setting from lowest E-value to
highest. Compare the E-value, Query cover and % identity for the
checked “hits”.
Which GENE is most similar to the human ACTA1 sequence query?
Click on the
accession number
for more
information about
the gene that had
the most significant
alignment
49
50
Amino acid
sequence
Link for
more info!
51
the process you used to find a version of the
human ACTA1 gene in Arabidopsis thaliana.
What information did you use to indicate that the plant
version was a meaningful find?
52
1. Pick a human gene which you
think is highly conserved
between plants and animals.
2. Follow the procedure you just
learned to see if a similar
Arabidopsis version exists.
3. Record your info on the
scorecard.
4. Repeat for a gene that you
predict is unique to humans.
53
Human
Gene
Name
Human
Gene ID
Human
Gene
Function
Arabid
opsis
Gene
Name
Arabid
opsis
Gene
ID
Arabidopsis
Gene
Function
Out-come
evidence :
Score, E-value,
Similar
Function,
Predic-
tion?
Actin
alpha 1
ACTA1 Cytoskele
tal
structure
ACT7 Actin 7 Cytoskeletal
structure
E value was
1e-80, not
random, both
have similar
functions….
Yes
Gene Discovery Scorecard
54
• What information so far
indicates whether or not
plants have animal muscle
genes?
• What additional
information might you
need to be more certain
whether ACT7 is a plant
version of human ACTA1?
55

Introduction to Gene Mining Part A: BLASTn-off!

  • 1.
    Introduction to GeneMining Part A: BLASTn-off! After Part A you will demonstrate your ability to: Use the bioinformatics NCBI Gene and BLASTn tools to search for a human gene of interest in a plant model. Evaluate the significance of your search results to see how similar human and plant genes might be. 1
  • 2.
    The Arabidopsis InformationPortal is funded by a grant from the National Science Foundation (#DBI-1262414) and co-funded by a grant from the Biotechnology and Biological Sciences Research Council (BB/L027151/1). These lessons were developed during the summer of 2015 as education outreach for the www.Araport.org portal in conjunction with the J. Craig Venter Institute, Rockville, MD, 20850, USA. Contact information General information: araport@jcvi.org Jason Miller, Grant Co-Principal Investigator, JCVI jmiller@jcvi.org This lesson was prepared by Andrea Cobb, Ph.D. (adcobb@fcps.edu) with the help of Margot Goldberg (mgoldberg1@pghboe.net)
  • 3.
    The images beloware all examples of….? 3
  • 4.
    What science modelsdo you recall? Lipid bilayer model Lock and key model of enzymes Stickleback model of evolution Computer models Experimental model of osmosis 4
  • 5.
    Why use modelsinstead of the “real thing”? To simplify a complex system Example: Study an enzyme reaction in a test tube rather than in the whole organism which contains many enzymes. To better manipulate and measure an effect Example: Treat Drosophila with drug X and measure the drug’s effect on Drosophila life span. To predict (test the model) Example: Use a computer model to find protein coding regions in the DNA of a newly sequenced genome. Other ideas? 5
  • 6.
    Thanks for volunteeringfor our study. Your chart says you have problems eating, facial weakness and overall poor muscle tone. Looks like your mother had the same symptoms. Your diagnosis is nemaline myopathy. I am sad to tell you that no known treatment exists, but my researchers and I are working hard to find a treatment. You can find information on this genetic disorder in a website called Online Mendelian Inheritance in Man http://www.OMIM.org The OMIM database shows that you might have a mutation in your Actin alpha 1 gene. We won’t experiment on you! It is much faster, kinder and less expensive to use a plant model. Thanks for your help, Doctor!
  • 7.
    https://www.youtube.c om/watch?v=foHiKrlY9 Qc explains why scientistsuse a certain plant for a model 7 Which plant will you use to study a version of my actin alpha 1 (ACTA1) gene?
  • 8.
  • 9.
    Can plants reallybe used as models for studying human diseases? 9
  • 10.
    Xiang Ming Zuand Simon Geir Molier, Current Opinions in Biotechnology, 2011, 22, 300-307. 10
  • 11.
  • 12.
    12 Why don’t yourest? I am going to search the OMIM database to find out more about your possible gene mutation. Use your computer and go to: http://www.OMIM.org and find out more about nemaline myopathy and the ACTA1 gene that may be involved. After you answer questions on your handout, type in any human disease that interests you and examine the results.
  • 13.
    • Use yourcomputer to find: http://www.OMIM.org and learn more about nemaline myopathy and the ACTA1 gene that may be involved. • After you answer questions on your handout, search for any human disease that interests you and examine the information. 13
  • 14.
    Use your textbook,open access textbooks, videos and databases to begin to find information about muscle genes and proteins. https://www.boundless.com/biology/ 14
  • 15.
    Usually, a generalsearch engine will give you too many hits for the question below! 15
  • 16.
    108 results Even a broad scientific database may provide toomany unrelated hits! Why are there SO MANY results? 16
  • 17.
    “BIG DATA” Biologists areincreasingly able to quickly generate enormous amounts of data but their data analysis may take weeks or even years. Data transfer protocols are not interchangeable, data storage is expensive, queries can crash! https://en.wikipedia.org/wiki/List_of_RNAs 17
  • 18.
    What scientific approach findsbetter information? • Bioinformatics is an interdisciplinary approach which uses computational, mathematical, and engineering methods to analyze and make discoveries from enormous data sets. 18
  • 19.
    To address the problemof BIG DATA, scientists can share data and analysis with other scientists. This speeds analysis and adds expertise . Scientists can share their data in research- specific portals. These research-specific portals usually have customized bioinformatics tools. 19
  • 20.
    A few examplesof how bioinformatics is used…. Use Questions addressed: Basic research How is DNA organized in chromosomes? Are genes related to other genes? Given sequence data, how do we find a gene? How are genes expressed in response to the environment? Biomedicine Will this drug work on this patient? Can we cure genetic diseases? Which genetic variations are associated with heart disease? Which pathogen proteins are best for vaccine development? Microbiology Can microbes remove pollution? Can microbes decrease the impact of climate change? Where did a disease originate? Agriculture Can drought resistant plants be identified, bred or engineered? Can insect resistant plants improve food supplies? Can more healthful food sources be developed? Use Questions addressed: Basic research How is DNA organized in chromosomes? Are genes related to other genes? Given sequence data, how do we find a gene? How are genes expressed in response to the environment? Biomedicine Will this drug work on this patient? Can we cure genetic diseases? Which genetic variations are associated with heart disease? Which pathogen proteins are best for vaccine development? Microbiology Can microbes remove pollution? Can microbes decrease the impact of climate change? Where did a disease originate? Agriculture Can drought resistant plants be identified, bred or engineered? Can insect resistant plants improve food supplies? Can more healthful food sources be developed? 20
  • 21.
    Scientists are morelikely to find useful information in bioinformatics portals that support their particular research. 21
  • 22.
    National Center for Biotechnology Information http://www.ncbi.nlm. nih.gov/gene Araport https://www.araport.org/ Anexample of increasingly more specific research-centered portals 22 http://www.phytosystems.ulg.ac.be/florid/ FLOR-ID
  • 23.
    23 For our plantmodel to be useful for my research, I must find a similar plant version of the ACTA1 gene involved in nemaline myopathy. Since plants and animals both move, do they use the same types of proteins to move? Do they have the same genes coding for these proteins?
  • 24.
    Begin your searchon the NCBI portal to find names of human muscle genes. Use http://www.ncbi.nlm.nih.gov/ and enter information shown, use the pull- down menu to select Gene. (Note: Araport.org and similar genome browsers will also allow you to search for genes and proteins of interest.) 24
  • 25.
    Could plant andanimal versions of this gene have a function in common? 25
  • 26.
    Actin subunits self-assembleto form filaments which have a role in cell structure. Check the “Inner Life of the Cell” video. https://www.youtube.com/watch?v=FzcTgrxMzZk (2:20 until 3:15) https://www.youtube.com /watch?v=VVgXDW_8O4U is a video showing polymerization of G-actin, a protein similar to Alpha Actin. This is how your actin should work. 26
  • 27.
    Click on FASTA to obtainthe human ACTA1 gene sequence. If it is reasonable that plants might have a gene similar to human ACTA1, you will need to find the ACTA1 gene sequence. 27
  • 28.
    Copy, then pastethe ACTA1 gene sequence to a new Word document or clipboard—we will use this to look for an Arabidopsis thaliana version of this gene. Save the Word document as “human ACTA1 DNA sequence”. 28
  • 29.
    I want tosearch for a version of the human ACTA1 gene in Arabidopsis thaliana. What bioinformatics tool could I use? 29
  • 30.
  • 31.
    BLAST Types BLASTn compares2 or more DNA sequences BLASTp compares 2 or more protein sequences BLASTX reads a DNA sequence in the 6 possible reading frames then compares it to a protein sequence database tBLASTX compares 2 or more DNA sequence translated in 6 reading frames 31
  • 32.
  • 33.
    http://www.ncbi.nlm.nih.gov/ There are severalways to access NCBI BLAST. Start at the URL and page, then select BLAST. http://blast.ncbi.nlm.nih.gov/Blast.cgi Or just go to the BLAST page URL below. Select nucleotide blast If I have a known DNA sequence , how can I use BLASTn to look for an unknown similar sequence? 33
  • 34.
    Click on FASTA to obtainthe human ACTA1 gene sequence. You found a human gene to compare… 34
  • 35.
    And you’ve alreadycopied and pasted the ACTA1 gene sequence to a Word document or clipboard—we will use this to look for an Arabidopsis thaliana version of this gene. 35
  • 36.
    Steps to useBlastn Paste in your copied ACTA1 sequence Enter the name of the organism in which we are looking for the same gene (Arabidopsis thaliana) Select the program –use “Somewhat similar sequences” for the broadest search #4 push blast button Check “show results” in a new window, then click on BLAST 36
  • 37.
    What information isprovided in an NCBI BLASTn report? The Graphics Section shows the query sequence in the red bar (green arrow) and aligned sequences are shown in colored tracks below. Each “track” represents a sequence that the BLASTn tool discovered in the database that is similar to your query sequence. The colored sections in each track are blocks of DNA which align with varying similarity (score), shown by the colored bar above. The black lines connecting the colored blocks are poorly aligned sequences (less than 40% identity). Move the mouse over a block to see the definition and score for that sequence result (also called “hit”). By clicking on a colored box, you will jump to the actual DNA alignment farther down the page. 37
  • 38.
    38 What information isprovided in an NCBI BLASTn report? The Descriptions Section lists the aligned sequence names and provides information about the alignment. In this search, we are using one gene sequence to find a similar gene sequence. Look at the results that end in “gene”.
  • 39.
    What is genealignment? What BLASTn values tell us whether the alignment is meaningful? 39
  • 40.
    40 https://www.youtube.com/watch?v=6Udqou3vmng Go to 31:13-40:15for a more detailed explanation of alignment. Query Subject (database used for search) Starting and ending nucleotides of your query Starting and ending nucleotide coordinates for this sequence in its database
  • 41.
    41 BLASTn seeks tomaximize the score for aligning shorter stretches of Query compared to the database. Alignment of the entire query is not required by Local alignment. Matching nucleotides are given a score of +1 and mismatches are negative. There are penalties for gaps. There are different algorithms, but this is the general idea.
  • 42.
  • 43.
    “Query cover” tellswhat percentage of the alignment is a good match to your input sequence (query). Note that the query is more than 2750 nucleotides long. 43
  • 44.
    The query coverageis low here (20%) because you are comparing 2 DNA sequences which contain exons (conserved, thus aligned) and introns (not highly conserved, thus non-aligned or poorly aligned. 44
  • 45.
    Although only 20%of the query aligns to a sequence in the Arabidopsis database, 80% of the aligned part is identical to the query (see the “Ident” value of 80% and the color-coded portions of the result track. ) 45
  • 46.
    “Alignments” provides details about nucleotidelocations, matches, gaps or mismatches. Access more info about the sequence by clicking on the sequence ID 46
  • 47.
    The E-value indicatesthe number of alignments with an equivalent or better score from this database that would be expected just by chance. For example, a one-in- a million (1/1,000,000) chance is a very small chance and would be written 1e-6. The lower the E-value, the more significant the score (less likely due just to chance) . E-values are in scientific notation, ex: 3e-80 = 3 x 10-80 47 In general, an E-value of 1X10-5 or smaller is considered significant (not just aligned by chance).
  • 48.
    48 This is fromthe Alignments Section and shows the details
  • 49.
    Results are arrangedin a default setting from lowest E-value to highest. Compare the E-value, Query cover and % identity for the checked “hits”. Which GENE is most similar to the human ACTA1 sequence query? Click on the accession number for more information about the gene that had the most significant alignment 49
  • 50.
  • 51.
  • 52.
    the process youused to find a version of the human ACTA1 gene in Arabidopsis thaliana. What information did you use to indicate that the plant version was a meaningful find? 52
  • 53.
    1. Pick ahuman gene which you think is highly conserved between plants and animals. 2. Follow the procedure you just learned to see if a similar Arabidopsis version exists. 3. Record your info on the scorecard. 4. Repeat for a gene that you predict is unique to humans. 53
  • 54.
    Human Gene Name Human Gene ID Human Gene Function Arabid opsis Gene Name Arabid opsis Gene ID Arabidopsis Gene Function Out-come evidence : Score,E-value, Similar Function, Predic- tion? Actin alpha 1 ACTA1 Cytoskele tal structure ACT7 Actin 7 Cytoskeletal structure E value was 1e-80, not random, both have similar functions…. Yes Gene Discovery Scorecard 54
  • 55.
    • What informationso far indicates whether or not plants have animal muscle genes? • What additional information might you need to be more certain whether ACT7 is a plant version of human ACTA1? 55