This document provides instructions for Assignment 2 of the BMI 214 course on machine learning for expression data and genotype-phenotype associations. It includes instructions on using the Weka machine learning tool to perform supervised and unsupervised learning on gene expression datasets. For supervised learning, it has students classify leukemia samples and evaluate different classifiers. For unsupervised learning, it has students perform k-means clustering on a yeast gene expression dataset. It also includes exercises on feature selection to identify informative genes for classification.
This document outlines an assignment involving machine learning techniques for analyzing gene expression and genotype-phenotype data. It includes instructions for using the Weka machine learning tool to perform supervised and unsupervised learning on gene expression datasets, as well as questions analyzing the results. It also provides background information on concepts like SNPs, haplotypes, and penetrance, and includes a link to a dataset for applying feature selection to genotype-phenotype association.
The document discusses mutation testing for ATL model transformations. It introduces mutation testing and how it can be applied to ATL. The key contributions are new mutation operators for ATL that mimic common developer errors, evaluation of test generation techniques and operators, and an open-source tool for mutation testing of ATL.
This Home has everything a family needs without spending millions more for a brand new build ... perfect for entertaining with spacious, formal Living & Dining Rooms complete with pantry & high ceilings.
Presented by:
Magda Mo, Salesperson Sutton Group-Bayview Realty Inc., Brokerage 416 483-8000
And
James Metcalfe, Broker Royal LePage RES Ltd., JOHNSTON & DANIEL Division Brokerage 416-489-2121
October 2009 Edition: Toronto Real Estate Market ViewsMagda Mo
The document provides information on fire safety in the home. It notes that home fires kill more people in Canada than any other industrialized country due to many Canadian homes being built of wood. It recommends having working smoke alarms on each floor and practicing home fire evacuation plans. Specific fire safety tips include keeping candles away from flammable items, avoiding using basements for storage of flammable materials like cleaning solvents, and ensuring wood paneling in basements is backed by non-flammable materials. Maintaining a tidy home reduces fire risks.
This document provides instructions for a machine learning lab assignment. Students are asked to use the Weka machine learning tool to classify RNA-binding proteins using various algorithms, including Naive Bayes, J48 decision tree, SVM with linear and RBF kernels. Performance is measured using 5-fold cross-validation on the training set and classification of a separate test protein. Results for accuracy and other metrics are recorded in tables.
The document provides instructions for a machine learning lab experiment using the Weka machine learning software. Students are asked to run several classifiers on a dataset containing RNA-binding protein sequences to predict whether amino acids bind to RNA or not. Classifiers include Naive Bayes, J48 decision tree, support vector machine (SVM) with linear and RBF kernels. Students record performance metrics from 5-fold cross validation and testing on a separate protein sequence, and analyze which classifier worked best.
This document outlines an assignment involving machine learning techniques for analyzing gene expression and genotype-phenotype data. It includes instructions for using the Weka machine learning tool to perform supervised and unsupervised learning on gene expression datasets, as well as questions analyzing the results. It also provides background information on concepts like SNPs, haplotypes, and penetrance, and includes a link to a dataset for applying feature selection to genotype-phenotype association.
The document discusses mutation testing for ATL model transformations. It introduces mutation testing and how it can be applied to ATL. The key contributions are new mutation operators for ATL that mimic common developer errors, evaluation of test generation techniques and operators, and an open-source tool for mutation testing of ATL.
This Home has everything a family needs without spending millions more for a brand new build ... perfect for entertaining with spacious, formal Living & Dining Rooms complete with pantry & high ceilings.
Presented by:
Magda Mo, Salesperson Sutton Group-Bayview Realty Inc., Brokerage 416 483-8000
And
James Metcalfe, Broker Royal LePage RES Ltd., JOHNSTON & DANIEL Division Brokerage 416-489-2121
October 2009 Edition: Toronto Real Estate Market ViewsMagda Mo
The document provides information on fire safety in the home. It notes that home fires kill more people in Canada than any other industrialized country due to many Canadian homes being built of wood. It recommends having working smoke alarms on each floor and practicing home fire evacuation plans. Specific fire safety tips include keeping candles away from flammable items, avoiding using basements for storage of flammable materials like cleaning solvents, and ensuring wood paneling in basements is backed by non-flammable materials. Maintaining a tidy home reduces fire risks.
This document provides instructions for a machine learning lab assignment. Students are asked to use the Weka machine learning tool to classify RNA-binding proteins using various algorithms, including Naive Bayes, J48 decision tree, SVM with linear and RBF kernels. Performance is measured using 5-fold cross-validation on the training set and classification of a separate test protein. Results for accuracy and other metrics are recorded in tables.
The document provides instructions for a machine learning lab experiment using the Weka machine learning software. Students are asked to run several classifiers on a dataset containing RNA-binding protein sequences to predict whether amino acids bind to RNA or not. Classifiers include Naive Bayes, J48 decision tree, support vector machine (SVM) with linear and RBF kernels. Students record performance metrics from 5-fold cross validation and testing on a separate protein sequence, and analyze which classifier worked best.
DNA and Genes Lab ActivityComplete your answers in the spaces .docxjacksnathalie
DNA and Genes Lab Activity
Complete your answers in the spaces provided. USE YOUR OWN WORDS – Yes even for definitions! Remember to add your last name and first initial to the file name prior to saving and submitting your completed assignment through Canvas.
Use your textbook, notes and these websites to answer the pre lab questions. http://learn.genetics.utah.edu/units/basics/transcribe/http://www.vcbio.science.ru.nl/en/virtuallessons/cellcycle/trans/
Pre Lab Questions:
1. What is the product of transcription?
2. What is the region of DNA called where transcription begins?
3. What is the product of translation?
4. In your own words define each of the following: Silent mutation
Missense mutation Nonsense mutation Frame shift mutation
5. Where in the cell does translation take place?
Click on the link below to access the online lab.
http://www.mhhe.com/biosci/genbio/virtual_labs_2K8/pages/DNA_And_Genes.html
Download and print the instructions for reference as you work through the lab. As you work through the lab fill in the table below. Use this information to answer the questions that follow contained in this document.
First read through the mutation guide. Once you close the guide you will see the buttons to begin the simulation. Note, you will be translating the mRNA strand into a protein.
As you work through each of the mutations fill in the charts below. You must complete 4 mutations for this lab activity. It’s good practice working with the codon table .
– Aris labs calls the codon table the ‘Genetic Code Chart’. Use the amino acid abbreviation for the protein sequence. For example the amino acid proline is abbreviated as pro.
You have to fill in all the letters AND the resulting amino acid sequence by dragging and dropping before you click the [check] button. Abrieviate STOP as either STP or END.
For each of the three mutations you will complete, fill in the table in this lab document with the original mRNA and amino acid sequence and the mRNA sequence and the resulting amino acid sequence RESULTING FROM the mutation as outlined in the mutation rule.
The various mutations represent missense, nonsense, silent and frame shift mutations. You must complete one of each. The lab will not necessarily present the mutations in this order. You must do the mutation and identify which type it is and make sure you do one of each.
6. Frame Shift Mutation example:
Provide the mutation rule you are following.
Original
A. Acids
Original
mRNA
Mutated
mRNA
Mutated
A. Acids
7. Missense Mutation example:
Provide the mutation rule you are following.
Original
A. Acids
Original
mRNA
Mutated
mRNA
Mutated
A. Acids
8. Nonsense Mutation example:
Provide the mutation rule you are following.
Original
A. Acids
Original
mRNA
Mutated
mRNA
Mutated
A. Acids
9. Silent Mutation example:
Pr ...
This document describes the development of a database for a laboratory entity within a hospital's electronic medical record system. It includes an introduction outlining the need for efficient medical databases. It then provides background on the business rules and attributes of the laboratory entity. An entity relationship diagram shows the relationships between entities. The document discusses using the system development life cycle methodology and describes the planning, analysis, design, implementation, and maintenance phases of developing the backend and frontend databases. It concludes by recommending future improvements.
This document provides an overview of quantitative data analysis techniques including descriptive statistics, reliability analysis, factor analysis, and various statistical tests. Descriptive statistics involve calculating frequencies, percentages, means, and cross-tabulations to summarize demographic and other variables. Reliability analysis using Cronbach's alpha is described to measure the internal consistency of scales. The steps for conducting an exploratory factor analysis are outlined. Finally, guidance is provided on selecting appropriate statistical tests such as t-tests, ANOVA, regression, chi-square, and Mann-Whitney U based on the variables' levels of measurement and number of groups being compared.
This document provides instructions for conducting a factor analysis in SPSS. It describes screening the data by examining correlations between variables to identify any that do not correlate well. It recommends having a sample size over 300 and communalities above 0.5. The analysis is run using principal component analysis. Factors are extracted based on eigenvalues over 1 or a fixed number. An orthogonal rotation like varimax is typically used to improve interpretability of the factors. Factor scores can optionally be saved.
One-way analysis of variance (ANOVA) tests allow you to determine if one given factor, such as drug treatment,
has a significant effect on gene expression behavior across any of the groups under study. A significant p-value
resulting from a 1-way ANOVA test would indicate that a gene is differentially expressed in at least one of the
groups analyzed. If there are more than two groups being analyzed, however, the 1-way ANOVA does not
specifically indicate which pair of groups exhibits statistical differences. Post Hoc tests can be applied in this
specific situation to determine which specific pair/pairs are differentially expressed. This document will provide
the necessary information for you to perform these analyses within GeneSpring.
1
Phylogenetic Analysis Homework assignment
This assignment will be completed on your own and turned in the week of 11/8-11/10.
Introduction
Molecular evolution is the study of how proteins and nucleic acids evolve. Included in this
field are studies of mutations and chromosomal rearrangements, the evolutionary process,
the identification of sequence patterns conferring function in proteins and nucleic acids,
and the reconstruction of the evolutionary history of organisms and the molecules that
they make. All of these studies rely on comparisons of nucleotide or amino acid sequences.
In this tutorial, you will be introduced to some of the fundamental principles of molecular
evolution and the types of bioinformatics tools that are used in evolutionary studies. We
will begin by carrying out a manual sequence comparison, so that the basic concepts can
be introduced, and the remainder of the project will be carried out at The Biology
Workbench, a set of bioinformatics analysis programs managed by The San Diego
Supercomputing Center at the University of California, San Diego.
Objectives
• To introduce the principles of molecular evolution
• To acquaint you with the tools that are available to compare nucleotide and
amino acid sequences
• To learn about the use of protein sequences in reconstructions of evolutionary history
Project
Branching evolution occurs when one ancestral species gives rise to two or more progeny
species. However, speciation events don't involve the vast majority of the genes in a
genome. That is, for most genes, both of the progeny species inherit identical genes from
the ancestor. Following speciation, these genes evolve independently in the separate
lineages. Studies of molecular evolution therefore rely heavily on comparisons of related
sequences from different organisms.
Shown below is an alignment of two homologous sequences that we will use as a starting
place. Homologous sequences are sequences that have descended from a common
ancestral sequence. You can't meaningfully compare sequences unless they are
homologous. This alignment uses the single letter amino acid code, in which G represents
glycine, Q represents glutamine, etc. The aligned proteins have been shown to be involved
in the metabolism of similar, but different, toxic compounds. As you can see, these amino
acid sequences are very similar and it is easy to recognize that they are related by common
descent.
2
dntAc: KMGVDDEVIVSRQNDGSVR
nahAc: KMGIDDEVIVSRQSDGSIR
An expanded version of this alignment is shown below. In this expanded alignment, both
the amino acids and the corresponding DNA nucleotides are shown. For ease of analysis,
the codons have been broken into separate entries in a table.
Alignment of nahAc and dntAc sequences.
K M G V D E V I V
dntAc AAA ATG GGC GTC GAT GAA GTC ATC GTC
nahAc ...
This document provides instructions for experiments involving bioinformatics tools and software. It begins with introductory information and a table of contents. The experiments cover topics like downloading sequences from NCBI, performing BLAST searches, converting between protein and nucleotide sequences, downloading and using MEGA and other software for phylogenetic analysis, primer design, sequence cleaning and formatting, and more. Step-by-step instructions are provided for completing each analysis using various online and offline bioinformatics resources.
This document provides instructions for a lab on protein modeling and comparative genomics. It covers using BLASTP and PSI-BLAST to search protein sequences against databases and compare results. It also demonstrates using genome browsers to compare genomic regions across species, noting that orthologous genes often retain similar functions. Key steps include running BLASTP and PSI-BLAST iterations with different parameters, analyzing output, and using the Ensembl genome browser to view orthologs between the mouse Pax6 gene region and corresponding human region.
The document describes a lab experiment analyzing gene expression data from human fibroblasts in response to serum using microarray analysis. The aims are to analyze the gene expression data using Excel and the ArrayTrack workbench. Key steps include importing microarray data into Excel and pre-treating the data by centering and scaling. ArrayTrack is then used to analyze the data through descriptive statistics, exploring gene expression profiles of gene lists, and using the significance analysis of microarrays (SAM) tool. Additional online databases like Gene Atlas and ArrayExpress are queried to find expression profiles and experimental data for a specific gene, APT13A2, under different conditions.
This tutorial summarizes the steps to analyze gene modules from a dataset using the Gitools analysis platform. The key steps are:
1. Import a continuous data matrix file from IntOGen containing corrected p-values for gene upregulation in different cancer types.
2. Load the continuous data matrix and transform it to a binary matrix using a p-value threshold of 0.05.
3. Load a file containing gene modules to analyze for enrichment.
4. Run a binomial test enrichment analysis and view the results heatmap showing enriched modules for each cancer type.
5. Use the properties window to change column labels from IDs to actual cancer type names for easier interpretation.
The document provides instructions for conducting a one-way ANOVA in SPSS. It describes entering data with groups as the independent variable and errors as the dependent variable. It outlines defining the variables, running the one-way ANOVA test, and interpreting the output, including tests for homogeneity of variance and post-hoc tests to determine differences between group means.
This document discusses classification and clustering techniques using the Weka data mining tool. It begins with an introduction to Weka and its capabilities for classification, clustering, and other data mining functions. It then provides an example of using Weka's J48 decision tree algorithm to classify iris flower samples based on sepal and petal attributes. Finally, it demonstrates k-means clustering on customer purchase data from a BMW dealership to group customers into five clusters based on their buying behaviors.
Part 5 of RNA-seq for DE analysis: Detecting differential expressionJoachim Jacob
Fifth part of the training session 'RNA-seq for Differential expression analysis'. We explain the most important concepts of detecting DE expression based on a count table, explaining DESeq2 algorithm. Interested in following this session? Please contact http://www.jakonix.be/contact.html
This document provides an overview of RELMA (Regenstrief LOINC Mapping Assistant), a software tool for mapping local laboratory test names and codes to standardized LOINC codes. It discusses installing and using RELMA to facilitate mapping of a local observation file to LOINC codes. The goals are to improve data quality, interoperability and comparability by implementing standardized terminology.
one complete report from all the 4 labs.pdfstudy help
The document provides instructions for compiling a complete lab report from four biology labs on genomic databases, primer design, PCR, and molecular cloning. It outlines the necessary sections for the report, including an introduction describing the overall question and background, materials and methods, results with data and figures, and a discussion/conclusion section. It also provides additional details on designing a transgene reporter gene based on knowledge gained from the lab exercises, including defining a transgene, necessary gene elements, and ideas for using the transgene.
one complete report from all the 4 labs.pdfstudy help
The document provides instructions for compiling a complete lab report from four biology labs on genomic databases, primer design, PCR, and molecular cloning. It outlines the required sections of the report, including an introduction, materials and methods, results with data, and a discussion/conclusion section. It also provides discussion questions on building a reporter gene or transgene, defining key terms and outlining the necessary gene elements and ideas for using the transgene. The report should integrate results and instructions from all four labs.
UNIT 5 EXPERIMENT ANSWER SHEET Please submit to the UNIT 5 Exper.docxouldparis
UNIT 5 EXPERIMENT ANSWER SHEET
Please submit to the UNIT 5 Experiment SUBMISSION LINK no later than Sunday midnight.
SUMMARY OF ACTIVITIES FOR UNIT 1 EXPERIMENT ASSIGNMENT
· Experiment 5 Exercise 1 – Transcription and Translation
· Experiment 5Exercise 2 – Translation and Mutations
· Experiment 5 Exercise 3 – Mutation Rates
Experiment 5 Exercise 1: Transcription and Translation
This exercise will ensure that you have a good understanding of the processes of transcription and translation. To get started, go to the following website:
University of Utah. No date. Transcription and Translation
http://learn.genetics.utah.edu/content/molecules/transcribe/
Procedure
A. Read over the information on the first screen and click on the click here to begin to proceed.
B. On the next screen transcribe the give DNA strand.
Table 1. Transcription of the DNA sequence (1.5 pts).
RNA
C. Once you have finished transcribing the DNA, you will then translate the RNA sequence. Follow the instructions on the screen.
Table 2. Translation (1.5 pts)
Codon
Amino Acid
Codon 1
Codon 2
Codon 3
Codon 4
Codon 5
Codon 6
Experiment 5 Exercise 2: Translation and Mutations
Now that you know how to transcribe DNA and translate the mRNA message, let’s take a look at the different types of mutations that might disrupt this process. Review pp 186-187 in your book before beginning. In this exercise you will need to use the following website:
McGraw Hill. No date. Virtual Lab: DNA and Genes
http://www.glencoe.com/sites/common_assets/advanced_placement/mader10e/virtual_labs_2K8/labs/BL_04/index.html
Read over the information in the Mutation Guide and close it when you are done. Note that there are several pages; you will need to click on Next to proceed through the Guide. If you want to review this material, you can click on the Mutation Guide button. You are going to run a series of simulations in which an mRNA sequence and its corresponding amino acid sequence is provided. You will be told what type of mutation you will you apply (= Mutation Rule) and you will have to determine the new, mutated mRNA and the resulting protein sequence.
Procedure
A. Click on the Mutate button to get started.
B. Find the Mutation Rule (lower left corner) and enter it into Table 3 below (see the Example provided).
C. Drag the appropriate nucleotides to build the new, Mutated mRNA sequence. If you make a mistake building the new mRNA sequence, drag the correct nucleotide and place it on top of the incorrect one (you cannot actually remove a nucleotide).
D. Once you have generated your Mutated mRNA sequence, you now need to build your Mutated amino acid sequence by matching the appropriate amino acid with each codon. Click on Genetic Code Chart to see the code or you can use Figure 10.11 on p 160 in your book.
NOTE: If you add a STOP codon, do NOT add any more amino acids after it!
E. Once you have finished, click on the Check button. If you are correct, then c ...
Assignment 2 Tests of SignificanceThroughout this assignmen.docxkarenahmanny4c
Assignment 2: Tests of Significance
Throughout this assignment you will review mock studies. You will needs to follow the directions outlined in the section using SPSS and decide whether there is significance between the variables. You will need to list the five steps of hypothesis testing (as covered in the lesson for Week 6) to see how
every
question should be formatted. You will complete all of the problems. Be sure to cut and past the appropriate test result boxes from SPSS under each problem and explain what you will do with your research hypotheses.
All calculations should be coming from your SPSS
. You will need to submit the SPSS output file to get credit for this assignment. This file will save as a .spv file and will need to be in a single file. In other words, you are not allowed to submit more than one output file for this assignment.
The five steps of hypothesis testing when using SPSS are as follows:
State your research hypothesis (H
1
) and null hypothesis (H
0
).
Identify your significance level (.05 or .01)
Conduct your analysis using SPSS.
Look for the valid score for comparison. This score is usually under ‘Sig 2-tail’ or ‘Sig. 2’. We will call this “p”.
Compare the two and apply the following rule:
If “p” is < or = significance level, than you reject the null.
Be sure to explain to the reader what this means in regards to your study. (Ex: will you recommend counseling services?)
* Be sure that your answers are clearly distinguishable. Perhaps you bold your font or use a different color.
This assignment is due no later than Sunday of Week 6 by 11:55 pm ET. Save the file in the following format: [your last name_SOCI332_A2]. The file must be a word file.
t Tests
t Test for a Single Sample (20 points)
Open SPSS
Enter the number of activities of daily living performed by the depressed clients studied in #1 in the Data View window.
In the Variable View window, change the variable name to “ADL” and set the decimals to zero.
Click Analyze
à
Compare Means
à
One-Sample T test
à
the arrow to move “ADL” to the Variable(s) window.
Enter the population mean (17) in the “Test Value” box.
Click OK.
1.
Researches are interested in whether depressed people undergoing group therapy will perform a different number of activities of daily living after group therapy. The researchers have randomly selected 12 depressed clients to undergo a 6-week group therapy program.
Use the five steps of hypothesis testing to determine whether the average number of activities of daily living (shown below) obtained after therapy is significantly different from a mean number of activities of 17 that is typical for depressed people. (Clearly indicate each step).
Test the difference at the .05 level of significance and at the .01 level (in SPSS this means you change the “confidence level” from 95% to 99%).
As part of Step 5, indicate whether the behavioral scientists should recommend group therapy for all depressed people based.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
The document describes conducting a factor analysis on SPSS to measure different aspects of student anxiety towards learning SPSS. A 23-item questionnaire was administered to over 2,500 students. Initial analysis of the correlation matrix found no issues with multicollinearity. The document then provides instructions for running the factor analysis in SPSS, including extracting factors, rotating the factors, and interpreting the output.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
DNA and Genes Lab ActivityComplete your answers in the spaces .docxjacksnathalie
DNA and Genes Lab Activity
Complete your answers in the spaces provided. USE YOUR OWN WORDS – Yes even for definitions! Remember to add your last name and first initial to the file name prior to saving and submitting your completed assignment through Canvas.
Use your textbook, notes and these websites to answer the pre lab questions. http://learn.genetics.utah.edu/units/basics/transcribe/http://www.vcbio.science.ru.nl/en/virtuallessons/cellcycle/trans/
Pre Lab Questions:
1. What is the product of transcription?
2. What is the region of DNA called where transcription begins?
3. What is the product of translation?
4. In your own words define each of the following: Silent mutation
Missense mutation Nonsense mutation Frame shift mutation
5. Where in the cell does translation take place?
Click on the link below to access the online lab.
http://www.mhhe.com/biosci/genbio/virtual_labs_2K8/pages/DNA_And_Genes.html
Download and print the instructions for reference as you work through the lab. As you work through the lab fill in the table below. Use this information to answer the questions that follow contained in this document.
First read through the mutation guide. Once you close the guide you will see the buttons to begin the simulation. Note, you will be translating the mRNA strand into a protein.
As you work through each of the mutations fill in the charts below. You must complete 4 mutations for this lab activity. It’s good practice working with the codon table .
– Aris labs calls the codon table the ‘Genetic Code Chart’. Use the amino acid abbreviation for the protein sequence. For example the amino acid proline is abbreviated as pro.
You have to fill in all the letters AND the resulting amino acid sequence by dragging and dropping before you click the [check] button. Abrieviate STOP as either STP or END.
For each of the three mutations you will complete, fill in the table in this lab document with the original mRNA and amino acid sequence and the mRNA sequence and the resulting amino acid sequence RESULTING FROM the mutation as outlined in the mutation rule.
The various mutations represent missense, nonsense, silent and frame shift mutations. You must complete one of each. The lab will not necessarily present the mutations in this order. You must do the mutation and identify which type it is and make sure you do one of each.
6. Frame Shift Mutation example:
Provide the mutation rule you are following.
Original
A. Acids
Original
mRNA
Mutated
mRNA
Mutated
A. Acids
7. Missense Mutation example:
Provide the mutation rule you are following.
Original
A. Acids
Original
mRNA
Mutated
mRNA
Mutated
A. Acids
8. Nonsense Mutation example:
Provide the mutation rule you are following.
Original
A. Acids
Original
mRNA
Mutated
mRNA
Mutated
A. Acids
9. Silent Mutation example:
Pr ...
This document describes the development of a database for a laboratory entity within a hospital's electronic medical record system. It includes an introduction outlining the need for efficient medical databases. It then provides background on the business rules and attributes of the laboratory entity. An entity relationship diagram shows the relationships between entities. The document discusses using the system development life cycle methodology and describes the planning, analysis, design, implementation, and maintenance phases of developing the backend and frontend databases. It concludes by recommending future improvements.
This document provides an overview of quantitative data analysis techniques including descriptive statistics, reliability analysis, factor analysis, and various statistical tests. Descriptive statistics involve calculating frequencies, percentages, means, and cross-tabulations to summarize demographic and other variables. Reliability analysis using Cronbach's alpha is described to measure the internal consistency of scales. The steps for conducting an exploratory factor analysis are outlined. Finally, guidance is provided on selecting appropriate statistical tests such as t-tests, ANOVA, regression, chi-square, and Mann-Whitney U based on the variables' levels of measurement and number of groups being compared.
This document provides instructions for conducting a factor analysis in SPSS. It describes screening the data by examining correlations between variables to identify any that do not correlate well. It recommends having a sample size over 300 and communalities above 0.5. The analysis is run using principal component analysis. Factors are extracted based on eigenvalues over 1 or a fixed number. An orthogonal rotation like varimax is typically used to improve interpretability of the factors. Factor scores can optionally be saved.
One-way analysis of variance (ANOVA) tests allow you to determine if one given factor, such as drug treatment,
has a significant effect on gene expression behavior across any of the groups under study. A significant p-value
resulting from a 1-way ANOVA test would indicate that a gene is differentially expressed in at least one of the
groups analyzed. If there are more than two groups being analyzed, however, the 1-way ANOVA does not
specifically indicate which pair of groups exhibits statistical differences. Post Hoc tests can be applied in this
specific situation to determine which specific pair/pairs are differentially expressed. This document will provide
the necessary information for you to perform these analyses within GeneSpring.
1
Phylogenetic Analysis Homework assignment
This assignment will be completed on your own and turned in the week of 11/8-11/10.
Introduction
Molecular evolution is the study of how proteins and nucleic acids evolve. Included in this
field are studies of mutations and chromosomal rearrangements, the evolutionary process,
the identification of sequence patterns conferring function in proteins and nucleic acids,
and the reconstruction of the evolutionary history of organisms and the molecules that
they make. All of these studies rely on comparisons of nucleotide or amino acid sequences.
In this tutorial, you will be introduced to some of the fundamental principles of molecular
evolution and the types of bioinformatics tools that are used in evolutionary studies. We
will begin by carrying out a manual sequence comparison, so that the basic concepts can
be introduced, and the remainder of the project will be carried out at The Biology
Workbench, a set of bioinformatics analysis programs managed by The San Diego
Supercomputing Center at the University of California, San Diego.
Objectives
• To introduce the principles of molecular evolution
• To acquaint you with the tools that are available to compare nucleotide and
amino acid sequences
• To learn about the use of protein sequences in reconstructions of evolutionary history
Project
Branching evolution occurs when one ancestral species gives rise to two or more progeny
species. However, speciation events don't involve the vast majority of the genes in a
genome. That is, for most genes, both of the progeny species inherit identical genes from
the ancestor. Following speciation, these genes evolve independently in the separate
lineages. Studies of molecular evolution therefore rely heavily on comparisons of related
sequences from different organisms.
Shown below is an alignment of two homologous sequences that we will use as a starting
place. Homologous sequences are sequences that have descended from a common
ancestral sequence. You can't meaningfully compare sequences unless they are
homologous. This alignment uses the single letter amino acid code, in which G represents
glycine, Q represents glutamine, etc. The aligned proteins have been shown to be involved
in the metabolism of similar, but different, toxic compounds. As you can see, these amino
acid sequences are very similar and it is easy to recognize that they are related by common
descent.
2
dntAc: KMGVDDEVIVSRQNDGSVR
nahAc: KMGIDDEVIVSRQSDGSIR
An expanded version of this alignment is shown below. In this expanded alignment, both
the amino acids and the corresponding DNA nucleotides are shown. For ease of analysis,
the codons have been broken into separate entries in a table.
Alignment of nahAc and dntAc sequences.
K M G V D E V I V
dntAc AAA ATG GGC GTC GAT GAA GTC ATC GTC
nahAc ...
This document provides instructions for experiments involving bioinformatics tools and software. It begins with introductory information and a table of contents. The experiments cover topics like downloading sequences from NCBI, performing BLAST searches, converting between protein and nucleotide sequences, downloading and using MEGA and other software for phylogenetic analysis, primer design, sequence cleaning and formatting, and more. Step-by-step instructions are provided for completing each analysis using various online and offline bioinformatics resources.
This document provides instructions for a lab on protein modeling and comparative genomics. It covers using BLASTP and PSI-BLAST to search protein sequences against databases and compare results. It also demonstrates using genome browsers to compare genomic regions across species, noting that orthologous genes often retain similar functions. Key steps include running BLASTP and PSI-BLAST iterations with different parameters, analyzing output, and using the Ensembl genome browser to view orthologs between the mouse Pax6 gene region and corresponding human region.
The document describes a lab experiment analyzing gene expression data from human fibroblasts in response to serum using microarray analysis. The aims are to analyze the gene expression data using Excel and the ArrayTrack workbench. Key steps include importing microarray data into Excel and pre-treating the data by centering and scaling. ArrayTrack is then used to analyze the data through descriptive statistics, exploring gene expression profiles of gene lists, and using the significance analysis of microarrays (SAM) tool. Additional online databases like Gene Atlas and ArrayExpress are queried to find expression profiles and experimental data for a specific gene, APT13A2, under different conditions.
This tutorial summarizes the steps to analyze gene modules from a dataset using the Gitools analysis platform. The key steps are:
1. Import a continuous data matrix file from IntOGen containing corrected p-values for gene upregulation in different cancer types.
2. Load the continuous data matrix and transform it to a binary matrix using a p-value threshold of 0.05.
3. Load a file containing gene modules to analyze for enrichment.
4. Run a binomial test enrichment analysis and view the results heatmap showing enriched modules for each cancer type.
5. Use the properties window to change column labels from IDs to actual cancer type names for easier interpretation.
The document provides instructions for conducting a one-way ANOVA in SPSS. It describes entering data with groups as the independent variable and errors as the dependent variable. It outlines defining the variables, running the one-way ANOVA test, and interpreting the output, including tests for homogeneity of variance and post-hoc tests to determine differences between group means.
This document discusses classification and clustering techniques using the Weka data mining tool. It begins with an introduction to Weka and its capabilities for classification, clustering, and other data mining functions. It then provides an example of using Weka's J48 decision tree algorithm to classify iris flower samples based on sepal and petal attributes. Finally, it demonstrates k-means clustering on customer purchase data from a BMW dealership to group customers into five clusters based on their buying behaviors.
Part 5 of RNA-seq for DE analysis: Detecting differential expressionJoachim Jacob
Fifth part of the training session 'RNA-seq for Differential expression analysis'. We explain the most important concepts of detecting DE expression based on a count table, explaining DESeq2 algorithm. Interested in following this session? Please contact http://www.jakonix.be/contact.html
This document provides an overview of RELMA (Regenstrief LOINC Mapping Assistant), a software tool for mapping local laboratory test names and codes to standardized LOINC codes. It discusses installing and using RELMA to facilitate mapping of a local observation file to LOINC codes. The goals are to improve data quality, interoperability and comparability by implementing standardized terminology.
one complete report from all the 4 labs.pdfstudy help
The document provides instructions for compiling a complete lab report from four biology labs on genomic databases, primer design, PCR, and molecular cloning. It outlines the necessary sections for the report, including an introduction describing the overall question and background, materials and methods, results with data and figures, and a discussion/conclusion section. It also provides additional details on designing a transgene reporter gene based on knowledge gained from the lab exercises, including defining a transgene, necessary gene elements, and ideas for using the transgene.
one complete report from all the 4 labs.pdfstudy help
The document provides instructions for compiling a complete lab report from four biology labs on genomic databases, primer design, PCR, and molecular cloning. It outlines the required sections of the report, including an introduction, materials and methods, results with data, and a discussion/conclusion section. It also provides discussion questions on building a reporter gene or transgene, defining key terms and outlining the necessary gene elements and ideas for using the transgene. The report should integrate results and instructions from all four labs.
UNIT 5 EXPERIMENT ANSWER SHEET Please submit to the UNIT 5 Exper.docxouldparis
UNIT 5 EXPERIMENT ANSWER SHEET
Please submit to the UNIT 5 Experiment SUBMISSION LINK no later than Sunday midnight.
SUMMARY OF ACTIVITIES FOR UNIT 1 EXPERIMENT ASSIGNMENT
· Experiment 5 Exercise 1 – Transcription and Translation
· Experiment 5Exercise 2 – Translation and Mutations
· Experiment 5 Exercise 3 – Mutation Rates
Experiment 5 Exercise 1: Transcription and Translation
This exercise will ensure that you have a good understanding of the processes of transcription and translation. To get started, go to the following website:
University of Utah. No date. Transcription and Translation
http://learn.genetics.utah.edu/content/molecules/transcribe/
Procedure
A. Read over the information on the first screen and click on the click here to begin to proceed.
B. On the next screen transcribe the give DNA strand.
Table 1. Transcription of the DNA sequence (1.5 pts).
RNA
C. Once you have finished transcribing the DNA, you will then translate the RNA sequence. Follow the instructions on the screen.
Table 2. Translation (1.5 pts)
Codon
Amino Acid
Codon 1
Codon 2
Codon 3
Codon 4
Codon 5
Codon 6
Experiment 5 Exercise 2: Translation and Mutations
Now that you know how to transcribe DNA and translate the mRNA message, let’s take a look at the different types of mutations that might disrupt this process. Review pp 186-187 in your book before beginning. In this exercise you will need to use the following website:
McGraw Hill. No date. Virtual Lab: DNA and Genes
http://www.glencoe.com/sites/common_assets/advanced_placement/mader10e/virtual_labs_2K8/labs/BL_04/index.html
Read over the information in the Mutation Guide and close it when you are done. Note that there are several pages; you will need to click on Next to proceed through the Guide. If you want to review this material, you can click on the Mutation Guide button. You are going to run a series of simulations in which an mRNA sequence and its corresponding amino acid sequence is provided. You will be told what type of mutation you will you apply (= Mutation Rule) and you will have to determine the new, mutated mRNA and the resulting protein sequence.
Procedure
A. Click on the Mutate button to get started.
B. Find the Mutation Rule (lower left corner) and enter it into Table 3 below (see the Example provided).
C. Drag the appropriate nucleotides to build the new, Mutated mRNA sequence. If you make a mistake building the new mRNA sequence, drag the correct nucleotide and place it on top of the incorrect one (you cannot actually remove a nucleotide).
D. Once you have generated your Mutated mRNA sequence, you now need to build your Mutated amino acid sequence by matching the appropriate amino acid with each codon. Click on Genetic Code Chart to see the code or you can use Figure 10.11 on p 160 in your book.
NOTE: If you add a STOP codon, do NOT add any more amino acids after it!
E. Once you have finished, click on the Check button. If you are correct, then c ...
Assignment 2 Tests of SignificanceThroughout this assignmen.docxkarenahmanny4c
Assignment 2: Tests of Significance
Throughout this assignment you will review mock studies. You will needs to follow the directions outlined in the section using SPSS and decide whether there is significance between the variables. You will need to list the five steps of hypothesis testing (as covered in the lesson for Week 6) to see how
every
question should be formatted. You will complete all of the problems. Be sure to cut and past the appropriate test result boxes from SPSS under each problem and explain what you will do with your research hypotheses.
All calculations should be coming from your SPSS
. You will need to submit the SPSS output file to get credit for this assignment. This file will save as a .spv file and will need to be in a single file. In other words, you are not allowed to submit more than one output file for this assignment.
The five steps of hypothesis testing when using SPSS are as follows:
State your research hypothesis (H
1
) and null hypothesis (H
0
).
Identify your significance level (.05 or .01)
Conduct your analysis using SPSS.
Look for the valid score for comparison. This score is usually under ‘Sig 2-tail’ or ‘Sig. 2’. We will call this “p”.
Compare the two and apply the following rule:
If “p” is < or = significance level, than you reject the null.
Be sure to explain to the reader what this means in regards to your study. (Ex: will you recommend counseling services?)
* Be sure that your answers are clearly distinguishable. Perhaps you bold your font or use a different color.
This assignment is due no later than Sunday of Week 6 by 11:55 pm ET. Save the file in the following format: [your last name_SOCI332_A2]. The file must be a word file.
t Tests
t Test for a Single Sample (20 points)
Open SPSS
Enter the number of activities of daily living performed by the depressed clients studied in #1 in the Data View window.
In the Variable View window, change the variable name to “ADL” and set the decimals to zero.
Click Analyze
à
Compare Means
à
One-Sample T test
à
the arrow to move “ADL” to the Variable(s) window.
Enter the population mean (17) in the “Test Value” box.
Click OK.
1.
Researches are interested in whether depressed people undergoing group therapy will perform a different number of activities of daily living after group therapy. The researchers have randomly selected 12 depressed clients to undergo a 6-week group therapy program.
Use the five steps of hypothesis testing to determine whether the average number of activities of daily living (shown below) obtained after therapy is significantly different from a mean number of activities of 17 that is typical for depressed people. (Clearly indicate each step).
Test the difference at the .05 level of significance and at the .01 level (in SPSS this means you change the “confidence level” from 95% to 99%).
As part of Step 5, indicate whether the behavioral scientists should recommend group therapy for all depressed people based.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
The document describes conducting a factor analysis on SPSS to measure different aspects of student anxiety towards learning SPSS. A 23-item questionnaire was administered to over 2,500 students. Initial analysis of the correlation matrix found no issues with multicollinearity. The document then provides instructions for running the factor analysis in SPSS, including extracting factors, rotating the factors, and interpreting the output.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
1. BMI 214
Spring 2006
Assignment 2:
Machine learning for expression data and
Genotype- phenotype associations
Contents:
Part 1: Weka for machine learning in expression data
a. Introduction
b. Supervised learning
c. Feature selection
d. Unsupervised learning
Part 2: Genotype-phenotype associations
a. Introduction
b. Feature selection
Submission: Send your answers as a PDF document to
biomedin214-spr0506-submit@lists.stanford.edu
Part 1: Weka for machine learning in expression data
Part 1a: Introduction
Download / familiarize yourself with Weka.
As discussed in class, Weka is a useful tool that has implemented most of the major
machine learning algorithms. In this part you will re-live Project 2, but now let Weka do
the work. It will be both a tutorial and problem set.
We will be using the Weka GUI in this assignment. There are also an excellent
command line interface and Java API, which will not be covered. Read more in the
documentation if interested.
Download and start the Weka GUI. Follow the instructions on the Weka site:
http://www.cs.waikato.ac.nz/ml/weka/
You will see four buttons; in this class we will only use the Explorer functionality.
There is much Weka documentation available. You may familiarize yourself with
Explorer as much as you like by reading the user guide and using their provided sample
datasets. Some good (optional) starting points:
Explorer guide:
http://easynews.dl.sourceforge.net/sourceforge/weka/ExplorerGuide-3.4.pdf
Weka Wiki:
http://weka.sourceforge.net/wekadoc/index.php/Main_Page
Part 1b: Supervised learning
We will be using the same leukemia data as in Project 2. Recall that that dataset
comprised 72 leukemia patients, 28 of which were AML and 44 of which were ALL.
Each patient had expression measurements for 7129 genes.
We have compiled these data into a Weka-compatible CSV file.
2. http://helix-web.stanford.edu/bmi214/assignment2/data/leukemia.csv
This file has the 72 leukemia patients (rows) and expression values for 150 genes
(columns). This data matrix is slightly altered from the one in Project 2: it is transposed,
and we chose a subset of 150 genes to make your life easier. The gene and experiment
names, as well as a link to the reference paper, are available in Project 2.
Open the leukemia.csv file in Weka.
Question 1. What is the mean value of expression of the gene labeled
“CD33 CD33 antigen (differentiation antigen)”?
Now we will run KNN. Go to the “classify” tab. Under “Classifier” click the “Choose”
button. Expand the “lazy” menu (yes, we use only the best!). Choose “IBk” and OK.
This is KNN: IBk stands for Instance-Based k. Now next to the button it should say
“IBk” with some parameters. Click on this text and a menu will pop up. For “KNN”, enter
5. Recall that this means the algorithm will use the five nearest neighbors to classify
each data point. Leave the rest of the values as default.
Under “Test options” choose the radio button “Cross-validation” and under “Folds” enter
5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This
means that the algorithm will classify “leukemia_type” (AML or ALL), using the
experiments as attributes.
Click the “Start’ button. The main window will show a variety of result summary
statistics, such as accuracy, true positives, false positives, and a confusion matrix.
Question 2. What is the % of correctly classified instances?
ROC curves, as discussed in lecture, illustrate the tradeoffs between sensitivity and
specificity. Roc curves plot Sensitivity vs. 1 – Specificity, or TP (true positive) rate vs.
FP (false positive) rate. Recall that
TP rate = Sensitivity = TP / (TP + FN)
FP rate = 1 – Specificity = FP / (TN + FN)
Specificity = 1 = FP rate = TN / (FP + TN)
Where TP = number of true positives and FP = number of false positives.
Question X. What are the TP and FP rates for ALL and AML?
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Optional: Right click on your result in the “Result list” on the left side of the screen.
Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP)
rate vs. false positive (FP) rate, which are the defaults. You can also view other types of
curves by clicking the dropdown menus. For example, precision-recall curves are an
alternative to ROC curves; precision and recall are options in the dropdown menu.
Recall that we can trivially achieve a TP rate of 1 for AML by classifying all of the
3. patients as positive (AML). That would correspond to a TP rate of 0 for ALL, since none
of the patients could then be classified ALL.
Question 3. Now we’ll try a different classification algorithm, ZeroR. Click
the Choose button under Classifier, and expand the “rules” folder. Choose
“ZeroR”. Again use cross- validation with Folds=5. Run it.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Explanation: ZeroR is a baseline classifier that simply identifies the class that is most
abundant (in this case, ALL with 44 patients), and predicts all variables (patients) to be
in that class. This illustrates that when the variables are unevenly assigned to classes, it
can be a problem to simply look at accuracy. For example, if you have a dataset of 100
variables (patients), where 90 were ALL, and 10 were AML, then ZeroR will predict them
all to be ALL, with 90% correctly classified (90% accuracy). Another classifier might
have 85% accuracy, which looks pretty good until you compare it with ZeroR. You can
get around the problem of uneven class sizes by weighting, a topic which we won’t cover
here. (But check out the “CostSensitiveClassifier” and “Cost Sensitive evaluation” if
you’re interested.)
We’ll try a few more algorithms on this dataset.
Question 4. Under the “bayes” Classifier folder, choose “NaiveBayes” and
run.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Question 5. Under the “functions” Classifier folder, choose “SMO” (a
popular implementation of SVM) and run.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Question 6. Under the “trees” Classifier folder, choose “ADTree” (a
popular implementation of a decision tree) and run.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
You can read about the various other algorithms in the documentation if interested.
4. Part 1c: Feature selection (aka attribute selection)
We will now use feature selection algorithms to extract the most “informative” genes for
classifying AML vs ALL. As mentioned in lecture, this is a useful real world exercise
because many times we are interested not in measuring all the gene expressions, but in
finding one or a few whose expression level can be used as a marker for the disease.
The general area of “Biomarkers” is economically important because it may lead to
diagnostic tests that can be marketed to clinical laboratories.
If interested, you can learn more about the theory behind feature selection in machine
learning textbooks and online resources. Wikipedia is always a good place to start.
http://en.wikipedia.org/wiki/Feature_selection
First we will choose a subset of five genes arbitrarily – the first five. To do this in Weka,
go back to the “Preprocess” tab and click the checkbox next to the first five genes (Zyxin
through RNS2). Scroll down to the bottom of the gene list and click “leukemia_type”,
which is the class label (AML or ALL). Check this box, too, so that there are 6 total
boxes checked. Click the “Invert” button above the list, so that the other 145 boxes
become checked. Click the “Remove” button below the list, so that you are left with 5
genes and leukemia_type.
Go to the “Classify” tab again. Classify using IBk and K = 5.
Question 7. What is the % correctly classified instances using just the
first five genes?
Re-load the dataset with “Open file…” to get all 150 genes back.
We’ll now find five genes that are “informative” according to some metric. Choose the
“Select attributes” tab at the top of the screen. Under ‘Attribute Evaluator”, choose the
algorithm called “InfoGainAttributeEval” and Search Method “Ranker” and Start. The
output shows the genes ranked by information gain, which is one possible metric for
measuring the “goodness” of a partition. The genes at the top of the list are the most
“informative” as attributes for classification.
Question 8. What are the top five genes in this output?
Back under “Preprocess”, remove all attributes except the top five genes from the
previous question, in the same manner as described above, so that you are left with the
five genes from the previous question and the class attribute leukemia_type. Under
“Classify”, run the exact same algorithm as before (IBk, K=5).
Question 9. What is the % of correctly classified instances?
Part 1d: Unsupervised learning
K-means clustering
This part will use the same dataset that you used for Project 2 k-means clustering.
5. Recall that that dataset comprised expression profiles for 2467 genes across 79
experiments.
Download a Weka-compatible CSV file at
http://helix-web.stanford.edu/bmi214/assignment2/data/yeast.dat.csv
The rows are genes and the columns are experiments. The gene and experiment
names, as well as a link to the reference paper, is available on the Project 2 instructions.
In this particular file, we have added one additional column called “ribosomal” with value
Y or N, to indicate whether a gene is ribosomal.
Question 10. What is the mean value of expression of the experiment
labeled “alpha 21”?
Now we will run K-means clustering. Go to the Cluster tab. Under Clusterer choose
SimpleKMeans. Click the text “SimpleKMeans” and set numClusters to 2. “Cluster
mode” should be “Use training set”. Run it.
Question 11. How many genes in each of the two clusters?
An optional exercise for those interested -- If you want to see which genes were
assigned to which clusters, right-click on the result in the Result list and choose
“Visualize cluster assignments”. Click the Save button and choose a location. This will
save a file in ARFF format (the Weka format), where the last value on each line will be a
cluster assignment. (See the Weka documentation for more on ARFF format).
Question 12. Recall that K-means depends on a random starting seed.
Change this seed by clicking on the text “SimpleKMeans” and changing the
“seed” value to 15. How many genes are now in the two clusters?
Question 13. Try adjusting the seed value to various values (e.g. 1.2, 5,
20, 100). How much do the numbers of genes in each cluster change?
What might this say about the genes and their assignments to clusters?
(I.e. are the assignments robust or weak?)
Question 14. Now under “Cluster mode”, choose “Classes to clusters
evaluation” and run. How many of the 121 ribosomal genes are assigned
to the same cluster?
Question 15. How many other genes are assigned to the same cluster as
the one that is predominantly ribosomal?
(If you were interested in what those genes were, you might save the ARFF output file
6. as described above, extract the genes in that cluster, and enter them in the GO Term
Finder, as discussed in Project 2, to get a feel for their functions. That’s left as an
optional exercise.)
Question 16. For completeness, we’ll now classify the yeast data. Under
the Classify tab, again choose the IBk algorithm, and use KNN=5. Use
cross- validation with Folds=5 and run.
% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =
Question 17. Run as in the previous question but using trivial classifier
ZeroR, which was discussed above.
% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =
Part 2:
Genotype/Phenotype
Part 2a: Introduction
Researching and answering these introductory questions will help you understand the
genotype-phenotype section below.
Question 18. What does diploid mean? How many copies of every
chromosome are there in one cell of a diploid organism? How many
copies of every gene?
Question 19. What is an “allele”, with respect to a single SNP? How
many alleles are theoretically possible for one SNP locus? Usually, how
many alternate forms actually arise in life for one SNP locus?
Question 20. Approximately how many SNPs are there in the human
genome, assuming a definition of “SNP” as 1% minor allele frequency?
(1% minor allele frequency means that 1% of the population has the
uncommon allele.)
Hint: It should be in millions.
7. Question 21. SNPs are one type of genetic variation – what other types
exist?
Question 22. What is penetrance, in terms of phenotype? What does 50%
penetrance mean? 100%? (three sentences or less)
Question 23. What is a haplotype? How is it related to linkage
disequilibrium? (three sentences or less)
Part 2b: Feature selection (aka attribute selection)
Background:
With genome sequencing cost drastically falling, we will soon be able to sequence
millions of people cheaply. With this wealth of new data, we will be able to associate
genotypes with phenotypes (e.g. diseases or drug responses). The goal is to find the
genes (or more specifically, the SNPs) that are found in conjunction with certain
phenotypes. We will be using a semi-synthetic dataset to explore this type of study.
The dataset can be found at
http://helix-web.stanford.edu/bmi214/assignment2/data/genotenureitus1.arff
Dataset description:
Many past genotype-phenotype association studies only examined a single gene
hypothesized to cause the phenotype (disease). In reality, there are complex
interactions of genes that cause disease. We will use machine learning to find such
complex associations, which is a very cutting-edge research area.
We will use Weka for this section. Open the file “genotenureitis1.arff”. Instead of CSV,
this file is in Weka’s ARFF format, which is described at
http://www.cs.waikato.ac.nz/~ml/weka/arff.html.
This file contains genotype/phenotype data for 557 subjects (people). Each row in the
data section represents one subject (person).
The (comma-delimited) columns are the attributes of the subjects. The first column is
the person’s identifier.
The next 188 columns represent genotypes, which in this case are SNPs (single
nucleotide polymorphisms). The SNPs were identified from three regions in the
genome. The SNPs are named
RjSNPi
where j is the region (1,2, or 3), and i is the SNP number within that region. The values
each SNP can take are {11, 12, 22}.
Recall that humans are diploid. In this file, a value of “11” for a subject’s SNP means
that the both copies are allele 1 (where allele 1’s possible values are the nucleotides
8. {A,T,G,C}; the actual allele isn’t specified.). “12” means that one was allele 1, and the
other was allele 2. “22” means that both were allele 2. (Allele 2 would be a nucleotide
other than allele 1.)
The last 20 columns in the file represent (artificially created) phenotypes. Some of the
phenotypes, for example, are
- “degree”, with possible values MD, MD/PhD, or PhD
- “npubs” (number of publications), with possible values 4 through 12.
- “gotgrants” (total grant dollars earned, in millions), with possible values 0 through
19.
Note that the genotype data are real, but the phenotype data are artificial. A more
lengthy (and far more humorous) description of the phenotypes can be found here:
http://helix-web.stanford.edu/bmi214/assignment2/data/pgrn_2005.pdf
This is an optional read; it is not necessary for understanding this assignment. This
document and the genotenuritis data file were created for a recent pharmacogenetics
conference.
Data filtering
Back to Weka and genotenureitis1.arff. Scroll to the bottom of the attribute names.
Examine the distributions for the attribute “irep” by clicking on it.
Question 24. How many class labels exist for the attribute irep?
The attribute irep is not useful to us; we will remove it and other useless fields with a
filter. Under “Filter”, choose unsupervised->attribute->RemoveUseless, and click Apply.
This will remove the “irep” attribute and the “ID” attribute(which actually appeared twice –
attribute number 1 and 192). You can undo operations like this one by clicking the Undo
button at the top.
If you don’t apply this filter, you may get erroneous results in the following sections.
Genotype- genotype associations
First we’ll examine whether SNPs are associated with each other. Recall the discussion
in class about haplotypes and linkage disequilibrium.
Question 25. Under the “Associate” tab, run the default associator
(Apriori). Copy here the output under “Best rules found:” (This should be
nine SNP association pairs and one phenotype- SNP pair.)
Question 26. Within all but one SNP- SNP pair, the two SNPs have
something in common. What is it, and what does it mean biologically?
Hint: look at the SNP names and the explanation of SNP names above.
(3 sentences or less.)
9. Attribute selection
We now perform attribute selection to find which SNPs best predict some of the
phenotypes.
Phenotype 1: gotgrants
The first phenotype we’ll investigate is “gotgrants”. We want to know whether there are
any SNPs related to the amount of grant money earned.
We disclose that the first phenotype was created artificially as a simple linear function of
the penetrance of a single SNP. You will try to discover what this SNP was.
Choose “gotgrants” from the dropdown menu on the left. Under the “Select attributes”
tab, choose the GainRatioAttributeEval evaluator with Ranker Search Method. Use
cross-validation with Folds=5. Run.
Question 27. What is the top SNP (genotype ) in this feature selection
result?
Question 28. Run multiple Attribute Evaluators with multiple Search
Methods; is the top gene the same in general?
(Note that some Evaluators require specific Search Methods; see the error log and/or
documentation if you get funny results.)
Question 29. Look at the top 80 or so SNPs. Do you see anything
interesting about them? What does this mean biologically? (3 sentences
or less.)
Phenotype 2: pctdrivel
The next phenotype is “pctdrivel”; again, we’ll investigate whether there are any SNPs
that are good predictors of this phenotype.
This phenotype was created artificially as a function of two other phenotypes. The
inclusion of other phenotypes represents the fact that environment can impact
phenotype, not just genetic makeup (cf. the nature vs. nurture debate).
Run multiple Attribute Evaluators with multiple Search Methods.
Question 30. What do you think are the two phenotypes affecting pctdrivel
and why? A well- justified answer will be accepted. (3 sentences or less.)
EXTRA CREDIT:
10. Phenotype 3: rivalside
This phenotype is very difficult to decipher; doing so will earn 5% extra credit.
This phenotype was created artificially as a more complex function of one other
phenotype and six SNPs.
What attributes do you think are affecting rivalside and why?