This document provides an introduction to machine learning and its applications in genomics and biology. It discusses how biology and genomics data have become "big data" due to technological advances in sequencing and data generation. Machine learning is well-suited for analyzing these large, multidimensional datasets and addressing complex biological questions. The document outlines different machine learning approaches like supervised and unsupervised learning, and provides examples of real-world applications. R and Python are introduced as popular programming languages for machine learning.
Deep Learning Explained: The future of Artificial Intelligence and Smart Netw...Melanie Swan
This talk provides an overview of an important emerging artificial intelligence technology, deep learning neural networks. Deep learning is a branch of computer science focused on machine learning algorithms that model and make predictions about data. A key distinction is that deep learning is not merely a software program, but a new class of information technology that is changing the concept of the modern technology project by replacing hard-coded software with a capacity to learn and execute tasks. In the future, deep learning smart networks might comprise a global computational infrastructure tackling real-time data science problems such as global health monitoring, energy storage and transmission, and financial risk assessment.
Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
BIG DATA AND MACHINE LEARNING
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
Deep Learning Explained: The future of Artificial Intelligence and Smart Netw...Melanie Swan
This talk provides an overview of an important emerging artificial intelligence technology, deep learning neural networks. Deep learning is a branch of computer science focused on machine learning algorithms that model and make predictions about data. A key distinction is that deep learning is not merely a software program, but a new class of information technology that is changing the concept of the modern technology project by replacing hard-coded software with a capacity to learn and execute tasks. In the future, deep learning smart networks might comprise a global computational infrastructure tackling real-time data science problems such as global health monitoring, energy storage and transmission, and financial risk assessment.
Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
BIG DATA AND MACHINE LEARNING
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
Random Forest Classifier in Machine Learning | Palin AnalyticsPalin analytics
Random Forest is a supervised learning ensemble algorithm. Ensemble algorithms are those which combine more than one algorithms of same or different kind for classifying objects....
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
An Introduction to Generative AI - May 18, 2023CoriFaklaris1
For this plenary talk at the Charlotte AI Institute for Smarter Learning, Dr. Cori Faklaris introduces her fellow college educators to the exciting world of generative AI tools. She gives a high-level overview of the generative AI landscape and how these tools use machine learning algorithms to generate creative content such as music, art, and text. She then shares some examples of generative AI tools and demonstrate how she has used some of these tools to enhance teaching and learning in the classroom and to boost her productivity in other areas of academic life.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
These slides were presented at a meetup in Kansas City by Bahador Khaleghi of H2O.ai.
More details can be viewed here: https://www.meetup.com/Kansas-City-Artificial-Intelligence-Deep-Learning/events/265662978/
Can we use data to train Machine Learning models, perform statistical analysis, yet without putting private data on risk? There are tools and techniques such as Federated Learning, Differential Privacy or Homomorphic Encryption enabling safer work on the data.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Week 1 lecture for High School Bioinformatics course; covers why we need to use computers in biology, what bioinformatics/computational biology is, an introduction to machine learning, and examples from current research
Random Forest Classifier in Machine Learning | Palin AnalyticsPalin analytics
Random Forest is a supervised learning ensemble algorithm. Ensemble algorithms are those which combine more than one algorithms of same or different kind for classifying objects....
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
An Introduction to Generative AI - May 18, 2023CoriFaklaris1
For this plenary talk at the Charlotte AI Institute for Smarter Learning, Dr. Cori Faklaris introduces her fellow college educators to the exciting world of generative AI tools. She gives a high-level overview of the generative AI landscape and how these tools use machine learning algorithms to generate creative content such as music, art, and text. She then shares some examples of generative AI tools and demonstrate how she has used some of these tools to enhance teaching and learning in the classroom and to boost her productivity in other areas of academic life.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
These slides were presented at a meetup in Kansas City by Bahador Khaleghi of H2O.ai.
More details can be viewed here: https://www.meetup.com/Kansas-City-Artificial-Intelligence-Deep-Learning/events/265662978/
Can we use data to train Machine Learning models, perform statistical analysis, yet without putting private data on risk? There are tools and techniques such as Federated Learning, Differential Privacy or Homomorphic Encryption enabling safer work on the data.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Week 1 lecture for High School Bioinformatics course; covers why we need to use computers in biology, what bioinformatics/computational biology is, an introduction to machine learning, and examples from current research
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrank Rybicki
These are my #AI slides for medical deep learning using #radiology and medical imaging examples. Please use them & modify to teach your own group about medical AI.
Keynote presented at the Phenotype Foundation first annual meeting.
Describes data sharing, data annotation and the needs for further tool and ontology and ontology mapping development.
Amsterdam, January 18, 2016
The Uneven Future of Evidence-Based MedicineIda Sim
An Apple ResearchKit study enrolled 22,000 people in five days. A
study claims that Twitter can be used to identify depressed patients. A computer program crunches genomic data, the published literature, and electronic health record data to guide cancer treatment. The pace, the data sources, and the methods for generating medical evidence are changing radically. What will — what should — evidence-based medicine look like in a faster, personalized, data-dense tomorrow?
- Presented as the 3rd Annual Cochrane Lecture, October 2015 in Vienna, Austria.
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
Presentation by Ola Spjuth (Uppsala University and Scaleout) at the Chemical Biology Seminar Series, February 6th, at Karolinska Institutet and Science for Life Laboratory, Stockholm, Sweden.
ABSTRACT
Phenotypic profiling of cells with high-content imaging is emerging as an important methodology with high predictive power. The true power of these methods comes when integrated into automated, robotized systems that can be run continuously and not restricted to batch analysis. One of the main challenges then becomes how to manage and continuously analyze the large amounts of data produced. In this talk I will present our efforts to establish an automated lab for cell profiling of drugs using multiplexed fluorescence imaging (Cell Painting). I will describe our computational and lab infrastructure as well as the systems, tools an methods we are developing to sustain continuous profiling of cells and continuous AI modeling. A key objective in the group is on improving screening and toxicity assessment, but also to explore predictions of mechanisms and pathways. The long-term goal is to build a closed-loop system where results from analyses are used by an AI system to design the next round of experiments and iteratively improve the confidence in predictions. Research website: https://pharmb.io
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
Genome sharing projects across the world
Did you ever wonder what happened to the exponential increase in genome sequencing data? It is out there around the world and a lot of it is consented for research use. This means that if you just know where to find the data, you can potentially analyse gigabytes of data to power your research.
In this talk Fiona will present community genome initiatives, the genome sharing projects across the world, how you can benefit from this wealth of data in your work, and how you can boost your academic career by sharing and collaboration.
by Fiona Nielsen, Founder and CEO of DNAdigest and Repositive
With a background in software development Fiona pursued her career in bioinformatics research at Radboud University Nijmegen. Now a scientist-turned-entrepreneur Fiona founded DNAdigest and its social enterprise spin-out Repositive Ltd. Both the charity and company focus on efficient and ethical sharing of genetics data for research to accelerate diagnostics and cures for genetic diseases.
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
Diamond Age Data Science and Zafgen, Inc, co-present on their work in using bioinformatics data effectively in the context of a small therapeutics company.
Eleanor Howe, PhD, CEO of Diamond Age, presents on the different types of computational biologist, the characteristics of a good bioinformatics team, and the pluses and minuses of using deep learning/AI in a discovery biology context.
Huseyin Mehmet, VP of Discovery Research at Zafgen, describes his team's work with Diamond Age and uses their capabilities to inform Zafgen's drug development. He discusses the needs of biotech companies for a diverse, experience bioinformatics team.
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...Health Catalyst
It’s been over six years since IBM’s Watson amazed all of us on Jeopardy, but it has yet to deliver similar breakthroughs in healthcare. The headlines in last week’s Forbes article read, “MD Anderson Benches IBM Watson In Setback For Artificial Intelligence In Medicine.” Is it really a setback for the entire industry or not? Health Catalyst’s EVP for Product Development, Dale Sanders, believes that the challenges are unique to IBM’s machine learning strategy in healthcare. If they adjust that strategy and better manage expectations about what’s possible for machine learning in medicine, the future will be brighter for Watson, their clients, and AI in healthcare, in general. Watson’s success is good for all of us, but it’s failure is bad for all of us, too.
Join Dale as he discusses:
The good news: Machine learning technology is accelerating at a rate beyond Moore’s Law. Dale believes that machine learning algorithms and models are doubling in capability every six months.
The bad news: The healthcare data ecosystem is not nearly as rich as many would believe, and certainly not as rich as that used to train Watson for Jeopardy. Without high-volume, high-quality data, Watson’s potential and the constant advances in machine learning algorithms will hit a glass ceiling in healthcare.
The best news: By adjusting strategy and expectations, there are still plenty of opportunities to do great things with machine learning by using the current data content in healthcare, while we build out the volume and breadth of data we need to truly understand the patient at the center of the healthcare picture… and you don’t need an army of PhD data scientists to do it.
Similar to Hands-on Introduction to Machine Learning (20)
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
18. Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
19. Cells, Tissues, & Diseases
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
20. Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
21. Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Big Data
23. 12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
24. 12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
1 Petabyte of Data =
20M four-drawer filing cabinets filled with text
or
13.3 years of HD-TV video
or
~7 billion Facebook photos
or
1 PB of MP3 songs requires ~2,000 years to play
29. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
30. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
• We have lots of data and complex problems
• We want to make data-driven predictions
and need to automate model building
31. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
32. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
• Allows us to better utilize these increasingly large
data sets to capture their inherent structure
• Learning algorithms by training with data
33. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
34. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
35. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
36. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future
46. The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20
We often use R, but Python is also a great choice!
• R tends to be favored by statisticians and academics
(for research)
• Python tends to be favored by engineers (with
production workflows)
47. • Open source implementation of S which was originally developed at Bell Lab
• Free programming language and software environment for advanced statistical
computing and graphics
• Functional programming language written primarily in C, Fortran
• Good at data manipulation, modeling and computing, data visualization
• Cross-platform compatible
• Vast community (e.g., CRAN, R-bloggers, Bioconductor)
• Over 10,000 packages including parallel/high-performance compute packages
• Used extensively by statisticians and academics
• Popularity is substantially increasing in recent years
• Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!),
documentation can be sparse, memory allocation can be an issue
The R Programming Language
21
53. Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width, petal length and width, and species (Iris setosa, versicolor,
and virginica) (5 features or variables) for 150 flowers (observations)
Iris Dataset in R
26
92. Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
60
93. Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
60
94. Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
•Example methods of regression with regularization: ridge, elastic net, LASSO
60
111. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
112. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
113. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
114. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
115. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Species
Species(setosa)~
1.58*Sepal.Width +
-2.36*Petal.Length
+ 5.96
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
116. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
117. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
118. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
119. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)
122. Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
123. Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
124. Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
76
125. Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
126. Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
Algorithm Selection is an Important Step!
144. Iris Data: Neural Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
88