The document discusses data preprocessing techniques for unsupervised learning. It covers topics like handling missing values using k-nearest neighbor imputation, normalization to remove biases among samples, detecting and handling outliers, and exploring clusters in the data through hierarchical and k-means clustering. The goal of these techniques is to clean and massage raw data into a format suitable for machine learning analysis to discover hidden patterns.
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
Data preprocessing techniques are applied before mining. These can improve the overall quality of the patterns mined and the time required for the actual mining.
Some important data preprocessing that must be needed before applying the data mining algorithm to any data sets are completely described in these slides.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
Final session in a series of four seminars presented to University of North Texas librarians. This presentation brings together some best practices for gathering, organizing, analyzing, and presenting statistics and data.
The use of data and its modelling in science provides meaningful interpretation of real world problems. This presentation provides an easy to understand overview of data visualization and analytics , and snippets of data science applications using R - programming.
Data preprocessing techniques are applied before mining. These can improve the overall quality of the patterns mined and the time required for the actual mining.
Some important data preprocessing that must be needed before applying the data mining algorithm to any data sets are completely described in these slides.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
Final session in a series of four seminars presented to University of North Texas librarians. This presentation brings together some best practices for gathering, organizing, analyzing, and presenting statistics and data.
The use of data and its modelling in science provides meaningful interpretation of real world problems. This presentation provides an easy to understand overview of data visualization and analytics , and snippets of data science applications using R - programming.
Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.
Tehisintellekti rakendused kõrghariduses: võimalused ja väljakutsedElena Sügis
Tehisintellekt on mõjutanud peaaegu kõiki tänapäeva inimelu aspekte. Selleks, et olla edukas ja konkurentsivõimeline oma erialal ning panustada oma organisatsiooni lisandväärtuse kasvatamisse, on vaja aru saada kaasaegsetest tehnoloogiatest ja nende kasutamise võimalustest oma töövaldkonnas. Kõrgharidus on üks eriala, mis pakub tehisintellekti tehnoloogiate rakendamiseks suurt potentsiaali. Ettekandes antakse ülevaate võimalustest ja väljakutsetest, mida tehisintellekti kasutuselevõtt võiks kõrgharidusse kaasa tuua.
Konverents „Õppejõult õppejõule 2021: õppimise ja õpetamise ruumid“
Räägisin sellest, miks on äge teadlane olla oma teadusala (bioinformaatika) vaadenurgast. Ettekannes jagasin inimestele skeemi kuidas valida nende jaoks ideaalset tööd, mis vastataks nende ootustele ja jääks huvitavaks pikemas perspektiivis.
See skeem on väga lihtne ning koosneb kolmest osast “tahan teha” “oskan teha” “on vaja teha” ja iga osa on kirjeldatud küsimustega. Peab vastama küsimustele ning otsima kõige suuremat ülekatet nendest kolmest osast. See on ideaalse töö kirjeldus.
Practice discovering biological knowledge using networks approach.Elena Sügis
This practice session gives an overview how to analyze biological data using networks approach. It covers netwokrs topology, data integration, differential expression, network visualization, functional enrichment analysis and retrieving data from external sources. Primarily Cytoscape software is used for this practice session.
The presentation was meant to explain who are bioinformaticians , what they do and why it's cool to the first year bachelor students.
Presentation was made in the frames of the course Introduction to Informatics (Sissejuhatus informaatikasse 2016/17 sügis) at the Institute of Computer Science, University of Tartu.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
18. Missing Values
Origins:
• Malfunctioning measurement equipment
• Very low intensity signal
• Deleted due to inconsistency with other recorded
data
• Data removed/not entered by mistake
19. Missing Values
How to deal with them:
• Filter out
• Replace missing values by 0
• Replace by the mean, median value
• K nearest neighbor imputation (KNN
imputation)
• Expectation—Maximization (EM) based
imputations
21. KNN
• We are given a gene expression matrix M
• Let X=(x1, x2, …, xi, …, xn) be a vector in the matrix M
with a missing value at xi at the dimension i
• Find in the gene expression data matrix matrix vectors
X1 , X2 , …, Xk , such that they are the k closest
vectors to X in M (with a chosen distance measure)
among the vectors that do not have a missing value at
dimension i
• Replace the missing value xi with the mean (or
median) of X1 i, X2 i, …, Xk i , i.e., mean (median) of
the values at dimension i of vectors X1 , X2 , …, Xk
26. Normalization & Standardization
Objective:
adjust measurements so that they can be appropriately
compared among samples
Key ideas:
• Remove technological biases
• Make samples comparable
Methods:
• Z-scores (centering and scaling)
• Logarithmization
• Quantile normalization
• Linear model based normalization
27. Z-scores
Centering a variable is subtracting the mean of the variable
from each data point so that the new variable's mean is 0.
Scaling a variable is multiplying each data point by a
constant in order to alter the range of the data.
where:
µ is the mean of the population.
σ is the standard deviation of the population.
z =
x −µ
σ
28. transforms the data by a linear projection onto
a lower-dimensional space that preserves as
much data variation as possible
Principal Component Analysis
35. Batch Effects
Measurements are affected by:
• Laboratory conditions
• Reagent lots
• Personnel differences
are technical sources of variation that have been added to the
samples during handling. They are unrelated to the biological or
scientific variables in a study.
Major problem :
might be correlated with an outcome of
interest and lead to incorrect conclusions
36. Fighting The Batch Effects
Experimental design solutions:
• Shorter experiment time
• Equally distributed samples between multiple laboratories and
across different processing times, etc.
• Provide info about changes in personnel, reagents, storage and
laboratories
Statistical solutions:
• ComBat
• SVA(Surrogate variable analysis, SVD+linear models)
• PAMR (Mean-centering)
• DWD (Distance-weighted discrimination based on SVM)
• Ratio_G (Geometric ratio-based)
J.T. Leek, Nature Reviews Genetics 11, 733-739 (October 2010,)
Chao Chen, PlosOne, 2011
43. IF YOU TORTURE
THE DATA
LONG ENOUGH
IT WILL CONFESS
TO ANYTHING
Ronald Coase, Economist, Nobel Prize winner
44. Clustering is finding groups of objects such that:
similar (or related) to the objects in the same group and
different from (or unrelated) to the objects in other groups
What is cluster analysis?
45. Properties
• Classes/labels for each instance are
derived only from the data
• For that reason, cluster analysis is referred
to as unsupervised classification
46. • Intuition building
Finding hidden internal structure of the high-dimensional data
• Hypothesis generation
Finding and characterizing similar groups of objects in the data
• Knowledge discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.
• Summarizing / compressing large data
• Data visualization
Why to cluster biological data?
48. • Intuition building
Finding hidden internal structure of the high-dimensional data
• Hypothesis generation
Finding and characterizing similar groups of objects in the data
• Knowledge discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.
• Summarizing / compressing large data
• Data visualization
Why to cluster biological data?
50. • Intuition building
Finding hidden internal structure of the high-dimensional data
• Hypothesis generation
Finding and characterizing similar groups of objects in the data
• Knowledge discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.
• Summarizing / compressing large data
• Data visualization
Why to cluster biological data?
52. • Intuition building
Finding hidden internal structure of the high-dimensional data
• Hypothesis generation
Finding and characterizing similar groups of objects in the data
• Knowledge discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.
• Summarizing / compressing large data
• Data visualization
Why to cluster biological data?
55. Fuzzy vs Non-Fuzzy
Fuzzy vs Non-Fuzzy
Each object belongs to each
cluster with some weight
(the weight can be zero)
Each object belongs to
exactly one cluster
Each object belongs to
each cluster with some
weight
Each object belongs to
exactly one cluster
57. • Each subtree corresponds to a cluster
• Height of branching shows distance
Hierarchical clustering
• Each subtree corresponds to a cluste
• Height of branching shows distance
Hierarchical clustering
58. Hierarchical clustering (0)
Algorithm for Agglomerative
Hierarchical Clustering:
Join the two closest objects
Algorithm for Agglomerative Hierarchical
Clustering:
Join the two closest objects
Hierarchical clustering
59. Join the two closest objects
Hierarchical clustering (1)
Join the two closest objects
Hierarchical clustering
60. Keep joining the closest pairs
Hierarchical clustering (2)
Keep joining the closest pairs
Hierarchical clustering
65. Q: Which clusters do we merge next?Hierarchical clustering (10)
After 10 steps we have
4 clusters left
66. Hierarchical clustering (10)
Several ways to measure distance
between clusters:
• Single linkage (MIN)
Several ways to measure
distance between clusters:
• Single linkage(MIN)
Hierarchical clustering
67. Hierarchical clustering (10)
Several ways to measure distance
between clusters:
• Single linkage (MIN)
• Complete linkage (MAX)
Several ways to measure
distance between clusters:
• Single linkage(MIN)
• Complete linkage(MAX)
Hierarchical clustering
68. Hierarchical clustering (10)
Several ways to measure distance
between clusters:
• Single linkage (MIN)
• Complete linkage (MAX)
• Average linkage
• Weighted
• Unweighted
• ...
Several ways to measure
distance between clusters:
• Single linkage (MIN)
• Complete linkage (MAX)
• Average linkage
• Weighted
• Unweighted ...
• Ward’s method
Hierarchical clustering
69. Hierarchical clustering (11)
In this example and at this stage
we have the same result as in
partitional clustering
In this example and at this
stage we have the same result
as in partitional clustering
Hierarchical clustering
70. Hierarchical clustering (12)
In the final step the two
remaining clusters are joined into
a single cluster
In the final step the two
remaining clusters are joined
into a single cluster
Hierarchical clustering
71. Hierarchical clustering (13)
In the final step the two
remaining clusters are joined into
a single cluster
In the final step the two
remaining clusters are joined
into a single cluster
Hierarchical clustering
72. Examples of Hierarchical
Clustering in Bioinformatics
Examples of Hierarchical
Clustering in Bioinformatics
PhylogenyGene expression
clustering
73. K-means clustering
• Partitional, non-fuzzy
• Partitions the data into K clusters
• K is given by the user
Algorithm:
• Choose K initial centers for the clusters
• Assign each object to its closest center
• Recalculate cluster centers
• Repeat until converges
80. K-means clustering summary
• One of the fastest clustering algorithms
• Therefore very widely used
• Sensitive to the choice of initial centres
• many algorithms to choose initial centres
cleverly
• Assumes that the mean can be calculated
• can be used on vector data
• cannot be used on sequences
(what is the mean of A and T?)
81. K-medoids clustering
• The same as K-means, except that the center
is required to be at an object
• Medoid - an object which has minimal total
distance to all other objects in its cluster
• Can be used on more complex data, with any
distance measure
• Slower than K-means
91. Examples of K-means and
K-medoids in Bioinformatics
Gene expression
clustering
Sequence clustering
Examples of K-means and
K-medoids in Bioinformatics
92. Distance measuresDistance measures
Distance of vectors and
• Euclidean distance
• Manhattan distance
• Correlation distance
Distance of sequences and
• Hamming distance => 3
• Levenshtein distance
x = (x1, . . . , xn) y = (y1, . . . , yn)
d(x, y) =
v
u
u
t
nX
i=1
(xi yi)
2
d(x, y) =
nX
i=1
|xi yi|
d(x, y) = 1 r(x, y)
is Pearson
correlation coefficient
r(x, y)
ACCTTG TACCTG
ACCTTG
TACCTG
.ACCTTG
TACC.TG
=> 2
95. Put it into words & Discover
https://mastrianascience.wikispaces.com/file/view/scientific_method_wordle.png/253466620/694x406/scientific_method_wordle.
96. Gene ontology
• Molecular Function - elemental activity or task
• Biological Process - broad objective or goal
• Cellular Component - location or complex
What found genes are doing
97. Functional annotations & Significance
statistical significance of having drawn a sample consisting of a specific number of
k successes out of n total draws from a population of size N containing K
successes.