The document discusses using statistical and machine learning methods to analyze big data from Intel's data centers to classify computing jobs by expected runtime. It summarizes defining the problem, available data on past jobs, exploring the runtime distribution, constructing classes using a mixture model, and estimating model parameters using the EM algorithm. The goal is to optimize job scheduling by separating short and long jobs into different queues.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
The document discusses various machine learning concepts like model overfitting, underfitting, missing values, stratification, feature selection, and incremental model building. It also discusses techniques for dealing with overfitting and underfitting like adding regularization. Feature engineering techniques like feature selection and creation are important preprocessing steps. Evaluation metrics like precision, recall, F1 score and NDCG are discussed for classification and ranking problems. The document emphasizes the importance of feature engineering and proper model evaluation.
The document discusses feature extraction and selection as important steps in machine learning. It notes that better features often lead to better algorithms. It then describes five clusters identified through clustering analysis. Each cluster contains individuals (male or female) with certain average demographic characteristics like age, location, income, and whether they have accounts or loans. The document emphasizes that feature extraction and selection are underrated but important for machine learning.
This document provides an introduction to machine learning, including:
- It discusses how the human brain learns to classify images and how machine learning systems are programmed to perform similar tasks.
- It provides an example of image classification using machine learning and discusses how machines are trained on sample data and then used to classify new queries.
- It outlines some common applications of machine learning in areas like banking, biomedicine, and computer/internet applications. It also discusses popular machine learning algorithms like Bayes networks, artificial neural networks, PCA, SVM classification, and K-means clustering.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
This document discusses the rise of machine learning and artificial intelligence. It provides quotes from industry leaders about the potential for AI to improve lives and build a better society. The text then explains what machine learning is, how it works through supervised, unsupervised and reinforcement learning, and some of the business applications of AI like product recommendations, fraud detection and machine translation. It also discusses the increasing investment in and priority placed on AI by companies, governments and researchers. The document encourages readers to consider the ethical implications of AI and ensure it is developed and applied in a way that benefits all of humanity.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
The document discusses various machine learning concepts like model overfitting, underfitting, missing values, stratification, feature selection, and incremental model building. It also discusses techniques for dealing with overfitting and underfitting like adding regularization. Feature engineering techniques like feature selection and creation are important preprocessing steps. Evaluation metrics like precision, recall, F1 score and NDCG are discussed for classification and ranking problems. The document emphasizes the importance of feature engineering and proper model evaluation.
The document discusses feature extraction and selection as important steps in machine learning. It notes that better features often lead to better algorithms. It then describes five clusters identified through clustering analysis. Each cluster contains individuals (male or female) with certain average demographic characteristics like age, location, income, and whether they have accounts or loans. The document emphasizes that feature extraction and selection are underrated but important for machine learning.
This document provides an introduction to machine learning, including:
- It discusses how the human brain learns to classify images and how machine learning systems are programmed to perform similar tasks.
- It provides an example of image classification using machine learning and discusses how machines are trained on sample data and then used to classify new queries.
- It outlines some common applications of machine learning in areas like banking, biomedicine, and computer/internet applications. It also discusses popular machine learning algorithms like Bayes networks, artificial neural networks, PCA, SVM classification, and K-means clustering.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
This document discusses the rise of machine learning and artificial intelligence. It provides quotes from industry leaders about the potential for AI to improve lives and build a better society. The text then explains what machine learning is, how it works through supervised, unsupervised and reinforcement learning, and some of the business applications of AI like product recommendations, fraud detection and machine translation. It also discusses the increasing investment in and priority placed on AI by companies, governments and researchers. The document encourages readers to consider the ethical implications of AI and ensure it is developed and applied in a way that benefits all of humanity.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Presented at the Royal Aeronautical Society Society conference, "Simulation-Based Training in the Digital Generation", highlighting how machine learning and big data analytics can be applied to achieve data-driven adaptive training.
This document outlines an agenda for a data science boot camp covering various machine learning topics over several hours. The agenda includes discussions of decision trees, ensembles, random forests, data modelling, and clustering. It also provides examples of data leakage problems and discusses the importance of evaluating model performance. Homework assignments involve building models with Weka and identifying the minimum attributes needed to distinguish between red and white wines.
The document discusses clustering and nearest neighbor algorithms for deriving knowledge from data at scale. It provides an overview of clustering techniques like k-means clustering and discusses how they are used for applications such as recommendation systems. It also discusses challenges like class imbalance that can arise when applying these techniques to large, real-world datasets and evaluates different methods for addressing class imbalance. Additionally, it discusses performance metrics like precision, recall, and lift that can be used to evaluate models on large datasets.
This document appears to be lecture slides for a course on deriving knowledge from data at scale. It covers many topics related to building machine learning models including data preparation, feature selection, classification algorithms like decision trees and support vector machines, and model evaluation. It provides examples applying these techniques to a Titanic passenger dataset to predict survival. It emphasizes the importance of data wrangling and discusses various feature selection methods.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
The document provides an overview of a time series analysis and forecasting course. It discusses key topics that will be covered including descriptive statistics, correlation, regression, hypothesis testing, clustering, time series analysis and forecasting techniques like TCSI and ARIMA models. It notes that the presentation serves as class notes and contains informal high-level summaries intended to aid the author, and encourages readers to check the website for updated versions of the document.
The document discusses an agenda for a lecture on deriving knowledge from data at scale. The lecture will include a course project check-in, a thought exercise on data transformation, and a deeper dive into ensembling techniques. It also provides tips on gaining experience and intuition for data science, including becoming proficient in tools, deeply understanding algorithms, and focusing on specific data types through hands-on practice of experiments. Attribute selection techniques like filters, wrappers and embedded methods are also covered. Finally, the document discusses support vector machines and handling missing values in data.
Recommendation system using collaborative deep learningRitesh Sawant
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional
CF-based methods use the ratings given to items by users
as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in
many applications, causing CF-based methods to degrade
significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as
item content information may be utilized. Collaborative
topic regression (CTR) is an appealing recent method taking
this approach which tightly couples the two components that
learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be
very effective when the auxiliary information is very sparse.
To address this problem, we generalize recent advances in
deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model
called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback)
matrix. Extensive experiments on three real-world datasets
from different domains show that CDL can significantly advance the state of the art.
High-performance graph analysis is unlocking knowledge in computer security, bioinformatics, social networks, and many other data integration areas. Graphs provide a convenient abstraction for many data problems beyond linear algebra. Some problems map directly to linear algebra. Others, like community detection, look eerily similar to sparse linear algebra techniques. And then there are algorithms that strongly resist attempts at making them look like linear algebra. This talk will cover recent results with an emphasis on streaming graph problems where the graph changes and results need updated with minimal latency. We’ll also touch on issues of sensitivity and reliability where graph analysis needs to learn from numerical analysis and linear algebra.
With R, Python, Apache Spark and a plethora of other open source tools, anyone with a computer can run machine learning algorithms in a jiffy! However, without an understanding of which algorithms to choose and when to apply a particular technique, most machine learning efforts turn into trial and error experiments with conclusions like "The algorithms don't work" or "Perhaps we should get more data".
In this lecture, we will focus on the key tenets of machine learning algorithms and how to choose an algorithm for a particular purpose. Rather than just showing how to run experiments in R ,Python or Apache Spark, we will provide an intuitive introduction to machine learning with just enough mathematics and basic statistics.
We will address:
• How do you differentiate Clustering, Classification and Prediction algorithms?
• What are the key steps in running a machine learning algorithm?
• How do you choose an algorithm for a specific goal?
• Where does exploratory data analysis and feature engineering fit into the picture?
• Once you run an algorithm, how do you evaluate the performance of an algorithm?
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
This document provides an overview of machine learning techniques that can be applied in finance, including exploratory data analysis, clustering, classification, and regression methods. It discusses statistical learning approaches like data mining and modeling. For clustering, it describes techniques like k-means clustering, hierarchical clustering, Gaussian mixture models, and self-organizing maps. For classification, it mentions discriminant analysis, decision trees, neural networks, and support vector machines. It also provides summaries of regression, ensemble methods, and working with big data and distributed learning.
The document discusses a lecture on deriving knowledge from data at scale. It outlines topics that will be covered, including forecasting techniques, introducing the Weka data mining tool, decision trees, and doing hands-on exercises with decision trees in Weka. The lecture objectives are also listed, which are to gain familiarity with Weka, understand decision trees, and get experience applying decision trees in Weka if time permits.
Enhance The Technique For Searching Dimension Incomplete Databasespaperpublications3
Abstract: Data ambiguity is major problem in the information retrieval ambiguity is due to the loss in the data dimension it causes lot of problem in various real life application. Database may incomplete due to missing dimension and value. In previous work is totally based on the missing value. We focus on the problem is to find the missing dimension in our work. Missing dimension leads towards the problem in the traditional query approach. Missing dimension information create computational problem, so large number of possible combinations of missing dimensions need to be examined to check similarity between the query object and the data objects . Our aim is to reduce the all recovery version to increase the system performance as number of possible recovery data is reduces the time to estimate the true result is also reduces. Keywords: Missing Dimensions, Similarity search, Whole sequence query, Probability triangle inequality, Temporal data.
Title: Enhance The Technique For Searching Dimension Incomplete Databases
Author: Mr. Amol Patil, Prof. Saba Siraj, Miss. Ashwini Sagade
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Giacomo Boracchi
This document discusses concept drift and the challenges it poses for machine learning models when applied to streaming data. Concept drift occurs when the underlying data distribution changes over time, violating the assumption of data being independently and identically distributed (i.i.d.). This can cause classification performance to decrease unless the model adapts. The document outlines different types of concept drift and compares the performance of simple adaptation strategies, such as continuously updating the model versus only using recent data, on a toy example to demonstrate the need for more sophisticated adaptive methods.
The document discusses various topics related to deriving knowledge from data at scale. It begins with definitions of a data scientist from different sources, noting that data scientists obtain, explore, model and interpret data using hacking, statistics and machine learning. It also discusses challenges of having enough data scientists. Other topics discussed include important ideas for data science like interdisciplinary work, algorithms, coding practices, data strategy, causation vs. correlation, and feedback loops. Building predictive models is also discussed with steps like defining objectives, accessing and understanding data, preprocessing, and evaluating models.
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET Journal
This document discusses the progression of sentiment analysis techniques from traditional machine learning approaches to modern deep learning methods. It begins with an overview of traditional techniques like Naive Bayes and support vector machines. It then discusses how these methods were improved through techniques like feature selection, handling negation, and scaling to big data. The document traces how research increasingly focused on applying neural networks to sentiment analysis. It aims to provide insight into how state-of-the-art deep learning models are replacing earlier algorithms for sentiment analysis.
PCA transforms correlated variables into uncorrelated variables called principal components. It finds the directions of maximum variance in high-dimensional data by computing the eigenvectors of the covariance matrix. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Dimensionality reduction is achieved by ignoring components with small eigenvalues, retaining only the most significant components.
Selection possibilities for seed content fresh fruit quality in guava j. ap...srajanlko
This document analyzes genetic variability in seed characteristics of 68 guava genotypes. It finds highly significant differences among genotypes for fruit weight and seed traits. A wide range of variation was observed for pulp to seed weight ratio (17.10 to 5905.31) and number of seeds per fruit (2 to 463). High genetic coefficient of variation and heritability estimates were found for pulp to seed weight ratio, 100-seed weight, and number of seeds per fruit, indicating these traits are influenced by additive genes and effective selection is possible. The study identifies pulp to seed weight ratio, number of seeds per fruit, and 100-seed weight as promising traits for guava breeding to develop varieties with fewer and softer seeds.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Presented at the Royal Aeronautical Society Society conference, "Simulation-Based Training in the Digital Generation", highlighting how machine learning and big data analytics can be applied to achieve data-driven adaptive training.
This document outlines an agenda for a data science boot camp covering various machine learning topics over several hours. The agenda includes discussions of decision trees, ensembles, random forests, data modelling, and clustering. It also provides examples of data leakage problems and discusses the importance of evaluating model performance. Homework assignments involve building models with Weka and identifying the minimum attributes needed to distinguish between red and white wines.
The document discusses clustering and nearest neighbor algorithms for deriving knowledge from data at scale. It provides an overview of clustering techniques like k-means clustering and discusses how they are used for applications such as recommendation systems. It also discusses challenges like class imbalance that can arise when applying these techniques to large, real-world datasets and evaluates different methods for addressing class imbalance. Additionally, it discusses performance metrics like precision, recall, and lift that can be used to evaluate models on large datasets.
This document appears to be lecture slides for a course on deriving knowledge from data at scale. It covers many topics related to building machine learning models including data preparation, feature selection, classification algorithms like decision trees and support vector machines, and model evaluation. It provides examples applying these techniques to a Titanic passenger dataset to predict survival. It emphasizes the importance of data wrangling and discusses various feature selection methods.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
The document provides an overview of a time series analysis and forecasting course. It discusses key topics that will be covered including descriptive statistics, correlation, regression, hypothesis testing, clustering, time series analysis and forecasting techniques like TCSI and ARIMA models. It notes that the presentation serves as class notes and contains informal high-level summaries intended to aid the author, and encourages readers to check the website for updated versions of the document.
The document discusses an agenda for a lecture on deriving knowledge from data at scale. The lecture will include a course project check-in, a thought exercise on data transformation, and a deeper dive into ensembling techniques. It also provides tips on gaining experience and intuition for data science, including becoming proficient in tools, deeply understanding algorithms, and focusing on specific data types through hands-on practice of experiments. Attribute selection techniques like filters, wrappers and embedded methods are also covered. Finally, the document discusses support vector machines and handling missing values in data.
Recommendation system using collaborative deep learningRitesh Sawant
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional
CF-based methods use the ratings given to items by users
as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in
many applications, causing CF-based methods to degrade
significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as
item content information may be utilized. Collaborative
topic regression (CTR) is an appealing recent method taking
this approach which tightly couples the two components that
learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be
very effective when the auxiliary information is very sparse.
To address this problem, we generalize recent advances in
deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model
called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback)
matrix. Extensive experiments on three real-world datasets
from different domains show that CDL can significantly advance the state of the art.
High-performance graph analysis is unlocking knowledge in computer security, bioinformatics, social networks, and many other data integration areas. Graphs provide a convenient abstraction for many data problems beyond linear algebra. Some problems map directly to linear algebra. Others, like community detection, look eerily similar to sparse linear algebra techniques. And then there are algorithms that strongly resist attempts at making them look like linear algebra. This talk will cover recent results with an emphasis on streaming graph problems where the graph changes and results need updated with minimal latency. We’ll also touch on issues of sensitivity and reliability where graph analysis needs to learn from numerical analysis and linear algebra.
With R, Python, Apache Spark and a plethora of other open source tools, anyone with a computer can run machine learning algorithms in a jiffy! However, without an understanding of which algorithms to choose and when to apply a particular technique, most machine learning efforts turn into trial and error experiments with conclusions like "The algorithms don't work" or "Perhaps we should get more data".
In this lecture, we will focus on the key tenets of machine learning algorithms and how to choose an algorithm for a particular purpose. Rather than just showing how to run experiments in R ,Python or Apache Spark, we will provide an intuitive introduction to machine learning with just enough mathematics and basic statistics.
We will address:
• How do you differentiate Clustering, Classification and Prediction algorithms?
• What are the key steps in running a machine learning algorithm?
• How do you choose an algorithm for a specific goal?
• Where does exploratory data analysis and feature engineering fit into the picture?
• Once you run an algorithm, how do you evaluate the performance of an algorithm?
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
This document provides an overview of machine learning techniques that can be applied in finance, including exploratory data analysis, clustering, classification, and regression methods. It discusses statistical learning approaches like data mining and modeling. For clustering, it describes techniques like k-means clustering, hierarchical clustering, Gaussian mixture models, and self-organizing maps. For classification, it mentions discriminant analysis, decision trees, neural networks, and support vector machines. It also provides summaries of regression, ensemble methods, and working with big data and distributed learning.
The document discusses a lecture on deriving knowledge from data at scale. It outlines topics that will be covered, including forecasting techniques, introducing the Weka data mining tool, decision trees, and doing hands-on exercises with decision trees in Weka. The lecture objectives are also listed, which are to gain familiarity with Weka, understand decision trees, and get experience applying decision trees in Weka if time permits.
Enhance The Technique For Searching Dimension Incomplete Databasespaperpublications3
Abstract: Data ambiguity is major problem in the information retrieval ambiguity is due to the loss in the data dimension it causes lot of problem in various real life application. Database may incomplete due to missing dimension and value. In previous work is totally based on the missing value. We focus on the problem is to find the missing dimension in our work. Missing dimension leads towards the problem in the traditional query approach. Missing dimension information create computational problem, so large number of possible combinations of missing dimensions need to be examined to check similarity between the query object and the data objects . Our aim is to reduce the all recovery version to increase the system performance as number of possible recovery data is reduces the time to estimate the true result is also reduces. Keywords: Missing Dimensions, Similarity search, Whole sequence query, Probability triangle inequality, Temporal data.
Title: Enhance The Technique For Searching Dimension Incomplete Databases
Author: Mr. Amol Patil, Prof. Saba Siraj, Miss. Ashwini Sagade
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...Giacomo Boracchi
This document discusses concept drift and the challenges it poses for machine learning models when applied to streaming data. Concept drift occurs when the underlying data distribution changes over time, violating the assumption of data being independently and identically distributed (i.i.d.). This can cause classification performance to decrease unless the model adapts. The document outlines different types of concept drift and compares the performance of simple adaptation strategies, such as continuously updating the model versus only using recent data, on a toy example to demonstrate the need for more sophisticated adaptive methods.
The document discusses various topics related to deriving knowledge from data at scale. It begins with definitions of a data scientist from different sources, noting that data scientists obtain, explore, model and interpret data using hacking, statistics and machine learning. It also discusses challenges of having enough data scientists. Other topics discussed include important ideas for data science like interdisciplinary work, algorithms, coding practices, data strategy, causation vs. correlation, and feedback loops. Building predictive models is also discussed with steps like defining objectives, accessing and understanding data, preprocessing, and evaluating models.
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET Journal
This document discusses the progression of sentiment analysis techniques from traditional machine learning approaches to modern deep learning methods. It begins with an overview of traditional techniques like Naive Bayes and support vector machines. It then discusses how these methods were improved through techniques like feature selection, handling negation, and scaling to big data. The document traces how research increasingly focused on applying neural networks to sentiment analysis. It aims to provide insight into how state-of-the-art deep learning models are replacing earlier algorithms for sentiment analysis.
PCA transforms correlated variables into uncorrelated variables called principal components. It finds the directions of maximum variance in high-dimensional data by computing the eigenvectors of the covariance matrix. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Dimensionality reduction is achieved by ignoring components with small eigenvalues, retaining only the most significant components.
Selection possibilities for seed content fresh fruit quality in guava j. ap...srajanlko
This document analyzes genetic variability in seed characteristics of 68 guava genotypes. It finds highly significant differences among genotypes for fruit weight and seed traits. A wide range of variation was observed for pulp to seed weight ratio (17.10 to 5905.31) and number of seeds per fruit (2 to 463). High genetic coefficient of variation and heritability estimates were found for pulp to seed weight ratio, 100-seed weight, and number of seeds per fruit, indicating these traits are influenced by additive genes and effective selection is possible. The study identifies pulp to seed weight ratio, number of seeds per fruit, and 100-seed weight as promising traits for guava breeding to develop varieties with fewer and softer seeds.
Presentación de Marcelo Wilkorwsky - eCommerce Day Montevideo 2015eCommerce Institute
Diapositivas presentadas por Marcelo Wilkorwsky, Director OINCS , en el eCommerce Day Montevideo 2015 en la plenaria "TRENDS PITCH ECOMMERCE II LO QUE VIENE EN ECOMMERCE & MCOMMERCE CENTRADO
EN RESULTADOS Y CONVERSIÓN, COMO IMPACTAN LA UBICUIDAD, LOS DISPOSITIVOS MÓVILES E INTERNET DE LAS COSAS EN EL FUTURO DEL COMERCIO MINORISTA".
Resumen de Especificaciones - Winter Panel ChileWinterPanelChile
El documento describe los paneles SIP (Structural Insulated Panel), paneles prefabricados compuestos por dos placas de madera u otro material unidas por un núcleo aislante. Explica sus propiedades aislantes y resistentes, ventajas para la construcción, y procesos y normas de fabricación.
This document provides an overview of machine learning concepts including supervised learning, unsupervised learning, and reinforcement learning. It discusses common machine learning applications and challenges. Key topics covered include linear regression, classification, clustering, neural networks, bias-variance tradeoff, and model selection. Evaluation techniques like training error, validation error, and test error are also summarized.
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
RAMP approach to analytics: Rapid Analytics and Model Prototyping; collaborative data challenges with in-built data science process management tools and analytics; An observatory of data science and scientists. Presented at the Design Theory Special Interest Group of International Design Society. Mines ParisTech and Centre for Data Science.
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
This document discusses classification techniques for data mining. It provides an overview of common classification algorithms including decision trees, k-nearest neighbors (kNN), and Naive Bayes. Decision trees use a top-down approach to classify data based on attribute tests at each node. kNN identifies the k nearest training examples to classify new data points. Naive Bayes assumes independence between attributes and uses Bayes' theorem for classification. The document also discusses how these techniques are used for data cleaning, integration, transformation and knowledge representation in the data mining process.
A predictive system for detection of bankruptcy using machine learning techni...IJDKP
Bankruptcy is a legal procedure that claims a person or organization as a debtor. It is essential to
ascertain the risk of bankruptcy at initial stages to prevent financial losses. In this perspective, different
soft computing techniques can be employed to ascertain bankruptcy. This study proposes a bankruptcy
prediction system to categorize the companies based on extent of risk. The prediction system acts as a
decision support tool for detection of bankruptcy
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
This document outlines the agenda and content for Day 3 of a 4-day "Data Science for Finance Crash Course" being held at Babson College in Boston. Day 3 focuses on evaluating machine learning models, including understanding various evaluation metrics for both supervised prediction and classification models. The day includes labs applying these concepts to a credit risk case study, building models and understanding/tuning performance. Metrics covered include R-squared, RMSE, MAE, MAPE, and confusion matrices. The document provides examples of how to fine-tune hyperparameters for random forests and neural networks. It recaps the overall machine learning process and introduces the speaker.
This report contains:-
1. what is data analytics, its usages, its types.
2. Tools used for data analytics
3. description of Classification
4. description of the association
5. description of clustering
6. decision tree, SVM modelling etc with example
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET Journal
This document describes a comparative analysis of GUI-based machine learning approaches for predicting Parkinson's disease. It analyzes various machine learning algorithms including logistic regression, decision trees, support vector machines, random forests, k-nearest neighbors, and naive Bayes. The document discusses data preprocessing techniques like variable identification, data validation, cleaning and preparing. It also covers data visualization and evaluating model performance using accuracy calculations. The goal is to compare the performance of these machine learning algorithms and identify the approach that predicts Parkinson's disease with the highest accuracy based on a given hospital dataset.
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmIRJET Journal
This document discusses using machine learning algorithms to calculate the water quality index of the Ganga River in India. Specifically, it aims to analyze water quality data collected from various cities along the Ganga Riverbed in different seasons (summer, monsoon, winter) and assess whether the river water is potable or not. The researchers designed a machine learning model using the decision tree algorithm that calculates the water quality index based on 9 physicochemical parameters. It will be implemented as a Python-based web application using the Flask framework. The model is trained on collected datasets to predict water quality and determine if it is safe for drinking.
Prashant Yadav presented on data science and analysis at Babasaheb Bhimrao Ambedkar University in Lucknow, Uttar Pradesh. The presentation introduced data science, discussed its applications in various fields like business and healthcare, and covered key topics like open source tools for data science, common data analysis methodologies and algorithms, using Python for data analysis, and challenges in the field. The presentation provided an overview of data science from introducing the concept to discussing real-world applications and issues.
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKIRJET Journal
This document discusses using machine learning classifiers to analyze credit risk. It examines various machine learning techniques for credit risk analysis, including Bayesian classifiers, naive Bayes, decision trees, k-nearest neighbors, multilayer perceptrons, support vector machines, and ensemble methods like bagging and boosting. Two credit datasets from the UCI machine learning repository were used to test the accuracy of these classifiers. The results showed decision trees had the highest accuracy at 89.9% and 71.25% on the two datasets, while k-nearest neighbors had the lowest. Future work could involve rebuilding the models with more accurate data to improve performance. The objective of credit risk analysis is to help banks and financial institutions balance approving loans to creditworthy borrowers
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
This document reviews algorithms that can be used for crime analysis and prediction. It discusses various data mining and machine learning techniques including classification algorithms like decision trees, k-nearest neighbors, and random forests as well as clustering algorithms like k-means clustering. Deep learning techniques are also examined for identifying relationships between different types of crimes and predicting where and when crimes may occur. The document evaluates these different algorithmic approaches and concludes that major developments in data science and machine learning now allow for effective crime analysis and prediction by discovering patterns in criminal data.
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
This document provides an overview of machine learning algorithms that are commonly used for data science. It discusses both supervised and unsupervised algorithms. For supervised algorithms, it describes decision trees, k-nearest neighbors, and linear regression. Decision trees create a hierarchical structure to classify data, k-nearest neighbors classifies new data based on similarity to existing data, and linear regression finds a linear relationship between variables. Unsupervised algorithms like clustering are also briefly mentioned. The document aims to familiarize data science enthusiasts with basic machine learning techniques.
Machine Learning, K-means Algorithm Implementation with RIRJET Journal
This document discusses the implementation of the K-means clustering algorithm using R programming. It begins with an introduction to machine learning and the different types of machine learning algorithms. It then focuses on the K-means algorithm, describing the steps of the algorithm and how it is used for cluster analysis in unsupervised learning. The document then demonstrates implementing K-means clustering in R by generating sample data, initializing random centroids, calculating distances between data points and centroids, assigning data points to clusters based on closest centroid, recalculating centroids, and plotting the results. It concludes that K-means clustering is useful for gaining insights into dataset structure and was successfully implemented in R.
Data science is an interdisciplinary field that uses algorithms, procedures, and processes to examine large amounts of data in order to uncover hidden patterns, generate insights, and direct decision making.
1) Data analytics involves treating available digital data as a "gold mine" to obtain tangible outputs that can improve business efficiency when applied. Machine learning uses algorithms to correlate parameters in data and improve relationships.
2) The document provides an overview of getting started in data science, covering business objectives, statistical analysis, programming tools like R and Python, and problem-solving approaches like supervised and unsupervised learning.
3) It describes the iterative "rule of seven" process for data science projects, including collecting/preparing data, exploring/analyzing it, transforming features, applying models, evaluating performance, and visualizing results.
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
The slides were used in a trial session for a student aiming to learn python to do Data science projects .
The session video can be watched from the link below
https://youtu.be/CwCe1pKOVI8
I have over 20 years of experience in both teaching & in completing computer science projects with certificates from Stanford, Alberta, Pennsylvania, California Irvine universities.
I teach the following subjects:
1) IGCSE A-level 9618 / AS-Level
2) AP Computer Science exam A
3) Python (basics, automating staff, Data Analysis, AI & Flask)
4) Java (using Duke University syllabus)
5) Descriptive statistics using SQL
6) PHP, SQL, MYSQL & Codeigniter framework (using University of Michigan syllabus)
7) Android Apps development using Java
8) C / C++ (using University of Colorado syllabus)
Check Trial Classes:
1) A-Level Trial Class : https://youtu.be/v3k7A0nNb9Q
2) AS level trial Class : https://youtu.be/wj14KpfbaPo
3) 0478 IGCSE class : https://youtu.be/sG7PrqagAes
4) AI & Data Science class: https://youtu.be/CwCe1pKOVI8
https://elmalla.info/blog/68-tutor-profile-slide-share
You can get your trial Class now by booking : https://calendly.com/ahmed-elmalla/30min
And you can contact me on
https://wa.me/0060167074241
by Python & Computer science tutor in Malaysia
This brief work is aimed in the direction of basics of data sciences and model building with focus on implementation on fairly sizable dataset. It focuses on cleaning the data, visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic regression, random forests, cross validation without delving too deep into any of them but giving a start to a new learner.
The document discusses character recognition for Devnagari script using a hybrid approach. Devnagari script poses challenges for character recognition due to complexities like fused characters, modifiers, and similarities between some letters. The proposed system uses Hough transform to detect features from lines and curves, and SVM for classification. This hybrid approach aims to address issues in handwritten character recognition for Devnagari script by utilizing effective feature extraction and classification methods.
1. The big-data analytics challenge –
combining statistical and algorithmic
perspectives
Anat Reiner-Benaim
Department of Statistics
University of Haifa
IDC, May 14, 2015
2. Outline
IDC, May 2015
data science -
◦ Definition?
◦ Who needs it?
◦ The elements of data science
Analysis:
◦ Modeling
◦ Software
Examples:
◦ Scheduling – prediction of runtime
◦ Genetics – detection of rare events
2
3. What is data Science?
IDC, May 2015
From Wikipedia:
“Data science is the
study of the
generalizable extraction
of knowledge from
data…
3
4. IDC, May 2015
More from Wikipedia:
…builds on techniques and theories from many fields,
including signal processing, mathematics, probability
models, machine learning, statistical learning, computer
programming, data engineering, pattern recognition and
learning, visualization, uncertainty modeling, data
warehousing and high performance computing...
…goal: extracting meaning from data and creating data
products…
…not restricted to only big data, although the fact that
data is scaling up makes big data an important aspect of
data science.”
4
5. Data Science – who needs it?
IDC, May 2015 5
Anyone who has (big) data, e.g.:
Cellular industry – phones, apps, advertisers
Internet: search engines, social media, marketing,
advertisers
Computer networks and server systems
Cyber security
Credit cards
Banks
Health care providers
Life science – genome, proteome…
TV and related
Weather forecast
6. The elements of data science
IDC, May 2015 6
• NoSQL Database
(e.g. Cassandra)
• DFS (Distributed File System)
(e.g. Hadoop, Spark, GraphLab)
Store,
Preprocess
Database - SQL
(e.g. MySQL, SAS-SQL)
Dump to SQL
• Apply sophisticated methods:
Statistical modeling
Machine learning algorithms
Analyze
“Big data
technologies”
“Big data
Analytics”
7. IDC, May 2015
◦ How can I decide that an item in a manufacturing process is
faulty?
◦ What is the difference between the new machine and the old
one?
◦ What are the factors that affect system load?
◦ How can I predict memory/runtime of a program?
◦ How can I predict that a costumer will churn?
◦ What is the chance that the phone/web user will click my
advertisement?
◦ What is the chance that the current ATM user is making a fraud?
◦ What are the chance for snow this week?
7
Data Analysis –
First, define the problem
9. IDC, May 2015 9
Choosing models
Type of variables:
◦ Continuous, ordinal, categorical.
Statistical assumption:
◦ Normality, equal-variance, independence.
Missing data
Stability
10. IDC, May 2015 10
Learning tools
Bootstrap
◦ Repeatedly fit model on resampled data.
Bagging (“bootstrap aggregation”)
◦ Combine bootstrap samples to prevent instability.
Boosting
◦ Combine a set of weak learners to create a single strong learner
Regularization
◦ Solve over-fitting by restriction
(e.g. limit regression to linear or low degree polynomial)
Utility/cost function
◦ Evaluate performance, compare models
Typically iterative procedures.
combined with the modeling procedures
Help optimize the model and evaluate its performance
11. IDC, May 2015 11
More to consider-
Control statistical error
due to large scale analysis
Multiple
statistical tests
Inflated
statistical error
Control FDR?
FDR = expected proportion of false findings (e.g.
“features”)
12. IDC, May 2015 12
The R software
Open source programming language and
environment for statistical computing.
Widely used among statisticians for developing
statistical software (“packages”) and for data
analysis.
Increasingly popular among all data professionals.
Advantages:
Contains most updated statistical
models and machine learning
algorithms.
Methods are based on research,
compiled and documented.
Contains Hadoop functions
(package “rhdfs”).
Very convenient for plain
programming, scripting,
simulations, visualization.
Friendly interface
(e.g. R-Studio).
The R project site
14. Example 1:
Classification of Job Runtime
in Intel
Joint work with:
Anna Grabarnick, University of Haifa
Edi Shmueli, Intel
15. Job processing
IDC, May 2015 15
Users serversjobs Job
scheduler
Decide:
• Which server?
• Queue?
16. Job schedulers
IDC, May 2015
Algorithms aimed to efficiently queuing and distributing
jobs among servers, thereby improving system
utilization.
Popular scheduling algorithms (e.g. the backfilling) use
information on how long the jobs are expected to run.
In serial job systems, scheduling performance can be
improved by merely separating the short jobs from the
long and assigning them to different queues in the
system.
This helps reduce the likelihood that short jobs will be
delayed after long ones, and thus improves overall 16
18. The problem
IDC, May 2015
Main purpose:
Classify jobs into “short” and “long” durations.
Questions:
◦ How can the classes can be defined?
◦ How can the jobs be classified?
18
19. Available data
IDC, May 2015
two traces obtained from one of Intel’s data centers:
1. ~1 million jobs executed during a period of 10 consecutive days.
Used for data training.
2. ~755,000 jobs executed during a period of 7 consecutive days.
Used for model validation.
Aside from runtime information, 9 categorical variables
were available:
19
20. IDC, May 2015 20
TABLE I. ROUGH GROUPING OF THE 9 CATEGORICAL VARIABLES
Group # of variables Relates to Example
A 3 Scheduling information Resources requested by the job
B 2 Execution-specific information Command line and arguments
C 4 Association information Project and component
TABLE I. STATISTICS REGARDING THE CATEGORICAL VARIABLES
Variable # of categories
# of missing (in
training data)
A1 9 0
A2 7 0
A3 5 0
B1 44 173
B2 22 184
C1 2 0
C2 5 239
C3 6 184
C4 32 0
21. Analysis steps
IDC, May 2015
Exploratory visualization of the data.
Class construction and characterization.
Classification:
◦ Choice of a classification model.
◦ Optimize model.
◦ Validate model.
21
26. IDC, May 2015 26
Constructing classes by the mixture
model
• The Gaussian (normal) mixture model has the form
𝑓 𝑥 =
𝑚=1
𝑀
𝛼 𝑚 𝜙 𝑥; 𝜇 𝑚, Σ 𝑚
with mixing proportions 𝛼 𝑚, 𝛼 𝑚 = 1.
• Each Gaussian density has a mean 𝜇 𝑚 and covariance matrix Σ 𝑚.
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
27. IDC, May 2015 27
• The parameters are usually estimated by maximum likelihood
using the EM algorithm.
• We model the runtime 𝑌 as a mixture of the two normal variables
𝑌1~𝑁 𝜇1, 𝜎1
2
, 𝑌2~𝑁 𝜇2, 𝜎2
2
.
𝑌 can be defined by
𝑌 = 1 − 𝛥 ∙ 𝑌1 + 𝛥 ∙ 𝑌2,
where Δ ∈ {0, 1} with ℙ Δ = 1 = 𝜋.
• Let 𝜙 𝜃(𝑥) denote the normal density with parameters 𝜃 = (𝜇, 𝜎2).
Then the density of 𝑌 is
𝑔 𝑌 𝑦 = 1 − 𝜋 𝜙 𝜃1
𝑦 + 𝜋𝜙 𝜃2
𝑦 .
• fit this model to our data by maximum likelihood. The parameters
are
𝜃 = 𝜋, 𝜃1, 𝜃2 = 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
.
The log-likelihood based on 𝑁 training cases is
𝑙 𝜃; Ζ =
𝑖=1
𝑁
log 1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖 .
Mixture distribution – parameters
estimation
“Short”
28. IDC, May 2015 28
• Direct maximization of 𝑙 𝜃; Ζ is quite difficult numerically. Instead,
we consider unobserved latent variables Δ𝑖 taking values 0 or 1 as
earlier: if Δ𝑖 = 1 then 𝑌𝑖 comes from distribution 2, otherwise it
comes from distribution 1.
• Suppose we knew the values of the Δ𝑖's. Then the log-likelihood
would be
𝑙 𝜃; Ζ, Δ
=
𝑖=1
𝑁
1 − Δ𝑖 log 𝜙 𝜃1
𝑦𝑖 + Δ𝑖 log 𝜙 𝜃2
𝑦𝑖
+
𝑖=1
𝑁
1 − Δ𝑖 log 𝜋 + Δ𝑖 log 1 − 𝜋
and the maximum likelihood estimates of 𝜇1 and 𝜎1
2
would be the
sample mean and the sample variance of the observations with Δ𝑖
= 0. Similarly, the estimates for 𝜇2 and 𝜎2
2
would be the sample mean
and the sample variance of the observations with Δ𝑖 = 1.
Parameters estimation – cont’d
29. • Since the Δ𝑖 values are actually unknown, we proceed in an
iterative fashion, substituting for each Δ𝑖 in the previous equation
its expected value
𝛾𝑖 𝜃 = 𝔼 Δ𝑖 𝜃, Ζ = ℙ Δ𝑖 = 1 𝜃, Ζ ,
which is also called the responsibility of model 2 for observation 𝑖.
• We use the following procedure, known as the EM algorithm, for
the two-component Gaussian mixture:
1. Take initial guesses for the parameters 𝜋, 𝜇1, 𝜎1
2
, 𝜇2, 𝜎2
2
(see below).
2. Expectation step: compute the responsibilities
𝛾𝑖=
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖
, 𝑖
= 1, 2, … , 𝑁.
IDC, May 2015 29
Parameters estimation – cont’d
31. • A simple choice for initial guesses for 𝜇1 and 𝜇2 is two randomly
selected observations 𝑦𝑖. The overall sample variance 𝑖=1
𝑁 𝑦 𝑖− 𝑦 2
𝑁
can be used as an initial guess for both 𝜎1
2
and 𝜎2
2
. The initial
mixing proportion 𝜋 can be set to 0.5.
• Software:
The "mixtools" R package was used for the mixture analysis, with
the function "normalmixEM" for parameter and posterior probability
(responsibility) estimation.
IDC, May 2015 31
Parameter estimation - additional notes
32. IDC, May 2015 32
• We obtain the following estimates:
33. IDC, May 2015 33
60.56%
39.44%
Partition of the runtimes to short (1) and long (2) for
threshold 0.5
1 2
• Each observation 𝑖 is assigned a posterior probability to
belong to each class:
𝜋𝜙 𝜃2
𝑦𝑖
1 − 𝜋 𝜙 𝜃1
𝑦𝑖 + 𝜋𝜙 𝜃2
𝑦𝑖
, 𝑖 = 1, 2, … , 𝑁.
• For instance, using probability threshold of 0.5:
34. Building a Classifier –
The Learning algorithm
IDC, May 2015 34
Fit a model on
training data:
• Model/feature
selection
Evaluate the
model on
testing data
Summarize model
performance:
• ROC
• Misclassification rates
• Fit (F test, SSE)
Compare
models
Validate on
validation set
Optimize on
full data:
• ROC,
pseudo-ROC
35. IDC, May 2015 35
• We use observations that are close to the means (∓0.5 sd).
They include ~450,000 observations (~43%).
The training and testing process
• 80% are for training – finding a classifier (model/feature selection)
• 20% are for testing– checking performance
• After obtaining a classifier – optimize:
choose the mixture threshold that maximizes performance on full
dataset.
• Sequential
procedures for
model reduction
36. IDC, May 2015 36
Classifiers
• Here we choose two classification models:
• logistic regression
• decision trees
• They can both handle:
• Missing data
• Candidate classifying variables that are either continuous or
categorical.
• Categorical variables with many categories
37. IDC, May 2015 37
Decision trees
• Classification rules are formed by the paths from the root to the leaves.
• No assumptions are made regarding the distribution of predictors.
• Relatively unstable.
• steps:
• Tree is built using recursive splitting of nodes, until a “maximal” tree is
generated.
• “Pruning” – simplification of the tree by cutting nodes off, prevents
overfitting.
• Selection of the “optimal” pruned tree – fits without overfitting.
38. IDC, May 2015 38
Logistic Regression
• Regression used to predict the outcome of a binary variable (like “short” or “long”).
• Conditional mean E(Y|X) is distributed Bernoulli.
• The connection between E(Y|X) and X can be described by the logistic function:
which has an “s” shape.
In general, the logistic function is
0 1
0 1 0 1
1
|
1 1
i
i i
X
i i X X
e
E Y X
e e
z
e
zf
1
1
39. IDC, May 2015 39
Performance measures
• We use ROC curve.
• It combines both types of errors:
• Sensitivity (“true positive rate”)
- probability for a “short” classification when the runtime is “short”.
• Specificity (“true negative rate”)
- probability for a “long” classification when the runtime is “long”.
40. IDC, May 2015 40
Performance optimization
• For the CART procedure, variables A1, A2, A3 and B4 were selected
to be in the classifier.
• For performance optimization, we use a pseudo-ROC curve:
• blue circle marks optimal tradeoff between sensitivity and specificity
• obtained for mixture probability threshold of 0.45.
41. IDC, May 2015 41
• For the Logistic regression, most variables were selected to be in
the classifier.
• For performance optimization, we compare ROC curves obtained for
different thresholds, and choose threshold 0.4:
42. IDC, May 2015 42
Validation results
• Total misclassification rates:
• CART: 9%.
• Logistic regression: 17%.
• Summary:
• Runtime can be effectively classified
using the available information.
• Further evaluation of our method is
required using different data sets from
different installations and times.
43. IDC, May 2015 43
Joint work with:
Pavel Goldstein and Prof. Avraham Korol,
University of Haifa
Example 2:
Detection of 2nd order Epistasis
on multi-trait complexes
44. IDC, May 2015
Goal:
search for epistatic effects (interactions between
genomic loci) on expression traits.
44
Searching for Epistasis
46. IDC, May 2015
Despite the growing interest in searching for epistatic
interactions, there is no consensus as to the best
strategy for their detection
Suggested approach:
QTL analysis - combine gene expression and mapping
data
Use multi-trait complexes rather than single traits
(trait = gene expression of a particular gene).
Screen for potential epistatic regions in a hierarchical
manner.
Control the overall FDR (False Discovery Rate).
46
47. Multi-trait complexes
47IDC, May 2015
Number of tests for interactions on single traits:
Number of genes (~7200) * number of loci pairs (~120,000) = a
lot!
A dimension reduction stage can be of help!
Suggestion:
Consider correlated traits as multi-trait complexes
has been shown to increase QTL detection power,
mapping resolution and estimation accuracy
(Korol et al, 2001).
48. Use WGCNA – Weighted correlation network
Top-down hierarchical clustering.
Dynamic Tree Cut algorithm:
branch cutting method for detecting gene modules,
depending on their shape
Building up meta-genes by taking the first principal
component of the genes from every cluster.
48IDC, May 2015
Clustering traits (genes)
49. Testing for epistasis:
Natural and Orthogonal Interactions (NOIA) model
(Alvarez-Castro and Carlborg , 2007)
For trait t, loci-pair l (loci A and B) and replicate i :
design
matrix
vector of
genetic effects
Indicator of
genotype
combinations
for two loci
genotypes
gene expression
49IDC, May 2015
50. The test for epistasis is done
hierarchically
50
Framework marker
Secondary markers
IDC, May 2015
51. False Discovery Rate (FDR)
in hierarchical testing
Yekutieli (2008) offers a procedure to control the FDR for the full
tree of tests
51IDC, May 2015
52. Hierarchical FDR control
A universal upper bound is derived for the full-tree FDR (Yekutieli, 2008):
An upper bound for 𝜹* may be estimated using:
where Rt
Pi=0 and Rt
Pi=1 are the number of discoveries in τt, given that Hi is a true null
hypothesis in τt, and false null hypothesis, respectively.
.
52IDC, May 2015
53. IDC, May 2015 53
Searching algorithm
STAGE 1:
Construct multi-trait complexes (using WGCNA clustering)
STAGE 2: hierarchical search
◦ step1:
Screen for combinations of loci-pair and multi-trait complex
with potential for epistasis (NOIA model)
◦ Step 2:
Test using higher resolution loci only for the selected
regions (NOIA model).
54. Data
A sample of 210 individuals from Arabidopsis
thaliana population
Genotypic map consists of 579 markers
Transcript levels were quantified using Affymetrix
whole-genome microarrays
Total of 22,810 gene expressions from all five
chromosomes
(non-expressed genes filtered out).
54IDC, May 2015
55. Two-stage hierarchical testing for
epistasis
STAGE 1: Identified 314 gene clusters (WGSNA)
STAGE 2:
47 sparse "framework" markers that are within 10 cM of each
other.
10-12 “secondary" marker related to each "framework" marker.
First step:
1081 marker pairs X 314 meta-genes =339,434 tests
- 11 regions are identified.
Second step:
- 1141 epistatic effects are identified.
55IDC, May 2015
59. Preprocessing
The Variance Stabilization Normalization
Gene expression filtering: 7244 genes out of 22810
Markers preprocessing
59IDC, May 2015
60. Computational advantage
Using the two-stage algorithm on meta-genes, 341,107
hypotheses were tests
Naive analysis:
121278 loci pairs for each of 7244 traits, namely 878,537,832
tests would have been performed
Reduction of tests number by 2575 times
60IDC, May 2015
62. Define a scan statistics
For gene 𝑔, 𝑔 = 1, … , 𝑚, let
Then the scan statistic for gene 𝑔 is
For gene 𝑔, we test the null hypothesis that there
is no k such that
E 𝐷𝑔,𝑘 , … , E 𝐷𝑔,𝑘+𝑤−1 > 𝛿0,
where 𝛿0 is the baseline level for the gene.
1
,( )
t w
w
g g p
p t
Y t D
1 1
max ( )
g
w w
g g
t n w
S Y t
62IDC, May 2015
63. IDC, May 2015
Peak Detection
63
1 1
max
g
w
g g
t n w
S Y
Point-wise
statistics
Moving-sum
statistics
1
,( )
t w
w
g g p
p t
Y t D
64. IDC, May 2015 64
Summary – data science
• Data science is an emerging filed/profession that
incorporates knowledge and expertise form several
disciplines.
• It combines both big data technologies and
sophisticated methods for complicated data analysis.
• Data analysis is aimed to answer various questions
with case-specific challenges, and should therefore
be carefully tailored to the type of problem and data.
65. References
IDC, May 2015 65
Reiner-Benaim, A., Shmueli, E. and Grabarnick, A.
(submitted)
A statistical learning approach for runtime prediction in Intel’s
data center.
Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014)
Two-stage genome-wide search for epistasis with
implementation to Recombinant Inbred Lines (RIL)
populations. PLOS ONE, 9(12).
Reiner-Benaim, A., (2015) Scan statistic tail probability
assessment based on process covariance and window size.
Methodology and Computing in Applied Probability, In Press.
Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014)
Scan statistics analysis for detection of introns in time-course
tiling array data. Statistical Applications in Genetics and
Molecular Biology, 13(2), 173-90.
However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here – gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.
However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here – gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.
The WGCNA proposed by Zhang and Horvath, used for gene expression clustering.
Firstly, Top-down hierarchical clustering applied, using weighted inter-genes distances .
Then, the branches cutting method sensitive for their shapes implemented for detecting gene modules.
And then meta-genes are defined as the first principal component of the genes from every cluster.
We propose to test epistasis hypothesis by fitting a proposed by Alvarez-Castro and Carlborg NOIA model modified for second order epistasis in RIL populations, which are homozygous. The model allows orthogonal estimation of genetic effects.
For loci A and B gene expression level for trait t, loci-pair l and replicate i we can represent the vector of gene expressions as a product of phenotypes with corresponding gynotype combinations indicators plus error term. In turn, phenotypes may be represented as a multiplication of genetic effects and design matrix that guarantees orthogonality of the effects.
As mentioned gynotype map neighbor markers contain very similar information. Based on this attribute we separated all markers for "framework" markers (marked as bold dots) - relatively distant loci and "secondary" markers (small vertical lines) related to corresponding framework markers. Long vertical lines denote borders of "framework" marker areas. Thus our markers have hierarchical structure.
We propose a two-stage approach for identifying QTL epistasis. The offered algorithm starts with an initial construction of multi-trait complexes (or meta-genes by WGCNA clustering the microarray gene expression data. Then, epistasis is tested for among all combinations of such complexes and loci-pairs: starting with an initial "rough" search for pairs among framework markers, which is followed by a higher resolution search only within the identified regions.
If we found the epistatic effect between markers m1 and m2, we should continue our search between all pairs of markers along with their "secondary" markers (colored in yellow)
Since the number of tests involved is enormous we should control false positives. For this purpose we used False Discovery Rate criteria proposed by Benjamini and Hochberg, In our case it defined as expected proportion of erroneously identified epistasis effects among all identified ones.
Yekutieli (2008) suggested the hierarchical procedure to control the FDR across the tree of hypotheses.
In our case all hypotheses could be arranged in a 2-level structure. In the first level are the hypotheses for all combinations of multi-trait complexes and pairs of sparse "framework" markers. In the second level are the hypotheses for all combinations selected in the first level, this time using "secondary", markers, related to corresponding framework markers. We are interesting in Full-tree FDR control - all epistasis discoveries in all tree . The rejection threshold q should be chosen such that the full-tree FDR will be controlled at the level, 0.1.
We implemented the algorithm on Arabidobsis data of 210 RILs .
Around 23000 gene expressions were produced from all five chromosomes.
Then we applied out algorithm:
were identified 314 gene clusters (WGCNA)
For the first stage of hierarchical testing 47 sparse "framework" markers that are within 10 cM of each other were used
10 -12 “secondary" markers were placed for each framework area
So we tested around 440 000 epistatic hypotheses
The Variance Stabilization Normalization (VSN) uses Generalized log transformation
After filtering non-expressed genes remained 7244 genes out of 22810
Also we filtered out bad markers or non-informative markers
Using the two-stage algorithm on meta-genes, 341,107 hypotheses were tests
If instead, all possible combinations of markers and row traits were tested at one stage, about 900,000,000 tests would have been performed.