Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
In this slide I answer the basic questions about machine learning like:
What is Machine Learning?
What are the types of machine learning?
How to deal with data?
How to test model performance?
Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Generative Adversarial Networks : Basic architecture and variantsananth
In this presentation we review the fundamentals behind GANs and look at different variants. We quickly review the theory such as the cost functions, training procedure, challenges and go on to look at variants such as CycleGAN, SAGAN etc.
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
In this presentation we articulate when deep learning techniques yield best results from a practitioner's view point. Do we apply deep learning techniques for every machine learning problem? What characteristics of an application lends itself suitable for deep learning? Does more data automatically imply better results regardless of the algorithm or model? Does "automated feature learning" obviate the need for data preprocessing and feature design?
Present about supervised learning. Mainly discussing regression and classification. Further activities are discussing how to practically apply supervised learning.
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
In this presentation we discuss the hypothesis of MaxEnt models, describe the role of feature functions and their applications to Natural Language Processing (NLP). The training of the classifier is discussed in a later presentation.
Generating Natural-Language Text with Neural NetworksJonathan Mugan
Automatic text generation enables computers to summarize text, to have conversations in customer-service and other settings, and to customize content based on the characteristics and goals of the human interlocutor. Using neural networks to automatically generate text is appealing because they can be trained through examples with no need to manually specify what should be said when. In this talk, we will provide an overview of the existing algorithms used in neural text generation, such as sequence2sequence models, reinforcement learning, variational methods, and generative adversarial networks. We will also discuss existing work that specifies how the content of generated text can be determined by manipulating a latent code. The talk will conclude with a discussion of current challenges and shortcomings of neural text generation.
Artificial Intelligence Course: Linear models ananth
In this presentation we present the linear models: Regression and Classification. We illustrate with several examples. Concepts such as underfitting (Bias) and overfitting (Variance) are presented. Linear models can be used as stand alone classifiers for simple cases and they are essential building blocks as a part of larger deep learning networks
This presentation discusses decision trees as a machine learning technique. This introduces the problem with several examples: cricket player selection, medical C-Section diagnosis and Mobile Phone price predictor. It discusses the ID3 algorithm and discusses how the decision tree is induced. The definition and use of the concepts such as Entropy, Information Gain are discussed.
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
Presentation of a real use case at TAJ law firm (Deloitte Paris) of applying Machine learning on accounting to help clients to prepare their tax audit.
An introductory course on building ML applications with primary focus on supervised learning. Covers the typical ML application cycle - Problem formulation, data definitions, offline modeling, platform design. Also, includes key tenets for building applications.
Note: This is an old slide deck. The content on building internal ML platforms is a bit outdated and slides on the model choices do not include deep learning models.
Valencian Summer School 2015
Day 1
Lecture 1
State of the Art in Machine Learning
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
In this slide I answer the basic questions about machine learning like:
What is Machine Learning?
What are the types of machine learning?
How to deal with data?
How to test model performance?
Valencian Summer School 2015
Day 1
Lecture 3
Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Generative Adversarial Networks : Basic architecture and variantsananth
In this presentation we review the fundamentals behind GANs and look at different variants. We quickly review the theory such as the cost functions, training procedure, challenges and go on to look at variants such as CycleGAN, SAGAN etc.
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
In this presentation we articulate when deep learning techniques yield best results from a practitioner's view point. Do we apply deep learning techniques for every machine learning problem? What characteristics of an application lends itself suitable for deep learning? Does more data automatically imply better results regardless of the algorithm or model? Does "automated feature learning" obviate the need for data preprocessing and feature design?
Present about supervised learning. Mainly discussing regression and classification. Further activities are discussing how to practically apply supervised learning.
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
In this presentation we discuss the hypothesis of MaxEnt models, describe the role of feature functions and their applications to Natural Language Processing (NLP). The training of the classifier is discussed in a later presentation.
Generating Natural-Language Text with Neural NetworksJonathan Mugan
Automatic text generation enables computers to summarize text, to have conversations in customer-service and other settings, and to customize content based on the characteristics and goals of the human interlocutor. Using neural networks to automatically generate text is appealing because they can be trained through examples with no need to manually specify what should be said when. In this talk, we will provide an overview of the existing algorithms used in neural text generation, such as sequence2sequence models, reinforcement learning, variational methods, and generative adversarial networks. We will also discuss existing work that specifies how the content of generated text can be determined by manipulating a latent code. The talk will conclude with a discussion of current challenges and shortcomings of neural text generation.
Artificial Intelligence Course: Linear models ananth
In this presentation we present the linear models: Regression and Classification. We illustrate with several examples. Concepts such as underfitting (Bias) and overfitting (Variance) are presented. Linear models can be used as stand alone classifiers for simple cases and they are essential building blocks as a part of larger deep learning networks
This presentation discusses decision trees as a machine learning technique. This introduces the problem with several examples: cricket player selection, medical C-Section diagnosis and Mobile Phone price predictor. It discusses the ID3 algorithm and discusses how the decision tree is induced. The definition and use of the concepts such as Entropy, Information Gain are discussed.
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
Presentation of a real use case at TAJ law firm (Deloitte Paris) of applying Machine learning on accounting to help clients to prepare their tax audit.
An introductory course on building ML applications with primary focus on supervised learning. Covers the typical ML application cycle - Problem formulation, data definitions, offline modeling, platform design. Also, includes key tenets for building applications.
Note: This is an old slide deck. The content on building internal ML platforms is a bit outdated and slides on the model choices do not include deep learning models.
Valencian Summer School 2015
Day 1
Lecture 1
State of the Art in Machine Learning
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
The slides from my talk at the FOSDEM HPC, Big Data and Data Science devroom, which has general tips from various sources about putting your first machine learning model in production.
The video is available from the FOSDEM website: https://fosdem.org/2017/schedule/event/machine_learning_zoo/
Can automated feature engineering prevent target leaks Meir Maor
In this talk we will review common and subtle ways of how problem definitions can go wrong. Exemplified by cases we encounter in the field, we will discuss target leaks (the use of information which cannot be available at prediction time), address sampling bias and consider ways to identify & tackle them.
You'll hear many real-life examples of how these issues manifested and see how introducing automated feature engineering can change the way data scientists discover and treat them.
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
Feature engineering- writing code to map raw input data into a set of signals that will be fed into a machine learning algorithm- is the dark art of data science. Although the process of crafting new features is tedious and failure-prone, the key to a successful model is a diverse set of high-quality features that are informed by domain experts. Recently, academic researchers have begun to focus on the problem of feature engineering, and have started to publish research that addresses the relative lack of tools that are designed to support the feature engineering process. In this talk, I will review some of my favorite papers and present some efforts to convert these ideas into tools that leverage the principles of reactive application design in order to make feature engineering (dare I say it) fun.
Valencian Summer School 2015
Day 1
Lecture 3
Ensembles of Decision Trees
Gonzalo Martínez (UAM)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaData Con LA
Producing highly accurate Predictive Models in Social Data Mining can be a challenge. Feature Engineering using traditional methodologies can only take you so far. Trying to find that needle in a haystack when the subject matter is too domain specific or prone to ambiguity can require large investments to achieve accurate results. Through this presentation we will discuss methodologies used by Toyota’s Research and Development Data Science Team and share secrets of building highly accurate Predictive Models for Social data using innovative techniques for Feature Engineering applied on the Apache Spark and MLlib platform.
Open Source Tools & Data Science Competitions odsc
This talk shares the presenter’s experience with open source tools in data science competitions. In the past several years Kaggle and other competitions have created a large online community of data scientists. In addition to competing with each other for fame and glory, members of this community also generously share knowledge, insights using forum and open source code. The open competition and sharing have resulted in rapid progress in the sophistication of the entire community. This presentation will briefly cover this journey from a competitor’s perspective, and share hands on tips on some open source tools proven popular and useful in recent competitions.
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
This was presented by Louis Ledoux and Marc Casas at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/1a/presentation_louis_ledoux_posit.pdf
Machine learning for IoT - unpacking the blackboxIvo Andreev
Have you ever considered Machine Learning as a black box? It sounds as a kind of magic happening. Although being one among many solutions available, Azure ML has proved to be a great balance between flexibility, usability and affordable price. But how does Azure ML compare with the other ML providers? How to choose the appropriate algorithm? Do you understand the key performance indicators and how to improve the quality of your models? The session is about understanding the black box and using it for IoT workload and not only.
Invited talk at Tsinghua University on "Applications of Deep Neural Network". As the tech. lead of deep learning task force at NIO USA INC, I was invited to give this colloquium talk on general applications of deep neural network.
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
We will discuss what feature engineering is all about , various techniques to use and how to scale to 20000 column datasets using random forest, svd, pca. Also demonstrated is how we can build a service around these to save time and effort when building 100s of models. We will share how we did all this using spark ml to build logistic regression, neural networks, Bayesian networks, etc.
Smaller and Easier: Machine Learning on Embedded ThingsNUS-ISS
Machine learning, meet things. Embedded machine learning is the blend of Machine Learning with Internet of Things and Edge Computing. This talk will cover recent topics in the Embedded Machine Learning field that have made it easier for anyone to deploy ML on small devices. We'll look at: TinyML, Edge Impulse, and Eloquent Arduino.
A presentation on a special category of databases called Deductive Databases. It is an attempt to merge logic programming with relational database. Other types include Object-oriented databases, Graph databases, XML databases, Multi-model databases, etc.
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
High-performance computing resources are currently widely used in science and engineering areas. Typical post-hoc approaches use persistent storage to save produced data from simulation, thus reading from storage to memory is required for data analysis tasks. For large-scale scientific simulations, such I/O operation will produce significant overhead. In-situ/in-transit approaches bypass I/O by accessing and processing in-memory simulation results directly, which suggests simulations and analysis applications should be more closely coupled. This paper constructs a flexible and extensible framework to connect scientific simulations with multi-steps machine learning processes and in-situ visualization tools, thus providing plugged-in analysis and visualization functionality over complex workflows at real time. A distributed simulation-time clustering method is proposed to detect anomalies from real turbulence flows.
A fascinating View of the Artificial Intelligence Journey.
Ramón López de Mántaras, Ph.D.
Technical and Business Perspectives on the Current and Future Impact of Machine Learning - MLVLC
October 20, 2015
Real-world Stories and Long-term Risks and Opportunities.
Tom Dietterich, Ph.D.
Technical and Business Perspectives on the Current and Future Impact of Machine Learning - MLVLC
October 20, 2015
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 9
Real World Machine Learning - Cooking Predictions
Andrés González (CleverTask)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 2
Lecture 11
The Future of Machine Learning
José David Martín-Guerrero (IDAL, UV)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Valencian Summer School 2015
Day 1
Lecture 7
A developers’ overview of the world of predictive APIs
Louis Dorard (PAPIs.io)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. 2
• Oregon State University (Structured output spaces)
• Music recognition
• Real-time strategy game-playing
• Kodak Research Labs
• Media classification (audio, video)
• Document Classification
• Performance Evaluation
• BigML
• Allston Trading (applying machine learning to market data)
Full Disclosure
3. 3
• But it’s “machine learning”!
• Your data sucks (or at least I hope it does) . . .
• Data is broken
• Data is incomplete
• . . . but you know about it!
• Make the problem easier
• Make the answer more obvious
• Don’t waste time modeling the obvious
• Until you find the right algorithm for it
Data Transformation
4. Your Data Sucks I: Broken Features
• Suppose you have a market data feature called
trade imbalance = (buy - sell) / total volume that
you calculate every five minutes
• Now suppose there are no trades over five minutes
• What to do?
• Point or feature removal
• Easy default
4
5. Your Data Sucks II: Missing Values
• Suppose you’re building a model
to predict the presence or
absence of cancer
• Each feature is a medical test
• Some are simple (height,
weight, temperature)
• Some are complex (blood
counts, CAT scan)
• Some patients have had all of
these done, some have not.
• Does the presence or absence of
a CAT scan tell you something?
Should it be a feature?
5
Height Weight
Blood
Test
Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No
6. Simplifying Your Problem
• What about the class
variable?
• It’s just another feature, so it
can be engineered
• Change the problem
• Do you need so many
classes?
• Do you need to do a
regression?
6
7. Feature Engineering: What?
• Your data may be too “raw”
for learning
• Multimedia Data
• Raw text data
• Something must be done to
make the data “learnable”
• Compute edge histograms,
SIFT features
• Do word counts, latent
topic modeling
7
8. An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No
9. • Whether two points are
walking distance from
each other is not an
obvious function of the
latitude and longitude
• But it is an obvious
function of the distance
between the two points
• Unfortunately, that
function is quite
complicated
• Fortunately, you know it
already!
9
10. An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2
Distance
(km)
Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No
11. Feature Engineering
• One of the core (maybe the core)
competencies of a machine learning engineer
• Requires domain understanding
• Requires algorithm understanding
• If you do it really well, you eliminate the need
for machine learning entirely
• Gives you another path to success; you can
often substitute domain knowledge for
modeling expertise
• But what if you don’t have specific domain
knowledge?
11
12. Techniques I: Discretization
• Construct meaningful bins for a
continuous feature (two or more)
• Body temperature
• Credit score
• The new features are categorical
features, each category of which
has nice semantics
• Don’t make the algorithm waste
effort modeling things that you
already know about
12
13. Techniques II: Delta
• Sometimes, the difference between two features is
the important bit
• As it was in the distance example
• Also holds a lot in the time domain
• Example: Hiss in speech recognition
• Struggling? Just differentiate! (In all seriousness,
this sometimes works)
13
14. Techniques III: Windowing
• If points are distributed in time,
previous points in the same
window are often very informative
• Weather
• Stock prices
• Add this to a 1-d sequence of
points to get an instant machine
learning problem!
• Sensor data
• User behavior
• Maybe add some delta features?
14
15. Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one
(subtract the mean and divide by the standard deviation).
• Good for domains with heterogeneous but gaussian-distributed data sources
• Demographic data
• Medical testing
• Note that this isn’t in general effective for decision trees!
• Transformation is order preserving
• Decision tree splits rely only on ordering!
• Good for things like k-NN
15
16. Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0]
and [1, 1, 1] -> [0.57, 0.57, 0.57])
• Nice for sparse feature spaces like text
• Helps us tell the difference between documents and dictionaries
• We’ll come back to the idea of sparsity
• Note that this will effect decision trees
• Does not necessarily preserve order (co-dependency between
features)
• A lesson against over-generalization of technique!
16
17. What Do We Really Want?
• This is nice, but what ever happened to “machine
learning”?
• Construct a feature space in which “learning is
easy”, whatever that means
• The space must preserve “important aspects of the
data”, whatever that means
• Are there general ways of posing this problem?
(Spoiler Alert: Yes)
17
18. Aside I: Projection
• A projection is a one-to-
one mapping from one
feature space to another
• We want a function f(x)
that projects a point x
into a space where a
good classifier is obvious
• The axes (features) in
your new space are
called your new basis
18
f(x)
f(x)
19. A Hack Projection: Distance to Cluster
• Do clustering on your data
• For each point, compute the
distance to each cluster centroid
• These distances are your new
features
• The new space can be either
higher or lower dimensional than
your new space
• For highly clustered data, this
can be a fairly powerful feature
space
19
20. Principle Components Analysis
• Find the axis through
the data with the
highest variance
• Repeat for the next
orthogonal axis and so
on, until you run out of
data or dimensions
• Each axis is a feature
20
21. PCA is Nice!
• Generally quite fast (matrix decomposition)
• Features are linear combinations of originals (which
means you can project test data into the space)
• Features are linearly independent (great for some
algorithms)
• Data can often be “explained” with just the first few
components (so this can be “dimensionality
reduction”)
21
22. Spectral Embeddings
• Two of the seminal ones
are Isomap and LLE
• Generally, compute the
nearest neighbor matrix
and use this to create the
embedding
• Pro: Pretty spectacular
results
• Con: No projection matrix
22
23. Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method
• Create an objective function that preserves neighbor
relationships
• Neighbor distances (unsupervised)
• Closest points of the same class (supervised)
• Clever search for a projection matrix that satisfies this
objective (usually an elaborate sort of gradient descent)
• I’ve had some success with these
23
24. Aside II: Sparsity
• Machine learning is essentially compression, and
constantly plays at the edges of this idea
• Minimum description length
• Bayesian information criteria
• L1 and L2 regularization
• Sparse representations are easily compressed
• So does that mean they’re more powerful?
24
25. Sparsity I: Text Data
• Text data is inherently sparse
• The fact that we choose a small number of words to
use gives a document its semantics
• Text features are incredibly powerful in the grand
scheme of feature spaces
• One or two words allow us to do accurate
classification
• But those one or two words must be sparse
25
26. Sparsity II: EigenFaces
• Here are the first few
components of PCA applied to
a collection of face images
• A small number of these
explain a huge part of a huge
number of faces
• First components are like stop
words, last few (sparse)
components make recognition
easy
26
27. Sparsity III: The Fourier Transform
• Very complex waveform
• Turns out to be easily
expressible as a
combination of a few
(i.e., sparse) constant
frequency signals
• Such representations
make accurate speech
recognition possible
27
28. Sparse Coding
• Iterate
• Choose a basis
• Evaluate that basis based on how well you can use
it to reconstruct the input, and how sparse it is
• Take some sort of gradient step to improve that
evaluation
• Andrew Ng’s efficient sparse coding algorithms and
Hinton’s deep autoencoders are both flavors of this
28
29. The New Basis
• Text: Topics
• Audio: Frequency
Transform
• Visual: Pen Strokes
29
30. Another Hack: Totally Random Trees
• Train a bunch of decision trees
• With no objective!
• Each leaf is a feature
• Ta-da! Sparse basis
• This actually works
30
31. And More and More
• There are a ton a variations on these themes
• Dimensionality Reduction
• Metric Learning
• “Coding” or “Encoding”
• Nice canonical implementations can be found at:
http://lvdmaaten.github.io/drtoolbox/
31