VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Logistic Regression is one of the most popular Machine Learning methods for solving classification problems. With Logistic Regressions in your Dashboard and in the BigML API, you will be able to easily create and download models to your environment for fast local predictions.
VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Logistic Regression is one of the most popular Machine Learning methods for solving classification problems. With Logistic Regressions in your Dashboard and in the BigML API, you will be able to easily create and download models to your environment for fast local predictions.
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture Review: Summary Day 2 Sessions. By Mercè Martín Prats (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 4: Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking
Lecturer: Dr. José Antonio Ortega - jao (BigML)
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture Review: Summary Day 2 Sessions. By Mercè Martín Prats (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 4: Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking
Lecturer: Dr. José Antonio Ortega - jao (BigML)
VSSML16 LR2. Summary Day 2
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Summary Day 2
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 3: REST API, Bindings, and Basic Workflows
Lecturer: Dr. José Antonio Ortega - jao (BigML)
Learn all you need to know about BigML's implementation of Latent Dirichlet Allocation (LDA), one of the most popular probabilistic methods for topic modeling. Topic Models, BigML's latest resource, helps you find relevant terms thematically related in your unstructured text data. With the BigML Topic Models in your Dashboard and in the BigML API, you will be able to discover the hidden topics in your text fields and use them as final output for information retrieval tasks, collaborative filtering, or for assessing document similarity, among others. You can also use the topics discovered as input features to train other models.
No Training Data? No Problem! Weak Supervision to the Rescue!
A talk on NLP Weak Supervision at the Singapore Quantum Black Meetup.
This talk talks about
1. ML's insatiable need for large datasets
2. Contemporary ML leaving out domain knowledge from Subject Matter Experts
3. How Weak Supervision, an approach of Data-Centric AI, solves both the problems simultaneously by encoding domain subject matter expertise into programmatic labeling functions.
4. The WRENCH benchmark to compare various weak supervision algorithms on several standard datasets.
5. Snorkel to combine the various labeling functions.
6. COSINE to fine-tune a final transformer based model that overcomes the noise in weak labels
7. Future Directions and Resources
Feel free to use the slides but please remember to credit me with a link to my Linkedin profile: www.linkedin.com/in/marie-stephen-leo.
Machine Learning Algorithms and Applications for Data Scientists.pptxJAMESJOHN130
Data Science professionals need to learn the application of multiple ML algorithms to solve various types of problems as only one algorithm may not be the best option for all issues. You can join a Machine Learning Bootcamp to gain competency in using frequently applied Machine Learning algorithms.
https://www.synergisticit.com/machine-learning-training-bay-area-ca/
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Keyanoush Razavidinani, Digital Services Consultant at A1 Digital, a BigML Partner, highlights why it is important to identify and reduce human bottlenecks that optimize processes and let you focus on important activities. Additionally, Guillem Vidal, Machine Learning Engineer at BigML completes the session by showcasing how Machine Learning is put to use in the manufacturing industry with a use case to detect factory failures.
The Road to Production: Automating your Anomaly Detectors - by jao (Jose A. Ortega), Co-Founder and Chief Technology Officer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Machine Learning for Anti Money Laundering Compliance, by Kevin Nagel, Consultant and Data Scientist at INFORM.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Multi Perspective Anomalies, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
My First Anomaly Detector: Practical Workshop, by Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
Introduction to End-to-End Machine Learning: Classification and Regression - Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
A Data-Driven Company: 21 Lessons for Large Organizations to Create Value from AI, by Richard Benjamins, Chief AI and Data Strategist at Telefónica.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
How Machine Learning Transforms and Automates Legal Services, by Arnoud Engelfriet, Co-Founder at Lynn Legal.
*Machine Learning School in The Netherlands 2022.
Machine Learning for Public Safety: Reducing Violence and Discrimination in Stadiums.
Speakers: Ramon van Ingen, Co-Founder at Siip, Entrepreneur, Researcher, and Pablo González, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Process Optimization in Manufacturing Plants, by Keyanoush Razavidinani, Digital Business Consultant at A1 Digital.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Lessons Learned Applying Anomaly Detection at Scale, by Álvaro Clemente, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
Citizen Development in AI, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
This new feature is a continuation of and improvement on our previous Image Processing release. Now, Object Detection lets you go a step further with your image data and allows you to locate objects and annotate regions in your images. Once your image regions are defined, you can train and evaluate Object Detection models, make predictions with them, and automate end-to-end Machine Learning workflows on a single platform. To make that possible, BigML enables Object Detection by introducing the regions optype.
As with any other BigML feature, Object Detection is available from the BigML Dashboard, API, and WhizzML for automation. Object Detection is extremely helpful to tackle a wide range of computer vision use cases such as medical image analysis, quality control in manufacturing, license plate recognition in transportation, people detection in security surveillance, among many others.
This new release brings Image Processing to the BigML platform, a feature that enhances our offering to solve image data-driven business problems with remarkable ease of use. Because BigML treats images as any other data type, this unique implementation allows you to easily use image data alongside text, categorical, numeric, date-time, and items data types as input to create any Machine Learning model available in our platform, both supervised and unsupervised.
Now, it is easier than ever to solve a wide variety of computer vision and image classification use cases in a single platform: label your image data, train and evaluate your models, make predictions, and automate your end-to-end Machine Learning workflows. As with any other BigML feature, Image Processing is available from the BigML Dashboard, API, and WhizzML, and it can be applied to solve use cases such as medical image analysis, visual product search, security surveillance, and vehicle damage detection, among others.
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
This session presents a quite common situation for those working in food and beverage retail (FnB) and highlights interesting insights to fight waste reduction.
Speaker: Stephen Kinns, CEO and Co-Founder at catsAi.
*ML in Retail 2021: Webinar.
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
This is an introductory session about the role that Machine Learning is playing in the retail sector and how it is being deployed across the different areas of this industry.
Speaker: Atakan Cetinsoy, VP of Predictive Applications at BigML.
*ML in Retail 2021: Webinar.
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
This presentation analyzes the role that Machine Learning plays in legal automation with a real-world Machine Learning application.
Speaker: Arnoud Engelfriet, Co-Founder at Lynn Legal.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
This is a real-life Machine Learning use case about integrated risk.
Speakers: Thomas Rengersen, Product Owner of the Governance Risk and Compliance Tool for Rabobank, and Thomas Alderse Baas, Co-Founder and Director of The Bowmen Group.
*ML in GRC 2021: Virtual Conference.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
3. BigML, Inc.
3
A Gentle Introduction to
Machine Learning
Charles Parker
● Experts who extract some
rules to predict new results
● Programmers who tailor a
computer program that
predicts following the
expert's rules.
● Non easily scalable to the
entire organization
● Data (often easily to be
found and more accurate
than the expert)
● ML algorithms
(faster, more modular,
measurable performance)
● Scalable to the entire
organization
What is your company's strategy based on?
Expertdriven decisions Datadriven decisions
4. BigML, Inc.
4
A Gentle Introduction to
Machine Learning
When datadriven decisions are a good idea
● Experts are hard to find or expensive
● Expert knowledge is difficult to be programmed into
production environments accurately/quickly enough
● Experts cannot explain how they do it: character or speech
recognition
● There's a performancecritical handmade system
● Experts are easily found and cheap
● Expert knowledge is easily programmed into production
environments
● The data is difficult or expensive to acquire
When datadriven decisions are a bad idea
5. BigML, Inc.
5
A Gentle Introduction to
Machine Learning
Steps to create a ML program from data
● Acquiring data
In tabular format: each row stores the information about the
thing that has a property that you want to predict. Each column
is a different attribute (field or feature).
● Defining the objective
The property that you are trying to predict
● Using an ML algorithm
The algorithm builds a program (the model or classifier) whose
inputs are the attributes of the new instance to be predicted and
whose output is the predicted value for the target field (the
objective).
6. BigML, Inc.
6
A Gentle Introduction to
Machine Learning
Modeling: creating a program with an ML algorithm
● The algorithm searches in a Hypothesis Space the set of
variables that best fits your data
Examples of Hypothesis Spaces:
● Logistic regression: Features coefficients + bias
● Neural network: weights for the nodes in the network
● Support vector machines: coefficients on each training point
● Decision trees: combination of features ranges
7. BigML, Inc.
7
A Gentle Introduction to
Machine Learning
Decision tree construction
● What question splits better you data? try all possible splits
and choose the one that achieves more purity
● When should we stop?
When the size of the subset is totally pure
When the size reaches a predetermined minimum
When the number of nodes or tree depth is too large
When you can’t get any statistically significant
improvement
● Nodes that don’t meet the latter criteria can be removed
after tree construction via pruning
The recursive algorithm analyzes the data to find
8. BigML, Inc.
8
A Gentle Introduction to
Machine Learning
Visualizing a decision tree
Root node
(split at petal length=2.45)
Branches
Leaf
(splitting stops)
9. BigML, Inc.
9
A Gentle Introduction to
Machine Learning
Decision tree outputs
● Prediction: Start from the root node. Use the inputs to
answer the question associated to each node you reach.
The answer will decide which branch will be used to
descend the tree. If you reach a leaf node, the majority class
in the leaf will be the prediction.
● Confidence: Degree of reliability of the prediction. Depends
on the purity of the final node and the number of instances
that it classifies.
● Field importance: Which field is more decisive in the
model's classification. Depends on the number of times it is
used as the best split and the error reduction it achieves.
Inputs: values of the features for a new instance
10. BigML, Inc.
10
A Gentle Introduction to
Machine Learning
Evaluating your models
● Testing your model with new data is the key to measure its
performance. Never evaluate with training data!
● Simplest approach: split your data into a training dataset
and a test dataset (8020% is costumary)
● Advanced approach: to avoid biased splits, do it repeatedly
and average evaluations or kfold crossvalidate.
● Accuracy is not a good metric when classes are
unbalanced. Use the confusion matrix instead or phi, F1
score or balanced accuracy.
Which evaluation metric to choose?
11. BigML, Inc.
11
● Confusion matrix can tell the number of correctly classified
(TP, TN) or misclassified instances (FP, FN) but this does not
tell you how misclassifications will impact your business.
● As a domain expert, you can assign a cost to each FP or FN
(cost matrix). This cost/gain ratio is the significant
performance measure for your models.
A Gentle Introduction to
Machine Learning
Domain specific evaluation
12. BigML, Inc.
12
● Ensembles are groups of different models built on
samples of data.
● Randomness is introduced in the models. Each
model is a good approximation for a different
random sample of data.
● A single ML Algorithm may not adapt nicely to
some datasets. Combining different models can.
● Combining models can reduce the over-fitting
caused by anomalies, errors or outliers.
● The combination of several accurate models gets us
closer to the real model.
Ensembles
Can a group of weaker models outperform a stronger
single model?
Poul Petersen
13. BigML, Inc.
13
● Bootstrap aggregating (bagging) models are built on
random samples (with replacement) of n instances.
● Random decision forest in addition to the random samples
of bagging, the models are built by choosing randomly the
candidate features at each split (random candidates).
● Plurality majority wins
● Confidence weighted each vote is weigthed by confidence
and majority wins
● Probability weighted each tree votes according to the
distribution at its prediction node
● K-Threshold a class is predicted only if enough models vote
for it
● Confidence Threshold votes for a class are only computed
if their confidence is over the threshold
Ensembles
Types of ensembles
Types of combinations
14. BigML, Inc.
14
● How many trees?
● How many nodes?
● Missing splits?
● Random Candidates?
● SMACdown: automatic optimization of ensembles
by exploring the configuration space.
Ensembles
Configuration parameters
Too many parameters? Automate!
15. BigML, Inc.
15
● Regressions are typically
used to relate two numeric
variables
● But using the proper function
we can relate discrete
variables too
Logistic Regression
How comes we use a regression to classify?
Logistic Regression is a classification ML Algorithm
Poul Petersen
16. BigML, Inc.
16
● We should use feature engineering to transform
raw features in linearly related predictors, if
needed.
● The ML algorithm searches for the coefficients to
solve the problem
by transforming it into a linear regression problem
In general, the algorithm will find a coefficient per
feature plus a bias coefficient and a missing
coefficient
Logistic Regression
Assumption: The output is linearly related to the
predictors.
17. BigML, Inc.
17
• Bias: Allows an intercept term. Important if
P(x=0) != 0
• Regularization
L1: prefers zeroing individual coefficients
L2: prefers pushing all coefficients towards
zero
• EPS: The minimum error between steps to
stop.
• Auto-scaling: Ensures that all features
contribute equally. Recommended unless there is
a specific need to not auto-scale.
Logistic Regression
Configuration parameters
18. BigML, Inc.
18
• Multi-class LR: Each class has its own LR computed
as a binary problem (one-vs-the-rest). A set of
coefficients is computed for each class.
• Non-numeric predictors: As LR works for numeric
predictors, the algorithm needs to do some encoding
of the non-numeric features to be able to use them.
These are the field-encodings.
– Categorical: one-shot, dummy encoding, contrast
encoding
– Text and Items: frequencies of terms
● Curvilinear LR: adding quadratic features as new
features
Logistic Regression
Extending the domain for the algorithm
19. BigML, Inc.
19
Logistic Regression
Logistic Regressions versus Decision Trees
● Expects a "smooth" linear
relationship with predictors
● LR is concerned with
probability of a discrete
outcome.
● Lots of parameters to get
wrong: regularization,
scaling, codings
● Slightly less prone to over-
fitting
● Because fits a shape, might
work better when less data
available.
● Adapts well to ragged
non-linear relationships
● No concern:
classification, regression,
multi-class all fine.
● Virtually parameter free
● Slightly more prone to
over-fitting
● Prefers surfaces parallel
to parameter axes, but
given enough data will
discover any shape.
21. BigML, Inc.
21
● Clustering is a ML technique designed to find
and group of similar instances in your data
(group by).
● It's unsupervised learning, as opposed to
supervised learning algorithms, like decision
trees, where training data has been labeled
and the model learns to predict that label.
Clusters are built on raw data.
● Goal: finding k clusters in which similar data
can be grouped together. Data in each cluster
is similar self similar and dissimilar to the rest.
Clusters
Clusters: looking for similarity
Poul Petersen
22. BigML, Inc.
22
● Customer segmentation: grouping users to act on
each group differently
● Item discovery: grouping items to find similar
alternatives
● Similarity: Grouping products or cases to act on each
group differently
● Recommender: grouping products to recommend
similar ones
● Active learning: grouping partially labeled data as
alternative to labeling each instance
Clustering can help us to identify new features shared
by the data in the groups
Clusters
Use cases
23. BigML, Inc.
23
● K-means: The number of expected groups is given by the user. The
algorithm starts using random data points as centers.
– K++: the first center is chosen randomly from instances and each
subsequent center is chosen from the remaining instances with
probability proportional to its squared distance from the point's
closest existing cluster center
Clusters
Types of clustering algorithm
The algorithm computes distances based
on each instance features. Each instance
is assigned to the nearest center or
centroid. Centroids are recalculated as the
center of all the data points in each
cluster and process is repeated till the
groups converge.
● G-means: The number of groups is also
determined by the algorithm. Starting
from k=2, each group is split if the data
distribution in it is not Gaussian-like.
24. BigML, Inc.
24
How distance between two instances is defined?
For clustering to work we need a distance function that must
be computable for all the features in your data. Scaled
euclidean distance is used for numeric features. What about
the rest of field types?
Categorical: Features contribute to the distance if categories
for both points are not the same
Text and Items: Words are parsed and its frequencies are
stored in a vector format. Cosine distance (1 – cosine
similarity) is computed.
Missing values: Distance to a missing value cannot be
defined. Either you ignore the instances with missing values
or you previously assign a common value (mean, median,
zero, etc.)
Clusters
Extending clustering to different data types
25. BigML, Inc.
25
● Anomaly detectors use ML algorithms
designed to single out instances in your data
which do not follow the general pattern (rank
by).
● As clustering, they fall into the unsupervised
learning category, so no labeling is required.
Anomaly detectors are built on raw data.
● Goal: Assigning to each data instance an
anomaly score, ranging from 0 to 1, where 0
means very similar to the rest of instances
and 1 means very dissimilar (anomalous).
Anomaly Detection
Anomaly detection: looking for the unusual
Poul Petersen
26. BigML, Inc.
26
● Unusual instance discovery
● Intrusion Detection: users whose behaviour does not
comply to the general pattern may indicate an intrusion
● Fraud: Cluster per profile and look for anomalous
transactions at different levels (card, user, user groups)
● Identify Incorrect Data
● Remove Outliers
● Model Competence / Input Data Drift: Models
performance can be downgraded because new data has
evolved to be statistically different. Check the
prediction's anomaly score.
Anomaly Detection
Use cases
27. BigML, Inc.
27
Anomaly Detection
Statistical anomaly indicators
● Univariateapproach: Given a single
variable, and assuming normal distribution
(Gaussian). Compute the standard
deviation and choose a multiple of it as
threshold to define what's anomalous.
● Benford's law: In reallife numeric sets
the small digits occur disproportionately
often as leading significant digits.
28. BigML, Inc.
28
Anomaly Detection
Isolation forests
● Train several random
decision trees that overfit
data till each instance is
completely isolated
● Use the medium depth of
these trees as threshold to
compute the anomaly
score, a number from 0 to 1
where 0 is similar and 1 is
dissimilar
● New instances are run
through the trees and
assigned an anomaly score
according to the average
depth they reach
29. BigML, Inc.
29
● Association Discovery is an unsupervised technique, like
clustering and anomaly detection.
● Uses the “Magnum Opus” algorithm by Geoff Webb
Association Discovery
Geoff Webb and Poul Petersen
Looking for “interesting” relations between variables
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Tue Sally 6788 sign food 26339 51
{class = gas} amount < 100
{customer = Bob, account = 3421} zip = 46140
Antecedent Consequent
31. BigML, Inc.
31
Association Discovery
Problems with frequent pattern mining
●
Often results in too few or too many patterns
●
Some high value patterns are infrequent
●
Cannot handle dense data
●
Cannot prune search space using constraints on
relationship between antecedent and consequent eg
confidence
●
Minimum support may not be relevant
●
Cannot be low enough to capture all valid rules cannot
be high enough to exclude all spurious rules
32. BigML, Inc.
32
● Very high support patterns can be spurious
● Very infrequent patterns can be significant
So the user selects the measure of interest
System finds the top-k associations on that
measure within constraints
– Must be statistically significant interaction
between antecedent and consequent
– Every item in the antecedent must increase
the strength of association
Association Discovery
It turns out that:
35. BigML, Inc.
35
● Generative models try to fit the coefficients of a
generic function to use it as data generating
function. This conveys information about the
structure of the model (looking for causality).
● Discriminative models, do not care about how the
labeling is generated, they only find how to split the
data into categories
● Generative models are more probabilistically sound
and able to do more than just classify
● Discriminative models are faster to fit and quicker to
predict
Latent Dirichlet Allocation
Generative vs discriminative models
Charles Parker
Pros and Cons
36. BigML, Inc.
36
A document can be analyzed from different levels
● According to its terms (one or more words)
● According to its topics (distributions of terms ~
semantics)
● Documents are generated by repeatedly drawing
a topic and a term in that topic at random
● Goal: To infer the topic distribution
How? Dirichlet Process is used to model the
term|topic, and topic|document distributions
Latent Dirichlet Allocation
Thinking of documents in terms of Topics
Generative Models for documents
37. BigML, Inc.
37
● We're more likely to think a word came from a topic if
we've already seen a bunch of other words from that
topic
● We're more likely to think the topic was responsible
for generating the document if we've already seen a
bunch of words in the document from that topic.
● Visualizing topic changes in documents over time
(specially for dated historical collections)
● Search by topics (without keywords)
● Using topics as a new feature instead of the bag of
words approach in modeling
Latent Dirichlet Allocation
Dirichlet Process intuitions
Applications
38. BigML, Inc.
38
● Topics can reduce the feature space
● Are nicely interpretable
● Automatically tailored to the document
● Need to choose the number of topics
● Takes a lot of time to fit or do inference
● Takes a lot of text to make it meaningful
● Tends to focus on “meaningless minutiae”
● While sometimes makes nice classifications, it's usually not a
dramatic improvement over bag-of-words
● Nice for exploration
Latent Dirichlet Allocation
Nice properties about topics
Caveats