VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 LR2. Summary Day 2
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Summary Day 2
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 3: Clusters and Anomaly Detection. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Anomaly Detection - New York Machine LearningTed Dunning
Anomaly detection is the art of finding what you don't know how to ask for. In this talk, I walk through the why and how of building probabilistic models for a variety of problems including continuous signals and web traffic. This talk blends theory and practice in a highly approachable way.
VSSML17 L2. Ensembles and Logistic RegressionsBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 2: Ensembles and Logistic Regressions. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
VSSML16 LR2. Summary Day 2
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Summary Day 2
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 3: Clusters and Anomaly Detection. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Valencian Summer School in Machine Learning 2017 - Day 1
Lectures Review: Summary Day 1 Sessions. By Mercè Martín (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Valencian Summer School 2015
Day 2
Lecture 15
Machine Learning - Black Art
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Anomaly Detection - New York Machine LearningTed Dunning
Anomaly detection is the art of finding what you don't know how to ask for. In this talk, I walk through the why and how of building probabilistic models for a variety of problems including continuous signals and web traffic. This talk blends theory and practice in a highly approachable way.
This presentation will present topics such as "What is Anomaly Detection? What are the different types of Data that may be used? What are the popular techniques may be used to identify anomalies. What are the best practices in anomaly detection? What is the Value of Anomaly Detection?
A Practical Guide to Anomaly Detection for DevOpsBigPanda
Recent years have seen an explosion in the volumes of data that modern production environments generate. Making fast educated decisions about production incidents is more challenging than ever. BigPanda's team is passionate about solutions such as anomaly detection that tackle this very challenge.
MLSEV. Cluster Analysis and Anomaly DetectionBigML, Inc
Unsupervised Learning (Part I), by BigML:
*Cluster Analysis: Finding Similarities
*Anomaly Detection: Finding the Unusual
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Cluster Analysis and Anomaly Detection (Unsupervised I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
This presentation on Machine Learning will help you understand what is clustering, K-Means clustering, flowchart to understand K-Means clustering along with demo showing clustering of cars into brands, what is logistic regression, logistic regression curve, sigmoid function and a demo on how to classify a tumor as malignant or benign based on its features. Machine Learning algorithms can help computers play chess, perform surgeries, and get smarter and more personal. K-Means & logistic regression are two widely used Machine learning algorithms which we are going to discuss in this video. Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps to predict the probability of an event by fitting data to a logit function. It is also called logit regression. K-means clustering is an unsupervised learning algorithm. In this case, you don't have labeled data unlike in supervised learning. You have a set of data that you want to group into and you want to put them into clusters, which means objects that are similar in nature and similar in characteristics need to be put together. This is what k-means clustering is all about. Now, let us get started and understand K-Means clustering & logistic regression in detail.
Below topics are explained in this Machine Learning tutorial part -2 :
1. Clustering
- What is clustering?
- K-Means clustering
- Flowchart to understand K-Means clustering
- Demo - Clustering of cars based on brands
2. Logistic regression
- What is logistic regression?
- Logistic regression curve & Sigmoid function
- Demo - Classify a tumor as malignant or benign based on features
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
Learn more at: https://www.simplilearn.com/
Mathematics online: some common algorithmsMark Moriarty
Brief overview of some basic algorithms used online and across data-mining, and a word on where to learn them. Prepared specially for UCC Boole Prize 2012.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
Enhancing and Automating Decision Making with Machine Learning - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Keyanoush Razavidinani, Digital Services Consultant at A1 Digital, a BigML Partner, highlights why it is important to identify and reduce human bottlenecks that optimize processes and let you focus on important activities. Additionally, Guillem Vidal, Machine Learning Engineer at BigML completes the session by showcasing how Machine Learning is put to use in the manufacturing industry with a use case to detect factory failures.
The Road to Production: Automating your Anomaly Detectors - by jao (Jose A. Ortega), Co-Founder and Chief Technology Officer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Machine Learning for Anti Money Laundering Compliance, by Kevin Nagel, Consultant and Data Scientist at INFORM.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Multi Perspective Anomalies, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
My First Anomaly Detector: Practical Workshop, by Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
Introduction to End-to-End Machine Learning: Classification and Regression - Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
A Data-Driven Company: 21 Lessons for Large Organizations to Create Value from AI, by Richard Benjamins, Chief AI and Data Strategist at Telefónica.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
How Machine Learning Transforms and Automates Legal Services, by Arnoud Engelfriet, Co-Founder at Lynn Legal.
*Machine Learning School in The Netherlands 2022.
Machine Learning for Public Safety: Reducing Violence and Discrimination in Stadiums.
Speakers: Ramon van Ingen, Co-Founder at Siip, Entrepreneur, Researcher, and Pablo González, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Process Optimization in Manufacturing Plants, by Keyanoush Razavidinani, Digital Business Consultant at A1 Digital.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Lessons Learned Applying Anomaly Detection at Scale, by Álvaro Clemente, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
Citizen Development in AI, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
This new feature is a continuation of and improvement on our previous Image Processing release. Now, Object Detection lets you go a step further with your image data and allows you to locate objects and annotate regions in your images. Once your image regions are defined, you can train and evaluate Object Detection models, make predictions with them, and automate end-to-end Machine Learning workflows on a single platform. To make that possible, BigML enables Object Detection by introducing the regions optype.
As with any other BigML feature, Object Detection is available from the BigML Dashboard, API, and WhizzML for automation. Object Detection is extremely helpful to tackle a wide range of computer vision use cases such as medical image analysis, quality control in manufacturing, license plate recognition in transportation, people detection in security surveillance, among many others.
This new release brings Image Processing to the BigML platform, a feature that enhances our offering to solve image data-driven business problems with remarkable ease of use. Because BigML treats images as any other data type, this unique implementation allows you to easily use image data alongside text, categorical, numeric, date-time, and items data types as input to create any Machine Learning model available in our platform, both supervised and unsupervised.
Now, it is easier than ever to solve a wide variety of computer vision and image classification use cases in a single platform: label your image data, train and evaluate your models, make predictions, and automate your end-to-end Machine Learning workflows. As with any other BigML feature, Image Processing is available from the BigML Dashboard, API, and WhizzML, and it can be applied to solve use cases such as medical image analysis, visual product search, security surveillance, and vehicle damage detection, among others.
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
This session presents a quite common situation for those working in food and beverage retail (FnB) and highlights interesting insights to fight waste reduction.
Speaker: Stephen Kinns, CEO and Co-Founder at catsAi.
*ML in Retail 2021: Webinar.
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
This is an introductory session about the role that Machine Learning is playing in the retail sector and how it is being deployed across the different areas of this industry.
Speaker: Atakan Cetinsoy, VP of Predictive Applications at BigML.
*ML in Retail 2021: Webinar.
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
This presentation analyzes the role that Machine Learning plays in legal automation with a real-world Machine Learning application.
Speaker: Arnoud Engelfriet, Co-Founder at Lynn Legal.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
This is a real-life Machine Learning use case about integrated risk.
Speakers: Thomas Rengersen, Product Owner of the Governance Risk and Compliance Tool for Rabobank, and Thomas Alderse Baas, Co-Founder and Director of The Bowmen Group.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceBigML, Inc
Some of these concepts (Cybersecurity, Governance, Risk Management, and Compliance) overlap and sometimes they can be confusing. This session helps us understand why those terms are key for any business to be successful.
Speaker: Jon Shende, Founding Investor at MyVayda.
*ML in GRC 2021: Virtual Conference.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
3. BigML, Inc 3Clusters
Trees vs Clusters
Trees/LR (Supervised Learning)
Provide: labeled data
Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity
4. BigML, Inc 4Clusters
Trees vs Clusters
sepal
length
sepal
width
petal
length
petal
width
species
5,1 3,5 1,4 0,2 setosa
5,7 2,6 3,5 1,0 versicolor
6,7 2,5 5,8 1,8 virginica
… … … … …
sepal
length
sepal
width
petal
length
petal
width
5,1 3,5 1,4 0,2
5,7 2,6 3,5 1,0
6,7 2,5 5,8 1,8
… … … …
Inputs “X” Label “Y”
Learning Task:
Find function “f” such that:
f(X)≈Y
Learning Task:
Find “k” clusters such that
the data in each cluster is self
similar
5. BigML, Inc 5Clusters
Use Cases
• Customer segmentation
• Item discovery
• Similarity
• Recommender
• Active learning
6. BigML, Inc 6Clusters
Customer Segmentation
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-game
purchases
• Assumption: Usage correlates to LTV
0%
3%
1%
7. BigML, Inc 7Clusters
Item Discovery
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible flavor
characteristics.
Smoky
Fruity
9. BigML, Inc 9Clusters
Similarity
GOAL: Cluster the loans by
application profile to rank loan
quality by percentage of trouble
loans in population
• Dataset of Lending Club Loans
• Mark any loan that is currently or has
even been late as “trouble”
0%
3%
7%
1%
10. BigML, Inc 10Clusters
Active Learning
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for diabetes
and label the dataset to build a model
but the test is expensive*.
11. BigML, Inc 11Clusters
Active Learning
*For a more realistic example of high cost, imagine a dataset
with a billion transactions, each one needing to be labelled as
fraud/not-fraud. Or a million images which need to be labeled as
cat/not-cat.
2323
14. BigML, Inc 14Clusters
Human Expert
• Jesa used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen
features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
15. BigML, Inc 15Clusters
Human Expert
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
17. BigML, Inc 17Clusters
Plot by Features
Num
Surfaces
Length / Width
box block eraser
knob
penny
dime
bead
key battery screw
K-Means Key Insight:
We can find clusters using distances
in n-dimensional feature space
K=3
18. BigML, Inc 18Clusters
Plot by Features
Num
Surfaces
Length / Width
box block eraser
knob
penny
dime
bead
key battery screw
K-Means
Find “best” (minimum distance)
circles that include all points
23. BigML, Inc 23Clusters
Starting Points
• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instances
• each subsequent center is chosen from the
remaining instances with probability proportional to
its squared distance from the point's closest existing
cluster center
25. BigML, Inc 25Clusters
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?
26. BigML, Inc 26Clusters
Distance to Missing?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
27. BigML, Inc 27Clusters
Distance to Categorical?
• Special distance function
• if xA == xB then
x distance = 0 (or scaling value)
else
x distance = 1
• Assign centroid the most common category of the
member instances
Approach: similar to “k-prototypes”
28. BigML, Inc 28Clusters
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)
Compute Euclidean distance between discrete vectors
29. BigML, Inc 29Clusters
Text Vectors
1
Cosine Similarity
0
-1
"hippo" "safari" "zebra" ….
3 0 1 …
2 4 0 …
0 5 7 …
Text Field #1
Text Field #2
Cosine Distance = 1 - Cosine Similarity
CD(TF1, TF2) = 0.575736
Features(thousands)
36. BigML, Inc 36
Poul Petersen
CIO, BigML, Inc.
Anomaly DetectionFinding the Unusual
37. BigML, Inc 37Anomaly Detection
Clusters vs Anomalies
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity
Anomalies (Unsupervised Learning)
Provide: unlabeled data
Learning Task: Rank data by dissimilarity
38. BigML, Inc 38Anomaly Detection
Clusters vs Anomalies
sepal
length
sepal
width
petal
length
petal
width
5,1 3,5 1,4 0,2
5,7 2,6 3,5 1,0
6,7 2,5 5,8 1,8
… … … …
Learning Task:
Find “k” clusters such that the data
in each cluster is self similar
sepal
length
sepal
width
petal
length
petal
width
5,1 3,5 1,4 0,2
5,7 2,6 3,5 1,0
6,7 2,5 5,8 1,8
… … … …
Learning Task:
Assign value from 0 (similar) to 1
(dissimilar) to each instance.
39. BigML, Inc 39Anomaly Detection
Use Cases
• Unusual instance discovery
• Intrusion Detection
• Fraud
• Identify Incorrect Data
• Remove Outliers
• Model Competence / Input Data Drift
40. BigML, Inc 40Anomaly Detection
Removing Outliers
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
41. BigML, Inc 41Anomaly Detection
Diabetes Anomalies
DIABETES
SOURCE
DIABETES
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
FILTER
ALL
MODEL
ALL
EVALUATION
CLEAN
EVALUATION
COMPARE
EVALUATIONS
ANAOMALY
DETECTOR
43. BigML, Inc 43Anomaly Detection
Intrusion Detection
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
• Dataset of command line history for users
• Data for each user consists of commands,
flags, working directories, etc.
• Assumption: Users typically issue the same
flag patterns and work in certain directories
Per User Per Dir All User All Dir
44. BigML, Inc 44Anomaly Detection
Fraud
• Dataset of credit card transactions
• Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
45. BigML, Inc 45Anomaly Detection
Model Competence
• After putting a model it into production, data that is being
predicted can become statistically different than the
training data.
• Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the
anomaly score is high, then the model may not be competent
and should not be trusted.
Prediction T T
Confidence 86 % 84 %
Anomaly Score 0,5367 0,7124
Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY
DETECTOR
46. BigML, Inc 46Anomaly Detection
Univariate Approach
• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the
squared differences from the Mean.)
• Depending on the number of instances, choose a
“multiple” of standard deviations to indicate an anomaly.
A multiple of 3 for 1000 instances removes ~ 3 outliers.
47. BigML, Inc 47Anomaly Detection
Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API
48. BigML, Inc 48Anomaly Detection
Benford’s Law
• In real-life numeric sets the small digits occur disproportionately often as
leading significant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API
52. BigML, Inc 52Anomaly Detection
Human Expert
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Key Insight
The “most unusual” object
is different in some way from
every partition of the features.
Most unusual
53. BigML, Inc 53Anomaly Detection
Human Expert
• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen
features
• Each cluster was then examined to see which
object fit the least well in its cluster and did not fit
any other cluster
54. BigML, Inc 54Anomaly Detection
Human Expert
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
56. BigML, Inc 56Anomaly Detection
Random Splits
smooth = True
length/width > 5
box
blockeraser
knob
penny
dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that “splits” matter - don’t know the order
57. BigML, Inc 57Anomaly Detection
Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
58. BigML, Inc 58Anomaly Detection
Isolation Forest Scoring
f_1 f_2 f_3
i_1 red cat ball
i_2 red cat ball
i_3 red cat box
i_4 blue dog pen
D = 3
D = 6
D = 2
Score
59. BigML, Inc 59Anomaly Detection
Model Competence
• A low anomaly score means the loan is similar to the
modeled loans.
• A high anomaly score means you can not trust the
model.
Prediction T T
Confidence
86 % 84 %
Anomaly
Score
0,5367 0,7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR