Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
However, the the graph theory jargon can make graph analytics seem more intimidating for self-study than is necessary. In this talk, the audience will be exposed to some of the basic concepts of graph theory (no prerequisite math knowledge needed!) and a few of the Python tools available for graph analysis.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
UMAP is a technique for dimensionality reduction that was proposed 2 years ago that quickly gained widespread usage for dimensionality reduction.
In this presentation I will try to demistyfy UMAP by comparing it to tSNE. I also sketch its theoretical background in topology and fuzzy sets.
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
However, the the graph theory jargon can make graph analytics seem more intimidating for self-study than is necessary. In this talk, the audience will be exposed to some of the basic concepts of graph theory (no prerequisite math knowledge needed!) and a few of the Python tools available for graph analysis.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
UMAP is a technique for dimensionality reduction that was proposed 2 years ago that quickly gained widespread usage for dimensionality reduction.
In this presentation I will try to demistyfy UMAP by comparing it to tSNE. I also sketch its theoretical background in topology and fuzzy sets.
Chapter summary and solutions to end-of-chapter exercises for "Data Visualization: Principles and Practice" book by Alexandru C. Telea
In this chapter author discusses a number of popular visualization methods for vector datasets: vector glyphs, vector color-coding, displacement plots, stream objects, texture-based vector visualization, and the simplified representation of vector fields.
Section 6.5 presents stream objects, which use integral techniques to construct paths in vector fields. Section 6.7 discusses a number of strategies for simplified representation of vector datasets. Section 6.8 presents a number of illustrative visualization techniques for vector fields, which offer an alternative mechanism for simplified representation to the techniques discussed in Section 6.7 Chapter presents also feature detection methods, algorithm for computing separatrices on field’s topology, and top-down and bottom-up field decomposition methods.
Chapter summary and solutions to end-of-chapter exercises for "Data Visualization: Principles and Practice" book by Alexandru C. Telea
We presented a number of fundamental methods for visualizing scalar data: color mapping, contouring, slicing, and height plots. Color mapping assigns a color as a function of the scalar value at each point of a given domain. Contouring displays all points within a given two- or three-dimensional domain that have a given scalar value. Height plots deform the scalar dataset domain in a given direction as a function of the scalar data. The main advantages of these techniques are that they produce intuitive results, easily understood by users, and they are simple to implement. However, such techniques also have s number of restrictions.
This presentation briefly defines machine learning and its types of algorithms. After that two algorithms are presented. The first is naive bayes classifier for text classification and later k-means for clustering including some strategies to improve results.
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
For my presentation for a reading group. I have not in any way contributed this study, which is done by the researchers named on the first slide.
https://papers.nips.cc/paper/6418-interaction-networks-for-learning-about-objects-relations-and-physics
Introductory session for basic matlab commands and a brief overview of K-mean clustering algorithm with image processing example.
NOTE: you can find code of k-mean clustering algorithm for image processing in notes.
Slides for Introductory session on K Means Clustering.
simple and good. ppt
Could be used for taking classes for MCA students on Clustering Algorithms for Data mining.
Prepared By K.T.Thomas HOD of Computer Science, Santhigiri College Vazhithala
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
Tutorial at the "Reality of the Semantic Gap in Image Retrieval" tutorial at the first international conference on Semantics And digital Media Technology (SAMT 2006). 6th December 2006.
Chapter summary and solutions to end-of-chapter exercises for "Data Visualization: Principles and Practice" book by Alexandru C. Telea
In this chapter author discusses a number of popular visualization methods for vector datasets: vector glyphs, vector color-coding, displacement plots, stream objects, texture-based vector visualization, and the simplified representation of vector fields.
Section 6.5 presents stream objects, which use integral techniques to construct paths in vector fields. Section 6.7 discusses a number of strategies for simplified representation of vector datasets. Section 6.8 presents a number of illustrative visualization techniques for vector fields, which offer an alternative mechanism for simplified representation to the techniques discussed in Section 6.7 Chapter presents also feature detection methods, algorithm for computing separatrices on field’s topology, and top-down and bottom-up field decomposition methods.
Chapter summary and solutions to end-of-chapter exercises for "Data Visualization: Principles and Practice" book by Alexandru C. Telea
We presented a number of fundamental methods for visualizing scalar data: color mapping, contouring, slicing, and height plots. Color mapping assigns a color as a function of the scalar value at each point of a given domain. Contouring displays all points within a given two- or three-dimensional domain that have a given scalar value. Height plots deform the scalar dataset domain in a given direction as a function of the scalar data. The main advantages of these techniques are that they produce intuitive results, easily understood by users, and they are simple to implement. However, such techniques also have s number of restrictions.
This presentation briefly defines machine learning and its types of algorithms. After that two algorithms are presented. The first is naive bayes classifier for text classification and later k-means for clustering including some strategies to improve results.
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
For my presentation for a reading group. I have not in any way contributed this study, which is done by the researchers named on the first slide.
https://papers.nips.cc/paper/6418-interaction-networks-for-learning-about-objects-relations-and-physics
Introductory session for basic matlab commands and a brief overview of K-mean clustering algorithm with image processing example.
NOTE: you can find code of k-mean clustering algorithm for image processing in notes.
Slides for Introductory session on K Means Clustering.
simple and good. ppt
Could be used for taking classes for MCA students on Clustering Algorithms for Data mining.
Prepared By K.T.Thomas HOD of Computer Science, Santhigiri College Vazhithala
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
Tutorial at the "Reality of the Semantic Gap in Image Retrieval" tutorial at the first international conference on Semantics And digital Media Technology (SAMT 2006). 6th December 2006.
Intro to SVM with its maths and examples. Types of SVM and its parameters. Concept of vector algebra. Concepts of text analytics and Natural Language Processing along with its applications.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
The term Machine Learning was coined by Arthur Samuel in 1959, an American pioneer in the field of computer gaming and artificial intelligence, and stated that “it gives computers the ability to learn without being explicitly programmed”. Machine Learning is the latest buzzword floating around. It deserves to, as it is one of the most interesting subfields of Computer Science. So what does Machine Learning really mean? Let’s try to understand Machine Learning
Could a Data Science Program use Data Science Insights?Zachary Thomas
I'm Zachary Thomas, a recent graduate of Galvanize's data science immersive. Attached the slide deck for my capstone presentation! Email me with questions at zthomas.nc@gmail.com
SVM Based Identification of Psychological Personality Using Handwritten Text IJERA Editor
Identification of Personality is a complex process. To ease this process, a model is developed using cursive
handwriting. Area based, width based and height based thresholds are set for only character selection, word
selection and line selection. The rest is considered as noise. Followed by feature vector construction. Slope
feature using slope calculation, shape features and edge detection done using Sobel filter and direction
histogram is considered. Based on the direction of handwriting the analysis was done. Writing which rises to
the right shows optimism and cheerfulness. Sagging to the right shows physical or mental weariness. The lines
which are straight, reveals over-control to compensate for an inner fear of loss of control.The analysis was done
using single line and multiple lines. Simple techniques have provided good results. The results using single line
were 95% and multiple lines were 91%.The classification is done using SVM classifier.
Awarded presentation of my research activity, PhD Day 2011, February 23th 2011, Cagliari, Italy.
This presentation has been awarded as the best one of the track on information engineering.
Want to know more?
see my publications at
http://prag.diee.unica.it/pra/ita/people/satta
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGIJORCS
Clustering plays a vital role in the various areas of research like Data Mining, Image Retrieval, Bio-computing and many a lot. Distance measure plays an important role in clustering data points. Choosing the right distance measure for a given dataset is a biggest challenge. In this paper, we study various distance measures and their effect on different clustering. This paper surveys existing distance measures for clustering and present a comparison between them based on application domain, efficiency, benefits and drawbacks. This comparison helps the researchers to take quick decision about which distance measure to use for clustering. We conclude this work by identifying trends and challenges of research and development towards clustering.
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
At this workshop, you will build your own messaging insights system - data ingestion from a live data source (Reddit), queueing, deploying a machine learning model, and serving messages with insights to your mobile phone!
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
In the same way that we need to make assertions about how code functions, we need to make assertions about data, and unit testing is a promising framework. In this talk, we'll explore what is unique about unit testing data, and see how Two Sigma's open source library Marbles addresses these unique challenges in several real-world scenarios.
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
How many newspapers should be distributed to each store for sale every day? The data science group at The New York Times addresses this optimization problem using custom time series modeling and analytical solutions, while also incorporating qualitative business concerns. I'll describe our modeling and data engineering approaches, written in Python and hosted on Google Cloud Platform.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
In September 2017, dockless bikeshare joined the transportation options in the District of Columbia. In March 2018, scooter share followed. During the pilot of these technologies, Python has helped District Department of Transportation answer some critical questions. This talk will discuss how Python was used to answer research questions and how it supported the evaluation of this demonstration.
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
There are many stories of developers creating databases that don't operate at scale. The application is good, but the database won't work the realistic volumes of data. It's like a horror movie where they never looked behind the door, ran into the dark forest and night, and discovered the database was the monster killing their application. How can we leverage Python to avoid scaling problems?
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
We will be using Beautiful Soup to Webscrape the IMDB website and create a function that will allow you to create a dictionary object on specific metadata of the IMDB profile for any IMDB ID you pass through as an argument.
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
This talk describes an experimental approach to time series modeling using 1D convolution filter layers in a neural network architecture. This approach was developed at System1 for forecasting marketplace value of online advertising categories.
Extending Pandas with Custom Types - Will AydPyData
Pandas v.0.23 brought to life a new extension interface through which you can extend NumPy's type system. This talk will explain what that means in more detail and provide practical examples of how the new interface can be leveraged to drastically improve your reporting.
Machine learning models are increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure that model predictions are fair. In this talk I’ll introduce several common model fairness metrics, discuss their tradeoffs, and finally demonstrate their use with a case study analyzing anonymized data from one of Civis Analytics’s client engagements.
What's the Science in Data Science? - Skipper SeaboldPyData
The gold standard for validating any scientific assumption is to run an experiment. Data science isn’t any different. Unfortunately, it’s not always possible to design the perfect experiment. In this talk, we’ll take a realistic look at measurement using tools from the social sciences to conduct quasi-experiments with observational data.
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
Forecasting time-series data has applications in many fields, including finance, health, etc. There are potential pitfalls when applying classic statistical and machine learning methods to time-series problems. This talk will give folks the basic toolbox to analyze time-series data and perform forecasting using statistical and machine learning models, as well as interpret and convey the outputs.
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
A historical text may now be unreadable, because its language is unknown, or its script forgotten (or both), or because it was deliberately enciphered. Deciphering needs two steps: Identify the language, then map the unknown script to a familiar one. I’ll present an algorithm to solve a cartoon version of this problem, where the language is known, and the cipher is alphabet rearrangement.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
Rasa NLU & Rasa Core are the leading open source libraries for building machine learning-based chatbots and voice assistants. In this live-coding workshop you will learn the fundamentals of conversational AI and how to build your own using the Rasa Stack.
Towards automating machine learning: benchmarking tools for hyperparameter tu...PyData
Fine-tuning machine learning models (hyperparameter tuning) is crucial but tedious. Fortunately, optimization promises to automate this task. We give an overview on algorithms for this task and explain their inner workings. To help you selecting one for your project, we benchmark implementations in Python against human experts. We link to the discussion on automating machine learning in general.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
4. The Machine Learning Problem:
Given a set of n samples of data such that each sample is
represented by more than a single number (e.g. multivariate
data that has several attributes or features), create a model
that is able to predict unknown properties of each sample.
7. Feature space is the n-dimensions where our variables live (not
including target).
Feature extraction is the art of creating a space with decision
boundaries.
8. Example
Target
Y ≡ Thickness of car tires after some testing period
Variables
X1
≡ distance travelled in test
X2
≡ time duration of test
X3
≡ amount of chemical C in tires
The feature space is R3
, or more accurately, the positive quadrant in R3
as all the X
variables can only be positive quantities.
9. Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4
(this is the feature
extraction part):
X4
= X1
/ X2
≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4
.
A mapping is a function, ϕ, from R3
to R4
:
ϕ(x1
,x2
,x3
) = (x1
,x2
,x3
,x1
x2
)
11. Real-world data is often not
represented numerically
out of the box (e.g. text,
images), therefore some
transformation must be
applied in order to do
machine learning.
12. Tricky Part
Machine learning relies on our ability to imagine data as
points in space, where the relative closeness of any two
is a measure of their similarity.
So...when we transform those non-numeric features into
numeric ones, how should we quantify the distance
between instances?
13. Many ways of quantifying “distance” (or similarity)
often the
default for
numeric data
common rule
of thumb for
text data
14. With text, our choice of distance metric is very
important! Why?
15. Challenges of Modeling Text Data
● Very high dimensional
○ One dimension for every word (token) in the corpus!
● Sparsely distributed
○ Documents vary in length!
○ Most instances (documents) may be mostly zeros!
● Has some features that are more important than others
○ E.g. the “of” dimension vs. the “basketball” dimension when clustering sports articles.
● Has some feature variations that matter more than others
○ E.g. freq(tree) vs. freq(horticulture) in classifying gardening books.
17. scikit-learn
from sklearn.metrics import pairwise_distances(X, Y=None,
metric=’euclidean’, n_jobs=None, **kwds)
Compute the distance matrix from a vector array X and optional Y.
Valid values for metric are:
● From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
● From scipy.spatial.distance...
19. ● Extends the Scikit-Learn API.
● Enhances the model selection process.
● Tools for feature visualization, visual
diagnostics, and visual steering.
● Not a replacement for other visualization
libraries.
Yellowbrick
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
model selection isiterative, but can besteered!
20. TSNE (t-distributed Stochastic Neighbor
Embedding)
1. Apply SVD (or PCA) to reduce
dimensionality (for efficiency).
2. Embed vectors using probability
distributions from both the original
dimensionality and the decomposed
dimensionality.
3. Cluster and visualize similar
documents in a scatterplot.
21. Three Example Datasets
Hobbies corpus
● From the Baleen project
● 448 newspaper/blog articles
● 5 classes: gaming, cooking, cinema, books, sports
● Doc length (in words): 532 avg, 14564 max, 1 min
Farm Ads corpus
● From the UCI Repository
● 4144 ads represented as a list of metadata tags
● 2 classes: accepted, not accepted
● Doc length (in words): 270 avg, 5316 max, 1 min
Dresses Attributes Sales corpus
● From the UCI Repository
● 500 dresses represented as features: neckline, waistline, fabric, size, season
● Doc length (in words): 11 avg, 11 max, 11 min
22. Euclidean Distance
Euclidean distance is the straight-line distance between 2 points in Euclidean
(metric) space.
tsne = TSNEVisualizer(metric="euclidean")
tsne.fit(docs, labels)
tsne.poof()
5 10 15 20 25
252015105
Doc 2
(20, 19)
Doc 1
(7, 14)
24. Cityblock (Manhattan) Distance
Manhattan distance between two points is computed as the sum of the absolute
differences of their Cartesian coordinates.
tsne = TSNEVisualizer(metric="cityblock")
tsne.fit(docs, labels)
tsne.poof()
26. Chebyshev Distance
Chebyshev distance is the L∞-norm of the difference between two points (a special
case of the Minkowski distance where p goes to infinity). It is also known as
chessboard distance.
tsne = TSNEVisualizer(metric="chebyshev")
tsne.fit(docs, labels)
tsne.poof()
28. Minkowski Distance
Minkowski distance is a generalization of Euclidean, Manhattan, and Chebyshev
distance, and defines distance between points in a normalized vector space as the
generalized Lp-norm of their difference.
tsne = TSNEVisualizer(metric="minkowski")
tsne.fit(docs, labels)
tsne.poof()
30. Mahalanobis Distance
A multidimensional generalization
of the distance between a point
and a distribution of points.
tsne = TSNEVisualizer(metric="mahalanobis", method='exact')
tsne.fit(docs, labels)
tsne.poof()
Think: shifting and rescaling coordinates with respect to distribution. Can help find
similarities between different-length docs.
32. Cosine “Distance”
Cosine “distance” is the cosine of the angle between two doc vectors. The more
parallel, the more similar. Corrects for length variations (angles rather than
magnitudes). Considers only non-zero elements (efficient for sparse vectors!).
Note: Cosine distance is not technically a distance measure because it doesn’t
satisfy the triangle inequality.
tsne = TSNEVisualizer(metric="cosine")
tsne.fit(docs, labels)
tsne.poof()
34. Canberra Distance
Canberra distance is a weighted version of Manhattan distance. It is often used for
data scattered around an origin, as it is biased for measures around the origin and
very sensitive for values close to zero.
tsne = TSNEVisualizer(metric="canberra")
tsne.fit(docs, labels)
tsne.poof()
36. Jaccard Distance
Jaccard distance defines similarity between finite sets as the
quotient of their intersection and their union. More effective for
detecting things like document duplication.
tsne = TSNEVisualizer(metric="jaccard")
tsne.fit(docs, labels)
tsne.poof()
38. Hamming Distance
Hamming distance between two strings is the number of positions at which the
corresponding symbols are different. Measures minimum substitutions required to
change one string into the other.
tsne = TSNEVisualizer(metric="hamming")
tsne.fit(docs, labels)
tsne.poof()