Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
Handling Missing Values for Machine Learning.pptxShamimBhuiyan8
In the machine learning model when we work with the data set sometimes we need to collect Data outside or secondary resources(ex.kaggle.com). Most of the time, various values are missing or categorical on the data set. To resolve this problem we make some ways. I think my defining steps in the presentation will help handle missing values in the data set.
No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.
This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.
The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.
We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.
Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Understanding and debugging decision trees
* Ensemble methods
* Bagging
* Random Forests
* When decision trees and random forests should be used?
* Python implementation with scikit-learn
* Analysis of performance
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
Handling Missing Values for Machine Learning.pptxShamimBhuiyan8
In the machine learning model when we work with the data set sometimes we need to collect Data outside or secondary resources(ex.kaggle.com). Most of the time, various values are missing or categorical on the data set. To resolve this problem we make some ways. I think my defining steps in the presentation will help handle missing values in the data set.
No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.
This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.
The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.
We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.
Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Understanding and debugging decision trees
* Ensemble methods
* Bagging
* Random Forests
* When decision trees and random forests should be used?
* Python implementation with scikit-learn
* Analysis of performance
Presentation in Vietnam Japan AI Community in 2019-05-26.
The presentation summarizes what I've learned about Regularization in Deep Learning.
Disclaimer: The presentation is given in a community event, so it wasn't thoroughly reviewed or revised.
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Feature Engineering in Machine LearningKnoldus Inc.
In this Knolx we are going to explore Data Preprocessing and Feature Engineering Techniques. We will also understand what is Feature Engineering and its importance in Machine Learning. How Feature Engineering can help in getting the best results from the algorithms.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Eric Nyberg's Presentation "From Jeopardy! To Cognitive Agents: Effective Learning in the Wild" on Cognitive Systems Institute Group Speaker Series July 9, 2015
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
Feature Engineering in Machine LearningKnoldus Inc.
In this Knolx we are going to explore Data Preprocessing and Feature Engineering Techniques. We will also understand what is Feature Engineering and its importance in Machine Learning. How Feature Engineering can help in getting the best results from the algorithms.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Eric Nyberg's Presentation "From Jeopardy! To Cognitive Agents: Effective Learning in the Wild" on Cognitive Systems Institute Group Speaker Series July 9, 2015
This slideshow reviews some of the features and functionalities of Qualtrics that enable its use in online trainings. This explores some important instructional design elements in online trainings, including for three main types: policy compliance, mass-scale trainings, and customized trainings. This reviews some core elements of online trainings. Finally, there are some reflections on real-world considerations when building an online training on Qualtrics.
A presentation on Course Design and Implementation of Course Delivery in Open and Distance Learning.
Delivered during University of Ibadan Cascade Training for all Academic Staffs in Distance Learning Programme.
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
Web-based Self- and Peer Assessment of Teachers Digital CompetencesHans Põldoja
Presentation in the research group seminar, Institute of Informatics, Tallinn University, 7 March 2012.
Based on the following publication:
Põldoja, H., Väljataga, T., Tammets, K., & Laanpere, M. (2011). Web-based Self- and Peer- assessment of Teachers’ Educational Technology Competencies. In H. Leung, E. Popescu, Y. Cao, R. Lau, & W. Nejdl (Eds.), Advances in Web-Based Learning – ICWL 2011: 10th International Conference, Hong Kong, China, December 2011. Proceedings (pp. 122–131). Berlin / Heidelberg: Springer. http://www.springerlink.com/content/e3t2042568271213/
Evaluating non-desktop user interfaces focuses on the evaluation of pervasive, ubicomp, augmented reality, etc . appliances. Examples, Methods, and Tools are discussed
Wholi: The right people find each other (at the right time)
Two key elements in this talk:
•PART 1: Machine learning for entity extraction
Natural language processing (NLP), information extraction
•PART 2: Matching profiles using deep learning classifier
Deep learning, word embeddings
Deep neural networks for matching online social networking profilesTraian Rebedea
> Proposed a large dataset for matching online social networking profiles
›This allowed us to train a deep neural network for profile matching using both domain-specific features and word embeddings generated from textual descriptions from social profiles
›Experiments showed that the NN surpassed both unsupervised and supervised models, achieving a high precision (P = 0.95) with a good recall rate (R = 0.85)
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Many historic periods (or events) are remembered
by slogans, expressions or words that are strongly linked to them. Educated people are also able to determine whether a particular word or expression is related to a specific period in human history. The present paper aims to establish correlations between significant historic periods (or events) and the texts written in that period. In order to achieve this, we have developed a system that automatically links words (and topics discovered using Latent Dirichlet Allocation) to periods of time in the recent history. For this analysis to be relevant and conclusive, it must be undertaken on a representative set of texts written throughout history. To this end, instead of relying on manually selected texts, the Google Books Ngram corpus has been chosen as a basis for the analysis. Although it provides only word n-gram statistics for the texts written in a given year, the resulting time series can be used to provide insights about the most important periods and events in recent history, by automatically linking them with specific keywords or even LDA topics.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. Outline
• About me
• Why machine learning?
• Basic notions of machine learning
• Sample problems and solutions
– Finding vandalism on Wikipedia
– Sexual predator detection
– Identification of buildings on a map
– Change in the meaning of concepts over time
• Practical ML
– R
– Python (sklearn)
– Java (Weka)
– Others
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
3. About me
• Started working in 2003 (year 2 of studies)
– J2EE developer
• Then created my own company with a colleague (in 2005)
– Web projects (PHP, Javascript, etc.), involved in some startups for the
technological aspects (e.g. Happyfish TV)
• Started teaching at A&C, UPB (in 2006)
– TA for Algorithms, Natural Language Processing
• Started my PhD (in 2007)
– Natural Language Processing, Discourse Analysis, Tehnology-Enhanced
Learning
• Now I am teaching several lectures: Algorithm Design, Algorithm Design
and Complexity, Symbolic and Statistical Learning, Information Retrieval
• Working on several topics in NLP: opinion mining, conversational agents,
question-answering, culturonomics
• Interested in all new projects that use NLP, ML and IR as well as any other
“smart” applications
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
4. Why machine learning?
• “Simply put, machine learning is the part of artificial
intelligence that actually works” (Forbes, link)
• Most popular course on Coursera
• Popular?
– Meaning practical: ML results are nice and easy to show to
others
– Meaning well-paid/in demand: companies are increasing
demand for ML specialists
• Large volumes of data, corpora, etc.
• Computer science, data science, almost any other
domain (especially in research)
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
5. Why ML?
• Mix of Computer science and Math (Statistics)
skills
• Why is math essential to ML?
– http://courses.washington.edu/css490/2012.Winter/lecture_slides/02
_math_essentials.pdf
• “To get really useful results, you need good
mathematical intuitions about certain general
machine learning principles, as well as the
inner workings of the individual algorithms.”
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
6. Basic Notions of ML
• Different tasks in ML
• Supervised
– Regression
– Classification
• Unsupervised
– Clustering
– Dimensionality reduction / visualization
• Semi-supervised
– Label propagation
• Reinforcement learning
• Deep learning
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
7. Supervised learning
• Dataset consisting of labeled examples
• Each example has one or several attributes
and a dependent variable (label, output,
response, predicted variable)
• After building, selection and assessing the
model on the labeled data, it is used to find
labels for new data
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
8. Supervised learning
• Dataset is usually split into three parts
– A rule is 50-25-25
• Training set
– Used to build models
– Compute the basic parameters of the model
• Holdout/validation set
– Used to find out the best model and adjust some parameters
of the model (usually, in order to minimize some error)
– Also called model selection
• Test set
– Used to assess the final model on new data, after model
building and selection
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
10. Cross-validation
If no of partitions (S) = no of training instances
leave-one-out (only for small datasets)
Main disadvantages:
- no of runs is
increased by a factor
of S
- Multiple parameters
for the same model
for different runs
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
11. Regression
• Supervised learning when the output variable
is continous
– This is the usual textbook definition
– Also may be used for binary output variables (e.g.
Not-spam=0, Spam=1) or output variables that
have an ordering (e.g. integer numbers, sentiment
levels, etc.)
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
13. Classification
• Supervised learning when the output variable
is discrete (classes)
– Can also be used for binary or integer outputs
– Even if they are ordered, but classification usually
does not account for the order
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
15. Classification
• Several important types
– Binary vs. multiple classes
– Hard vs. soft
– Single-label vs. multi-label (per example)
– Balanced vs. unbalanced classes
• Extensions that are semi-supervised
– One-class classification
– PU (Positive and Unlabeled) learning
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
16. Unsupervised learning
• For an unlabeled dataset, infer properties (find
“hidden” structure) about the distribution of the
objects in the dataset
– without any help related to correct answers
• Generally, much more data than for supervised
learning
• There are several techniques that can be applied
– Clustering (cluster analysis)
– Dimensionality reduction (visualization of data)
– Association rules, frequent itemsets
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
17. Clustering
• Partition dataset into clusters (groups) of
objects such that the ones in the same cluster
are more similar to each other than to objects
in different clusters
• Similarity
– The most important notion when clustering
– “Inverse of a distance”
• Possible objective: approximate the modes of
the input dataset distribution
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
18. Clustering
• Lots of different cluster models:
– Connectivity (distance models) (e.g Hierarchical)
– Centroid (e.g. K-means)
– Distribution (e.g GMM)
– Density of data (e.g. DBSCAN)
– Graph (e.g. cliques, community detection)
– Subspace (e.g. biclustering)
– …
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
21. Dimensionality reduction
• Usually, unsupervised datasets are large both
in number of examples and in number of
attributes (features)
• Reduce the number of features in order to
improve (human) readability and
interpretation of data
• Mainly linked to visualization
– Also used for feature selection, data compression
or denoising
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
23. Semi-supervised learning
• Mix of labeled and unlabeled data in the dataset
• Enlarge the labeled dataset with unlabeled instances for which we
might determine the correct label with a high probability
• Unlabeled data is much cheaper or simpler to get than labeled data
– Human annotation either requires experts or is not challenging (it is
boring)
– Annotation may also require specific software or other types of
devices
– Students (or Amazon Mturk users) are not always good annotators
• More information:
http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
24. Label propagation
• Start with the labeled dataset
• Want to assign labels to some (or all) of the unlabeled
instances
• Assumption: “data points that are close have similar labels”
• Build a fully connected graph with both labeled and unlabeled
instances as vertices
• The weight of an edge represents the similarity between the
points (or inverse of a distance)
• Repeat until convergence
– Propagate labels through edges (larger weights allow a
more quick propagation)
– Nodes will have soft labels
• More info: http://lvk.cs.msu.su/~bruzz/articles/classification/zhu02learning.pdf
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
26. Basic Notions of ML
• How to asses the performance of a model?
• Used to choose the best model
• Supervised learning
– Assessing errors of the model on the validation
and test sets
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
30. Basic Notions of ML
• Overfitting
– Low error on training data
– High error on (unseen/new) test data
– The model does not generalize well from the training data
on unseen data
– Possible causes:
• Too few training examples
• Model too complex (too many parameters)
• Underfitting
– High error both on training and test data
– Model is too simple to capture the information in the
training data
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
32. Sample problems and solutions
• Chosen from prior experience
• May not be the most interesting problems for all
(any) of you
• I have tried to present 4 problems that are
somehow diverse in solution and simple to
explain
• All solutions are state-of-the-art (that means not
the very best one, but among the top)
– Very best solution may be found in research papers
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
33. Finding vandalism on Wikipedia
• Work with Dan Cioiu
• Vandalism = editing in a malicious manner that is intentionally
disruptive
• Dataset used at PAN 2011
– Real articles extracted from Wikipedia in 2009 and 2010
– Training data: 32,000 edits
– Test data: 130,000 edits
• An edit consists of:
– the old and new content of the article
– author information
– revision comment and timestamp
• We chose to use only the actual text in the revision (no author and
timestamp information) to predict vandalism
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
37. Combining results using a meta-
classifier
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
38. Final results on test set
• Source: http://link.springer.com/chapter/10.1007%2F978-3-642-40495-5_32
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
39. Main conclusions
• Feature extraction is important in ML
– Which are the best features?
• Syntactic and context analysis are the most important
• Semantic analysis is slow but increases detection by
10%
• Using features related to users would further improve
our results
• A meta-classifier sometimes improves the results of
several individual classifiers
– Not always! Maybe another discussion on this…
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
40. Sexual predator detection
• Work with Claudia Cardei
• Automatically detect sexual predators in a chat
discussion in order to alert the parents or the police
• Corpus from PAN 2012
– 60000+ chat conversations (around 10000 after removing
outliers)
– 90000+ users
– 142 sexual predators
• Sexual predator is used “to describe a person seen as
obtaining or trying to obtain sexual contact with
another person in a metaphorically predatory manner”
(Wikipedia)
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
41. Problem description
• Decide either a user is a predator or not
• Use only the chats that have a predator
(balanced corpus: 50% predators - 50%
victims)
• One document for each user
• How to determine the features?
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
42. Simple solution
• “Bag of words” (BoW) features with tf-idf
weighting
• Feature extraction using mutual information
• MLP (neural network) with 500 features: 89%
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
43. Going beyond
• The should be other features that are discriminative for predators
• Investigated behavioral features:
– Percentage of questions and inquiries initiated by an user
– Percentage of negations
– Expressions that might denote an underage user (“don’t know”, “never
did”,…)
– The age of the user if found in a chat reply (“asl” like answers)
– Percentage of slang words
– Percentage of words with a sexual connotation
– Readability scores (Flesch)
– Who started the conversation
• AdaBoost (with DecisionStump): 90%
• Random Forest: 93%
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
44. Main conclusions
• Feature selection is useful when the number of
features is large
– Reduced from 10k+ features to hundreds/thousands
– We used MI, there are others techniques
• Fewer features can be as descriptive as (or even
more descriptive than) the BoW model
– Difficult to choose
– Need additional computations and information
– Need to understand the dataset
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
45. Identification of buildings on a map
• Work with Alexandra Vieru, Paul Vlase
• Given an aerial image, identify the buildings
• Non-military purposes:
– Estimate population density
– Identify main landmarks (shopping malls, etc.)
– Identify new buildings over time
• Features
– HOG descriptors – 3x3 cells
– Histogram of RGB – 5 bins, normalized
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
48. Main conclusions
• Verify which are the best features
– If this is possible
• Must verify for overfitting using train and test
corpus
• Each task can have a baseline (a simple
classification) method
– Here it might be SVM with RGB histogram features
• Actually this could be treated as a one-class
classification (or PU learning) problem
– There are much more negative instances than positive
ones
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
49. Change in the meaning of concepts
over time
• Work with Costin Chiru, Bianca Dinu, Diana Mincu
• Assumption:
– Different meanings of words should be identifiable
based on the other words that are associated with
them
– These other words are changing over time thus
denoting a change in the meaning (or usage of the
meanings)
• Using the Google Books n-grams corpus
– Dependency information
– 5-grams information centered on the analyzed word
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
51. • Clustering using tf-idf over dependencies of the
analyzed word
• Each distinct year is considered a different document
and each dependency is a feature/term
• Example for the word “gay”:
– Cluster 1 (1600 - 1925): signature {more, lively, happy,
brilliant, cheerful}
– Cluster 2 (1923-1969): signature {more, lively, happy,
cheerful, be}
– Cluster 3 (1981 - today): signature {lesbian, bisexual, anti,
enolum, straight}
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
52. Practical ML
• Brief discussion of alternatives that might be
used by a programmer
• The main idea is to show that it is simple to
start working on a ML task
• However, it is more difficult to achieve very
good results
– Experience
– Theoretical underpinnings of models
– Statistical background
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
53. R language
• Designed for “statistical computing”
• Lots of packages and functions for ML & statistics
• Simple to output visualizations
• Easy to write code, short
• It is kind of slow on some tasks
• Need to understand some new concepts: data frame,
factor, etc.
• You may find the same functionality in several
packages (e.g. NaïveBayes, Cohen’s Kappa, etc.)
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
54. Python (sklearn)
• sklearn (scikit-learn) package developed for
ML
• Uses existing packages numpy, scipy
• Started as a GSOC project in 2007
• It is new, implemented efficiently (time and
memory), lots of functionality
• Easy to write code
– Naming conventions, parameters, etc.
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
56. Java (Weka)
• Similar to scikit-learn for Python
• Older (since 1997) and more stable than scikit-learn
• Some implementations are (were) a little bit more
inefficient, especially related to memory consumption
• However, usually it is fast, accurate, and has lots of
other useful utilities
• It also has a GUI for getting some results fast without
writing any code
• Lots of people are debating whether Weka or scikit-learn is
better. I don’t want to get into this discussion as it depends on
your task and resources.
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
57. Not discussed
• RapidMiner
– Easy to use
– GUI interface for building the processing pipeline
– Can write some code/plugins
• Matlab/Octave
– Numeric computing
– Paid vs. free
• SPSS/SAS/Stata
– Somehow similar to R, faster, more robust, expensive
• Many others:
– Apache Mahout / MALLET / NLTK / Orange / MLPACK
– Specific solutions for various ML models
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49
58. Conclusions
• Lots of interesting ML tasks and data available
– Do you have access to new and interesting data in your
business?
• Several easy to use programming alternatives
• Understand the basics
• Enhance your maths (statistics) skills for a better ML
experience
• Hands-on experience and communities of practice
– www.kaggle.com (and other similar initiatives)
29 May 2014 Practical Machine Learning (Part 1) Softbinator Talks #49