Slides from my 2.5-day workshop organised at 2018 Learning Analytics Summer Institute (LASI'18) organised at Teachers College, Columbia University on July 11, 2018.
Le Machine Learning, l’IA, le DeepLearning, les Statistiques, le Data Mining… bref, tous ces mots sont les buzz words du moment mais que se cache-t-il derrière ?
A travers des exemples concrets, on parcourra les différentes approches du Machine Learning, les grandes familles d’algorithmes (n’ayez crainte : sans rentrer dans le cœur de leurs implémentations), puis les outils et les frameworks à la disposition des Data Scientists… et pour finir, on essayera de prédire l’avenir !
Salon Data - Nantes - 19 Septembre 2017
https://salondata.fr/2017/07/12/0930-1030-ml/
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
London TensorFlow Meetup - 18th July 2017Daniel Ecer
Slides from London TensorFlow Meetup on 18th July 2017
Corresponding repositories:
https://github.com/elifesciences/sciencebeam
https://github.com/elifesciences/sciencebeam-gym
Le Machine Learning, l’IA, le DeepLearning, les Statistiques, le Data Mining… bref, tous ces mots sont les buzz words du moment mais que se cache-t-il derrière ?
A travers des exemples concrets, on parcourra les différentes approches du Machine Learning, les grandes familles d’algorithmes (n’ayez crainte : sans rentrer dans le cœur de leurs implémentations), puis les outils et les frameworks à la disposition des Data Scientists… et pour finir, on essayera de prédire l’avenir !
Salon Data - Nantes - 19 Septembre 2017
https://salondata.fr/2017/07/12/0930-1030-ml/
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
London TensorFlow Meetup - 18th July 2017Daniel Ecer
Slides from London TensorFlow Meetup on 18th July 2017
Corresponding repositories:
https://github.com/elifesciences/sciencebeam
https://github.com/elifesciences/sciencebeam-gym
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this session, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.
Combining machine learning and search through learning to rankJettro Coenradie
In this presentation, we will go through all the steps to use machine learning to improve your search results. We'll discuss the search basics you need to know as well as some machine learning basics. After that, we use a sample application available at the URL https://rolling500.luminis.amsterdam to show improvements using a trained model and the learning to rank plugin in Elasticsearch.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Combining machine learning and search through learning to rankJettro Coenradie
With advanced tools available for search like Solr and Elasticsearch, companies are embedding search in almost all their products and websites. Search is becoming mainstream. Therefore we can focus on teaching the search engine tricks to return more relevant results. One new trick is called "learning to rank". During the presentation, you'll learn what Learning To Rank is, when to apply it and of course, you'll get an example to show how it works using Elasticsearch and a learning to rank plugin. After this presentation, you have learned to combine machine learning models and search.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Machine learning for sensor Data AnalyticsMATLABISRAEL
במצגת זאת נראה כיצד עושים Machine Learning בסביבת MATLAB. נציג מספר יכולות ואפליקציות מובנות ההופכות את תהליך למידת המכונה ליעיל ומהיר יותר – כלים כמו ה-Classification Learner, ה-Regression Learner ו-Bayesian Optimization. בהסתמך על מידע המתקבל מחיישני סמארטפון, נבנה מערכת סיווג המזהה את הפעילות שמבצע המשתמש – הליכה, טיפוס במדרגות, שכיבה, וכו'
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Introduction to Learning Analytics for High School Teachers and ManagersVitomir Kovanovic
Presentation at the first Learning Analytics Learning Network (LALN) Event in Adelaide, Australia on Oct 22, 2019.
Abstract:
With the increased adoption of technology, institutions have unprecedented opportunities to continuously improve the quality of their services through data collection and analysis. Schools and universities now have data about learners and their contexts that can provide valuable insight into how they learn. Early attempts were directed towards mining educational data to identify students-at-risk and develop interventions. Recently, more sophisticated approaches are being deployed by researchers and practitioners. These include analysis of learner behaviour that leads to various learning outcomes, social networks and teams, employability, creativity, and critical thinking. Analysing digital traces generated through learning processes requires a broad suite of methods from data science, statistics, psychometrics, social and learning sciences.
This workshop aims to introduce teachers and educators to the fast growing and promising field of learning analytics. How digital data can be used for the analysis and improvement of student learning will be explored. First, we will provide an overview of learning analytics, its key methods and approaches, as well as problems for which it can be used. Secondly, attendees will engage in group learning activities to explore ways in which learning analytics could be used within their institutions. The focus will be on identifying learning-related challenges that are relevant to their particular context and exploring how learning analytics can be used to practically and effectively.
Extending video interactions to support self-regulated learning in an online ...Vitomir Kovanovic
Slides from our presentation at ASCILITE'18 conference in Geelong, Victoria. Full paper is available in ASCILITE conference proceedings at http://ascilite.org/wp-content/uploads/2018/12/ASCILITE-2018-Proceedings.pdf
More Related Content
Similar to Unsupervised Learning for Learning Analytics Researchers
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this session, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.
Combining machine learning and search through learning to rankJettro Coenradie
In this presentation, we will go through all the steps to use machine learning to improve your search results. We'll discuss the search basics you need to know as well as some machine learning basics. After that, we use a sample application available at the URL https://rolling500.luminis.amsterdam to show improvements using a trained model and the learning to rank plugin in Elasticsearch.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Combining machine learning and search through learning to rankJettro Coenradie
With advanced tools available for search like Solr and Elasticsearch, companies are embedding search in almost all their products and websites. Search is becoming mainstream. Therefore we can focus on teaching the search engine tricks to return more relevant results. One new trick is called "learning to rank". During the presentation, you'll learn what Learning To Rank is, when to apply it and of course, you'll get an example to show how it works using Elasticsearch and a learning to rank plugin. After this presentation, you have learned to combine machine learning models and search.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Machine learning for sensor Data AnalyticsMATLABISRAEL
במצגת זאת נראה כיצד עושים Machine Learning בסביבת MATLAB. נציג מספר יכולות ואפליקציות מובנות ההופכות את תהליך למידת המכונה ליעיל ומהיר יותר – כלים כמו ה-Classification Learner, ה-Regression Learner ו-Bayesian Optimization. בהסתמך על מידע המתקבל מחיישני סמארטפון, נבנה מערכת סיווג המזהה את הפעילות שמבצע המשתמש – הליכה, טיפוס במדרגות, שכיבה, וכו'
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Introduction to Learning Analytics for High School Teachers and ManagersVitomir Kovanovic
Presentation at the first Learning Analytics Learning Network (LALN) Event in Adelaide, Australia on Oct 22, 2019.
Abstract:
With the increased adoption of technology, institutions have unprecedented opportunities to continuously improve the quality of their services through data collection and analysis. Schools and universities now have data about learners and their contexts that can provide valuable insight into how they learn. Early attempts were directed towards mining educational data to identify students-at-risk and develop interventions. Recently, more sophisticated approaches are being deployed by researchers and practitioners. These include analysis of learner behaviour that leads to various learning outcomes, social networks and teams, employability, creativity, and critical thinking. Analysing digital traces generated through learning processes requires a broad suite of methods from data science, statistics, psychometrics, social and learning sciences.
This workshop aims to introduce teachers and educators to the fast growing and promising field of learning analytics. How digital data can be used for the analysis and improvement of student learning will be explored. First, we will provide an overview of learning analytics, its key methods and approaches, as well as problems for which it can be used. Secondly, attendees will engage in group learning activities to explore ways in which learning analytics could be used within their institutions. The focus will be on identifying learning-related challenges that are relevant to their particular context and exploring how learning analytics can be used to practically and effectively.
Extending video interactions to support self-regulated learning in an online ...Vitomir Kovanovic
Slides from our presentation at ASCILITE'18 conference in Geelong, Victoria. Full paper is available in ASCILITE conference proceedings at http://ascilite.org/wp-content/uploads/2018/12/ASCILITE-2018-Proceedings.pdf
Analysing social presence in online discussions through network and text anal...Vitomir Kovanovic
The slides from our presentation at IEEE ICALT'19 conference.
Abstract:
This paper presents an approach to studying relationships between students' social presence and course topics from transcripts of asynchronous discussions in online learning environments. Specifically, the paper uses topic modelling and epistemic network analysis to investigate how students' social presence is expressed across different course topics. Finally, we show how this method can be adopted to examine how students' social presence changed due to an instructional intervention. The results of this study and its implications are further discussed.
Automated Analysis of Cognitive Presence in Online Discussions Written in Por...Vitomir Kovanovic
Slides from our EC-TEL'18 Paper presentation. Full paper is available at https://dx.doi.org/10.1007/978-3-319-98572-5_19
Abstract:
This paper presents a method for automated content analysis of students’ messages in asynchronous discussions written in Portuguese. In particular, the paper looks at the problem of coding discussion transcripts for the levels of cognitive presence, a key construct in a widely used Community of Inquiry model of online learning. Although there are techniques to coding for cognitive presence in the English language, the literature is still poor in methods for others languages, such as Portuguese. The proposed method uses a set of 87 different features to create a random forest classifier to automatically extract the cognitive phases. The model developed reached Cohen’s κ of .72, which represents a “substantial” agreement, and it is above the Cohen’s κ threshold of .70, commonly used in the literature for determining a reliable quantitative content analysis. This paper also provides some theoretical insights into the nature of cognitive presence by looking at the classification features that were most relevant for distinguishing between the different phases of cognitive presence.
Validating a theorized model of engagement in learning analyticsVitomir Kovanovic
Slides from our paper presentation at LAK'19 conference in Tempe, AZ. The full paper is available at https://dx.doi.org/10.1145/3303772.3303775
Abstract:
Student engagement is often considered an overarching construct in educational research and practice. Though frequently employed in the learning analytics literature, engagement has been subjected to a variety of interpretations and there is little consensus regarding the very definition of the construct. This raises grave concerns with regards to construct validity: namely, do these varied metrics measure the same thing? To address such concerns, this paper proposes, quantifies, and validates a model of engagement which is both grounded in the theoretical literature and described by common metrics drawn from the field of learning analytics. To identify a latent variable structure in our data we used exploratory factor analysis and validated the derived model on a separate sub-sample of our data using confirmatory factor analysis. To analyze the associations between our latent variables and student outcomes, a structural equation model was fitted, and the validity of this model across different course settings was assessed using MIMIC modeling. Across different domains, the broad consistency of our model with the theoretical literature suggest a mechanism that may be used to inform both interventions and course design.
Examining the Value of Learning Analytics for Supporting Work-integrated Lear...Vitomir Kovanovic
Slides from our presentation at the Seventh National Conference
on Work-Integrated Learning (ACEN’18).
The full paper is available at https://www.researchgate.net/publication/328578409_Examining_the_value_of_learning_analytics_for_supporting_work-integrated_learning
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
The slides from my 2hr tutorial organised at 2018 Learning Analytics Summer Institute (LASI) at Teachers College, Columbia University on June 11, 2018.
Kovanović et al. 2017 - developing a mooc experimentation platform: insight...Vitomir Kovanovic
LAK'17 Conference paper presentation:
Abstract:
In 2011, the phenomenon of MOOCs had swept the world of education and put online education in the focus of the public discourse around the world. Although researchers were excited with the vast amounts of MOOC data being collected, the benefits of this data did not stand to the expectations due to several challenges. The analyses of MOOC data are very time-consuming and labor-intensive, and require a highly advanced set of technical skills, often not available to the education researchers. Because of this MOOC data analyses are rarely done before the courses end, limiting the potential of data to impact the student learning outcomes and experience.
In this paper we introduce MOOCito (MOOC intervention tool), a user-friendly software platform for the analysis of MOOC data, that focuses on conducting data-informed instructional interventions and course experimentations. We cover important design principles behind MOOCito and provide an overview of the trends in MOOC research leading to its development. Although a work-in-progress, in this paper, we outline the prototype of MOOCito and the results of a user evaluation study that focused on system’s perceived usability and ease-of-use. The results of the study are discussed, as well as their practical implications.
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...Vitomir Kovanovic
LAK'16 Conference paper presentation:
abstract:
In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen’s kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features
than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
Unsupervised Learning for Learning Analytics Researchers
1. UNSUPERVISED MACHINE
LEARNING
VITOMIR KOVANOVIĆ
UNIVERSITY OF SOUTH AUSTRALIA
#vkovanovic
vitomir.kovanovic.info
Vitomir.Kovanovic@unisa.edu.au
LEARNING ANALYTICS SUMMER INSTITUTE
TEACHERS COLLEGE, COLUMBIA UNIVERSITY
JUNE 11-13, 2018
SREĆKO JOKSIMOVIĆ
UNIVERSITY OF SOUTH AUSTRALIA
#s_joksimovic
www.sjoksimovic.info
Srecko.Joksimovic@unisa.edu.au
1
2. About me
• Learning analytics researcher
• Research Fellow, School of Education, UniSA
Data Scientist, Teaching Innovation Unit, UniSA
Member of the Centre for Change and Complexity in Learning (C3L)
• Member of the SoLAR executive board
• Computer science and information systems background
• Used cluster analysis in several research projects
2
3. About you
• Introduce yourself
• Name, affiliation, position
• Experience with machine learning and clustering
• Experience with Weka or some other ML/DM toolkit
• Ideas for clustering in your own research/work
3
5. Workshop outline
1. Three days, four sessions
2. Equally theoretical and practical
3. Use of Weka Machine Learning toolkit
4. Focus on practical use
5. Examples of clustering use in learning analytics
5
6. Workshop topics
• Introduction to machine learning & unsupervised methods
• Introduction to cluster analysis
• Overview of cluster analysis use in Learning Analytics
• Introduction to WEKA toolkit
• Overview of the tutorial dataset
• K-means algorithm
• K-means demo
• Hierarchical clustering algorithms
• Hierarchical clustering demo
6
7. Tutorial topics
• How to choose the number of clusters
• How to interpret clustering results
• Practical challenges
• More advanced cluster analysis approaches
• Statistical methods for comparing clusters
• Clustering real-world data from OU UK
• Discussing different cluster analysis methods
7
10. What is machine learning?
Computing method for making sense of the data
10
11. Data is everywhere
Each minute:
● 3,600,000 Google searches
● 456,000 Twitter posts
● 46,740 Instagram photos
● 45,787 Uber trips
● 600 new Wikipedia edits
● 13 new Spotify songs
Domo (2017). “Data Never Sleeps 5.0”
https://www.domo.com
11
16. Fields that influenced machine learning
• Statistics
• Operations research
• Artificial intelligence
• Data visualisation
• Software engineering
• Information systems management
16
18. Two key ideas in machine learning
1.Features
2.Models
18
19. What is a feature?
1. A feature is a characteristic of a data point
2. Each data point is represented as a vector of features [f1, f2, f3 ... fm]
3. A whole dataset of N data points is represented as a N x M matrix
Data point Feature 1 Feature 2 .... Feature M
Data point 1
Data point 2
....
Data point N
19
20. What is a feature?
• Performance of machine learning algorithms in large part depend on the
quality of extracted features (how useful they are for a given ML task)
• Expertise and prior knowledge come into play when deciding which features
to extract
20
21. What is a model?
• Something that capture important patterns in the data
• A model can be used to
• Draw inferences
• Understand the data
• Learn hidden rules
• Support decision making
21
22. An example model: BMI calculator
• Goal: Predicting a person’s body fat category (overweight, normal, or underweight)
from his height (in m) and weight (in kg).
• Model:
• BMI = weight / height2
• If BMI > 25: overweight
• If BMI < 18.5: underweight
• Otherwise: normal
• An example: 1.75m and 70kg:
BMI: 70/(1.75*1.75) = 22.85 -> Normal category
22
23. ML Model
How machine learning works?
Slow and hard
Model development
Model use
Response
(Prediction)
A new
data point
Fast and easy
ML Model
Model
buildingN data
points
feature
extraction
NxM
feature
matrix
feature
extraction
feature
vector of
length M
23
24. Two types of errors
• Bias: The error from erroneous
assumptions of the model.
• High bias: miss the relevant
relationships between variables
(underfitting).
• Variance: The error from sensitivity to
small fluctuations in the data.
• High variance: modelling the
random noise in the data, rather
than real relationships (overfitting).
24
25. Two types of errors
• We always work with samples
• Samples always have noise
• The trick is to develop models that do not
fit training data, but new future data
25
29. Many more approaches
• Models that blur the division between supervised and unsupervised
• Reinforcement learning: learning the class label after making a prediction
• Neural networks (can be supervised and unsupervised)
• Online learning models: learning as data arrives
• Feature processing methods: association rule mining
29
31. 10 data points
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
Data point 6
Data point 7
Data point 8
Data point 9
Data point 10
31
32. ML Model
How machine learning works?
Slow and hard
Model development
Model use
Response
(Prediction)
A new
data point
Fast and easy
ML Model
Model
buildingN data
points
feature
extraction
NxM feature matrix
feature
extraction
feature vector of length M
32
33. First step: feature extraction
• From each data point we
extracted four features:
• Number of wheels
• Colour
• Top speed (in km/h)
• Weight (in kg)
• Our feature matrix is 10 x 4
ID Wheels Color Top speed
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
33
34. Supervised learning: classification
• Each data point is provided with
a continuous numerical label
(outcome variable)
• The goal is to predict the class
label for a new data point
ID Wheels Color Top speed
(km/h)
Weight
(kg)
Label
1 4 Yellow 220 1,200 Car
2 4 Red 180 950 Car
3 2 Blue 260 230 Bike
4 2 Red 210 320 Bike
5 4 Yellow 160 870 Car
6 4 Blue 170 750 Car
7 4 Red 190 850 Car
8 2 Yellow 140 140 Bike
9 2 Yellow 210 310 Bike
10 2 Red 240 280 Bike
[4, Yellow, 260, 1100]
?Car
ID Wheels Color Top speed
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
We learned a model to classify a new
(unseen) vehicle as either a car or a bike
34
35. Supervised learning: regression
• The goal is to predict the
outcome value for a new
data point
[4, Yellow, 260, 1100]
?140,000
ID Wheels Color Top speed
(km/h)
Weight
(kg)
Label
1 4 Yellow 220 1,200 120,000
2 4 Red 180 950 40,000
3 2 Blue 260 230 63,000
4 2 Red 210 320 53,000
5 4 Yellow 160 870 21,000
6 4 Blue 170 750 37,000
7 4 Red 190 850 21,000
8 2 Yellow 140 140 26,000
9 2 Yellow 210 310 68,000
10 2 Red 240 280 75,000
ID Wheels Color Top speed
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
We learned a model to predict a price of a
new (unseen) vehicle
35
37. Unsupervised learning: clustering
• We want algorithm to group data
points into several groups based
on their similarity
ID Wheels Color Top speed
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 ?
2 4 Red 180 950 ?
3 2 Blue 260 230 ?
4 2 Red 210 320 ?
5 4 Yellow 160 870 ?
6 4 Blue 170 750 ?
7 4 Red 190 850 ?
8 2 Yellow 140 140 ?
9 2 Yellow 210 310 ?
10 2 Red 240 280 ?
ID Wheels Color Top speed
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
[4, Yellow, 260, 1100]
?1
ID Wheels Color Top speed
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 1
2 4 Red 180 950 1
3 2 Blue 260 230 2
4 2 Red 210 320 2
5 4 Yellow 160 870 1
6 4 Blue 170 750 1
7 4 Red 190 850 1
8 2 Yellow 140 140 2
9 2 Yellow 210 310 2
10 2 Red 240 280 2
Interpretation of group meaning is up to the
researcher
1=?, 2=?
37
38. Unsupervised learning: clustering
• We want algorithm to group data
points into several groups based
on their similarity
ID Wheels Color Top speed
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 ?
2 4 Red 180 950 ?
3 2 Blue 260 230 ?
4 2 Red 210 320 ?
5 4 Yellow 160 870 ?
6 4 Blue 170 750 ?
7 4 Red 190 850 ?
8 2 Yellow 140 140 ?
9 2 Yellow 210 310 ?
10 2 Red 240 280 ?
ID Wheels Color Top speed
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
[4, Yellow, 260, 1100]
?2
ID Wheels Color Top speed
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 2
2 4 Red 180 950 1
3 2 Blue 260 230 2
4 2 Red 210 320 2
5 4 Yellow 160 870 1
6 4 Blue 170 750 1
7 4 Red 190 850 1
8 2 Yellow 140 140 1
9 2 Yellow 210 310 2
10 2 Red 240 280 2
Pick the grouping of data that is most useful
for your own purpose
38
43. Social Sciences
• Improve understanding of a domain
• Compress and summarize large datasets
• Within Learning Analytics:
• Profile learners based on their course engagement,
• Discover emerging topics in a corpus (student discussions, course materials)
• Group courses based on their characteristics
43
50. What is not clustering?
• Simple data partitioning
• Single property
• Predefined groups
• Data clustering
• Multiple properties
• Unforeseen groups
• Combinations of properties describe
groups
50
58. Representing a cluster
• Centroid – a geometrical centre of a cluster • Medoid – data point closest to the centroid
58
59. What is mean by similar?
• What is meant by “similar data points”?
• Geometry – More similar data points are closer to each other in N-
dimensional feature space
• Yes, but:
• Close to the cluster “centre”
• Closeness to any other data point in a cluster
• Is it about distance between data points or their special density?
59
62. Different types of clustering methods
• Membership strictness
• Hard clustering
• Each object either belongs to a cluster
or not
• Soft (fuzzy) clustering
• Each object belongs to each cluster to
some degree
62
63. Different types of clustering methods
• Membership exclusivity
• Strict partitioning clustering (e.g. K-means)
• Each object belongs to one and only one
cluster
• Strict partitioning clustering with outliers
• Each objects belongs to zero or one
clusters
63
64. Different types of clustering methods
• Overlapping clustering
• Each object can belong to one or more
“hard” clusters
• Hierarchical clustering
• Objects belonging to a child cluster also
belong to the parent cluster
64
65. Different types of clustering methods
• Distance-based clustering
• Group objects based on distance
among them
• Density-based clustering
• Group objects based on area they
occupy
65
66. Special clustering approaches
MAAANY more approaches
• Model-based clustering:
• EM clustering
• Neural network approaches – Self-organising maps
• Grid-based approaches (e.g., STING)
• Clustering algorithms for large datasets
• Clustering of stream data in real time
• Clustering (partitioning) approaches for different types of data (e.g., graphs)
• Clustering approaches for categorical data
• Clustering approaches for freeform clusters (e.g., CURE)
• Clustering approaches for high-dimensional data (e.g., CLIQUE, PROCLUS)
• Constraint-based clustering
• Semi-supervised clustering
66
67. Multivariate methods
• N Data points have M features
• Find K clusters so that
• Each data point is associated to
each of the K clusters to a certain
degree (0 – none, 1.0 – fully)
• Each of the K clusters is
associated with all M features to
a certain degree
• Find K which maximizes the
likelihood of the observed data
67
68. Neural network approaches
• Network of connected nodes that propagate signals
• Edges have coefficients that alter signal propagation
• Traditionally supervised learning method
• Backpropagation method of learning coefficients
• Learning method and network structure altered to
support unsupervised learning
• Nodes can move!
• Eventually position of nodes indicate location of
clusters
68
70. Popular distance metric
• Way of calculating
similarity between
different data points.
• Important for methods
based on distances (e.g.,
K-Means, Hierarchical
clustering)
• Have a significant effect
on the final clustering
results
Distance metric Formula
Euclidean distance
𝑎 − 𝑏 2 =
𝑖
𝑎𝑖 − 𝑏𝑖
2
Squared Euclidean distance
𝑎 − 𝑏 2
2
=
𝑖
𝑎𝑖 − 𝑏𝑖
2
Manhattan (Hammington)
distance
𝑎 − 𝑏 1 =
𝑖
𝑎𝑖 − 𝑏𝑖
Maximum distance 𝑎 − 𝑏 ∞ = max
𝑖
𝑎𝑖 − 𝑏𝑖
70
73. Some examples
Kovanović, V., Joksimović, S., Gašević, D., Owers, J., Scott, A.-M., & Woodgate, A.
(2016). Profiling MOOC course returners: How does student behaviour change
between two course enrolments? In Proceedings of the Third ACM Conference
on Learning @ Scale (L@S’16) (pp. 269–272). New York, NY, USA: ACM.
https://doi.org/10.1145/2876034.2893431
73
74. Dataset
• 28 offerings of 11 different
Coursera MOOCs at the University
of Edinburgh
• 26,025 double course enrolment
records
• 52,050 course enrolment records
• K-means clustering
• Too large for clustering methods
that use pairwise distances (e.g.,
hierarchical clustering)
# Course Offering
1 Artificial Intelligence Planning 1,2,3
2 Animal Behavior and Welfare 1,2
3 AstroTech: The Science and Technology behind
Astronomical Discovery
1,2
4 Astrobiology and the Search for Extraterrestrial Life 1,2
5 The Clinical Psychology of Children and Young People 1,2
6 Critical Thinking in Global Challenges 1,2,3
7 E-learning and Digital Cultures 1,2,3
8 EDIVET: Do you have what it takes to be a veterinarian? 1,2
9 Equine Nutrition 1,2,3
10 Introduction to Philosophy 1,2,3,4
11 Warhol 1,2
74
75. Extracted features
• 9 different features extracted
Feature Description
Days No. of days active
Sub. No. of submitted assignments
Wiki No. of wiki page views
Disc. No. of discussion views
Posts No. of discussion messages written
Quiz. No. of quizzes attempted
Quiz.Uni. No. of different quizzes attempted
Vid.Uni. No. of different videos watched
Vid. No. of videos watched
75
77. Results
Cluster label Students Students
%
Enrol only (E) 22,932 44.1
Low engagement (LE) 21,776 41.8
Videos & Quizzes (VQ) 2,120 4.1
Videos (V) 5,128 9.9
Social (S) 94 0.2
77
78. Some examples
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015).
Analytics of communities of inquiry: Effects of learning technology use on
cognitive presence in asynchronous online discussions. The Internet and Higher
Education, 27, 74–89. https://doi.org/10.1016/j.iheduc.2015.06.002
78
79. Clustering features
# Type Code
1 Clustering variables
(content)
ULC UserLoginCount Total number of times student logged into the system.
2 CVC CourseViewCount Total number of times student viewed general course information.
3 AVT AssignmentViewTime Total time spent on all course assignments.
4 AVC AssignmentViewCount Total number of times student opened one of the course assignments.
5 RVT ResourceViewTime Total time spent on reading the course resources.
6 RVC ResourceViewCount Total number of times student opened one of the course resource materials.
7 Clustering variables
(discussion)
FSC ForumSearchCount Total number of times student used search function on the discussion boards.
8 DVT DiscussionViewTime Total time spent on viewing course’s online discussions.
9 DVC DiscussionViewCount Total number of time student opened one of the course’s online discussions.
10 APT AddPostTime Total time spent on posting discussion board messages.
11 APC AddPostCount Total number of the discussion board messages posted by the student.
12 UPT UpdatePostTime Total time spent on updating one of his discussion board messages.
13 UPC UpdatePostCount Total number of times student updated one of his discussion board messages.
79
81. Cluster interpretations
# Size Label
1 21 Task-focused users Overall below average activity,
Above average message posting activity
2 15 Content-focused users Below average discussions-related activity,
Average content-related activity, emphasis on assignments
3 22 No-users Overall below average activity,
slightly bigger in discussion-related activities
4 3 Highly intensive users Significantly most active students,
especially in content-related activities
5 6 Content-focused intensive users Above average content-related activity,
Average discussion-related activity
6 14 Socially-focused intensive users Above average discussion-related activity,
Average content-related activity
81
82. Some examples
Almeda, M. V., Scupelli, P., Baker, R. S., Weber, M., & Fisher, A. (2014). Clustering
of Design Decisions in Classroom Visual Displays. In Proceedings of the Fourth
International Conference on Learning Analytics and Knowledge (pp. 44–48). New
York, NY, USA: ACM. https://doi.org/10.1145/2567574.2567605
82
83. Clustering visual designs of classrooms
• 30 schools in northwestern USA
• Classroom Wall Coding System, CWaCS 1.0.
• Each classroom wall was photographed
• Units of analysis were marked with a box
• Coding scheme:
1. Academic
1. Academic topics (F1)
2. Academic organizational (F2)
2. Non-academic (F3)
3. Behavioural (F4)
• Adopted K-means to cluster classrooms
based on frequency of four features (F1-F4)
• Academic organizational
1. Goals for the day
2. Group assignments
3. Job charts
4. Labels
5. Schedule day/week
6. Yearly
7. Schedule
8. Skills
9. Homework
• Behavior materials
1. Behavior management
2. Progress charts
3. Rules
4. Other behaviour
• Academic topics
1. Behavior
2. Content specific
3. Procedures
4. Resources
5. Calendars/clocks
6. Other
• Non-academic
1. Motivational slogans
2. Decorations
3. Decorative frames
4. Student art
5. Other non-academic
83
85. Some examples
Ferguson, R., Clow, D., Beale, R., Cooper, A. J., Morris, N., Bayne, S., & Woodgate,
A. (2015). Moving Through MOOCS: Pedagogy, Learning Design and Patterns of
Engagement. In Design for Teaching and Learning in a Networked World (pp. 70–
84). Springer International Publishing. https://doi.org/10.1007/978-3-319-
24258-3_6
85
86. Features
Possible combinations:
• 1 = Visited content only
• 2 = Posted comment but visited no new content
• 3 = Visited content and posted comment
• 4 = Submitted the assessment late
• 5 = Visited content and submitted assessment late
• 6 = Posted late assessment, saw no new content
• 7 = Visited content, posted, late assessment
• 8 = Submitted assessment early /on time
• 9 = Visited content, assessment early /on time
• 10 = Posted, assessment early /on time, no new content
• 11 = Visited, posted, assessment early /on time
For each course week, we assigned learners
an activity score:
• 1 if they viewed content
• 2 if they posted a comment
• 4 if they submitted their assessment in a
subsequent week
• 8 if they submitted it early or on time
• Adopted K-means
86
88. Further examples
Lust, G., Elen, J., & Clarebout, G. (2013). Regulation of tool-use within a blended course: Student differences and performance
effects. Computers & Education, 60(1), 385–395. https://doi.org/10.1016/j.compedu.2012.09.001
Wise, A. F., Speer, J., Marbouti, F., & Hsiao, Y.-T. (2013). Broadening the notion of participation in online discussions: examining
patterns in learners’ online listening behaviors. Instructional Science, 41(2), 323–343. https://doi.org/10.1007/s11251-012-9230-9
Niemann, K., Schmitz, H.-C., Kirschenmann, U., Wolpers, M., Schmidt, A., & Krones, T. (2012). Clustering by Usage: Higher Order Co-
occurrences of Learning Objects. In Proceedings of the 2Nd International Conference on Learning Analytics and Knowledge (pp.
238–247). New York, NY, USA: ACM. https://doi.org/10.1145/2330601.2330659
Cobo, G., García-Solórzano, D., Morán, J. A., Santamaría, E., Monzo, C., & Melenchón, J. (2012). Using Agglomerative Hierarchical
Clustering to Model Learner Participation Profiles in Online Discussion Forums. In Proceedings of the 2Nd International Conference
on Learning Analytics and Knowledge (pp. 248–251). New York, NY, USA: ACM. https://doi.org/10.1145/2330601.2330660
Crossley, S., Roscoe, R., & McNamara, D. S. (2014). What Is Successful Writing? An Investigation into the Multiple Ways Writers Can
Write Successful Essays. Written Communication, 31(2), 184–214. https://doi.org/10.1177/0741088314526354
Hecking, T., Ziebarth, S., & Hoppe, H. U. (2014). Analysis of Dynamic Resource Access Patterns in Online Courses. Journal of Learning
Analytics, 1(3), 34–60.
Li, N., Kidziński, Ł., Jermann, P., & Dillenbourg, P. (2015). MOOC Video Interaction Patterns: What Do They Tell Us? In Proceedings of
the 10th European Conference on Technology Enhanced Learning (pp. 197–210). Springer International Publishing.
https://doi.org/10.1007/978-3-319-24258-3_15
88
89. K-Means clustering
• The most widely used clustering algorithm
• Very simple, decent results
• Produces “circular” clusters
• Iterative algorithm
• Initial position of cluster centroids random
• Often done multiple times and results averaged out (e.g., 1,000 times)
89
90. K-Means algorithm
1. Pick the number of clusters K
2. Pick K centroids in the N-dimensional feature space 𝑐𝑖
𝑁
, 𝑖 ∈ 1 … 𝐾
3. For each of P data points 𝑝𝑖
𝑁
:
1. Calculate the distance to each of the K centroids
2. Assign it to its closest centroid
4. Recalculate centroid positions based on the assigned data points
5. Repeat steps 3–5 until centroid positions stabilize (i.e., there is no change in step 4)
90
92. K-Means characteristics
• The final solution depends a lot on the original random centroid positions
• The algorithm is often repeated (restarted) many times.
• Restart K-means R (e.g., 1,000) times.
• For each of the data points there will be R cluster assignments.
• For each data point, pick the cluster assignment which was most common among
R assignments
92
93. K-Means characteristics
• The algorithm is easy to implement
• Petty fast, converges very quickly
• For N data points, requires calculation of N*K distances (which is 𝑂 𝑁 )
• Produces circular clusters – can be a problem in some domains
• Susceptive to outliers: Each data point will be assigned to one of the centroids and
can shift its centroid significantly “off side”
• The number of clusters must be provided
• Can be stuck in a local optima (solved often by multiple runs)
93
94. K-Means variants
• K-Means++
• “Smart” picking of the initial centroids (a.k.a. seeds)
• Seed selection algorithm:
• Pick the first seed randomly (uniform distribution across the whole space)
• Pick the next seed with a probability which is a squared distance from the closest seed
• Effectively “spreads” the seed centroids across the feature space
• K-Medoids & its flavours (Partitioning Around Medoids - PAM)
• The solution to outlier problem: Instead of using centroid, use medoid.
• Instead of representing clusters with centres, use existing data points to represents
clusters
94
95. PAM algorithm (Partitioning Around
Medoids)
1. One variant of K-Medoids
2. Pick the number of clusters K
3. Pick K data points in the N-dimensional feature space 𝑚𝑖
𝑁
, 𝑖 ∈ 1 … 𝐾 which will be initial cluster
representatives
4. Assign each of remaining M-K data points 𝑝𝑖
𝑁
to the closest representative
5. For each representative point 𝑜𝑗:
1. Pick a random non-representative data point from its cluster 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚
2. Check if swapping 𝑜𝑗 with 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 produces clusters with smaller “errors” (the sum of all clusters’
absolute differences between their data points and representatives)
3. If the new cost is smaller than the original cost, keep 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 as a representative point
6. Repeat steps 4–6 until there are no changes in representative objects
95
96. K-Means variants
• X-Means
• Does not require number of clusters K to be specified
• Refines clustering solution by splitting existing clusters
• Keeps the clustering configuration which maximizes AIC (Akaike information
criterion) or BIC (Bayesian information criterion)
• Implemented in WEKA
• Cascading K-Means
• Restarts K-means with different K and picks the K that maximizes Calinski and
Harabasz criterion (F value in ANOVA)
• Implemented in WEKA
96
97. K-Means variants
• Large datasets variants: CLARA (Clustering LARge Applications) and CLARANS (Clustering Large
Applications upon RANdomized Search)
• CLARA: Use a sample of data points as potential candidate medoids and run PAM.
• CLARANS Add randomization so the sample is not fixed at the start
• Fuzzy C-means
• Each data point can belong to multiple clusters with different probabilities (up to 100% for
all clusters)
• Also assesses the compactness of each cluster
• Compact clusters will have members with high probabilities
97
99. Hierarchical clustering
• Next to k-means, very popular method for cluster analysis
• Two key flavours
• agglomerative
• divisive
• Especially usable for small datasets
• Evaluate and pick the number of clusters visually
• The height of the merge/split indicates the distance
• Used extensively in Learning Analytics
• Many variants, using Linkage Functions
99
100. Agglomerative hierarchical clustering
• Build the clusters from bottom-up
• Algorithm:
• Build a singleton cluster for each data point
• Repeat until all data in a single cluster:
• find two closest clusters (based on linkage function)
• merge these two together
• Run Interactive DEMO
100
101. Agglomerative hierarchical clustering
• Requires calculation of the distances between all cluster pairs
• At step 1 – this means calculating all pairwise distances among data points
• N data points – N*N/2 distances
• Not feasible for large datasets
101
102. Divisive hierarchical clustering
• All data start in a single cluster, then we split one cluster at each step.
• More complex than agglomerative (how to split a cluster?)
• Less popular than agglomerative algorithms
• Can be faster as we do not need to go all the way to the bottom of the dendrogram
• Many approaches, often use “flat” algorithm as a partitioning method (e.g., K-means)
102
103. Example divisive clustering with K-means
• Start with all data in a single cluster
• Use K-means to create two initial clusters A2 and B2
• Use K-means to divide A2 into A2-1 and A2-2
• Use K-means to divide B2 into B2-1 and B2-2
• Pick between:
• A2-1, A2-2, B2
• A2, B2-1, B2-2
• Call the best combination A3, B3, and C3
• Repeat the division of each cluster into two clusters. Pick between:
• A3-1, A3-2, B3, C3
• A3, B3-1, B3-2, C3
• A3, B3, C3-1, C3-2
103
104. Linkage functions
• Key question for agglomerative clustering: How to pick two clusters to merge
• What is meant by “closest”
• Several different criteria. Most popular
• Single-linkage: Minimal distance between any two data points
• Complete-linkage: Maximal distance between any two data points
• Average-linkage: Distance between cluster centroids
• Ward’s method: pick the pair of clusters so that the new cluster has minimal possible sum
of squares of distances. Minimizes the variation within the clusters.
104
109. What is Weka?
Software “workbench”
Waikato Environment for Knowledge Analysis
(WEKA)
109
110. Installing Weka
• https://www.cs.waikato.ac.nz/ml/weka/index.html
• Very powerful, lot of resources available
• Good for fast prototyping, much faster than R/Python
• Can be used
• Through GUI, which is very quirky and has hidden “gems”
• From command line (useful for integrating with other tools/scripts)
• As a Java library
• Not the best designed UI, clearly done by the developers
• Great book about ML/DM/Weka
https://www.cs.waikato.ac.nz/ml/weka/book.html
• Many demo datasets included in Weka
https://www.cs.waikato.ac.nz/ml/weka/datasets.html
110
111. Weka Interfaces
Will be used throughout the course
Performance comparisons
Graphical front - alternative to Explorer
Unified interface
Command-line interface
111
117. Selecting the number of clusters
• Clustering is user-centric and subjective
• How to pick the number of clusters?
• Based on background knowledge (e.g., educational theory)
• Use an algorithm that calculates optimal number of clusters automatically (e.g., EM)
• Use an algorithm that provides a visual overview of all clustering configurations (e.g.,
hierarchical clustering)
• Use supervised clustering algorithm where clustering process is guided by the user
• Evaluate multiple values for K manually
• “Elbow” method: trade-off point between number of clusters and within cluster variance
• Silhouette method: test robustness of cluster membership
117
118. Elbow method
• As K increases, the average
diameter (variance) of clusters
is also getting smaller
• Find a “sweet spot” at which
the decrease in variance sharply
changes
• Sometimes not so clear
118
119. Silhouette method
• Visual method for determining the number of clusters
• 𝑎 𝑖 – the average distance of point 𝑖 to all other points in its cluster
• 𝑏(𝑖) – the smallest average distance of point 𝑖 to points in another cluster (distance to the
closest neighbouring cluster)
• 𝑠 𝑖 =
𝑏 𝑖 −𝑎(𝑖)
max 𝑎 𝑖 ,𝑏(𝑖)
• 𝑠 𝑖 =
𝑏 𝑖 −𝑎(𝑖)
𝑏(𝑖)
, 𝑖𝑓𝑏 𝑖 > 𝑎(𝑖)
0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖)
𝑏 𝑖 −𝑎(𝑖)
𝑎(𝑖)
, 𝑖𝑓𝑏 𝑖 < 𝑎(𝑖)
119
128. Challenges and solutions
• High dimensionality & feature (attribute) selection
• Categorical attributes
• “Weirdly-shaped” clusters
• Outliers
128
129. Curse of dimensionality
• Euclidean distance metric 𝑖 𝑎𝑖 − 𝑏𝑖
2
• In a highly-dimensional space with 𝑑 dimensions (0.0–1.0):
• Highly likely that for at least for one feature 𝑖, the value 𝑎𝑖 − 𝑏𝑖 will be close to 1.0
• This puts the lower bound on the distance at 1.0
• However, the upper bound is 𝑑
• Most pairs of data points being far from this upper bound
• Most data points have distance close to average distance
• Many irrelevant dimensions for most clusters
• The noise in irrelevant dimensions masks real differences between clusters
129
130. Curse of dimensionality: some solutions
• Feature transformation methods (essentially compression)
• Creating smaller number of new, synthetic features based on the larger number of input
features which are then used for clustering
• Principal component analysis (PCA)
• Singular-value decomposition (SVD)
• Feature (attribute) selection methods
• Searching for a subset of features that are relevant for a given domain.
• Minimize entropy
• Idea that feature spaces that contain tight clusters have low entropy
• Subspace clustering – extension of attribute selection
130
131. Curse of dimensionality: some solutions
• Popular algorithms:
• CLIQUE: A Dimension-Growth Subspace Clustering Method
• Start with a single dimension and grow space by adding new dimensions
• PROCLUS: A Dimension-Reduction Subspace Clustering Method
• Starts with the complete high-dimensional space and assigns weight of each
dimension for every cluster which are used to regenerate clusters
• Explores dense subspace regions
131
132. Categorical data
• Most clustering algorithms focus on clustering with continuous numerical attributes (ratio
variables)
• How to cluster categorical data? E.g., clustering students based on their demographic
characteristics:
• Gender
• Program
• Study level (postgraduate vs. undergraduate)
• Domestic/international
132
133. Categorical data: simple solution
• Ignore the problem, threat categorical data as numerical:
• Male: 1, Female: 2
• Domestic: 1, International: 2
• Often does not produce good results.
• Distance metric is not meaningful.
• Point A: (Male, Domestic)
• Point B: (Female, Domestic)
• Point C: (Female, International)
• Is point B closer to point A or point C?
• Depends on the information value of these two features
• “Localized method“
• If two distinct clusters have few points that are close, they might be merged together
incorrectly.
133
C
A B
Gender
Dom/Intl
1 2
1
2
134. Categorical data: custom algorithms
• ROCK (RObust Clustering using linKs)
• A Hierarchical Clustering Algorithm for Categorical Attributes
• Two data points are similar is they have similar neighbours
• Typical example: market basket data
134
135. “Weirdly-shaped” clusters
• Most algorithms focus on distance
between data points
• However, often the connectedness of
data points is also important
• Different algorithms developed for
these situations
135
136. Different types of clustering methods
• Distance-based clustering
• Group objects based on distance
among them
• Density-based clustering
• Group objects based on area they
occupy
136
137. CURE
• Pick a subsample of data and cluster it using a
method such as hierarchical clustering
• Pick N characteristic points per each cluster that
are most distant from each other.
• Move representative points for a fraction towards
cluster centroid.
• Merge two clusters which have representative
points sufficiently close.
137
138. DBSCAN
• DBSCAN (Density-based spatial clustering of
applications with noise)
• Density-based algorithm
• Searches for areas with large number of points
• Implemented in WEKA
• General idea:
• Each data point is either core point, reachable
point or outlier
• Core points have minP (parameter) points
around them in the radius r (parameter)
• Reachable points are in radius r of a core point
• Every other data point is an outlier
138
minP=4
Red: core
Yellow: reachable
Blue: outliers
139. Self-organising maps (SOM)
• Special type of neural network
• Used to learn the contour of the underlying data
• Neuron laid out in a grid structure, each neuron connected
with neighbours and all input nodes
• For each data point, a neuron which is closest to it gets
adjusted, with adjustments being propagated to
neighbouring neurons
• Over time, neurons will position themselves in the shape of
the data
• Dense areas with many neurons indicate clusters
139
141. Expectation-maximization (EM) clustering
• Much more general than clustering
• Used to estimate hidden (latent) parameters
• Does not require number of clusters to be specified
• General idea:
• Pick number of clusters K
• Fit K distributions over clustering variables with their parameters P
• Estimate likelihood of all data points being generated by any of the K distributions (expectation)
• For every data point, sum likelihoods of being generated by any of the K distributions
• Combine weights with the data to produce new estimates for parameters P
• Repeat until convergence is reached (no parameter change)
141
145. Analysis of cluster differences
• We can check the differences between clusters with regards to
• Clustering variables (e.g., number of logins, number of discussion posts)
• Some additional variables (e.g., student grades, age, gender)
• We can examine difference
• One variable at a time (univariate differences)
• Across multiple variables simultaneously (multivariate differences)
• Takes into consideration the interaction among multiple variables
145
146. Univariate analysis of cluster differences
• For every variable we can use parametric and non-parametric univariate tests:
• Two clusters: t-test and Mann-Whitney
• Three or more clusters: One-way ANOVA and Kruskal-Wallis
• Requires p-value adjustment (e.g., Bonferroni, Holm-Bonferroni correction)
• Whether to use parametric or non-parametric primarily depends on the homogeneity
(equality) of variance assumption
• Can be tested with Levene’s test
• If Levene’s test shows p<.05, use Mann-Whitney and Kruskal-Wallis
• Significant ANOVAs tests can be followed by pairwise tests (e.g., TukeyHSD)
• Significant Kruskal-Wallis tests can be followed by pairwise KW tests (with also p-
value correction)
146
147. Multivariate analysis of cluster differences
• We can test differences across all variables at the same time
• More holistic than ANOVA/KW
• Instead of one dependent variable, we can have multiple variables
• Use meaningful groups of variables (e.g., behavioural variables)
• MANOVA: Multivariate analysis of variance
• Step “before” ANOVAs/KWs
• Has several statistics: Wilk’s Λ, Pillai’s Trace
• Assumption: Homogeneity of covariance
• Much trickier to test: Box’s M one method, but it is very sensitive (use p<.001)
• Use Levene’s tests on each of the variables (doesn’t guarantee homogeneity of
covariance but might help)
• If assumption is violated, still can be used but with more robust metric (Pillai’s Trace)
147
148. Example MANOVA
“For assessing the difference between student clusters a multivariate analysis of
variance (MANOVA) was used. To validate the difference between the discovered
clusters a MANOVA model with cluster assignment as a single independent variable and
thirteen clustering variables as the dependent measures was constructed…”
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of
inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The
Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
148
149. Example MANOVA
“Before running MANOVAs, … the homogeneity of covariances assumption was checked using
Box’s M test and homogeneity of variances using Levine’s test. To protect from the assumption
violations, we log-transformed the data and used the Pillai’s trace statistic which is considered to
be a robust against assumption violations. As a final protection measure, obtained MANOVA
results were compared with the results of the robust rank-based variation of the MANOVA
analysis”
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of
inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The
Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
149
150. Example MANOVA
“The assumption of homogeneity of covariances was tested using Box’s M test which was not
accepted. Thus, Pillai’s trace statistic was used, as it is more robust to the assumption violations
together with the Bonferroni correction method. A statistically significant MANOVA effect was
obtained, Pillai’s Trace = 1.62, F(39, 174) = 5.28, p < 10-14”
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of
inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The
Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
150
151. MANOVA Follow-up analyses
• Significant MANOVA can be followed-up with two types of analyses:
• Individual ANOVAs/KWs (with p-value correction)
• Which in turn can be followed with pairwise analyses: TukeyHSD/Pairwise KWs
• Discriminatory Factor Analysis (DFA)
• What combinations of variables differentiate between clusters
• DFA can be run alone (without MANOVA) but its significance then can’t be tested
151
152. 152
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities
of inquiry: Effects of learning technology use on cognitive presence in asynchronous online
discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
153. Example DFA analysis
153
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities
of inquiry: Effects of learning technology use on cognitive presence in asynchronous online
discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002