Slides from the introductory tutorial to topic modeling with R and LSA, pLSA and LDA algorithms organized at LAK15 conference in Poughkeepsie, NY March 17, 2015
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Talk from QCon SF on 2018-11-05
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front each of our members at the right time. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. But the job of recommendation does not end there. Why should you care about any particular title we recommend? What can we say about a new and unfamiliar title that will pique your interest? How do we convince you that a title is worth watching? Answering these questions is critical in helping our members discover great content, especially for unfamiliar titles. One way to do this is to consider the artwork or imagery we use to visually portray each title. If the artwork representing a title captures something compelling to you, then it acts as a gateway into that title and gives you some visual “evidence” for why the title might be good for you. Selecting good artwork is important because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we show for each title on the Netflix homepage. We will look at how to frame this as a machine learning problem using contextual multi-armed bandits in a recommendation system setting. We will also describe the algorithmic and system challenges involved in getting this type of approach for artwork personalization to succeed at Netflix scale. Finally, we will discuss some of the future opportunities that we see to expand and improve upon this approach.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
The Netflix experience is driven by a number of Machine Learning algorithms: personalized ranking, page generation, search, similarity, ratings, etc. On the 6th of January, we simultaneously launched Netflix in 130 new countries around the world, which brings the total to over 190 countries. Preparing for such a rapid expansion while ensuring each algorithm was ready to work seamlessly created new challenges for our recommendation and search teams. In this post, we highlight the four most interesting challenges we’ve encountered in making our algorithms operate globally and, most importantly, how this improved our ability to connect members worldwide with stories they'll love.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Talk from QCon SF on 2018-11-05
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front each of our members at the right time. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. But the job of recommendation does not end there. Why should you care about any particular title we recommend? What can we say about a new and unfamiliar title that will pique your interest? How do we convince you that a title is worth watching? Answering these questions is critical in helping our members discover great content, especially for unfamiliar titles. One way to do this is to consider the artwork or imagery we use to visually portray each title. If the artwork representing a title captures something compelling to you, then it acts as a gateway into that title and gives you some visual “evidence” for why the title might be good for you. Selecting good artwork is important because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we show for each title on the Netflix homepage. We will look at how to frame this as a machine learning problem using contextual multi-armed bandits in a recommendation system setting. We will also describe the algorithmic and system challenges involved in getting this type of approach for artwork personalization to succeed at Netflix scale. Finally, we will discuss some of the future opportunities that we see to expand and improve upon this approach.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
The Netflix experience is driven by a number of Machine Learning algorithms: personalized ranking, page generation, search, similarity, ratings, etc. On the 6th of January, we simultaneously launched Netflix in 130 new countries around the world, which brings the total to over 190 countries. Preparing for such a rapid expansion while ensuring each algorithm was ready to work seamlessly created new challenges for our recommendation and search teams. In this post, we highlight the four most interesting challenges we’ve encountered in making our algorithms operate globally and, most importantly, how this improved our ability to connect members worldwide with stories they'll love.
A Simple Explanation of the paper XLNet(https://arxiv.org/abs/1906.08237).
It would be helpful to get to grips with the concepts XLNet before you dive into the paper.
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
NOTE THAT I HAVE MOVED AWAY FROM SLIDESHARE TO ZENODO
The identical presentation is now here:
https://doi.org/10.5281/zenodo.7778641
General introduction to LinkML, The Linked Data Modeling Language.
Adapter from presentation given to NIH May 2022
https://linkml.io/linkml
Software y otras tecnologías para el desarrollo de proyectos de humanidades d...Rubén Alcaraz Martínez
Se presenta una recopilación de aplicaciones de software y tecnologías útiles en múltiples proyectos de humanidades digitales, así como algunos breves estudios de casos relacionados. La selección incluye sistemas de gestión de colecciones digitales multipropósito, bibliotecas y software para la visualización de datos, herramientas para la generación de exposiciones virtuales, entornos narrativos de diversa índole, sistemas de información geográfica, software para la creación y gestión de vocabularios controlados y para el análisis de textos y la minería de datos.
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...Databricks
As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc.
Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings.
We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
Open Government Data: What it is, Where it is Going, and the Opportunities fo...OECD Governance
Keynote presentation given by Ryan Androsoff (Digital Government Policy Analyst, OECD) at the 2015 EUROSAI-OLACEFS conference in Quito, Ecuador on 25 June 2015. Focus of the presentation is on Open Government Data and the opportunities for Supreme Audit Institutions presented by open data. Video of the presentation is available at: https://youtu.be/SlBfxmecJhI?t=1h50m19s
For more information on OECD's work relating to Open Government Data please see: http://www.oecd.org/gov/public-innovation/open-government-data.htm
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Are you currently managing AI projects that require a lot of GPU power?
Are you tired of managing the complexity of your infrastructures, GPU instances and your Kubeflow yourself?
Need flexibility for your AI platform or SaaS solution?
OVHcloud innovates in AI by offering simple and turnkey solutions to train your models and put them into production.
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
Dr. Steve Liu, Chief Scientist, Tinder at MLconf SF 2017MLconf
Personalized User Recommendations at Tinder: The TinVec Approach:
With 26 million matches per day and more than 20 billion matches made to date, Tinder is the world’s most popular app for meeting new people. Our users swipe for a variety of purposes, like dating to find love, expanding social networks and meeting locals when traveling.
Recommendation is an important service behind-the-scenes at Tinder, and a good recommendation system needs to be personalized to meet an individual user’s preferences. In this talk, we will discuss a new personalized recommendation approach being developed at Tinder, called TinVec. TinVec embeds users’ preferences into vectors leveraging on the large amount of swipes by Tinder users. We will discuss the design, implementation, and evaluation of TinVec as well as its application to
personalized recommendations.
Bio: Dr. Steve Liu is chief scientist at Tinder. In his role, he leads research innovation and applies novel technologies to new product developments.
He is currently a professor and William Dawson Scholar at McGill University School of Computer Science. He has also served as a visiting research scientist at HP Labs. Dr. Liu has published more than 280 research papers in peer-reviewed international journals and conference proceedings. He has also authored and co-authored several books. Over the course of his career, his research has focused on big data, machine learning/AI, computing systems and networking, Internet of Things, and more. His research has been referenced in articles publishing across The New York Times, IDG/Computer World, The Register, Business Insider, Huffington Post, CBC, NewScientist, MIT Technology Review, McGill Daily and others. He is a recipient of the Outstanding Young Canadian Computer Science Researcher Prizes from the Canadian Association of Computer Science and is a recipient of the Tomlinson Scientist Award from McGill University.
He is serving or has served on the editorial boards of ACM Transactions on Cyber-Physical Systems (TCPS), IEEE/ACM Transactions on Networking (ToN), IEEE Transactions on Parallel and Distributed Systems (TPDS), IEEE Transactions on Vehicular Technology (TVT), and IEEE Communications Surveys and Tutorials (COMST). He has also served on the organizing committees of more than 38 major international conferences and workshops.
Dr. Liu received his Ph.D. in Computer Science with multiple honors from the University of Illinois at Urbana-Champaign. He received his Master’s degree in Automation and BSc degree in Mathematics from Tsinghua University.
Introductory talk given to PhD students starting research at NUS PhD open day 2020. Covers research in Computer Science, and some experience in research on trustworthy software systems.
Rubric-based Assessment of Programming Thinking Skills and Comparative Evalua...Hironori Washizaki
Hironori Washizaki, "Rubric-based Assessment of Programming Thinking Skills and Comparative Evaluation of Introductory Programming Environments," 4th International Annual Meeting on STEM Education (IAMSTEM 2021), Keynote, August 12-14, 2021, Keelung, Taiwan and Online
A Simple Explanation of the paper XLNet(https://arxiv.org/abs/1906.08237).
It would be helpful to get to grips with the concepts XLNet before you dive into the paper.
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOChris Mungall
NOTE THAT I HAVE MOVED AWAY FROM SLIDESHARE TO ZENODO
The identical presentation is now here:
https://doi.org/10.5281/zenodo.7778641
General introduction to LinkML, The Linked Data Modeling Language.
Adapter from presentation given to NIH May 2022
https://linkml.io/linkml
Software y otras tecnologías para el desarrollo de proyectos de humanidades d...Rubén Alcaraz Martínez
Se presenta una recopilación de aplicaciones de software y tecnologías útiles en múltiples proyectos de humanidades digitales, así como algunos breves estudios de casos relacionados. La selección incluye sistemas de gestión de colecciones digitales multipropósito, bibliotecas y software para la visualización de datos, herramientas para la generación de exposiciones virtuales, entornos narrativos de diversa índole, sistemas de información geográfica, software para la creación y gestión de vocabularios controlados y para el análisis de textos y la minería de datos.
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...Databricks
As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc.
Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings.
We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
Open Government Data: What it is, Where it is Going, and the Opportunities fo...OECD Governance
Keynote presentation given by Ryan Androsoff (Digital Government Policy Analyst, OECD) at the 2015 EUROSAI-OLACEFS conference in Quito, Ecuador on 25 June 2015. Focus of the presentation is on Open Government Data and the opportunities for Supreme Audit Institutions presented by open data. Video of the presentation is available at: https://youtu.be/SlBfxmecJhI?t=1h50m19s
For more information on OECD's work relating to Open Government Data please see: http://www.oecd.org/gov/public-innovation/open-government-data.htm
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Are you currently managing AI projects that require a lot of GPU power?
Are you tired of managing the complexity of your infrastructures, GPU instances and your Kubeflow yourself?
Need flexibility for your AI platform or SaaS solution?
OVHcloud innovates in AI by offering simple and turnkey solutions to train your models and put them into production.
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Algorithmic Music Recommendations at SpotifyChris Johnson
In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.
Dr. Steve Liu, Chief Scientist, Tinder at MLconf SF 2017MLconf
Personalized User Recommendations at Tinder: The TinVec Approach:
With 26 million matches per day and more than 20 billion matches made to date, Tinder is the world’s most popular app for meeting new people. Our users swipe for a variety of purposes, like dating to find love, expanding social networks and meeting locals when traveling.
Recommendation is an important service behind-the-scenes at Tinder, and a good recommendation system needs to be personalized to meet an individual user’s preferences. In this talk, we will discuss a new personalized recommendation approach being developed at Tinder, called TinVec. TinVec embeds users’ preferences into vectors leveraging on the large amount of swipes by Tinder users. We will discuss the design, implementation, and evaluation of TinVec as well as its application to
personalized recommendations.
Bio: Dr. Steve Liu is chief scientist at Tinder. In his role, he leads research innovation and applies novel technologies to new product developments.
He is currently a professor and William Dawson Scholar at McGill University School of Computer Science. He has also served as a visiting research scientist at HP Labs. Dr. Liu has published more than 280 research papers in peer-reviewed international journals and conference proceedings. He has also authored and co-authored several books. Over the course of his career, his research has focused on big data, machine learning/AI, computing systems and networking, Internet of Things, and more. His research has been referenced in articles publishing across The New York Times, IDG/Computer World, The Register, Business Insider, Huffington Post, CBC, NewScientist, MIT Technology Review, McGill Daily and others. He is a recipient of the Outstanding Young Canadian Computer Science Researcher Prizes from the Canadian Association of Computer Science and is a recipient of the Tomlinson Scientist Award from McGill University.
He is serving or has served on the editorial boards of ACM Transactions on Cyber-Physical Systems (TCPS), IEEE/ACM Transactions on Networking (ToN), IEEE Transactions on Parallel and Distributed Systems (TPDS), IEEE Transactions on Vehicular Technology (TVT), and IEEE Communications Surveys and Tutorials (COMST). He has also served on the organizing committees of more than 38 major international conferences and workshops.
Dr. Liu received his Ph.D. in Computer Science with multiple honors from the University of Illinois at Urbana-Champaign. He received his Master’s degree in Automation and BSc degree in Mathematics from Tsinghua University.
Introductory talk given to PhD students starting research at NUS PhD open day 2020. Covers research in Computer Science, and some experience in research on trustworthy software systems.
Rubric-based Assessment of Programming Thinking Skills and Comparative Evalua...Hironori Washizaki
Hironori Washizaki, "Rubric-based Assessment of Programming Thinking Skills and Comparative Evaluation of Introductory Programming Environments," 4th International Annual Meeting on STEM Education (IAMSTEM 2021), Keynote, August 12-14, 2021, Keelung, Taiwan and Online
Critiquing CS Assessment from a CS for All lens: Dagstuhl Seminar PosterMark Guzdial
Poster presented at the Dagstuhl Seminar "Assessing Learning in Introductory Computer Science" (http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=16072). I argue that we have to consider what the learner wants to do and wants to be (i.e., their desired Community of Practice) when assessing learning. Different CoP, different outcomes, different assessments.
Open Syllabus is now a fully functional Sakai 2.6 contrib tool that helps instructors quickly organize their course's material. In this presentation, we sketch its most important features including the seamless integration with resources, citations and assignments.
Presented as two guest lectures for the N8 Centre of Excellence in Computationally Intensive Research (https://n8cir.org.uk) ( 8 + 15 March 2022). Reviews basic R functionality, as well as: how to prepare a text corpus for statistical analysis; conduct basic analyses of this corpus; and generate and modify bar plots and word clouds using R.
A presentation to the Academic staff of SISTC (Sydney International School of Technology and Commerce) on different techniques to adopt to work with Generative AI, such as ChatGPT and to consider different forms of assessment.
Open Syllabus (OSyl) is now a fully functional Sakai 2.7 contrib tool that helps instructors quickly organise their course's material to provide a coherent learning environment for their students.
Educational App Development
course proposal (its offering in 2016/2017 might be discussed in spring/summer 2016)
Department of Information Technology and Technical Education
Faculty of Education, Charles University in Prague
The course gives a professional and academic introduction to cross-platform (web) educational application development founded on the understanding of the role of ICT in contemporary education.
(a slightly updated version of this talk is at https://doi.org/10.6084/m9.figshare.10301741.v1)
A talk on the role of software in research and how NCSA is responding in terms of people and roles - given at the 2019 Data Science Leadership Summit (https://sites.google.com/msdse.org/datascienceleadership2019/).
This is partially based on a previous paper: Daniel S. Katz, Kenton McHenry, Caleb Reinking, Robert Haines, "Research Software Development & Management in Universities: Case Studies from Manchester's RSDS Group, Illinois' NCSA, and Notre Dame's CRC", 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science)
doi: https://doi.org/10.1109/SE4Science.2019.00009
preprint: https://arxiv.org/abs/1903.00732
Learning Engineering Initiatives at Harvard DCEJay Luker
The online nature of Harvard DCE's classroom content puts us in an ideal position to record, analyze and act on data related to the behavior and outcomes of our learners. Join us for an overview of how we are combining information from Canvas, Banner, and Opencast Matterhorn to gain insight into the best ways to shape DCE's learning environments and advance the quality and effectiveness of our course materials through the application of Learning Engineering principles.
This slide was used in ISO/IEC JTC1 SC36 Plenary Meeting in June 22, 2015.
Title of this slide is 'Proof of Concept for Learning Analytics Interoperability and subtitle is 'Reference Model based on open source SW'.
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxwellesleyterresa
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf
Nicholas J. Horton
Randall Pruim
Daniel T. Kaplan
A Student's
Guide to
R
Project MOSAIC
2 horton, kaplan, pruim
Copyright (c) 2015 by Nicholas J. Horton, Randall
Pruim, & Daniel Kaplan.
Edition 1.2, November 2015
This material is copyrighted by the authors under a
Creative Commons Attribution 3.0 Unported License.
You are free to Share (to copy, distribute and transmit
the work) and to Remix (to adapt the work) if you
attribute our work. More detailed information about
the licensing is available at this web page: http:
//www.mosaic-web.org/go/teachingRlicense.html.
Cover Photo: Maya Hanna.
http://www.mosaic-web.org/go/teachingRlicense.html
http://www.mosaic-web.org/go/teachingRlicense.html
Contents
1 Introduction 13
2 Getting Started with RStudio 15
3 One Quantitative Variable 27
4 One Categorical Variable 39
5 Two Quantitative Variables 45
6 Two Categorical Variables 55
7 Quantitative Response, Categorical Predictor 61
8 Categorical Response, Quantitative Predictor 69
9 Survival Time Outcomes 73
4 horton, kaplan, pruim
10 More than Two Variables 75
11 Probability Distributions & Random Variables 83
12 Power Calculations 89
13 Data Management 93
14 Health Evaluation (HELP) Study 107
15 Exercises and Problems 111
16 Bibliography 115
17 Index 117
About These Notes
We present an approach to teaching introductory and in-
termediate statistics courses that is tightly coupled with
computing generally and with R and RStudio in particular.
These activities and examples are intended to highlight
a modern approach to statistical education that focuses
on modeling, resampling based inference, and multivari-
ate graphical techniques. A secondary goal is to facilitate
computing with data through use of small simulation
studies and appropriate statistical analysis workflow. This
follows the philosophy outlined by Nolan and Temple
Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.
Computing in the statistics
curriculum. The American
Statistician, 64(2):97–107, 2010
tics education is a principal component of the recently
adopted American Statistical Association’s curriculum
guidelines2.
2 ASA Undergraduate Guide-
lines Workgroup. 2014 cur-
riculum guidelines for under-
graduate programs in statisti-
cal science. Technical report,
American Statistical Associa-
tion, November 2014. http:
//www.amstat.org/education/
curriculumguidelines.cfm
Throughout this book (and its companion volumes),
we introduce multiple activities, some appropriate for
an introductory course, others suitable for higher levels,
that demonstrate key concepts in statistics and modeling
while also supporting the core material of more tradi-
tional courses.
A Work in Progress
Caution!
Despite our best efforts, you
WILL find bugs both in this
document and in our code.
Please let us know when y ...
Similar to Topic Modeling for Learning Analytics Researchers LAK15 Tutorial (20)
Introduction to Learning Analytics for High School Teachers and ManagersVitomir Kovanovic
Presentation at the first Learning Analytics Learning Network (LALN) Event in Adelaide, Australia on Oct 22, 2019.
Abstract:
With the increased adoption of technology, institutions have unprecedented opportunities to continuously improve the quality of their services through data collection and analysis. Schools and universities now have data about learners and their contexts that can provide valuable insight into how they learn. Early attempts were directed towards mining educational data to identify students-at-risk and develop interventions. Recently, more sophisticated approaches are being deployed by researchers and practitioners. These include analysis of learner behaviour that leads to various learning outcomes, social networks and teams, employability, creativity, and critical thinking. Analysing digital traces generated through learning processes requires a broad suite of methods from data science, statistics, psychometrics, social and learning sciences.
This workshop aims to introduce teachers and educators to the fast growing and promising field of learning analytics. How digital data can be used for the analysis and improvement of student learning will be explored. First, we will provide an overview of learning analytics, its key methods and approaches, as well as problems for which it can be used. Secondly, attendees will engage in group learning activities to explore ways in which learning analytics could be used within their institutions. The focus will be on identifying learning-related challenges that are relevant to their particular context and exploring how learning analytics can be used to practically and effectively.
Extending video interactions to support self-regulated learning in an online ...Vitomir Kovanovic
Slides from our presentation at ASCILITE'18 conference in Geelong, Victoria. Full paper is available in ASCILITE conference proceedings at http://ascilite.org/wp-content/uploads/2018/12/ASCILITE-2018-Proceedings.pdf
Analysing social presence in online discussions through network and text anal...Vitomir Kovanovic
The slides from our presentation at IEEE ICALT'19 conference.
Abstract:
This paper presents an approach to studying relationships between students' social presence and course topics from transcripts of asynchronous discussions in online learning environments. Specifically, the paper uses topic modelling and epistemic network analysis to investigate how students' social presence is expressed across different course topics. Finally, we show how this method can be adopted to examine how students' social presence changed due to an instructional intervention. The results of this study and its implications are further discussed.
Automated Analysis of Cognitive Presence in Online Discussions Written in Por...Vitomir Kovanovic
Slides from our EC-TEL'18 Paper presentation. Full paper is available at https://dx.doi.org/10.1007/978-3-319-98572-5_19
Abstract:
This paper presents a method for automated content analysis of students’ messages in asynchronous discussions written in Portuguese. In particular, the paper looks at the problem of coding discussion transcripts for the levels of cognitive presence, a key construct in a widely used Community of Inquiry model of online learning. Although there are techniques to coding for cognitive presence in the English language, the literature is still poor in methods for others languages, such as Portuguese. The proposed method uses a set of 87 different features to create a random forest classifier to automatically extract the cognitive phases. The model developed reached Cohen’s κ of .72, which represents a “substantial” agreement, and it is above the Cohen’s κ threshold of .70, commonly used in the literature for determining a reliable quantitative content analysis. This paper also provides some theoretical insights into the nature of cognitive presence by looking at the classification features that were most relevant for distinguishing between the different phases of cognitive presence.
Validating a theorized model of engagement in learning analyticsVitomir Kovanovic
Slides from our paper presentation at LAK'19 conference in Tempe, AZ. The full paper is available at https://dx.doi.org/10.1145/3303772.3303775
Abstract:
Student engagement is often considered an overarching construct in educational research and practice. Though frequently employed in the learning analytics literature, engagement has been subjected to a variety of interpretations and there is little consensus regarding the very definition of the construct. This raises grave concerns with regards to construct validity: namely, do these varied metrics measure the same thing? To address such concerns, this paper proposes, quantifies, and validates a model of engagement which is both grounded in the theoretical literature and described by common metrics drawn from the field of learning analytics. To identify a latent variable structure in our data we used exploratory factor analysis and validated the derived model on a separate sub-sample of our data using confirmatory factor analysis. To analyze the associations between our latent variables and student outcomes, a structural equation model was fitted, and the validity of this model across different course settings was assessed using MIMIC modeling. Across different domains, the broad consistency of our model with the theoretical literature suggest a mechanism that may be used to inform both interventions and course design.
Examining the Value of Learning Analytics for Supporting Work-integrated Lear...Vitomir Kovanovic
Slides from our presentation at the Seventh National Conference
on Work-Integrated Learning (ACEN’18).
The full paper is available at https://www.researchgate.net/publication/328578409_Examining_the_value_of_learning_analytics_for_supporting_work-integrated_learning
Unsupervised Learning for Learning Analytics ResearchersVitomir Kovanovic
Slides from my 2.5-day workshop organised at 2018 Learning Analytics Summer Institute (LASI'18) organised at Teachers College, Columbia University on July 11, 2018.
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
The slides from my 2hr tutorial organised at 2018 Learning Analytics Summer Institute (LASI) at Teachers College, Columbia University on June 11, 2018.
Kovanović et al. 2017 - developing a mooc experimentation platform: insight...Vitomir Kovanovic
LAK'17 Conference paper presentation:
Abstract:
In 2011, the phenomenon of MOOCs had swept the world of education and put online education in the focus of the public discourse around the world. Although researchers were excited with the vast amounts of MOOC data being collected, the benefits of this data did not stand to the expectations due to several challenges. The analyses of MOOC data are very time-consuming and labor-intensive, and require a highly advanced set of technical skills, often not available to the education researchers. Because of this MOOC data analyses are rarely done before the courses end, limiting the potential of data to impact the student learning outcomes and experience.
In this paper we introduce MOOCito (MOOC intervention tool), a user-friendly software platform for the analysis of MOOC data, that focuses on conducting data-informed instructional interventions and course experimentations. We cover important design principles behind MOOCito and provide an overview of the trends in MOOC research leading to its development. Although a work-in-progress, in this paper, we outline the prototype of MOOCito and the results of a user evaluation study that focused on system’s perceived usability and ease-of-use. The results of the study are discussed, as well as their practical implications.
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...Vitomir Kovanovic
LAK'16 Conference paper presentation:
abstract:
In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen’s kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features
than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
1. Topic Modeling for Learning
Analytics Researchers
Marist College,
Poughkeepsie, NY, USA
Vitomir Kovanović
School of Informatics
University of Edinburgh
Edinburgh, United Kingdom
v.kovanovic@ed.ac.uk
Srećko Joksimović
School of Interactive Arts and Technology
Simon Fraser University
Vancouver, Canada
sjoksimo@sfu.ca
Dragan Gašević
Schools of Education and Informatics
University of Edinburgh
Edinburgh, United Kingdom
dgasevic@acm.org
March 17, 2015
3. Workshop outline
08:30 Tutorial Opening (15 min)
● Opening of the tutorial: (15 min): Introduction of the presenters, workshop objectives, obtaining datasets for the workshop and resolving
any technical issues.
08:45 Introduction (60 min)
● Introduction to topic modeling (30 min): Overview of the topic modeling and learning analytics problems it can address.
● Brief introduction to R and RStudio (30 min): Gentle introduction to R & RStudio for the purposes of the tutorial.
09: 45 Coffee break (15 min)
10:00 Topic Modeling Pt I (60 min)
● LDA topic modeling (50 min): Introduction to the topic modeling algorithms(LSA, pLSA, LDA) and topicmodelsR programming
library.
● Real-world example of LDA topic modeling (10 min): LDA analysis of the large MOOC news corpora.
11:00 Coffee break (15 min)
11:15 Topic Modeling Pt II (60 min)
● Collaborative topic modeling (40 min): Step-by-step LDA topic modeling on the real-world dataset.
● Advanced topic modeling techniques (20 min): A brief overview of more advanced topic modeling techniques such as hierarchical,
dynamic and correlated topic modeling.
12:15 Tutorial Closing & Discussion (15 min)
● Closing discussion and final remarks (15 min): Overview and discussion of what is learned in the tutorial and how to proceed further
with the adoption of topic modeling in own research.
4. Download software
Download R and R studio:
MS Windows: bit.do/tm_win
OS X: bit.do/tm_mac
Download example code:
Both platforms: bit.do/tm_code
5. Presenters
Vitomir Kovanović
● Background in software engineering.
○ BSc & MSc in Informatics & Computer Science.
○ From 2008 to 2011 worked as a software developer on a large
distributed system for sports betting.
● Currently third year PhD student at the School of Informatics, University of
Edinburgh.
○ PhD research focuses on development of learning analytics models
based on text mining, social network analysis and trace-data clustering
in the context of communities of inquiry.
○ From Sep. 2011 to Dec. 2014 in a PhD program at Simon Fraser
University.
● Member of SoLAR, active in learning analytics community.
● In 2014, with Christopher Brooks, Zachary Pardos, and Srećko Joksimović
run tutorial “Introduction to Data Mining for Educational Researchers” at
LASI at Harvard University.
6. Presenters
Srećko Joksimović
● Background in software engineering.
○ BSc & MSc in Informatics & Computer Science.
● Currently completing the second year of the PhD program at School of Interactive Arts and
Technology, Simon Fraser University.
○ Research interests center around the analysis of teaching and learning in networked learning
environments.
● Member of SoLAR, active in learning analytics community.
○ SoLAR Executive Student Member
● In 2014, with Christopher Brooks, Zachary Pardos, and Vitomir Kovanović run tutorial “Introduction to
Data Mining for Educational Researchers” at LASI at Harvard University.
7. This is our first time to organize this
tutorial, so we are both excited and
frightened!
DISCLAIMER
8. Audience introduction
Very quick (and casual) run around the room:
● Introduce yourself to other participants
● What is the most important thing you want to get out of this tutorial?
9. The goals of the tutorial
● What is topic modelling?
● Why you might want to use it?
● What types of topic modeling there are? (e.g., LSA, pLSA, LDA)
● Advanced topic modeling techniques
● Latent Dirichlet Allocation (LDA)
○ How to conduct LDA analysis using R
○ How to interpret topic modeling results
○ Several real-world examples
○ Collaborative hands-on LDA analysis
● Very hands-on and practical (i.e., not too much math)
Ask questions anytime!
10. Technical stuff
Download R and R studio:
MS Windows: http://bit.do/lda_win
OS X: http://bit.do/lda_mac
Download example code:
Both platforms: http://bit.do/lda_code
13. Goal of Topic Modeling
Fast and easy birds eye view of the large datasets.
What is the data about?
What are the key themes?
Very powerful when coupled with different covariates (e.g., year
of publication or author):
Longitudinal analysis: How the key themes change over time?
Focus of discussion: Who is focussing on one topic, who talks
about variety of topics?
14. Goal of Topic Modeling
Meeks and Weingart (2012):
“It [Topic modeling] is distant reading in the most pure sense: focused on
corpora and not individual texts, treating the works themselves as
unceremonious “buckets of words,” and providing seductive but obscure
results in the forms of easily interpreted (and manipulated) “topics.”” (p. 2)
15. Goal of Topic Modeling
Documents WordsTopics
Latent ObservedObserved
Documents are about several topics at the same time.
Topics are associated with different words.
Topics in the documents are expressed through the words
that are used.
23. Social Sciences: Stanford Topic
Modeling Toolbox
http://vis.stanford.edu/files/2009-TopicModels-NIPS-Workshop.pd
Also extensively
used in social
sciences
26. Topic Modeling + Learning Analytics
We have a vast amount of data in learning analytics that we want to analyze:
● MOOC/LMS online discussions,
● Student (micro)blog postings,
● Student essays/written exams,
● Research reports that can be used to inform about the field (semi-automated meta-analyses),
● News reports about LA/MOOCs/Online learning informing about public,
● .....
27. MOOC News
Analysis
Kovanovic, V., Joksimovic, S., Gasevic, D.,
Siemens, G., & Hatala, M. (2014). What
public media reveals about MOOCs? British
Journal of Educational Technology (to
appear)
● 4,000 News reports from all
over the world
● Run LDA topic modeling to
discover important topics
● Looked at how topics
changed over time
28. Analysis of MOOC discussions
http://scholar.harvard.edu/files/dtingley/files/educationwriteup.
pdf
29. Types of topic modeling
Vector-based techniques:
● Latent Semantic Analysis (LSA) (a.k.a Latent Semantic Indexing - LSI)
Probabilistic techniques
● All cool kids use bayesian statistics
● Better handles synonymy and polysemy
● Probabilistic Latent Semantic Analysis (pLSA)
● Latent Dirichlet Allocation (LDA)
○ Whole bunch of LDA extensions
Highly recommended introduction to all three algorithms:
Crain, S. P., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent
Dirichlet Allocation and Beyond. In C. C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 129–161). Springer.
30. Software for topic modeling
Most require solid technical background:
● Python libraries
a. Gensim - Topic Modelling for Humans
b. LDA Python library
● C/C++ libraries
a. lda-c
b. hlda
c. ctm-c
d. hdp
● R packages
a. lsa package
b. lda package
c. topicmodels package
d. stm package
● Java libraries
a. S-Space Package
b. MALLET
33. Features of R
- The core software is quite lean
- functionality is divided into modular packages,
- Graphics capabilities very sophisticated,
- Useful for interactive work,
- Contains a powerful programming language for developing new tools ,
- Very active and vibrant user community
- R-help and R-devel mailing lists and Stack Overflow, CrossValidated.
34. RStudio
- RStudio IDE is a powerful and productive user interface for R
- Free, open source, and works on Windows, Mac, and Linux.
- http://www.rstudio.com/
37. Working directory and libraries
- Working directory
- getwd()
- setwd("directoryname")
- Using Files window
- Libraries
- Packages are collections of R functions
- library() # see all packages installed
- search() # see packages currently loaded
- install.library("package") # install package
- library("package" ) # load package
38. Scripts
- R script is just a plain text file with R commands.
- You can prepare a script in any text editor (e.g., vim,
Notepad, TextEdit, WordPad)
- Creating scripts with R studio
- Running a simple script
> source("foo.R")
39. Graphics
- Basic graph types:
- density plots, dot plots, bar charts, line charts, pie
charts, boxplots, scatter plots
- Advanced graphics
- customization - customize a graph's axes, add reference lines, text
annotations and a legend,
- statistically complex graphs - e.g., plots conditioned on one or more
variables, probability plots, mosaic plots, and correlograms
- Packages
- lattice, ggplot2, graphics
41. Basic commands
comments:
# everything after # is ignored
variables:
Variable names are case sensitive. ‘.’ is a valid character!
variable.name <- value # this is some comment
input.directory <- "/home/srecko/cck12_blogs"
number.of.topics <- 5
TF.threshold <- 0.95
stem.words <- FALSE
stem.words <- F
43. Basic commands
vectors (arrays): all elements of the same type
accessing elements with [ and ]
indexing starts at 1
> names[1]
[1] "Huey"
> names[2]
[1] "Dewey"
> names[4] <- "Donald Duck"
> names
[1] "Huey" "Dewey" "Louie" "Donald Duck"
44. Basic commands
lists: elements of the different type
created with list() function
accessing elements with $name
savings <- list(amount=10000, currency="USD")
> savings
$amount
[1] 10000
$currency
[1] "USD"
> savings$amount
[1] 10000
45. Basic commands
strings:
single or double quotes valid.
paste and paste0 for string concatenation.
> paste("Three nephews", names[1], names[2], names[3])
[1] "Three nephews Huey Dewey Louie"
> paste0("Three nephews", names[1], names[2], names[3])
[1] "Three nephewsHueyDeweyLouie"
R can operate an all elements in a vector. Arguments are paired
> paste(names, names)
[1] "Huey Huey" "Dewey Dewey" "Louie Louie"
Everyting is a vector. Shorter vectors are cycled
paste("Hello", names)
[1] "Hello Huey" "Hello Dewey" "Hello Louie"
same as
> paste(c("Hello", "Hello", "Hello"), names)
[1] "Hello Huey" "Hello Dewey" "Hello Louie"
46. Data frames
table-like structure
> nephew.data <- data.frame(age=c(3, 5, 7),
name=c("Huey", "Dewey", "Louie"),
favorite.drink=c("coke", "orange juice", "bourbon"))
> nephew.data
age name favorite.drink
1 3 Huey coke
2 5 Dewey orange juice
3 7 Louie bourbon
use [row,col] to access individual elements
> nephew.data[2,2]
[1] Dewey
Levels: Dewey Huey Louie
47. Data frames
Access the whole row
> nephew.data[2, ]
age name favorite.drink
2 5 Dewey orange juice
Access the whole column
> nephew.data[, 2]
[1] Huey Dewey Louie
Levels: Dewey Huey Louie
You can also use $column.name syntax
> nephew.data$name
[1] Huey Dewey Louie
Levels: Dewey Huey Louie
48. Exporting the data
The most common ways to export the data:
> write.csv(nephew.data, file="nephews.csv")
> nephew.data <- read.csv(file="nephews.csv")
52. Introduction to topic modeling
Original problem: How to synthesize the information in a large collections of documents?
Example:
D1: modem the steering linux. modem, linux the modem. steering the modem. linux!
D2: linux; the linux. the linux modem linux. the modem, clutch the modem. petrol.
D3: petrol! clutch the steering, steering, linux. the steering clutch petrol. clutch the petrol; the clutch.
D4: the the the. clutch clutch clutch! steering petrol; steering petrol petrol; steering petrol!!!!
Typically done using some clustering algorithm to find groups of similar documents.
53. Alternative approach: Topic modeling
● Find several latent topics of themes that are “link” between the documents and words.
● Topics explain why certain words appear in a given document.
● Preprocessing is very much the same for all topic modeling methods (LSA, pLSA, LDA)
54. Pre-processing steps
Example:
D1: modem the steering linux. modem, linux the modem. steering the modem. linux!
D2: linux; the linux. the linux modem linux. the modem, clutch the modem. petrol.
D3: petrol! clutch the steering, steering, linux. the steering clutch petrol. clutch the petrol; the clutch.
D4: the the the. clutch clutch clutch! steering petrol; steering petrol petrol; steering petrol!!!!
55. Pre-processing steps
Step 1: Remove all non alphanumerical characters
D1: modem the steering linux modem linux the modem steering the modem linux
D2: linux the linux the linux modem linux the modem clutch the modem petrol
D3: petrol clutch the steering steering linux the steering clutch petrol clutch the petrol the clutch
D4: the the the clutch clutch clutch steering petrol steering petrol petrol steering petrol
56. Pre-processing steps
Step 2: Find all words that exist in the corpus. That defines our Vocabulary
D1: modem the steering linux modem linux the modem steering the modem linux
D2: linux the linux the linux modem linux the modem clutch the modem petrol
D3: petrol clutch the steering steering linux the steering clutch petrol clutch the petrol the clutch
D4: the the the clutch clutch clutch steering petrol steering petrol petrol steering petrol
Vocabulary: modem, the, steering, linux, clutch, petrol
59. Back to starting problem
Can we somehow make matrices smaller?
Two topics => two rows in matrix
D1 D2 D3 D4
linux 3 4 1 0
modem 4 3 0 1
the 3 4 4 3
clutch 0 1 4 3
steering 2 0 3 3
petrol 0 1 3 4
60. Enter Latent Semantic Analysis (LSA)
Documents WordsTopics
Example:
George Siemens’s new blog post (i.e., a document) will likely be about:
MOOCs. (say 80%)
Connectivism (say 30%)
And given that it is 80% about MOOCs,
it will likely use words:
Massive (90%),
Coursera (50%),
Connectivism (85%),
Analytics (60%)
etc...
Latent ObservedObserved
And given that it is 30% about Connectivism
it will likely use the words:
Network (80%)
Knowledge (40%)
Share (30%)
etc...
61. LSA
Nothing more than a singular value decomposition (SVD) of document-term matrix:
Find three matrices U, Σ and V so that: X = UΣVt
6x4 DOCUMENTS
T
E
R
M
S
=
6x4 TOPICS
T
E
R
M
S
X X
TOP 0 0 0
0 IC 0 0
0 0 IMPO 0
0 0 0 RTAN
CE
4x4 DOCUMENTS
T
O
P
I
C
S
For example with 5 topics, 1000 documents and 1000 word vocabulary:
Original matrix: 1000 x 1000 = 106
LSA representation: 5x1000 + 5 + 5x1000 ~ 104
-> 100 times less space!
68. Back to Document Term Matrix
D1 D2 D3 D4
linux 3 4 1 0
modem 4 3 0 1
the 3 4 4 3
clutch 0 1 4 3
steering 2 0 3 3
petrol 0 1 3 4
69. What is a topic?
A list of probabilities for each of the
possible words in a vocabulary.
Example topic:
● dog: 5%
● cat: 5%
● hause: 3%
● hamster: 2%
● turtle: 1%
● calculus: 0.000001%
● analytics: 0.000001%
● .......
● .......
● .......
Probabilistic Topic Modeling
70. What is a topic?
A list of probabilities for each of the
possible words in a given language.
Example topic:
● dog: 5%
● cat: 5%
● hause: 3%
● hamster: 2%
● turtle: 1%
● calculus: 0.000001%
● analytics: 0.000001%
● .......
● .......
● .......
Probabilistic Topic Modeling
Pets
71. Digression: Coin tossing and dice
rolling
50% 50%
33%
66%
● Two possible outcomes: Head 50%, Tail 50%
● Modeled by binomial distribution:
○ parameters:
■ number of trials n (e.g., 10),
■ probabilities for head/tails (e.g., 50/50)
● Coin can be weighted!
● Binomial distribution is still used!
○ only probabilities are different
72. Digression: Coin tossing and dice
rolling
● Analogously, rolling dices is modeled by multinomial distribution
○ parameters
■ number of trials n
■ probabilities (e.g. , 1/6 for a fair dice)
● Dices also can be rigged (e.g., weighted)
○ the same distribution applies, just with
different probabilities.
16% 16% 16% 16% 16% 16%
50%
10% 10% 10% 10% 10%
73. A probabilistic spin to LSA: pLSA
Instead of finding lower-ranked matrix representation, we can try to find a mixture of
word->topic & topic->documents distributions that are most likely given the observed documents.
● We define a statistical model of how the documents are being made (generated).
○ This is called generative process in topic modeling terminology.
● Then we try to find parameters of that model that best fit the observed data.
74. A probabilistic spin to LSA: pLSA
Instead of finding lower-ranked matrix representation, we can try to find a mixture of
word->topic & topic->documents distributions that are most likely given the observed documents.
● We define a statistical model of how the documents are being made (generated).
○ This is called generative process in topic modeling terminology.
● Then we try to find parameters of that model that best fit the observed data.
75. Generative process illustrated
We just received a 50 word long document by our news reporter John Doe.
He is allowed to write only about one of the 6 possible topics, using only 6 words that exist in the world.
This is how we imagine that he wrote that article:
● For the first word, he throws a dice that tells him what is the topic of the first word. Say it is topic 1 (IT)
● Then he throws another dice to pick which word to use to describe topic 1. Say it is word 1 (linux)
● The process is repeated for all 50 words in the document.
● One thing! Dices are weighted!!!
○ The first dice for picking topics puts more weight on IT topic that on the other 5 topics.
○ Also, dice for IT topic, puts more weight on words ‘linux’ and ‘modem’.
○ Likewise dice for topic 2 (cars) puts more weight on word ‘petrol’ and ‘steering’
78. pLSA
Another more useful formulation:
Find most likely values for P(t|d) and P(w|t).
Typically solved using EM algorithm.
79. pLSA challenges
● Tries to find topic distributions and word distributions that best fit the data
○ Overfitting
○ Some documents have strong associations to only couple of topics,
while others have more evenly distributed associations.
80. LDA: an extension to pLSA
In LDA, we encode our assumptions about the data.
Two important assumptions:
1. On average, how many topics are per document?
a. Few or more?
2. On average, how are words distributed across topics?
a. Are topics strongly associated with few words or not?
Those assumptions are defined by two vectors α and β:
α: K dimensional vector that defines how K topics are distributed across documents.
Smaller αs favor fewer topics strongly associated with each document.
β: V dimensional vector that defines how V words are associated across topics.
Smaller βs favor fewer words strongly associated with each topics.
IMPORTANT: Typically, all elements in α and β are the same (in topicmodels library 0.1 is default)
Uninformative prior: We don’t know what are the prior distributions so we will say all are equally likely.
Those assumptions are then used to generate “dices” (topic-word associations, topic-document associations)
81. Dirichlet Distribution
● Multivariate distribution parametrized by a N-dimensional vector.
● Drawing a sample from a dirichlet distribution parametrized with N-
dimensional vector returns an N-dimensional vector that sums to
one.
N=3:
Params = [1, 1, 1] Params = [10, 10, 10] Params = [.1, .1, .1]
Bigger than 1 Less than 1
82. Dirichlet Distribution
● Multivariate distribution parametrized by a N-dimensional vector.
● Drawing a sample from a dirichlet distribution parametrized with N-
dimensional vector returns an N-dimensional vector that sums to one.
N=3:
T1
T3
T2 T1 T1
T3
T1
T3 T3
T1 T2T2T2T2
T3
T1 T2
T3
Params = [1,1,1]Params = [2, 2, 2] Params = [5, 5, 5] Params = [.9, .9,0.9] Params = [.5, .5, .5] Params = [2, 5, 15]
[.33, .33, .33] [.98, .1, .1]
Params = [.1, .1, .1]
T3
T1 T2
83. Dirichlet Distribution
How topics are distributed across documents and how words are
distributed across topics are drawn from Dirichlet distribution!
T1
T3
T2 T1 T1
T3
T1
T3 T3
T1 T2T2T2T2
T3
T1 T2
T3
Params = [1,1,1]Params = [2, 2, 2] Params = [5, 5, 5] Params = [.9, .9,0.9] Params = [.5, .5, .5] Params = [2, 5, 15]
[.33, .33, .33] [.98, .1, .1]
Params = [.1, .1, .1]
T3
T1 T2
84. Generative process outline
How we imagine that documents were generated:
1. We specify number of topics K and our assumptions α ∈ RK
and β ∈ RV
that capture
general associations between document-topic and topic-word relationships.
2. For each document we pick one sample from a Dirichlet distribution (defined by α)
that defines document’s distribution over topics (θ ∈ RK
).
3. For each topic we pick one sample from a Dirichlet distribution (defined by β)
that defines topic’s distribution over available words (φ ∈ RV
).
4. For each position in each document:
a. Pick a topic Z from document’s topic distribution
b. then pick a word from a selected topic Z’s word distribution (φz
).
Number of topics
(specify)
Number of words in each
document (observed)
Number of documents
(observed)
Assumptions
(specify)
RK
RV
K=3, α = [.1, 0.1, .1]
85. D1: modem the steering linux modem linux the modem steering the modem linux
Our vocabulary is 6 words in total (V=6):
linux, modem, the, clutch, steering, petrol
We decide to use three topics (K=3). We call them IT, cars, and pets.
As we have three topics α has three elements. We set α to [.1, .1, .1].
As we have six words, β has six elements. We set β to [.1, .1, .1, .1, .1, .1].
Generative process for D1
86. D1: modem the steering linux modem linux the
modem steering the modem linux
K=3, V=6, α ∈ R3
= [0.1, 0.1, 0.1],
β ∈ R6
= [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
From Dirichlet(β) we sample 3 times, once for each topic:
1. For topic IT we pick φIT
. It happened to be = [0.3, 0.5, 0.05, 0.05, 0.05, 0.05]
2. For topic cars we pick φcars
. It happened to be = [0.1, 0.05, 0.3, 0.15, 0.2, 0.2]
3. For topic pets we pick φpets
. It happened to be = [0.05, 0.1, 0.1, 0.20, 0.3, 0.25]
From Dirichlet(α) we sample M times, once for every document:
4. For document D1, we pick θD1
. It happened to be [0.9, 0.05, 0.05]
5. For document D2, we pick θD2
. It happened to be .....
............
6. For first position in document D1, we pick a topic from Multinomial([0.9, 0.05, 0.05]). Say it is IT.
As the topic for the first position is IT, we pick a word from Multinomial([0.3, 0.5, 0.05, 0.05, 0.05, 0.05]).
Oh look, it is ‘modem’.
7. For second position in document D1, we pick a topic based on Multinomial(θD1
). Say it is cars.
As the topic for the second position is cars, we pick a word from Multinomial(φcars
). Oh look, it is ‘the’.
8. .........
Generative process for D1
88. ● Posterior inference of θs, φs, and Zs
● Going “backwards” from observation to latent elements in the model.
● The problem is intractable, so probabilities are estimated typically using:
○ MCMC through “Gibbs sampling, and
○ Variational Bayesian methods.
What is LDA?
90. topicmodels R library
● “Interface” to lda library. Provides less functionality, but is easier to use.
● Analysis steps:
a. reading data
b. data cleanup (stopwords, numbers, punctuation, short words, stemming, lemmatization)
c. building document term matrix
d. filtering document term matrix based on TF and TF-IDF thresholds
■ TF: we don’t want words that appear only in few documents,
they are useless for topic description
■ TF-IDF: we don’t want words that are appear a lot, but are too general (e.g., ‘have’)
e. Either select K upfront, or evaluate several values of K for best fit.
■ Plot evaluation of different values of K
f. Show top words for each topic
g. Additional analyses (covariates etc)
91. topicmodels R library
library(topicmodels)
data <- load files into memory
corpus <- Corpus(VectorSource(data));
preprocessing.params <- list(stemming = T,
stopwords = T,
minWordLength = 3,
removeNumbers = T,
removePunctuation = T)
dtm <- DocumentTermMatrix(corpus, control = preprocessing.params)
dtm <- filter noise from DTM (TF, TF-IDF)
lda.params <- list(seed=1, iter= 2000)
LDA(dtm, k = 20, method = "Gibbs", control = lda.params)
plot graphics
do further analysis....
92. topicmodels R library
● Problem of picking number of topics K
○ Perplexity on the left-out test set
○ Harmonic mean evaluation
● A bit more coding for:
○ Filtering of words
○ Plotting perplexity scores requires coding
○ Calculating frequencies of each topic requires a bit of coding
○ Finding most likely topics for each document requires
93. lak_topic_models.R script
# include utility script
source("lak_topic_modeling.R")
# project name, used for saving analysis results
project.id <- "mooc news topic models"
# what values of K to check?
all.ks <- seq(2, 60, 2)
dtm <- build.dtm(load.dir="./data/mooc_news/",
tf.threshold=0.95,
tf.idf.threshold=0.9,
tf.idf.function="median",
stemming=T,
stopwords.removal=T,
minWordLength=3,
removeNumbers=T,
removePunctuation=T,
minDocFreq=3)
94. lak_topic_models.R script
# build all topic models with different values of K
all.topic.models <- build.topic.models(dtm, all.ks, alpha=0.1)
# get the best topic models based on harmonic mean metric
best.model <- get.best.model(all.topic.models)
# assign all documents to most likely topics
document.topic.assignments <- get.topic.assignments(best.model)
log.message("nn********************************************")
# print LDA results
print.topics.and.term(document.topic.assignments, best.model, n.terms=10)
# print document headlines
print.nth.line(document.topic.assignments, load.dir)
98. Course and dataset
- Connectivism and Connective Knowledge 2012
(CCK12)
- Aim of the course
- Data
- ~ 285 blogs (including comments)
99. Course topics
- What is Connectivism?
- Patterns of connectivity
- Connective Knowledge
- What Makes Connectivism Unique?
- Groups, Networks and Collectives
- Personal Learning Environments & Networks
- Complex Adaptive Systems
- Power & Authority
- Openness & Transparency
- Net Pedagogy: The role of the Educator
- Research & Analytics
- Changing views, changing systems: From grassroots to policy
102. Correlated Topic Model
- CTM allows the topics to be correlated
- CTM allows for better prediction
- Logistic normal instead of
Dirichlet distribution
- More robust to overfitting
103. CTM in R
- Package "topicmodels"
- function CTM
CTM(x, k, method = "VEM", control = NULL, model = NULL, ...)
- Arguments
- x - DocumentTermMatrix object
- k - Integer, number of topics
- method - currently only "VEM" supported
105. Supervised LDA
- Each document is associated with an external variable
- for example, a movie review could be associated with a numerical
ratings.
- Defines a one-to-one correspondence between latent
topics and user tags
- Allows sLDA to directly learn word-tag
correspondences.
108. Relational Topic Models
- Given a new document RTM predicts which documents
it is likely to be linked to
- For example - tracking activities on Facebook in order to
predict a reaction on an advertisement
- RTM is also good for certain types of data that have
spatial/geographic dependencies
109. Hierarchical topic modeling
- LDA fails to draw the relationship between one topic and another.
- CTM considers the relationship but fail to indicate the level of abstract of a
topic.
the, of,
a, is,
and
networks
protocol
network
packets
link
routing
adaptive
network
networks
protocol
database
constraint
algebra
boolean
relational
constraint
dependencies
local
consistency
logic
logics
query
theories
languages
110. Structural topic modeling
- STM allows for the inclusion of covariates of interest.
- STM vs. LDA
- topics can be correlated
- each document has its own prior distribution over topics, defined by
covariate X rather than sharing a global mean
- word use within a topic can vary by covariate U
- The STM provides fast, transparent, replicable analyses that require few a
priori assumptions about the texts under study
111. STM in R
- stm: R package for Structural Topic Models
http://structuraltopicmodel.com
112. Topic Mapping
LDA
- accuracy, reproducibility
Network approach
- Preprocessing
- stemming -> "stars" to "star"
- stop word removal
- Pruning of connections
- remove nonsignificant words
- Clustering of words
- Infomap community detection algorithm
- Topic-model estimation
113. References
[1] David M. Blei and John D. Lafferty: "A Correlated Topic Model of Science", The Annals of Applied
Statistics, Vol. 1, No. 1 (Jun., 2007) , pp. 17-35 Published by: Institute of Mathematical Statistics, URL:
http://www.jstor.org/stable/4537420
[2] Blei, David M., and John D. Lafferty. "Dynamic topic models." Proceedings of the 23rd international
conference on Machine learning. ACM, 2006.
[3] Gerrish, Sean, and David Blei. "The ideal point topic model: Predicting legislative roll calls from
text." Proceedings of the Computational Social Science and the Wisdom of Crowds Workshop. Neural
Information Processing Symposium. 2010.
115. Recap
● Topic modeling is used in variety of domains
○ humanities,
○ history,
○ engineering & computing,
○ social sciences, and
○ education
● All try to model relationships between observed words and documents by a set of latent topics
● Used for exploration and getting “sense of the data”
● Require reasonably small data pre-processing
● Document-Term Matrix (DTM) is the foundation on which all topic modeling methods work
○ Bag-of words approaches as DTM does not contain ordering information
○ Punctuation, numbers, short, rnare and uninformative words are typically removed.
○ Stemming and lemmatization can also be applied
116. Algorithms
● Several different topic modeling algorithms
○ LSA
■ Finding smaller (lower-rank) matrices that closely approximate DTM
○ pLSA
■ Finding topic-word and topic-document associations that best match dataset and
specified number of topics K
○ LDA
■ Finding topic-word and topic-document associations that best match dataset and
specified number of topics that come from Dirichlet distribution with given dirichlet
priors.
117. Practical application
● Practical applications of algorithms require use of different libraries
○ Most require certain level of programming proficiency
○ Require fiddling with parameters and human judgement
○ MALLET and R libraries most widely used.
○ C/C++ libraries typically have more advanced algorithms
118. Where to go next
● lak_topic_modeling.R provides a nice template for LDA analyses
● Play around with parameters and extend it for more advanced algorithms
● Try MALLET and other R libraries that provide correlated topic models or structured topic models
● Even better: find a (poor) grad students to do that for you
● Try out the basic topic modeling using topicmodels library
● Use covariates such as publication year or author to make interesting interpretations and insights
into the dataset
119. Discussion & Questions
Topic Modeling in Learning Analytics
Q1: Do you plan on using topic modeling for
some of your own research projects?
Q2: Do you see particular problems that you
can address with topic modeling?