The document describes a new movie rating dataset called MovieTweetings that was collected from Twitter posts containing "I rated #IMDb". The dataset addresses the problem of outdated public movie rating datasets by providing a continuously updated collection of user movie ratings and metadata sourced from Twitter and IMDb. It contains over 120,000 ratings for nearly 12,000 movies from around 20,000 users. Some example insights from the dataset include the most highly rated recent releases and lists of highest/lowest averaged rated films. The dataset is publicly available to support recommender systems research.
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
In this data management session, Christopher describes how to build robust and reliable data products in BigQuery and dbt, for PPC and SEO use cases. After an introduction to the modern data stack, six principles of reliable data products are presented, followed by the following use cases:
- Google Ads Conversion upload
- SEO sitemap efficiency report
- Google Shopping product rating sync
- Large-Scale link checker with advertools
- Inventory-based PPC campaigns with dbt
Here is the referenced selection of gists on github: https://gist.github.com/ChrisGutknecht
Einstein Analytics (previously known as Wave Analytics) allows developers to not only create analytics applications, but also to create application templates that allow end-users to create their own analytics applications based on your master app. You, the developer, can define parameters and rules as part of the template, allowing the end-user to customize the app to their requirements. This Dreamforce 2017 session explains how to use Analytics Templates and the Analytics External Data API to automate the ingest of data from outside the platform, manipulating datasets and dataflows to provide a seamless experience for the user.
Contract Testing of WebSockets: Functional Programming Is Taking the StageNordic APIs
Functional Testing is a very powerful tool in general. It allows performing validation tests for HTTP responses, locating of broken connections and even generating of a number of concurrent requests. And if testing of the stateless HTTP request/response protocol is not a rocket science, testing of WebSockets could be problematic, because messages need to be tested on both sides: client and server. In my talk, I would like to share an experience of how we do it in my current team using Contract Testing pattern and Erlang OTP. Life demo is included, of course.
Calibration exercise on GA4 observed vs modelled vs GA 360 data. The methodology was flawed in this case but the results and approach stand up to scrutiny
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
In this data management session, Christopher describes how to build robust and reliable data products in BigQuery and dbt, for PPC and SEO use cases. After an introduction to the modern data stack, six principles of reliable data products are presented, followed by the following use cases:
- Google Ads Conversion upload
- SEO sitemap efficiency report
- Google Shopping product rating sync
- Large-Scale link checker with advertools
- Inventory-based PPC campaigns with dbt
Here is the referenced selection of gists on github: https://gist.github.com/ChrisGutknecht
Einstein Analytics (previously known as Wave Analytics) allows developers to not only create analytics applications, but also to create application templates that allow end-users to create their own analytics applications based on your master app. You, the developer, can define parameters and rules as part of the template, allowing the end-user to customize the app to their requirements. This Dreamforce 2017 session explains how to use Analytics Templates and the Analytics External Data API to automate the ingest of data from outside the platform, manipulating datasets and dataflows to provide a seamless experience for the user.
Contract Testing of WebSockets: Functional Programming Is Taking the StageNordic APIs
Functional Testing is a very powerful tool in general. It allows performing validation tests for HTTP responses, locating of broken connections and even generating of a number of concurrent requests. And if testing of the stateless HTTP request/response protocol is not a rocket science, testing of WebSockets could be problematic, because messages need to be tested on both sides: client and server. In my talk, I would like to share an experience of how we do it in my current team using Contract Testing pattern and Erlang OTP. Life demo is included, of course.
Calibration exercise on GA4 observed vs modelled vs GA 360 data. The methodology was flawed in this case but the results and approach stand up to scrutiny
Topics Covered:
1. Intro (Speaker, LLM, Will AI replace you)
2. AI for SEO
3. Understanding Prompts
4. How we @Botpresso Use AI (Python Scripts & Case Study)
5. DOs & DONTs
6. Tools
7. 10 Commandments
8. AI-driven Prompt Mastery 🎁
AI Prompts for SEO E-book: https://botpresso.com/ai-prompts-for-seo/
The Python Cheat Sheet for the Busy MarketerHamlet Batista
What percentage of an Inbound marketer's day doesn't involve working with spreadsheets? How much of this work is time-consuming and repetitive? In this interactive session, you will learn how to manipulate Google Sheets to automate common data analysis workflows using Python, a very easy to use programming language.
Brighton SEO - Measurefest talk.
With data visualisation, you can uncover insights much easier than reading through a list of numbers. In this talk you will find how you should use Google Data Studio and Power BI for visualizing SEO performance. This will help you to tell the correct SEO data stories.
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
Tutorial for KDD 2019:
Search and recommender systems process rich natural language text data such as user queries and documents. Achieving high-quality search and recommendation results requires processing and understanding such information effectively and efficiently, where natural language processing (NLP) technologies are widely deployed. In recent years, the rapid development of deep learning models has been proven successful for improving various NLP tasks, indicating their great potential of promoting search and recommender systems.
In this tutorial, we summarize the current effort of deep learning for NLP in search/recommender systems. We first give an overview of search/recommender systems with NLP, then introduce basic concept of deep learning for NLP, covering state-of-the-art technologies in both language understanding and language generation. After that, we share our hands-on experience with LinkedIn applications. In the end, we highlight several important future trends.
[Business Agility Conference 2022] The top 3 points you should have paid atte...Jason Yip
When people say “Spotify Model” they’re almost always thinking about org structure (Squads, Chapters, Guilds, Tribes). Structure is the last thing you should worry about. Before structure, I’ll expand on what you should have been paying attention to.
E-A-T: Myths, Truths, And Implications for SEOIan Lurie
E-A-T has become a big deal in the SEO world. But how important is it? And can we use it to improve rankings? This presentation explores the myths, and talks about practical applications of E-A-T.
Data Driven Approach to Scale SEO at BrightonSEO 2023Nitin Manchanda
With the help of my favourite case study, I'm showcasing how I took a data-driven approach to scale SEO for a travel brand.
I've covered how I collected data, found trends, and converted them into opportunities. Those opportunities were tested before the grand deployment, which resulted in multifold growth in SEO visibility and revenue.
Estrategias SEO en Gestion de Stocks para Ecommerce #CEMD2020MJ Cachón Yáñez
Repaso de las distintas casuísticas SEO cuando un ecommerce rompe stock, cómo gestionarlo y sus pros y contras.
También se han tratado aspectos UX y Conversión.
Introduction to Factorization Machines model with an example. Motivations - why you should have it in your toolbox, model and it expressiveness, use case for context-aware recommendations and Field-Aware Factorization Machines.
This is the presentation I delivered at the 2012 Dallas SourceCon event on LinkedIn: Beyond the Basics. In this deck you will find content covering hidden talent pools on LinkedIn, effective LinkedIn sourcing strategies and tactics, including company and industry search, semantic search, Boolean search, diversity sourcing, LinkedIn Recruiter features such as "All Groups," and LinkedIn signal.You will also find out why you rank where you do in LinkedIn search results, according to LinkedIn.
The MovieLens Datasets: History and ContextMax Harper
Presented at IUI 2016. The MovieLens datasets are widely used in education, research, and industry. They are downloaded hundreds
of thousands of times each year, reflecting their use in popular press programming books, traditional and online courses, and software. These datasets are a product of member activity in the MovieLens movie recommendation system, an active research platform that has hosted many experiments since its launch in
1997. This article documents the history of MovieLens and the MovieLens datasets. We include a discussion of lessons learned from running a long-standing, live research platform from the perspective of a research organization. We document best practices and limitations of using the MovieLens datasets in new research.
Topics Covered:
1. Intro (Speaker, LLM, Will AI replace you)
2. AI for SEO
3. Understanding Prompts
4. How we @Botpresso Use AI (Python Scripts & Case Study)
5. DOs & DONTs
6. Tools
7. 10 Commandments
8. AI-driven Prompt Mastery 🎁
AI Prompts for SEO E-book: https://botpresso.com/ai-prompts-for-seo/
The Python Cheat Sheet for the Busy MarketerHamlet Batista
What percentage of an Inbound marketer's day doesn't involve working with spreadsheets? How much of this work is time-consuming and repetitive? In this interactive session, you will learn how to manipulate Google Sheets to automate common data analysis workflows using Python, a very easy to use programming language.
Brighton SEO - Measurefest talk.
With data visualisation, you can uncover insights much easier than reading through a list of numbers. In this talk you will find how you should use Google Data Studio and Power BI for visualizing SEO performance. This will help you to tell the correct SEO data stories.
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
Tutorial for KDD 2019:
Search and recommender systems process rich natural language text data such as user queries and documents. Achieving high-quality search and recommendation results requires processing and understanding such information effectively and efficiently, where natural language processing (NLP) technologies are widely deployed. In recent years, the rapid development of deep learning models has been proven successful for improving various NLP tasks, indicating their great potential of promoting search and recommender systems.
In this tutorial, we summarize the current effort of deep learning for NLP in search/recommender systems. We first give an overview of search/recommender systems with NLP, then introduce basic concept of deep learning for NLP, covering state-of-the-art technologies in both language understanding and language generation. After that, we share our hands-on experience with LinkedIn applications. In the end, we highlight several important future trends.
[Business Agility Conference 2022] The top 3 points you should have paid atte...Jason Yip
When people say “Spotify Model” they’re almost always thinking about org structure (Squads, Chapters, Guilds, Tribes). Structure is the last thing you should worry about. Before structure, I’ll expand on what you should have been paying attention to.
E-A-T: Myths, Truths, And Implications for SEOIan Lurie
E-A-T has become a big deal in the SEO world. But how important is it? And can we use it to improve rankings? This presentation explores the myths, and talks about practical applications of E-A-T.
Data Driven Approach to Scale SEO at BrightonSEO 2023Nitin Manchanda
With the help of my favourite case study, I'm showcasing how I took a data-driven approach to scale SEO for a travel brand.
I've covered how I collected data, found trends, and converted them into opportunities. Those opportunities were tested before the grand deployment, which resulted in multifold growth in SEO visibility and revenue.
Estrategias SEO en Gestion de Stocks para Ecommerce #CEMD2020MJ Cachón Yáñez
Repaso de las distintas casuísticas SEO cuando un ecommerce rompe stock, cómo gestionarlo y sus pros y contras.
También se han tratado aspectos UX y Conversión.
Introduction to Factorization Machines model with an example. Motivations - why you should have it in your toolbox, model and it expressiveness, use case for context-aware recommendations and Field-Aware Factorization Machines.
This is the presentation I delivered at the 2012 Dallas SourceCon event on LinkedIn: Beyond the Basics. In this deck you will find content covering hidden talent pools on LinkedIn, effective LinkedIn sourcing strategies and tactics, including company and industry search, semantic search, Boolean search, diversity sourcing, LinkedIn Recruiter features such as "All Groups," and LinkedIn signal.You will also find out why you rank where you do in LinkedIn search results, according to LinkedIn.
The MovieLens Datasets: History and ContextMax Harper
Presented at IUI 2016. The MovieLens datasets are widely used in education, research, and industry. They are downloaded hundreds
of thousands of times each year, reflecting their use in popular press programming books, traditional and online courses, and software. These datasets are a product of member activity in the MovieLens movie recommendation system, an active research platform that has hosted many experiments since its launch in
1997. This article documents the history of MovieLens and the MovieLens datasets. We include a discussion of lessons learned from running a long-standing, live research platform from the perspective of a research organization. We document best practices and limitations of using the MovieLens datasets in new research.
An online evaluation of explicit feedback mechanisms for recommender systemsSimon Dooms
Poster about an online feedback experiment as presented during the WEBIST 2011 conference in Noordwijkerhout (The Netherlands), May 7, 2011 by Simon Dooms.
A User-centric Evaluation of Recommender Algorithms for an Event Recommendati...Simon Dooms
Describing the setup and results of a user-centric online experiment where 5 different recommendation algorithms are tested on a Belgium events website.
A File-Based Approach for Recommender Systems in High-Performance Computing E...Simon Dooms
How to create a recommender system that works without a database backend and therefore allows perfect scaling across an arbitrary number of computing nodes and multiple cores?
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. Research datasets
Recsys research needs datasets
To evaluate, experiment and demonstrate
I need datasets
Available for download:
MovieLens 100K
MovieLens 1M
MovieLens 10M
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013 2
4. Research datasets
Recsys research needs datasets
To evaluate, experiment and demonstrate
I needed datasets
Available for download:
MovieLens 100K ~ most recent movie: 1998
MovieLens 1M ~ most recent movie: 2000
MovieLens 10M ~ most recent movie: 2008
I need up-to-date movie ratings
ConclusionResultsAbout DataTwitter - IMDbIntro
4Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
5. Finding data
Data is all around us
5
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
9. Finding data
Data is all around us
BUT extremely unstructured
What we want:
1::122::5::838985046
1::185::5::838983525
1::231::5::838983392
1::292::5::838983421
1::316::5::838983392
(user, item, rating, time)
9
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
14. Structured data
“I rated Death Proof 10/10 #IMDb”
• User
• Item (movie)
• Rating
• Hashtag
14
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
15. Structured data
Search Twitter for
“I rated #IMDb”
Bingo!
15
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
16. Collecting data
We query the Twitter API for “I rated #IMDb”
Extract relevant information
Cross-reference with IMDb for extra genre data
16
ConclusionResultsAbout DataTwitter - IMDb
Intro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
18. Your data
MovieTweetings dataset available on GitHub
(https://github.com/sidooms/MovieTweetings)
Find it on the RecSys Wiki (category datasets)
Latest
All ratings
Automagically updated daily
Snapshots
Fixed portion of dataset
Added manually when appropriate
10K, 20K, 30K, 40K, 50K, 100K
DISCLAIMER: Depending on Twitter API, IMDb apps and me!
18
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
19. Some numbers
MovieTweetings MovieLens 100K MovieLens 1M MovieLens 10M
Ratings 121,404 100,000 1,000,209 10,000,054
Users 19,464 943 6,040 71,567
Items 11,655 1682 3,900 10,681
19
(Results on September 30, 2013)
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
20. Some fun
Top 3 most rated movies
1. Iron Man 3 (2013)
2. Man of Steel (2013)
3. World War Z (2013)
Top 3 AVG rated movies (min 20 ratings)
1. The Shawshank Redemption (1994)
2. LOTR: The Return of the King (2003)
3. The Dark Knight (2008)
Bottom 3 worst AVG rated movies (min 20 ratings)
3. Scary MoVie (2013)
2. Piranha 3DD (2012)
1. Cosmopolis (2012)
20
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
21. Some conclusions
Outdated public datasets
Social media = Unstructured data available
Structured rating data through Twitter – IMDb
MovieTweetings: our Movie Rating Dataset
Always up-to-date
Includes most recent and most relevant movies
Unfiltered rating data
Publicly available
Death Proof (2007) really is an awesome movie
21
ConclusionResultsAbout DataTwitter - IMDbIntro
Oct. 12, 2013 Simon Dooms - Ghent University - CrowdRec 2013
I am Simon Dooms from Ghent University, Belgium and I will be presenting you the MovieTweetings dataset which is a Movie Rating dataset collected from Twitter.
Elephant in the room, research loves datasets. Especially recsys research needs datasets, we need it to evaluate our algorithms, do experimentation and also when we want to demonstrate our final recommender systems, we need data to drive the engines.I am no different, my research also needed datasets. For my PhD I am working with hybrid recommender systems and I focus on the movie domain because movies are fun. So I needed data to test out new configurations and algorithms and did what we all do … download the movielens dataset (which comes in three sizes) and insert it into the system. Experiments went well, evaluations were okay, but then I started visually inspecting the end results (so the recommendation lists) of my system.
This is what I got. I should really watch Braveheart, Forrest Gump and Liar Liar. Three very good suggestions but they also illustrate a system. Because I use old datasets I can only recommend old movies. This is not a problem for my personal experiments and offline evaluations. I can calculate all the RMSE I want, but this IS a problem when I want to take my system out of the lab and show it to actual users, maybe run some user-centric experiments.
We should be able to recommend new and interesting movies, but when I inspected the datasets I was working with, I realized that was impossible. When we use the Movielens 100K dataset, we are in fact working with data that is 15 years old. So the most recent movies we can recommend are Blade and ‘Saving Private Ryan’…The bigger MovieLens datasets are somewhat more recent, but still, even 2008 is 5 years ago. The year of the first twilight movie, and the first ‘Iron Man’.So if I want to build a recommender system that produces relevant results, I need up-to-date movie ratings.
So I started to look for rating data. And luckily for me, in these modern times we are living in … data is all around us.
For example take this movie IMDB page. While we get all kinds of information on the movie, there is also preference information to be found, like the fact that the movie is in a top 5000 list, has a total rating of 7.1 , more than 7000 people liked it on Facebook, it had some nominations … and so on.
For another example we go to Facebook, search for the same movie, and this page comes up. Again some basic information about the movie, but also rating information like: more than 300 thousand people who liked this movie/topic. I can click on this link and I get a new screen listing those 300 thousand users.
Yet another source is Twitter, when I search for tweets containing my movie title, I get lists like this one. All tweets contain the movie title, but in fact only two are actual opinions about the movie. Some are rather neutral or just accidentally happen to contain the movie title, like this second one here.
So data is all around us … But it is extremely unstructured and hard to interpret.What we want is a nice list of users, expressing numerical ratings for items with timestamps. So we restart our quest for data and this time we focus on structured data.
Eventually we found our holy grail in the social share feature integrated in IMDb. You see them everywhere on the web nowadays, the ‘share’ button allowing you to advertise content to your social network. Very often when you click on these things, the original website already makes a suggestion as to what you should write. And luckily for us, IMDb has a very interesting suggestion…
At least it does so for its mobile client apps. They have an app for every major platform, but I have an iPhone, so we will be taking the iPhone tour.
I am on my iPhone and I start the IMDb app… I get this homescreen. It allows me to search for movies, so I search for my movie and get this screen…Again, just like the on the website, we see some basic information and the option to rate this movie… Now I click the rate this link
…and get to the rating screen where I can select my rating. And most importantly, I can choose to share my rating.After saving I get the option to post to Twitter….
….which brings me to the most interesting screenshot. The IMDb app pre-formats my tweet in a structured way. ‘I rated Death Proof 10 out of 10 hashtag #IMDB’. So this tweet actually contains all we need to know, it has a user, item, rating and a hashtag making it easier for us to find the tweets.
Now to find structured ratings, all we need to do, is go to Twitter and find all tweets containing ‘I rated’ and the hashtag #IMDB. E voila, behold the jackpot of ratings. Now all tweet results are relevant ratings and contain all the information we need to build ourselves an interesting rating dataset.
On a daily basis we query the Twitter API for tweets containing ‘I rated #IMDB’ and we extract the relevant information. We cross-reference this with the IMDb page to provide also some extra genre data just like MovieLens does.
The end result of our efforts is three files ratings, movies and users. In the Rating file we have users ids, itemids, ratings and timestamps presented in the MovieLens style to make the dataset compatible with code working on MovieLens data.Note however that the ratings are on a 1 to 10 scale as is custom for IMDB, and not 1 to 5 as MovieLens.For item id we use the unique IMDB id which can direct us easily to the relevant IMDB information page by adding the suffix.The movies file contains again much like the MovieLens dataset, some basic info on the movie like title, year and genresThen finally the user file, in this file we make the connection between the internal user id we used in our ratings file and the true Twitter ID of the user. We use the ID and not the username handle because handles can be changed, but the user id will always remain the same.
I use this dataset for my own research, but I figured it could probaly be interesting for the entire recsys community and so I made the dataset available online through the GitHub Platform. Information about the dataset is also added to the RecSys wiki, so you can find the dataset in a number of ways.The data itself is made available in two formats, latest and snapshots. The latest repository will always contain all the data and is automagically updated daily.And there are the snaphots which are just fixed portions of the dataset to make it easier to repeat experiments and refer to the dataset in research. Currently we have snapshots of 10K up to 100K ratings.Little disclaimer I have to add. This continuation of this dataset is currently depending on the Twitter API, the functionality of the IMDb apps and my effort and time. I will do my best to maintain this as long as possible but there is no way of knowing how long that will be.
Okay time for some numbers, we started building this dataset 7 months ago and this is how much ratings we have gathered since then. Currently we are adding between 500 and 600 new ratings to the dataset each day and so at the current pace we have collected about 120K ratings.If we compare numbers with MovieLens, we can see that our data is much sparser because of the high number of users and items contained in the dataset. Our dataset is unfiltered so we also have users with less than 20 ratings.
Time to wrap up and conclude this presentation.We started with the notion that public datasets are still very often used in research, but they are becoming outdated and fail to incorporate new and relevant items.Lots of data could be found in social media, but almost always dubious and unstructured, so hard to use in our systems.We found structured data through the social share features of the IMDB platform and build ourselves a new movie rating dataset based on that.The dataset is updated daily…will therefore always contain the most recent and relevant movies …provides unfiltered rating data…and is publicly available….And last but not least, you should really watch the movie Death Proof, it is awesome. Thank you.