SlideShare a Scribd company logo
WOULD THEY
PLAY?
B Y C H R I S T I N E O S A Z U W A
Using Python and music APIs to predict the likelihood of an
artist or band playing a music festival based on artist &
song attributes.
The Problem
Every year music festivals are announced with
dozens to hundreds of artists. Does your
favorite band fit in? Could the company’s
curating their list automate the process?
Given music data from the top music tech
companies in the world, can you predict the
likelihood of an artist playing a popular
festival?
LIST OF ALL BANDS HISTORICALLY THAT PLAYED EACH FESTIVAL
FOR 5-10 YEARS
HISTORICAL DATA ON THE BANDS THAT PLAYED AT THE TIME
THEY PLAYED INCLUDING:
SALES VOLUME, POPULARITY OF ARTIST, POPULARITY OF TOP
TRACKS, LISTENERSHIP, RECORD LABEL, SOCIAL MEDIA METRICS,
RADIO PLAYS, AVERAGE CONCERT TICKET PRICE, AWARDS WON,
APPEARANCE ON BILLBOARD CHARTS, ATTRIBUTES OF TOP SONGS,
GOOGLE TRENDS
EVERGREEN DATA ABOUT THE BANDS:
NUMBER OF BAND MEMBERS, PLACE OF ORIGIN, SIMILARITIES TO
OTHER ARTISTS, START YEAR, BREAK UP YEAR, GENRE, IF ARTIST HAVE
PLAYED THE FESTIVAL BEFORE
Acquiring & Cleaning the
Data
While all the data exists for the artists, the
challenge was finding and connecting data for
various artists together by scraping the web &
using public APIs.
Original:
Get 5-10 years of historical data for 10-15 large world-wide
festivals.
Revised:
As I started the process, I realized it would be challenging
to get all the festival data I wanted, so for the sake of time I
choose to limit to one festival to start:
Vans Warped Tour—a summer long tour music festival
usually played by pop punk, punk, hardcore & emo bands.
A special issue I ran into with Warped Tour data is that the artist change daily for the festival. One area of consistency I found is that Warped Tour releases a
compilation album each year with a song from most of the artists that played a significant amount of dates for the festival, so I choose to use that compilation as my
baseline.
Used “ItemSearch” function to find the
compilation albums, then iterated
through results to get track listings
from each year. Missing 2013-2015
Library/API: bottlenose
Used “search_album” function to find
compilation albums that were missing.
Then iterated though results to get track
listings from each year. Missing 2014.
Library/API: python-itunes
Wikia is a crowdsourced site with
sections dedicated to specific topics. I
used BS4 to pull the data for the artists
for the missing date of 2014.
No Library/API, used BeautifulSoup
Used spotipy’s “search” function to
locate each artist in the Warped Tour
compilation album list, then appended
their Spotify ID. From there, I was able
to use spotipy’s “artist” feature to
append artist popularity, artist genre.
Finally, I used spotipy’s
“artist_top_tracks” to find the artists
top 10 tracks by popularity on Spotify
and append an average of those
song’s attributes.
Music Streaming Service
Library/API: spotipy
Used “search” function to locate each
artist. Appended the
Gracenote/MusicBrainz ID for each
artist and also returned a list of band
members. From there, I did a count of
number of members and appended to
my data list.
Library/API: musicbrainzngs
Used the ID I got from MusicBrainz to
use pylast’s function
“get_artist_by_mbid”. From there, I
appended the artist play count on
LastFM and their overall listener count.
API: pylast
Used pygn’s “search” function to
search for the artist by name and then
append the decade that the artist
started in.
Library/API: pygn
This involved a lot of moving parts but thankfully, these API’s play well together.
Crowd Sourced Music
Database
Music Plays Tracker Music Database
These sites/APIs had data I wanted but it was too challenging to get.
Amazon: Lack of consistent data among the artists.
Google Trends: No official API, limited to 10 queries a minute. Too slow for
mass searching.
Pollstar: API only for artists. Used BeautifulSoup to parse the site for concert
ticket cost data, but very limited amount, resulted in too many missing values.
Billboard: Attempted to find the frequency in which an artist was on a
Billboard chart but API lacked the relevant charts.
Gracenote: Attempted to get artist origin data but rate limits too low for mass
searching on multiple queries, so I prioritized “era”.
Ultimate Music Database: Incredibly robust site, however; no API and very
dated, so even parsing with BeautifulSoup proved challenging.
Instead of manually cleaning the data, I opted to add in
exceptions to errors. Instead of having missing values to clean, I
structure my functions with “try”/“except”, so that in the case of
an error, the function continues, and would pass through a ‘0’.
I didn’t really have time…
band: str, name of band/artist
spotify_id: str, alpha numeric string assign to artist
(provided by Spotify)
musicbrainz_id: str, alpha numeric string assign to
artist (provided by MusicBrainz)
member_count: int, count of number of people
associated with or in the given group/artists
(provided by MusicBrainz)
spotify_popularity: int, popularity of artist based on
Spotify’s algorithm range 0-100 (provided by
Spotify)
spotify_genre: str, the genre assigned the artist
(provided by Spotify)
duration_ms: int, length of the top 10 songs (as
defined by Spotify) of the given artist averaged
(provided by Spotify/EchoNest)
acousticness, danceability, energy,
instrumentalness, key, liveness, loudness, mode,
speechiness, tempo, time_signature, valence:
int, attributes of the top 10 songs (as defined by
Spotify) of the given artist averaged (provided by
Spotify/EchoNest)
lastfm_play_count: int, the number of times an
artists song has been counted on LastFM (provided
by LastFM)
lastfm_listener_count: int, the number of people
that have listened to an artist on LastFM (provided
by LastFM)
Data Explained
Resulting
Dataset
After gathering data from
various websites, I was able
to acquire and successfully
attribute a small portion of
the initial data I sought out
for this project.
Thanks to the Songkick API (and python-songkick
library), I was able to pull data from Lollapalooza
(an indie music festival that takes place in Chicago
every year). This supplemented the Warped Tour
dataset & allow for a split, train test.
Correlations with Target
The Models
Since acquiring the right data took so long, I
did some minimal modeling using Linear
Regression, Logistic Regression, Decision Trees
& K-Nearest Neighbors
Used pandas throughout to import and export CSV
files in order to store data & manipulate.
pandas
z
From sklearn, imported linear_model, feature_selection,
import cross_validation, metrics, LinearRegression,
cross_val_score, export_graphviz, KNeighborsClassifier,
LogisticRegression, DecisionTreeClassifier in order to
run Logistic Regression, Linear Regression, Decision
Trees & KNN. From statsmodel I used to run a Linear
Regression model.
sklearn + statsmodel
m
Imported numpy in order to do basic math: mean,
square root, etc.
numpy
Imported seaborn for data visualizing the correlation
heatmap.
seaborn
Imported pygn, spotipy, billboard, pylast, songkick,
discogs_client, pytrends, musicbrainzngs, pytrends
Music APIs
5
Imported requests, sys, os, itertools, re, bs4, json, time,
string, collections
Others
Did the basics here to make sure everything worked!
Did the basics here to make sure everything worked!
accuracy confusion matrix
train/test scores
Linear
Regression
Results
For the sake of simplicity, I
opted to use this Linear
Regression model in my
function though the overall R2
is not very strong.
I needed to automate getting the data for each artist, for
each festival, so I created a function to run that process.
Given a list of bands, the function can iterate through that
list, append all the data necessary and output a pandas
data frame.
Getting New Festival Data
I created the ability to input an artist and then have a
function run that:
1. Use the above function that collects all the data for
the artist needed to run the model and places it in a
pandas data frame.
2. Runs the model
3. Outputs a plain text answer to the artist’s likelihood of
playing that festival
Predicting Band Likelihood to Play
Results
Getting new festival data:
Lollapalooza 2016:
94 rows × 23 columns
Getting a Prediction:
Conclusions
There is so much left to do!
Gathering data from the internet is time & memory intensive. In addition, restrictions & rate limits,
as well as continuously learning new libraries were strong barriers. Using “sleep” is important!
APIs & Websites
In order to have a large enough data set, I looked at bands that have played the festival across
several years, however; I don’t have the data to say how that band was doing within the year
they played. This automatically skews my data, thus finding more artist attributed data would
help the modeling process.
Lack of Historical Data
These commands were instrumental to making my functions work properly without having to have
completely clean data.
Understanding yield, return, except & try
When you have just a few lines of code it’s easy to remember what you’re doing but when you
approach hundreds, it’s important to make note of what you’re doing and use variables that
make sense. Also, important to know when to use a function & when to just use a variable.
Importance of Labels & Conventions
Next Steps
During this process I requested access to
several APIs such as Next Big Sound that
have yet to be granted, which would
provide me with additional artist data. In
addition, I came across the
Discography.com data late in my process &
hope to incorporate some of that data as
well.
Getting More
Data
Now that I have the functionality created
for one festival, I hope to replicate amongst
several festivals to eventually allow a user
to enter an artist name and it will query
data from multiple festivals and provide the
likelihood of them playing each one.
Adding More
Festivals
Within the scope of this project I focused
primarily on data acquisition and creating a
proof of concept, so given more time, I
intend to refine and improve upon my
modeling.
Improving The
Model
Ultimately, I’d like for this to be web based
so users can simply input an artists &
search, then it will display the results. In
order for this to happen and deliver the
results in a timely fashion, I may need to
start exploring cache & memory.
Creating a User
Interface
Thank You!

More Related Content

What's hot

Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
Sophia Ciocca
 
Metadata for musicians: discovery, attribution and payment
Metadata for musicians: discovery, attribution and paymentMetadata for musicians: discovery, attribution and payment
Metadata for musicians: discovery, attribution and payment
Kristin Thomson
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
Ching-Wei Chen
 
Analysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature IdeasAnalysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature Ideas
Sarah L. Miller
 
Metadata is Money at Musicbiz 2017
Metadata is Money at Musicbiz 2017Metadata is Money at Musicbiz 2017
Metadata is Money at Musicbiz 2017
Kristin Thomson
 
Deezer - Big data as a streaming service
Deezer - Big data as a streaming serviceDeezer - Big data as a streaming service
Deezer - Big data as a streaming service
Julie Knibbe
 
Last.fm API workshop - Stockholm
Last.fm API workshop - StockholmLast.fm API workshop - Stockholm
Last.fm API workshop - Stockholm
Matthew Ogle
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
Yi-Hsuan Yang
 

What's hot (8)

Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
 
Metadata for musicians: discovery, attribution and payment
Metadata for musicians: discovery, attribution and paymentMetadata for musicians: discovery, attribution and payment
Metadata for musicians: discovery, attribution and payment
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Analysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature IdeasAnalysis of Spotify & New Feature Ideas
Analysis of Spotify & New Feature Ideas
 
Metadata is Money at Musicbiz 2017
Metadata is Money at Musicbiz 2017Metadata is Money at Musicbiz 2017
Metadata is Money at Musicbiz 2017
 
Deezer - Big data as a streaming service
Deezer - Big data as a streaming serviceDeezer - Big data as a streaming service
Deezer - Big data as a streaming service
 
Last.fm API workshop - Stockholm
Last.fm API workshop - StockholmLast.fm API workshop - Stockholm
Last.fm API workshop - Stockholm
 
Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)Machine learning for creative AI applications in music (2018 nov)
Machine learning for creative AI applications in music (2018 nov)
 

Viewers also liked

Problema de resistencia_de_materiales_volmir
Problema de resistencia_de_materiales_volmirProblema de resistencia_de_materiales_volmir
Problema de resistencia_de_materiales_volmir
Nitram Serrot
 
Pos un-smp-sma-smk-dan-unpk-tahun-2013
Pos un-smp-sma-smk-dan-unpk-tahun-2013Pos un-smp-sma-smk-dan-unpk-tahun-2013
Pos un-smp-sma-smk-dan-unpk-tahun-2013
Ahmad Gulam
 
Office button and home ribbon
Office button and home ribbonOffice button and home ribbon
Office button and home ribbonfajargaluh
 
Tugas Praktikum PTI 1 Ppt
Tugas Praktikum PTI 1 PptTugas Praktikum PTI 1 Ppt
Tugas Praktikum PTI 1 Ppt
Nida N
 
Verbtensespresentsimple 090314094456-phpapp02
Verbtensespresentsimple 090314094456-phpapp02Verbtensespresentsimple 090314094456-phpapp02
Verbtensespresentsimple 090314094456-phpapp02
theartih
 
Tomb Raider The Lost Dominion @ Open Labs
Tomb Raider The Lost Dominion @ Open LabsTomb Raider The Lost Dominion @ Open Labs
Tomb Raider The Lost Dominion @ Open Labs
Elio Qoshi
 
презентацияпроектаFunster
презентацияпроектаFunsterпрезентацияпроектаFunster
презентацияпроектаFunsterrustem_media
 
ura - the future is free (open source solutions startup)
ura - the future is free (open source solutions startup)ura - the future is free (open source solutions startup)
ura - the future is free (open source solutions startup)
Elio Qoshi
 
Learning styles nib
Learning styles nibLearning styles nib
Learning styles nib
theartih
 
Water Monitoring Day Training
Water Monitoring Day TrainingWater Monitoring Day Training
Water Monitoring Day TrainingMalama Ola
 
L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...
L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...
L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...
APGEF
 
Powerpoint peta pemikiran
Powerpoint peta pemikiranPowerpoint peta pemikiran
Powerpoint peta pemikiran
AC TS ABD HAMID BIN MOHAMMAD
 
Cultural influence cambodia
Cultural influence cambodiaCultural influence cambodia
Cultural influence cambodiatheartih
 
Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"
Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"
Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"
Sanh Huỳnh
 
Crear un esquema comparativo que represente el rol
Crear un esquema comparativo que represente el rolCrear un esquema comparativo que represente el rol
Crear un esquema comparativo que represente el rol
Skett A Mateo Morrobel
 

Viewers also liked (20)

Problema de resistencia_de_materiales_volmir
Problema de resistencia_de_materiales_volmirProblema de resistencia_de_materiales_volmir
Problema de resistencia_de_materiales_volmir
 
Pos un-smp-sma-smk-dan-unpk-tahun-2013
Pos un-smp-sma-smk-dan-unpk-tahun-2013Pos un-smp-sma-smk-dan-unpk-tahun-2013
Pos un-smp-sma-smk-dan-unpk-tahun-2013
 
Eduteach ppt
Eduteach pptEduteach ppt
Eduteach ppt
 
Office button and home ribbon
Office button and home ribbonOffice button and home ribbon
Office button and home ribbon
 
Tugas Praktikum PTI 1 Ppt
Tugas Praktikum PTI 1 PptTugas Praktikum PTI 1 Ppt
Tugas Praktikum PTI 1 Ppt
 
Cultural
Cultural  Cultural
Cultural
 
GoogleAnalytics
GoogleAnalyticsGoogleAnalytics
GoogleAnalytics
 
Verbtensespresentsimple 090314094456-phpapp02
Verbtensespresentsimple 090314094456-phpapp02Verbtensespresentsimple 090314094456-phpapp02
Verbtensespresentsimple 090314094456-phpapp02
 
Tomb Raider The Lost Dominion @ Open Labs
Tomb Raider The Lost Dominion @ Open LabsTomb Raider The Lost Dominion @ Open Labs
Tomb Raider The Lost Dominion @ Open Labs
 
презентацияпроектаFunster
презентацияпроектаFunsterпрезентацияпроектаFunster
презентацияпроектаFunster
 
ura - the future is free (open source solutions startup)
ura - the future is free (open source solutions startup)ura - the future is free (open source solutions startup)
ura - the future is free (open source solutions startup)
 
Learning styles nib
Learning styles nibLearning styles nib
Learning styles nib
 
Water Monitoring Day Training
Water Monitoring Day TrainingWater Monitoring Day Training
Water Monitoring Day Training
 
L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...
L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...
L’APGEF interviendra au VIIe Forum Economique Européen organisé par la ville ...
 
property-tables
property-tablesproperty-tables
property-tables
 
Powerpoint peta pemikiran
Powerpoint peta pemikiranPowerpoint peta pemikiran
Powerpoint peta pemikiran
 
Kahit saan!
Kahit saan! Kahit saan!
Kahit saan!
 
Cultural influence cambodia
Cultural influence cambodiaCultural influence cambodia
Cultural influence cambodia
 
Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"
Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"
Bài giảng giới thiệu dự án: "Nói không với nghiện Game online"
 
Crear un esquema comparativo que represente el rol
Crear un esquema comparativo que represente el rolCrear un esquema comparativo que represente el rol
Crear un esquema comparativo que represente el rol
 

Similar to Music

Poster vega north
Poster vega northPoster vega north
Poster vega north
AcxelVega
 
Trending Music ChatGPT Plugin
Trending Music ChatGPT PluginTrending Music ChatGPT Plugin
Trending Music ChatGPT Plugin
AvinashRulz
 
Back to the Future: Evolution of Music Moods From 1992 to 2022
Back to the Future: Evolution of Music Moods From 1992 to 2022Back to the Future: Evolution of Music Moods From 1992 to 2022
Back to the Future: Evolution of Music Moods From 1992 to 2022
AndriaLesane
 
RecordPlug & plugXchange
RecordPlug & plugXchangeRecordPlug & plugXchange
RecordPlug & plugXchange
Jimmy Ether
 
Here is app.js, artist.js and songs.js file. Can you look at the my .pdf
Here is app.js, artist.js and songs.js file. Can you look at the my .pdfHere is app.js, artist.js and songs.js file. Can you look at the my .pdf
Here is app.js, artist.js and songs.js file. Can you look at the my .pdf
aggarwalshoppe14
 
Hsjs.pdf
Hsjs.pdfHsjs.pdf
Hsjs.pdf
Jyotimoydas2
 
Program 01 For this program you will complete the program in.pdf
Program 01 For this program you will complete the program in.pdfProgram 01 For this program you will complete the program in.pdf
Program 01 For this program you will complete the program in.pdf
addtechglobalmarketi
 
FindStream investor deck
FindStream investor deckFindStream investor deck
FindStream investor deck
FindStream
 
Deezer and Spotify for brands and labels
Deezer and Spotify for brands and labelsDeezer and Spotify for brands and labels
Deezer and Spotify for brands and labels
PlayApp
 
How To Get On A Spotify Playlist
How To Get On A Spotify PlaylistHow To Get On A Spotify Playlist
How To Get On A Spotify Playlist
PlaylistPump PR Agency
 
Application Programming Interfaces
Application Programming InterfacesApplication Programming Interfaces
Application Programming Interfaces
Cindy Royal
 
Movingfans
MovingfansMovingfans
Movingfans
Movingsfans
 
Movingfans full
Movingfans fullMovingfans full
Movingfans full
Movingsfans
 
How to Discover New Music in 2023 5 Best Apps
How to Discover New Music in 2023 5 Best AppsHow to Discover New Music in 2023 5 Best Apps
How to Discover New Music in 2023 5 Best Apps
DarylMitchell9
 
Spotify Stream Prediction using Regression Models
Spotify Stream Prediction using Regression ModelsSpotify Stream Prediction using Regression Models
Spotify Stream Prediction using Regression Models
IRJET Journal
 
9a it4 teaminfinitives presentation
9a it4 teaminfinitives presentation9a it4 teaminfinitives presentation
9a it4 teaminfinitives presentationcardinalwisemanICT
 
9a it4 teaminfinitives pitch presentation
9a it4 teaminfinitives pitch presentation9a it4 teaminfinitives pitch presentation
9a it4 teaminfinitives pitch presentationCardinalwisemanICT2
 
Maintaining a community based music website with open source software
Maintaining a community based music website with open source softwareMaintaining a community based music website with open source software
Maintaining a community based music website with open source software
Andrew Goodwin
 
Roskildefestival big data
Roskildefestival big dataRoskildefestival big data
Roskildefestival big data
Henrik Hammer Eliassen
 

Similar to Music (20)

Poster vega north
Poster vega northPoster vega north
Poster vega north
 
Trending Music ChatGPT Plugin
Trending Music ChatGPT PluginTrending Music ChatGPT Plugin
Trending Music ChatGPT Plugin
 
Back to the Future: Evolution of Music Moods From 1992 to 2022
Back to the Future: Evolution of Music Moods From 1992 to 2022Back to the Future: Evolution of Music Moods From 1992 to 2022
Back to the Future: Evolution of Music Moods From 1992 to 2022
 
RecordPlug & plugXchange
RecordPlug & plugXchangeRecordPlug & plugXchange
RecordPlug & plugXchange
 
Music hack day
Music hack day Music hack day
Music hack day
 
Here is app.js, artist.js and songs.js file. Can you look at the my .pdf
Here is app.js, artist.js and songs.js file. Can you look at the my .pdfHere is app.js, artist.js and songs.js file. Can you look at the my .pdf
Here is app.js, artist.js and songs.js file. Can you look at the my .pdf
 
Hsjs.pdf
Hsjs.pdfHsjs.pdf
Hsjs.pdf
 
Program 01 For this program you will complete the program in.pdf
Program 01 For this program you will complete the program in.pdfProgram 01 For this program you will complete the program in.pdf
Program 01 For this program you will complete the program in.pdf
 
FindStream investor deck
FindStream investor deckFindStream investor deck
FindStream investor deck
 
Deezer and Spotify for brands and labels
Deezer and Spotify for brands and labelsDeezer and Spotify for brands and labels
Deezer and Spotify for brands and labels
 
How To Get On A Spotify Playlist
How To Get On A Spotify PlaylistHow To Get On A Spotify Playlist
How To Get On A Spotify Playlist
 
Application Programming Interfaces
Application Programming InterfacesApplication Programming Interfaces
Application Programming Interfaces
 
Movingfans
MovingfansMovingfans
Movingfans
 
Movingfans full
Movingfans fullMovingfans full
Movingfans full
 
How to Discover New Music in 2023 5 Best Apps
How to Discover New Music in 2023 5 Best AppsHow to Discover New Music in 2023 5 Best Apps
How to Discover New Music in 2023 5 Best Apps
 
Spotify Stream Prediction using Regression Models
Spotify Stream Prediction using Regression ModelsSpotify Stream Prediction using Regression Models
Spotify Stream Prediction using Regression Models
 
9a it4 teaminfinitives presentation
9a it4 teaminfinitives presentation9a it4 teaminfinitives presentation
9a it4 teaminfinitives presentation
 
9a it4 teaminfinitives pitch presentation
9a it4 teaminfinitives pitch presentation9a it4 teaminfinitives pitch presentation
9a it4 teaminfinitives pitch presentation
 
Maintaining a community based music website with open source software
Maintaining a community based music website with open source softwareMaintaining a community based music website with open source software
Maintaining a community based music website with open source software
 
Roskildefestival big data
Roskildefestival big dataRoskildefestival big data
Roskildefestival big data
 

Music

  • 1. WOULD THEY PLAY? B Y C H R I S T I N E O S A Z U W A Using Python and music APIs to predict the likelihood of an artist or band playing a music festival based on artist & song attributes.
  • 2. The Problem Every year music festivals are announced with dozens to hundreds of artists. Does your favorite band fit in? Could the company’s curating their list automate the process? Given music data from the top music tech companies in the world, can you predict the likelihood of an artist playing a popular festival?
  • 3. LIST OF ALL BANDS HISTORICALLY THAT PLAYED EACH FESTIVAL FOR 5-10 YEARS HISTORICAL DATA ON THE BANDS THAT PLAYED AT THE TIME THEY PLAYED INCLUDING: SALES VOLUME, POPULARITY OF ARTIST, POPULARITY OF TOP TRACKS, LISTENERSHIP, RECORD LABEL, SOCIAL MEDIA METRICS, RADIO PLAYS, AVERAGE CONCERT TICKET PRICE, AWARDS WON, APPEARANCE ON BILLBOARD CHARTS, ATTRIBUTES OF TOP SONGS, GOOGLE TRENDS EVERGREEN DATA ABOUT THE BANDS: NUMBER OF BAND MEMBERS, PLACE OF ORIGIN, SIMILARITIES TO OTHER ARTISTS, START YEAR, BREAK UP YEAR, GENRE, IF ARTIST HAVE PLAYED THE FESTIVAL BEFORE
  • 4. Acquiring & Cleaning the Data While all the data exists for the artists, the challenge was finding and connecting data for various artists together by scraping the web & using public APIs.
  • 5. Original: Get 5-10 years of historical data for 10-15 large world-wide festivals. Revised: As I started the process, I realized it would be challenging to get all the festival data I wanted, so for the sake of time I choose to limit to one festival to start: Vans Warped Tour—a summer long tour music festival usually played by pop punk, punk, hardcore & emo bands.
  • 6. A special issue I ran into with Warped Tour data is that the artist change daily for the festival. One area of consistency I found is that Warped Tour releases a compilation album each year with a song from most of the artists that played a significant amount of dates for the festival, so I choose to use that compilation as my baseline. Used “ItemSearch” function to find the compilation albums, then iterated through results to get track listings from each year. Missing 2013-2015 Library/API: bottlenose Used “search_album” function to find compilation albums that were missing. Then iterated though results to get track listings from each year. Missing 2014. Library/API: python-itunes Wikia is a crowdsourced site with sections dedicated to specific topics. I used BS4 to pull the data for the artists for the missing date of 2014. No Library/API, used BeautifulSoup
  • 7. Used spotipy’s “search” function to locate each artist in the Warped Tour compilation album list, then appended their Spotify ID. From there, I was able to use spotipy’s “artist” feature to append artist popularity, artist genre. Finally, I used spotipy’s “artist_top_tracks” to find the artists top 10 tracks by popularity on Spotify and append an average of those song’s attributes. Music Streaming Service Library/API: spotipy Used “search” function to locate each artist. Appended the Gracenote/MusicBrainz ID for each artist and also returned a list of band members. From there, I did a count of number of members and appended to my data list. Library/API: musicbrainzngs Used the ID I got from MusicBrainz to use pylast’s function “get_artist_by_mbid”. From there, I appended the artist play count on LastFM and their overall listener count. API: pylast Used pygn’s “search” function to search for the artist by name and then append the decade that the artist started in. Library/API: pygn This involved a lot of moving parts but thankfully, these API’s play well together. Crowd Sourced Music Database Music Plays Tracker Music Database
  • 8. These sites/APIs had data I wanted but it was too challenging to get. Amazon: Lack of consistent data among the artists. Google Trends: No official API, limited to 10 queries a minute. Too slow for mass searching. Pollstar: API only for artists. Used BeautifulSoup to parse the site for concert ticket cost data, but very limited amount, resulted in too many missing values. Billboard: Attempted to find the frequency in which an artist was on a Billboard chart but API lacked the relevant charts. Gracenote: Attempted to get artist origin data but rate limits too low for mass searching on multiple queries, so I prioritized “era”. Ultimate Music Database: Incredibly robust site, however; no API and very dated, so even parsing with BeautifulSoup proved challenging.
  • 9. Instead of manually cleaning the data, I opted to add in exceptions to errors. Instead of having missing values to clean, I structure my functions with “try”/“except”, so that in the case of an error, the function continues, and would pass through a ‘0’. I didn’t really have time…
  • 10. band: str, name of band/artist spotify_id: str, alpha numeric string assign to artist (provided by Spotify) musicbrainz_id: str, alpha numeric string assign to artist (provided by MusicBrainz) member_count: int, count of number of people associated with or in the given group/artists (provided by MusicBrainz) spotify_popularity: int, popularity of artist based on Spotify’s algorithm range 0-100 (provided by Spotify) spotify_genre: str, the genre assigned the artist (provided by Spotify) duration_ms: int, length of the top 10 songs (as defined by Spotify) of the given artist averaged (provided by Spotify/EchoNest) acousticness, danceability, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence: int, attributes of the top 10 songs (as defined by Spotify) of the given artist averaged (provided by Spotify/EchoNest) lastfm_play_count: int, the number of times an artists song has been counted on LastFM (provided by LastFM) lastfm_listener_count: int, the number of people that have listened to an artist on LastFM (provided by LastFM) Data Explained Resulting Dataset After gathering data from various websites, I was able to acquire and successfully attribute a small portion of the initial data I sought out for this project.
  • 11. Thanks to the Songkick API (and python-songkick library), I was able to pull data from Lollapalooza (an indie music festival that takes place in Chicago every year). This supplemented the Warped Tour dataset & allow for a split, train test.
  • 12.
  • 14. The Models Since acquiring the right data took so long, I did some minimal modeling using Linear Regression, Logistic Regression, Decision Trees & K-Nearest Neighbors
  • 15. Used pandas throughout to import and export CSV files in order to store data & manipulate. pandas z From sklearn, imported linear_model, feature_selection, import cross_validation, metrics, LinearRegression, cross_val_score, export_graphviz, KNeighborsClassifier, LogisticRegression, DecisionTreeClassifier in order to run Logistic Regression, Linear Regression, Decision Trees & KNN. From statsmodel I used to run a Linear Regression model. sklearn + statsmodel m Imported numpy in order to do basic math: mean, square root, etc. numpy Imported seaborn for data visualizing the correlation heatmap. seaborn Imported pygn, spotipy, billboard, pylast, songkick, discogs_client, pytrends, musicbrainzngs, pytrends Music APIs 5 Imported requests, sys, os, itertools, re, bs4, json, time, string, collections Others Did the basics here to make sure everything worked!
  • 16. Did the basics here to make sure everything worked! accuracy confusion matrix train/test scores
  • 17. Linear Regression Results For the sake of simplicity, I opted to use this Linear Regression model in my function though the overall R2 is not very strong.
  • 18. I needed to automate getting the data for each artist, for each festival, so I created a function to run that process. Given a list of bands, the function can iterate through that list, append all the data necessary and output a pandas data frame. Getting New Festival Data I created the ability to input an artist and then have a function run that: 1. Use the above function that collects all the data for the artist needed to run the model and places it in a pandas data frame. 2. Runs the model 3. Outputs a plain text answer to the artist’s likelihood of playing that festival Predicting Band Likelihood to Play
  • 19. Results Getting new festival data: Lollapalooza 2016: 94 rows × 23 columns Getting a Prediction:
  • 20. Conclusions There is so much left to do!
  • 21. Gathering data from the internet is time & memory intensive. In addition, restrictions & rate limits, as well as continuously learning new libraries were strong barriers. Using “sleep” is important! APIs & Websites In order to have a large enough data set, I looked at bands that have played the festival across several years, however; I don’t have the data to say how that band was doing within the year they played. This automatically skews my data, thus finding more artist attributed data would help the modeling process. Lack of Historical Data These commands were instrumental to making my functions work properly without having to have completely clean data. Understanding yield, return, except & try When you have just a few lines of code it’s easy to remember what you’re doing but when you approach hundreds, it’s important to make note of what you’re doing and use variables that make sense. Also, important to know when to use a function & when to just use a variable. Importance of Labels & Conventions
  • 22. Next Steps During this process I requested access to several APIs such as Next Big Sound that have yet to be granted, which would provide me with additional artist data. In addition, I came across the Discography.com data late in my process & hope to incorporate some of that data as well. Getting More Data Now that I have the functionality created for one festival, I hope to replicate amongst several festivals to eventually allow a user to enter an artist name and it will query data from multiple festivals and provide the likelihood of them playing each one. Adding More Festivals Within the scope of this project I focused primarily on data acquisition and creating a proof of concept, so given more time, I intend to refine and improve upon my modeling. Improving The Model Ultimately, I’d like for this to be web based so users can simply input an artists & search, then it will display the results. In order for this to happen and deliver the results in a timely fashion, I may need to start exploring cache & memory. Creating a User Interface