This document summarizes a study analyzing trends in lyrics of hit pop songs from 1970 to 2009. The study involved collecting lyrics from various online sources, cleaning and consolidating the data, then using text mining and sentiment analysis techniques to analyze trends. Key findings included themes of heartbreak and love being most common, average song length peaking in the 80s-90s and declining since, and a shift towards more female artists in recent decades. Sentiment analysis classified songs as positive, negative or neutral. The hybrid model achieved the best accuracy at 88% for this task. Popular artists tended to have longevity, spanning multiple decades.
2. INTRODUCTION
Originated in the 1950s
Derived from “popular”
music
Mass audience appeal
Catchy rhythm and lyrics
Major form of
entertainment for all ages
Medium of expression
3. WE FOCUS ON
Analyze lyrical patterns of
hit pop songs
From 1970 to 2009
Observe trends in terms
of frequent
words, artists, length of the
track etc.
Notice variations in styles
4. HOW DID WE DO THIS?
Data from multiple
sources
Data summarization
Text mining and modeling
Combined into a
large data set
SAS Sentiment studio and modeling
9. “It is a capital mistake to theorize
before one has data.”
10. DATA PREPARATION
Data access:
COLUMN DESCRIPTION
SOURCE
Year of song release
Position
Artist Name
http://www.bobborst.com/popculture/
top-100-songs-of-the-year/?year=1970
Song Name
Lyrics
http://www.metrolyrics.com/azlyrics.html
Gender
http://www.lyricsfreak.com/
http://en.wikipedia.org/wiki/
State
http://en.wikipedia.org/wiki/
Length
http://en.wikipedia.org/wiki/
11. DATA CONSOLIDATION
4 separate excel
workbooks created with
201 rows each.
Consolidate function
on Excel was used to
merge the datasets.
12. DATA CONSOLIDATION
Preview of rows in an excel
workbook
Conversion of
data
Conversion to
standard
format
Importing
data into
environment
Steps involved in data consolidation
15. DATA DICTIONARY
Attribute
Year
Position
Artist
Song
Gender
Description
Year of appearance on list
Rank of song on list
Name of the singer
Title of the track
Gender of the artist
Field type
Num(5)
Num(3)
VarChar(50)
VarChar(100)
Char(10)
Source
http://www.bobborst.com
http://www.bobborst.com
http://www.bobborst.com
http://www.bobborst.com
http://www.wikipedia.com
Example
2004
4
Maroon 5
This love
Male
Lyrics
State
Lyrics of the song
Name of the US state of
origin, else NA
Length of the track in
seconds
Specifies theme of the
song as
rap/religion/men/women
VarChar(20000)
Char(50)
http://www.azlyrics.com
http://www.wikipedia.com
I was so high...
California
Num(10)
http://www.wikipedia.com
207
Varchar(20)
Manually coded
Love
Length
Theme*
* We used themes such as Happiness,
Love, Heartbreak, Optimism etc. as a
categorical variable to signify the theme of
17. DATA UNDERSTANDING
Distribution of songs according to themes:
Heartbreak:
188
Love: 162
Happiness: 89
Dance: 82
Sorrow: 81
Optimism: 54
Women: 46
Men: 13
Hate: 13
Religion: 8
Instrumental: 5
18. DATA UNDERSTANDING
Average length of songs by year
The average length of
songs peaked between
the late 80s to the 90s.
The current trend is
towards shorter songs.
19. DATA UNDERSTANDING
Length versus position on chart
Shortest song: 1:40
minutes
Longest song: 8:57
Songs with lengths less than 2:30 minutes
minutes
and beyond 7:30 minutes never made it to the
Overall Mean: 4:02
20. DATA UNDERSTANDING
Gender – by decade
For the years 1970 through 1979, 75.5%
of the entries were by male singers and
24.5% were female singers.
In the next decade from 1980 to
1989, the male entries reduced to 71%
and female entries increased to 29%.
21. DATA UNDERSTANDING
Gender – by decade
During the 90s, there was a change in
trends and almost equal entries were
observed in both cases. The male
entries dropped to 53% while the female
entries rose to 47%.
During the following decade spanning
from 2000 to 2009, male entries
25. MODELING
Aim: Predictive modeling to predict
themes
Regression- Logistic regression with
stepwise selection
method
Text topics as
input variable
Model
Themes as
target variable
Misclassification
rate
Logistic regression 0.07625
(stepwise)
Logistic
Regression
Average squared
error
0.070996
26. MODELING
Aim: Predictive modeling to predict
themes
Regression- Logistic regression with
stepwise selection
method
Topics 2 and 5 were identified and
considered significant inputs
27. SENTIMENT ANALYSIS
Aim: Categorize the
songs into positive and
negative themes
Three models were
developed
• Statistical
• Rule-based
• Hybrid
38. CONCLUSIONS –
POPULAR TRENDS
79 songs by the top ten most
recurring artists were further
analyzed.
Six female entries and four
male entries were observed.
47 songs were sung by
females as opposed to 32 by
males.
Top themes were heartbreak
(36.7%), love (22.8%) and
39. CONCLUSIONS –
POPULAR TRENDS
There was no pattern in the
place of origin.
3 out of 10 entries
belonged to countries other
than the U.S.
Most of these singers
made it to the top hits in
multiple years spanning
decades.
40. CONCLUSIONS –
POPULAR York
12 entries New TRENDS
1
11 entries Texas
2
3
9 entries Indiana
8 entries
4
8 entries California
7 entries
5
6 entries Texas
6 entries Pennsylvania
6
6 entries Michigan
6 entries Barbados