What's Popular in Music? Let's analyze!
I know nothing about music. This makes it ideal as a topic for me to study and apply data without any (informed) misconceptions. This exploration is really just an expansion of what a team of two other students and I did for a Data Competition at the Fuqua School of Business. The competition involved creating the dataset. For this pet project, I also created some analysis.
Acknowledgements
I want to thank the Master of Quantitative Management team at the Fuqua School of Business and the Fuqua School of Business as a whole for giving us the opportunity to participate in this competition and further build our data collection and analysis skills. I would also like to thank my teammates in the competition, without whom the dataset which I have analyzed further would not exist.
Data
The data comprises the top 20 songs and their respective artists from the year-end US Billboard Hot 100 for the years 2006 to 2018. The data was scraped from the US Billboard website. Song metrics were obtained from Chartmetric. These features include duration, genre, popularity, as well as technical details including beats per minute (BPM), danceability, valence, energy, and liveness.
Motivation
So why music? The music industry is a rapidly-changing sector where trends emerge and fade away. Artists are constantly innovating and producing exciting music for everyone to enjoy. The US Billboard Hot publishes lists of what is currently trending and the pace of change can be intimidating to those unfamiliar with the world of music, given that there are even weekly lists. This data can be used to provide insights about music tastes valuable to artists and brands. In a rapidly changing industry, artists can use this data to make sense of patterns in the trends and can understand what kind of songs are gaining popularity and how music tastes are changing. Brands too can capitalize on the insights generated by this data. When selecting potential influencers or personalities for advertisements or sponsorships, brands can use this data to identify emerging artists who are more commonly appearing in, say, the Top 100.
Moreover, the data allows a more fundamental analysis on songs themselves. The various characteristics of songs such as BPM, energy, danceability, valence, accousticness, and liveness can be analyzed to see if there are any clusters of songs around particular values of the data. It can be seen if these are associated with a particular genre or artist. If any such association is found, this data can then be used to train models which can predict the genre or artist of a new song where data on these characteristics is available.
My study, therefore, is an exploratory analysis of trends over time in key music metrics of the top 20 songs of each year between 2006 and 2018.
Analysis
Let's carry out a sanity check to ensure that our data has the correct years and number of observations. The year refers to the latest year a song appeared in the Top 20; a few duplicates mean that a few years do not have 20 songs associated.
We would like to see how the average duration of songs in the Top 20 varies by year. There appears to be almost a wave-like function which defines the duration of songs as the years go on. Can we expect the top 20 of 2019 to be slightly longer on average than the top 20 of 2018...or slightly shorter?
We are also interested in understanding the trends in the genre of music that is popular over time. Interestingly, while pop songs dominated the top 20 in the early years of the analysis, hip-hop and hip-hop/rap songs are increasingly popular lately.
Technical Characteristics
While working on the project, I was introduced for the first time to some technical characteristics of music: BPM (beats per minute), valence (the degree of positiveness), energy (a measure of intensity and activity), liveness (the level of confidence that a track is live, based on the detection of audience), danceability (how suitable a track is for dancing) and accousticness (likelihood of a song being created solely by accoustic means). I was interested in better understanding these measures and how they may relate to each other.
First, let's begin with a distribution of the songs based on all these metrics.
BPM
Most songs have a BPM of around 100-150.
Valence is more equally spread out than energy.
It may be that more dancing-friendly songs are likelier to make it to the top 20 - of course, this must be compared with the danceability for all other songs too.
In all of these graphs, there does not seem to be any major shift from year to year; a larger sample is needed to check if there is any such shift.
Now, for the fun part. Are there any groups of songs that can be made out using these metrics? Is there a way to cluster these observations? I made use of k-means clustering.
Based on the graph below, the songs can be optimally divided into 3 clusters.
Based on this analysis, it may not make sense to include all 6 variables in the clustering. Before we attempt to cluster on a subset of these variables, let us see if these clusters do allow us to make any meaningful distinctions on the data.
The clusters do not allow us to differentiate songs based on their Spotify score or their genre.
Let's now repeat the clustering analysis but using only BPM, energy, and valence.
Here, we find that the optimal number of clusters is 2.
Even here, we do not see any meaningful relationships.
We can say that these six metrics, while useful to tell us something about the song, do not tell us anything about their genre or their rating received. These clusters do not differentiate by the maximum rank or the year either.
Acknowledgements
I want to thank the Master of Quantitative Management team at the Fuqua School of Business and the Fuqua School of Business as a whole for giving us the opportunity to participate in this competition and further build our data collection and analysis skills. I would also like to thank my teammates in the competition, without whom the dataset which I have analyzed further would not exist.
Data
The data comprises the top 20 songs and their respective artists from the year-end US Billboard Hot 100 for the years 2006 to 2018. The data was scraped from the US Billboard website. Song metrics were obtained from Chartmetric. These features include duration, genre, popularity, as well as technical details including beats per minute (BPM), danceability, valence, energy, and liveness.
Motivation
So why music? The music industry is a rapidly-changing sector where trends emerge and fade away. Artists are constantly innovating and producing exciting music for everyone to enjoy. The US Billboard Hot publishes lists of what is currently trending and the pace of change can be intimidating to those unfamiliar with the world of music, given that there are even weekly lists. This data can be used to provide insights about music tastes valuable to artists and brands. In a rapidly changing industry, artists can use this data to make sense of patterns in the trends and can understand what kind of songs are gaining popularity and how music tastes are changing. Brands too can capitalize on the insights generated by this data. When selecting potential influencers or personalities for advertisements or sponsorships, brands can use this data to identify emerging artists who are more commonly appearing in, say, the Top 100.
Moreover, the data allows a more fundamental analysis on songs themselves. The various characteristics of songs such as BPM, energy, danceability, valence, accousticness, and liveness can be analyzed to see if there are any clusters of songs around particular values of the data. It can be seen if these are associated with a particular genre or artist. If any such association is found, this data can then be used to train models which can predict the genre or artist of a new song where data on these characteristics is available.
My study, therefore, is an exploratory analysis of trends over time in key music metrics of the top 20 songs of each year between 2006 and 2018.
Analysis
Let's carry out a sanity check to ensure that our data has the correct years and number of observations. The year refers to the latest year a song appeared in the Top 20; a few duplicates mean that a few years do not have 20 songs associated.
We would like to see how the average duration of songs in the Top 20 varies by year. There appears to be almost a wave-like function which defines the duration of songs as the years go on. Can we expect the top 20 of 2019 to be slightly longer on average than the top 20 of 2018...or slightly shorter?
We are also interested in understanding the trends in the genre of music that is popular over time. Interestingly, while pop songs dominated the top 20 in the early years of the analysis, hip-hop and hip-hop/rap songs are increasingly popular lately.
Technical Characteristics
While working on the project, I was introduced for the first time to some technical characteristics of music: BPM (beats per minute), valence (the degree of positiveness), energy (a measure of intensity and activity), liveness (the level of confidence that a track is live, based on the detection of audience), danceability (how suitable a track is for dancing) and accousticness (likelihood of a song being created solely by accoustic means). I was interested in better understanding these measures and how they may relate to each other.
First, let's begin with a distribution of the songs based on all these metrics.
BPM
Most songs have a BPM of around 100-150.
Valence is more equally spread out than energy.
It may be that more dancing-friendly songs are likelier to make it to the top 20 - of course, this must be compared with the danceability for all other songs too.
In all of these graphs, there does not seem to be any major shift from year to year; a larger sample is needed to check if there is any such shift.
Now, for the fun part. Are there any groups of songs that can be made out using these metrics? Is there a way to cluster these observations? I made use of k-means clustering.
Based on the graph below, the songs can be optimally divided into 3 clusters.
Based on this analysis, it may not make sense to include all 6 variables in the clustering. Before we attempt to cluster on a subset of these variables, let us see if these clusters do allow us to make any meaningful distinctions on the data.
The clusters do not allow us to differentiate songs based on their Spotify score or their genre.
Let's now repeat the clustering analysis but using only BPM, energy, and valence.
Here, we find that the optimal number of clusters is 2.
Even here, we do not see any meaningful relationships.
We can say that these six metrics, while useful to tell us something about the song, do not tell us anything about their genre or their rating received. These clusters do not differentiate by the maximum rank or the year either.
Comments
Post a Comment