Can We Predict If a Song Will Become a Hit?
- HUYEN TRAN
- May 2, 2023
- 5 min read

The music industry is highly unpredictable, and there is no guarantee that every song from an established singer will be a hit. Therefore, predicting a song's popularity is crucial for decision-makers in the industry. By leveraging data analysis and market research, music investors can make informed decisions about which artists and songs to invest in, ultimately maximizing their return on investment.
This project was a joined effort between me and four other classmates Anusha Muniraju, Chieh-Hsin Wu, Shiva Prasad Reddy, Sri Harsha Somayajula, and Surabhi Suresh to find the answer for the question, "Is there a model for a hit song?"
Where'd we find of the data?
We're utilizing the Spotify API to extract song features that IS integrated into our project. With the Spotify API, we have access to a vast amount of music data from Spotify, making it an excellent tool for building various systems. The total dataset contains 34,080 observation. The time period is from 1956 to 2022.
The dataset contains 23 variables which are 14 numerical variables and 9 categorical variables. These are the list of variables:
artist_name (object): Name of the artist.
track_id (object): Spotify unique track id.
track_name (object): Name of the track/song.
1 - acousticness (float): A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
2 - danceability (float): Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
3 - duration_ms (int): The duration of the track in milliseconds.
4 - energy (float): Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
5 - instrumentalness (float): Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
6 - key (int): The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
7 - liveness (float): Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
8 - loudness: (float): The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
9 - mode( int): Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
10 - speechiness (float): Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
11 - tempo - beats per minute
12 - time_signature (int): An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
13 - valence - (float): A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
14- popularity - overall popularity score (based on # of clicks) The popularity of the track. The value will be between 0 and 100, with 100 being the most popular.
The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.
Exploratory Data Analysis and Feature Engineering
We started by cleaning the data. We first checked the "0" values and "NA" values. The "0" values are found in Key, Mode, and Instrumentalness, but the values of "0" are meaningful in these fields. Therefore, we decide to keep all of these "0" values. Next, we dropped some "NA" values.
The format of date is inconsistent, so we reformatted into year only.


In addition, we checked distribution of popularity column which is our target variable. The variable is right-skewed. We then encoded the the popularity column which originally contains values from 0-100. We convert the values below 60 into 0 and over 60 into 1.
Popularity >= 60 ----> " 1"
Popularity < 60 ----> " 0"
The resulting dataset has a column popularity_encoded with 22401 rows having value 0 and 11679 rows having value 1. The result shows that out data is unbalanced. Moreover, we check the correlation between the variables. it turns out that the loudness and energy are highly correlated. As the result, we will not consider "energy" in building our model.


Last we divide our data into 6 genres which are rock, rap, r&b, pop, latin, and edm.
Modelling
Our target variable is binary data, so we choose logistic regression and XGBoost regression. Overall, the XGBoost model has higher accuracy in all genres.


Logistic regression performs poorly in classifying minority classes in an imbalanced dataset
It tends to optimize for overall accuracy and may not capture the underlying distribution of the data.
It MISCLASSIFIED minority samples and have a higher false negative rate.Result is detrimental in applications where correctly identifying the minority class is critical.

XGBoost is good for classifying minority class in an imbalanced dataset
It can effectively learn from the minority class by giving them more weight during training
It can handle non-linear relationships and complex interactions between features.
XGBoost has a built-in regularization to prevent overfitting which is especially important in imbalanced datasets with sparse samples.
Here’s what we found
We run features analysis in XGBoost model, and found out important features contributing to a hit song are ARTIST, SPEECHINESS, DURATION, DANCEABILITY, LOUDNESS and LIVENESS.
The familiarity of the artist has a correlation to the popular songs. Having a popular artist increase the chance that the song could be a hit.
Especially in R&B, the song maker would consider the songs with higher loudness because the cluster of popular songs in R&B has higher loudness comparing to other less popular songs.
To improve our model accuracy, we need to analyze more features such as: region-based analysis, types of marketing campaign and so on.
Comments