import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
Jazz has been a major influence for various artists and even genres since its peak in the 1920s known as the Jazz age. Within every music genre there are various subgenres however, it seemed appropraite to analyze the overarching genre and proceed to break it down.
Our major motivation is to find what audio features make songs more or less popular according to spotify's popularity rating.
## Read in data and limit to Jazz songs
totalData = pd.read_csv('SpotifyFeatures.csv')
jazzData = totalData[totalData['genre'] == 'Jazz']
jazzData.head()
## Find basic information on Jazz dataset
def main():
print("Number of observations :: ", len(jazzData.index))
print("Number of columns :: ", len(jazzData.columns))
print("Headers :: ", jazzData.columns.values)
if __name__ == "__main__":
main()
Given the above variety variables, we decided to only use numeric data. We then found that the duration of the songs was causing the variance to sky rocket so we decided to get rid of that feature and prodeed with our analysis. Additionally, given the large dataset, we will only analyze the songs with popularity higher than the median popularity.
## Extracting our desired subset
features = ['popularity', 'acousticness', 'danceability', 'energy', ## Defining the desired song features
'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'valence']
jazzFeatures = jazzData[features] ## Extracting features from the dataset
## Limit data to popularity > median
median = np.median(jazzFeatures['popularity'])
jazzLimited = jazzFeatures[jazzFeatures['popularity'] > median]
## Find basic information of final Jazz dataset
def main():
print("Number of observations :: ", len(jazzLimited.index))
print("Number of columns :: ", len(jazzLimited.columns))
print("Headers :: ", jazzLimited.columns.values)
if __name__ == "__main__":
main()
## Drop popularity from dataset for pca analysis later on
pcaData = jazzLimited.drop('popularity', 1)
To produce clear well centered plots, we'll normalize the features, plot scatterplots and then explore the actual matrix breakdown to see if there's any localized variance we can analyze. Additionally, we wanted to see if we a heatmap could give us some direction for our research.
## Normalizing all the features
n = jazzLimited.shape[0]
jazzMeans = np.mean(jazzLimited,0)
normJazz = (jazzLimited-jazzMeans)/np.sqrt(n)
jazzCorr = normJazz.corr()
plt.suptitle("Jazz Features Correlation")
sns.heatmap(jazzCorr, cmap="YlGnBu")
Unfortunately, it doesn't seem like popularity is correlated with any specific features. Instead, we'll examine the scatterplots and try to find some trends.
## PLOT NORMALIZED FEATURES
plt.figure(figsize=(20, 30))
plt.suptitle("Scatter Matrix of Song Features")
plt.subplots_adjust(wspace=0.3, hspace=0.3)
for i in range(len(features)-1):
plt.subplot(3, 3, i+1)
sns.scatterplot(normJazz.iloc[:,i+1],normJazz.iloc[:,0])
plt.plot([0, 0], [-0.4, 0.4], linewidth=1, color = 'r')
plt.plot([normJazz.iloc[:,i+1].min(), normJazz.iloc[:,i+1].max()], [0, 0], linewidth=2, color = 'r')
plt.xlabel(features[i+1])
plt.ylabel(features[0])
Trends:
While these observations provide some direction, they're more subjective and don't provide much concrete trends or measurements. Since most of us can only visualize two or three dimensions, so the most logical next-step for our 10 dimensional data was reducing its dimensions. This makes the data easier to interpret and speeds up these computations. Dimensionality reduction also helps us find appropriate functions of the predictors that correspond to interesting features, get rid of unwanted predictors, and removes contamination from measurement noise. We proceeded using Principal Component Analysis (PCA) as this was a great way to quantify and pinpoint the variation within our dataset. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. These new principal components act as new axes, which help reduce the dimensionality of our dataset without loss of information. To better interpret these components we used loading plots of our new axes and examined what variables were the most heavily weighted as a means of better understanding our dataset.
from sklearn.decomposition import PCA
from sklearn import preprocessing
## PCA w/o popularity
pcaDataNorm = normJazz.drop('popularity', 1)
pca = PCA()
pca.fit(pcaDataNorm.T)
pca_data = pca.transform(pcaDataNorm.T)
per_var = np.round(pca.explained_variance_ratio_*100, decimals = 1)
labels = ['PC 1', 'PC 2', 'PC 3', 'PC 4', 'PC 5', 'PC 6', 'PC 7', 'PC 8','PC 9']
plt.bar(x = range(1,len(per_var)+1), height = per_var, tick_label = labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title("Scree Plot")
plt.show()
We see here that all the variation in our data set can be explained by the first two prinicpal components. We'll take a look into the weight of the features within these components to better intrpret our results.
pcaFeatures = ['acousticness', 'danceability', 'energy', ## Defining the desired song features
'instrumentalness', 'liveness', 'loudness', 'speechiness',
'tempo', 'valence']
pca_df = pd.DataFrame(pca_data, index = pcaFeatures, columns = labels)
plt.scatter(pca_df['PC 1'],pca_df['PC 2'])
plt.title('PCA Graph for the Jazz Genre')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))
for sample in pca_df.index:
plt.annotate(sample, (pca_df['PC 1'].loc[sample],pca_df['PC 2'].loc[sample]))
plt.show()
Here we see that tempo and loudness are our most heavily weighted features, which implies that the variation in popularity is heavily influenced by these song features. Since we utilized different functions to the ones we used in class. I thought it would be a good idea to go through these different functions and see if we can add any information to our findings.
## Getting matrices
u,s,vt = np.linalg.svd(normJazz, full_matrices = False)
u.shape, s, vt.shape
## Computing total variance by summing squared singular values
total_variance = np.sum(s**2)
print("total_variance: {:.3f} should approximately equal the sum of feature variances: {:.3f}"
.format(total_variance, np.sum(np.var(pcaData, axis=0))))
##2d shit
jazz_2d = normJazz@np.transpose(vt)[:,0:2]
plt.figure(figsize=(9, 6))
plt.title("PC2 vs. PC1 for Jazz Data")
sns.scatterplot(jazz_2d.iloc[:, 0], jazz_2d.iloc[:, 1])
plt.xlabel('PC 1')
plt.ylabel('PC 2');
## Finding how much variance is explained by our 2 principal components
two_dim_variance = (s[0]**2 + s[1]**2)/total_variance
print("Our first 2 principal components explain ",
100*round(two_dim_variance, 5),
"% of the variance in our data.")
plt.plot(np.array([1,2,3,4,5,6,7,8,9,10]),(s**2))
plt.xlabel("Principal Component")
plt.ylabel("Variance (Component Scores)")
The results are consistent and since we found the "defining" features for Jazz songs, we'll try to expand out scope and see if other genre's tempo and loudness are correlated with their popularity.
totalData = pd.DataFrame(totalData)
## Create df topSongs with top 50% of songs based on popularity from each genre
genres = totalData['genre'].unique() ## Define all genres
## Creates df topSongs w/ top 50% songs based on popularity from the frist genre
index = (totalData[totalData['genre'] == genres[0]]['popularity']) > np.median(totalData[totalData['genre'] == genres[0]]['popularity'])
topSongs = pd.DataFrame(totalData[totalData['genre'] == genres[0]][index])
## Loop through each genre
for i in range(len(genres)-1):
a = (totalData[totalData['genre'] == genres[i+1]]['popularity']) > np.median(totalData[totalData['genre'] == genres[i+1]]['popularity'])
temp = pd.DataFrame(totalData[totalData['genre'] == genres[i+1]][a])
topSongs = topSongs.append(temp, ignore_index=True)
## Find median data for each genre from our limited Dataset
groupedData = topSongs.groupby('genre').median()
groupedData.sort_values('popularity', ascending = False)
cols = groupedData.columns
label = ['popularity', 'acousticness', 'danceability', 'energy', 'tempo', 'loudness']
target = groupedData[['popularity', 'acousticness', 'danceability', 'energy', 'tempo', 'loudness']]
## Plot
plt.figure(figsize=(15, 10))
plt.suptitle("Median Features for Each Genre")
plt.subplots_adjust(wspace=0.3, hspace=0.3)
for i in range(5):
plt.subplot(3, 2, i+1)
sns.regplot(target.iloc[:, i+1], target.iloc[:, 0])
plt.xlabel(label[i+1])
plt.ylabel(label[0])
Our regression plots for all genres are somewhat consistent with our PCA findings, however they do seem to be skewed heavily by outliers. This leads us to believe that even though we cn find some correlation with popularity features and song features, there are no definitive sounds which make a song popular. If that were the case then we would see no variation at all and all sonds would have the same metrics for their audio features.
Finally, we'll examine specfic artists within the jazz genre to see if the subgenre trend will become more apparent.
## Miles Davis, Earth, Wind and Fire, Louis Armstrong
artists = ['Earth, Wind & Fire', 'Miles Davis', 'Louis Armstrong']
ewf = totalData[totalData['artist_name'] == artists[0]][features] # Earth, wind and fire
milesDavis = totalData[totalData['artist_name'] == artists[1]][features] # Miles Davis
armstrong = totalData[totalData['artist_name'] == artists[2]][features] # Louis Amrstrong
## Normalizing all the features
n = ewf.shape[0]
ewfMeans = np.mean(ewf,0)
ewfNorm = (ewf-ewfMeans)/np.sqrt(n)
## Earth, Wind and Fire songs
ewfCorr = ewfNorm.corr()
plt.suptitle("Earth, Wind and Fire Correlation")
sns.heatmap(ewfCorr, cmap="YlGnBu")
We see here that there are much stronger correlations with specfic artists. This follows the Simpson's paradox and shows how the lack of trends and correlations are caused by generalizations and incorrect groupings. Genre's are not specific enough in categorizing songs to allow us to see a clear trend in our data.
sns.pairplot(ewfNorm)
## Getting matrices
u,s,vt = np.linalg.svd(ewfNorm, full_matrices = False)
plt.plot(np.array([1,2,3,4,5,6,7,8,9,10]),(s**2))
plt.xlabel("Principal Component")
plt.ylabel("Variance (Component Scores)")
## Finding how much variance is explained by our 2 principal components
total_variance = np.sum(s**2)
two_dim_variance = (s[0]**2 + s[1]**2)/total_variance
print("Our first 2 principal components explain ",
100*round(two_dim_variance, 5),
"% of the variance in our data.")
print(s)
## Normalizing all the features
n = milesDavis.shape[0]
davisMeans = np.mean(milesDavis,0)
davisNorm = (milesDavis-davisMeans)/np.sqrt(n)
## Miles Davis Songs
davisCorr = davisNorm.corr()
plt.suptitle("Miles Davis Correlation", )
sns.heatmap(davisCorr, cmap="YlGnBu")
sns.pairplot(davisNorm)
## Getting matrices
u,s,vt = np.linalg.svd(davisNorm, full_matrices = False)
plt.plot(np.array([1,2,3,4,5,6,7,8,9,10]),(s**2))
plt.xlabel("Principal Component")
plt.ylabel("Variance (Component Scores)")
## Finding how much variance is explained by our 2 principal components
total_variance = np.sum(s**2)
two_dim_variance = (s[0]**2 + s[1]**2)/total_variance
print("Our first 2 principal components explain ",
100*round(two_dim_variance, 5),
"% of the variance in our data.")
Tempo and Loudness seem to be the most important features in determining song popularity, however there is no definitive audi makeup that would make a song popular. Additionally, we saw Simpson's paradox in action and were able to conclude that the variability of song features within genres is extremely high and that subgenres or artists are a much more effective way of classifying or clustering music data.