Coding/EDA / 캐글 EDA-마! 서퍼티파이!.md

캐글 EDA-마! 서퍼티파이!

조회

https://www.kaggle.com/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genres/data

 

550K Spotify Songs: Audio, Lyrics & Genres

Enhanced Music Dataset with Audio Features, Lyrics, Genres & Artist Metadata

www.kaggle.com

참고로 본인은 스포티파이 계정이 있습니다. 왜냐고? 포슬립 사운드트랙이 거기 있으니까...


import kagglehub

# Download latest version
path = kagglehub.dataset_download("serkantysz/550k-spotify-songs-audio-lyrics-and-genres")

print("Path to dataset files:", path)

저 캐글허브 pip로 설치하시면 다운로드 안 받아도 가져올 수 있습니다.

 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 그래프 기본 테마 설정
sns.set_theme(palette="icefire", style="darkgrid", font_scale=1)
sns.color_palette("icefire", as_cmap=True)

# 그래프를 그리기 위한 기본 설정
plt.rcParams['font.family'] = '그리운 겔리롤'
# plt.rcParams['font.family'] = 'AppleGothic'
plt.rcParams['figure.figsize'] = 12, 9
plt.rcParams['font.size'] = 16
plt.rcParams['axes.unicode_minus'] = False

저 다운로드가 생각보다 오래 걸리니까 다운받을동안 세팅하십쇼. 다 되면 경로가 나올건데 그 경로 안에 보면 csv파일 있으니까 열면 된다. 거기 다른 사람이 EDA 올려둔거 참고해도 됨.

 

# 파일이 두개더군요... 
artist_df = pd.read_csv('/home/koreanraichu/.cache/kagglehub/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genres/versions/1/artists.csv') 
song_df = pd.read_csv('/home/koreanraichu/.cache/kagglehub/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genres/versions/1/songs.csv')

아 이거 갖다 쓰면 안되냐고요? 본인이 우분투 쓰고 계시면 그러시든지.


데이터 정보 확인

이거 파일 두개 다 볼거임.

 

.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71440 entries, 0 to 71439
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          71440 non-null  object
 1   name        71438 non-null  object
 2   followers   71440 non-null  int64 
 3   popularity  71440 non-null  int64 
 4   genres      71440 non-null  object
 5   main_genre  71440 non-null  object
dtypes: int64(2), object(4)
memory usage: 3.3+ MB

아 뭐야 결측값 있어…

 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550622 entries, 0 to 550621
Data columns (total 24 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id                      550622 non-null  object 
 1   name                    550619 non-null  object 
 2   album_name              550602 non-null  object 
 3   artists                 550622 non-null  object 
 4   danceability            550622 non-null  float64
 5   energy                  550622 non-null  float64
 6   key                     550622 non-null  int64  
 7   loudness                550622 non-null  float64
 8   mode                    550622 non-null  int64  
 9   speechiness             550622 non-null  float64
 10  acousticness            550622 non-null  float64
 11  instrumentalness        550622 non-null  float64
 12  liveness                550622 non-null  float64
 13  valence                 550622 non-null  float64
 14  tempo                   550622 non-null  float64
 15  duration_ms             550622 non-null  int64  
 16  lyrics                  550622 non-null  object 
 17  year                    550622 non-null  int64  
 18  genre                   550622 non-null  object 
 19  popularity              550622 non-null  int64  
 20  total_artist_followers  550622 non-null  int64  
 21  avg_artist_popularity   550622 non-null  float64
 22  artist_ids              550622 non-null  object 
 23  niche_genres            550622 non-null  object 
dtypes: float64(10), int64(6), object(8)
memory usage: 100.8+ MB

어쩐지 여는데 드럽게 오래 걸리더라고…

 

.describe()

followers popularity
count 7.144000e+04 71440.000000
mean 2.330490e+05 28.773138
std 2.204255e+06 18.006693
min 0.000000e+00 0.000000
25% 9.127500e+02 15.000000
50% 7.865000e+03 28.000000
75% 4.510750e+04 41.000000
max 1.685087e+08 100.000000
	id	name	genres	main_genre
count	71440	71438	71440	71440
unique	71440	69643	16134	10
top	7yhRUp1m94EmlkPRw7VoVQ	Chris Martin	[]	Electronic
freq	1	6	19580	20032

저 아이디는 PK인거임?

 

danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	year	popularity	total_artist_followers	avg_artist_popularity
count	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	550622.000000	5.506220e+05	550622.000000	550622.000000	5.506220e+05	550622.000000
mean	0.527173	0.671389	5.275487	-7.877679	0.667360	0.085574	0.243174	0.107573	0.224043	0.464568	122.815094	2.374116e+05	2007.144742	17.575954	2.533411e+06	48.076930
std	0.172603	0.245591	3.558157	3.858865	0.471159	0.093337	0.307418	0.239932	0.196800	0.249094	29.453634	9.555273e+04	13.575992	17.457960	9.227678e+06	19.154401
min	0.045500	0.000020	0.000000	-44.868000	0.000000	0.021900	0.000000	0.000000	0.006730	0.000000	30.946000	1.502700e+04	1900.000000	0.000000	0.000000e+00	0.000000
25%	0.408000	0.496000	2.000000	-9.907000	0.000000	0.035000	0.003650	0.000000	0.099600	0.259000	99.978000	1.840400e+05	2002.000000	0.000000	5.510225e+04	35.000000
50%	0.530000	0.716000	5.000000	-7.038000	1.000000	0.049700	0.075500	0.000185	0.141000	0.447000	121.729000	2.230000e+05	2010.000000	14.000000	2.693830e+05	49.000000
75%	0.651000	0.886000	8.000000	-5.095000	1.000000	0.092200	0.432000	0.032200	0.294000	0.660000	140.938000	2.707830e+05	2017.000000	30.000000	1.354562e+06	62.000000
max	0.988000	1.000000	11.000000	0.000000	1.000000	0.966000	0.996000	0.998000	1.000000	0.998000	245.941000	4.995315e+06	2025.000000	98.000000	2.951819e+08	100.000000
id	name	album_name	artists	lyrics	genre	artist_ids	niche_genres
count	550622	550619	550602	550622	550622	550622	550622	550622
unique	550622	351146	140154	95121	478160	10	95661	23638
top	036JzAN5DCANSZyeW6MjqG	Home	Greatest Hits	["Grateful Dead"]	Who's gonna tell you when\n It's too late?\n ♪...	Rock	["4TMHGUX5WI7OOm53PqSDAT"]	[]
freq	1	221	1958	1500	56	197168	1500	6099

 

 

.isna().sum()

id            0
name          2
followers     0
popularity    0
genres        0
main_genre    0
dtype: int64
id                         0
name                       3
album_name                20
artists                    0
danceability               0
energy                     0
key                        0
loudness                   0
mode                       0
speechiness                0
acousticness               0
instrumentalness           0
liveness                   0
valence                    0
tempo                      0
duration_ms                0
lyrics                     0
year                       0
genre                      0
popularity                 0
total_artist_followers     0
avg_artist_popularity      0
artist_ids                 0
niche_genres               0
dtype: int64

이거 걍 언노운으로 때우면 안되나? 이따 함 보긴 하겠지만.

 

.head()

id	name	followers	popularity	genres	main_genre
0	6YROFUbu5zRCHi2xkir5pk	Brian Hyland	67223	47	[]	Pop
1	5tFRohaO5yEsuJxmMnlCO9	Barns Courtney	602647	62	[]	Electronic
2	3w1Q754jb31h5CXQCcnLNL	Capcom Sound Team	210392	58	['japanese vgm', 'soundtrack']	Electronic
3	3oDbviiivRWhXwIE8hxkVV	The Beach Boys	5139194	76	['baroque pop']	Classical
4	60zvRmhQHRxokEB1taAVpN	Beth Malone	1569	29	['musicals']	Classical
id	name	album_name	artists	danceability	energy	key	loudness	mode	speechiness	...	tempo	duration_ms	lyrics	year	genre	popularity	total_artist_followers	avg_artist_popularity	artist_ids	niche_genres
0	0Prct5TDjAnEgIqbxcldY9	!	UNDEN!ABLE	["HELLYEAH"]	0.415	0.605	7	-11.157	1	0.0575	...	100.059	79500	He said he came from Jamaica,\nhe owned a coup...	2016	Rock	0	769490	52.0	["4hxDvVq5t8ebPYPdBl1F9f"]	["groove metal", "metal"]
1	2ASl4wirkeYm3OWZxXKYuq	!!	Childhood Dreams	["Yxngxr1"]	0.788	0.648	7	-9.135	0	0.3150	...	79.998	114000	Fuck the bitch, now she running with my kids\n...	2019	Hip-Hop	29	143628	45.0	["2jwRHcdgkRhelYEMqndDKe"]	[]
2	5tA3ImW310llKo8EMBj2Ga	!!Noble Stabbings!!	Situationist Comedy	["Dillinger Four"]	0.171	0.957	2	-5.749	1	0.1490	...	175.317	197400	You like to stand on the other side\nPoint and...	2002	Rock	0	36619	35.0	["4YAN46l70QV0PGXlMg0iHi"]	["melodic hardcore", "pop punk", "punk", "skat...
3	0fROT4kK5oTm8xO8PX6EJF	!I'll Be Back!	!I'll Be Back!	["Ril\u00e8s"]	0.823	0.612	1	-7.767	1	0.2480	...	142.959	178533	It's been a while, shit, I missed the rehab, p...	2018	Hip-Hop	43	929303	63.0	["6pdcQa7by8IKuoVXvgknlI"]	["french rap"]
4	1xBFhv5faebv3mmwxx7DnS	!Lost!	!Lost!	["Ril\u00e8s"]	0.729	0.552	7	-8.562	0	0.0650	...	86.103	186197	I would like to give you all my time\nI would ...	2018	Hip-Hop	0	929303	63.0	["6pdcQa7by8IKuoVXvgknlI"]	["french rap"]

근데 캡콤은… 저기 뭘로 올라간거임…?

 

.columns

Index(['id', 'name', 'followers', 'popularity', 'genres', 'main_genre'], dtype='object')
Index(['id', 'name', 'album_name', 'artists', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'lyrics', 'year',
       'genre', 'popularity', 'total_artist_followers',
       'avg_artist_popularity', 'artist_ids', 'niche_genres'],
      dtype='object')

 

전처리

결측값 확인 

na_artist = artist_df.query('name.isna()').index
# 아이디로 조회가 안되는데 아이디가 왜 있는건지 모르겠음. 
for idx in na_artist:
    artist_df.loc[idx] = 'unknown'

artist_df.isna().sum()

아이디는 왜 있는건지 모르겠는게 저걸로 검색이 안됨.

 

na1_song = song_df.query('name.isna()').index
# 곡명이 언노운이면 뭐 나보고 어쩌라는겨... 

na2_song = song_df.query('album_name.isna()').index
# 앨범이 언노운이면 뭐 우째야되나... 

for idx in na1_song:
    song_df.loc[idx, 'name'] = 'untitled'

for idx in na2_song:
    song_df.loc[idx, 'album_name'] = 'various'

song_df.isna().sum()

얘도 걍 때웁니다.

 

범주화

# 팔로워 범주화 
# 1억 저건 뭐 브루노마스임? ㄷㄷ 
artist_df['followers'].min(), artist_df['followers'].max()

follower_label = np.linspace(0, 168508682, 11)
labels = [f"Group {i}" for i in range(1, 11)]

artist_df['follower_group'] = pd.cut(artist_df['followers'], bins = follower_label, labels = labels)

일단 아티스트는 넘파이의 힘을 빌려서 걍 째버림… 이거 리눅스로 하는 중인데 자동완성도 안되고 크롬이나 VS코드나 실시간으로 뻗는중임..

 

popularity_bins = list(range(0, 110, 10))
popularity_label = list(range(0, 100, 10))

# 이걸 넣어주면 0도 포함됩니다. 
artist_df['popularity_group'] = pd.cut(artist_df['popularity'], bins = popularity_bins, labels = popularity_label, include_lowest=True)

저 인클루드 어쩌고 안 주면 어떻게 되냐고요? 0이 Nan이 됩니다.

 

song_df['year'].min(), song_df['year'].max()

# 노래들 년도 볌주화
song_era = list(range(1900, 2040, 10)) # 1900년대부터 10틱으로 갑니다. 
song_era_label = [f"{i}s" for i in song_era[:-1]]

song_df['Era'] = pd.cut(song_df['year'], bins = song_era, labels = song_era_label, include_lowest = True, right = False)

노래... 뭐가 많은데 일단 발매년도랑...

 

popularity_bins = list(range(0, 110, 10))
popularity_label = list(range(0, 100, 10))

# 이걸 넣어주면 0도 포함됩니다. 
song_df['popularity_group'] = pd.cut(song_df['popularity'], bins = popularity_bins, labels = popularity_label, include_lowest=True)
follower_label = np.linspace(0, 295181876, 11)
labels = [f"Group {i}" for i in range(1, 11)]

song_df['follower_group'] = pd.cut(song_df['total_artist_followers'], bins = follower_label, labels = labels)

팔로워랑 인기도 있어서 이것도 범주화 해주고...

 

분석을 들어가긴 들어가야되는데 문제가 하나 있다. 지금 리눅스로 작성중인데 이게 숨쉬다가 VScode가 뻗고 크롬이 뻗고 파폭이 뻗고 그 와중에 티스토리는 자동저장이 안되고 브라우저 뻗으면 다 날아감. 그래서 여기까지 올리고 그 다음껀 낼 올려드림…

'Coding > EDA' 카테고리의 다른 글

Post-COVID Video Games Worldwide (2021-2025)  (0) 2026.02.10
또 ChEMBL을 털어보았다  (0) 2026.01.28
캐글 EDA-마! 서퍼티파이! (2)  (0) 2026.01.20
그냥 해보는 ChEMBL EDA  (0) 2026.01.15
캐글 EDA-Video game sales  (0) 2026.01.09

댓글

홈으로 돌아가기

검색 결과

"search" 검색 결과입니다.