barcode

R 배워보기- 6.1. Manipulating data-General

Coding/R

이거 쿡복 보니까 시리즈가 개 많고... 분량이 그냥 종류별로 있습니다... 농담같지만 실화임. 
그래서 세부적으로 나갈거예요... 근데 데이터프레임에 csv 불러오는 건 생각 좀 해봐야겠음. 분량이 정말 살인적입니다. 농담 아님. 

아 참고로 데이터프레임 정리하기에 라이브러리가 하나 필요합니다. plyr이라고... 
그거 없이 하는 방법도 있긴 한데 나중에 뭉텅이로 처리하려면 plyr 필요해요. 

install.packages("plyr")

설치 ㄱㄱ. 


sort()

벡터는 sort()로 정렬한다. 그죠 여기 벡터가 나온다는 건 데이터프레임도 있다 이겁니다. (스포일러)

 

> v=sample(1:25)
> v
 [1] 11  2 12 18 23 21 22 14  3 19 13  9  1 16  5 20  6 10 25  8  4 15 24  7 17

지쟈쓰 붓다 알라신을 부르게 만드는 이 데이터를 깔끔하게 정리했으면 좋겠다 그죠? 

> sort(v)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> sort(v,decreasing=TRUE)
 [1] 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

해(별)결. decreasing=TRUE는 파이썬 판다스에서 ascending=T, F 주는것과 비슷하다. 저기에 트루가 들어가면 내림차순. (ascending은... 뭐더라...)

 

plyr library 소환해서 정리하기 

> library(plyr)
> arrange(data,Molecular.Weight)

분자량 순으로 정렬한 것. 참고로 터미널에서 부르다보니 분량이 장난없어서 캡처고 뭐고 복사부터 안된다. 

> arrange(data,Molecular.Weight,HBA)

이건 분자량과 수소결합 받개로 정렬한다는 얘기. 

 

> arrange(data,-Type)
> arrange(data,Smiles,-HBA)

역방향 정렬은 이런 식으로 -를 붙여서 하면 된다. 

 

상여자는 라이브러리따원 깔지 않는다네 

> data[order(data$Name),]

단식(이름 정렬)

> data[order(data$Type,data$Molecular.Weight),]

복식(타입, 분자량)

> data[do.call(order,as.list(data)),]

이것도 정렬하는거라는데 뭔지는 모름. (사실 데이터가 방대해서 확인 못했다)

> data[order(-data$Molecular.Weight),]
> data[order(data$Smiles,-data$Type),]

얘도 일단 - 붙이면 역방향으로 정렬된다. 

 

sample()

위에서도 잠깐 나왔던 그 코드. 

 

> v=11:20
> v
 [1] 11 12 13 14 15 16 17 18 19 20
> v=sample(v)
> v
 [1] 13 11 14 18 16 20 19 17 12 15

근데 깔끔하게 만들어놓고 굳이 다시 개판치는 이유가 뭐냐... 머신러닝 학습용 만드나 

> data=data[sample(1:nrow(data)),]

데이터프레임은 이거 쓰면 된다. 

 

as.character(), as.numeric(), as.factors()

각각 문자, 숫자, 팩터로 바꿔주는 것. 

> v
 [1] 13 11 14 18 16 20 19 17 12 15
> w=as.character(v)
> x=factor(v)

이런 식으로 바꾼 다음 str()를 이용해 구조를 확인해보자. 

> str(v)
 int [1:10] 13 11 14 18 16 20 19 17 12 15
> str(w)
 chr [1:10] "13" "11" "14" "18" "16" "20" "19" "17" "12" "15"
> str(x)
 Factor w/ 10 levels "11","12","13",..: 3 1 4 8 6 10 9 7 2 5

(끄덕)

문자와 팩터끼리는 상호전환이 되는데 숫자와 팩터는 바로 되는 게 아니라 문자 한 번 찍고 가야 한다. 

> x
 [1] 13 11 14 18 16 20 19 17 12 15
Levels: 11 12 13 14 15 16 17 18 19 20
> as.numeric(x)
 [1]  3  1  4  8  6 10  9  7  2  5
# 뭐야 앞글자 돌려줘요

팩터를 바로 numeric 줘 버리면 이렇게 된다. 어? 근데 as numeric이 낯이 익으시다고요? 실례지만 판다스 하셨음? 

 

> z=unclass(x)
> z
 [1]  3  1  4  8  6 10  9  7  2  5
attr(,"levels")
 [1] "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"

참고로 as.numeric() 말고 unclass()로도 숫자형 된다. 

 

duplicated()와 unique()

살다보면 중복값이 있을 때가 있다. 그걸 언제 다 세고 앉았겠음... 

 

> a
 [1]   4  12   2  26  13   2  15  17  16   7  25  14   4 -12  21  10  10  19  18
[20]  16

근데 이게 중복이 있어? 있으면 그것도 웃길듯 

> duplicated(a)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[13]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

왜 있는거죠 

> a[duplicated(a)]
[1]  2  4 10 16
> unique(a[duplicated(a)])
[1]  2  4 10 16
# 얘는 반복 없이 걍 띄워주는 듯. python의 set같은 것

사실 위에놈이나 밑에놈이나 그게 그거같아보이지만, unique()는 중복인 항목이 삼중 사중이어도 걍 하나로 뽑아준다. python의 set 같은 역할. 

> unique(a)
 [1]   4  12   2  26  13  15  17  16   7  25  14 -12  21  10  19  18
> a[!duplicated(a)]
 [1]   4  12   2  26  13  15  17  16   7  25  14 -12  21  10  19  18

중복이요? 뺄 수는 있음. 

이건 사실 단식 벡터긴 한데 데이터프레임에서도 먹힌다. 

> duplicated(data)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE

씁 근데 이건 좀... 저거 개 방대함... 

> duplicated(data$Type)
 [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

그래서 일단 분자 타입을 픽해보겠다. 

 

고유값 다 어디갔...... 

> data[duplicated(data),]
 [1] ChEMBL.ID                       Name                           
 [3] Synonyms                        Type                           
 [5] Max.Phase                       Molecular.Weight               
 [7] Targets                         Bioactivities                  
 [9] AlogP                           Polar.Surface.Area             
[11] HBA                             HBD                            
[13] X.RO5.Violations                X.Rotatable.Bonds              
[15] Passes.Ro3                      QED.Weighted                   
[17] CX.Acidic.pKa                   CX.Basic.pKa                   
[19] CX.LogP                         CX.LogD                        
[21] Aromatic.Rings                  Structure.Type                 
[23] Inorganic.Flag                  Heavy.Atoms                    
[25] HBA..Lipinski.                  HBD..Lipinski.                 
[27] X.RO5.Violations..Lipinski.     Molecular.Weight..Monoisotopic.
[29] Molecular.Species               Molecular.Formula              
[31] Smiles

데이터프레임 전역에 대한 결과. 

저기 겹치는 항목이 왜 있지? 라고 생각하실 수도 있는데 ChEMBL빨 데이터가 생각보다 공백이 많다. 그래서 scatter plot 그릴 때도 dropna() 처리하고 그렸다. 그리고 그렇게 날려먹으면 못 쓰는거 꽤 많다. (dropna()가 특정 컬럼만 되는 게 아니라 묻따않 공백 빠이염이 되버림)

 

> data[!duplicated(data$Name),]

이렇게 하면 특정 컬럼에서 겹치는 걸 볼 수 있다. 

 

> unique(data$Name)
[1]                                  IODINE I 131 DERLOTUXIMAB BIOTIN
[3] biotin-geranylpyrophosphate      BIOTIN                          
4 Levels:  BIOTIN ... IODINE I 131 DERLOTUXIMAB BIOTIN

아 픽도 됩니다. (해당 DB는 바이오틴 관련 compound) 근데 compound 80몇개 중에 이름 있는 게 저거뿐인 건 좀 심한 거 아니냐고... 

 

NA 들어간 것 비교하기

> df <- data.frame( a=c(TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,NA,NA,NA),
+                   b=c(TRUE,FALSE,NA,TRUE,FALSE,NA,TRUE,FALSE,NA))
> df
      a     b
1  TRUE  TRUE
2  TRUE FALSE
3  TRUE    NA
4 FALSE  TRUE
5 FALSE FALSE
6 FALSE    NA
7    NA  TRUE
8    NA FALSE
9    NA    NA

이걸로 해 볼건데... 

일단 NA는 결측값이라(Null은 아예 빈 거고 얘는 결측값으로 채워져 있는 상태) 비교가... 

> df$a == df$b
[1]  TRUE FALSE    NA FALSE  TRUE    NA    NA    NA    NA

안된다. 

파이썬 판다스는 그래서 NA 빼고 계산한다. 

> data.frame(df, isSame = (df$a==df$b))
      a     b isSame
1  TRUE  TRUE   TRUE
2  TRUE FALSE  FALSE
3  TRUE    NA     NA
4 FALSE  TRUE  FALSE
5 FALSE FALSE   TRUE
6 FALSE    NA     NA
7    NA  TRUE     NA
8    NA FALSE     NA
9    NA    NA     NA

한 쪽이라도 NA가 있으면 비교를 거부하는 모습이다. 이걸 쿡북에서는 함수 짜서 해결 봤는데

> compareNA=function(v1,v2){
+ same=(v1==v2)|(is.na(v1)&is.na(v2))
+ same[is.na(same)]=FALSE
+ return(same)
+ }

그게 이거다. 

 

> compareNA(df$a,df$b)
[1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
> data.frame(df,isSame = compareNA(df$a,df$b))
      a     b isSame
1  TRUE  TRUE   TRUE
2  TRUE FALSE  FALSE
3  TRUE    NA  FALSE
4 FALSE  TRUE  FALSE
5 FALSE FALSE   TRUE
6 FALSE    NA  FALSE
7    NA  TRUE  FALSE
8    NA FALSE  FALSE
9    NA    NA   TRUE

일단 NA랑 NA가 있으면 같은걸로 쳐주는 듯. 

 

데이터 recoding하기

역시나 둘 다 plyr이 있어야 한다... 여기서는 크게 범주형 자료와 수치형 자료에 대한 리코딩을 진행할 것이다. 

두 데이터간 차이는 숫자로 측정이 되느냐 안 되느냐 여부. 일단 범주형 데이터의 예시로는 분자의 타입이 있고, 수치형 데이터의 예시로는 분자량이 있다. 

 

> data$Type
 [1] Small molecule Small molecule Small molecule                Small molecule
 [6] Small molecule                Small molecule Small molecule Small molecule
[11] Small molecule Small molecule Small molecule Small molecule Small molecule
[16] Small molecule Small molecule                               Small molecule
[21] Antibody       Small molecule Protein        Small molecule Small molecule
[26] Small molecule Small molecule Small molecule Small molecule Small molecule
[31]                Small molecule Small molecule Small molecule Small molecule
[36] Small molecule Small molecule                               Small molecule
[41] Small molecule Unknown        Small molecule Small molecule Small molecule
[46] Small molecule Small molecule Small molecule Small molecule Small molecule
[51] Small molecule Small molecule Small molecule Unknown        Small molecule
[56] Small molecule Small molecule Small molecule Small molecule               
[61] Small molecule Small molecule Small molecule Small molecule Small molecule
[66] Small molecule Small molecule Small molecule Small molecule Small molecule
[71] Small molecule Small molecule Small molecule Small molecule Small molecule
[76] Small molecule Small molecule Small molecule
Levels:  Antibody Protein Small molecule Unknown

이게 분자 타입. 작은 분자냐 항체냐 모르는거냐 단백질이냐에 따라 나눈다. 공백 뭐냐 즉, 수치로 측정할 수 있는 자료가 아니다. 

 

> data$code=revalue(data$Type,c("Antibody"="A","Small molecule"="S","Protein"="P","Unknown"="U"))
> data$code
 [1] S S S   S S   S S S S S S S S S S     S A S P S S S S S S S   S S S S S S  
[39]   S S U S S S S S S S S S S S U S S S S S   S S S S S S S S S S S S S S S S
[77] S S
Levels:  A P S U

revalue()

 

> data$code=mapvalues(data$Type,from=c("Antibody","Small molecule","Protein","Unknown"),to=c("A","S","P","U"))
> data$code
 [1] S S S   S S   S S S S S S S S S S     S A S P S S S S S S S   S S S S S S  
[39]   S S U S S S S S S S S S S S U S S S S S   S S S S S S S S S S S S S S S S
[77] S S
Levels:  A P S U

mapvalues()

 

> data$code[data$Type=="Antibody"]="A"

아 상여자는 라이브러리따원 쓰지 않는다네 그리고 저걸 일일이 다 쳐야된다 

 

다음 예시를 보자. 

 

> data$Molecular.Weight
 [1]  258.34  453.57  980.28  618.71 1838.31  634.78  919.07  789.01  650.05
[10] 1839.29  582.12  847.94  729.44 1646.16  641.93  718.87  814.15  602.71
[19]  814.97  814.02      NA  584.84 3269.70 1041.18 1075.19  700.86  605.75
[28] 1942.31  444.56  565.67  947.12  500.67 1625.96  913.05  542.68  923.15
[37] 1351.45  590.66  574.66  667.87  913.15 2532.95 1265.49  668.86  555.53
[46]  381.55 1187.32  566.12  682.82  651.23 1323.40  806.99  631.75 2625.32
[55]  326.43  764.39  606.62 1460.41  747.94  786.91 1074.26  429.52  786.99
[64] 1058.26 1474.77  549.67 1234.09 1278.10  244.32  854.06  891.95  718.84
[73]  773.45  728.92  834.14  923.15 1753.23  615.75

바이오틴과 관련 있는 분자들의 분자량이다. NA가 좀 거슬리긴 하지만 패스. 분자량은 숫자로 측정하는 데이터 중 하나이다. 

 

> data$category[data$Molecular.Weight > 500]="Large"
> data$category[data$Molecular.Weight <= 500]="Small"
> data$category
 [1] "Small" "Small" "Large" "Large" "Large" "Large" "Large" "Large" "Large"
[10] "Large" "Large" "Large" "Large" "Large" "Large" "Large" "Large" "Large"
[19] "Large" "Large" NA      "Large" "Large" "Large" "Large" "Large" "Large"
[28] "Large" "Small" "Large" "Large" "Large" "Large" "Large" "Large" "Large"
[37] "Large" "Large" "Large" "Large" "Large" "Large" "Large" "Large" "Large"
[46] "Small" "Large" "Large" "Large" "Large" "Large" "Large" "Large" "Large"
[55] "Small" "Large" "Large" "Large" "Large" "Large" "Large" "Small" "Large"
[64] "Large" "Large" "Large" "Large" "Large" "Small" "Large" "Large" "Large"
[73] "Large" "Large" "Large" "Large" "Large" "Large"

얘는 또 category여...? 아, 여기서는 분자량 500을 기준으로 나누었다. 기준에 대해서는 나중에 또 알아보도록 하자. 

> data$category=cut(data$Molecular.Weight, breaks=c(0,500,Inf),labels=c("Small","Large"))
> data$category
 [1] Small Small Large Large Large Large Large Large Large Large Large Large
[13] Large Large Large Large Large Large Large Large <NA>  Large Large Large
[25] Large Large Large Large Small Large Large Large Large Large Large Large
[37] Large Large Large Large Large Large Large Large Large Small Large Large
[49] Large Large Large Large Large Large Small Large Large Large Large Large
[61] Large Small Large Large Large Large Large Large Small Large Large Large
[73] Large Large Large Large Large Large
Levels: Small Large

저렇게 여러 줄 치기 귀찮으면 cut()으로 나누면 된다. 

 

컬럼간 연산

> data$ratio=data$HBA/data$HBD
경고메시지(들): 
In Ops.factor(data$HBA, data$HBD) :
  요인(factors)에 대하여 의미있는 ‘/’가 아닙니다.
> data$ratio
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA

근데 저거 엄연히 수치가 있는건데 왜 NA가 뜨는겨... 

 

> data$HBA
 [1] 4    7    12   10   None 9    17   13   12   None 9    14   8    None 10  
[16] 10   9    9    14   12        8    None None None 11   7    None 7    9   
[31] 17   7    None 19   5    15   None 10   9    7    16   None None 9    7   
[46] 6    None 8    10   9    None 8    11   None 4    10   7    None 10   14  
[61] None 4    13   None None 8    None None 3    11   15   10   9    10   11  
[76] 15   None 10  
Levels:  10 11 12 13 14 15 16 17 19 3 4 5 6 7 8 9 None
> data$HBD
 [1] 2    5    6    3    None 8    3    3    6    None 6    12   3    None 8   
[16] 8    5    2    3    9         3    None None None 8    3    None 3    6   
[31] 3    3    None 13   5    3    None 3    2    5    4    None None 7    6   
[46] 3    None 5    5    8    None 8    6    None 3    9    6    None 9    3   
[61] None 4    3    None None 5    None None 3    5    13   6    3    6    5   
[76] 3    None 5   
Levels:  12 13 2 3 4 5 6 7 8 9 None

내가 있다고 했잖음. 

 

> data$sum = as.numeric(data$HBA) + as.numeric(data$HBD)
> data$sum
 [1] 16 22 12  7 30 27 14 10 12 30 25  8 21 30 12 12 24 21 11 15  2 21 30 30 30
[26] 13 20 30 20 25 14 20 30 13 20 12 30  7 21 22 14 30 30 26 23 19 30 23  9 27
[51] 30 26 11 30 17 13 23 30 13 11 30 18 10 30 30 23 30 30 16 10 10 10 22 10 10
[76] 12 30  9

(마른세수) 쟤네 팩터라 숫자로 바로 바꾸면 안된다. 

> data$sum = as.numeric(as.character(data$HBA)) + as.numeric(as.character(data$HBD))
경고메시지(들): 
1: 강제형변환에 의해 생성된 NA 입니다 
2: 강제형변환에 의해 생성된 NA 입니다 
> data$sum
 [1]  6 12 18 13 NA 17 20 16 18 NA 15 26 11 NA 18 18 14 11 17 21 NA 11 NA NA NA
[26] 19 10 NA 10 15 20 10 NA 32 10 18 NA 13 11 12 20 NA NA 16 13  9 NA 13 15 17
[51] NA 16 17 NA  7 19 13 NA 19 17 NA  8 16 NA NA 13 NA NA  6 16 28 16 12 16 16
[76] 18 NA 15

이중변환 해야 한다... 

 

벡터 매핑하기

여기서도 plyr이 쓰일 줄은 몰랐고... 

 

> str=c("alpha","omicron","pi")
> revalue(str,c("alpha"="sigma"))
[1] "sigma"   "omicron" "pi"
> mapvalues(str,from=c("alpha"),to=c("sigma"))
[1] "sigma"   "omicron" "pi"

벡터값 매핑도 revalue()와 mapvalues()로 한다. 단, revalues()는 숫자 매핑은 안된다. 

 

> mapvalues(v,from=c(1,2),to=c(11,12))
 [1]  3  6  8  9 12  5  7  4 10 11

근데 시그마 하니 록맨 X의 시그마가 생각나는군... X5 스테이지 브금이 아주 클럽 브금이었음... 

 

https://youtu.be/vOpQoAqTTpc

생각난 김에 듣고 가자. 아니 이게 무슨 흐름이지 

 

> str[str=="alpha"]="sigma"
> str
[1] "sigma"   "omicron" "pi"
> v[v==2]=12
> v
 [1]  3  6  8  9 12  5  7  4 10  1

라이브러리 없이도 되긴 된다. 값이 많으면 좀 귀찮아지긴 하겠지만... 

 

> sub("^sigma$","omicron",str)
[1] "omicron" "omicron" "pi"     
> gsub("o","O",str)
[1] "sigma"   "OmicrOn" "pi"

sub()과 gsub()도 된다고 합니다.