barcode

R 배워보기- 6.4. Manipulating data-Restructing data

Coding/R

들어가기 전에 

아니 새기들아 깔아야 하는 라이브러리가 있으면 미리 좀 알려달라고!!! (깊은 분노) 

아니 어느 레시피에서 재료설명도 없이 주저리 주저리 레시피 쓰다가 존내 당연하다는 듯 여러분 다들 집에 맨드레이크 있으시죠? 맨드레이크를 채썰어주세요. 하면서 레시피를 쓰냐!!! 집에 왜 그런게 있죠 아니 외가에서 무 받아온게 사람 모양이더라고 

아무튼... 좀 개빡치긴 했지만... 라이브러리 깔고 가세요... 

install.packages("tidyr")
install.packages("reshape2")
install.packages("doBy")

테이블 가로세로 바꾸기

테이블은 보통 가로로 길거나 세로로 길거나 둘 중 하나이다. 캡처는 못했지만, 전전직장에서 일하면서 SQL로 정리해뒀던 샘플 표는 가로로 정말 엄청나게 긴 표였다. (시료의 색깔, 크기, 규격, 회사명, 제품명, 합불여부까지 다 기재해서... 마스크는 똑같은 회사에서 만들더라도 KF 규격이 다르거나 색, 크기가 다르면 다 검사 받아야 한다)

아무튼... 이게 가로세로가 바뀐다고? 

> df=read.csv('/home/koreanraichu/example.csv',sep=";")
> df
          ID Interesred.in  Class
1 kimlab0213        Python  Basic
2   ahn_0526        Python Medium
3   peponi01             R  Basic
4  kuda_koma             R Expert
5 comma_life          Java  Basic
6 wheresjohn          Java Medium
7 hanguk_joa        Python Expert
8   sigma_00             R  Basic
9  kokoatalk          Java  Basic

이건 근데 가로로 긴거냐 세로로 긴거냐... 그냥 세로로 방대한거 아닌가 

> df_long=gather(df,ID,Interested.in,Class,factor_key=TRUE)
> df_long
     ID Interested.in
1 Class         Basic
2 Class        Medium
3 Class         Basic
4 Class        Expert
5 Class         Basic
6 Class        Medium
7 Class        Expert
8 Class         Basic
9 Class         Basic

tidyr 라이브러리의 gather()를 이용하면 이렇게 된다. (코드에는 생략되어 있으나 이 글을 읽는 여러분들이라면 아시리라 믿는다. tidyr 먼저 부르자)

> library(reshape2)

다음의 패키지를 부착합니다: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths

> df_melt=melt(df,id.vars=c("ID","Class"))
> df_melt
          ID  Class      variable  value
1 kimlab0213  Basic Interested.in Python
2   ahn_0526 Medium Interested.in Python
3   peponi01  Basic Interested.in      R
4  kuda_koma Expert Interested.in      R
5 comma_life  Basic Interested.in   Java
6 wheresjohn Medium Interested.in   Java
7 hanguk_joa Expert Interested.in Python
8   sigma_00  Basic Interested.in      R
9  kokoatalk  Basic Interested.in   Java

reshape2의 melt()로는 이렇게 한다. 

> df_melt=melt(df,id.bvars=c("Interested.in","Class"),measure.vars=c("Interested.in","Class"),variable.names="Lenguages",value.name="ID")

디테일한 지정도 된다. 

이거는 넓은 표를 길게 했다 칩시다... 그럼 긴 표를 넓게 하려면 어떻게 해야 할까? 

> df_wide=spread(df,ID,Class)
> df_wide
  Interested.in ahn_0526 comma_life hanguk_joa kimlab0213 kokoatalk kuda_koma
1          Java     <NA>      Basic       <NA>       <NA>     Basic      <NA>
2        Python   Medium       <NA>     Expert      Basic      <NA>      <NA>
3             R     <NA>       <NA>       <NA>       <NA>      <NA>    Expert
  peponi01 sigma_00 wheresjohn
1     <NA>     <NA>     Medium
2     <NA>     <NA>       <NA>
3    Basic    Basic       <NA>
> df2_wide=spread(df2,Category.1,Price)
> df2_wide
       ID              Product.name   Category.2 Order Cancel Consumable
1  BC-001                 BCIG agar     Bacteria    10      2         NA
2  DM-001                      DMEM Cell culture    20      0         NA
3  GL-001               Slide glass Experimental    50      0         NA
4  GL-002               Cover glass Experimental    55      1         NA
5  LB-001                   LB agar     Bacteria     5      0         NA
6  LB-002                  LB broth     Bacteria     6      1         NA
7  MS-001                   MS agar        Plant     7      2         NA
8  MS-002                  MS broth        Plant     7      0         NA
9  MS-003 MS agar(w/ plant hormone)        Plant     2      1         NA
10 PI-001             Pasteur pipet Experimental    30      5      10000
11 PI-002           Pipet-AID(10ml) Experimental    45      0      15000
12 PI-003           Pipet-AID(25ml) Experimental    45      5      25000
13 PI-004                 Pipet-AID Experimental     2      0         NA
   Equipment Glasswear Medium
1         NA        NA  70000
2         NA        NA  90000
3         NA     20000     NA
4         NA     20000     NA
5         NA        NA  55000
6         NA        NA  50000
7         NA        NA  65000
8         NA        NA  60000
9         NA        NA  75000
10        NA        NA     NA
11        NA        NA     NA
12        NA        NA     NA
13    100000        NA     NA

tidyr 라이브러리에 있는 spread를 쓰거나... NA 진짜 보기싫네... 

> dcast(df2,Order~Cancel,value.var="Price",sum)
   Order      0     1     2     5
1      2 100000 75000     0     0
2      5  55000     0     0     0
3      6      0 50000     0     0
4      7  60000     0 65000     0
5     10      0     0 70000     0
6     20  90000     0     0     0
7     30      0     0     0 10000
8     45  15000     0     0 25000
9     50  20000     0     0     0
10    55      0 20000     0     0

dcast를 쓰자. 얘가 대충 피벗테이블 같은 역할이라는 듯... 형식은 dcast(원본 데이터, 행~열에 들어갈 항목, 값으로 들어갈 항목, 적용할 함수). 

 

Summarize

대충 설명만 들어봤을 때는 pandas의 그룹바이가 생각난다. 

 

data <- read.table(header=TRUE, text='
 subject sex condition before after change
       1   F   placebo   10.1   6.9   -3.2
       2   F   placebo    6.3   4.2   -2.1
       3   M   aspirin   12.4   6.3   -6.1
       4   F   placebo    8.1   6.1   -2.0
       5   M   aspirin   15.2   9.9   -5.3
       6   F   aspirin   10.9   7.0   -3.9
       7   F   aspirin   11.6   8.5   -3.1
       8   M   aspirin    9.5   3.0   -6.5
       9   F   placebo   11.5   9.0   -2.5
      10   M   placebo   11.9  11.0   -0.9
      11   F   aspirin   11.4   8.0   -3.4
      12   M   aspirin   10.0   4.4   -5.6
      13   M   aspirin   12.5   5.4   -7.1
      14   M   placebo   10.6  10.6    0.0
      15   M   aspirin    9.1   4.3   -4.8
      16   F   placebo   12.1  10.2   -1.9
      17   F   placebo   11.0   8.8   -2.2
      18   F   placebo   11.9  10.2   -1.7
      19   M   aspirin    9.1   3.6   -5.5
      20   M   placebo   13.5  12.4   -1.1
      21   M   aspirin   12.0   7.5   -4.5
      22   F   placebo    9.1   7.6   -1.5
      23   M   placebo    9.9   8.0   -1.9
      24   F   placebo    7.6   5.2   -2.4
      25   F   placebo   11.8   9.7   -2.1
      26   F   placebo   11.8  10.7   -1.1
      27   F   aspirin   10.1   7.9   -2.2
      28   M   aspirin   11.6   8.3   -3.3
      29   F   aspirin   11.3   6.8   -4.5
      30   F   placebo   10.3   8.3   -2.0
 ')

예제 테이블은 이건데... 아마도 임상실험을 상정하고 만든 데이터같다. 임상실험에서는 투약군과 위약군이 나뉜다. 투약군은 약을 투여하는거고 위약군은 가짜 약을 투여하는 군이라고 보면 된다. 

비운의 약 TGN1412의 경우 전임상에서는 이상 없었는데 1차 임상시험에서 투약군 6명이 전부 심각한 부작용으로 ICU행이 되어서 개발 중단된 약. (위약군 두 명만 멀쩡했다고...)

 

> library(plyr)
> cdata=ddply(data,c("sex","condition"),summarise,N=length(change),mean=mean(change),sd=sd(change),se=sd/sqrt(N))
> cdata
  sex condition  N      mean        sd        se
1   F   aspirin  5 -3.420000 0.8642916 0.3865230
2   F   placebo 12 -2.058333 0.5247655 0.1514867
3   M   aspirin  9 -5.411111 1.1307569 0.3769190
4   M   placebo  4 -0.975000 0.7804913 0.3902456
# Run the functions length, mean, and sd on the value of "change" for each group, 
# broken down by sex + condition

ddply()를 쓰고 싶다면 plyr 라이브러리를... 아직도 안 깔았음??? 

> cdataNA=ddply(dataNA,c("sex","condition"),summarise,N=sum(!is.na(change)),mean=mean(change,na.rm=TRUE),sd=sd(change,na.rm=TRUE),se=sd/sqrt(N))
> cdataNA
  sex condition  N      mean        sd        se
1   F   aspirin  4 -3.425000 0.9979145 0.4989572
2   F   placebo 12 -2.058333 0.5247655 0.1514867
3   M   aspirin  7 -5.142857 1.0674848 0.4034713
4   M   placebo  3 -1.300000 0.5291503 0.3055050

네? 결측값이요? na.rm=TRUE를 주면 된다. 

 

> ddply(dataNA,c("sex","condition"),summarise,N=sum(!is.na(change)),mean=mean(change),sd=sd(change,na.rm=TRUE),se=sd/sqrt(N))
  sex condition  N      mean        sd        se
1   F   aspirin  4        NA 0.9979145 0.4989572
2   F   placebo 12 -2.058333 0.5247655 0.1514867
3   M   aspirin  7        NA 1.0674848 0.4034713
4   M   placebo  3        NA 0.5291503 0.3055050

na.rm=TRUE를 안 주면(일단 평균에만 안 줬다) 이렇게 된다. pandas는 skipna가 자동으로 켜져있지만 R은 아님. 

 

> library(doBy)
> cdata=summaryBy(change~sex+condition,data=data,FUN=c(length,mean,sd))
> cdata
  sex condition change.length change.mean change.sd
1   F   aspirin             5   -3.420000 0.8642916
2   F   placebo            12   -2.058333 0.5247655
3   M   aspirin             9   -5.411111 1.1307569
4   M   placebo             4   -0.975000 0.7804913

doBy 라이브러리의 summaryBy()도 쓸 수 있다. 저게 ddply()보다 간결함. 

 

cdataNA <- summaryBy(change ~ sex + condition, data=dataNA,
                     FUN=c(length2, mean, sd), na.rm=TRUE)

결측값도 간결하게 처리해준다. 

네? 라이브러리 깔 여건이 안된다고요? 그렇다면 R에 있는 aggregate()를 쓰면 되는데... 쓰다보면 알겠지만 차라리 라이브러리 까는 게 낫다. 

> cdata=aggregate(data["subject"],by=data[c("sex","condition")],FUN=length)
> cdata
  sex condition subject
1   F   aspirin       5
2   M   aspirin       9
3   F   placebo      12
4   M   placebo       4

일단 만들고

> names(cdata)[names(cdata)=="subject"]="N"
> cdata
  sex condition  N
1   F   aspirin  5
2   M   aspirin  9
3   F   placebo 12
4   M   placebo  4

이름 바꿔주고 

> cdata=cdata[order(cdata$sex),]
> cdata
  sex condition  N
1   F   aspirin  5
3   F   placebo 12
2   M   aspirin  9
4   M   placebo  4

정렬해주고... 아직 안 끝났다. 이제 평균이랑 표준편차랑 se 구해야된다. 

 

> cdata.means=aggregate(data[c("before","after","change")],by=data[c("sex","condition")],FUN=mean)
> cdata.means
  sex condition   before     after    change
1   F   aspirin 11.06000  7.640000 -3.420000
2   M   aspirin 11.26667  5.855556 -5.411111
3   F   placebo 10.13333  8.075000 -2.058333
4   M   placebo 11.47500 10.500000 -0.975000
> cdata=merge(cdata,cdata.means)
> cdata
  sex condition  N   before     after    change
1   F   aspirin  5 11.06000  7.640000 -3.420000
2   F   placebo 12 10.13333  8.075000 -2.058333
3   M   aspirin  9 11.26667  5.855556 -5.411111
4   M   placebo  4 11.47500 10.500000 -0.975000

평균

 

> cdata.sd=aggregate(data["change"],by=data[c("sex","condition")],FUN=sd)
> cdata.sd
  sex condition    change
1   F   aspirin 0.8642916
2   M   aspirin 1.1307569
3   F   placebo 0.5247655
4   M   placebo 0.7804913
> names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
> cdata.sd
  sex condition change.sd
1   F   aspirin 0.8642916
2   M   aspirin 1.1307569
3   F   placebo 0.5247655
4   M   placebo 0.7804913
> cdata=merge(cdata,cdata.sd)
> cdata
  sex condition subject   before     after    change change.sd
1   F   aspirin       5 11.06000  7.640000 -3.420000 0.8642916
2   F   placebo      12 10.13333  8.075000 -2.058333 0.5247655
3   M   aspirin       9 11.26667  5.855556 -5.411111 1.1307569
4   M   placebo       4 11.47500 10.500000 -0.975000 0.7804913

표준편차(Standard deviation 줄이면 SD)

 

> cdata$change.se <- cdata$change.sd / sqrt(cdata$subject)
> cdata
  sex condition subject   before     after    change change.sd change.se
1   F   aspirin       5 11.06000  7.640000 -3.420000 0.8642916 0.3865230
2   F   placebo      12 10.13333  8.075000 -2.058333 0.5247655 0.1514867
3   M   aspirin       9 11.26667  5.855556 -5.411111 1.1307569 0.3769190
4   M   placebo       4 11.47500 10.500000 -0.975000 0.7804913 0.3902456

se... 저거 참고로 중간에 표 한번 조져서 다시 만들었다...ㅋㅋㅋㅋㅋㅋ 

이 노가다를 하느니 그냥 라이브러리를 깔자... 

 

Contingency table

솔직히 이거 뭐하는건지는 나도 모름. 

 

> ctable=table(df)
> ctable
, , Class = Basic

            Interested.in
ID           Java Python R
  ahn_0526      0      0 0
  comma_life    1      0 0
  hanguk_joa    0      0 0
  kimlab0213    0      1 0
  kokoatalk     1      0 0
  kuda_koma     0      0 0
  peponi01      0      0 1
  sigma_00      0      0 1
  wheresjohn    0      0 0

, , Class = Expert

            Interested.in
ID           Java Python R
  ahn_0526      0      0 0
  comma_life    0      0 0
  hanguk_joa    0      1 0
  kimlab0213    0      0 0
  kokoatalk     0      0 0
  kuda_koma     0      0 1
  peponi01      0      0 0
  sigma_00      0      0 0
  wheresjohn    0      0 0

, , Class = Medium

            Interested.in
ID           Java Python R
  ahn_0526      0      1 0
  comma_life    0      0 0
  hanguk_joa    0      0 0
  kimlab0213    0      0 0
  kokoatalk     0      0 0
  kuda_koma     0      0 0
  peponi01      0      0 0
  sigma_00      0      0 0
  wheresjohn    1      0 0

일단 불러온 걸로 만들어 본 결과... 

 

> df=read.csv('/home/koreanraichu/example.csv',sep=";")
> df
          ID Interested.in  Class
1 kimlab0213        Python  Basic
2   ahn_0526        Python Medium
3   peponi01             R  Basic
4  kuda_koma             R Expert
5 comma_life          Java  Basic
6 wheresjohn          Java Medium
7 hanguk_joa        Python Expert
8   sigma_00             R  Basic
9  kokoatalk          Java  Basic

여기서 Class별로 언어 몇 개 있는지 세 준다. (Basic에서 파이썬 R 자바 몇개 이런 식) 그걸 ID별로 표기하느라 중구난방이 된 것. 이걸 수제작으로 그리면 

> df_counts=data.frame(Class=c("Basic","Basic","Basic","Medium","Medium","Medium","Expert","Expert","Expert"),Lan=c("Python","R","Java","Python","R","Java","Python","R","Java"),freq=c(1,2,2,1,0,1,1,1,0))
> df_counts
   Class    Lan freq
1  Basic Python    1
2  Basic      R    2
3  Basic   Java    2
4 Medium Python    1
5 Medium      R    0
6 Medium   Java    1
7 Expert Python    1
8 Expert      R    1
9 Expert   Java    0

중간에 오타나면 멘탈바사삭 확정. 

우리의 신입사원 김부추씨는 프로그래밍 언어 강의 플랫폼을 서비스하는 회사에서 근무한다. 회사에서 고객 데이터 집계를 위해 각 클래스별로 얼마나 듣는지 데이터를 달라는데... 저걸 다 세면 되나요? 아니 어느세월에... 

> table(df$Class,df$Interested.in)
        
         Java Python R
  Basic     2      1 2
  Expert    0      1 1
  Medium    1      1 0

이러면 나오잖음... 

인덱스는 난이도(class), 컬럼이 언어이다. 즉 컬럼별로 보자면 자바 셋, 파이썬 셋, R 셋. 난이도별로 보자면 초급 5명, 중급 2명, 쌉고수전문가 2명. 

네? 회사에 제출하는건데 저렇게 해도 되냐고요? 

> table(df$Class,df$Interested.in,dnn=c("Class","Interested.in"))
        Interested.in
Class    Java Python R
  Basic     2      1 2
  Expert    0      1 1
  Medium    1      1 0

이름을 넣으면 됨. 

> table(df[,c("Class","Interested.in")])
        Interested.in
Class    Java Python R
  Basic     2      1 2
  Expert    0      1 1
  Medium    1      1 0

물론 이것도 된다. 저기서 ID를 넣으면 데이터가 매우 괴랄해지므로 좀 묶을 수 있는 걸로 넣어보는 것을 추천한다. 

 

> countdf=as.data.frame(table(df))
> countdf
           ID Interested.in  Class Freq
1    ahn_0526          Java  Basic    0
2  comma_life          Java  Basic    1
3  hanguk_joa          Java  Basic    0
4  kimlab0213          Java  Basic    0
5   kokoatalk          Java  Basic    1
6   kuda_koma          Java  Basic    0
7    peponi01          Java  Basic    0
8    sigma_00          Java  Basic    0
9  wheresjohn          Java  Basic    0
10   ahn_0526        Python  Basic    0
11 comma_life        Python  Basic    0
12 hanguk_joa        Python  Basic    0
13 kimlab0213        Python  Basic    1
14  kokoatalk        Python  Basic    0
15  kuda_koma        Python  Basic    0
16   peponi01        Python  Basic    0
17   sigma_00        Python  Basic    0
18 wheresjohn        Python  Basic    0
19   ahn_0526             R  Basic    0
20 comma_life             R  Basic    0
21 hanguk_joa             R  Basic    0
22 kimlab0213             R  Basic    0
23  kokoatalk             R  Basic    0
24  kuda_koma             R  Basic    0
25   peponi01             R  Basic    1
26   sigma_00             R  Basic    1
27 wheresjohn             R  Basic    0
28   ahn_0526          Java Expert    0
29 comma_life          Java Expert    0
30 hanguk_joa          Java Expert    0
31 kimlab0213          Java Expert    0
32  kokoatalk          Java Expert    0
33  kuda_koma          Java Expert    0
34   peponi01          Java Expert    0
35   sigma_00          Java Expert    0
36 wheresjohn          Java Expert    0
37   ahn_0526        Python Expert    0
38 comma_life        Python Expert    0
39 hanguk_joa        Python Expert    1
40 kimlab0213        Python Expert    0
41  kokoatalk        Python Expert    0
42  kuda_koma        Python Expert    0
43   peponi01        Python Expert    0
44   sigma_00        Python Expert    0
45 wheresjohn        Python Expert    0
46   ahn_0526             R Expert    0
47 comma_life             R Expert    0
48 hanguk_joa             R Expert    0
49 kimlab0213             R Expert    0
50  kokoatalk             R Expert    0
51  kuda_koma             R Expert    1
52   peponi01             R Expert    0
53   sigma_00             R Expert    0
54 wheresjohn             R Expert    0
55   ahn_0526          Java Medium    0
56 comma_life          Java Medium    0
57 hanguk_joa          Java Medium    0
58 kimlab0213          Java Medium    0
59  kokoatalk          Java Medium    0
60  kuda_koma          Java Medium    0
61   peponi01          Java Medium    0
62   sigma_00          Java Medium    0
63 wheresjohn          Java Medium    1
64   ahn_0526        Python Medium    1
65 comma_life        Python Medium    0
66 hanguk_joa        Python Medium    0
67 kimlab0213        Python Medium    0
68  kokoatalk        Python Medium    0
69  kuda_koma        Python Medium    0
70   peponi01        Python Medium    0
71   sigma_00        Python Medium    0
72 wheresjohn        Python Medium    0
73   ahn_0526             R Medium    0
74 comma_life             R Medium    0
75 hanguk_joa             R Medium    0
76 kimlab0213             R Medium    0
77  kokoatalk             R Medium    0
78  kuda_koma             R Medium    0
79   peponi01             R Medium    0
80   sigma_00             R Medium    0
81 wheresjohn             R Medium    0

count로 바꾸는 것도 되긴 되는데... 하지 말자... (마른세수) 

 

> xtabs(Freq~Class+Interested.in,data=countdf)
        Interested.in
Class    Java Python R
  Basic     2      1 2
  Expert    0      1 1
  Medium    1      1 0

뭐야 이것도 됨?