barcode

R 배워보기-7. Statistical analysis (상)

Coding/R

들어가기 전에

1) R의 꽃, R의 주목적이다보니 분량이 어마무시하게 길다. 네이버에서 글 분량갖고 자르진 않겠지만 그래서 상, 하편으로 나눠서 작성할 예정이다. (cookbook 소챕터 3:3으로 나눔)
2) 그리고 이새기들 자꾸 안 알려주고 뜬금없이 라이브러리 소환하세요 하는데 오늘도 깔아야 할 패키지가 있다. 하편에 나오는 것도 있고 상편에 나오는 것도 있는데 그냥 다 깔고 가자. 

install.packages("car")
install.packages("ggplot2")
install.packages("exact2x2")
install.packages("irr")

3) 처음 보는 분석들이 많아서 나도 이게 뭔지 모른다. 따라서 이론적인 설명은 패스. 


회귀분석과 상관계수

참고로 여기서 만드는 데이터는 진짜 새발의 피수준... 나중 가면 라이브러리 깔아서 불러온다. 

 

> set.seed(1)
> xvar=1:20+rnorm(20,sd=3)
> zvar=1:20/4+rnorm(20,sd=2)
> yvar=2*xvar+xvar*zvar/5+3+rnorm(20,sd=4)
> dat=data.frame(x=xvar,y=yvar,z=zvar)
> dat
            x          y          z
1  -0.8793614  0.2159694  2.0879547
2   2.5509300  8.1415762  2.0642726
3   0.4931142  6.8627566  0.8991300
4   8.7858424 17.5642539 -2.9787034
5   5.9885233 15.2038911  2.4896515
6   3.5385948  8.2293408  1.3877425
7   8.4622872 23.8173481  1.4384090
8  10.2149741 24.5805906 -0.9415048
9  10.7273441 26.7808960  1.2936999
10  9.0838348 30.7526228  3.3358831
11 15.5353435 52.6505709  5.4673591
12 13.1695297 34.2512053  2.7944245
13 11.1362783 35.6025037  4.0253432
14  7.3559003 18.1851647  3.3923899
15 18.3747928 49.1415013  0.9958809
16 15.8651992 52.7105687  3.1700109
17 16.9514292 47.1691760  3.4614201
18 20.8315086 58.7406015  4.3813732
19 21.4636636 78.0409159  6.9500507
20 21.7817040 74.4542008  6.5263515

회귀분석은 이놈으로 해 보자. 

 

> head(dat)
           x          y         z
1 -0.8793614  0.2159694  2.087955
2  2.5509300  8.1415762  2.064273
3  0.4931142  6.8627566  0.899130
4  8.7858424 17.5642539 -2.978703
5  5.9885233 15.2038911  2.489651
6  3.5385948  8.2293408  1.387743
> tail(dat)
          x        y         z
15 18.37479 49.14150 0.9958809
16 15.86520 52.71057 3.1700109
17 16.95143 47.16918 3.4614201
18 20.83151 58.74060 4.3813732
19 21.46366 78.04092 6.9500507
20 21.78170 74.45420 6.5263515

참고로 데이터 분량이 씁 아 이건 좀 그렇네 싶으면 head()나 tail()을 써 보자. pandas의 head(), tail()과 똑같이 데이터의 머리꼬리를 뽑아준다. 

 

> head(dfa,5)
  rater1 rater2
1      3      3
2      3      6
3      3      4
4      4      6
5      5      2

숫자 지정은 이런 식으로 하면 된다. 아무튼 다시 분석하러 가 봅시다. 

 

Correlation(상관분석)

상관계수는 그냥 이걸로 분석하면 된다. 

> cor(dat$x,dat$y)
[1] 0.9669456

혹시 분석을 뭉텅이로 해야 해서 일일이 못 돌릴 것 같은가? 

> cor(dat)
          x         y         z
x 1.0000000 0.9669456 0.5479301
y 0.9669456 1.0000000 0.6893575
z 0.5479301 0.6893575 1.0000000

아 그럼 행렬 만들면 된다. 

> round(cor(dat),2)
     x    y    z
x 1.00 0.97 0.55
y 0.97 1.00 0.69
z 0.55 0.69 1.00

반올림도 해준다. 

Bradford assay의 경우 무게를 알고 있는 단백질을 Bradford adday를 통해 분석한 다음 그걸로 standard curve를 그리게 되는데, 이 때 r^2를 계산하게 된다. 그리고 r^2가 1에 가까울 수록 잘 된거다. 그 알제곱이 저거다. 아마도? 

 

Linear regression(선형 회귀분석)

> fit=lm(y~x,data=dat)
> fit

Call:
lm(formula = y ~ x, data = dat)

Coefficients:
(Intercept)            x  
     -1.962        3.172
> fit=lm(dat$y~dat$x)
> fit

Call:
lm(formula = dat$y ~ dat$x)

Coefficients:
(Intercept)        dat$x  
     -1.962        3.172

선형 회귀 분석은 이걸로 하면 된다. 둘 다 같은건데, 위 코드는 데이터프레임 컬럼 불러서 진행한거고 아래는 데이터프레임 컬럼 벡터로 픽해서 했다. Intercept는 절편(여기서는 y 절편). 

 

> summary(fit)

Call:
lm(formula = dat$y ~ dat$x)

Residuals:
   Min     1Q Median     3Q    Max 
-8.341 -5.304 -1.047  4.505 11.925 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.9619     2.5497  -0.769    0.452    
dat$x         3.1718     0.1971  16.089 3.97e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.894 on 18 degrees of freedom
Multiple R-squared:  0.935,	Adjusted R-squared:  0.9314 
F-statistic: 258.9 on 1 and 18 DF,  p-value: 3.968e-12

자세한 정보가 궁금하다면 summary()를 쓰면 된다. 

네? 여러개요? 

> fit2=lm(y~x+z,data=dat)
> fit2

Call:
lm(formula = y ~ x + z, data = dat)

Coefficients:
(Intercept)            x            z  
     -3.145        2.762        2.190
> fit2=lm(dat$y~dat$x+dat$z)
> fit2

Call:
lm(formula = dat$y ~ dat$x + dat$z)

Coefficients:
(Intercept)        dat$x        dat$z  
     -3.145        2.762        2.190

둘 중 하나 픽해서 하면 된다. (Intercept는 아까도 말했듯 절편)

 

> summary(fit2)

Call:
lm(formula = dat$y ~ dat$x + dat$z)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4155 -2.8132  0.0799  1.9206  6.6844 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.1445     1.7599  -1.787 0.091817 .  
dat$x         2.7620     0.1610  17.160  3.6e-12 ***
dat$z         2.1896     0.4713   4.646 0.000231 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.025 on 17 degrees of freedom
Multiple R-squared:  0.9714,	Adjusted R-squared:  0.968 
F-statistic: 288.3 on 2 and 17 DF,  p-value: 7.669e-14

근데 summary는 요약 아님? 

 

Interaction 껴서 분석하기

> fit3=lm(y~x*z,data=dat)
> fit3

Call:
lm(formula = y ~ x * z, data = dat)

Coefficients:
(Intercept)            x            z          x:z  
     2.8057       2.1931      -0.4598       0.1950
> fit3=lm(y~x+z+x:z,data=dat)
> fit3

Call:
lm(formula = y ~ x + z + x:z, data = dat)

Coefficients:
(Intercept)            x            z          x:z  
     2.8057       2.1931      -0.4598       0.1950

Interaction 껴서 분석하는 것도 된다. 근데 이게 뭐 하는건지는 나도 모른다. 

 

> summary(fit3)

Call:
lm(formula = y ~ x + z + x:z, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5313 -2.5372  0.1603  1.8645  6.7627 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.80570    2.41253   1.163  0.26189    
x            2.19312    0.22715   9.655 4.47e-08 ***
z           -0.45978    0.94445  -0.487  0.63299    
x:z          0.19497    0.06346   3.072  0.00729 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.29 on 16 degrees of freedom
Multiple R-squared:  0.982,	Adjusted R-squared:  0.9786 
F-statistic: 290.7 on 3 and 16 DF,  p-value: 3.673e-14

물론 얘도 summary()로 자세히 볼 수 있다. 

 

T-test

> x=floor(runif(25)*100)
> y=floor(runif(25)*105)
> x
 [1] 99 49 48 17 75 45 51 20 22 59 57  7  3 64 92 59 56 52 98 50 68 60 23 25 72
> y
 [1] 47 18 78 11 90 64 58 34 47 52 18 55  7 29 22 29 93 46 81 92 43  6 35 75 35

데이터 가져오기 귀찮으면 만들면 된다. 

> data=data.frame(x,y)
> head(data)
   x  y
1 99 47
2 49 18
3 48 78
4 17 11
5 75 90
6 45 64

오케이 가즈아! 

 

평범하게 T-test 웰치스? 

> t.test(x,y)

	Welch Two Sample t-test

data:  x and y
t = 0.56293, df = 47.977, p-value = 0.5761
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.90445  19.38445
sample estimates:
mean of x mean of y 
    50.84     46.60

이거 치면 끝. 

 

> t.test(data$x,data$y)

이것도 먹힌다. 

 

> t.test(x,y,var.equal=TRUE)

	Two Sample t-test

data:  x and y
t = 0.56293, df = 48, p-value = 0.5761
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.90427  19.38427
sample estimates:
mean of x mean of y 
    50.84     46.60

var.equal=TRUE 옵션을 주면 스튜던트 T-test를 진행할 수 있다. default는 웰치스 T-test. 잠깐만 웰치스라고 했음? 아니 그 음료수 아니... 스펠이 왜 똑같지? 

 

Paired T-test

> t.test(x,y,paired=TRUE)

	Paired t-test

data:  x and y
t = 0.6259, df = 24, p-value = 0.5373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.741436 18.221436
sample estimates:
mean of the differences 
                   4.24

paired=TRUE 옵션을 주면 paired T-test도 가능하다. 

 

One sample T-test

엥? 근데 저 데이터 하난데 어캄? 

 

> t.test(x,mu=0)

	One Sample t-test

data:  x
t = 9.6518, df = 24, p-value = 9.776e-10
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 39.9686 61.7114
sample estimates:
mean of x 
    50.84
> t.test(y,mu=0)

	One Sample t-test

data:  y
t = 8.6554, df = 24, p-value = 7.617e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 35.48812 57.71188
sample estimates:
mean of x 
     46.6

단식도 된다. mu=0은 모평균을 0으로 놓고 본다는 얘기. 모평균은 뮤, 모표준편차는 시그마 제곱이다. 

 

빈도 분석

> data <- read.table(header=TRUE, text='
+  condition result
+    control      0
+    control      0
+    control      0
+    control      0
+  treatment      1
+    control      0
+    control      0
+  treatment      0
+  treatment      1
+    control      1
+  treatment      1
+  treatment      1
+  treatment      1
+  treatment      1
+  treatment      0
+    control      0
+    control      1
+    control      0
+    control      1
+  treatment      0
+  treatment      1
+  treatment      0
+  treatment      0
+    control      0
+  treatment      1
+    control      0
+    control      0
+  treatment      1
+  treatment      0
+  treatment      1
+ ')
> head(data)
  condition result
1   control      0
2   control      0
3   control      0
4   control      0
5 treatment      1
6   control      0

데이터는 고마운데 분량 이거 ㄷㄷ

이럴 땐 이런 분석! 이라고 올려뒀더라고... 

 

X^2(카이스퀘어, Chi-square)

> ct=table(data$result)
> ct

 0  1 
17 13

참고로 빈도 분석은 어떤 형태로든 contingency table을 먼저 만들고 시작한다. 여기서는 글 분량때문에 같은 종류의 테이블에 대해서는 코드 생략하고 뭘로 만들었는지만 기재한다. 

 

> chisq.test(ct)

	Chi-squared test for given probabilities

data:  ct
X-squared = 0.53333, df = 1, p-value = 0.4652

평범한 카이스퀘어

> pt=c(.75,.25) # 확률값
> chisq.test(ct,p=pt)

	Chi-squared test for given probabilities

data:  ct
X-squared = 5.3778, df = 1, p-value = 0.02039

기대 빈도수가 다르다고 전제하고 계산하는 카이스퀘어

 

> chi_res=chisq.test(ct,p=pt)
> str(chi_res)
List of 9
 $ statistic: Named num 5.38
  ..- attr(*, "names")= chr "X-squared"
 $ parameter: Named num 1
  ..- attr(*, "names")= chr "df"
 $ p.value  : num 0.0204
 $ method   : chr "Chi-squared test for given probabilities"
 $ data.name: chr "ct"
 $ observed : 'table' int [1:2(1d)] 17 13
  ..- attr(*, "dimnames")=List of 1
  .. ..$ : chr [1:2] "0" "1"
 $ expected : Named num [1:2] 22.5 7.5
  ..- attr(*, "names")= chr [1:2] "0" "1"
 $ residuals: 'table' num [1:2(1d)] -1.16 2.01
  ..- attr(*, "dimnames")=List of 1
  .. ..$ : chr [1:2] "0" "1"
 $ stdres   : 'table' num [1:2(1d)] -2.32 2.32
  ..- attr(*, "dimnames")=List of 1
  .. ..$ : chr [1:2] "0" "1"
 - attr(*, "class")= chr "htest"

얘는 카이스퀘어 돌린 거 변수 지정하고 str()로 봐야 자세히 보인다. 

 

Exact binomial test

> binom.test(ct,p=0.5)

	Exact binomial test

data:  ct
number of successes = 17, number of trials = 30, p-value = 0.5847
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.3742735 0.7453925
sample estimates:
probability of success 
             0.5666667

요거는 위에서 만든 contingency table을 바탕으로 Exact binomial test를 한 결과. 저거 근데 확률 안 써도 값 같더만... (심지어 확률을 바꿔도 값이 같음) 아, p-value는 좀 다르긴 함. 

 

> binom.test(ct,p=0.25)

	Exact binomial test

data:  ct
number of successes = 17, number of trials = 30, p-value = 0.0002157
alternative hypothesis: true probability of success is not equal to 0.25
95 percent confidence interval:
 0.3742735 0.7453925
sample estimates:
probability of success 
             0.5666667

카이스퀘어와 달리 두 번째 컬럼값의 확률만 기재한다. 그러면 알아서 앞에껏도 계산해주는 듯. 

 

> bin_res=binom.test(ct,p=0.5)
> str(bin_res)
List of 9
 $ statistic  : Named num 17
  ..- attr(*, "names")= chr "number of successes"
 $ parameter  : Named num 30
  ..- attr(*, "names")= chr "number of trials"
 $ p.value    : Named num 0.585
  ..- attr(*, "names")= chr "0"
 $ conf.int   : num [1:2] 0.374 0.745
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 0.567
  ..- attr(*, "names")= chr "probability of success"
 $ null.value : Named num 0.5
  ..- attr(*, "names")= chr "probability of success"
 $ alternative: chr "two.sided"
 $ method     : chr "Exact binomial test"
 $ data.name  : chr "ct"
 - attr(*, "class")= chr "htest"

얘도 자세히 보려면 카이스퀘어처럼 변수 지정해주고 str()로 봐야 한다. 

 

X^2(카이스퀘어)

> ct=table(data$condition,data$result)
> ct
           
             0  1
  control   11  3
  treatment  6 10

이번꺼는 그룹간의 독립성을 보는 거라 테이블이 조금 다르다. 위 테이블은 결과값만 가지고 만는거지만, 이 테이블은 결과값과 컨디션으로 들어간다. 

 

> chisq.test(ct)

	Pearson's Chi-squared test with Yates' continuity correction

data:  ct
X-squared = 3.593, df = 1, p-value = 0.05802

아이고 어슨이형 또뵙네 

> chisq.test(ct,correct=FALSE)

	Pearson's Chi-squared test

data:  ct
X-squared = 5.1293, df = 1, p-value = 0.02353

correct=FALSE 옵션을 주면 Yates's correction for continuity가 빠진다. ...근데 그거 막 그렇게 빼도 돼? 

 

Fisher’s exact test

> fisher.test(ct)

	Fisher's Exact Test for Count Data

data:  ct
p-value = 0.03293
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.966861 45.555016
sample estimates:
odds ratio 
  5.714369

이건 피셔의 정확도 검정? 그건데 샘플 수가 작을 때 사용하는 분석법이라고 한다. 사실 통계가 데이터가 많을수록 좋은 것도 사실이지만, 그게 항상 그렇게 되리라는 법이 없다. 예를 들자면 희귀병 환자라던가... 

 

Cochran-Mantel-Haenszel test

fish <- read.table(header=TRUE, text='
  Location Allele   Habitat Count
 tillamook     94    marine    56
 tillamook     94 estuarine    69
 tillamook non-94    marine    40
 tillamook non-94 estuarine    77
   yaquina     94    marine    61
   yaquina     94 estuarine   257
   yaquina non-94    marine    57
   yaquina non-94 estuarine   301
     alsea     94    marine    73
     alsea     94 estuarine    65
     alsea non-94    marine    71
     alsea non-94 estuarine    79
    umpqua     94    marine    71
    umpqua     94 estuarine    48
    umpqua non-94    marine    55
    umpqua non-94 estuarine    48
')
> head(fish)
   Location Allele   Habitat Count
1 tillamook     94    marine    56
2 tillamook     94 estuarine    69
3 tillamook non-94    marine    40
4 tillamook non-94 estuarine    77
5   yaquina     94    marine    61
6   yaquina     94 estuarine   257

코크란-만텔-헨젤 검정은 또 표가 다른 게 들어간다. 살려줘. 

 

> ct=xtabs(Count~Allele+Habitat+Location,data=fish)
> ct
, , Location = alsea

        Habitat
Allele   estuarine marine
  94            65     73
  non-94        79     71

, , Location = tillamook

        Habitat
Allele   estuarine marine
  94            69     56
  non-94        77     40

, , Location = umpqua

        Habitat
Allele   estuarine marine
  94            48     71
  non-94        48     55

, , Location = yaquina

        Habitat
Allele   estuarine marine
  94           257     61
  non-94       301     57

얘는 그냥 contengency table로 만들면 3차원이 된다. 지역별로 테이블이 다 나오기때문에 

> ftable(ct)
                 Location alsea tillamook umpqua yaquina
Allele Habitat                                          
94     estuarine             65        69     48     257
       marine                73        56     71      61
non-94 estuarine             79        77     48     301
       marine                71        40     55      57

이걸 2차원으로 다시 합칠거다. 

> mantelhaen.test(ct)

	Mantel-Haenszel chi-squared test with continuity correction

data:  ct
Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 0.6005522 0.9593077
sample estimates:
common odds ratio 
         0.759022

그리고 분석하면 된다... 분석 개빡시네 이거. 

 

McNemar’s test

> data <- read.table(header=TRUE, text='
+  subject time result
+        1  pre      0
+        1 post      1
+        2  pre      1
+        2 post      1
+        3  pre      0
+        3 post      1
+        4  pre      1
+        4 post      0
+        5  pre      1
+        5 post      1
+        6  pre      0
+        6 post      1
+        7  pre      0
+        7 post      1
+        8  pre      0
+        8 post      1
+        9  pre      0
+        9 post      1
+       10  pre      1
+       10 post      1
+       11  pre      0
+       11 post      0
+       12  pre      1
+       12 post      1
+       13  pre      0
+       13 post      1
+       14  pre      0
+       14 post      0
+       15  pre      0
+       15 post      1
+ ')
> head(data)
  subject time result
1       1  pre      0
2       1 post      1
3       2  pre      1
4       2 post      1
5       3  pre      0
6       3 post      1

Mcnemar's test는 또 다른 데이터로 진행한다. 살려줘요. 

 

> library(tidyr)
> data_wide <- spread(data, time, result)
> head(data_wide)
  subject post pre
1       1    1   0
2       2    1   1
3       3    1   0
4       4    0   1
5       5    1   1
6       6    1   0

어? 이거 이렇게 처리하는 거 어디서 봤는데? 

> ct=table(data_wide[,c("pre","post")])
> ct
   post
pre 0 1
  0 2 8
  1 1 4

아무튼 테이블을 만들어주고... 

 

> mcnemar.test(ct)

	McNemar's Chi-squared test with continuity correction

data:  ct
McNemar's chi-squared = 4, df = 1, p-value = 0.0455

분⭐석

 

> library(exact2x2)
필요한 패키지를 로딩중입니다: exactci
필요한 패키지를 로딩중입니다: ssanv
필요한 패키지를 로딩중입니다: testthat

다음의 패키지를 부착합니다: ‘testthat’

The following object is masked from ‘package:tidyr’:

    matches

> mcnemar.exact(ct)

	Exact McNemar test (with central confidence intervals)

data:  ct
b = 8, c = 1, p-value = 0.03906
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   1.072554 354.981246
sample estimates:
odds ratio 
         8

excat버전을 쓰려면 exact2x2라는 패키지를 깔아야 한다. (위에 썼던 그거 맞다)

 

ANOVA

들어보긴 했는데 이거 뭐임? SW 이름 아니었어? 이사람

 

> data <- read.table(header=TRUE, text='
+  subject sex   age before after
+        1   F   old    9.5   7.1
+        2   M   old   10.3  11.0
+        3   M   old    7.5   5.8
+        4   F   old   12.4   8.8
+        5   M   old   10.2   8.6
+        6   M   old   11.0   8.0
+        7   M young    9.1   3.0
+        8   F young    7.9   5.2
+        9   F   old    6.6   3.4
+       10   M young    7.7   4.0
+       11   M young    9.4   5.3
+       12   M   old   11.6  11.3
+       13   M young    9.9   4.6
+       14   F young    8.6   6.4
+       15   F young   14.3  13.5
+       16   F   old    9.2   4.7
+       17   M young    9.8   5.1
+       18   F   old    9.9   7.3
+       19   F young   13.0   9.5
+       20   M young   10.2   5.4
+       21   M young    9.0   3.7
+       22   F young    7.9   6.2
+       23   M   old   10.1  10.0
+       24   M young    9.0   1.7
+       25   M young    8.6   2.9
+       26   M young    9.4   3.2
+       27   M young    9.7   4.7
+       28   M young    9.3   4.9
+       29   F young   10.7   9.8
+       30   M   old    9.3   9.4
+ ')
> head(data)
  subject sex age before after
1       1   F old    9.5   7.1
2       2   M old   10.3  11.0
3       3   M old    7.5   5.8
4       4   F old   12.4   8.8
5       5   M old   10.2   8.6
6       6   M old   11.0   8.0

Age에 old랑 young은 뭔 기준인지 모르겠지만 일단 패스. 

> data$subject=factor(data$subject)

이 데이터는 ANOVA 들어가기 전에 저기 subject를 팩터화해줘서 연속적인 데이터 취급 받는 사태를 방지해야 한다. 참고로 ANOVA는 one-way랑 two-way가 있다. 

 

one-way ANOVA

> aov1=aov(before~sex,data=data)
> aov1
Call:
   aov(formula = before ~ sex, data = data)

Terms:
                     sex Residuals
Sum of Squares   1.52861  74.70105
Deg. of Freedom        1        28

Residual standard error: 1.633369
Estimated effects may be unbalanced
> summary(aov1)
            Df Sum Sq Mean Sq F value Pr(>F)
sex          1   1.53   1.529   0.573  0.455
Residuals   28  74.70   2.668
> model.tables(aov1,"means")
Tables of means
Grand mean
         
9.703333 

 sex 
     F      M
    10  9.532
rep 11 19.000

 

two-way ANOVA

> aov3=aov(after~sex*age,data=data)
> aov3
Call:
   aov(formula = after ~ sex * age, data = data)

Terms:
                      sex       age   sex:age Residuals
Sum of Squares   16.07755  38.96110  89.61138 103.51164
Deg. of Freedom         1         1         1        26

Residual standard error: 1.995299
Estimated effects may be unbalanced
> summary(aov3)
            Df Sum Sq Mean Sq F value  Pr(>F)    
sex          1  16.08   16.08   4.038  0.0550 .  
age          1  38.96   38.96   9.786  0.0043 ** 
sex:age      1  89.61   89.61  22.509 6.6e-05 ***
Residuals   26 103.51    3.98                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> model.tables(aov3,"means")
Tables of means
Grand mean
         
6.483333 

 sex 
         F      M
     7.445  5.926
rep 11.000 19.000

 age 
       old  young
     7.874  5.556
rep 12.000 18.000

 sex:age 
     age
sex   old    young 
  F    6.260  8.433
  rep  5.000  6.000
  M    9.157  4.042
  rep  7.000 12.000
> aov3=aov(after~sex+age+sex:age,data=data)
> summary(aov3)
            Df Sum Sq Mean Sq F value  Pr(>F)    
sex          1  16.08   16.08   4.038  0.0550 .  
age          1  38.96   38.96   9.786  0.0043 ** 
sex:age      1  89.61   89.61  22.509 6.6e-05 ***
Residuals   26 103.51    3.98                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

둘이 같은 코드. 

 

Tukey HSD post-hoc test

> TukeyHSD(aov1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = before ~ sex, data = data)

$sex
          diff       lwr       upr     p adj
M-F -0.4684211 -1.736038 0.7991961 0.4554053

One-way

 

> TukeyHSD(aov3)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = after ~ sex + age + sex:age, data = data)

$sex
         diff       lwr        upr     p adj
M-F -1.519139 -3.073025 0.03474709 0.0549625

$age
              diff       lwr        upr     p adj
young-old -2.31785 -3.846349 -0.7893498 0.0044215

$`sex:age`
                      diff        lwr        upr     p adj
M:old-F:old      2.8971429 -0.3079526  6.1022384 0.0869856
F:young-F:old    2.1733333 -1.1411824  5.4878491 0.2966111
M:young-F:old   -2.2183333 -5.1319553  0.6952887 0.1832890
F:young-M:old   -0.7238095 -3.7691188  2.3214997 0.9138789
M:young-M:old   -5.1154762 -7.7187601 -2.5121923 0.0000676
M:young-F:young -4.3916667 -7.1285380 -1.6547953 0.0008841

Two-way

 

ANOVA 데이터를 바탕으로 tukey HSD 분석을 할 수도 있다. 

 

ANOVAs with within-subjects variables

> data_long <- gather(data, time, value, before:after)
> head(data_long)
  subject sex age   time value
1       1   F old before   9.5
2       2   M old before  10.3
3       3   M old before   7.5
4       4   F old before  12.4
5       5   M old before  10.2
6       6   M old before  11.0
> data_long$subject <- factor(data_long$subject)

subject variable를 끼고 하려고 표를 길게 만들었단다. (아까 팩터화한 그거 맞음)

 

> aov_time=aov(value~time+Error(subject/time),data=data_long)
> aov_time

Call:
aov(formula = value ~ time + Error(subject/time), data = data_long)

Grand Mean: 8.093333

Stratum 1: subject

Terms:
                Residuals
Sum of Squares   261.2473
Deg. of Freedom        29

Residual standard error: 3.001421

Stratum 2: subject:time

Terms:
                   time Residuals
Sum of Squares  155.526    63.144
Deg. of Freedom       1        29

Residual standard error: 1.475595
Estimated effects are balanced

one-way

 

> aov_age_time=aov(value~age*time+Error(subject/time),data=data_long)
> aov_age_time

Call:
aov(formula = value ~ age * time + Error(subject/time), data = data_long)

Grand Mean: 8.093333

Stratum 1: subject

Terms:
                      age Residuals
Sum of Squares   24.44011 236.80722
Deg. of Freedom         1        28

Residual standard error: 2.908161
1 out of 2 effects not estimable
Estimated effects are balanced

Stratum 2: subject:time

Terms:
                   time age:time Residuals
Sum of Squares  155.526   18.769    44.375
Deg. of Freedom       1        1        28

Residual standard error: 1.258897
Estimated effects may be unbalanced

mixed design