barcode

R 배워보기-7. Statistical analysis (하)

Coding/R

3부작으로 나눌 걸 그랬나... 

라이브러리 다 깔았지들? 


Logistic regression(로지스틱 회귀분석)

> dat=subset(mtcars,select=c(mpg,am,vs))
> dat
                     mpg am vs
Mazda RX4           21.0  1  0
Mazda RX4 Wag       21.0  1  0
Datsun 710          22.8  1  1
Hornet 4 Drive      21.4  0  1
Hornet Sportabout   18.7  0  0
Valiant             18.1  0  1
Duster 360          14.3  0  0
Merc 240D           24.4  0  1
Merc 230            22.8  0  1
Merc 280            19.2  0  1
Merc 280C           17.8  0  1
Merc 450SE          16.4  0  0
Merc 450SL          17.3  0  0
Merc 450SLC         15.2  0  0
Cadillac Fleetwood  10.4  0  0
Lincoln Continental 10.4  0  0
Chrysler Imperial   14.7  0  0
Fiat 128            32.4  1  1
Honda Civic         30.4  1  1
Toyota Corolla      33.9  1  1
Toyota Corona       21.5  0  1
Dodge Challenger    15.5  0  0
AMC Javelin         15.2  0  0
Camaro Z28          13.3  0  0
Pontiac Firebird    19.2  0  0
Fiat X1-9           27.3  1  1
Porsche 914-2       26.0  1  0
Lotus Europa        30.4  1  1
Ford Pantera L      15.8  1  0
Ferrari Dino        19.7  1  0
Maserati Bora       15.0  1  0
Volvo 142E          21.4  1  1

이거 뭐 내장 데이터라고 함... 

 

예측 변수는 연속적인데 결과가 이분법(얘 아니면 쟤)

> logr_vm=glm(vs~mpg,data=dat,family=binomial)
> logr_vm

Call:  glm(formula = vs ~ mpg, family = binomial, data = dat)

Coefficients:
(Intercept)          mpg  
    -8.8331       0.4304  

Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
Null Deviance:	    43.86 
Residual Deviance: 25.53 	AIC: 29.53
> logr_vm <- glm(vs ~ mpg, data=dat, family=binomial(link="logit"))
> logr_vm

Call:  glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)

Coefficients:
(Intercept)          mpg  
    -8.8331       0.4304  

Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
Null Deviance:	    43.86 
Residual Deviance: 25.53 	AIC: 29.53

일단 둘 중 하나 적용하면 된다. 어차피 결과는 같아서... 

 

> summary(logr_vm)

Call:
glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2127  -0.5121  -0.2276   0.6402   1.6980  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -8.8331     3.1623  -2.793  0.00522 **
mpg           0.4304     0.1584   2.717  0.00659 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.860  on 31  degrees of freedom
Residual deviance: 25.533  on 30  degrees of freedom
AIC: 29.533

Number of Fisher Scoring iterations: 6

이쪽은 회귀분석과 마찬가지로 summary()로 더보기가 된다. 같은 회귀분석이라 그런가 쟤는 선형인데?

 

> library(ggplot2)
> ggplot(dat, aes(x=mpg, y=vs)) + geom_point() + 
+   stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
`geom_smooth()` using formula 'y ~ x'

처음에 geom 보고 점이라 저렇게 쓴건가 했음... ㅋㅋㅋㅋㅋㅋ 아 이 그래프는 그리시려면 ggplot2 깔아야됨. 어차피 다음 챕터도 저거니까 걍 지금 까세요. 

 

> par(mar=c(4,4,1,1))
> plot(dat$mpg,dat$vs)
> curve(predict(logr_vm,data.frame(mpg=x),type="response"),add=TRUE)

이거는 ggplot2 없어도 될건데... 

 

예측변수와 결과가 둘 다 이분법(둘다 얘 아니면 쟤) 무슨 시그모이드냐 

> logr_va=glm(vs~am,data=dat,family=binomial)
> logr_va

Call:  glm(formula = vs ~ am, family = binomial, data = dat)

Coefficients:
(Intercept)           am  
    -0.5390       0.6931  

Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
Null Deviance:	    43.86 
Residual Deviance: 42.95 	AIC: 46.95

근데 am은 뭐임? am pm은 아닐것인데... 각도도 아니고. 

 

> ggplot(dat, aes(x=am, y=vs)) + 
+   geom_point(shape=1, position=position_jitter(width=.05,height=.05)) + 
+   stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
`geom_smooth()` using formula 'y ~ x'

뭐야 내 인테그랄 돌려줘요 로지스틱 의문의 적분기호행 

 

> par(mar=c(4,4,1,1))
> plot(jitter(dat$am,.2),jitter(dat$vs,.2))
> curve(predict(logr_va, data.frame(am=x), type="response"), add=TRUE)

(마른세수)

 

예측 데이터가 이분법+연속형 짬뽕이고 결과는 이분법인 예

> logr_vma=glm(vs~mpg+am,data=dat,family=binomial)
> logr_vma

Call:  glm(formula = vs ~ mpg + am, family = binomial, data = dat)

Coefficients:
(Intercept)          mpg           am  
   -12.7051       0.6809      -3.0073  

Degrees of Freedom: 31 Total (i.e. Null);  29 Residual
Null Deviance:	    43.86 
Residual Deviance: 20.65 	AIC: 26.65

(대충 이렇게 분석하면 된다는 얘기) 

 

예측 데이터 여러개와 상호작용(Interaction)

> logr_vmai=glm(vs~mpg*am,data=dat,family=binomial)
> logr_vmai

Call:  glm(formula = vs ~ mpg * am, family = binomial, data = dat)

Coefficients:
(Intercept)          mpg           am       mpg:am  
   -20.4784       1.1084      10.1055      -0.6637  

Degrees of Freedom: 31 Total (i.e. Null);  28 Residual
Null Deviance:	    43.86 
Residual Deviance: 19.12 	AIC: 27.12
> logr_vmai=glm(vs~mpg+am+mpg:am,data=dat,family=binomial)
> logr_vmai

Call:  glm(formula = vs ~ mpg + am + mpg:am, family = binomial, data = dat)

Coefficients:
(Intercept)          mpg           am       mpg:am  
   -20.4784       1.1084      10.1055      -0.6637  

Degrees of Freedom: 31 Total (i.e. Null);  28 Residual
Null Deviance:	    43.86 
Residual Deviance: 19.12 	AIC: 27.12

둘 중 아무거나 쓰면 된다. 근데 얘는 왜 그래프 없음? 

 

등분산 검정

> head(InsectSprays)
  count spray
1    10     A
2     7     A
3    20     A
4    14     A
5    14     A
6    12     A

이거랑 

> tg=ToothGrowth
> tg$dose=factor(tg$dose)
> head(tg)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

이거 쓸 건데 독립변수가 하나냐 멀티냐 차이다. 둘 다 내장 데이터인듯. 

 

> plot(count~spray,data=InsectSprays)

 

> plot(len ~ interaction(dose,supp), data=ToothGrowth)

각 데이터별 박스 그래프는 대충 이런 모양이다. 

 

Bartlett’s test

> bartlett.test(count~spray,data=InsectSprays)

	Bartlett test of homogeneity of variances

data:  count by spray
Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05

독립 변수가 하나일 때

 

> bartlett.test(len~interaction(supp,dose),data=ToothGrowth)

	Bartlett test of homogeneity of variances

data:  len by interaction(supp, dose)
Bartlett's K-squared = 6.9273, df = 5, p-value = 0.2261

독립 변수가 여러개일 때(interaction()으로 합쳐야 한다)

 

Levene’s test

이거 쓰려면 뭐 또 깔아야됨... car 깔고 오세여. 

 

> leveneTest(count~spray,data=InsectSprays)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  5  3.8214 0.004223 **
      66                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

독립 변수가 하나일 때

 

> leveneTest(len~supp*dose,data=tg)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  5  1.7086 0.1484
      54

독립 변수가 여러개일 때

 

Fligner-Killeen test

> fligner.test(count~spray,data=InsectSprays)

	Fligner-Killeen test of homogeneity of variances

data:  count by spray
Fligner-Killeen:med chi-squared = 14.483, df = 5, p-value = 0.01282

독립 변수가 하나일 때

 

> fligner.test(len~interaction(supp,dose),data=ToothGrowth)

	Fligner-Killeen test of homogeneity of variances

data:  len by interaction(supp, dose)
Fligner-Killeen:med chi-squared = 7.7488, df = 5, p-value = 0.1706

독립 변수가 여러개일 때(얘도 interaction()으로 묶는다)

 

Inter-rater reliability

네 irr 깔고 오시면 되겠습니다. 

 

> library(irr)
필요한 패키지를 로딩중입니다: lpSolve
> data(diagnoses)
> dat=diagnoses[,1:3]
> head(dat)
                   rater1                  rater2                  rater3
1             4. Neurosis             4. Neurosis             4. Neurosis
2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
3 2. Personality Disorder        3. Schizophrenia        3. Schizophrenia
4                5. Other                5. Other                5. Other
5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
6           1. Depression           1. Depression        3. Schizophrenia

그리고 이걸 또 데려옴... 

 

Kohen's kappa

> kappa2(dat[,c(1,2)],"unweighted")
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 30 
   Raters = 2 
    Kappa = 0.651 

        z = 7 
  p-value = 2.63e-12

응? 가중치? 이거는 나중에 설명드림. 일단 이렇게 하시면 됩니다. 

 

N raters: Fleiss’s Kappa, Conger’s Kappa

> kappam.fleiss(dat)
 Fleiss' Kappa for m Raters

 Subjects = 30 
   Raters = 3 
    Kappa = 0.534 

        z = 9.89 
  p-value = 0

Fleiss's kappa

 

> kappam.fleiss(dat,exact=TRUE)
 Fleiss' Kappa for m Raters (exact value)

 Subjects = 30 
   Raters = 3 
    Kappa = 0.55

Conger's exact kappa

 

Kohen's kappa(weighted)

데이터가 순서가 있는 데이터일 경우 가중치가 추가된다고... 

 

> data(anxiety)
> dfa=anxiety[,c(1,2)]
> head(dfa)
  rater1 rater2
1      3      3
2      3      6
3      3      4
4      4      6
5      5      2
6      5      4

이 데이터를 쓸 예정이다. ...여기 순서가 있었음? 

 

> kappa2(dfa,"squared")
 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 20 
   Raters = 2 
    Kappa = 0.297 

        z = 1.34 
  p-value = 0.18

Weight=squared

 

> kappa2(dfa,"equal")
 Cohen's Kappa for 2 Raters (Weights: equal)

 Subjects = 20 
   Raters = 2 
    Kappa = 0.189 

        z = 1.42 
  p-value = 0.157

Weight=equal

 

> kappa2(dfa,"unweighted")
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 20 
   Raters = 2 
    Kappa = 0.119 

        z = 1.16 
  p-value = 0.245

가중치 없음

 

Kohen's kappa(weighted)+factor

팩터: 레벨이 있는 그거 맞음

 

> dfa2=dfa
> dfa2$rater1=factor(dfa2$rater1,levels=1:6,labels=LETTERS[1:6])
> dfa2$rater2=factor(dfa2$rater2,levels=1:6,labels=LETTERS[1:6])
> head(dfa2)
  rater1 rater2
1      C      C
2      C      F
3      C      D
4      D      F
5      E      B
6      E      D

위에서 불러 온 데이터들을 팩터화한 다음, 레벨을 매겼다. 

> levels(dfa2$rater1)
[1] "A" "B" "C" "D" "E" "F"
> levels(dfa2$rater2)
[1] "A" "B" "C" "D" "E" "F"

그래서 이렇게 됐다. (아마 label이 letter 1~6이라 알파벳으로 표기된 듯)

 

> kappa2(dfa2,"squared")
 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 20 
   Raters = 2 
    Kappa = 0.297 

        z = 1.34 
  p-value = 0.18

Weight=squared

 

> kappa2(dfa2,"equal")
 Cohen's Kappa for 2 Raters (Weights: equal)

 Subjects = 20 
   Raters = 2 
    Kappa = 0.189 

        z = 1.42 
  p-value = 0.157

Weight=equal

 

> kappa2(dfa2,"unweighted")
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 20 
   Raters = 2 
    Kappa = 0.119 

        z = 1.16 
  p-value = 0.245

가중치 없음

 

ICC(Intraclass correlation coefficient)

이건 또 무슨 저세상 용어냐...OTL 

 

> icc(anxiety,model="twoway",type="agreement")
 Single Score Intraclass Correlation

   Model: twoway 
   Type : agreement 

   Subjects = 20 
     Raters = 3 
   ICC(A,1) = 0.198

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
 F(19,39.7) = 1.83 , p = 0.0543 

 95%-Confidence Interval for ICC Population Values:
  -0.039 < ICC < 0.494

정말 이건 무슨 저세상 용어인가요...