3부작으로 나눌 걸 그랬나...
라이브러리 다 깔았지들?
Logistic regression(로지스틱 회귀분석)
> dat=subset(mtcars,select=c(mpg,am,vs))
> dat
mpg am vs
Mazda RX4 21.0 1 0
Mazda RX4 Wag 21.0 1 0
Datsun 710 22.8 1 1
Hornet 4 Drive 21.4 0 1
Hornet Sportabout 18.7 0 0
Valiant 18.1 0 1
Duster 360 14.3 0 0
Merc 240D 24.4 0 1
Merc 230 22.8 0 1
Merc 280 19.2 0 1
Merc 280C 17.8 0 1
Merc 450SE 16.4 0 0
Merc 450SL 17.3 0 0
Merc 450SLC 15.2 0 0
Cadillac Fleetwood 10.4 0 0
Lincoln Continental 10.4 0 0
Chrysler Imperial 14.7 0 0
Fiat 128 32.4 1 1
Honda Civic 30.4 1 1
Toyota Corolla 33.9 1 1
Toyota Corona 21.5 0 1
Dodge Challenger 15.5 0 0
AMC Javelin 15.2 0 0
Camaro Z28 13.3 0 0
Pontiac Firebird 19.2 0 0
Fiat X1-9 27.3 1 1
Porsche 914-2 26.0 1 0
Lotus Europa 30.4 1 1
Ford Pantera L 15.8 1 0
Ferrari Dino 19.7 1 0
Maserati Bora 15.0 1 0
Volvo 142E 21.4 1 1
이거 뭐 내장 데이터라고 함...
예측 변수는 연속적인데 결과가 이분법(얘 아니면 쟤)
> logr_vm=glm(vs~mpg,data=dat,family=binomial)
> logr_vm
Call: glm(formula = vs ~ mpg, family = binomial, data = dat)
Coefficients:
(Intercept) mpg
-8.8331 0.4304
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 43.86
Residual Deviance: 25.53 AIC: 29.53
> logr_vm <- glm(vs ~ mpg, data=dat, family=binomial(link="logit"))
> logr_vm
Call: glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)
Coefficients:
(Intercept) mpg
-8.8331 0.4304
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 43.86
Residual Deviance: 25.53 AIC: 29.53
일단 둘 중 하나 적용하면 된다. 어차피 결과는 같아서...
> summary(logr_vm)
Call:
glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2127 -0.5121 -0.2276 0.6402 1.6980
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.8331 3.1623 -2.793 0.00522 **
mpg 0.4304 0.1584 2.717 0.00659 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.860 on 31 degrees of freedom
Residual deviance: 25.533 on 30 degrees of freedom
AIC: 29.533
Number of Fisher Scoring iterations: 6
이쪽은 회귀분석과 마찬가지로 summary()로 더보기가 된다. 같은 회귀분석이라 그런가 쟤는 선형인데?
> library(ggplot2)
> ggplot(dat, aes(x=mpg, y=vs)) + geom_point() +
+ stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
`geom_smooth()` using formula 'y ~ x'
처음에 geom 보고 점이라 저렇게 쓴건가 했음... ㅋㅋㅋㅋㅋㅋ 아 이 그래프는 그리시려면 ggplot2 깔아야됨. 어차피 다음 챕터도 저거니까 걍 지금 까세요.
> par(mar=c(4,4,1,1))
> plot(dat$mpg,dat$vs)
> curve(predict(logr_vm,data.frame(mpg=x),type="response"),add=TRUE)
이거는 ggplot2 없어도 될건데...
예측변수와 결과가 둘 다 이분법(둘다 얘 아니면 쟤) 무슨 시그모이드냐
> logr_va=glm(vs~am,data=dat,family=binomial)
> logr_va
Call: glm(formula = vs ~ am, family = binomial, data = dat)
Coefficients:
(Intercept) am
-0.5390 0.6931
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 43.86
Residual Deviance: 42.95 AIC: 46.95
근데 am은 뭐임? am pm은 아닐것인데... 각도도 아니고.
> ggplot(dat, aes(x=am, y=vs)) +
+ geom_point(shape=1, position=position_jitter(width=.05,height=.05)) +
+ stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
`geom_smooth()` using formula 'y ~ x'
뭐야 내 인테그랄 돌려줘요 로지스틱 의문의 적분기호행
> par(mar=c(4,4,1,1))
> plot(jitter(dat$am,.2),jitter(dat$vs,.2))
> curve(predict(logr_va, data.frame(am=x), type="response"), add=TRUE)
(마른세수)
예측 데이터가 이분법+연속형 짬뽕이고 결과는 이분법인 예
> logr_vma=glm(vs~mpg+am,data=dat,family=binomial)
> logr_vma
Call: glm(formula = vs ~ mpg + am, family = binomial, data = dat)
Coefficients:
(Intercept) mpg am
-12.7051 0.6809 -3.0073
Degrees of Freedom: 31 Total (i.e. Null); 29 Residual
Null Deviance: 43.86
Residual Deviance: 20.65 AIC: 26.65
(대충 이렇게 분석하면 된다는 얘기)
예측 데이터 여러개와 상호작용(Interaction)
> logr_vmai=glm(vs~mpg*am,data=dat,family=binomial)
> logr_vmai
Call: glm(formula = vs ~ mpg * am, family = binomial, data = dat)
Coefficients:
(Intercept) mpg am mpg:am
-20.4784 1.1084 10.1055 -0.6637
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 43.86
Residual Deviance: 19.12 AIC: 27.12
> logr_vmai=glm(vs~mpg+am+mpg:am,data=dat,family=binomial)
> logr_vmai
Call: glm(formula = vs ~ mpg + am + mpg:am, family = binomial, data = dat)
Coefficients:
(Intercept) mpg am mpg:am
-20.4784 1.1084 10.1055 -0.6637
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 43.86
Residual Deviance: 19.12 AIC: 27.12
둘 중 아무거나 쓰면 된다. 근데 얘는 왜 그래프 없음?
등분산 검정
> head(InsectSprays)
count spray
1 10 A
2 7 A
3 20 A
4 14 A
5 14 A
6 12 A
이거랑
> tg=ToothGrowth
> tg$dose=factor(tg$dose)
> head(tg)
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
이거 쓸 건데 독립변수가 하나냐 멀티냐 차이다. 둘 다 내장 데이터인듯.
> plot(count~spray,data=InsectSprays)
> plot(len ~ interaction(dose,supp), data=ToothGrowth)
각 데이터별 박스 그래프는 대충 이런 모양이다.
Bartlett’s test
> bartlett.test(count~spray,data=InsectSprays)
Bartlett test of homogeneity of variances
data: count by spray
Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05
독립 변수가 하나일 때
> bartlett.test(len~interaction(supp,dose),data=ToothGrowth)
Bartlett test of homogeneity of variances
data: len by interaction(supp, dose)
Bartlett's K-squared = 6.9273, df = 5, p-value = 0.2261
독립 변수가 여러개일 때(interaction()으로 합쳐야 한다)
Levene’s test
이거 쓰려면 뭐 또 깔아야됨... car 깔고 오세여.
> leveneTest(count~spray,data=InsectSprays)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.8214 0.004223 **
66
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
독립 변수가 하나일 때
> leveneTest(len~supp*dose,data=tg)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 1.7086 0.1484
54
독립 변수가 여러개일 때
Fligner-Killeen test
> fligner.test(count~spray,data=InsectSprays)
Fligner-Killeen test of homogeneity of variances
data: count by spray
Fligner-Killeen:med chi-squared = 14.483, df = 5, p-value = 0.01282
독립 변수가 하나일 때
> fligner.test(len~interaction(supp,dose),data=ToothGrowth)
Fligner-Killeen test of homogeneity of variances
data: len by interaction(supp, dose)
Fligner-Killeen:med chi-squared = 7.7488, df = 5, p-value = 0.1706
독립 변수가 여러개일 때(얘도 interaction()으로 묶는다)
Inter-rater reliability
네 irr 깔고 오시면 되겠습니다.
> library(irr)
필요한 패키지를 로딩중입니다: lpSolve
> data(diagnoses)
> dat=diagnoses[,1:3]
> head(dat)
rater1 rater2 rater3
1 4. Neurosis 4. Neurosis 4. Neurosis
2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
3 2. Personality Disorder 3. Schizophrenia 3. Schizophrenia
4 5. Other 5. Other 5. Other
5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
6 1. Depression 1. Depression 3. Schizophrenia
그리고 이걸 또 데려옴...
Kohen's kappa
> kappa2(dat[,c(1,2)],"unweighted")
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 30
Raters = 2
Kappa = 0.651
z = 7
p-value = 2.63e-12
응? 가중치? 이거는 나중에 설명드림. 일단 이렇게 하시면 됩니다.
N raters: Fleiss’s Kappa, Conger’s Kappa
> kappam.fleiss(dat)
Fleiss' Kappa for m Raters
Subjects = 30
Raters = 3
Kappa = 0.534
z = 9.89
p-value = 0
Fleiss's kappa
> kappam.fleiss(dat,exact=TRUE)
Fleiss' Kappa for m Raters (exact value)
Subjects = 30
Raters = 3
Kappa = 0.55
Conger's exact kappa
Kohen's kappa(weighted)
데이터가 순서가 있는 데이터일 경우 가중치가 추가된다고...
> data(anxiety)
> dfa=anxiety[,c(1,2)]
> head(dfa)
rater1 rater2
1 3 3
2 3 6
3 3 4
4 4 6
5 5 2
6 5 4
이 데이터를 쓸 예정이다. ...여기 순서가 있었음?
> kappa2(dfa,"squared")
Cohen's Kappa for 2 Raters (Weights: squared)
Subjects = 20
Raters = 2
Kappa = 0.297
z = 1.34
p-value = 0.18
Weight=squared
> kappa2(dfa,"equal")
Cohen's Kappa for 2 Raters (Weights: equal)
Subjects = 20
Raters = 2
Kappa = 0.189
z = 1.42
p-value = 0.157
Weight=equal
> kappa2(dfa,"unweighted")
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 20
Raters = 2
Kappa = 0.119
z = 1.16
p-value = 0.245
가중치 없음
Kohen's kappa(weighted)+factor
팩터: 레벨이 있는 그거 맞음
> dfa2=dfa
> dfa2$rater1=factor(dfa2$rater1,levels=1:6,labels=LETTERS[1:6])
> dfa2$rater2=factor(dfa2$rater2,levels=1:6,labels=LETTERS[1:6])
> head(dfa2)
rater1 rater2
1 C C
2 C F
3 C D
4 D F
5 E B
6 E D
위에서 불러 온 데이터들을 팩터화한 다음, 레벨을 매겼다.
> levels(dfa2$rater1)
[1] "A" "B" "C" "D" "E" "F"
> levels(dfa2$rater2)
[1] "A" "B" "C" "D" "E" "F"
그래서 이렇게 됐다. (아마 label이 letter 1~6이라 알파벳으로 표기된 듯)
> kappa2(dfa2,"squared")
Cohen's Kappa for 2 Raters (Weights: squared)
Subjects = 20
Raters = 2
Kappa = 0.297
z = 1.34
p-value = 0.18
Weight=squared
> kappa2(dfa2,"equal")
Cohen's Kappa for 2 Raters (Weights: equal)
Subjects = 20
Raters = 2
Kappa = 0.189
z = 1.42
p-value = 0.157
Weight=equal
> kappa2(dfa2,"unweighted")
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 20
Raters = 2
Kappa = 0.119
z = 1.16
p-value = 0.245
가중치 없음
ICC(Intraclass correlation coefficient)
이건 또 무슨 저세상 용어냐...OTL
> icc(anxiety,model="twoway",type="agreement")
Single Score Intraclass Correlation
Model: twoway
Type : agreement
Subjects = 20
Raters = 3
ICC(A,1) = 0.198
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(19,39.7) = 1.83 , p = 0.0543
95%-Confidence Interval for ICC Population Values:
-0.039 < ICC < 0.494
정말 이건 무슨 저세상 용어인가요...
'Coding > R' 카테고리의 다른 글
R 배워보기-8.1. ggplot2로 그래프 그리기 (하) (0) | 2022.08.22 |
---|---|
R 배워보기-8.1. ggplot2로 그래프 그리기 (상) (0) | 2022.08.22 |
R 배워보기-7. Statistical analysis (상) (0) | 2022.08.21 |
R 배워보기- 6.5. Manipulating data-Sequential data (0) | 2022.08.20 |
R 배워보기- 6.4. Manipulating data-Restructing data (0) | 2022.08.20 |