일단 이항분포가 뭐냐… 특정 확률(p)을 가진 베르누이 시행을 n번 독립적으로 반복했을 때, 성공하는 횟수(X)에 대한 이산 확률 분포라고 한다. Pass or Fail 뭐 이런건데, 여기서 중요한 건 결과가 두개라고 확률까지 반반이라는 건 아니라는 얘기. 그것은 하등 근거없는 편견이다.
근데 확률이 너무 낮아서 이항분포 때릴 수 있나 모르겠음.
확률이 얼마길래?
우리가 여기서 해볼 건 1/4096(no빛부 야생), 1/1365(빛부), 1/512(빛부+국제교배)이다.
R에서 이항분포 때려보기
> x_val=0:10
> y_val1=dbinom(x_val,100,1/4096)
일단 위는 x가 0부터 10까지라는 얘기이고, 아래가 이항분포 그 뭐시기다. 저걸 쉽게 풀어주자면 1) 1/4096의 확률을 가지고 있는 어떤 사건을 2) 100번 시행할거야 3) 근데 0번에서 10번까지 성공활 확률이 어떻게 돼? 라는 거. 일단 저 가운데 숫자는 한 1000정도로 통일해보자. 참고로 ggplot 안씀 이거 윈도에서 새로 깐거임...
엄밀히 말하자면 위에 두 개는 야생 포켓몬을 1) 빛나는 부적 없이, 2) 빛나는 부적을 갖고 조우한 것이고 3)은 국제교배(어버이 국적이 다른 포켓몬끼리 교배)+빛나는 부적이라 알을 1000번 깐다는 전제 하의 확률이다. 그니까 1000번 알 까서 이로치 나올 확률이 1/4정도 뭐 이런 얘기… 그러니까 알깠는데 이로치가 나왔다는 건 매우 좋은거다. 그래도 리니지보다는 혜자인 게 함정.
막대그래프 스타일로 세개를 겹쳐서 그려봤다. 검정색이 1/4096, 하늘색이 1/1365, 노란색이 1/512다. 확률이 낮아질수록 점점 옆으로 쏠리는 것을 볼 수 있다. 막대그래프라서 안 와닿는다고?
이건 좀 와닿수?
시행 횟수를 2000으로 늘려봤다. 그럼 아예 시행 횟수를 4096으로 맞추면요?
결론: 이로치는 급나 완전 개 노가다와 운빨의 산물이다
위에껀 걍 이항분포… 그니까 1) 필드 뺑뺑이를 돌거나 알을 깔 동안 2) 이로치를 만나거나 깔 확률이 정확히 0~10인거고 아래는 누적 이항분포이다.
y_val4=pbinom(x_val2,4096,1/4096)
얘는 그니까 4096마리 만났을 때 0~10마리까지 이로치 만날 확률을 누적해둔 거라 언젠가는 1이 되긴 된다. 역시나 검은 선이 1/4096, 하늘색 선은 1/1365, 빨간 선이 1/512인데... 왜 확률이 더 높은 애가 더 앞으로 가 있나요? 나도 정확히 이해는 안되는데, 1/512가 확률이 더 높아서 성공횟수가 분산되기 때문에 확률이 높을수록 뒤로 가는 게 맞단다.
> library(ellipse)
다음의 패키지를 부착합니다: ‘ellipse’
The following object is masked from ‘package:car’:
ellipse
The following object is masked from ‘package:graphics’:
pairs
> cond=factor(rep(c("A","B"),each=200))
> data=data.frame(cond,rating=(c(rating,rating2)))
> head(data)
cond rating
1 A -0.6264538
2 A 0.1836433
3 A -0.8356286
4 A 1.5952808
5 A 0.3295078
6 A -0.8204684
> pdf("plots.pdf")
> plot(plot)
50건 이상의 경고들을 발견되었습니다 (이들 중 처음 50건을 확인하기 위해서는 warnings()를 이용하시길 바랍니다).
> dev.off()
pdf("파일명.pdf")만 쓰면 빈 pdf파일이 나오고 밑에 저장할 그래프를 하나씩 쓰면 페이지당 하나씩 저장된다.
> pdf("plots.pdf")
> plot(plot)
50건 이상의 경고들을 발견되었습니다 (이들 중 처음 50건을 확인하기 위해서는 warnings()를 이용하시길 바랍니다).
> plot(plot2)
50건 이상의 경고들을 발견되었습니다 (이들 중 처음 50건을 확인하기 위해서는 warnings()를 이용하시길 바랍니다).
> dev.off()
X11cairo
2
Bradford assay는 단백질의 무게를 확인하기 위해 진행하는 실험이다. Bradford assay용 시약을 섞고 OD595를 재면 되는데, 그러기 위해서 Standard curve가 필요하다. 일정한 무게의 단백질(BSA; Bovine serum albumin)을 용해한 다음 Bradford assay용 시약을 섞고 OD595를 측정하고,
이런 식으로 standard curve를 그린다. 이게 없으면 OD595를 재도 무게가 어느 정도인지 모른다.
실험수업을 듣는 제육쌈밥(대짱이)군, 이번 주 실험 주제는 Bradford assay였다. standard curve를 그리고 미지 시료의 단백질 농도까지 정량하는 게 이번 과제였다. 그런데 양을 측정하려면 먼저 standard curve를 그리고, 선형 회귀분석을 통해(엑셀에서는 추세선) R^2와 일차식을 구해야 하지 않겠는가? 하지만 제육쌈밥군의 컴퓨터는 리눅스였고, 하필 그날따라 리브레오피스가 버벅거리는 것이었다. (이거 생각보다 버벅거림)
그리고 이걸로도 어떻게든 되겠지, 하면서 제육쌈밥군은 R을 켰다.
근데 왜 이름이 제육쌈밥이죠아니 그냥 그게 생각나서요이름 진짜 막 지으시네
그래프 그리기 전단계
> library(ggplot2)
> setwd('/home/koreanraichu/')
어디가요 ggplot2 불러야지
setwd는 working directory 설정하는건데, 본인은 저기에 파일이 있어서 저기로 고정해두고 쓴다. 고정 안 해두면 '/home/koreanraichu/파일.csv'로 열어야 하지만, 고정하게 되면 '파일명.csv' 한방이면 끝.
> ggplot(data=data,aes(x=BSA,y=AVR,group=1))+geom_line(colour="#939597")+geom_smooth(method=lm,se=FALSE)
`geom_smooth()` using formula 'y ~ x'
method=lm으로 하면 직선형이 나온다. 아무튼 그렸으면 봅시다...
> ggplot(data=data,aes(x=BSA,y=AVR,group=1))+geom_line(colour="#939597")+geom_smooth(method=lm,se=FALSE,colour="#000000")
`geom_smooth()` using formula 'y ~ x'
뭐야 이거 왜 돼요
추세선 식과 R^2
> summary(lm(AVR~BSA,data))
Call:
lm(formula = AVR ~ BSA, data = data)
Residuals:
1 2 3 4 5
-0.002969 -0.002556 0.005093 0.003154 -0.002723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.969e-03 2.776e-03 1.069 0.363366
BSA 4.588e-04 2.707e-05 16.948 0.000447 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.004421 on 3 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9862
F-statistic: 287.2 on 1 and 3 DF, p-value: 0.0004474
BSA가 X축, Intercept는 절편(Y절편)이다. R^2는 Multiple R-squared에 있다.
표시형식 왜저래요... 아무튼 이 추세선의 식은 y=0.0004588x+0.002969 되시겠다.
> ggplot(data=data,aes(x=BSA,y=AVR,group=1))+geom_line(colour="#939597")+geom_smooth(method=lm,se=FALSE,colour="#000000")+geom_text(x=100,y=0.08,label="y=0.0004588x+0.002969")+geom_text(x=100,y=0.077,label="R^2=0.9897")
`geom_smooth()` using formula 'y ~ x'
그래프에 넣을거면 geom_text()를 쓰자. ...아니 자꾸 점으로 읽어...
축 제목과 그래프 제목
여기까지 다 그린 제육쌈밥군. 됐다! 하고 그래프를 저장하고 R을 끄려다가 생각해보니, 축 제목이 좀 그렇다?
> ggplot(data=data,aes(x=BSA,y=AVR,group=1))+geom_line(colour="#939597")+geom_smooth(method=lm,se=FALSE,colour="#000000")+geom_text(x=100,y=0.08,label="y=0.0004588x+0.002969")+geom_text(x=100,y=0.077,label="R^2=0.9897")+xlab("BSA conc. (ug/ul)")+ylab("OD 595")+ggtitle("Standard curve")
`geom_smooth()` using formula 'y ~ x'
여기서 데이터 찾기 들어가시면 세상천지 대한민국 데이터는 다 있음. (제주도 야채 데이터도 저기서 받음)
근데 csv파일로 제공되는 데이터 중에 한글이 깨지는게 좀 있어서 그건 조심해야 합니다. 다른 인코딩은 모르겠고 UTF-8로 했는데 꺠지는건 문제 아니냐...
R 내장 데이터 목록
boot
Data sets in package ‘boot’:
acme Monthly Excess Returns
aids Delay in AIDS Reporting in England and Wales
aircondit Failures of Air-conditioning Equipment
aircondit7 Failures of Air-conditioning Equipment
amis Car Speeding and Warning Signs
aml Remission Times for Acute Myelogenous Leukaemia
beaver Beaver Body Temperature Data
bigcity Population of U.S. Cities
brambles Spatial Location of Bramble Canes
breslow Smoking Deaths Among Doctors
calcium Calcium Uptake Data
cane Sugar-cane Disease Data
capability Simulated Manufacturing Process Data
catsM Weight Data for Domestic Cats
cav Position of Muscle Caveolae
cd4 CD4 Counts for HIV-Positive Patients
cd4.nested Nested Bootstrap of cd4 data
channing Channing House Data
city Population of U.S. Cities
claridge Genetic Links to Left-handedness
cloth Number of Flaws in Cloth
co.transfer Carbon Monoxide Transfer
coal Dates of Coal Mining Disasters
darwin Darwin's Plant Height Differences
dogs Cardiac Data for Domestic Dogs
downs.bc Incidence of Down's Syndrome in British
Columbia
ducks Behavioral and Plumage Characteristics of
Hybrid Ducks
fir Counts of Balsam-fir Seedlings
frets Head Dimensions in Brothers
grav Acceleration Due to Gravity
gravity Acceleration Due to Gravity
hirose Failure Time of PET Film
islay Jura Quartzite Azimuths on Islay
manaus Average Heights of the Rio Negro river at
Manaus
melanoma Survival from Malignant Melanoma
motor Data from a Simulated Motorcycle Accident
neuro Neurophysiological Point Process Data
nitrofen Toxicity of Nitrofen in Aquatic Systems
nodal Nodal Involvement in Prostate Cancer
nuclear Nuclear Power Station Construction Data
paulsen Neurotransmission in Guinea Pig Brains
poisons Animal Survival Times
polar Pole Positions of New Caledonian Laterites
remission Cancer Remission and Cell Activity
salinity Water Salinity and River Discharge
survival Survival of Rats after Radiation Doses
tau Tau Particle Decay Modes
tuna Tuna Sighting Data
urine Urine Analysis Data
wool Australian Relative Wool Prices
carData
Data sets in package ‘carData’:
AMSsurvey American Math Society Survey Data
Adler Experimenter Expectations
Angell Moral Integration of American Cities
Anscombe U. S. State Public-School Expenditures
Arrests Arrests for Marijuana Possession
BEPS British Election Panel Study
Baumann Methods of Teaching Reading Comprehension
Bfox Canadian Women's Labour-Force Participation
Blackmore Exercise Histories of Eating-Disordered and
Control Subjects
Burt Fraudulent Data on IQs of Twins Raised Apart
CES11 2011 Canadian National Election Study, With
Attitude Toward Abortion
CanPop Canadian Population Data
Chile Voting Intentions in the 1988 Chilean
Plebiscite
Chirot The 1907 Romanian Peasant Rebellion
Cowles Cowles and Davis's Data on Volunteering
Davis Self-Reports of Height and Weight
DavisThin Davis's Data on Drive for Thinness
Depredations Minnesota Wolf Depredation Data
Duncan Duncan's Occupational Prestige Data
Ericksen The 1980 U.S. Census Undercount
Florida Florida County Voting
Freedman Crowding and Crime in U. S. Metropolitan Areas
Friendly Format Effects on Recall
GSSvocab Data from the General Social Survey (GSS) from
the National Opinion Research Center of the
University of Chicago.
Ginzberg Data on Depression
Greene Refugee Appeals
Guyer Anonymity and Cooperation
Hartnagel Canadian Crime-Rates Time Series
Highway1 Highway Accidents
KosteckiDillon Treatment of Migraine Headaches
Leinhardt Data on Infant-Mortality
LoBD Cancer drug data use to provide an example of
the use of the skew power distributions.
Mandel Contrived Collinear Data
Migration Canadian Interprovincial Migration Data
Moore Status, Authoritarianism, and Conformity
MplsDemo Minneapolis Demographic Data 2015, by
Neighborhood
MplsStops Minneapolis Police Department 2017 Stop Data
Mroz U.S. Women's Labor-Force Participation
OBrienKaiser O'Brien and Kaiser's Repeated-Measures Data
OBrienKaiserLong O'Brien and Kaiser's Repeated-Measures Data in
"Long" Format
Ornstein Interlocking Directorates Among Major Canadian
Firms
Pottery Chemical Composition of Pottery
Prestige Prestige of Canadian Occupations
Quartet Four Regression Datasets
Robey Fertility and Contraception
Rossi Rossi et al.'s Criminal Recidivism Data
SLID Survey of Labour and Income Dynamics
Sahlins Agricultural Production in Mazulu Village
Salaries Salaries for Professors
Soils Soil Compositions of Physical and Chemical
Characteristics
States Education and Related Statistics for the U.S.
States
TitanicSurvival Survival of Passengers on the Titanic
Transact Transaction data
UN National Statistics from the United Nations,
Mostly From 2009-2011
UN98 United Nations Social Indicators Data 1998]
USPop Population of the United States
Vocab Vocabulary and Education
WVS World Values Surveys
WeightLoss Weight Loss Data
Wells Well Switching in Bangladesh
Womenlf Canadian Women's Labour-Force Participation
Wong Post-Coma Recovery of IQ
Wool Wool data
caret
Data sets in package ‘caret’:
GermanCredit German Credit Data
Sacramento Sacramento CA Home Prices
absorp (tecator) Fat, Water and Protein Content of Meat Samples
bbbDescr (BloodBrain) Blood Brain Barrier Data
cars Kelly Blue Book resale data for 2005 model year
GM cars
cox2Class (cox2) COX-2 Activity Data
cox2Descr (cox2) COX-2 Activity Data
cox2IC50 (cox2) COX-2 Activity Data
dhfr Dihydrofolate Reductase Inhibitors Data
endpoints (tecator) Fat, Water and Protein Content of Meat Samples
fattyAcids (oil) Fatty acid composition of commercial oils
logBBB (BloodBrain) Blood Brain Barrier Data
mdrrClass (mdrr) Multidrug Resistance Reversal (MDRR) Agent Data
mdrrDescr (mdrr) Multidrug Resistance Reversal (MDRR) Agent Data
oilType (oil) Fatty acid composition of commercial oils
potteryClass (pottery)
Pottery from Pre-Classical Sites in Italy
scat Morphometric Data on Scat
scat_orig (scat) Morphometric Data on Scat
segmentationData Cell Body Segmentation
cluster
Data sets in package ‘cluster’:
agriculture European Union Agricultural Workforces
animals Attributes of Animals
chorSub Subset of C-horizon of Kola Data
flower Flower Characteristics
plantTraits Plant Species Traits Data
pluton Isotopic Composition Plutonium Batches
ruspini Ruspini Data
votes.repub Votes for Republican Candidate in Presidential
Elections
xclara Bivariate Data Set with 3 Clusters
colorspace
Data sets in package ‘colorspace’:
USSouthPolygon Polygon for County Map of US South States:
Alabama, Georgia, and South Carolina
max_chroma_table Compute Maximum Chroma for Given Hue and
Luminance in HCL
datasets
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock
Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure Personal Expenditure Data
UScitiesD Distances Between European Cities and Between
US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe's Quartet of 'Identical' Simple Linear
Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of
Australian Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights by Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student's 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and Between
US Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the UK
freeny Freeny's Revenue Data
freeny.x (freeny) Freeny's Revenue Data
freeny.y (freeny) Freeny's Revenue Data
infert Infertility after Spontaneous and Induced
Abortion
iris Edgar Anderson's Iris Data
iris3 Edgar Anderson's Iris Data
islands Areas of the World's Major Landmasses
ldeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the UK
lh Luteinizing Hormone in Blood Samples
longley Longley's Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at Nottingham,
1920-1939
npk Classical N, P, K Factorial Experiment
occupationalStatus Occupational Status of Fathers and their Sons
precip Annual Precipitation in US Cities
presidents Quarterly Approval Ratings of US Presidents
pressure Vapor Pressure of Mercury as a Function of
Temperature
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator
RANDU
rivers Lengths of Major North American Rivers
rock Measurements on Petroleum Rock Samples
sleep Student's Sleep Data
stack.loss (stackloss)
Brownlee's Stack Loss Plant Data
stack.x (stackloss) Brownlee's Stack Loss Plant Data
stackloss Brownlee's Stack Loss Plant Data
state.abb (state) US State Facts and Figures
state.area (state) US State Facts and Figures
state.center (state) US State Facts and Figures
state.division (state)
US State Facts and Figures
state.name (state) US State Facts and Figures
state.region (state) US State Facts and Figures
state.x77 (state) US State Facts and Figures
sunspot.month Monthly Sunspot Data, from 1749 to "Present"
sunspot.year Yearly Sunspot Data, 1700-1988
sunspots Monthly Sunspot Numbers, 1749-1983
swiss Swiss Fertility and Socioeconomic Indicators
(1888) Data
treering Yearly Treering Data, -6000-1979
trees Diameter, Height and Volume for Black Cherry
Trees
uspop Populations Recorded by the US Census
volcano Topographic Information on Auckland's Maunga
Whau Volcano
warpbreaks The Number of Breaks in Yarn during Weaving
women Average Heights and Weights for American Women
doBy
Data sets in package ‘doBy’:
NIRmilk NIRmilk
beets beets data
breastcancer Gene expression signatures for p53 mutation
status in 250 breast cancer samples
budworm budworm data
carcass Lean meat contents of 344 pig carcasses
carcassall Lean meat contents of 344 pig carcasses
codstom Diet of Atlantic cod in the Gulf of St.
Lawrence (Canada)
crimeRate crimeRate
cropyield Yield from Danish agricultural production of
grain and root crop.
dietox Growth curves of pigs in a 3x3 factorial
experiment
fatacid Fish oil in pig food
fev Forced expiratory volume in children
haldCement Heat development in cement under hardening.
math Mathematics marks for students
mathmark Mathematics marks for students
milkman Milk yield data for manually milked cows.
milkman_rdm1 Milk yield data for manually milked cows.
potatoes Weight and size of 20 potatoes
dplyr
Data sets in package ‘dplyr’:
band_instruments Band membership
band_instruments2 Band membership
band_members Band membership
starwars Starwars characters
storms Storm tracks data
ggplot2
Data sets in package ‘ggplot2’:
diamonds Prices of over 50,000 round cut diamonds
economics US economic time series
economics_long US economic time series
faithfuld 2d density estimate of Old Faithful data
luv_colours 'colors()' in Luv space
midwest Midwest demographics
mpg Fuel economy data from 1999 to 2008 for 38
popular models of cars
msleep An updated and expanded version of the mammals
sleep dataset
presidential Terms of 11 presidents from Eisenhower to Obama
seals Vector field of seal movements
txhousing Housing sales in TX
ipred
Data sets in package ‘ipred’:
DLBCL Diffuse Large B-Cell Lymphoma
GlaucomaMVF Glaucoma Database
Smoking Smoking Styles
dystrophy Detection of muscular dystrophy carriers.
irr
Data sets in package ‘irr’:
anxiety Anxiety ratings by different raters
diagnoses Psychiatric diagnoses provided by different
raters
video Different raters judging the credibility of
videotaped testimonies
vision Eye-testing case records
lattice
Data sets in package ‘lattice’:
USMortality Mortality Rates in US by Cause and Gender
USRegionalMortality Mortality Rates in US by Cause and Gender
barley Yield data from a Minnesota barley trial
environmental Atmospheric environmental conditions in New
York City
ethanol Engine exhaust fumes from burning ethanol
melanoma Melanoma skin cancer incidence
singer Heights of New York Choral Society singers
lava
Data sets in package ‘lava’:
bmd Longitudinal Bone Mineral Density Data (Wide
format)
bmidata Data
brisa Simulated data
calcium Longitudinal Bone Mineral Density Data
hubble Hubble data
hubble2 Hubble data
indoorenv Data
missingdata Missing data example
nldata Example data (nonlinear model)
nsem Example SEM data (nonlinear)
semdata Example SEM data
serotonin Serotonin data
serotonin2 Data
twindata Twin menarche data
lme4
Data sets in package ‘lme4’:
Arabidopsis Arabidopsis clipping/fertilization data
Dyestuff Yield of dyestuff by batch
Dyestuff2 Yield of dyestuff by batch
InstEval University Lecture/Instructor Evaluations by
Students at ETH
Pastes Paste strength by batch and cask
Penicillin Variation in penicillin testing
VerbAgg Verbal Aggression item responses
cake Breakage Angle of Chocolate Cakes
cbpp Contagious bovine pleuropneumonia
grouseticks Data on red grouse ticks from Elston et al.
2001
grouseticks_agg (grouseticks)
Data on red grouse ticks from Elston et al.
2001
sleepstudy Reaction times in a sleep deprivation study
lubridate
Data sets in package ‘lubridate’:
lakers Lakers 2008-2009 basketball data set
Data sets in package ‘maptools’:
SplashDams Data for Splash Dams in western Oregon
h1pl (gpcholes) Hisaji Ono's lake/hole problem
h2pl (gpcholes) Hisaji Ono's lake/hole problem
state.vbm US State Visibility Based Map
wrld_simpl Simplified world country polygons
MASS
Data sets in package ‘MASS’:
Aids2 Australian AIDS Survival Data
Animals Brain and Body Weights for 28 Species
Boston Housing Values in Suburbs of Boston
Cars93 Data from 93 Cars on Sale in the USA in 1993
Cushings Diagnostic Tests on Patients with Cushing's
Syndrome
DDT DDT in Kale
GAGurine Level of GAG in Urine of Children
Insurance Numbers of Car Insurance claims
Melanoma Survival from Malignant Melanoma
OME Tests of Auditory Perception in Children with
OME
Pima.te Diabetes in Pima Indian Women
Pima.tr Diabetes in Pima Indian Women
Pima.tr2 Diabetes in Pima Indian Women
Rabbit Blood Pressure in Rabbits
Rubber Accelerated Testing of Tyre Rubber
SP500 Returns of the Standard and Poors 500
Sitka Growth Curves for Sitka Spruce Trees in 1988
Sitka89 Growth Curves for Sitka Spruce Trees in 1989
Skye AFM Compositions of Aphyric Skye Lavas
Traffic Effect of Swedish Speed Limits on Accidents
UScereal Nutritional and Marketing Information on US
Cereals
UScrime The Effect of Punishment Regimes on Crime Rates
VA Veteran's Administration Lung Cancer Trial
abbey Determinations of Nickel Content
accdeaths Accidental Deaths in the US 1973-1978
anorexia Anorexia Data on Weight Change
bacteria Presence of Bacteria after Drug Treatments
beav1 Body Temperature Series of Beaver 1
beav2 Body Temperature Series of Beaver 2
biopsy Biopsy Data on Breast Cancer Patients
birthwt Risk Factors Associated with Low Infant Birth
Weight
cabbages Data from a cabbage field trial
caith Colours of Eyes and Hair of People in Caithness
cats Anatomical Data from Domestic Cats
cement Heat Evolved by Setting Cements
chem Copper in Wholemeal Flour
coop Co-operative Trial in Analytical Chemistry
cpus Performance of Computer CPUs
crabs Morphological Measurements on Leptograpsus
Crabs
deaths Monthly Deaths from Lung Diseases in the UK
drivers Deaths of Car Drivers in Great Britain 1969-84
eagles Foraging Ecology of Bald Eagles
epil Seizure Counts for Epileptics
farms Ecological Factors in Farm Management
fgl Measurements of Forensic Glass Fragments
forbes Forbes' Data on Boiling Points in the Alps
galaxies Velocities for 82 Galaxies
gehan Remission Times of Leukaemia Patients
genotype Rat Genotype Data
geyser Old Faithful Geyser Data
gilgais Line Transect of Soil in Gilgai Territory
hills Record Times in Scottish Hill Races
housing Frequency Table from a Copenhagen Housing
Conditions Survey
immer Yields from a Barley Field Trial
leuk Survival Times and White Blood Counts for
Leukaemia Patients
mammals Brain and Body Weights for 62 Species of Land
Mammals
mcycle Data from a Simulated Motorcycle Accident
menarche Age of Menarche in Warsaw
michelson Michelson's Speed of Light Data
minn38 Minnesota High School Graduates of 1938
motors Accelerated Life Testing of Motorettes
muscle Effect of Calcium Chloride on Muscle
Contraction in Rat Hearts
newcomb Newcomb's Measurements of the Passage Time of
Light
nlschools Eighth-Grade Pupils in the Netherlands
npk Classical N, P, K Factorial Experiment
npr1 US Naval Petroleum Reserve No. 1 data
oats Data from an Oats Field Trial
painters The Painter's Data of de Piles
petrol N. L. Prater's Petrol Refinery Data
phones Belgium Phone Calls 1950-1973
quine Absenteeism from School in Rural New South
Wales
road Road Accident Deaths in US States
rotifer Numbers of Rotifers by Fluid Density
ships Ships Damage Data
shoes Shoe wear data of Box, Hunter and Hunter
shrimp Percentage of Shrimp in Shrimp Cocktail
shuttle Space Shuttle Autolander Problem
snails Snail Mortality Data
steam The Saturated Steam Pressure Data
stormer The Stormer Viscometer Data
survey Student Survey Data
synth.te Synthetic Classification Problem
synth.tr Synthetic Classification Problem
topo Spatial Topographic Data
waders Counts of Waders at 15 Sites in South Africa
whiteside House Insulation: Whiteside's Data
wtloss Weight Loss Data from an Obese Patient
Matrix
Data sets in package ‘Matrix’:
CAex Albers' example Matrix with "Difficult" Eigen
Factorization
KNex Koenker-Ng Example Sparse Model Matrix and
Response Vector
USCounties USCounties Contiguity Matrix
wrld_1deg World 1-degree grid contiguity matrix
mgcv
Data sets in package ‘mgcv’:
columb Reduced version of Columbus OH crime data
columb.polys Reduced version of Columbus OH crime data
ModelMetrices
Data sets in package ‘ModelMetrics’:
testDF Test data
nlme
Data sets in package ‘nlme’:
Alfalfa Split-Plot Experiment on Varieties of Alfalfa
Assay Bioassay on Cell Culture Plate
BodyWeight Rat weight over time for different diets
Cefamandole Pharmacokinetics of Cefamandole
Dialyzer High-Flux Hemodialyzer
Earthquake Earthquake Intensity
Fatigue Cracks caused by metal fatigue
Gasoline Refinery yield of gasoline
Glucose Glucose levels over time
Glucose2 Glucose Levels Following Alcohol Ingestion
Gun Methods for firing naval guns
IGF Radioimmunoassay of IGF-I Protein
Machines Productivity Scores for Machines and Workers
MathAchSchool School demographic data for MathAchieve
MathAchieve Mathematics achievement scores
Meat Tenderness of meat
Milk Protein content of cows' milk
Muscle Contraction of heart muscle sections
Nitrendipene Assay of nitrendipene
Oats Split-plot Experiment on Varieties of Oats
Orthodont Growth curve data on an orthdontic measurement
Ovary Counts of Ovarian Follicles
Oxboys Heights of Boys in Oxford
Oxide Variability in Semiconductor Manufacturing
PBG Effect of Phenylbiguanide on Blood Pressure
Phenobarb Phenobarbitol Kinetics
Pixel X-ray pixel intensities over time
Quinidine Quinidine Kinetics
Rail Evaluation of Stress in Railway Rails
RatPupWeight The weight of rat pups
Relaxin Assay for Relaxin
Remifentanil Pharmacokinetics of Remifentanil
Soybean Growth of soybean plants
Spruce Growth of Spruce Trees
Tetracycline1 Pharmacokinetics of tetracycline
Tetracycline2 Pharmacokinetics of tetracycline
Wafer Modeling of Analog MOS Circuits
Wheat Yields by growing conditions
Wheat2 Wheat Yield Trials
bdf Language scores
ergoStool Ergometrics experiment with stool types
pbkrtest
Data sets in package ‘pbkrtest’:
beets Sugar beets data
budworm budworm data
plyr
Data sets in package ‘plyr’:
baseball Yearly batting records for all major league
baseball players
ozone Monthly ozone measurements over Central
America.
pROC
Data sets in package ‘pROC’:
aSAH Subarachnoid hemorrhage data
quantreg
Data sets in package ‘quantreg’:
Bosco Boscovich Data
CobarOre Cobar Ore data
Mammals Garland(1983) Data on Running Speed of Mammals
Peirce C.S. Peirce's Auditory Response Data
barro Barro Data
engel Engel Data
gasprice Time Series of US Gasoline Prices
uis UIS Drug Treatment study data
reshape
Data sets in package ‘reshape’:
french_fries Sensory data from a french fries experiment
smiths Demo data describing the Smiths
tips Tipping data
reshape2
Data sets in package ‘reshape2’:
french_fries Sensory data from a french fries experiment.
smiths Demo data describing the Smiths.
tips Tipping data
rpart
Data sets in package ‘rpart’:
car.test.frame Automobile Data from 'Consumer Reports' 1990
car90 Automobile Data from 'Consumer Reports' 1990
cu.summary Automobile Data from 'Consumer Reports' 1990
kyphosis Data on Children who have had Corrective Spinal
Surgery
solder Soldering of Components on Printed-Circuit
Boards
solder.balance (solder)
Soldering of Components on Printed-Circuit
Boards
stagec Stage C Prostate Cancer
sp
Data sets in package ‘sp’:
Rlogo Rlogo jpeg image
gt (Rlogo) Rlogo jpeg image
meuse Meuse river data set
meuse.area River Meuse outline
meuse.grid Prediction Grid for Meuse Data Set
meuse.grid_ll Prediction Grid for Meuse Data Set,
geographical coordinates
meuse.riv River Meuse outline
SparseM
Data sets in package ‘SparseM’:
X (triogramX) A Design Matrix for a Triogram Problem
lsq Least Squares Problems in Surveying
ssanv
Data sets in package ‘ssanv’:
example.of.Fisher.exact
Object of class 'power.htest'
stringr
Data sets in package ‘stringr’:
fruit Sample character vectors for practicing string
manipulations.
sentences Sample character vectors for practicing string
manipulations.
words Sample character vectors for practicing string
manipulations.
survival
Data sets in package ‘survival’:
aml (cancer) Acute Myelogenous Leukemia survival data
bladder (cancer) Bladder Cancer Recurrences
bladder1 (cancer) Bladder Cancer Recurrences
bladder2 (cancer) Bladder Cancer Recurrences
cancer NCCTG Lung Cancer Data
capacitor (reliability)
Reliability data sets
cgd Chronic Granulotamous Disease data
cgd0 (cgd) Chronic Granulotomous Disease data
colon (cancer) Chemotherapy for Stage B/C colon cancer
cracks (reliability) Reliability data sets
diabetic Ddiabetic retinopathy
flchain Assay of serum free light chain for 7874
subjects.
gbsg (cancer) Breast cancer data sets used in Royston and
Altman (2013)
genfan (reliability) Reliability data sets
heart Stanford Heart Transplant data
ifluid (reliability) Reliability data sets
imotor (reliability) Reliability data sets
jasa (heart) Stanford Heart Transplant data
jasa1 (heart) Stanford Heart Transplant data
kidney (cancer) Kidney catheter data
leukemia (cancer) Acute Myelogenous Leukemia survival data
logan Data from the 1972-78 GSS data used by Logan
lung (cancer) NCCTG Lung Cancer Data
mgus (cancer) Monoclonal gammopathy data
mgus1 (cancer) Monoclonal gammopathy data
mgus2 (cancer) Monoclonal gammopathy data
myeloid (cancer) Acute myeloid leukemia
myeloma (cancer) Survival times of patients with multiple
myeloma
nafld1 (nafld) Non-alcohol fatty liver disease
nafld2 (nafld) Non-alcohol fatty liver disease
nafld3 (nafld) Non-alcohol fatty liver disease
nwtco Data from the National Wilm's Tumor Study
ovarian (cancer) Ovarian Cancer Survival Data
pbc Mayo Clinic Primary Biliary Cholangitis Data
pbcseq (pbc) Mayo Clinic Primary Biliary Cirrhosis,
sequential data
rats (cancer) Rat treatment data from Mantel et al
rats2 (cancer) Rat data from Gail et al.
retinopathy Diabetic Retinopathy
rhDNase rhDNASE data set
rotterdam (cancer) Breast cancer data set used in Royston and
Altman (2013)
solder Data from a soldering experiment
stanford2 (heart) More Stanford Heart Transplant data
survexp.mn (survexp) Census Data Sets for the Expected Survival and
Person Years Functions
survexp.us (survexp) Census Data Sets for the Expected Survival and
Person Years Functions
survexp.usr (survexp) Census Data Sets for the Expected Survival and
Person Years Functions
tobin Tobin's Tobit data
transplant Liver transplant waiting list
turbine (reliability) Reliability data sets
udca Data from a trial of usrodeoxycholic acid
udca1 (udca) Data from a trial of usrodeoxycholic acid
udca2 (udca) Data from a trial of usrodeoxycholic acid
uspop2 (survexp) Projected US Population
valveSeat (reliability)
Reliability data sets
veteran (cancer) Veterans' Administration Lung Cancer study
tidyr
Data sets in package ‘tidyr’:
billboard Song rankings for Billboard top 100 in the year
2000
construction Completed construction in the US in 2018
fish_encounters Fish encounters
population World Health Organization TB data
relig_income Pew religion and income survey
smiths Some data about the Smith family
table1 Example tabular representations
table2 Example tabular representations
table3 Example tabular representations
table4a Example tabular representations
table4b Example tabular representations
table5 Example tabular representations
us_rent_income US rent and income data
who World Health Organization TB data
world_bank_pop Population data from the world bank
> sp=ggplot(dat,aes(xval,yval))+geom_point()# Setting the tick marks on an axis
# This will show tick marks on every 0.25 from 1 to 10
# The scale will show only the ones that are within range (3.50-6.25 in this case)
bp + scale_y_continuous(breaks=seq(1,10,1/4))
# The breaks can be spaced unevenly
bp + scale_y_continuous(breaks=c(4, 4.25, 4.5, 5, 6,8))
# Suppress ticks and gridlines
bp + scale_y_continuous(breaks=NULL)
# Hide tick marks and labels (on Y axis), but keep the gridlines
bp + theme(axis.ticks = element_blank(), axis.text.y = element_blank())
x축과 마찬가지로 y축도 눈금이나 라벨을 숨기는 게 된다.
축 스케일이 지수 or 로그일 때
우리의 제육쌈밥군... R로 깔끔하게 스탠다드 커브를 만들어서 냈던 게 교수님의 마음에 들었는지, 방학동안 랩에서 일해보지 않겠느냐는 제의를 받았다. 근데 왜 제육쌈밥이죠그냥 마침 관심이 있던 분야였던 제육쌈밥군은 흔쾌히 수락했고, 첫 출근을 하게 됐는데... 실험실 선배가 그를 불러 넌지시 물어봤다.
"교수님께 얘기는 들었어. R로 standard curve를 그렸다고... 혹시... 나 좀 도와줄 수 있어? "
제육쌈밥군이 OK하자 선배는 그래프 하나를 보여줬다.
sorted(CLNSIG_dict.items())[n]
일본열도 아님 "교수님께서 좀 더 깔끔한 그래프를 보고 싶다고 하셨는데, 어떻게 해야 할 지 모르겠어. " "어떻게 깔끔하게 바꾸고 싶으시대요? " "데이터가 직선으로 보였으면 좋겠대. "
한참 실험수업을 듣고 있는 제육쌈밥(대짱이). 오늘의 실험은 Bradford assay였다. 조별로 Bardford assay 시약을 처리한 다음 OD595(컬럼이 좀 개판났는데 OD595가 맞음... 세 번 달아서 1, 2, 3이다.)를 측정하고 결과값을 받았다. 과제로 Standard curve를 그려오라는 과제를 받은 제육쌈밥군... 까짓거 후다닥 해치우자! 라고 생각했으나 문제가 생겼다.
제육쌈밥군의 컴퓨터는 리눅스였고, 리브레오피스가 그날따라 매우 버벅였다는 것... 돌릴 수 있는 것은 R밖에 없는데, 이를 어쩌지?
> ggplot(data=tgc,aes(x=dose,y=len,colour=supp))+geom_line()+geom_point()+geom_errorbar(aes(ymin=len-ci,ymax=len+ci),width=.1)
# Use 95% confidence interval instead of SEM
> ggplot(dat,aes(x=rating))+geom_histogram()+geom_vline(aes(xintercept=mean(rating,na.rm=TRUE)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> ggplot(dat, aes(x=comd, y=rating,fill=comd)) + geom_boxplot()+stat_summary(fun.y=mean,geom="point",size=3,shape=5)
경고메시지(들):
`fun.y` is deprecated. Use `fun` instead.
아 평균도 찍어준다니까요
scatter plot(geom_point())
저 포인트 꺾은선에서 본 것 같다고? 쟤는 라인이랑 둘이 쓰면 꺾은선 그래프에 점 찍어주고, 단독으로 쓰면 산점도다.
참고로 예전에 python 하면서도 설명했나 모르겠는데 산점도는 XY축 값이 다 있어야 한다.
> dat <- data.frame(cond = rep(c("A", "B"), each=10),
+ xvar = 1:20 + rnorm(20,sd=3),
+ yvar = 1:20 + rnorm(20,sd=3))
> dat
cond xvar yvar
1 A -4.252354091 3.473157275
2 A 1.702317971 0.005939612
3 A 4.323053753 -0.094252427
4 A 1.780628408 2.072808278
5 A 11.537348371 1.215440358
6 A 6.672130388 3.608111411
7 A 0.004294848 7.529210289
8 A 9.971403007 6.156154355
9 A 9.007456032 8.935238147
10 A 11.766997972 8.928092187
11 B 8.840215645 13.202410972
12 B 5.974093783 17.644890794
13 B 15.034828849 10.485402010
14 B 10.985009895 10.138043066
15 B 13.543221961 18.681876114
16 B 11.435789493 19.143471805
17 B 16.977063388 18.832504955
18 B 17.220012698 18.939818864
19 B 17.793359218 19.718587761
20 B 19.319909163 19.647899863
> ggplot(dat,aes(x=xvar,y=yvar,color=cond))+geom_point(size=3)+scale_color_manual(values=c("#f5df4d","#939597"))+geom_smooth(method=lm,se=FALSE)
`geom_smooth()` using formula 'y ~ x'
> ggplot(dat,aes(x=xvar,y=yvar,color=cond))+geom_point(size=3)+scale_color_manual(values=c("#f5df4d","#939597"))+geom_smooth(method=lm,se=FALSE,fullrange=TRUE)
`geom_smooth()` using formula 'y ~ x'
> logr_vm=glm(vs~mpg,data=dat,family=binomial)
> logr_vm
Call: glm(formula = vs ~ mpg, family = binomial, data = dat)
Coefficients:
(Intercept) mpg
-8.8331 0.4304
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 43.86
Residual Deviance: 25.53 AIC: 29.53
> logr_vm <- glm(vs ~ mpg, data=dat, family=binomial(link="logit"))
> logr_vm
Call: glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)
Coefficients:
(Intercept) mpg
-8.8331 0.4304
Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
Null Deviance: 43.86
Residual Deviance: 25.53 AIC: 29.53
일단 둘 중 하나 적용하면 된다. 어차피 결과는 같아서...
> summary(logr_vm)
Call:
glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2127 -0.5121 -0.2276 0.6402 1.6980
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.8331 3.1623 -2.793 0.00522 **
mpg 0.4304 0.1584 2.717 0.00659 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.860 on 31 degrees of freedom
Residual deviance: 25.533 on 30 degrees of freedom
AIC: 29.533
Number of Fisher Scoring iterations: 6
이쪽은 회귀분석과 마찬가지로 summary()로 더보기가 된다. 같은 회귀분석이라 그런가쟤는 선형인데?
> library(ggplot2)
> ggplot(dat, aes(x=mpg, y=vs)) + geom_point() +
+ stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
`geom_smooth()` using formula 'y ~ x'
처음에 geom 보고 점이라 저렇게 쓴건가 했음... ㅋㅋㅋㅋㅋㅋ 아 이 그래프는 그리시려면 ggplot2 깔아야됨. 어차피 다음 챕터도 저거니까 걍 지금 까세요.
> bartlett.test(count~spray,data=InsectSprays)
Bartlett test of homogeneity of variances
data: count by spray
Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05
독립 변수가 하나일 때
> bartlett.test(len~interaction(supp,dose),data=ToothGrowth)
Bartlett test of homogeneity of variances
data: len by interaction(supp, dose)
Bartlett's K-squared = 6.9273, df = 5, p-value = 0.2261
독립 변수가 여러개일 때(interaction()으로 합쳐야 한다)
Levene’s test
이거 쓰려면 뭐 또 깔아야됨... car 깔고 오세여.
> leveneTest(count~spray,data=InsectSprays)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.8214 0.004223 **
66
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
독립 변수가 하나일 때
> leveneTest(len~supp*dose,data=tg)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 1.7086 0.1484
54
독립 변수가 여러개일 때
Fligner-Killeen test
> fligner.test(count~spray,data=InsectSprays)
Fligner-Killeen test of homogeneity of variances
data: count by spray
Fligner-Killeen:med chi-squared = 14.483, df = 5, p-value = 0.01282
독립 변수가 하나일 때
> fligner.test(len~interaction(supp,dose),data=ToothGrowth)
Fligner-Killeen test of homogeneity of variances
data: len by interaction(supp, dose)
Fligner-Killeen:med chi-squared = 7.7488, df = 5, p-value = 0.1706
> dfa2=dfa
> dfa2$rater1=factor(dfa2$rater1,levels=1:6,labels=LETTERS[1:6])
> dfa2$rater2=factor(dfa2$rater2,levels=1:6,labels=LETTERS[1:6])
> head(dfa2)
rater1 rater2
1 C C
2 C F
3 C D
4 D F
5 E B
6 E D
위에서 불러 온 데이터들을 팩터화한 다음, 레벨을 매겼다.
> levels(dfa2$rater1)
[1] "A" "B" "C" "D" "E" "F"
> levels(dfa2$rater2)
[1] "A" "B" "C" "D" "E" "F"
1) R의 꽃, R의 주목적이다보니 분량이 어마무시하게 길다. 네이버에서 글 분량갖고 자르진 않겠지만 그래서 상, 하편으로 나눠서 작성할 예정이다. (cookbook 소챕터 3:3으로 나눔) 2) 그리고 이새기들 자꾸 안 알려주고 뜬금없이 라이브러리 소환하세요 하는데 오늘도 깔아야 할 패키지가 있다. 하편에 나오는 것도 있고 상편에 나오는 것도 있는데 그냥 다 깔고 가자.
> t.test(x,y)
Welch Two Sample t-test
data: x and y
t = 0.56293, df = 47.977, p-value = 0.5761
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.90445 19.38445
sample estimates:
mean of x mean of y
50.84 46.60
이거 치면 끝.
> t.test(data$x,data$y)
이것도 먹힌다.
> t.test(x,y,var.equal=TRUE)
Two Sample t-test
data: x and y
t = 0.56293, df = 48, p-value = 0.5761
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.90427 19.38427
sample estimates:
mean of x mean of y
50.84 46.60
var.equal=TRUE 옵션을 주면 스튜던트 T-test를 진행할 수 있다. default는 웰치스 T-test. 잠깐만 웰치스라고 했음? 아니 그 음료수 아니... 스펠이 왜 똑같지?
Paired T-test
> t.test(x,y,paired=TRUE)
Paired t-test
data: x and y
t = 0.6259, df = 24, p-value = 0.5373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.741436 18.221436
sample estimates:
mean of the differences
4.24
paired=TRUE 옵션을 주면 paired T-test도 가능하다.
One sample T-test
엥? 근데 저 데이터 하난데 어캄?
> t.test(x,mu=0)
One Sample t-test
data: x
t = 9.6518, df = 24, p-value = 9.776e-10
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
39.9686 61.7114
sample estimates:
mean of x
50.84
> t.test(y,mu=0)
One Sample t-test
data: y
t = 8.6554, df = 24, p-value = 7.617e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
35.48812 57.71188
sample estimates:
mean of x
46.6
단식도 된다. mu=0은 모평균을 0으로 놓고 본다는 얘기. 모평균은 뮤, 모표준편차는 시그마 제곱이다.
빈도 분석
> data <- read.table(header=TRUE, text='
+ condition result
+ control 0
+ control 0
+ control 0
+ control 0
+ treatment 1
+ control 0
+ control 0
+ treatment 0
+ treatment 1
+ control 1
+ treatment 1
+ treatment 1
+ treatment 1
+ treatment 1
+ treatment 0
+ control 0
+ control 1
+ control 0
+ control 1
+ treatment 0
+ treatment 1
+ treatment 0
+ treatment 0
+ control 0
+ treatment 1
+ control 0
+ control 0
+ treatment 1
+ treatment 0
+ treatment 1
+ ')
> head(data)
condition result
1 control 0
2 control 0
3 control 0
4 control 0
5 treatment 1
6 control 0
데이터는 고마운데 분량 이거 ㄷㄷ
이럴 땐 이런 분석! 이라고 올려뒀더라고...
X^2(카이스퀘어, Chi-square)
> ct=table(data$result)
> ct
0 1
17 13
참고로 빈도 분석은 어떤 형태로든 contingency table을 먼저 만들고 시작한다. 여기서는 글 분량때문에 같은 종류의 테이블에 대해서는 코드 생략하고 뭘로 만들었는지만 기재한다.
> chisq.test(ct)
Chi-squared test for given probabilities
data: ct
X-squared = 0.53333, df = 1, p-value = 0.4652
평범한 카이스퀘어
> pt=c(.75,.25) # 확률값
> chisq.test(ct,p=pt)
Chi-squared test for given probabilities
data: ct
X-squared = 5.3778, df = 1, p-value = 0.02039
기대 빈도수가 다르다고 전제하고 계산하는 카이스퀘어
> chi_res=chisq.test(ct,p=pt)
> str(chi_res)
List of 9
$ statistic: Named num 5.38
..- attr(*, "names")= chr "X-squared"
$ parameter: Named num 1
..- attr(*, "names")= chr "df"
$ p.value : num 0.0204
$ method : chr "Chi-squared test for given probabilities"
$ data.name: chr "ct"
$ observed : 'table' int [1:2(1d)] 17 13
..- attr(*, "dimnames")=List of 1
.. ..$ : chr [1:2] "0" "1"
$ expected : Named num [1:2] 22.5 7.5
..- attr(*, "names")= chr [1:2] "0" "1"
$ residuals: 'table' num [1:2(1d)] -1.16 2.01
..- attr(*, "dimnames")=List of 1
.. ..$ : chr [1:2] "0" "1"
$ stdres : 'table' num [1:2(1d)] -2.32 2.32
..- attr(*, "dimnames")=List of 1
.. ..$ : chr [1:2] "0" "1"
- attr(*, "class")= chr "htest"
얘는 카이스퀘어 돌린 거 변수 지정하고 str()로 봐야 자세히 보인다.
Exact binomial test
> binom.test(ct,p=0.5)
Exact binomial test
data: ct
number of successes = 17, number of trials = 30, p-value = 0.5847
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.3742735 0.7453925
sample estimates:
probability of success
0.5666667
요거는 위에서 만든 contingency table을 바탕으로 Exact binomial test를 한 결과. 저거 근데 확률 안 써도 값 같더만... (심지어 확률을 바꿔도 값이 같음) 아, p-value는 좀 다르긴 함.
> binom.test(ct,p=0.25)
Exact binomial test
data: ct
number of successes = 17, number of trials = 30, p-value = 0.0002157
alternative hypothesis: true probability of success is not equal to 0.25
95 percent confidence interval:
0.3742735 0.7453925
sample estimates:
probability of success
0.5666667
카이스퀘어와 달리 두 번째 컬럼값의 확률만 기재한다. 그러면 알아서 앞에껏도 계산해주는 듯.
> bin_res=binom.test(ct,p=0.5)
> str(bin_res)
List of 9
$ statistic : Named num 17
..- attr(*, "names")= chr "number of successes"
$ parameter : Named num 30
..- attr(*, "names")= chr "number of trials"
$ p.value : Named num 0.585
..- attr(*, "names")= chr "0"
$ conf.int : num [1:2] 0.374 0.745
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 0.567
..- attr(*, "names")= chr "probability of success"
$ null.value : Named num 0.5
..- attr(*, "names")= chr "probability of success"
$ alternative: chr "two.sided"
$ method : chr "Exact binomial test"
$ data.name : chr "ct"
- attr(*, "class")= chr "htest"
correct=FALSE 옵션을 주면 Yates's correction for continuity가 빠진다. ...근데 그거 막 그렇게 빼도 돼?
Fisher’s exact test
> fisher.test(ct)
Fisher's Exact Test for Count Data
data: ct
p-value = 0.03293
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.966861 45.555016
sample estimates:
odds ratio
5.714369
이건 피셔의 정확도 검정? 그건데 샘플 수가 작을 때 사용하는 분석법이라고 한다. 사실 통계가 데이터가 많을수록 좋은 것도 사실이지만, 그게 항상 그렇게 되리라는 법이 없다. 예를 들자면 희귀병 환자라던가...
> mantelhaen.test(ct)
Mantel-Haenszel chi-squared test with continuity correction
data: ct
Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
0.6005522 0.9593077
sample estimates:
common odds ratio
0.759022
그리고 분석하면 된다... 분석 개빡시네 이거.
McNemar’s test
> data <- read.table(header=TRUE, text='
+ subject time result
+ 1 pre 0
+ 1 post 1
+ 2 pre 1
+ 2 post 1
+ 3 pre 0
+ 3 post 1
+ 4 pre 1
+ 4 post 0
+ 5 pre 1
+ 5 post 1
+ 6 pre 0
+ 6 post 1
+ 7 pre 0
+ 7 post 1
+ 8 pre 0
+ 8 post 1
+ 9 pre 0
+ 9 post 1
+ 10 pre 1
+ 10 post 1
+ 11 pre 0
+ 11 post 0
+ 12 pre 1
+ 12 post 1
+ 13 pre 0
+ 13 post 1
+ 14 pre 0
+ 14 post 0
+ 15 pre 0
+ 15 post 1
+ ')
> head(data)
subject time result
1 1 pre 0
2 1 post 1
3 2 pre 1
4 2 post 1
5 3 pre 0
6 3 post 1
> ct=table(data_wide[,c("pre","post")])
> ct
post
pre 0 1
0 2 8
1 1 4
아무튼 테이블을 만들어주고...
> mcnemar.test(ct)
McNemar's Chi-squared test with continuity correction
data: ct
McNemar's chi-squared = 4, df = 1, p-value = 0.0455
분⭐석
> library(exact2x2)
필요한 패키지를 로딩중입니다: exactci
필요한 패키지를 로딩중입니다: ssanv
필요한 패키지를 로딩중입니다: testthat
다음의 패키지를 부착합니다: ‘testthat’
The following object is masked from ‘package:tidyr’:
matches
> mcnemar.exact(ct)
Exact McNemar test (with central confidence intervals)
data: ct
b = 8, c = 1, p-value = 0.03906
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.072554 354.981246
sample estimates:
odds ratio
8
excat버전을 쓰려면 exact2x2라는 패키지를 깔아야 한다. (위에 썼던 그거 맞다)
ANOVA
들어보긴 했는데 이거 뭐임? SW 이름 아니었어? 이사람
> data <- read.table(header=TRUE, text='
+ subject sex age before after
+ 1 F old 9.5 7.1
+ 2 M old 10.3 11.0
+ 3 M old 7.5 5.8
+ 4 F old 12.4 8.8
+ 5 M old 10.2 8.6
+ 6 M old 11.0 8.0
+ 7 M young 9.1 3.0
+ 8 F young 7.9 5.2
+ 9 F old 6.6 3.4
+ 10 M young 7.7 4.0
+ 11 M young 9.4 5.3
+ 12 M old 11.6 11.3
+ 13 M young 9.9 4.6
+ 14 F young 8.6 6.4
+ 15 F young 14.3 13.5
+ 16 F old 9.2 4.7
+ 17 M young 9.8 5.1
+ 18 F old 9.9 7.3
+ 19 F young 13.0 9.5
+ 20 M young 10.2 5.4
+ 21 M young 9.0 3.7
+ 22 F young 7.9 6.2
+ 23 M old 10.1 10.0
+ 24 M young 9.0 1.7
+ 25 M young 8.6 2.9
+ 26 M young 9.4 3.2
+ 27 M young 9.7 4.7
+ 28 M young 9.3 4.9
+ 29 F young 10.7 9.8
+ 30 M old 9.3 9.4
+ ')
> head(data)
subject sex age before after
1 1 F old 9.5 7.1
2 2 M old 10.3 11.0
3 3 M old 7.5 5.8
4 4 F old 12.4 8.8
5 5 M old 10.2 8.6
6 6 M old 11.0 8.0
Age에 old랑 young은 뭔 기준인지 모르겠지만 일단 패스.
> data$subject=factor(data$subject)
이 데이터는 ANOVA 들어가기 전에 저기 subject를 팩터화해줘서 연속적인 데이터 취급 받는 사태를 방지해야 한다. 참고로 ANOVA는 one-way랑 two-way가 있다.
one-way ANOVA
> aov1=aov(before~sex,data=data)
> aov1
Call:
aov(formula = before ~ sex, data = data)
Terms:
sex Residuals
Sum of Squares 1.52861 74.70105
Deg. of Freedom 1 28
Residual standard error: 1.633369
Estimated effects may be unbalanced
> summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 1.53 1.529 0.573 0.455
Residuals 28 74.70 2.668
> model.tables(aov1,"means")
Tables of means
Grand mean
9.703333
sex
F M
10 9.532
rep 11 19.000
two-way ANOVA
> aov3=aov(after~sex*age,data=data)
> aov3
Call:
aov(formula = after ~ sex * age, data = data)
Terms:
sex age sex:age Residuals
Sum of Squares 16.07755 38.96110 89.61138 103.51164
Deg. of Freedom 1 1 1 26
Residual standard error: 1.995299
Estimated effects may be unbalanced
> summary(aov3)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 16.08 16.08 4.038 0.0550 .
age 1 38.96 38.96 9.786 0.0043 **
sex:age 1 89.61 89.61 22.509 6.6e-05 ***
Residuals 26 103.51 3.98
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> model.tables(aov3,"means")
Tables of means
Grand mean
6.483333
sex
F M
7.445 5.926
rep 11.000 19.000
age
old young
7.874 5.556
rep 12.000 18.000
sex:age
age
sex old young
F 6.260 8.433
rep 5.000 6.000
M 9.157 4.042
rep 7.000 12.000
> aov3=aov(after~sex+age+sex:age,data=data)
> summary(aov3)
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 16.08 16.08 4.038 0.0550 .
age 1 38.96 38.96 9.786 0.0043 **
sex:age 1 89.61 89.61 22.509 6.6e-05 ***
Residuals 26 103.51 3.98
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
둘이 같은 코드.
Tukey HSD post-hoc test
> TukeyHSD(aov1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = before ~ sex, data = data)
$sex
diff lwr upr p adj
M-F -0.4684211 -1.736038 0.7991961 0.4554053
One-way
> TukeyHSD(aov3)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = after ~ sex + age + sex:age, data = data)
$sex
diff lwr upr p adj
M-F -1.519139 -3.073025 0.03474709 0.0549625
$age
diff lwr upr p adj
young-old -2.31785 -3.846349 -0.7893498 0.0044215
$`sex:age`
diff lwr upr p adj
M:old-F:old 2.8971429 -0.3079526 6.1022384 0.0869856
F:young-F:old 2.1733333 -1.1411824 5.4878491 0.2966111
M:young-F:old -2.2183333 -5.1319553 0.6952887 0.1832890
F:young-M:old -0.7238095 -3.7691188 2.3214997 0.9138789
M:young-M:old -5.1154762 -7.7187601 -2.5121923 0.0000676
M:young-F:young -4.3916667 -7.1285380 -1.6547953 0.0008841
Two-way
ANOVA 데이터를 바탕으로 tukey HSD 분석을 할 수도 있다.
ANOVAs with within-subjects variables
> data_long <- gather(data, time, value, before:after)
> head(data_long)
subject sex age time value
1 1 F old before 9.5
2 2 M old before 10.3
3 3 M old before 7.5
4 4 F old before 12.4
5 5 M old before 10.2
6 6 M old before 11.0
> data_long$subject <- factor(data_long$subject)
subject variable를 끼고 하려고 표를 길게 만들었단다. (아까 팩터화한 그거 맞음)
> aov_time=aov(value~time+Error(subject/time),data=data_long)
> aov_time
Call:
aov(formula = value ~ time + Error(subject/time), data = data_long)
Grand Mean: 8.093333
Stratum 1: subject
Terms:
Residuals
Sum of Squares 261.2473
Deg. of Freedom 29
Residual standard error: 3.001421
Stratum 2: subject:time
Terms:
time Residuals
Sum of Squares 155.526 63.144
Deg. of Freedom 1 29
Residual standard error: 1.475595
Estimated effects are balanced
one-way
> aov_age_time=aov(value~age*time+Error(subject/time),data=data_long)
> aov_age_time
Call:
aov(formula = value ~ age * time + Error(subject/time), data = data_long)
Grand Mean: 8.093333
Stratum 1: subject
Terms:
age Residuals
Sum of Squares 24.44011 236.80722
Deg. of Freedom 1 28
Residual standard error: 2.908161
1 out of 2 effects not estimable
Estimated effects are balanced
Stratum 2: subject:time
Terms:
time age:time Residuals
Sum of Squares 155.526 18.769 44.375
Deg. of Freedom 1 1 28
Residual standard error: 1.258897
Estimated effects may be unbalanced
> x=v
> x[is.na(x)]="D"
> x
[1] "A" "A" "A" "B" "B" "B" "B" "D" "D" "C" "C" "B" "C" "C" "C"
NA를 다른 걸로 대체하게 되면
> xr=rle(x)
> xr
Run Length Encoding
lengths: int [1:6] 3 4 2 2 1 3
values : chr [1:6] "A" "B" "D" "C" "B" "C"
rle() 결과가 달라진다. NA가 다 D로 바뀌었기 때문.
> xr$values[xr$values=="D"]=NA
> xr
Run Length Encoding
lengths: int [1:6] 3 4 2 2 1 3
values : chr [1:6] "A" "B" NA "C" "B" "C"
그 상태에서 D를 NA로 바꾼 것은 원래 벡터를 그냥 rle() 한 것과 또 다르다.
> x2=inverse.rle(xr)
> x2
[1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
# 중간에 NA를 때웠던 벡터
> inverse.rle(vr)
[1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
# NA 안 때운 벡터
근데 벡터되면 다 똑같음.
이거 팩터에서도 되긴 되는데...
> f=factor(v)
> f
[1] A A A B B B B <NA> <NA> C C B C C C
Levels: A B C
여기 팩터가 있다.
> f_levels=levels(f)
> f_levels
[1] "A" "B" "C"
팩터의 레벨에 결측값은 낄 수 없다. ???: 뭐어? 결측값? 결측값??? 결측값도 레벨에 끼면 소는 누가 키울거야 소는!결측값이랑 소랑 뭔 상관이여
> fc=as.character(f)
> fc
[1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
아무튼 얘는 글자로 변환하고 rle()를 준다. 사실상 이 뒤로는 벡터랑 같은 부분이라 생략.
근데 팩터에 rle 주면 안되냐고?
> rle(f)
Error in rle(f) : 'x'는 반드시 atomic 타입으로 이루어진 벡터이어야 합니다
오류남.
바로 앞 데이터로 NA 채우기
> x <- c(NA,NA, "A","A", "B","B","B", NA,NA, "C", NA,NA,NA, "A","A","B", NA,NA)
> x
[1] NA NA "A" "A" "B" "B" "B" NA NA "C" NA NA NA "A" "A" "B" NA NA
> goodVals <- c(NA, x[goodIdx])
> goodVals
[1] NA "A" "A" "B" "B" "B" "C" "A" "A" "B"
# These are the non-NA values from x only
# Add a leading NA for later use when we index into this vector
NA 직전 값들을 알아낸다. (NA 앞 인덱스이면서 is.na()가 false인 값)
> fillIdx=cumsum(goodIdx)+1
> fillIdx
[1] 1 1 2 3 4 5 6 6 6 7 7 7 7 8 9 10 10 10
# 1을 더하는 이유는 0부터 채우는 걸 피하기 위함이다!
그리고 채우기 위한 준비를 해 봅니다.
> goodVals[fillIdx]
[1] NA NA "A" "A" "B" "B" "B" "B" "B" "C" "C" "C" "C" "A" "A" "B" "B" "B"
# The original vector with gaps filled
테이블은 보통 가로로 길거나 세로로 길거나 둘 중 하나이다. 캡처는 못했지만, 전전직장에서 일하면서 SQL로 정리해뒀던 샘플 표는 가로로 정말 엄청나게 긴 표였다. (시료의 색깔, 크기, 규격, 회사명, 제품명, 합불여부까지 다 기재해서... 마스크는 똑같은 회사에서 만들더라도 KF 규격이 다르거나 색, 크기가 다르면 다 검사 받아야 한다)
아무튼... 이게 가로세로가 바뀐다고?
> df=read.csv('/home/koreanraichu/example.csv',sep=";")
> df
ID Interesred.in Class
1 kimlab0213 Python Basic
2 ahn_0526 Python Medium
3 peponi01 R Basic
4 kuda_koma R Expert
5 comma_life Java Basic
6 wheresjohn Java Medium
7 hanguk_joa Python Expert
8 sigma_00 R Basic
9 kokoatalk Java Basic
이건 근데 가로로 긴거냐 세로로 긴거냐... 그냥 세로로 방대한거 아닌가
> df_long=gather(df,ID,Interested.in,Class,factor_key=TRUE)
> df_long
ID Interested.in
1 Class Basic
2 Class Medium
3 Class Basic
4 Class Expert
5 Class Basic
6 Class Medium
7 Class Expert
8 Class Basic
9 Class Basic
tidyr 라이브러리의 gather()를 이용하면 이렇게 된다. (코드에는 생략되어 있으나 이 글을 읽는 여러분들이라면 아시리라 믿는다. tidyr 먼저 부르자)
> library(reshape2)
다음의 패키지를 부착합니다: ‘reshape2’
The following object is masked from ‘package:tidyr’:
smiths
> df_melt=melt(df,id.vars=c("ID","Class"))
> df_melt
ID Class variable value
1 kimlab0213 Basic Interested.in Python
2 ahn_0526 Medium Interested.in Python
3 peponi01 Basic Interested.in R
4 kuda_koma Expert Interested.in R
5 comma_life Basic Interested.in Java
6 wheresjohn Medium Interested.in Java
7 hanguk_joa Expert Interested.in Python
8 sigma_00 Basic Interested.in R
9 kokoatalk Basic Interested.in Java
> df2_wide=spread(df2,Category.1,Price)
> df2_wide
ID Product.name Category.2 Order Cancel Consumable
1 BC-001 BCIG agar Bacteria 10 2 NA
2 DM-001 DMEM Cell culture 20 0 NA
3 GL-001 Slide glass Experimental 50 0 NA
4 GL-002 Cover glass Experimental 55 1 NA
5 LB-001 LB agar Bacteria 5 0 NA
6 LB-002 LB broth Bacteria 6 1 NA
7 MS-001 MS agar Plant 7 2 NA
8 MS-002 MS broth Plant 7 0 NA
9 MS-003 MS agar(w/ plant hormone) Plant 2 1 NA
10 PI-001 Pasteur pipet Experimental 30 5 10000
11 PI-002 Pipet-AID(10ml) Experimental 45 0 15000
12 PI-003 Pipet-AID(25ml) Experimental 45 5 25000
13 PI-004 Pipet-AID Experimental 2 0 NA
Equipment Glasswear Medium
1 NA NA 70000
2 NA NA 90000
3 NA 20000 NA
4 NA 20000 NA
5 NA NA 55000
6 NA NA 50000
7 NA NA 65000
8 NA NA 60000
9 NA NA 75000
10 NA NA NA
11 NA NA NA
12 NA NA NA
13 100000 NA NA
dcast를 쓰자. 얘가 대충 피벗테이블 같은 역할이라는 듯... 형식은 dcast(원본 데이터, 행~열에 들어갈 항목, 값으로 들어갈 항목, 적용할 함수).
Summarize
대충 설명만 들어봤을 때는 pandas의 그룹바이가 생각난다.
data <- read.table(header=TRUE, text='
subject sex condition before after change
1 F placebo 10.1 6.9 -3.2
2 F placebo 6.3 4.2 -2.1
3 M aspirin 12.4 6.3 -6.1
4 F placebo 8.1 6.1 -2.0
5 M aspirin 15.2 9.9 -5.3
6 F aspirin 10.9 7.0 -3.9
7 F aspirin 11.6 8.5 -3.1
8 M aspirin 9.5 3.0 -6.5
9 F placebo 11.5 9.0 -2.5
10 M placebo 11.9 11.0 -0.9
11 F aspirin 11.4 8.0 -3.4
12 M aspirin 10.0 4.4 -5.6
13 M aspirin 12.5 5.4 -7.1
14 M placebo 10.6 10.6 0.0
15 M aspirin 9.1 4.3 -4.8
16 F placebo 12.1 10.2 -1.9
17 F placebo 11.0 8.8 -2.2
18 F placebo 11.9 10.2 -1.7
19 M aspirin 9.1 3.6 -5.5
20 M placebo 13.5 12.4 -1.1
21 M aspirin 12.0 7.5 -4.5
22 F placebo 9.1 7.6 -1.5
23 M placebo 9.9 8.0 -1.9
24 F placebo 7.6 5.2 -2.4
25 F placebo 11.8 9.7 -2.1
26 F placebo 11.8 10.7 -1.1
27 F aspirin 10.1 7.9 -2.2
28 M aspirin 11.6 8.3 -3.3
29 F aspirin 11.3 6.8 -4.5
30 F placebo 10.3 8.3 -2.0
')
예제 테이블은 이건데... 아마도 임상실험을 상정하고 만든 데이터같다. 임상실험에서는 투약군과 위약군이 나뉜다. 투약군은 약을 투여하는거고 위약군은 가짜 약을 투여하는 군이라고 보면 된다.
비운의 약 TGN1412의 경우 전임상에서는 이상 없었는데 1차 임상시험에서 투약군 6명이 전부 심각한 부작용으로 ICU행이 되어서 개발 중단된 약. (위약군 두 명만 멀쩡했다고...)
> library(plyr)
> cdata=ddply(data,c("sex","condition"),summarise,N=length(change),mean=mean(change),sd=sd(change),se=sd/sqrt(N))
> cdata
sex condition N mean sd se
1 F aspirin 5 -3.420000 0.8642916 0.3865230
2 F placebo 12 -2.058333 0.5247655 0.1514867
3 M aspirin 9 -5.411111 1.1307569 0.3769190
4 M placebo 4 -0.975000 0.7804913 0.3902456
# Run the functions length, mean, and sd on the value of "change" for each group,
# broken down by sex + condition
ddply()를 쓰고 싶다면 plyr 라이브러리를... 아직도 안 깔았음???
> cdataNA=ddply(dataNA,c("sex","condition"),summarise,N=sum(!is.na(change)),mean=mean(change,na.rm=TRUE),sd=sd(change,na.rm=TRUE),se=sd/sqrt(N))
> cdataNA
sex condition N mean sd se
1 F aspirin 4 -3.425000 0.9979145 0.4989572
2 F placebo 12 -2.058333 0.5247655 0.1514867
3 M aspirin 7 -5.142857 1.0674848 0.4034713
4 M placebo 3 -1.300000 0.5291503 0.3055050
네? 결측값이요? na.rm=TRUE를 주면 된다.
> ddply(dataNA,c("sex","condition"),summarise,N=sum(!is.na(change)),mean=mean(change),sd=sd(change,na.rm=TRUE),se=sd/sqrt(N))
sex condition N mean sd se
1 F aspirin 4 NA 0.9979145 0.4989572
2 F placebo 12 -2.058333 0.5247655 0.1514867
3 M aspirin 7 NA 1.0674848 0.4034713
4 M placebo 3 NA 0.5291503 0.3055050
na.rm=TRUE를 안 주면(일단 평균에만 안 줬다) 이렇게 된다. pandas는 skipna가 자동으로 켜져있지만 R은 아님.
> library(doBy)
> cdata=summaryBy(change~sex+condition,data=data,FUN=c(length,mean,sd))
> cdata
sex condition change.length change.mean change.sd
1 F aspirin 5 -3.420000 0.8642916
2 F placebo 12 -2.058333 0.5247655
3 M aspirin 9 -5.411111 1.1307569
4 M placebo 4 -0.975000 0.7804913
doBy 라이브러리의 summaryBy()도 쓸 수 있다. 저게 ddply()보다 간결함.
네? 라이브러리 깔 여건이 안된다고요? 그렇다면 R에 있는 aggregate()를 쓰면 되는데... 쓰다보면 알겠지만 차라리 라이브러리 까는 게 낫다.
> cdata=aggregate(data["subject"],by=data[c("sex","condition")],FUN=length)
> cdata
sex condition subject
1 F aspirin 5
2 M aspirin 9
3 F placebo 12
4 M placebo 4
일단 만들고
> names(cdata)[names(cdata)=="subject"]="N"
> cdata
sex condition N
1 F aspirin 5
2 M aspirin 9
3 F placebo 12
4 M placebo 4
이름 바꿔주고
> cdata=cdata[order(cdata$sex),]
> cdata
sex condition N
1 F aspirin 5
3 F placebo 12
2 M aspirin 9
4 M placebo 4
정렬해주고... 아직 안 끝났다. 이제 평균이랑 표준편차랑 se 구해야된다.
> cdata.means=aggregate(data[c("before","after","change")],by=data[c("sex","condition")],FUN=mean)
> cdata.means
sex condition before after change
1 F aspirin 11.06000 7.640000 -3.420000
2 M aspirin 11.26667 5.855556 -5.411111
3 F placebo 10.13333 8.075000 -2.058333
4 M placebo 11.47500 10.500000 -0.975000
> cdata=merge(cdata,cdata.means)
> cdata
sex condition N before after change
1 F aspirin 5 11.06000 7.640000 -3.420000
2 F placebo 12 10.13333 8.075000 -2.058333
3 M aspirin 9 11.26667 5.855556 -5.411111
4 M placebo 4 11.47500 10.500000 -0.975000
평균
> cdata.sd=aggregate(data["change"],by=data[c("sex","condition")],FUN=sd)
> cdata.sd
sex condition change
1 F aspirin 0.8642916
2 M aspirin 1.1307569
3 F placebo 0.5247655
4 M placebo 0.7804913
> names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
> cdata.sd
sex condition change.sd
1 F aspirin 0.8642916
2 M aspirin 1.1307569
3 F placebo 0.5247655
4 M placebo 0.7804913
> cdata=merge(cdata,cdata.sd)
> cdata
sex condition subject before after change change.sd
1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916
2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655
3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569
4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913
표준편차(Standard deviation 줄이면 SD)
> cdata$change.se <- cdata$change.sd / sqrt(cdata$subject)
> cdata
sex condition subject before after change change.sd change.se
1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 0.3865230
2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 0.1514867
3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 0.3769190
4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 0.3902456
우리의 신입사원 김부추씨는 저 csv파일을 받아서 R로 깔끔하게 정리를 했다. (order는 주문량, cancel은 주문 취소량) 그런데 중간보고를 위해 상급자에게 갔더니 상급자의 한마디!
"부추씨, 이거 정리 되게 깔끔하게 잘 했어. 그런데 저기 ID가 한번에 딱 봐서는 모를 것 같거든... ID만 Product ID로 바꿔서 제출하면 될 것 같아. "
김부추 멘붕왔다. 헐 그럼 csv파일단에서 수정 다시 해야 함? 이거 하느라 개고생했는데...OTL 근데 부추가 누구죠아 제 텅비드 이름인데요이제 텅비드를 취업시키는건가아 걍 해요
> library(plyr)
> rename(df2,c("ID"="Product.ID"))
Product.ID Product.name Category.1 Category.2 Price Order
1 LB-001 LB agar Medium Bacteria 55000 5
2 LB-002 LB broth Medium Bacteria 50000 6
3 MS-001 MS agar Medium Plant 65000 7
4 MS-002 MS broth Medium Plant 60000 7
5 DM-001 DMEM Medium Cell culture 90000 20
6 BC-001 BCIG agar Medium Bacteria 70000 10
7 MS-003 MS agar(w/ plant hormone) Medium Plant 75000 2
8 PI-001 Pasteur pipet Consumable Experimental 10000 30
9 PI-002 Pipet-AID(10ml) Consumable Experimental 15000 45
10 PI-003 Pipet-AID(25ml) Consumable Experimental 25000 45
11 PI-004 Pipet-AID Equipment Experimental 100000 2
12 GL-001 Slide glass Glasswear Experimental 20000 50
13 GL-002 Cover glass Glasswear Experimental 20000 55
Cancel
1 0
2 1
3 2
4 0
5 0
6 2
7 1
8 5
9 0
10 5
11 0
12 0
13 1
파일단에서 손댈 것도 없이 그냥 저거 한 방이면 끝난다. (물론 plyr 라이브러리는 깔아두셨죠?) 참고로 얘는 저거 적용하고 names(df2)로 불러보면 ID로 나온다. (df2=코드 하고 해야 하나...)
네? 당장 보고 들어가야 해서 plyr 라이브러리 깔 시간이 없다고요?
> names(df2)[names(df2)=="ID"]="Product.ID"
> df2
Product.ID Product.name Category.1 Category.2 Price Order
1 LB-001 LB agar Medium Bacteria 55000 5
2 LB-002 LB broth Medium Bacteria 50000 6
3 MS-001 MS agar Medium Plant 65000 7
4 MS-002 MS broth Medium Plant 60000 7
5 DM-001 DMEM Medium Cell culture 90000 20
6 BC-001 BCIG agar Medium Bacteria 70000 10
7 MS-003 MS agar(w/ plant hormone) Medium Plant 75000 2
8 PI-001 Pasteur pipet Consumable Experimental 10000 30
9 PI-002 Pipet-AID(10ml) Consumable Experimental 15000 45
10 PI-003 Pipet-AID(25ml) Consumable Experimental 25000 45
11 PI-004 Pipet-AID Equipment Experimental 100000 2
12 GL-001 Slide glass Glasswear Experimental 20000 50
13 GL-002 Cover glass Glasswear Experimental 20000 55
Cancel
1 0
2 1
3 2
4 0
5 0
6 2
7 1
8 5
9 0
10 5
11 0
12 0
13 1
ㄱㄱ
> names(df2)=sub("ID","Product.ID",names(df2))
> df2
Product.Product.ID Product.name Category.1 Category.2 Price
1 LB-001 LB agar Medium Bacteria 55000
2 LB-002 LB broth Medium Bacteria 50000
3 MS-001 MS agar Medium Plant 65000
4 MS-002 MS broth Medium Plant 60000
5 DM-001 DMEM Medium Cell culture 90000
6 BC-001 BCIG agar Medium Bacteria 70000
7 MS-003 MS agar(w/ plant hormone) Medium Plant 75000
8 PI-001 Pasteur pipet Consumable Experimental 10000
9 PI-002 Pipet-AID(10ml) Consumable Experimental 15000
10 PI-003 Pipet-AID(25ml) Consumable Experimental 25000
11 PI-004 Pipet-AID Equipment Experimental 100000
12 GL-001 Slide glass Glasswear Experimental 20000
13 GL-002 Cover glass Glasswear Experimental 20000
Order Cancel
1 5 0
2 6 1
3 7 2
4 7 0
5 20 0
6 10 2
7 2 1
8 30 5
9 45 0
10 45 5
11 2 0
12 50 0
13 55 1
> names(df2)=gsub("D","d",names(df2))
> df2
Product.Product.Id Product.name Category.1 Category.2 Price
1 LB-001 LB agar Medium Bacteria 55000
2 LB-002 LB broth Medium Bacteria 50000
3 MS-001 MS agar Medium Plant 65000
4 MS-002 MS broth Medium Plant 60000
5 DM-001 DMEM Medium Cell culture 90000
6 BC-001 BCIG agar Medium Bacteria 70000
7 MS-003 MS agar(w/ plant hormone) Medium Plant 75000
8 PI-001 Pasteur pipet Consumable Experimental 10000
9 PI-002 Pipet-AID(10ml) Consumable Experimental 15000
10 PI-003 Pipet-AID(25ml) Consumable Experimental 25000
11 PI-004 Pipet-AID Equipment Experimental 100000
12 GL-001 Slide glass Glasswear Experimental 20000
13 GL-002 Cover glass Glasswear Experimental 20000
Order Cancel
1 5 0
2 6 1
3 7 2
4 7 0
5 20 0
6 10 2
7 2 1
8 30 5
9 45 0
10 45 5
11 2 0
12 50 0
13 55 1
sub(), gsub()도 당연히 된다. (라이브러리 없어도 됨)
컬럼 첨삭하기
> df
ID Interesred.in Class
1 kimlab0213 Python Basic
2 ahn_0526 Python Medium
3 peponi01 R Basic
4 kuda_koma R Expert
5 comma_life Java Basic
6 wheresjohn Java Medium
7 hanguk_joa Python Expert
8 sigma_00 R Basic
9 kokoatalk Java Basic
example 1번을 모셔봅시다. 예제 1번은 회원들에게 프로그래밍 언어 강의를 제공하는 웹 플랫폼을 상정하고 만든 것이다. (그래서 파이썬 R 자바가... 씨언어 어디갔어요씨언어 뭐 씨? 씨샵? 씨쁠쁠?) 그러니까 ID는 회원 아이디, 관심분야는 파이썬, 클래스는 강의의 난이도(초급/중급/전문가).
씁 근데 이렇게만 해서는 이 아이디의 주인이 해당 수업을 다 이수했나 아닌가를 모르겠다. 그래서 P/F 컬럼을 새로 추가하고자 한다.
> df$PF=c("P","P","F","F","P","F","P","F","F")
> df
ID Interesred.in Class PF
1 kimlab0213 Python Basic P
2 ahn_0526 Python Medium P
3 peponi01 R Basic F
4 kuda_koma R Expert F
5 comma_life Java Basic P
6 wheresjohn Java Medium F
7 hanguk_joa Python Expert P
8 sigma_00 R Basic F
9 kokoatalk Java Basic F
제목이요? 저거는 따옴표로 감싸지 않는 이상 공백 있으면 에러뜸미다... (언더바 ㄱㄱ)
> df[,"Pass or Fail"]=c("P","P","F","F","P","F","P","F","F")
> df
ID Interesred.in Class Pass or Fail
1 kimlab0213 Python Basic P
2 ahn_0526 Python Medium P
3 peponi01 R Basic F
4 kuda_koma R Expert F
5 comma_life Java Basic P
6 wheresjohn Java Medium F
7 hanguk_joa Python Expert P
8 sigma_00 R Basic F
9 kokoatalk Java Basic F
아니면 이건 어떠심?
그렇게 아이디어를 제안한 김상추씨. 하지만 선배 김양상추씨(둘 다 본인 포켓몬 이름임... 릴리요였나...)는 그 제안을 듣고 이렇게 말했다.
"좋은 아이디어긴 한데... P/F로 매기게 되면 강의 이수여부는 알 수 있지만, 출결 관리까지 알기는 좀 힘들 것 같아. 차라리 P/F를 빼고 출석율을 넣거나, 출석율이랑 병기하는 건 어때? "
> df$PF=NULL
> df
ID Interesred.in Class
1 kimlab0213 Python Basic
2 ahn_0526 Python Medium
3 peponi01 R Basic
4 kuda_koma R Expert
5 comma_life Java Basic
6 wheresjohn Java Medium
7 hanguk_joa Python Expert
8 sigma_00 R Basic
9 kokoatalk Java Basic
그래서 김상추씨는 PF 컬럼을 빼버렸다.
참고로
# Ways to add a column
data$size <- c("small", "large", "medium")
data[["size"]] <- c("small", "large", "medium")
data[,"size"] <- c("small", "large", "medium")
data$size <- 0 # Use the same value (0) for all rows
# Ways to remove the column
data$size <- NULL
data[["size"]] <- NULL
data[,"size"] <- NULL
data[[3]] <- NULL
data[,3] <- NULL
data <- subset(data, select=-size)
> df
ID Interesred.in Class Pass or Fail
1 kimlab0213 Python Basic P
2 ahn_0526 Python Medium P
3 peponi01 R Basic F
4 kuda_koma R Expert F
5 comma_life Java Basic P
6 wheresjohn Java Medium F
7 hanguk_joa Python Expert P
8 sigma_00 R Basic F
9 kokoatalk Java Basic F
아까 그거 맞다. 여기에다가 클래스별 가격 정보를 추가할건데... 그럼 가격표가 어디 있느냐고?
> df5
Class Price
1 Basic Free
2 Medium 1000
3 Expert 2000
여기요.
이 표를 파이널(별)퓨전해달라는 업무를 받은 김상추씨(아니 진짜 내 포켓몬 이름이여 이름이 왜그래요부추때문에 추로 끝나는 야채는 다 넣었음다행히도 고추는 없습니다 그건 어감이 거시기해서)... 아니 그럼 새로 컬럼 만들고 저걸 일일이 다 써요? ㄴㄴ
> merge(df,df5,"Class")
Class ID Interesred.in Pass or Fail Price
1 Basic kimlab0213 Python P Free
2 Basic peponi01 R F Free
3 Basic sigma_00 R F Free
4 Basic comma_life Java P Free
5 Basic kokoatalk Java F Free
6 Expert hanguk_joa Python P 2000
7 Expert kuda_koma R F 2000
8 Medium ahn_0526 Python P 1000
9 Medium wheresjohn Java F 1000
공통된 컬럼인 Class를 기반으로 파이널퓨전 하면 된다.
> df6[c(3,4,5,1,2)]
ID Interesred.in Pass or Fail Class Price
1 kimlab0213 Python P Basic Free
2 peponi01 R F Basic Free
3 sigma_00 R F Basic Free
4 comma_life Java P Basic Free
5 kokoatalk Java F Basic Free
6 hanguk_joa Python P Expert 2000
7 kuda_koma R F Expert 2000
8 ahn_0526 Python P Medium 1000
9 wheresjohn Java F Medium 1000
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
혹시 이 에러 해결법 아시는 분 제보좀 부탁드립니다. (합치려는 데이터프레임 영역 팩터로 같은 거 확인함)
> animals <- read.table(header=T, text='
+ size type name
+ small cat lynx
+ big cat tiger
+ small dog chihuahua
+ big dog "great dane"
+ ')
> observations <- read.table(header=T, text='
+ number size type
+ 1 big cat
+ 2 small dog
+ 3 small dog
+ 4 big dog
+ ')
이걸 파이널퓨전하는 것도 된다.
> merge(observations,animals,c("size","type"))
size type number name
1 big cat 1 tiger
2 big dog 4 great dane
3 small dog 2 chihuahua
4 small dog 3 chihuahua
어째서인지는 모르겠으나 됨.
데이터프레임 비교하기
> dfA <- data.frame(Subject=c(1,1,2,2), Response=c("X","X","X","X"))
> dfB <- data.frame(Subject=c(1,2,3), Response=c("X","Y","X"))
> dfC <- data.frame(Subject=c(1,2,3), Response=c("Z","Y","Z"))
> dfA
Subject Response
1 1 X
2 1 X
3 2 X
4 2 X
> dfB
Subject Response
1 1 X
2 2 Y
3 3 X
> dfC
Subject Response
1 1 Z
2 2 Y
3 3 Z
쿡북에 있는 시범조교를 모셔봤습니다.
> dfA$Coder="A"
> dfB$Coder="B"
> dfC$Coder="C"
각 데이터프레임에 Coder라는 컬럼을 추가한다.
> df7=rbind(dfA,dfB,dfC)
> df7
Subject Response Coder
1 1 X A
2 1 X A
3 2 X A
4 2 X A
5 1 X B
6 2 Y B
7 3 X B
8 1 Z C
9 2 Y C
10 3 Z C
그리고 행단위로 파이널퓨전!
> df7=df7[,c("Coder","Subject","Response")]
> df7
Coder Subject Response
1 A 1 X
2 A 1 X
3 A 2 X
4 A 2 X
5 B 1 X
6 B 2 Y
7 B 3 X
8 C 1 Z
9 C 2 Y
10 C 3 Z
하고 정렬까지 해야 시범조교 끝...
참고로 이걸 진행하려면 함수 하나를 정의하고 가야 하는데
dupsBetweenGroups <- function (df, idcol) {
# df: the data frame
# idcol: the column which identifies the group each row belongs to
# Get the data columns to use for finding matches
datacols <- setdiff(names(df), idcol)
# Sort by idcol, then datacols. Save order so we can undo the sorting later.
sortorder <- do.call(order, df)
df <- df[sortorder,]
# Find duplicates within each id group (first copy not marked)
dupWithin <- duplicated(df)
# With duplicates within each group filtered out, find duplicates between groups.
# Need to scan up and down with duplicated() because first copy is not marked.
dupBetween = rep(NA, nrow(df))
dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols])
dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols], fromLast=TRUE) | dupBetween[!dupWithin]
# ============= Replace NA's with previous non-NA value ==============
# This is why we sorted earlier - it was necessary to do this part efficiently
# Get indexes of non-NA's
goodIdx <- !is.na(dupBetween)
# These are the non-NA values from x only
# Add a leading NA for later use when we index into this vector
goodVals <- c(NA, dupBetween[goodIdx])
# Fill the indices of the output vector with the indices pulled from
# these offsets of goodVals. Add 1 to avoid indexing to zero.
fillIdx <- cumsum(goodIdx)+1
# The original vector, now with gaps filled
dupBetween <- goodVals[fillIdx]
# Undo the original sort
dupBetween[sortorder] <- dupBetween
# Return the vector of which entries are duplicated across groups
return(dupBetween)
}
> cbind(df7,dup=dupRows)
Coder Subject Response dup
1 A 1 X TRUE
2 A 1 X TRUE
3 A 2 X FALSE
4 A 2 X FALSE
5 B 1 X TRUE
6 B 2 Y TRUE
7 B 3 X FALSE
8 C 1 Z FALSE
9 C 2 Y TRUE
10 C 3 Z FALSE
열단위로 결합해도 뭔지 모르겠다...
사실 저건 각 테이블에서 중복되는 값을 찾아주는 함수이다. 코드가 달라도 중복되는 값이 있으면 TRUE, 아예 없으면 FALSE.
> cbind(df7,unique=!dupRows)
Coder Subject Response unique
1 A 1 X FALSE
2 A 1 X FALSE
3 A 2 X TRUE
4 A 2 X TRUE
5 B 1 X FALSE
6 B 2 Y FALSE
7 B 3 X TRUE
8 C 1 Z TRUE
9 C 2 Y FALSE
10 C 3 Z TRUE
이건 반대로 중복이 없으면 TRUE, 중복값이 있으면 FALSE다.
서브셋 나누기
> dfA=subset(dfDup,Coder=="A",select=-Coder)
> dfB=subset(dfDup,Coder=="B",select=-Coder)
> dfC=subset(dfDup,Coder=="C",select=-Coder)
> dfA
Subject Response dup
1 1 X TRUE
2 1 X TRUE
3 2 X FALSE
4 2 X FALSE
> dfB
Subject Response dup
5 1 X TRUE
6 2 Y TRUE
7 3 X FALSE
> dfC
Subject Response dup
8 1 Z FALSE
9 2 Y TRUE
10 3 Z FALSE
# 아 저게 Coder 빼고 셀렉션해달라는 얘기였냐고
> v=factor(c("grass","water","grass"),levels=c("grass","water","fire"))
> v
[1] grass water grass
Levels: grass water fire
여기 팩터가 있다. 근데 요소들을 보다 보니... 레벨은 있는데 요소가 없는 게 있네?
> v=factor(v)
> v
[1] grass water grass
Levels: grass water
그래서 지워드렸습니다.
> x=factor(c("A","B","A"),levels=c("A","B","C"))
> y=c(1,2,3)
> z=factor(c("R","G","G"),levels=c("R","G","B"))
> df=data.frame(x,y,z)
> df
x y z
1 A 1 R
2 B 2 G
3 A 3 G
팩터도 당연히 데이터프레임이 된다. 데이터프레임으로 만들 경우, 데이터프레임으로 출력할 때는 그냥 표로 나오게 되지만
> df$x
[1] A B A
Levels: A B C
여기서 팩터 단식으로 불러내면 이렇게 레벨이 나온다.
그런데 여기서도 요소가 없는 레벨이 있다?
> df=droplevels(df)
> df
x y z
1 A 1 R
2 B 2 G
3 A 3 G
> df$x
[1] A B A
Levels: A B
dropevels()를 쓰면 지워진다.
레벨 조정-순서 조정
> v=factor(c("S","M","L","XL","M","L"))
> v
[1] S M L XL M L
Levels: L M S XL
퍼 펌킨인쨔응 하악하악 뭐여 아무튼... 호바귀와 펌킨인쨔응은 사이즈라는 개념이 있다. 근데 레벨을 보면... 저 순서가 아니다. 알파벳순 아니라고...OTL
> v=factor(v,levels=c("S","M","L","XL"))
> v
[1] S M L XL M L
Levels: S M L XL
저기 겹치는 항목이 왜 있지? 라고 생각하실 수도 있는데 ChEMBL빨 데이터가 생각보다 공백이 많다. 그래서 scatter plot 그릴 때도 dropna() 처리하고 그렸다. 그리고 그렇게 날려먹으면 못 쓰는거 꽤 많다. (dropna()가 특정 컬럼만 되는 게 아니라 묻따않 공백 빠이염이 되버림)
> data[!duplicated(data$Name),]
이렇게 하면 특정 컬럼에서 겹치는 걸 볼 수 있다.
> unique(data$Name)
[1] IODINE I 131 DERLOTUXIMAB BIOTIN
[3] biotin-geranylpyrophosphate BIOTIN
4 Levels: BIOTIN ... IODINE I 131 DERLOTUXIMAB BIOTIN
아 픽도 됩니다. (해당 DB는 바이오틴 관련 compound) 근데 compound 80몇개 중에 이름 있는 게 저거뿐인 건 좀 심한 거 아니냐고...
NA 들어간 것 비교하기
> df <- data.frame( a=c(TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,NA,NA,NA),
+ b=c(TRUE,FALSE,NA,TRUE,FALSE,NA,TRUE,FALSE,NA))
> df
a b
1 TRUE TRUE
2 TRUE FALSE
3 TRUE NA
4 FALSE TRUE
5 FALSE FALSE
6 FALSE NA
7 NA TRUE
8 NA FALSE
9 NA NA
이걸로 해 볼건데...
일단 NA는 결측값이라(Null은 아예 빈 거고 얘는 결측값으로 채워져 있는 상태) 비교가...
> df$a == df$b
[1] TRUE FALSE NA FALSE TRUE NA NA NA NA
안된다.
파이썬 판다스는 그래서 NA 빼고 계산한다.
> data.frame(df, isSame = (df$a==df$b))
a b isSame
1 TRUE TRUE TRUE
2 TRUE FALSE FALSE
3 TRUE NA NA
4 FALSE TRUE FALSE
5 FALSE FALSE TRUE
6 FALSE NA NA
7 NA TRUE NA
8 NA FALSE NA
9 NA NA NA
한 쪽이라도 NA가 있으면 비교를 거부하는 모습이다. 이걸 쿡북에서는 함수 짜서 해결 봤는데
> data.frame(df,isSame = compareNA(df$a,df$b))
a b isSame
1 TRUE TRUE TRUE
2 TRUE FALSE FALSE
3 TRUE NA FALSE
4 FALSE TRUE FALSE
5 FALSE FALSE TRUE
6 FALSE NA FALSE
7 NA TRUE FALSE
8 NA FALSE FALSE
9 NA NA TRUE
일단 NA랑 NA가 있으면 같은걸로 쳐주는 듯.
데이터 recoding하기
역시나 둘 다 plyr이 있어야 한다... 여기서는 크게 범주형 자료와 수치형 자료에 대한 리코딩을 진행할 것이다.
두 데이터간 차이는 숫자로 측정이 되느냐 안 되느냐 여부. 일단 범주형 데이터의 예시로는 분자의 타입이 있고, 수치형 데이터의 예시로는 분자량이 있다.
> data$Type
[1] Small molecule Small molecule Small molecule Small molecule
[6] Small molecule Small molecule Small molecule Small molecule
[11] Small molecule Small molecule Small molecule Small molecule Small molecule
[16] Small molecule Small molecule Small molecule
[21] Antibody Small molecule Protein Small molecule Small molecule
[26] Small molecule Small molecule Small molecule Small molecule Small molecule
[31] Small molecule Small molecule Small molecule Small molecule
[36] Small molecule Small molecule Small molecule
[41] Small molecule Unknown Small molecule Small molecule Small molecule
[46] Small molecule Small molecule Small molecule Small molecule Small molecule
[51] Small molecule Small molecule Small molecule Unknown Small molecule
[56] Small molecule Small molecule Small molecule Small molecule
[61] Small molecule Small molecule Small molecule Small molecule Small molecule
[66] Small molecule Small molecule Small molecule Small molecule Small molecule
[71] Small molecule Small molecule Small molecule Small molecule Small molecule
[76] Small molecule Small molecule Small molecule
Levels: Antibody Protein Small molecule Unknown
이게 분자 타입. 작은 분자냐 항체냐 모르는거냐 단백질이냐에 따라 나눈다. 공백 뭐냐 즉, 수치로 측정할 수 있는 자료가 아니다.
> data$code=revalue(data$Type,c("Antibody"="A","Small molecule"="S","Protein"="P","Unknown"="U"))
> data$code
[1] S S S S S S S S S S S S S S S S A S P S S S S S S S S S S S S S
[39] S S U S S S S S S S S S S S U S S S S S S S S S S S S S S S S S S S S S
[77] S S
Levels: A P S U
revalue()
> data$code=mapvalues(data$Type,from=c("Antibody","Small molecule","Protein","Unknown"),to=c("A","S","P","U"))
> data$code
[1] S S S S S S S S S S S S S S S S A S P S S S S S S S S S S S S S
[39] S S U S S S S S S S S S S S U S S S S S S S S S S S S S S S S S S S S S
[77] S S
Levels: A P S U
얘는 또 category여...? 아, 여기서는 분자량 500을 기준으로 나누었다. 기준에 대해서는 나중에 또 알아보도록 하자.
> data$category=cut(data$Molecular.Weight, breaks=c(0,500,Inf),labels=c("Small","Large"))
> data$category
[1] Small Small Large Large Large Large Large Large Large Large Large Large
[13] Large Large Large Large Large Large Large Large <NA> Large Large Large
[25] Large Large Large Large Small Large Large Large Large Large Large Large
[37] Large Large Large Large Large Large Large Large Large Small Large Large
[49] Large Large Large Large Large Large Small Large Large Large Large Large
[61] Large Small Large Large Large Large Large Large Small Large Large Large
[73] Large Large Large Large Large Large
Levels: Small Large
저렇게 여러 줄 치기 귀찮으면 cut()으로 나누면 된다.
컬럼간 연산
> data$ratio=data$HBA/data$HBD
경고메시지(들):
In Ops.factor(data$HBA, data$HBD) :
요인(factors)에 대하여 의미있는 ‘/’가 아닙니다.
> data$ratio
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA
> data$sum = as.numeric(as.character(data$HBA)) + as.numeric(as.character(data$HBD))
경고메시지(들):
1: 강제형변환에 의해 생성된 NA 입니다
2: 강제형변환에 의해 생성된 NA 입니다
> data$sum
[1] 6 12 18 13 NA 17 20 16 18 NA 15 26 11 NA 18 18 14 11 17 21 NA 11 NA NA NA
[26] 19 10 NA 10 15 20 10 NA 32 10 18 NA 13 11 12 20 NA NA 16 13 9 NA 13 15 17
[51] NA 16 17 NA 7 19 13 NA 19 17 NA 8 16 NA NA 13 NA NA 6 16 28 16 12 16 16
[76] 18 NA 15