Solution

2019-3

Packages

library(foreign)

Questions

This question is based on the diabetes dataset (diabetes.arff). This dataset consists of 768 observations and 9 attributes. The brief description of the attributes are are as follows:

– preg: Number of times the patient is pregnant

– plas: Plasma glucose concentration

– pres: Diastolic blood pressure (mm Hg)

– skin: Tricepts skin fold thickness(mm)

– insu: 2-hour serum insulin (mu U/ml)

– mass: Body mass index(weight in kg/(height in m)^2)

– pedi: Diabetes pedigree function

– age: Age (years)

– class: Class variables (either tested_negative or tested_positive)

a)

Provide the R codes for loading the data into a variable

. (1 mark)

Diabetes<-read.arff(‘diabetes.arff’)
Diabetes

preg plas pres skin insu mass  pedi age           class
1      6  148   72   35    0 33.6 0.627  50 tested_positive
2      1   85   66   29    0 26.6 0.351  31 tested_negative
……
767    1  126   60    0    0 30.1 0.349  47 tested_positive
768    1   93   70   31    0 30.4 0.315  23 tested_negative

b)

Provide the R codes generating the CSV equivalent of the diabetes dataset (diabetes.csv). (1 mark)

write.csv(Diabetes,’diabetes.csv’)

c)

Compare and contrast the similarities and differences of the ARFF format and the CSV format. (2 marks)

• The similarities are the ARFF format and the CSV format are both data file format.

• The differences are: 1)ARFF format will need a smaller memory space than CSV format, in this case, “diabetes.arff” occupied 35kb, while “diabetes.csv” occupied 40kb. If the data is acctually sparse, arrf format will save memory. 2)when dataset is large enough, we can see that it is more time consuming to load CSV format data than ARFF format data.

d)

Provide the R codes for generating a logistic regression model(

) using

as the response and the other attributes as predictors. (1 mark)

model<-glm(class~preg+plas+pres+skin+insu+mass+pedi+age,family=binomial,data=Diabetes)
summary(model)

Call:
glm(formula = class ~ preg + plas + pres + skin + insu + mass +
   pedi + age, family = binomial, data = Diabetes)

Deviance Residuals:
   Min       1Q   Median       3Q      Max  
-2.5566  -0.7274  -0.4159   0.7267   2.9297  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.4046964  0.7166359 -11.728  < 2e-16 ***
preg         0.1231823  0.0320776   3.840 0.000123 ***
plas         0.0351637  0.0037087   9.481  < 2e-16 ***
pres        -0.0132955  0.0052336  -2.540 0.011072 *  
skin         0.0006190  0.0068994   0.090 0.928515    
insu        -0.0011917  0.0009012  -1.322 0.186065    
mass         0.0897010  0.0150876   5.945 2.76e-09 ***
pedi         0.9451797  0.2991475   3.160 0.001580 **
age          0.0148690  0.0093348   1.593 0.111192

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 723.45  on 759  degrees of freedom
AIC: 741.45

Number of Fisher Scoring iterations: 5

e)

Using the logistic regression result of the model, write down the equation of log-odds of the model. Please round off all the coefficient estimates to 4-decimal places. (1 mark)

• the log-odds of the model is:

• where,

f)

We learned that logistic regression uses a logistic function: Pr(Y=REFERENCE_CLASS|data) (i.e. the probability of calss = REFERENCE_CLASS given a data point. it turns out that R uses the first level value of a factor-type attributes as the reference class)

i.e. from the above output, we can tell that the REFERENCE_CLASS is test_negative. Using this model, determine whether the following testing data point should be classfied as tested _negative or test_positive. You show the step by step(mathematical) working how you arrive at this conclusion. testing data points: preg=1, plas=123, pres=60, skin=20, insu=0, mass=30, pedi=0.3, age=40.

testdata<-data.frame(preg=1,plas=123,pres=60,skin=20,insu=0,mass=30,pedi=0.3,age=40)
mypre<-(8.4047+0.1232*testdata$preg+0.0352*testdata$plas0.0133*testdata$pres+0.0006*testdata$skin0.0012*testdata$insu+0.0897*testdata$mass+0.9452*testdata$pedi+0.0149*testdata$age)
(myp<-exp(mypre)/(1+exp(mypre)))

[1] 0.2373361

According to the above output,the probability of tested_nagative is less than 0.24, we can find the testing data point should be classfied as tested_positive.

g)

Provide the R codes for verifying the probability value of f) using the predict() function in R. (2 marks)

pre<-predict(model,testdata)
(p<-exp(pre)/(1+exp(pre)))

1
0.2364237

This result is consistant with the probability value of f).

h)

Suppose you want to change the reference class in R to tested_positive. you should use the relevel() function. Read the help pages and provide the R command to change reference class to test_positive so that predict() will be based on tested_positive. (1 mark)

#help(relevel)
Diabetes$class<-relevel(Diabetes$class,ref=”tested_positive”)
levels(Diabetes$class)

[1] “tested_positive” “tested_negative”

i)

If you were to generate a new model(model2) using tested_postive as the reference class, what is the diffence in the regression model of model2 compared to model? (2 mark)

model2<-glm(class~preg+plas+pres+skin+insu+mass+pedi+age,family=binomial,data=Diabetes)
summary(model2)

Call:
glm(formula = class ~ preg + plas + pres + skin + insu + mass +
   pedi + age, family = binomial, data = Diabetes)

Deviance Residuals:
   Min       1Q   Median       3Q      Max  
-2.9297  -0.7267   0.4159   0.7274   2.5566  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  8.4046964  0.7166359  11.728  < 2e-16 ***
preg        -0.1231823  0.0320776  -3.840 0.000123 ***
plas        -0.0351637  0.0037087  -9.481  < 2e-16 ***
pres         0.0132955  0.0052336   2.540 0.011072 *  
skin        -0.0006190  0.0068994  -0.090 0.928515    
insu         0.0011917  0.0009012   1.322 0.186065    
mass        -0.0897010  0.0150876  -5.945 2.76e-09 ***
pedi        -0.9451797  0.2991475  -3.160 0.001580 **
age         -0.0148690  0.0093348  -1.593 0.111192    

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 723.45  on 759  degrees of freedom
AIC: 741.45

Number of Fisher Scoring iterations: 5

• All coefficients have changed the signs, but the absolue value hold the same.