There are 5 questions (3 questions in R and 2 in Python). Please write your answer only in the space provided. Files are available in the Tests folder in Blackboard. Please submit all your work to Blackboard before 1:10 pm.

R portion

Question 1 (20 marks)

This question is based on the dataset taken from the bank marketing dataset. I have provided a reduced dataset for this exam. For logistic regression the variable of interest or the target variable is

· y – has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)

The rest of the variables are shown below

· age (numeric)

· job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)

· marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)

· education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘’, ‘illiterate’, ‘professional.course’, ‘’, ‘unknown’)

· default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)

· housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

· loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

· contact: contact communication type (categorical: ‘cellular’, ‘telephone’)

· month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

· day_of_week: last contact day of the week (categorical: ‘mon’, ‘tue’, ‘wed’, ‘thu’, ‘fri’)

· duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

· campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

· pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

· previous: number of contacts performed before this campaign and for this client (numeric)

· poutcome: outcome of the previous marketing campaign (categorical: ‘failure’, ‘nonexistent’, ‘success’)

For part (a) and (b), please provide an R file named as follows. If your GWId is G11221234, the file should be named G11221234-1R.R. For parts (c), (d) and (e) answer only in the space provided. Please use the dataset named: bank1.csv

Your task is,

a) (4 marks) Using duration as the independent variable, develop four logistic regression models for four subsets of data. The data are subset based on the following two variables

loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’), and

default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)

Please ensure that you remove all rows, where loan and default are ‘unknown.’ That will leave you only four categories. Make sure to recode all ‘yes’ values to 1 and all ‘no’ values to 0.

b) (4 marks) Please plot the four logistic regressions (actual and modeled data)

c) (4 marks) Write each of the four equations below

a) (4 marks) Explain the equation for Loan=Yes, default=Yes

  b) (4 marks) Which of these models can the bank use? Why?

Question 2 (20 marks)

Please use the cereal.csv dataset for this problem.

(a) (10 marks) Write a function called mysummary that

– takes the given data frame as input,

– calculates the average for calories, protein and fat,

– for the top-ranked 5 cereals based on rating, and

– returns the three averages for calories, protein and fat as a vector Following that,

– (4 marks) Read the data file into a data frame called dfq2

– (4 marks) Call the mysummary function by passing dfq2 as the parameter

– (6 marks) Provide the output formatted as the following1:

Average calories is 106.88, average protein is 2.54, and average fat is 1.01

If your GWId is G11221234, the file should be named G11221234-2R.R

Question 3 (20 marks)

Using the input file named nst-est2016-01.csv build the following RShiny application. Depending on the slider input, only those states are displayed for which the 2016 population lies in the slider range. Put your ui.R, server.R and the data file in a folder named your GWId. So if your GW Id is G11221234, the folder should be named G11221234.

R shiny

Submit your zipped folder.

1 The actual numbers will be different depending on the file you usePython portion

Question 4

(20 marks)You will use the Cigar.csv dataset for this question. It is a panel of cigarette consumption based on 46 observations from 1963 to 1992. Here is the description of that dataset. The total number of observations (rows) is 1380. The data fields are as follows:state: state abbreviationyear: the yearPrice: price per pack of cigarettespop: populationpop16: population above the age of 16 cpi: consumer price index (1983=100) ndi: per capita disposable income sales: cigarette sales in packs per capitapimin: minimum price in adjoining states per pack of cigarettes Your task is to create a box plot based on all rows for five states with the highest average price per packet of cigarettes (price) across all years. It should look something like the plot shown below. The plot needs to be properly labelled. Submit your response as a notebook file named as G11221234- 4P.ipynb assuming your GWId is G11221234.

Question 5 (20 marks)

This question requires you to scrape the following website.

The top part of that table is shown below

You will need to

(a) (10 marks) Scrape the data from this table

(b) (2 marks) Save the data to a csv file

(c) (2 marks) Load the data into a pandas data frame

(d) (6 marks) Provide a bar graph for the five largest rivers based on length.

To help you get started, you search for this table using the following command.

table = soup.findAll(“table”, {“class” : “wikitable”})

Submit your response as a notebook file named as G11221234-5P.ipynb assuming your GWId is G11221234.

# Q1 - PARTa
df <- read.csv("/Users/Phantom/Desktop/sample final/bank1.csv")

lydy = subset(df, loan == 'yes' & default == 'yes')
lydn = subset(df, loan == 'yes' & default == 'no')
lndy = subset(df, loan == 'no' & default == 'yes')
lndn = subset(df, loan == 'no' & default == 'yes')

lydy$y = ifelse(lydy$y == 'yes', 1,0)
lydn$y = ifelse(lydn$y == 'yes', 1,0)
lndy$y = ifelse(lndy$y == 'yes', 1,0)
lndn$y = ifelse(lndn$y == 'yes', 1,0)

lydy.mod = glm(y ~ duration, data = lydy, family = 'binomial')
lydn.mod = glm(y ~ duration, data = lydn, family = 'binomial')
lndy.mod = glm(y ~ duration, data = lndy, family = 'binomial')
lndn.mod = glm(y ~ duration, data = lndn, family = 'binomial')

plot(y ~ duration, data = lydy)
curve(probs = predict(lydy.mod, data.frame(duration = lydy$duration), type = 'resp'), add = TRUE)
curve(pr, add = TRUE)

# Q2 

# write a function called mysummary


mysummary <- function(somedf){
  # find top 5 cereals based on rating
  newdf <- somedf[order(somedf$rating), ]
  ss <- tail(newdf, 5)
  # calculate means
  cal <- mean(ss$calories)
  pro <- mean(ss$protein)
  fat <- mean(ss$fat)
  # return
  return(c(cal, pro, fat))

dfq2 <- read.csv("/Users/Phantom/Desktop/sample final/cereal.csv")

s = sprintf("Average calories is %0.2f, average protein is %0.2f, and average fat is %0.2f", 
            x[1], x[2], x[3])

# Q3