Stat 444 Winter 2019

Final Project
March 18, 2019

Deadlines

You have

Until 12 pm noon on April 14th: to upload your prediction les to Kaggle. The links to the Kaggle submission sites will be provided separately.

Until 12 pm noon on April 15th: to upload your nal project les to Learn Dropbox folders. The report, including the appendix, should be in a pdf format and submit to the Report folder. The Rmd le that generates the appendix should go to the Appendix folder. The Rmd les for smoothing, random forest, boosting and optional methods should go to Smoothing, RandomForest, Boosting and Optional folders, respectively. The le naming convention should be LastName FirstName xxx.yyy”. The LastName” and FirstName” are the last and rst names of the team representative. xxx” represents purpose, for example, report, appendix, smoothing, etc. yyy” is the le su x, which can be pdf, Rmd or zip if multiple les are included.

Only team representatives should upload the  les.

If you make more than one submission before the deadline, the most recent one will be considered. No (re)submissions, under any circumstances, will be accepted after the deadline.

Data

The dataset available on Learn is a cleaned subset of the D.C. Residential Properties Kaggle dataset, where you can nd detail descriptions of the variables. Brie y, it contains information about single-unit single-building non-condo residential properties. The response is PRICE.

Grading of the Projects

Your project will be graded on two main aspects, prediction and report. The prediction accuracy is evaluated primarily in three categories: smoothing method, random forest, and boosting. An optional category is for any method that does not belong to the above three categories. The prediction accuracy of each category will be compared with other teams, and its grade is primarily determined by its ranking. The report includes exploration of the data, visualization, model building details, interesting ndings, handling of missing data, outliers, etc.

Prediction

The data contain a variable fold, whose values are 1; 2; 3; 4, and 5. The prediction accuracy is evaluated by a 5-fold cross-validation with fold assignment speci ed by fold.

1

The data also contain an Id variable. To submit the result of a method to Kaggle, you should generate a csv le which contains two variables: the Id and your predicted price (PRICE) for each observation, where the prediction is computed using the data that are not in the same fold of the target observation.

The error metrics is RMLSE (Root-Mean-Squared-Logarithmic-Error), which is the Root-Mean-Squared-Error (RMSE) between the (natural) log of the predicted value and the log of the observed
q

sales price. More speci cally, RMLSE= 1 Pn (log(P RICE) log(prediction))2.
ni=1

Your R Markdown le for each method should reproduce the RMLSE shown in Kaggle. Any discrepancy beyond rounding errors will be investigated and likely penalized.