Stat 444 Winter 2019

                         Final Project

                        March 18, 2019


You have

1.Until 12 pm noon on April 14th: to upload your prediction files to              Kaggle. The links to the Kaggle submission sites will be provided              separately.

2.Until 12 pm noon on April 15th: to upload your final project files to           Learn Dropbox folders. The report, including the appendix, should         be in a pdf format and submit to the Report folder. The Rmd file that       generates the appendix should go to the Appendix folder. The Rmd         files for smoothing, random forest, boosting and optional methods         should go to Smoothing, RandomForest, Boosting and Optional                 folders, respectively. The file naming convention should be                         “LastName FirstName xxx.yyy”. The “LastName” and “FirstName”           are the last and first names of the team representative. “xxx”                     represents purpose, for example, report, appendix, smoothing, etc.         “yyy” is the file suffix, which can be pdf, Rmd or zip if multiple files         are included.

Only team representatives should upload the files.

If you make more than one submission before the deadline, the most recent one will be considered. No (re)submissions, under any circumstances, will be accepted after the deadline.


The dataset available on Learn is a cleaned subset of the D.C. Residential Properties Kaggle dataset, where you can find detail descriptions of the variables. Briefly, it contains information about single- unit single-building non-condo residential properties. The response is PRICE.

Grading of the Projects

Your project will be graded on two main aspects, prediction and report. The prediction accuracy is evaluated primarily in three categories: smoothing method, random forest, and boosting. An optional category is for any method that does not belong to the above three categories. The prediction accuracy of each category will be compared with other teams, and its grade is primarily determined by its ranking. The report includes exploration of the data, visualization, model building details, interesting findings, handling of missing data, outliers, etc.


The data contain a variable fold, whose values are 1, 2, 3, 4, and 5. The prediction accuracy is evaluated by a 5-fold cross-validation with fold assignment specified by fold.

The data also contain an Id variable. To submit the result of a method to Kaggle, you should generate a csv file which contains two variables: the Id and your predicted price (PRICE) for each observation, where the prediction is computed using the data that are not in the same fold of the target observation.

The error metrics is RMLSE (Root-Mean-Squared-Logarithmic-Error), which is the Root-Mean- Squared-Error (RMSE) between the (natural) log of the predicted value and the log of the observed sales price.  More specifically, RMLSE=. 1 Σn   (log(P RICE) − log(prediction))2.

Your R Markdown file for each method should reproduce the RMLSE shown in Kaggle. Any discrepancy beyond rounding errors will be investigated and likely penalized.

Written Report

Page limit 30 pages, excluding title page and Appendix. You should present your final models in the main report as well as the outline of your model building or tuning parameter selection strategy. In principle, the main report should not contain any R code or R output, except graphs. The model building process, i.e. the technical details of how you arrive at the final models, should be included in Appendix. You should make references to Appendix sections or pages in the main report.

Your report should have but not limited to the following sections and information:

 •Executive summary: On a separate page. In the first paragraph, provide your findings in the context of the project. For example, highlight the       implications/interpretations that you have come to during your modeling/prediction process in the context of the project. You should avoid technical (mathematical/statistical) language as much as possible in this part of the project. This is the section in reports which is usually read carefully by managers, who may not have any statistical background. In the second paragraph, highlight your achievements, such as the ranking achieved or additional methods attempted that have good results. Include a table of the prediction error and ranking for each method.

 • Introduction: (Briefly) describe the objective and organization of your report.

 •Data: Here you can perform some descriptive analysis on the data. For example, you can provide relevant tables or graphs.

 •Preprocessing (optional): This is the place to discuss missing data, outliers, and/or any problems existing in the data. If you perform any feature reduction and/or transformation and/or imputation, detail your solution/steps here. Some of the aspects, for example, outlier detection and handling can be postponed till later modeling steps.

 •Smoothing methods: Use smoothing methods (e.g. spline or local regression) for prediction. Describe your model building strategy and your final model.

 •Radom Forests: Use random forests models on the data, describe the tuning parameter selection process and the parameter selected, report the importance of variables.

 •Boosting: Use boosting methods on the data, describe the tuning parameter selection process and the parameter selected, report the importance of variables.

 • Additional methods (optional): Describe any other method you have tried.

 • Statistical Conclusions: Compare your smoothing, random forests and boosting models or any other method you have attempted, and pick the best candidate with respect to some criteria (prediction error, ease of use, computation time, etc.). This is where you provide your statistical conclusion on the models.

 •Future work: Any aspect of the project you wish that could be done better. Any weak- ness of the current methods you tried that you wish there are improvements. Any other statistical/machine learning methods you wish to learn/try in the future.

 •Contribution: For teams with more than one student, include a table of the tasks and contribution percentage from each team member. The tasks can include but not limited to: data cleaning, methods (smoothing, random forests, etc.), report writing. Each member should contribute at least 50% to one of the statistical methods. That is, the division of labor cannot be one member performing all statistical methods and model fitting and the other member doing everything else.


Use R Markdown to generate the appendix and attached the generated pdf after the main report. Appendix is not counted in the page limit. It should at least contain

 •Modeling details: It is unlikely all models you have fitted are useful/interesting to report. Some models are only of intermediate values, and some models may need to be checked but turns out to fare worse than the current one. You can detail your model building process and tuning parameter selection here to justify the models you presented in the main text, please cross-reference in the report.

Additional Information

 • Cleaning data with Python: Some of you have asked whether you can use Python. It is not recommended, but you can use Python to clean the data before analysis if you prefer. In this case, you need to embed Python in R, i.e. call Python from R. Also include comments so that your cleaning steps can be understood.

 • Latex: The R Markdown will create a latex file besides the pdf file if you set “keep tex: yes” under “output: pdf document” section. You can modify the latex file for better control of the layout and margin of the document if you prefer.