Stat 300W Final Project, Due the last day of class, printed in the Dropbox, April 8th at 7pm.

Write an 800-1400 word essay, in LaTeX, on one of the following controversial topics in stats.

Write either a persuasive essay arguing for one side over the other, or write a compare-and-contrast essay looking at the differences and similarity between each side?

Essay questions could include:

Take a side, or compare Bayesian vs Frequentist methods.

-Take a side, or compare parametric vs non parametric methods.

… or simulations vs real data.

How important is parsimony vs accuracy?

How valuable is null hypothesis testing, what does it mask?

Should p-values be the gold standard?

How feasible are causality studies?

Is multiple testing a valid remedy?

Is imputation a valid remedy?

Are the flexibility of ultra-complex methods like neural networks worth the fact that they can’t be explained or described reasonably?

Why do we use the mean as the basis for everything instead of the median?

Discuss publication bias and how asymmetric tests like funnel plots can expose it.

Discuss the pros and cons of bar graphs and pie graphs.

Writing Papers – Template for a technical report

A completed technical report might look like this:

- Executive Summary
- Introduction / Problem statement
- Methods
- Results
- Conclusion / Discussion

Executive Summary:

This is the LAST thing that you should write. This would be the tl;dr of the technical report. “tl;dr” stands for “too long; didn’t read”. More formally this is called the executive summary, which means ‘if this report was given to a major decision maker, whom has tons of things they need to know already, what would you like them to know from the report that can be reduced to 100 words or fewer.’

Here you should write the research question as shortly as you can, one main result, and the name of the main method used. Nothing from the discussion / conclusion section is needed here.

Introduction:

The introduction typically follows a close formula.

Describe the research problem or state the research questions that were posed. If you can, tell why this research problem is important. The explanation of importance doesn’t have to be too specific to the research problem. If you are working with data about a medical problem, mention that many people suffer this medical problem; in a research paper, this is a good opportunity to cite a well-known related paper that has found the scope of the problem for you.

If you don’t know why a problem is important and a quick literature search won’t tell you, leave the problem’s importance to a co-author whose expertise is more suited to this part. It’s much better to admit you don’t know something than to say something wrong.

Describe each section in the paper or report in very short detail. (e.g. “in the methods section, we describe the data cleaning and the regression tree method that we used. In the results section, we describe the goal scoring rate of different hockey players. In the discussion section, we follow up with a comparison of this method to an older, more traditional one.”)

Methods

The methods: What did you do to get these results?

If this were a field science, you would list the days and describe the conditions under which you went out into the field and gathered information (e.g. ‘we collected our samples on sunny days in the North Okanagan valley between June 10th and September 20th, 2015’). In a data science, you would instead describe the dataset that you used, its format and size, and key variables and features (e.g. ‘We gathered the data from NHL.com’s event-tracking database using the nhlscrapr package along with our own patch, The data we collected included each goal, shot, hit, penalty, and faceoff recorded in each regular season game from October 2012 to April 2017’)

This is where the bulk of your writing should be. About 50% of your report will be the methods section. You don’t need to explain the entire data cleaning process, but you should mention where the data came from, and the tools / software that were used. It’s also good practice to mention when the data was taken (especially in the case of news reports which may be updated, altered, or archived such that scraping may produce different results later).

If there were any judgement calls in your data cleaning process, such as…

- what was done about extreme and influential cases,
- how problematic variables were used,
- how tuning parameters for complex methods were selected, and
- how missing values were either filled in or explained away,

…these should be included as well.

In short, you don’t have to give everything away, but an expert with the same software and data access should be able to recreate what you did.

A methods section serves two purposes:

The first purpose is to give legitimacy to your results. If you show results without explaining how you got them, a reader might assume that the results were invented or made up. With a methods section, the reader should be able to see a logical path between the data and the results.

The second purpose is to aid future readers in using your discovery. After all, publication is about making something public, and that includes giving access to the entire discovery process if you can.

After the data preparation is explained, describe the model you selected or the process you used to select the model. If you just did linear regression, say that. If you used a random forest, or the LASSO, or stepwise regression, say that instead.

Normally, you only need to include the final method that you decided upon. However, there is a good chance that the method you used wasn’t the only method that you tried. In a research paper, you wouldn’t necessarily mention these ‘dead ends’ because paper length is limited by the journal. In a technical report (or a thesis) these other approaches are useful to help you justify your choice and that alternatives were considered. You can explain why these rejected methods didn’t work or what about the results they produced was bad. Don’t overdo these dead-end explanations. The reader is much more interested in what you did and what worked instead of what didn’t work, typically.

Example: “After an exploratory analysis, we tried to classify events using random forests, dimension reduction, and neural nets. We decided to further pursue neural nets because they produce models with much lower out-of-bag errors than other approaches.”

Results

It’s easiest to write the results first, even though they don’t appear first. Any tables of figures you want to show, make these as soon as the analysis work is done. Talk about your results a little. Explain the importance of any tables and figures; why are they there?

Mention the general trend (e.g. ‘there is a negative, non-linear trend between playing time per game and shots against goal’), and any notable observations (‘however, the New Jersey Devils break this trend’)

You don’t need to write much here. The charts should explain themselves.

Discussion / Conclusion:

In a technical report, this is where you take the results and give them meaning in the context of the research questions that were in the introduction. You can also quickly summarize what you did.

In a journal paper or a thesis, this section might also include future research questions that could be answered with more data or by a different analysis. A technical report should be more self-contained, and the allusions to the further work is not required.

In every case, no new information about the project should in introduced in the conclusion. If you have an interesting finding, it should be in the results. If that interesting finding doesn’t fit with the rest of the results, a new subsection for it can always be made, but keep it out of the discussion section.

Remember, when giving context to the results, don’t reach beyond your expertise. If the data is genetic, and you are not a geneticist or biologist, do not make conclusions about the importance of a gene. Often statistical publications are co-authored with subject experts; let those experts write about their topics and stick to the data analysis.

Notes: The Skeleton Method

The Skeleton Method

Papers are big and intimidating to write, and imagining everything involving in the writing of one is pretty much impossible, at least for modern papers. Instead, it’s much easier to think about and write small parts of a paper at a time, and then do any necessary synthesis at the end.

That’s where the skeleton method comes in.

First, write the skeleton of your paper. That is, list the sections you want to include, and a general question for each section:

Example: A Simulator for Twenty20 Cricket (Davis, Perera, and Swartz)

- Introduction

What is the problem at hand and the context and motivation for solving? - Preliminaries

Describe the data you have and its features? - Parameter Estimation

How do we estimate how good each player is? - Extending the Simulator

What else could we do with this simulator, but for the sake of simplicity, we opted not to? - Adequacy of the Simulator

How do we know the simulator is describing the game that we say it is? - Discussion

What else could we do with this simulator? What does it imply? - Next, write a few short sentences that could answer these questions.
- Introduction

What is the problem at hand and the context and motivation for solving?

- Cricket is a growing worldwide game.

– But it’s underanalyzed compared to baseball.

– Let’s apply baseball style analytics to cricket.

- Preliminaries

Describe the data you have and its features?

– Data comes from ESPN CricInfo.

– We have the outcome and commentary of every ball thrown.

– This data comes as text and we format them. - Parameter Estimation

How do we estimate how good each player is?

– We break down each player by the distribution of their outcomes.

– We make adjustments for the game situation in which each outcome happened.

– Game-situation adjustments are calculated in an appendix. - Extending the Simulator

What else could we do with this simulator, but for the sake of simplicity, we opted not to?

– Consider the second inning.

– Account for home team advantage

– Account for multiple leagues - Adequacy of the Simulator

How do we know the simulator is describing the game that we say it is?

– Does it produce a distribution of scores similar to that of actual games.

– Does it reach those scores in the same fashion as actual games do?

– Does it make reasonable predictions or estimates of well known players? - Discussion

What else could we do with this simulator? What does it imply?

– We did what we said we would in the introduction.

– We could do things like rate players.

– We could (did) gamble using this.

Step 3: Develop each meaty bit on its own.

Finally, answer each of the sub-part questions with anything from one paragraph to two pages.

Take each part you added in step 2 and develop it into a rough draft.

START WITH THE EASIEST PARTS TO GET MOMENTUM.

You may find that other parts are larger than expected, and that others are unnecessary. - Preliminaries

Describe the data you have and its features.

- Data comes from ESPN CricInfo.

– (How did you find the pages?) With some Rcurl code

– (How did you know they were the right pages?) text matching (via regex) to detect

– (What kinds of pages did you want?) Commentaries and summaries from ODI and T20I

– (How many? When?) About 1000 games, from 2009-2015, gathered 2015-2016

– (What variables did you get from these pages?) Teams involved, match number, players

– (Any challenges?) Automatic navigation is challenging are requires some supervision.

– (Any other challenges?) Matching the summary data to the game commentaries

– (Any others?) A lot of players are referred to by nicknames, or have the same last name

– (Any notable features?) The commentaries are always written in the same format.

Example skeleton: Bootstrapping

- Introduction

- What is bootstrapping?

- How is it used?

- Why is it used?

– Sometimes samples are too small to get a good estimate of something.

– Examples involving small degrees of freedom, saturation - Preliminaries

- When was it developed?

- Where is it popular these days? - Main Topic 1 – Basic Idea

- Mathematical description

– Resampling (With vs. without replacement)

– Assumptions. (Actually very few)…

– Monte Carlo approximation

– Combinatorial exact description. - Main Topic 2 – Extension or related idea

- Jackknife

- Non-Parametric??? - Discussion

What does the future hold for this method?

What are some limitations?

Summary of the ‘take-home message’

Example Skeleton: Bias in social media - Introduction

– What is social media sampling?

– Why is it a problem? (Bias)

– Why is bias a problem? - Preliminaries

– Difference between “convenience sampling” and social media sampling.

– Convenience is most convenient to researcher.

– Social media has a random element to it, but not one we can usually account for mathematically.

– Network sampling / recruitment sampling.

– Boaty McBoatface, a case study - Main Idea 1 – Social Media Algorithms

– How do these affect surveys? Create bias?

– The question of representativeness.

– Paid targeted advertising. How does it affect the surveys? How is affected by them? - Main Idea 2 – Implications, Bubble worlds
- Discussion

Example Skeleton: ‘Parametric vs Non-Parametric’ - What makes a test parametric vs nonparametric? (The assumption of a distribution)

The argument that parametric tests are more powerful. (That is, they will produce smaller p-values than an equivalent non parametric test on the same data and null hypothesis)

Some common non-parametrics.

Wilcox rank-sum (vs t test)

Spearman correlation (vs pearson)

Fisher’s exact test (vs chi squared)

Permutation test

Moran’s I for autocorrelation

Median test

Diagnostics for whether a distribution is appropriate, like Shapiro Wilks for normality, or Anderson darling.

Conclude with a call to action that publications should not rely so heavily on parametric tests. - Example Skeleton: ‘Introduction to MCMC’

What is Monte Carlo “Run the basic parts of the simulation over and over”

What is a Markov Chain “A set of probabilities that depends upon the last outcome of sequence of values”. (Example: Stock price. The price of a stock tomorrow depends on two things.

1. The stock price today.

2. A random element. (A set of probabilities)

Parallelization, multiple computers working on different copies of the same thing. OR… Multiple complex situations from the same starting point.

(Example: Sports game. Always starts with the same set of players, same score 0-0 and the same field conditions. A set of random events follows).

Advantage of MCMC is that you can get empirical confidence intervals for things that you don’t have a nice formulas for.

MCMC can handle very complex systems like network exploration by only knowing a few simple rules.

Example Skeleton: ‘Parsimony vs. Accuracy’

How important is having a simple, easy to understand and use model vs. having one that fits the data very well.

Easy to see why accuracy is good (better model fit is better), however there is a trade off.

You can always do a SATURATED model. (100 response values, use 100 predictor variables), but a saturated model does not do anything for new values. It could fit the first 100 values perfectly, but it will totally fail to fit the 101st value.

Less extreme is the OVERFITTED model, a model that uses more predictors than are meaningful. It will fit the data you have well, but it will work poorly for new information.

What are the advantages / disadvantages of trees over standard regression?What does an overfitted polynomial model look like?

How do you know if you have overfit?

Cross-validation. (Split your data up into two parts, the ‘training’ set, and the ‘test’ set. Build the model using ONLY the training set, then predict values in the TEST set.

K-fold Cross validation.

Evaluate models in ways that balance parsimony (simpleness) and accuracy.

Just accuracy: R-squared. Sum of squared error.

Both: Adj. R-squared (R-squared, with a PENALTY). AIC,BIC/SBC (Information Criteria). Cross-validated R-squared.

Example Skeleton: ‘Regression Trees, Random Forests’

Intro – Comparison to something familiar, linear regression.

Like linear regression model, a tree takes a set of predictor x values and produces a single response y.

Unlike linear regression, a tree can only produce a from a specific set of values - What does a tree look like?

What is a regression tree? (example diagrams)

A set of cutoffs of values of x variables, as determined by an algorithm aiming to optimise some measure like AIC.

Not every x variable needs to be used. Some can be used twice.

Works well for continuous, discrete, or even dummy/binary x variables.

What is a classification tree?

What is a random forest? - What are the advantages / disadvantages of trees over standard regression?

A: Robustness to breaks from normality, constant variance, no outliers.

A: Ability to make a model from supersaturated data (when there are more parameters p than observations n)

A: Easy to adapt to classification or logistic scenarios. - D: ‘Black box’, which provides predictions but not inference.

D: Reliance on tuning parameters.

D: Only provides predictions and not clear measures on uncertainty. - Example Skeleton: ‘P-values’

Introduction

Null hypothesis statistical tests (NHSTs) are the most commonly used tool in many fields to determine if something is noteworthy.

NHSTs have the advantage of producing a p-value, which can be interpreted in relatively simple terms, despite there being a great variety in the situations to which such tests can be applied.

However, there is frequent confusion or misinterpretation over the idea of ‘statistical significance’, with authors using it as a proxy for scientifically or clinically meaningful.

Is the convenience worth the risk of overreach? - Definitions
- · What is a null hypothesis
- · What is a hypothesis test
- · What does a p-value of x mean?
- ·
- Points to consider
- · The issue of sample size. A very large sample will produce a small p-value for a test even when the effect size is small, and not of practical significance. A small sample may not give a small p-value no matter what happens (within reason).
- · P-hacking, or selecting variables in order to find a small p-value by dumb luck.
- · The issue of multiple testing.
- · The arbitrary nature of 0.05 as a cutoff. (It’s just a historical accident, there is nothing special about 0.05)
- · Alternative, related approaches like confidence intervals.
- · Pre-registration of studies.
- · Publication bias towards statistical significance.
- · Alternatives like Magnitude Based Inference as shown at
__http://sportsci.org/jour/05/ambwgh.htm#_Toc122308049__and at__http://bit.csc.lsu.edu/~jianhua/emrah.pdf__, specifically at Figure 3.