1 Probabilistic Interpretation of Ridge Regression

Consider a regression problem in which we have data {(xi, yi)}n with xiRp and yiR. As usual,

we can stack all our features in a n × p matrix, X, and all our labels in a n × 1 vector y. We would like to use ridge regression to make our prediction, i.e. f (x) = xβ, and:

1

where λ is a hyperparameter that controls the amount of regularization. Note that, when λ = 0 we get the sum of squares loss from the least squares problem. In this problem we are going to show that ridge regression has a probabilistic interpretation as a prior probability distribution over the feature weights.

a) Derive the closed form solution to (1) by setting the gradient of the loss function equal to 0. Now consider the following probabilistic model:

2

where In×n is the n × n identity matrix, and we will use I as a shorthand for the p × p identity matrix. Here, λ > 0 is a fixed and known precision parameter for the prior.  Under this model, the  full data has likelihood:

3

b)Using Bayes’ Rule show that, up to a proportionality constant, the posterior pdf of β|y, X, λ  can be written as:

4

Hint: “up to a proportionality constant” also includes “drop all terms that don’t include β,” be- cause those terms do not enter into the estimation of β in practice.

c) Show that, up to a proportionality constant, the pdf in b) can be expressed as:

5

d) The result in part c) looks very similar to part of the pdf of a very popular distribution. What is this distribution? Give its parameters for the case in c). Show that the Maximum a-Posteriori (MaP) estimate (this is just the maximum of the posterior distribution) for β under this distribution is also the solution to the ridge minimization problem in (1), i.e. show that:

                                             βMaP  = arg max Pr(β|y, X, λ) = β.

You can do this by appealing to known properties of the distribution you have found.

2 Statistical Learning Theory: Sub-Gaussian Bounds

For a random variable X with E[X] = 0, we say that X is sub-Gaussian with variance proxy σ2 if its distribution satisfies the following:

6

Most known families of distributions are sub-Gaussian, and assuming that the data’s distribution    is sub-Gaussian is generally considered a mild assumption, especially in comparison to assuming  that the data comes from a specific distribution. While in class we have looked at distribution-free bounds, in this question we derive some useful bounds that come from assuming that the data is sub-Gaussian, a mild, yet stronger, assumption. In this problem we will show that sub-Gaussian random variables have tails that decay exponentially fast: the probability of selecting a value t far away from 0 decays exponentially fast as the value t gets further and further away from 0. We will also employ a method known as Chernoff’s bounding method (Chernoff 1952), which is widely applied in statistical learning theory to construct many different types of bounds.

a) Prove that, for all real numbers t > 0, for sub-Gaussian distribution D:

7

Hint: eX is always positive, so Markov’s inequality applies in this case.

b) Starting from the result in part a) prove that

8

Hint: The result in part a) is true for all s > 0 but to get the tightest possible bound we would want s that makes the bound in a) the smallest possible. This looks a lot like a minimization problem…

c) Let X1, . . . , Xn be n independent sub-Gaussian random variables all with the same variance proxy, σ2. Using the tools employed in parts a) and b) prove that, for all t > 0:

9

3 K-Means as an Optimization Problem

10

c) Construct a toy dataset in which K-Means does not always converge to the globally optimal solution.

Hint: You can do this with only 3 or 4 data points and proper initialization of cluster centers.

d) Why does K-Means fail to find the globally optimal solution? Given this result, does the Gaussian Mixture Model (GMM) always find the globally optimal solution? You can use known facts about both algorithms and their loss functions to answer this question.

4 EM Algorithm for Topic Modeling

In this question we will try to design an algorithm for discovering the abstract topics that occur in a set of documents. In the problem, we have a set of M abstract documents in the universe, D, which are mixtures of latent topics. We also have a dictionary of words, W, with size N . Intuitively, W is the list of all unique words that occur at least once within D. Using these two sets we can denote the number of times word wj occurs in document di as n(wj, di).  It follows that we can represent a document di as a vector of size N , each entry of which corresponds to how many times word wj appears in it.  The topics z  are latent variables in a topic set Z sized K.  Intuitively,  if a document   is about certain topic it will include some particular words with higher probability. For example, “music”, “show” and “film” appear frequently in documents about Arts while “school”, “student”and “teachers” usually occur if the document is about Education. Meanwhile, a document can be a mixture of several topics, e.g., a document can be about arts education for high school students. Inspired by the intuition, a reasonable approach to model the generative process is as follows.

                                        Pr(di, zk, wj) = Pr(di) Pr(zk|di) Pr(wj|zk)

which means the topics only depend on the documents and the words only depend on topics. For computational  convenience,  let  Pr(di)  be  a  uniform  distribution  and  Pr(zk|di)  and  Pr(wj|zk)  be Multinomial distributions. i.e.

11

We are interested in learning αik and βkj because they are substantively interesting: αik represents the proportion of topic k that makes up document i, and βkj the probability of seeing word j under topic k. Therefore, a single word wj in a document di is generated in this way: (1) Choose a topic zk from the topic distribution P r(zk|di), the probability of choosing topic k is αik; (2) Choose a word from the vocabulary distribution p(wj|zk) in this topic,  the probability is βkj.  We want to learn these parameters by maximizing the likelihood of the data using the EM algorithm.

a) For our set of N documents and W word of choices, write down the log-likelihood of the model above, i.e.  log(p(D = di, W = wj|α, β)).

Remember that in the EM (Expectation-Maximization) algorithm, we can figure out the parame- ters and the hidden variables by iteratively computing (1) the distribution of latent variables using the old parameters and (2) find new parameters that maximize the likelihood using the old latent variable distribution. Now try to:

b) E-step:  Derive  the  distribution  of  latent  variable  p(zk|wj, di, αold, βold)  by  fixing  the  old  pa- rameters.

c) M-step:  Find the αnew and βnew that optimize the log likelihood using p(zk|wj, di, αold, βold) as the distribution of z.

You would iterate these until convergence to get the final parameters in practice.

Just so you know, the model you have just been working with in this problem is a very famous and very popular model. 🙂

5 What Happens to My Gradient

In this problem you will get the chance to construct a neural network with architecture 784 − H − H − 1, where H is the number of hidden nodes you choose. This neural network can be used for binary classification, such 0, 1 digits classification for MNIST. The model can be represented as

12

The loss function for this model (or the negative log-likelihood) is the usual one:

13

b) Use the preprocessed MNIST data provided in the attachment. MNIST is a dataset of images of handwritten digits that also contain labels for the number they correspond to. We want to train a neural network to learn to recognize these handwritten digits. Use gradient descent with the gradients you have derived in part a) to train a neural network on the data. Report the accuracy for the network on the train and test data. Since the sample size is large, we can approximate the true gradient with stochastic gradient for computational convenience:

14

which means at each step, instead of calculating the gradients with all samples in the dataset, we randomly pick a mini-batch of samples and calculate gradient using these samples. In implementing the neural network, it would be beneficial to build and train the model in a general way. That is, build a NN with d hidden layers such that we can easily modify d.

c) In our current model, we have 2 hidden layers. What is the magnitude of W1 L(θ)? Now try to modify the the number of hidden layers d, from 1 to 4. What happens to the magnitute ofW1 L(θ) when we  increase d?  In this question, you don’t need to train the whole neural network,   just initialize it with random weights, e.g. N (0, 1), and calculate the gradient.

d) Suppose that all weights are finite and the weight matrices W2, W3, …, Wd are diagonalizable matrices in which the absolute value of the eigenvalues are upper-bounded by 4 − s. Prove that

15