Assignment 1 – CSC3060 “AIDA”
*30% of the module mark.
Deadline: 5pm Friday, 16th November 2018.
This version: 2018-10-04*
Introduction
In this assignment, you will:
(a) Create a dataset of handwritten symbols (which you will use for your analyses and experiments in the rest of Assignment 1, and in Assignment 2)
(b) Calculate features (i.e. variables) from the handwritten symbols which may be useful for distinguishing between the different symbols automatically
(c) Perform statistical analysis of the datasets, using methods of statistical inference.
When you use a procedure that has an element of randomness, please use the seed value 3060 (your code should give the same results each time it runs).
Sections 1 and 2 of this Assignment can be completed in one of the following programming languages: Python, R, Java. Section 3 must be completed in R.
Please read carefully the information about the assessment criteria and marking process at the end of this document.
Section 1 (8%): Creating a dataset
This section asks you to build a dataset of images composed of written numbers, letters and mathematical symbols. Each image is represented by a black & white matrix with size 20 rows by 20 columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As such, one image can be stored in a plaintext “.csv” file containing the matrix (and no headers), as in these examples:


The goal is to create a dataset containing eight handwritten images of each of the digits {1,2,3,4,5,6,7}, eight handwritten images of each of the digits {a,b,c,d,e,f,g}, and 8 handwritten images of the mathematical symbols {<, >, =, ≤, ≥, ≠, ≈}. We will refer to these as the digit, letter and math datasets, respectively. Each image should be obtained by writing a hand-written symbol yourself (preferably with a touch screen, using the lab computers, although it is fine if you create them using the computer mouse). The quality of the drawing is not essential, as long as the digit or letter can easily be read by a human. The image will vary from sample to sample; however, each character should fit reasonably well in the 20×20 box (i.e. do not draw a tiny character in one corner of the 20×20 box; this will make your life easier when it comes to doing analyses!).
You may use whatever means you prefer to obtain the images and .csv files. However, a suggestion is to use the software GIMP (http://www.gimp.org). Using GIMP, you can create a new image with 20 by 20 points (pt), advanced options 1 pixel/pt, color space grayscale, fill with background colour. This will give you a small white square, which you can magnify to e.g. 2000% in order to make it easier to draw on. To draw on the image, you can select the pencil tool and adjust the brush size to (e.g.) 1 pixel. The standard file formats of GIMP are useful to save the images, but we need a more easily readable format. One good option is to export as PGM, type ASCII. In this format, each image becomes a text file with a header consisting of the following four lines:
P2
# CREATOR: … 20 20
255
The third and fourth lines of the header specify the pixel array size and the maximum allowed pixel value, respectively. (The images are greyscale, with 0 representing fully black and 255 representing fully white).1
The remaining lines of the file specify the pixel values, with one value on each line; the total number of pixel values should correspond to the specified array size (i.e. 20*20=400).
For our purposes, a number < 128 represents a black pixel, while a number >= 128 represents a white one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in Figure 1 above. You shall save each image matrix as a csv file following the specification above, and using the filename STUDENTNR_LABEL_INDEX.csv, where STUDENTNR is your student number (e.g. 123456), INDEX is a number from 1 to 8, indexing the set of eight images you must create for each symbol, and LABEL is a numeric code that uniquely identifies the type of symbol.
We will use the following codes to label the different types of images:
1 For further information about this image format, see https://en.wikipedia.org/wiki/Netpbm_format

For example, if your student number is 123456, then 123456_25_8.csv would be the eighth image you created for the letter ‘e’. (As well as creating the csv files, you may also want to keep the PGM files, in case you need to inspect the data later on).
As part of your submission, upload the csv files that you create in a directory called “section1_images”, along with any code you wrote to create the csv files, in a folder called “section1_code” (see submission instructions at the end of this document).
It is important to upload the images in the correct csv format as these files will be used to verify your calculations in the next section.
In your report, briefly explain in your own words how you created the images and obtained the matrices from them.
Section 2 (10%): Feature engineering
Using each 20×20 matrix obtained from an image as described above, you must create an array of characteristics that describe some features of the image. Each feature will be a number (i.e. each feature is a numeric variable). There are 18 features in total. In the feature definitions that follow, a pixel has 8 neighbours, which will be referred to as follows:



Your task in this section is to write code to calculate each of the features above. In calculating pixel neighbours, you can assume that the images are padded on each side with white pixels. Save your calculated features in a file called STUDENTNR_features.csv, where STUDENTNR is your student number. This file will consist of 168 rows, with each row listing the comma-separated feature values for each of your 168 images. The first entry in the row will be the LABEL code, the second will be the image INDEX, and the remaining 20 entries will be the calculated features. For example, the features for your eighth “e” image may be as follows: 25,8,4,28,14,12,1.1667,8,8,1,2,4,11,8,7,12,11,1,2,1,0.11,22 The 8 rows that correspond to the 8 instances of a particular character should be grouped together in the features file, and the order of the 8 rows should correspond to the INDEX used in the image
filenames. In other words, the 168 rows of STUDENTNR_features.csv should be sorted first by the label and secondly by the index.
If you cannot calculate a particular feature, you may use a random integer between 0 and 10 for the feature values instead. (You will lose marks for not calculating the feature, but you can use the random values in the analyses that follow in the subsequent section).
In your report, briefly describe and explain the code you have written to calculate the features above. If you ran into difficulties, you should still explain your thought processes and attempts to calculate the features. In the case of features 19 and 20, you should explain your rationale for choosing the features you did, as well as how they are calculated (i.e. you should give a justification for why you think these features should be useful).
You should put the file STUDETNR_features.csv in a folder called section2_features. Put code for this section in a folder called section2_code. Your code should use relative paths; i.e. it should read the image matrixes from “../section1_images” and save the feature file to “../section2_features”.
Section 3: Statistical analyses of feature data (12%)
In this section, you will perform statistical analyses of the feature data, in order to explore which features are important for distinguishing between different kinds of symbols.
You shall use descriptive statistics (mean, variance, etc.), null hypothesis testing, and confidence intervals to perform your analysis of the data. You are encouraged to provide tables, figures, and/or graphs in the report to support your discussions and findings. When performing tests, always consider whether multiple test correction is needed.
It is your responsibility to define the appropriate assumptions to run the tests, and to choose an appropriate test according to the data characteristics and the question that you are studying. You are not restricted to the hypothesis tests that were discussed in the lectures. Recall to always justify the approach that you choose to employ. You may assume a significance level of 0.05 for the analyses when running hypothesis testing.
In particular, in the report you should address each of the following subtasks, using appropriate statistical tests, tables, graphs, etc.
1. Estimate the probability distribution for nr_pix for each of the three symbol groups: letter, digit, and math. Visualise the distributions. Briefly describe the shape of the distributions.
2. Suppose you randomly sample a digit image from the set of digits. What is the probability that the number of pixels in the image is greater than 20?
3. Present summary statistics (e.g. mean and standard deviation) about all the features, for (a) the full set of 168 items, (b) the 56 digits, (b) the 56 letters, (c) the 56 math symbols. Briefly discuss the summary statistics, and whether they already suggest which features may be useful for discriminating digits and letters. For features you feel may be interesting, consider suitable visualisations (e.g. histogram of feature values for the three groups2)
4. Are there pairs of features which are highly associated with each other, and thus provide little extra information with respect to having only one of them in the data? Can you discard some features from your data set without losing much information? Justify your claims.