# UNSW SYDNEY

School of Mathematics and Statistics

MATH 2871

DATA MANAGEMENT FOR STATISTICAL ANALYSIS ASSIGNMENT 1

TERM 1 2020

INSTRUCTIONS:

1. Worth 20 marks (20% of final grade)

2. Number of Questions: 4

3. Due by 4 pm 20 April  – Week 10 -– via Moodle Turnitin ONLY

4. For this assignment, you are allowed to work in groups of 1 or 2 or 3 or 4 or 5 (max).

5. You must write all members of your group with their student numbers on each page at the top right hand corner of the assignment.

6. You must submit only one pdf file via Moodle Turnitin. The University requires each student to submit the assignment in Moodle Turnitin even though this

could be a group assignment. Please make sure you TYPE up your solutions. NO handwritten solutions.  You can use MICROSOFT WORD for example.

7. Your submission must include your SAS programming codes, but all the results must in only in one pdf file. Only a pdf file will be allowed for submission.

8. You need to create a libname orion whenever it is necessary. The data is from your Dataset on Moodle. If it says “create a report”, then provide the

appropriate SAS code only. Only provide the required output when it says to do so.

9. DO NOT ANSWER each question’ parts separately. For each question,

provide your SAS code which answers each part as one code. You will be hardly penalise if you don’t do this.

10. There is a 3 single-sided pages limit. After 3 pages, I stop grading.

Look at the data in the assignment folder called PAYROLL. This data set contains information for individual employees. We need to compute the annual payroll by department. Assume 2000 work hours per year for hourly employees.

a. Sort PAYROLL dataset by dept and stored the sorted data into Work.sorted

b. Create a new data set called work.budget from work.sorted by DEPT. The DEPT groups are ADMN10, ADMN20, ADMN30, CAM10, and CAM20.

c. If WageCat is S, then the variable YEARLY contains the value of WageRate multiplied by 12. If the WageCat is H, then the variable YEARLY contains the value of WageRate multiplied by 2000.

d. If the observation is the first observation for the variable DEPT, initialise PAYROLL to zero.

e. Add the value of YEARLY to the value of PAYROLL.

Print output of the report should look like

A local wireless mobile phone provider tracks monthly usage for all customers with plans that limit their usage by total number of minutes. In the assignment folder, this information is stored in MOBILE. Review the variables in this data set by proc contents. Your task is to identify each month as under the minute limits, at the limit, or over the limit for the customer’s plan. Unfortunately, someone came up the following code but it does not work.

a. Examine the SAS data set including the variable labels and attributes. Identify and correct the bugs in this code so that this program will run correctly.

b. Add comments to your revised program for ach fix so that you can understand the mistake.

c. Create a report that lists ID, number of plan minutes, and average minutes for customers that exceeds their limit every month. It should include the title, Mobile Plan Usage and include the date only. Do NOT print this report.

## Question 3. (worth 4 marks)

a. Suppose you have a variable called GENDER that stores a value of 1 for males and 2 for females. Discuss why creating a user defined format to print a 1 as ‘Male’ and a 2 as a ‘Female’ may be better than creating a new character a new character variable in a DATA step and then printing the character value.

b. Describe the advantage of using the CLASS statement rather the BY statement with PROC MEANS.

Below are raw data for 4 students. Each student did 2 assignments. Each assignment had 5 questions worth 10 marks each for a total of 50 marks for each assignment. For each student there are 3 records.

• Record 1 has the student’s first and last name and ID number.

• Record 2 contains the marks for assignment 1

• Record 3 contains the marks for assignment 2.

George Jensens 314225

a1 7 10 7 10 8

a2 7 7 8 9 10

Ashley Brown 324785

a1 2 1 6 7 9

a2 7 1 8 6 4

Samuel Chen 312705

a1 10 9 9 9 10

a2 8 7 9 7 0

Sandi Chu 302905

a1 10 10 10 10 9

a2 9 9 10 9 9

a) (i) Create a report, marks1a, that contains records with the student’s

first name, last name, ID and the total mark for each of the 2 assignments. (i.e. the 5 variables on each of the 4 records should be name, ID, a1mark, a2mark).

In the DATA step, you must use DATALINES statement only. You need to read first and last names separately.

(ii) Draw histograms for variables a1mark and a2mark respectively (hint: use proc sgplot).

Include your output for this question.

b) Consider now the data file below containing similar information, but

• data are separated by colons and commas.

• some data values are missing. This shows that the students did not attempt

those questions and should awarded 0 for those questions. Your data should be reads a csv file.

Jensens,George,314225 a1:7,10, 7,10,8

a2:7, , 7, 8,

Brown,Ashley,324785 a1:2, 1, , ,10

a2:8, ,8,6,8

Chen,Samuel,312705 a1:10,9,9,9,10

a2:9,8,9,10,8

Chu,Sandi, 302905 a1:10,9,8,8,9

a2:9,8,8,9,10

Create a report, marks1b, that contains records with the student’s first and

last name (together), ID and total marks for each of the two assignments (i.e. the 4 variables on each record should be in order first and last name, ID, a1mark, and a2mark. Include your report and output in your submission. Your data should be read a csv file.

Note: You need to some type ods for question 5.