The population we want to compare are performance in exams of students whose race/ethnicity are grouped B/C/D.
Complete data can download from https://www.kaggle.com/spscientist/students-performance-in-exams. This dataset includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon them. There are three quantitative variables and five categorical variables. Group B has 190 records, group C has 329 records, group D has 262 records.
Qualitative variables we observed were “gender” and “parental level of education”. “gender” male means students are male, “gender” female means students are female. “parental level of education” gives the level of education of students’ parents.
Quantitative variables we observed were “math score” / “reading score”/ “writing score” which are relative exam scores.
Sample and collect data
We use simple random sampling without replacement to get the subset of the complete dataset.
Step1: add a column as id number series for each group B/C/D.
Step2: use Sampling tool that’s part of the Data Analysis command, set id number column as the input range, then remove the replicate items, make sure the sample size is 30.
Frequency Tables and Pie Charts
For each qualitative variable, make a frequency table and a pie chart.
The pie chart above show the gender of the 30 students exam records sampled from race group B/C/D. we can see that the number of female students are equal with the number of male students in group B samples, the number of female students are less than the number of male students in group C and group D samples.
2) parental level of education
The pie charts above show parental level of education of the 30 samples from race group B/C/D. we can see that student’s parental level of education in group D has the highest proportion of bachelor’s degree. High school proportions are same in group B and group C.
The boxplots above to the left shows the math scores of the 30 students sampled from race group B, group C, and group D. The boxplots show that group C has the least variation in math scores, while group D has the largest variation in math scores. Also, we can see that the medium math score of group D is higher than that of group B and group C.
The boxplots above in the center show the reading scores of the 30 students sampled from race group B, group C and group D. the boxplots show that group B has the largest variation. Once again, the median reading score of group D is higher than that of group B and group C.
The boxplots above in the right show the writing scores of the 30 students sampled from race group B, group C and group D. the boxplots show that the group B has the smallest variation. Once again, the median writing score of group D is higher than that of group B and group C.
Recall that the parental level of education in subgroup C have the highest proportion in bachelor’s degree, maybe students who parents with higher education are more likely to get higher score in their own exams.
The table below shows the mean, standard deviation, and the mean’s margin of error for each quantitative variables and population. The margins of error are calculated using a normal distribution and a 95% confidence level.
The charts below show the means and confidence intervals for each quantitative variables and population. Assuming a normal distribution of sample means, there is a 95% probability that the true mean of a population lies within the confidence interval shown for it.
- Variation of reading scores in group B and group C.
It appears from the boxplot that the variation among the reading scores in group B is less than that of in group C. Also, the standard deviation of reading scores in group B sample is 15.10, while the standard deviation of reading scores in group C sample is 18.14.
To see whether this is a statistically significant finding at the 5% level of significance, an F-stat was performed.
We use the “F-test two sample for Variances” command in Data Analysis. then we get the following table:
The result show p-value for test is 0.16, so we don’t have enough evidence to reject the null hypothesis that reading scores of group B and group C’s variances are same at the 5% level of significance.
2) mean of reading scores in group B and group C.
Further, the mean value of reading scores in group B and group C respectively are 63.87 and 68.33. we want to see whether this is a statistically significant finding at 5% level of significance. Then a two-sample t-test is conducted.
By using the “t-Test: Two-Sample Assuming Equal Variances” command in Data Analysis, we can get the following result:
The results above show that the p-value one tail for testing is 0.15, also, we don’t have enough evidence to reject the null hypothesis that there is no significant difference between the mean reading score in group B and the mean reading score in group C.
3) Mean math score of group A and group C
Same as 1) and 2), we first conduct the variance testing using F-test, then using T-testing to test the mean of two sample.
The testing result are listed as below:
The results shows that at the significance level 0.05, the mean math score of group B are equal with the mean score of group D.