The University of Nottingham
SCHOOL OF MATHEMATICAL SCIENCES
SPRING SEMESTER 2023-2024
MATH1033 - STATISTICS
Your neat, clearly-legible solutions should be submitted electronically via the MATH1033 Moodle page by
18:00 on Wednesday 8th May 2024. Since this work is assessed, your submission must be entirely your
own work (see the University’s policy on Academic Misconduct). Submissions made more than one week
after the deadline date will receive a mark of zero. Please try to make your submission by the deadline.
General points about the coursework
1. Please use R Markdown to produce your report.
2. An R Markdown template file to get you started is available to download from Moodle. Do make use of
this, besides reading carefully the Hints and Tips section below.
3. Please submit your report a self-contained html file (i.e. as produced by R Markdown) or pdf.
4. If you have any queries about the coursework, please ask me by email (of course, please limit this to
requests for clarification; don’t ask for any of the solution nor post any of your own).
Your task
The data file scottishData.csv contains a sample of the ”Indicator” data that were used to compute the 2020
Scottish Index of Multiple Deprivation (SIMD), a tool used by government bodies to support policy-making. If
you are interested, you can see the SIMD and find out more about it here: https://simd.scot
Once you have downloaded the csv file, and once you’ve set the RStudio working directory to wherever you
put the file, you can load the data with dat <- read.csv(”scottishData.csv”) The file contains data for a sample
of 400 ”data zones” within Scotland. Data zones are small geographical areas in Scotland, of which there
are 6,976 in total, with each typically containing a population of between 500 and 1000 people. Of the 400
observations within the data file, 100 are from the Glasgow City, 100 are from City of Edinburgh, and 200
are from elsewhere in Scotland. Glasgow and Edinburgh are the two largest cities in Scotland by population.
Table 1 shows a description of the different variables within the data set.
Your report should have the following section headings: Summary, Introduction, Methods, Results, Conclusions.
For detailed guidance, read carefully section page 4 of the notes, and the ”How will the report be marked?”
section below.
The Results section of your report should include subsections per points 1-3 as follows. The bullet points
indicate what should be included within these subsections, along with suitable brief commentary.
MATH1033 Turn Over
2 MATH1010
1. A comparison of employment rate between Glasgow and Edinburgh.
• A single plot with side-by-side boxplots for the Employment_rate variable for each of
Glasgow and Edinburgh.
• A histogram of the Employment_rate variable with accompanying normal QQ plot, for
each of Glasgow and Edinburgh.
• Sample means and variances of the Employment_rate variable for the data zones in
each of Glasgow and Edinburgh.
• Test of whether there is a difference in variability of Employment_rate scores between
Glasgow and Edinburgh.
• Test of whether there is a difference in means of Employment_rate scores between
Glasgow and Edinburgh.
2. Investigation into how Employment_rate and other variables are associated.
• A matrix of pairwise scatterplots for the following variables: Employment_rate,
Attainment, Attendance, ALCOHOL, and Broadband. Also present pairwise correlation
coefficients between these variables.
• A regression of Employment_rate on Attendance, including a scatterplot showing a line
of best fit.
3. A further investigation into a respect of your choosing.
• It’s up to you what you choose here. Possible things you could consider are: considering
an analysis similar to 1 above, but involving the data on data zones outside of Glasgow
and Edinburgh; considering whether what you find in investigations in 2 above are
similar if you consider whether the data zones are from Glasgow, Edinburgh or elsewhere;
investigating the other variables in the data set besides these in 1 and 2.
• Note that some variables will be very strongly correlated, but with fairly obvious/boring
explanation: for example “rate” variables (see Table 1) are just “count” variables
divided by population size, and data zones are designed to have similar population
sizes.
• Think freely and creatively about what is interesting to investigate, especially how you
could make good use of the methods that you are learning in the module.
Please include as an appendix the R code to produce the results in your report, but don’t include
R code or unformatted text/numerical output in the main part of the report itself.
Hints and tips:
1. Use the template .Rmd file provided on Moodle as your starting point.
2. Read carefully “How will the report be marked?” below. Then re-read it again once again
just before you submit to make sure you have everything in place.
3. You may find the subset command useful. Some examples:
• glasgow <- subset(dat, Council_area == "Glasgow City") defines a new variable containing
data only for Glasgow.
• subset(dat, (Council_area != "City of Edinburgh" & Council_area != "Glasgow City"))
finds the data zones that are not in either Edinburgh or Glasgow.
4. The command names(dat) will tell you the names of the variables (columns) in dat.
5. dat(,c(16,17,18)) will pick out just the 16th, 17th, 18th column (for example).
MATH1010
[ ]
m
( ]
⑧m
3 MATH1010
6. The pairs() function produces a matrix of pairwise scatterplots. cor() computes pairwise
correlation coefficients.
7. Do make sure that figures have clear titles, axis labels, etc
MATH1010 Turn Over
.
4 MATH1010
How will the report be marked?
The marking criteria and approximate mark allocation are as follows:
Summary [4 marks] - have you explained (in non-technical language) (a) the aim of the analysis;
(b) (very briefly) the methods you have used; and (c) the key findings?
Introduction [5] - have you (a) explained the context, talked in a bit more detail about the aim;
(b) given some relevant background information; (c) described the available data; (d) explained
why the study is useful/important?
Methods [3] - have you described the statistical techniques you have used (in at least enough
detail that a fellow statistician can understand what you have done)?
Results [14, of which 7 are for the investigation of your choosing mentioned in point 3 above] -
have you presented suitable graphical/numerical summaries, tests and results, and interspersed
these with text giving explanation?
Conclusions [4] - have you (a) recapped your key findings, (b) discussed any limitations, and
(c) suggested possible further extensions of the work?
Presentation [10] - overall, does the report flow nicely, is the writing clear, and is the presentation
tidy (figures/tables well labelled and captioned)? Has Markdown been used well?
MATH1010
5 MATH1010
Table 1: A description of the different variables. “Standardised ratio” is such that a value of 100
is the Scotland average for a population with the same age and sex profile.
MATH1010 End
请加QQ:99515681 邮箱:99515681@qq.com WX:codinghelp