Data Processing & Visualisation Coursework

This coursework is worth 50% of the total marks available for this module. The

penalty for late or non-submission is an award of zero marks. You are

reminded of the need to comply with Cardiff University’s Student Guide to

Academic Integrity. Your work should be submitted using the official

Coursework Submission Cover sheet.

INSTRUCTIONS

Q1. The US annual “County Health Rankings” provide information of how health is influenced by where

people live, learn, work and play. They provide a starting point for change in communities. A data

analytics team attempts to estimate the premature death (i.e., “Years of Potential Life Lost Rate

(YPLLR)” based on number of deaths under age 75) in Florida. Other variables that they believe offer

some insight on the premature death include:

– Teen births, i.e., “Teen Birth Rate (TBR)” (Teen births/females ages 15-19 * 1,000);

– Violent crime, i.e., “Violent Crime Rate (VCR)” (violent crimes/population * 100,000) ;

– Adult smoking, i.e., “Percentage Smokers (PS)” (Percentage of adults that reported currently

smoking).

The text file named “2017Health.txt” (available on Learning Central) contains the data. Shown below is

the form of the data.

State County Years of

Potential Life

Lost Rate

Teen

Birth

Rate

Violent

Crime

Rate

Percentage

Smokers

Florida Alachua 6633 19 579 16

Florida Baker 8270 58 360 19

Florida Bay 9168 50 508 18

Florida Bradford 10346 61 461 18

Florida Brevard 7722 25 518 16

Florida Broward 5737 23 441 15

Florida Calhoun 6415 59 130 19

Florida Charlotte 7353 30 219 14

… … … … … …1) [cell1 – 1 mark] Download the file “CW-your student number.ipynb” from Learning Central, and

upload it to your IPython Notebook. Change the title of the file using your student number.

Write code to read the given data in text format (i.e., “2017Health.txt”) into required tabular data

structure: make the county (i.e., Alachua, Baker…) be the index of the returned data structure;

the first column of the returned data structure represents the “Years of Potential Life Lost

Rate”; the second column represents the “Teen Birth Rate”; the third column represents the

“Violent Crime Rate”; and the last column represents the “Percentage Smokers”.

– Display the returned tabular data structure in your programme.

2) [cell2 – 5 marks] Write code to analyse the data contained in the variable called “Percentage

Smokers”.

– Print the “mean of Percentage Smokers”.

– Print the “minimum of Percentage Smokers”.

– Print the “maximum of Percentage Smokers”.

– Print the “standard deviation of Percentage Smokers”.

– Print the “95% confidence interval of Percentage Smokers”.

3) [cell3 – 9 marks] Write code to plot a bar graph that uses bars to compare the “Percentage

Smokers” of North Florida, Central Florida and South Florida. North Florida: use the measures

of the following counties: Duval, Alachua, Leon, Flagler, Marion; Central Florida: use the

measures of the following counties: Orange, Polk, Hillsborough, Pinellas, Brevard; South

Florida: use the measures of the following counties: Miami-Dade, Broward, Lee, Palm Beach,

Sarasota.

– Visualise a single plot: the horizonal axis shows the data categories being compared (i.e.,

North Florida, Central Florida and South Florida); and the vertical axis represents the mean

measure of Percentage Smokers.

– Add error bars to the bar graph, showing the 95% confidence interval.

– Add appropriate title, horizontal axis label and vertical axis label to the bar graph.

4) [cell4 – 4 marks] Based on the following two predictor variables: “Teen Birth Rate (TBR)” and

“Percentage Smokers (PS)”, write code to build a linear regression model to estimate the

“Years of Potential Life Lost Rate (YPLLR)”.

– Print the resulting linear equation in the programme.

5) [cell5 – 6 marks] Based on the error of prediction (i.e., the absolute error/difference between

the measured “Years of Potential Life Lost Rate” and the predicted “Years of Potential Life Lost

Rate”), compare the following two linear models

Model A: YPLLR = 60.6 × TBR + 5297.06

Model B: YPLLR = 1.36 × VCR + 7254.3

and advise the data analytics team which model should be used. Write code to perform

appropriate statistical data analysis.

– Print the mean absolute error (MAE) of the model A.

– Print the mean absolute error (MAE) of the model B.

– Print the main results of the data analysis processes, including normality test and statistical

significance test.

– Print ONE sentence, stating your conclusion and justification on the difference in

performance between two models, in terms of predicting the “Years of Potential Life Lost Rate”.SUBMISSION INSTRUCTIONS

All submission should be via Learning Central unless agreed in advance with the Director of Teaching. The current

electronic coursework submission policy can be found at:

http://www.cs.cf.ac.uk/currentstudents/ElectronicCourseworkSubmissionPolicy.pdf

Description Type Name

Cover sheet Compulsory One PDF (.pdf) file [student number].pdf

Q1 Compulsory One IPython Notebook file (.ipynb) CW-student number.ipynb

CRITERIA FOR ASSESSMENT

Credit will be awarded against the following criteria.

Your CODE and RESULTS should be contained within an IPython Notebook that analyses and

visualises a given dataset (should be obtained via Learning Central: 17/18-CM2105 Data Processing

and Visualisation). This coursework assesses the intended learning outcomes of 1, 2, 3, 4.

The breakdown of marks (total=25 marks) will be for correctly computing and visualising:

1) [cell1 – 1 mark]: Manipulate data and display the restructured tabular data (see “Sample output 1”

below).

2) [cell2 – 5 marks]: Produce the required results of descriptive statistics.

3) [cell3 – 9 marks]: Create and visualise the required graph.

4) [cell4 – 4 marks]: Conduct and report the required results of regression analysis.

5) [cell5 – 6 marks]: Conduct and report procedures and results of statistical data analysis.

Sample output 1

Feedback on your performance will address each of these criteria.FURTHER DETAILS

Feedback on your coursework will address the above criteria and will be

returned in approximately four weeks. This will be supplemented with oral

feedback in lectures and labs. If you have any questions relating to your

individual solutions talk to the lecturer.

Assignment status: Already Solved By Our Experts

(USA, AUS, UK & CA PhD. Writers)

>> CLICK HERE TO ORDER 100% ORIGINAL PAPERS FROM Academic Writers Bay <<