In this assignment you are going to simulate data from an area chosen by yourself. It can be cyber related, healthcare, industrial, financial/credit card fraud, commerce – anything. However, run your ideas past me first before diving in. If you recall from the dplyr tutorials we were able to simulate small amounts of data based on several dataframes. We then linked the data we required using join() commands, etc. We then obtained summaries of the data and could use ggplot2 to highlight trends, etc.
Carefully, choose your domain. Give a rationale for simulating it.
Define your data frames, generate them using sample_n and/or other commands. There is a package called charlatan you may find useful for generating personal names and other values. About 4-5 dataframes will suffice.
Think about seeding trends and patterns in your simulated data that you can “detect” later.
Use dplyr to extract the columns you need from the dataframes.
Use some sort of analysis such as summaries to get statistics on your data. Break it down by a category variable such as e.g. time, gender, fraudulent V normal, etc.
In the write-up, I will expect to see an introduction section, methods, and then sections for Simulation of data and transforming data, Analysis of data; marks for plots should of course be in the Analysis section.
Part 1: Analysis of the Data (70 marks)
You will need to develop R code to support your analysis, use dplyr where possible to get the numeric answers. Regarding ggplot2, be careful as to what type of plot you use and how you use them as you have many records and want the charts to be readable. You should place the R code in an appendix at back of the report (it will not add to word count). Section each piece of code with # comments and screenshots of outputs.
Simulation of data (20 marks)
Transforming data (10 marks)
Analysis of data and plots (20 marks)
Write-up of the data analysis (similar format of my R tutorials) (20 marks)
Part 2: Scale-up Report (30 marks)
The second part will involve writing a report. Now assuming your Part 1 was an initial study for your organisation, what are the issues when you scale it up and start using it in practice?
Discussion of Cyber security, big data issues, and GDPR issues (20 marks)
Structure of report, neatness, references. Applies to both Part 1 and Part 2 (10 marks)
Penalties: Do not go over word limit of 3,000 (other than ±10%) as loss of marks will occur according to the university guidance on penalties.