Pick a dataset. There are recommended ones at the bottom of this page, or you can choose your own from the R datasets page (Links to an external site.) or any other publicly available dataset. If you pick your own dataset, I need to approve it first—just send me a short email telling me what analysis you want to run and why.
Construct a research question you want to ask of that dataset. This should be a clearly defined question about how two or more variables in your dataset interact. "What effect do sex, rank, and discipline have on salary?" is a well-defined research question. "What factors determine salary?" is not.
Pick an analytical method for answering that research question. This could be linear regression or any of the machine learning techniques.
Visualize the data being addressed in the research question above, and make sure that it is appropriate for your analysis. For example, if it is nonlinear data and you want to run a regression analysis, transform it so that it is linear.
Analyze the data with the method you chose. NOTE: If you picked linear regression, you MUST run a multivariate, not a bivariate, regression.
Interpret the results of your analysis.
Consider what weaknesses your analytical methods displayed, as well as the moral and societal implications of the analysis you just ran.
You should prepare a 3–4 page report using RMarkdown. It should include answers to the following questions for each step:
What dataset did you pick? Why did you pick that one? What does one line in the dataset represent? What are the columns of the dataset, and what do they mean?
What is your research question? Why does that relationship interest you? Why did you choose to exclude some variables from that question?
What method did you pick to answer your question and why? What strength does that method have over other ones?
Include at least two plots in your paper. All plots must be fully labeled and use appropriate encodings.
If you are running linear or logistic regression, write out the specific regression equation. If you are using a machine learning method, describe how you will implement that method, including equations if/when appropriate.
For linear regression, include a description of all coefficients and their uncertainty and p-values. For machine learning methods, this should include accuracy metrics for your model. In both cases, the step must include a sentence about specifically how to interpret the results, such as "a unit increase in variable x is associated with a 20-point increase in variable y," or "this method classifies variable x into categories y with 85% accuracy."
Describe the weaknesses of your method analytically and what its implications are from a scientific, moral, and societal perspective.
You should submit a RMD file and a PDF file of your report. The PDF file should hide the code you used to run your analysis unless it is pertinent to the discussion.
Stat2Data TipJoke, ReligionGDP, Grocery
Zelig swiss, free1, approval
plm RiceFarms, Males