This mini project is based on the Kaggle challenge “Diagnosis of COVID-19 and its Clinical Spectrum”. Link: [login to view URL]
A. Predict confirmed COVID-19 cases among suspected cases.
Based on the results of laboratory tests commonly collected for a suspected COVID-19 case during a visit to the emergency room, would it be possible to predict the test result for SARS-Cov-2 (positive/negative)?
B. Predict admission to general ward, semi-intensive unit or intensive care unit among confirmed COVID-19 cases.
Based on the results of laboratory tests commonly collected among confirmed COVID-19 cases during a visit to the emergency room, would it be possible to predict which patients will need to be admitted to a general ward, semi-intensive unit or intensive care unit?
1. Clean and prepare the data for regression analysis.
2. Explore the data. Choose your candidates of predictor variables. Explain and justify your choice.
3. Use regression analysis to investigate research question A. What regression model would you like to use? Multiple linear regression? Logistic regression? Regression trees? Others? Choose a regression model and explain your reason. Build, interpret and test your regression model. What insights does your model reveal about the research question?
4. Similarly, investigate research question B.
5. Discuss the limitation of your study, and possibly the limitation of the dataset.
6. [Bonus 20 points] Do you think model selection and regularization methods can be useful in this study? If so, use them to enhance your model and results. If not, explain why.
7. [Bonus 20 points] Since this dataset was published in 2020, the world has known much more about the novel coronavirus and COVID-19. Are your insights in Tasks 3 and 4 supported by the scientific research of COVID-19? Are they against the existing scientific knowledge of COVID-19? If so, what was wrong about the regression analysis?