In 2011, Reddit did a survey of it's users. That survey can be found attached, as RedditShortDemoSurvey-1-Cleaned.csv.
Your assignment is as follows:
1. Clean the survey data and conduct a full EDA. (25 pts)
Aggregate all Countries to their Continents.
Drop US States
Look for and handle missing values
Create indicator variables for categoricals, bin where you feel is appropriate
Clean bad data (e.g. the value movies is present in “Are you a dog or a cat person?”
Visualize the distributions of cleaned variables
2. Use Pearson's Correlation Coefficient, determine which variables are most highly collinear, and graph the results. (hint: [login to view URL]~mwaskom/software/seaborn/examples/[login to view URL]) (25pts)
3. Create a random forest model that predicts Education Level based on the remaining variables. Use a grid search to optimize your model hyperparameters. Compute your model's AUC using 5 fold cross validation. (25 pts)
4. Create a small powerpoint presentation that would show a layperson: (25 pts)
What your model predicts
Explains what AUC is, and uses AUC to explain this model's ability to predict the dependent variable.
Explains Type I and Type II errors, and uses a confusion matrix to show your model's likelihood to commit Type I and Type II errors.
I am experienced developer with expertise in flask,express and django python, I develop web app and restful services. Apart from application development, I have work experience in machine learning and natural language processing.
I have done many courses in Ml and data science