Preference given to freelancers who can complete the project within 24-48 hours and who have R and Random Forest experience. You will need to be very familiar with Random Forests and R as I am not and can not provide much assistance.
Essentially, I am looking for an small enhancement of the Random Forest process in the R GUI called Rattle. From what I can tell by looking at the R Add-In called Party, there are a number of functions included which might mean adding perhaps 5-15 additional lines of code to what I already have (although I could certainly be off on that estimate).
Using the R GUI called Rattle, I can easily select my dataset (see below) and choose a single Y, as well as the random seed, and choose the ratio of training to testing data. Next, I execute the RF (Random Forest) model choosing only the number of trees (default is 500) and the number of predictors (default is the integer of the square root of m total predictors). From this, R (through Rattle's code) gives me the Out-of-Bag Error and the traditional 2x2 classification grid for both training and testing data. Not including the 5 seconds it takes R to run the code, I can set up this scenario from scratch in less than 1 minute. Due to Rattle’s limitations, I can only execute for a single Y at a time. This issue, as well as the inability to aggregate those Out-of-Bag results, is my problem.
The algorithm above is outlined very succinctly at [url removed, login to view]~dzeng/BIOS740/[url removed, login to view] on the first page under the title “The algorithm” and is covered in the listed points 1, 2, 3 and 1. Essentially, what I need done is the very next point they list that says:
2. Aggregated the OOB predictions. (On the average, each data point would be out-of-bag around 36% of the times, so aggregate these predictions.) Calculate the error rate, and call it the OOB estimate of error rate.
However, as I am really after the PPV (Positive Prediction Value - i.e. where a 1 is predicted for Yn) and not the global OOB error (due to my data being skewed towards y-values of 0) of the models, I am more interested in the raw prediction counts so I can calculate error rates myself.
I will supply a CSV data sample of ~4000 observations (~50/50 training/testing split) with multiple binary Y's and multiple binary X's and one continuous X (an integer ranging from 0 to ~30) for each observation. I can even supply the R code from Rattle for the procedure I am currently using.
I would like your R code to be able to accept the following inputs from me:
-observations in the format: Observation #, Y1…Yn, X1…Xm
-random seed value
-number of trees value (default is 500)
-number of predictors to be randomly sampled (default is the integer of the square root of m total predictors)
-number of rows at bottom of data list for holdout data (to be scored each round)
-number of rounds (which will be ~1,000 – 1,000,000)
I would like your R code to be able to supply the following outputs to me:
-CSV file with full original data plus the aggregated OOB prediction totals (for both training and testing data) for each observation for each Y (i.e. the number of times the OOB prediction was 0 for each observation for each Y and the number of times the OOB prediction was 1 for each observation for each Y)
If you happen to be aware of an open source R GUI that will already do all of the above for me (and that I can understand and use), you can just help me install it and will not need to supply the R code. As long as it works for me, the project will be considered completed.