• We will provide one datasets with one target variable (“Score”), a timestamp and 24 independent variables. The dataset contains ~55 thousand observations (however your solution should be scalable to accommodate a much larger dataset).
• The goal is to write at most 6 sets of “greater than” and “less than” restrictions on the independent variables. Each set of restrictions will return a subsample of the dataset on which we evaluate an objective function.
• Specifically, the objective function is the sum of the target of the observations in the selected subsample. Each query (set of restrictions) has to return at least 10 valid responses.
• In addition, any observations that come less than 60 seconds after a valid observation in this subsample will be removed. So each query has to return at least 10 responses that are 60 seconds apart from each other.
• In other words, your goal in this project is cornering up to 6 regions of the dataset using intervals on the independent variables, and maximize the density of positive values of the target.
• You can find the dataset in the Excel file “[url removed, login to view]”.
• You will see some variables have version A or version B (for instance W2 R2). In such cases you can use either one or the other, not both.
• Your restrictions cannot be applied using a higher number of decimal places than occur in the observations. For instance a restriction to W1R1 cannot be 0.015, it must be either 0.01 or 0.02.
• Regression analysis, Neural Networks, SVM, and K-clusters will not help you much. These methods classify observations by applying a weighted average of the independent variables. The classification rule has to be on the independent variables directly, cannot be on a weighted average of them or any other function.
• Make sure to order the timestamp chronologically.
• A start could be plotting the density of the independent variables for the subsample of positive target values and for the subsample of negative target values. Then you can identify regions with a high density of positive target observations.
• All accepted bids will be awarded on completion
• We are going to judge the performance of each bid both inside the sample (milestone 1) and outside the sample (milestone 2). A good performance consists in a high aggregate sum of the target variable.
• After this stage we will ask you provide details of how you would maintain the existing algorithms (milestone 3) over a much larger data set, ~500 thousand observations.