Data Analysis
- Status: Closed
- Prize: $100
- Entries Received: 25
- Winner: pshyvw
Contest Brief
Taking the attached CSV file (or any other CSV in similar format) can you provide a solution that accepts a percentage parameter and have it available in executable fashion so that I can run it in the future.
The solution needs to do the following:
1. Delete all rows where COUNTRY = ANTEPOST
2. Delete all rows where FULL_DESCRIPTION contains 'Place Market' or 'Odds' or 'Forecast' or 'Specials' or 'Match' or 'Distance' or 'Without'
3. Sort by COUNTRY (column D), EVENT_ID, IN_PLAY DESC, LATEST_TAKEN
4. For each EVENT_ID do the following:
a. Delete all rows where LATEST_TAKEN is 2 or more minutes before the SCHEDULED_OFF (i.e. if SCHEDULED_OFF is 31/03/2014 13:30 then any times before 31/03/2014 13:29)
b. Determine the (first if more than one) SELECTION with the lowest ODDS during the minute before the SCHEDULED_OFF (i.e. Arab Spring with ODDS of 1.7 in the race whose EVENT_ID is 113511428). This will be called the FAVOURITE.
c. Delete all other rows where IN_PLAY = PE
d. Determine the (first) SELECTION with the (same or) next lowest ODDS during the minute before the SCHEDULED_OFF (i.e. Mr Greenspan in the race whose EVENT_ID is 113511428)
e. If there is less than a 5% difference (this is the supplied parameter) between the two ODDS for 4b and 4d, then delete all other rows apart from 4b and 4d and exit (i.e. move onto next EVENT_ID)
f. Delete all rows for the EVENT_ID where SELECTION FAVOURITE
g. Find the lowest value ODDS where IN_PLAY = IP. This will be called LOWEST_IP.
h. Delete all other rows where IN_PLAY = IP
5. Save a separate cut down version of the CSV after the above processing has taken place (i.e. each EVENT_ID should have 2 rows - one with a PE and one with an IP or two with PE if we exited at 4e above).
6. Split the cut down version of the CSV into two further files extracting the rows where EVENT = TO BE PLACED into a file of their own and saving the remaining rows in another file of their own.
03.07.2014 14:55:00 (UK)____AMENDED INSTRUCTIONS AS FOLLOWS (after swapping 4c and 4d points above and making further small tweaks)
4. For each EVENT_ID do the following:
a. Delete all rows where LATEST_TAKEN is 2 or more minutes before the SCHEDULED_OFF (i.e. if SCHEDULED_OFF is 31/03/2014 13:30 then any times before 31/03/2014 13:29)
b. Determine the (first if more than one) SELECTION with the lowest ODDS during the minute before the SCHEDULED_OFF (i.e. Arab Spring with ODDS of 1.7 in the race whose EVENT_ID is 113511428). This will be called the FAVOURITE.
c. Determine the (first) SELECTION with the (same or) next lowest ODDS during the minute before the SCHEDULED_OFF (i.e. Mr Greenspan in the race whose EVENT_ID is 113511428). This will be called the 2ND_FAVOURITE.
d. Delete all other rows where IN_PLAY = PE or IN_PLAY = NI (i.e. where EVENT_ID = 113511428)
e. If there is less than a 5% difference (this is the supplied parameter) between the two ODDS for 4b (1.7) and 4c (3.8), then delete all other rows apart from 4b and 4c and exit (i.e. move onto next EVENT_ID)
f. Sort by COUNTRY (column D), EVENT_ID, SELECTION and delete all rows for the EVENT_ID where SELECTION is NOT equal to FAVOURITE (determined in 4b) but keep the row where 2ND_FAVOURITE was determined
g. If COUNTRY is NOT equal to 'GB' or NOT equal to 'IRE', sort by COUNTRY (column D), EVENT_ID, ODDS and find the lowest value ODDS where IN_PLAY = IP. This will be called LOWEST_IP.
h. If COUNTRY is NOT equal to 'GB' or NOT equal to 'IRE', delete all other rows where IN_PLAY = IP (i.e. where EVENT_ID = 113511428)
5. Save a separate cut down version of the CSV after the above processing has taken place (i.e. each EVENT_ID should have 3 rows - two with a PE and one with an IP or two with PE if we exited at 4e above or COUNTRY is NOT equal to 'GB' or 'IRE').
6. Split the cut down version of the CSV into two further files extracting the rows where EVENT = TO BE PLACED into a file of their own and saving the remaining rows in another file of their own.
Recommended Skills
Employer Feedback
“Delighted with the end result. As the rating suggests, the quality, communication, expertise and professionalism was top notch. I\'m already thinking of the next task to give to this freelancer.”
Collie33, Ireland.
Public Clarification Board
-
pshyvw
- 11 years ago
FYI. To see if there is a real need to sort based on multiple fields(could be time consuming for large datasets), I ran some checks. For a given EVENT_ID, every SELECTION belongs to only one COUNTRY. Multi-field sorting at 4f & 4g is not needed in my opinion(unnecessary performance hit), if we can get the records based on the other filters to arrive at that step.
- 11 years ago
View 5 more messages
-
FixTheTask
- 11 years ago
And the part appears in my proposed image as well. it shows how i made it. it does not take more than some seconds to sort them.
- 11 years ago
-
tosha5252
- 11 years ago
If it does not take more than some seconds, you haven't done it correctly
- 11 years ago
-
FixTheTask
- 11 years ago
I ran some checks. For a given EVENT_ID, every SELECTION belongs to only one COUNTRY.
- 11 years ago
-
pshyvw
- 11 years ago
Incorporated all changes. New findings are that for each EVENT_ID, output csv file has: (a) 3 records with the third rec for IN_PLAY=IP only when COUNTRY is "GB" or "IRE". OR (b) 2 records with IN_PLAY= PE or NI (c) Interestingly, has 161 entries with only one record(FAVORITE). Found that these events have very few records 'during the last minute' (sometimes only 1 record during the last minute as in say EVENT_ID=113530124) and even if more than 1 record in last minute they have only one SELECTION. For this third category, we get only FAVORITE; there is no SECOND_FAVORITE as per modified rule 4c
- 11 years ago
-
pshyvw
- 11 years ago
Oops: Message got posted when I hit Enter. Please ignore previous message
- 11 years ago
-
pshyvw
- 11 years ago
Incorporated all changes. New findings:
- 11 years ago
-
haankrishan
- 11 years ago
sir had u selected the winner
- 11 years ago
-
Contest Holder - 11 years ago
Folks, just so you know you're on the right track I've saved a CSV/XLSX file for most of the steps in the instructions (for EVENT_ID = 113511428).
See files available from https://www.dropbox.com/sh/hh4dj7dtuhlz2r1/AAAEUCYMsV6L5msf8wLt8L36a.- 11 years ago
-
Contest Holder - 11 years ago
Folks, I've just realised that only UK and Ireland have in play markets so 4d now becomes:
d. Delete all other rows where IN_PLAY = PE or IN_PLAY = NI (i.e. where EVENT_ID = 113511428)
g and h will not be applicable when COUNTRY is NOT equal to 'GB' or NOT egual to 'IRE'
I'm updating the brief to reflect this and small typo in point e which I'm also updating.- 11 years ago
-
pshyvw
- 11 years ago
4g. I believe the objective is to get the 'lowest ODDS' for a given EVENT_ID and IN_PLAY=IP? Was wondering what the sorting by COUNTRY, EVENT_ID and ODDS would result in. In my opinion, EVENT_ID is redundant at 4g as it is the same for all records for consideration at this stage. Further, we have deleted all records other than for FAVORITE SELECTION. A SELECTION belongs to only one COUNTRY and as such COUNTRY too would be same for all records selected for this step. Can we simply sort based on ODDS for the relevant 'subsetted' dataset at that step? Data 'subsetting' is based on EVENT_ID & IN_PLAY=IP. While this may sound like an implementation issue, need confirmation whether sorting by COUNTRY can lead to a different record being selected (not necessarily the lowest ODDS for the dataset filtered only by EVENT_ID & IN_PLAY=IP).
- 11 years ago
-
Contest Holder - 11 years ago
Yes, the objective is to find the lowest ODDS for the FAVOURITE. There is no need to sort if you don't think it is necessary provided you retain the correct record/row. However, a SELECTION does exist for multiple EVENT_IDs (e.g. 113511429 and 113511429). A SELECTION could also race in the UK (where COUNTRY = GB) one day and then race in Ireland (where COUNTRY = IRE) a day or two later.
- 11 years ago
-
pshyvw
- 11 years ago
4c. If we look at the dataset just before executing this step, the (first) SELECTION with the 'next lowest ODDS' for EVENT_ID=113511428 during the minute before SCHEDULED OFF is also Arab Spring with 1.71. Since you were expecting Mr Greenspan with 3.8, I think 4c wording (initial part) should be "Determine the 'next' (other than FAVORITE) SELECTION with the lowest ODDS during the ...". Please confirm.
- 11 years ago
-
Contest Holder - 11 years ago
Yes you are correct - good observation - I will amend brief to match.
- 11 years ago
-
pshyvw
- 11 years ago
4b. Though not mentioned, it is implied from point 5 that the FAVORITE SELECTION is from IN_PLAY=PE. For EVENT_ID=100933526, we have all IN_PLAY=NI, for which we can determine 4b & 4c though not 4g. This is because 4g clearly specifies "where IN_PLAY=IP". Can we add the "where IN_PLAY=PE" to the specification at 4b & 4c? Or alternately, how do we handle cases like 100933526 where we get the first 2 NIs (not PEs) but no IP.
- 11 years ago
-
Contest Holder - 11 years ago
I've just realised that only UK and Ireland have in play markets so 4d now becomes:
d. Delete all other rows where IN_PLAY = PE or IN_PLAY = NI (i.e. where EVENT_ID = 113511428)
g and h will not be applicable when COUNTRY is NOT equal to 'GB' or NOT egual to 'IRE'- 11 years ago
-
momo2619
- 11 years ago
That contest is nearly 1 a day, and if the fastest should win, so one of those freelancer could finish it since yesterday.
Shall I continue or stop here? just one answer. Continue or not.- 11 years ago
-
momo2619
- 11 years ago
Just to update you, the application will be fast and simple.
if more details needed please let me know. like the interface or anything else.- 11 years ago
-
Contest Holder - 11 years ago
I haven't given any thought to interface but may have some comments when I see it.
- 11 years ago
-
bstoinev
- 11 years ago
How about new button in the ribbon that allows you to do this?
- 11 years ago
-
Contest Holder - 11 years ago
I don't understand your comment - ribbon?
- 11 years ago
-
artej11
- 11 years ago
Is mathematica script fine?( you will need mathematica to run it)
- 11 years ago
-
Contest Holder - 11 years ago
Can I download mathematica for free? If so, then it should be okay.
- 11 years ago
-
artej11
- 11 years ago
also what does f. Delete all rows for the EVENT_ID where SELECTION FAVOURITE mean?
- 11 years ago
-
artej11
- 11 years ago
Is it just delete all favourites?
- 11 years ago
-
Contest Holder - 11 years ago
Typo in original instructions. See amended brief where f is now:
f. Sort by COUNTRY (column D), EVENT_ID, SELECTION and delete all rows for the EVENT_ID where SELECTION is NOT equal to FAVOURITE (determined in 4b) but keep the row where 2ND_FAVOURITE was determined- 11 years ago
-
pshyvw
- 11 years ago
If we go by the specified sequence at 4c all PE records would get deleted and there is no way we can get another PE at 4d - would always be PE and IP
- 11 years ago
-
Contest Holder - 11 years ago
See amended brief as 4c and 4d were in the wrong order previously.
- 11 years ago
-
pshyvw
- 11 years ago
In 4b, how do we interpret "during the minute before the SCHEDULED_OFF" - does this mean that SCHEDULED_OFF - LATEST_TAKEN should be 0 or 1 as 2 and more are deleted?
- 11 years ago
-
Contest Holder - 11 years ago
Perform some date/time comparison and only choose the minute before the off. So of the off time was 13:30 then the minute before would be 13:29. I'm only interested in rows that contain 13:29.
- 11 years ago
-
mfvonh
- 11 years ago
Would you accept a Python script?
- 11 years ago
-
Contest Holder - 11 years ago
Assuming I can easily run it on a Windows environment then yes.
- 11 years ago
-
Contest Holder - 11 years ago
Contest brief updated to reflect new amendments to instructions.
- 11 years ago
-
Contest Holder - 11 years ago
5. Save a separate cut down version of the CSV after the above processing has taken place (i.e. each EVENT_ID should have 3 rows - two with a PE and one with an IP or two with PE if we exited at 4e above).
6. Split the cut down version of the CSV into two further files extracting the rows where EVENT = TO BE PLACED into a file of their own and saving the remaining rows in another file of their own.- 11 years ago
-
Contest Holder - 11 years ago
c. Determine the (first) SELECTION with the (same or) next lowest ODDS during the minute before the SCHEDULED_OFF (i.e. Mr Greenspan in the race whose EVENT_ID is 113511428). This will be called the 2ND_FAVOURITE.
d. Delete all other rows where IN_PLAY = PE (i.e. where EVENT_ID = 113511428)
e. If there is less than a 5% difference (this is the supplied parameter) between the two ODDS for 4b (1.7) and 4c (3.8), then delete all other rows apart from 4b and 4d and exit (i.e. move onto next EVENT_ID)
f. Sort by COUNTRY (column D), EVENT_ID, SELECTION and delete all rows for the EVENT_ID where SELECTION is NOT equal to FAVOURITE (determined in 4b) but keep the row where 2ND_FAVOURITE was determined
g. Sort by COUNTRY (column D), EVENT_ID, ODDS and find the lowest value ODDS where IN_PLAY = IP. This will be called LOWEST_IP.
h. Delete all other rows where IN_PLAY = IP (i.e. where EVENT_ID = 113511428)- 11 years ago
-
Contest Holder - 11 years ago
Folks,
I spotted an error in my instructions which may answer some of your questions. I've slightly amended the instructions which basically swap points 4c and 4d. There are some other tiny tweaks which should help explain the scenario. Amended instructions as follows (in separate messages due to 1000 character limit):
4. For each EVENT_ID do the following:
a. Delete all rows where LATEST_TAKEN is 2 or more minutes before the SCHEDULED_OFF (i.e. if SCHEDULED_OFF is 31/03/2014 13:30 then any times before 31/03/2014 13:29)
b. Determine the (first if more than one) SELECTION with the lowest ODDS during the minute before the SCHEDULED_OFF (i.e. Arab Spring with ODDS of 1.7 in the race whose EVENT_ID is 113511428). This will be called the FAVOURITE.- 11 years ago
-
anujbatham
- 11 years ago
wait for my entry
- 11 years ago
-
ACMAuk
- 11 years ago
Stay tuned- Almost complete.........
- 11 years ago
-
shenghuan
- 11 years ago
1. Delete all rows where COUNTRY = ANTEPOST --- There were even no country named ANTEPOST ( or list of ANTEPOST provided).
- 11 years ago
-
shenghuan
- 11 years ago
Thank you.
- 11 years ago
-
soniapathania
- 11 years ago
hi..is these employers seeking data entryjobs real..because they are asking for registration fee.
- 11 years ago
-
FixTheTask
- 11 years ago
Please wait for something
- 11 years ago
-
momo2619
- 11 years ago
I am nearly finished the job itself. But if you need it in an executable,i am creating currently a c# application to do the task,
- 11 years ago
-
ACMAuk
- 11 years ago
Please wait for something VERY EXCELLENT!
- 11 years ago
-
shenghuan
- 11 years ago
Does the data set only contain IN_PLAY = IP or IN_PLAY = PE records?
- 11 years ago
-
Contest Holder - 11 years ago
For each EVENT_ID you should have one row where IN_PLAY = IP and one row where IN_PLAY = PE.
The row where IN_PLAY = PE should be a minute before the scheduled off time and the row where IN_PLAY = IP should be the lowest odds figure for the where SELECTION_ID is the same as that for the PE row.- 11 years ago
-
shenghuan
- 11 years ago
Your description is not clear. It does not mention if IN_PLAY = other should be kept.
- 11 years ago
-
shenghuan
- 11 years ago
My feeling is freelancer is only for logo design. So only images are allowed to be submitted.\
- 11 years ago
-
shenghuan
- 11 years ago
It only allow to submit figures. I change the .csv to .gif or ,png, but cannot submit.
- 11 years ago
-
shenghuan
- 11 years ago
An error occurred, please try again later.
- 11 years ago
How to get started with contests
-
Post Your Contest Quick and easy
-
Get Tons of Entries From around the world
-
Award the best entry Download the files - Easy!