This job is only for VERY experienced programmers.
Hi, I'm looking for someone to build me a method of parsing irregularly formatted cvs, preferably using Textract but other options considered. It has to be able to parse out specific information related to artists. Maybe google has options. You need to be experienced with AWS. I'm open to page scraping technology as well. Either way, you will need to be AWS certified because this has to go on to my AWS production server at some point in the future.
The solution should be able to :
1. create a s simple webform that enables us to capture the metadata and to provide the user with a simple method of submitting the cv. The metadata captured is covered in point 9
2. parse from either a PDF document or from a html page. If you convert the html page to a pdf before sending it to Textract then I will need to know what API you are intending to use. Have a look at the document JohnDoe_sample
3. although the majority of cvs will be 1 column the solution will also be required to handle 2 columns text - how will you handle 2 cols?
4. the final product will have the ability to skip over data within the cv that is unable to be parsed and continue moving through the cv without manual intervention
5. each record output will have a confidence value between 0-100
6. the output will need to be in a suitable format to be imported into our db - preferable as json
7. each record will include the artist_name, year of exhibition, title of exhibition, group or solo exhibition, venue
8. the only reliable way to distinguish one exhibition record from another under 95% of instances is to find a geographic location followed by either a line feed or a "." For example VIC or Melbourne
9. there will be meta data including the name of the artist, dob, email of the person scanning requesting the scan, the date it was scanned and whether the person scanning it was the artist or their agent
10. the parsed document will be output to screen for verification and as a json file to my S3 bucket
11. a manifest/log file should be created capturing each scan, date etc. for administration
12. each cv should not take more than 180 secs to process
1. If you are interested let me know - I'll be looking at your experience with AWS, Textract and programming
2. I will then send you more detailed briefing documentation.
3. You will then send me your proposed technology solution. It isn't really helpful to just say "I can do it". I want to know how you will complete this job and what technology you think will give the best results. How many hours you estimate it will take to work out the parsing. How will you use the AWS environment and what will you need to setup.
4. If you quote above the range then you probably won't get the gig unless you can show how your solution is neat and brilliant.
The required output at the end of this is to be able to identify the exhibition details for the given artist. You need to extract in the following information from each cv:
<YYYY> - <exhibition title>, <exhibition venue>, <exhibition city>
There is a more detailed breakdown of this job that I can forward to interested parties.
Thanks for your interest.
18 freelancere byder i gennemsnit $521 på dette job
Yes sir, I'm experienced programmer in parsing CSV and XML. I used to parse CSV file to import products to many ecommerce site before .so I'm confident I can parse it as per your requirement. Thanks
Hi , I can develop this iregular csv interpreter module perfectly within a tight deadline. PHP expert who has 100% job completion here. Let's discuss the project over a private message. Thank you!