I need a Python 3 script that will extract specific parts of text from a PDF file and generate an .rtf file with those text elements isolated, and some extra text added.
The input file is a typical "Written Discovery" request used in litigation (when people are suing/being sued). The first few pages have the case information and preliminary definitions. Then there are a series of numbered requests that all have the same title, but with a different number. Following the title, there is the text of the discovery request.
In the example I am using (see file: [login to view URL]), each discovery request is titled "SAMPLE DISCOVERY REQUEST" but in "real world" documents they could be a number of different titles (eg SPECIAL INTERROGATORY, REQUEST FOR PRODUCTION, REQUEST FOR ADMISSION), although they will all be the same in a single document.
I would like a script that uses OCR to extract all of the discovery requests and put them into an .rtf file with limited formatting (bold and underlined). Each request should be followed by the following text: RESPONSE TO [TITLE OF DISCOVERY REQUEST] (see file: [login to view URL])
What makes this tricky is that these documents always appear on "Pleading Paper" where each line is numbered on the left hand side of the page. This causes the text in OCR output to be interrupted by numbers (see file: OCR [login to view URL]).
The script will need to determine when the requests begin, what they are called, list all of them (in the example, there are 9, but 30-50 is more typical) and add in the "RESPONSE TO" language.
This is the first step in what could be a much larger project for the right developer.
Please let me know if you can handle this project in a short timeframe, and if you have any questions.
6 freelancere byder i gennemsnit $214 på dette job
Hello!\nI am a python developer.\nI looked at your project and it seems interesting.\nI have all necessary skills required for this project.\nPing me to discuss in detail.