I need to make a Python script (Scrapy or Selenium, I am up to suggestions) to extract information within some specific(I have around 12) websites - daily(auto) or manually.
The pages are in portuguese, but I can guide you into the key input-fields and key-pages to look for.
1. User input:
- Time period (if the page has this feature)
- Website to scrap.
- Keywords(can be a list of words) to look for.
- User chooses the local path to the files to be downloaded.
- Access the page.
- Searches the tabs that can have useful information(I will provide the specific parts for each webpage to make the queries) links inside that domain.
- Download(if the page serves files in doc, html or pdf) and look for the keywords.
- Extract all the related content (files or the text in html).
- Go around Captchas(if the page has captcha)
- All the extracted content must have the URL which the information/file is available in the webpage - can be done by logs.
- All the extracted content must have the DATE which the scrapping has been made - can be done by logs.
- All key-fields (like CSSSelector for a date field) should be configurable for each spider.
- The URL to start scrapping each webpage should be configurable.
- If page contains Authentication(Login/Password), user will fill the configuration for it.
1. My plan is to pay for each 4 mapped websites (so total project is for 3 "packs" of websites)
2. The content in few cases will need to be extracted from images.
3. Start your bid with the word forward, so I can know if you did read all the description.
4. If you can't extract properly the content I can give you another one to replace that one, so you still need to deliver 4 websites per milestone.
5. I WILL RELEASE THE MILESTONES ONLY AFTER YOU SEND ME THE CODE AND I AM TOTALLY SATISFIED (I WILL RUN TESTS TO CHECK FUNCTIONALITY).
I have many projects at hand and would be great to stablish a good relation with you, since I constantly need someone to work with me.