Distributed web page scraper (preferably on EC2)
$100-300 USD
Betalt ved levering
As input to your script, I have a list of about 1M URLs. I want these URLs scraped, and inserted into a database. You do NOT need to recursively crawl the URLs. You just need to retrieve them.
I want a distributed scraper. In particular, I want to give a parameter N, and have the script automatically provision N scrapers, maybe N different Amazon EC2 instances, or some other cloud service. The N instances should avoid doing the same work.
I don't care you write a wrapper script around Scrapy, or another existing web scraper implementation. You can do this if you already know Scrapy or Bixo and want to use it.
The script should really require very little configuration. It should be convenient and one-click if possible. That way, the next time I have a batch of 1M URLs, I can easily run your script.
Projekt ID: #3680209