We need to build an infrastructure capable of scraping/indexing approximately 50-100 news websites.
The bots/scrapers should be equipped to quickly request, extract and parse news article URLs as they are published to the various news websites - in near 'real time'. The major focus of this project is to ensure that 1) News content/URLs are parsed and entered into a Database as fast as possible (within minutes of publication), and 2) that ALL news content/article URLs are captured (avoid missing content). We are looking to build a robust, easily scaled news aggregator system of bots/scrapers that can easily handle the load of 10,000 - 20,000 news articles per 24 hours, and where adding *new* sources of news (publishers, websites) is reasonably easy to do. We are open to ideas on how this is achieved.
The scraper/bot should capture the following information:
- Unique article URL
- Body of text
- Author name
- Publisher URL (IE: [login to view URL])
The indexed information should be entered into an SQL database with relevant fields, unique article_ID and tables according to the above indexed information.
A list of the relevant news websites to be indexed can be obtained on request. They are primarily Australian, New Zealand, US and European news websites. All English language, all properly formatted.
We utilise Linode servers (Ubuntu OS) and operate an SQL database. We would prefer the scraper/bots to be completed in Node.js.
We have a Proxy Rotator that can manage 'requests' from the scraper/bots, so as to 'mask' multiples of visits to news websites.
*** Please note: we are seeking developers with strong experience in scraping/content extraction automation. Our primary focus is to ensure that ALL new news articles added to a news websites from the start point are captured, AND that all news articles are captured as quickly as possible. This content needs to be collected 24/7, and within 'minutes' of publication at the news website. Please only post a proposal or express interest if you believe you have the necessary skills to complete this task. ***
39 freelancere byder i gennemsnit $2692 på dette job
There are some questions before we place a firm bid. Let us know if you have some time so we can both converse regarding the questions / doubts we have.
Hello, Thanks for your post ! ...As per my UNDERSTANDING : .......[login to view URL] want us to develop a bot which will crawl into 50-100 news websites to scrape info & store into a SQL database. ........[login to view URL] achieve this objectiv Flere