Lukket

A review and help for python spiders working with Scrapinghub

Hello,

I have a list of about 10k urls that I need to validate, all I have to do, is scraping the home page if the website does respond, and discard the others.

However, I've noticed that, when running the spider on Scrapinghub several times in a row, I get inconsistent results, meaning not the same number of scraped items. Usually the main difference is on the number of timed out urls.

I have set up DOWNLOAD_TIMEOUT up to 300 (with RETRY_ENABLED to False), but I still get a bunch of "[login to view URL] [login to view URL]: User timeout caused connection failure: Getting [login to view URL] took longer than 300.0 seconds.."

I have tried some of the 'slowest' websites (with request duration > 50 seconds) in the browser and they work fine. Even when running the scrapping on a single website in my local machine it works fine and loads quickly (less than 2/3 sec).

When looking at the request logs, I've found 300 urls with a request duration of more than 50 seconds, whenever I browse those website, or lauch a spider on only one of those urls, it works fast.

Now, I have isolated the 100 slowest requests (50 seconds or more), and created a new spider with those urls.

When I look at this spider request logs, I see that the request durations are not the same at all, and that the request duration follows a pattern of going from 200ms for the first request to around 2000ms for the last request.

So my final question is : how could I avoid this 'instability'? I need to run those spiders regularly, in order to maintain a list of working urls, and I can't afford to have missing items.

I have attached a zip file ([login to view URL]) with all files to support my investigation :

- [login to view URL] : here you can see 4 identical spiders, giving different results

- [login to view URL] : the [login to view URL] file

- [login to view URL] : an overview of the spider

- [login to view URL] : the 4 spiders stats, showing a big difference in the timeouts and http status)

- [login to view URL] : the requests logs of 8830 urls (see how the request duration goes gradualy higher, and then cycle back)

- [login to view URL] : an extract of the 100 slowest requests from [login to view URL]

- [login to view URL] : the same spider, running on the 100 'slowest' urls taken from [login to view URL] (see how the request duration goes gradualy higher)

Evner: Python, Scrapy

Se mere: scrapy debug, scrapy testing, scrapy close spider, run scrapy from python, scrapy tutorial, scrapy contracts, scrapy examples, scrapinghub python version, software development, software architecture, python, web scraping, scrapy, web scraping web search

Om arbejdsgiveren:
( 7 bedømmelser ) Noumea, New Caledonia

Projekt ID: #17676525

11 freelancere byder i gennemsnit $32/time for dette job

chirgeo

Hi. Ok, I can investigate this issue and see what can be wrong. From my idea this can be related as well with the location from where the connections is made. To solve this issue we may require to use different prox Flere

$40 USD / time
(85 bedømmelser)
7.1
adampohp79

Dear Sir. Glad to meet you. I'm Web developer specializing in web scraping crawling and indexing web pages. Skills: python, scrapy, selenium, requests, beautifulsoup, mechanize, lxml, urllib2, automation, bots, Flere

$41 USD / time
(32 bedømmelser)
5.4
abhilashtv

Hi, ➲ 7+ years of full-time experience in Python / Django with 50,000+ Upwork hours billed and 50+ successful Python projects ➲ Upwork Top 10 Certification for Python and Django ➲ Guaranteed Results Policy: Pay only Flere

$26 USD / time
(12 bedømmelser)
5.0
polarjin2017

Hi.. How are you? I saw your description carefully your project. Owing to my rich experience in python and scrapinghub, i can say i can do this perfectly. I have many top skills like python, scrapy,CSS,HTML ,PHP , Flere

$41 USD / time
(8 bedømmelser)
4.6
roshanasim

I have worked in python for 5 years. I have developed a mental health project expression recognition in python integration with android, natural language processing in python, regular expressions handling, development Flere

$42 USD / time
(14 bedømmelser)
4.4
WIFTCAP

Hi, we believe we can take care of your requirements. We will be happy to discuss your requirements in detail and take this forward. Previous Work : We have worked on several projects on Python, Django including Flere

$25 USD / time
(11 bedømmelser)
4.3
seemasit

Hi, How are you :) I have gone through your project requirements I am proficient in the python and i can develop a scrapping code for your project so please come over chat for further discussion about project. Flere

$27 USD / time
(0 bedømmelser)
0.0
Thesynapses

Hello, You landed on a perfect profile..! I have gone through the job post & feel great pleasure in contacting you to initiate a [login to view URL] can take up your project and do it very perfection. Coming to our Flere

$25 USD / time
(0 bedømmelser)
0.0
$27 USD / time
(0 bedømmelser)
0.0
HalosysIndia

Hello, Sincere greetings! I've discreetly gone through your requirement and here is its preliminary solution. We have a team of Python developer having extensive expertise in the domain & you can interview to se Flere

$35 USD / time
(0 bedømmelser)
0.0
divkis

Hi, This is very interesting task at hand and I would like to take this up, reason being I am an expert at scraping and have written it to scrape Amazon, Ecommerce sites, Real estate sites and Social networking site Flere

$25 USD / time
(0 bedømmelser)
0.0