News Aggregator/Crawler for 27 Sites - Perl/python

Hi there,

We need experienced professionals with following coding languages: Perl maybe Python and we are not sure if its can be done in PHP (we are though open to suggestions).

The work will involve three main aspects:

1) A boot to crawl each site (27 sites).

2) A script to find related/similar news by text linguistics -patter analysis- or any way that you know that can this be done.

3) What is crawled has then to be indexed to a database and made available by search, possibly using this open source search software; [url removed, login to view]

We have 27 news sites that we want to be crawled/spider with a boot; we assume that as the 27 sites are different the code for each might have to be slightly different as well.

Most sites will need to be crawler every 10 to 15 minutes and some other every 30 minutes. Only the front pages of each site are to be checked, but maybe we can in some sites juts check the RSS feed and get the data from there possibly.

What coding language to use here: we know that Perl by default a very good text based coding language; that is why we suggest that work would be done with Perl; that is the crawling + the related stories script. The other reason is that we know that sites like [url removed, login to view] and [url removed, login to view] have been coded in Perl and as you can see the stories under the “RELATED” for example in [url removed, login to view] are very good. Therefore, it seems Perl as a text-based language can achieve a good results here. At the end of the day we leave it to your expertise. We also know that a web spider can be written also in PHP, but we are not sure of its capability!!!

Note: all work you will do has to be documented, because we want that if the future both you and another coder has to fix something they will understand the code and be able to read it through. THEREFORE, WE WANT VERY PROFESSIONAL WORK.

So let us know your experience/expertise with both scripts that can spider web pages and in addition scripts that can do text linguistics analysis and be able to find related news by text analysis on the news tiles and news description. we want to build a relation with you as well. Once of the reason why we will probably not work with previous suppliers is because they don’t have expertise in Perl or Python; so this is an opportunity to you or your company to join us as a future long term partner.

You will also be working on your local production server until we are ready to move the our live server that we still need to set up; so once we are happy with your work then we move to our server.

You will be only doing programming; another company will do all graphics and some other small things like users account section…etc; so only the three aspects mentioned above is what you are biding for. Crawling + related script + search: in other words you will do the core of the whole project.

Send your bid as soon as possible; we have some detail description about each section and we can provide it to you upon request. If possible, it would be nice to know if you have extensive experience with web crawler or aggregators and in special if you can accomplish the “Related” news script, which is very important to us.

We can probably use escrow account and pay as per steps are accomplished...


Evner: Perl, Python

Se mere: python news aggregator, python news aggregator open source, aggregator crawler, news aggregator perl, web crawler news aggregator, crawling news perl, news aggregator python, crawl news site python, megite aggregator, news crawler perl, python news crawler, perl news crawler, www escrow us, working of web crawler, working for cgi, why you want to work for us, why would you like to work here, why would you like to work for us, why would you like to work for, why programming is important, why is analysis important, why do you want to work for us, why do you want to work for this company, why do you want to work for our company, what is escrow account

Om arbejdsgiveren:
( 121 bedømmelser ) Belfast, Ireland

Projekt ID: #272574