I gang

Website crawler

This project can be coded in PHP, c++ or java, depending on language you already have crawler code written in and what your

expertise is. If you do not have the experience to do this job or you are unstable then I ask you not to bid.

The server is LAMP so the database used must be MySQL. The ability schedule repeat crawls/info updating as a cron job would be nice.

CRAWLER PART 1

Crawl Yellow Pages/business listings to retrieve company names/info.

Retrieve any of the following found:

business name(required), address, category, emails, phone nums, websites,

logos, rating. Record the url of page then listing was found on.

If a business logo is found, it needs to be downloaded and matched(renamed) to the recorded listing.

The regular expressions used to find/retrieve the information on the sites need to be

modifiable so they can be altered and reused for other directories so they should

be saved to a MySQL table.

Because the crawler will be used on sites in different

languages, the language words in the regular expressions need to be

inter-changeable. If the language word 'dog' is in the expression and

I choose to crawl using Punjabi, 'dog' needs to be replaced with 'gutta'.

The language word/phrase lists don't need to be supplied, just the ability to supply them

at a later time is necessary.

CRAWLER PART 2

For listings from CRAWLER PART 1 which didn't return web addresses,

An automated website finder (using google?) which will search for a business' website using the business' name,

email addresses, phone numbers and other data.

CRAWLER PART 3

Crawl a list of websites and locate the customer contact information page.

When it is found parse it and record the email address/phone number and contact name,

or if a submit question form exists, record that a form was found.

CRAWLER PART 4

Crawl a list of websites looking for product information and prices for specific categories (Bank, Insurance, Pension, Mobile, Mortgage and others)

CRAWLER PART 5

Crawl forums:

Search forums by thread and try to determine what the thread is about using sets of keywords with associated weights.

Thus a thread containing words like summer, bike and canoe could be weighed as something about leisure.

The keywords and their weights have to be easily configurable.

If a thread has enough weight to fall into a category then it will be recorded.

ADMIN CONSOLE

The software must have a management console enabling the following functions:

Must be accessible using Windows (preferably web based so our mac/linux system can use it too)

Must be able to modify the list of sites to crawl,

Must be able to change regex for parsing,

Must be able to switch the language for the regex'

Must be able to start/stop crawler

QUESTIONS

1. Which solution do you recommend?

2. Do you have any comments on the above?

3. Can you provide a demo of the crawlers that you find suitable for this project?

4. Can you be available in the future for paid maintenance and further development?

------------------------------------------

I'd like to automate gathering contact information from businesses, initially building lists of businesses in various countries/regions by scanning yellow pages and directories. After the lists are built, then the business' websites need to be checked for their contact pages/contact info. Since it needs to work for different languages, the regex need to be easily configurable so the language words/phrases can be changed.

The methods for sites will be different but thats why a customizable crawler is required. The first steps can all be the same, following links inside the directory site looking for pages with keywords like category, phone, address and checking if the words exist multiple times on a page or if groups of them exist.

An example is [url removed, login to view], the direcory has the name of the business' and their phone numbers but not easily recognizable entries for a bot which wasn't written for YellowPages specifically. For a page like that it would try normal contact keywords and once it has failed it could look at the number of phone numbers occurring on the page and cut it up into phone number to phone number chunks (stripping extra html but leaving tokens to denote font size etc). Later that bulk data can be viewed and separated thru custom methods.

The accuracy wont' be as good as it would be if it was created for a specific site but the ability to tweak the regex' will allow it to be get more precision. Even items like finding which link goes to the next page in a directory can be customizable.

Many directories have similar titles ('email' 'address' 'phone') so they'll be easier to lock onto.

Later it'll need an automated mailer to email the site's contact emails and read the responses to see if the email is the approved email for normal contact.

The other crawler is to crawl forums and build a knowledge base of information on various pre-determined subjects, looking for quality information over quantity.

Sample db code is included to give a basic idea of what I'm looking for.

Færdigheder: C programmering, Java, PHP

Se mere: website crawler mac, phone number crawler, php crawler demo, crawler tutorial php, crawler phone numbers, mysql sample, php crawler bot, site crawler, yellow software, work finder, windows mobile web development, why use php for web development, what should i ask google, what language are websites written in, what job can i get with a linux, what is your name in punjabi language, what is php used for in web development, what is java and why do i need it, what is a regular expression, what are regular expressions in java, website software for mac, websites demo, website page size, website in maintenance, website development using java

Om arbejdsgiveren:
( 80 bedømmelser ) copenhagen, Thailand

Projekt-ID: #274457

Tildelt til:

phpdeveloper12

Hi I did similar project in php. I had to extract all web domains [url removed, login to view] domain with email of webmasters and etc. So I could build your crawler fast. Need dedicated server with good bandwidth since crawler generate a lo Mere

$650 USD in 30 dage
(0 bedømmelser)
0.0

11 freelancere byder i gennemsnit $536 for dette job

SigmaVisual

Please check PMB.

$500 USD in 7 dage
(243 bedømmelser)
7.8
creatorul

Professional work.

$750 USD in 20 dage
(156 bedømmelser)
7.5
codersam

Work is almost ready...Please check [url removed, login to view] this site crawling video from youtube dailymotion metacafe.

$450 USD in 7 dage
(140 bedømmelser)
7.3
aruhat

Hi, High level expertise in Scrapping, Please see PMB. Regards, Shyam

$500 USD in 10 dage
(15 bedømmelser)
6.2
klycoder

We are lowering our bid

$650 USD in 25 dage
(1 bedømmelse)
6.1
phpmagic

Great job. We can help you. We have 3 years experience with Flash, Adobe Suite, Corel Draw, 3D Max, Joomla, OSCommerce, WordPress, PHPList, OpenX and PHP Working . We have more expert programmers, they can work goood i Mere

$700 USD in 20 dage
(38 bedømmelser)
5.8
alan37

can be done.

$250 USD in 7 dage
(2 bedømmelser)
2.0
ruperts

Greetings! I can make this program for you. Please see PMB for details. Thanks, Engr. Ruperts

$700 USD in 30 dage
(0 bedømmelser)
0.0
seditiosus

I've coded many bots in different languages. This job can be easily done with PHP.

$400 USD in 8 dage
(1 bedømmelse)
0.0
coderofbd

Hi, Please check the PMB. Thanks.

$350 USD in 20 dage
(0 bedømmelser)
0.0