I gang

php Website-Crawler and information extraction

Hi,

I need a Crawler script, which reads URLs from a website and casts the underlying websites. The information I need from that websites are:

- Mister/Miss (Herr/Frau)

- Forename

- Name

- Position

- Name of organization

- Street

- Address

- Fon and Fax

- Email

- Website

- link title leading to that website

The crawler should look for this information first under "contact" and then in "disclaimer". It also could be possible, that the crawler find an intro-page, which it has to skip.

If there are several data records an one webpage, it should be saved in the same line.

The Output must be a CSV or Excel-File.

Because these Websites are in german the word:

contact -> Kontakt

disclaimer -> Impressum

Furthermore the Crawler should recognize if there's a position describtion of the contact person. For example "Stadtwehrführer", "Kommandant" or "Stadtwehrleiter". "*leiter" or "*führer" e.g. indicates a position.

Also the crawler should recognize the name of the organization. "Feuerwehr" ist the indicator.

An example:

________________________________________

Verantwortlich:

Feuerwehr Bitterfeld-Wolfen (Name of organization)

PD-Chemiemark Areal A, Geb. 046

Ortsteil Wolfen

06766 Bitterfeld-Wolfen (postal code + city) postal code has 5 numbers in GER

Vertreten durch:

Herr Uwe Wagner (Mister Forename Name)

Stadtwehrleiter (Position)

Kontakt:

Telefon+49 (0) 03494 6660564 (Phone Nr)

E-Mail: abcd(at)[url removed, login to view] (need inteligent scan, a correct email address is the most important)

____________________________________________________

The links-list can be found here:

[url removed, login to view]

Beside an csv file with all extracted data, I need the script to modify and tune it a little bit afterwards.

All Phone and Fax Nrs have to be in the same format!

If you need further information don't hesitate to contact me.

Best regards

Sebastian

PS: I attached an example file of "FF Bad Waldsee" in "Baden-Württ.".

Evner: PHP

Se mere: php information extraction, php crawler information, php email crawler, crawler script php, php find a person, leading websites, intro php, find correct email address, find a postal code, correct email address, best webpage, bad websites list, at&t organization, php crawler code, crawler wolfen, correct email format, best php website, website crawler, webpage php, wagner, telefon, street 3, skip , php w, php german

Om arbejdsgiveren:
( 1 bedømmelse ) Köln, Germany

Projekt ID: #604513