Web Spider in C#

Færdiggjort Opslået May 11, 2011 Betalt ved levering
Færdiggjort Betalt ved levering

We require a web crawler written in C# that saves to disk rather than to a database.

## Deliverables

The executable should create a subfolder under its current working directory: 'data'. This folder will contain subfolders for every web host found. .

A unicode text file ([login to view URL]) will reside in the same folder as the executable and contain a number of 'seed' URLs.

For each seed URL, the following operation should take place:

- Perform a HTTP GET against the page at this URL.

- Save the http header to ./data/[login to view URL]

- Save the page content to ./data/[login to view URL]

- Save the current download time to ./data/[login to view URL]

- Save all files (images, style sheets, scripts, etc) following the above structure (eg ./data/[login to view URL])

- Saved page data should contain http headers as well as body

- If 4xx error received, save the URL (not the content) to data/[login to view URL] if not already present in the file.

- If redirect received, save the URL (not the content) to data/[login to view URL] if not already present in the file.

- Save all URLS for a specific host to ./data/[login to view URL]

- Save all hostnames found to ./data/[login to view URL]

- Parse HTML for src, hrefs, links.

- Parse CSS for URLs

- All URLs located during parsing, if not already eixsting in the file structure should be saved to ./data/[login to view URL]

Once the seed file has been processed, it should be deleted.

Once the seed file has been deleted, the program should being processing every url in ./data/found.txt.

Each url should be deleted from [login to view URL] once it has been processed.

- Up to 10 background threads should be crawling simultaneously.

- A separate thread should handle the [login to view URL] file and scheduling URLs between download threads.

- Program should be a command line utility.

- A running log of progress should be visible on STDOUT.

- Program should be capable of running continuously for days without crashing, hanging, or memory leaks.

- Duplicate URLs should not be downloaded twice.

- [login to view URL] files can be ignored.

The application should take a number of optional command line arguments, as follows:

- /d= delay between downloads (in milliseconds). Default should be 0. Each threat will pause this number of milliseconds after saving a download before processing another.

- /r= refresh time in days. URLs already downloaded more than r days ago will be downloaded again, overwriting the existing data.

- CTL+C should cause the program to gracefully shut down.

In a nutshell, we want something similar to OpenWebSpider, with the following major differences:

- No database; saves data to file system.

- Supports unicode URLs and content; saves html as UTF-16 text files.

- Far fewer command line options, much simpler utility.

- Automatically searches across domains, no requirement to limit to single domain.

Must run on Windows XP SP3, and Windows 7 (32 and 64 bit).

All files must be saved as UTF-16, not Asci or UTF8.

Program should correctly save English, Chinese, Thai, Japanese, Arabic, etc, URLs and content.

File paths on Windows should be escaped if they contain unicode characters, incompatible with windows NTFS file system.

C# Programmering Microsoft Script Installering Shell Script Software Arkitektur Software Testning Web Hosting Hjemmeside Management Hjemmeside Testning Windows Skrivebord

Projekt ID: #3307187

Om projektet

7 bud Remote projekt Aktiv May 12, 2011

Tildelt til:

rased108

See private message.

$199.75 USD in 25 dage
(30 bedømmelser)
4.6

7 freelancere byder i gennemsnit $282 timen for dette job

arthurprs

See private message.

$382.5 USD in 25 dage
(79 bedømmelser)
5.1
InterntEngineer

See private message.

$255 USD in 25 dage
(33 bedømmelser)
4.8
izharuislam

See private message.

$340 USD in 25 dage
(28 bedømmelser)
4.2
ugexe

See private message.

$255 USD in 25 dage
(20 bedømmelser)
3.7
usoftsolutions

See private message.

$284.75 USD in 25 dage
(1 bedømmelse)
2.7
Tinasay1

See private message.

$255 USD in 25 dage
(0 bedømmelser)
0.0