Web Spider in C#
$100-500 USD
Betalt ved levering
We require a web crawler written in C# that saves to disk rather than to a database.
## Deliverables
The executable should create a subfolder under its current working directory: 'data'. This folder will contain subfolders for every web host found. .
A unicode text file ([login to view URL]) will reside in the same folder as the executable and contain a number of 'seed' URLs.
For each seed URL, the following operation should take place:
- Perform a HTTP GET against the page at this URL.
- Save the http header to ./data/[login to view URL]
- Save the page content to ./data/[login to view URL]
- Save the current download time to ./data/[login to view URL]
- Save all files (images, style sheets, scripts, etc) following the above structure (eg ./data/[login to view URL])
- Saved page data should contain http headers as well as body
- If 4xx error received, save the URL (not the content) to data/[login to view URL] if not already present in the file.
- If redirect received, save the URL (not the content) to data/[login to view URL] if not already present in the file.
- Save all URLS for a specific host to ./data/[login to view URL]
- Save all hostnames found to ./data/[login to view URL]
- Parse HTML for src, hrefs, links.
- Parse CSS for URLs
- All URLs located during parsing, if not already eixsting in the file structure should be saved to ./data/[login to view URL]
Once the seed file has been processed, it should be deleted.
Once the seed file has been deleted, the program should being processing every url in ./data/found.txt.
Each url should be deleted from [login to view URL] once it has been processed.
- Up to 10 background threads should be crawling simultaneously.
- A separate thread should handle the [login to view URL] file and scheduling URLs between download threads.
- Program should be a command line utility.
- A running log of progress should be visible on STDOUT.
- Program should be capable of running continuously for days without crashing, hanging, or memory leaks.
- Duplicate URLs should not be downloaded twice.
- [login to view URL] files can be ignored.
The application should take a number of optional command line arguments, as follows:
- /d= delay between downloads (in milliseconds). Default should be 0. Each threat will pause this number of milliseconds after saving a download before processing another.
- /r= refresh time in days. URLs already downloaded more than r days ago will be downloaded again, overwriting the existing data.
- CTL+C should cause the program to gracefully shut down.
In a nutshell, we want something similar to OpenWebSpider, with the following major differences:
- No database; saves data to file system.
- Supports unicode URLs and content; saves html as UTF-16 text files.
- Far fewer command line options, much simpler utility.
- Automatically searches across domains, no requirement to limit to single domain.
Must run on Windows XP SP3, and Windows 7 (32 and 64 bit).
All files must be saved as UTF-16, not Asci or UTF8.
Program should correctly save English, Chinese, Thai, Japanese, Arabic, etc, URLs and content.
File paths on Windows should be escaped if they contain unicode characters, incompatible with windows NTFS file system.
Projekt ID: #3307187