Website Scraper - File Downloader

I need a program to daily spider [url removed, login to view] and download all media files found, and keep the records in an MS SQL 2005 database. (I’ll give you the table).

This project is intended for those who have done similar spiders/scrapers in the past.

I’m providing a flow chart to code to; along with the database table and the rest of this description it is intended to be a complete guide for development. Read everything over before bidding. Answer the questions at the end while posting your bid, bids without answers will not be considered.

First of all, go to [url removed, login to view] and see the format of the site. There are five sections: movies, pictures, games, animations and links. Each section contains a multi-page list of files. Each item in the list contains a thumbnail, a title, a description and a link to a page that contains a ‘download’ link to the actual file. Program must be able to grab all of these items, so you obviously need to be able to make full use of RegEx.

The following is a flow chart of the code logic:

For sections movie, pic, game, animation, links


Start at page 1


For each item in page


if ‘url’ exists in db


Matches + 1 // set matches = 0 at start of each section

If (matches > 10)


Move on to next section





Add item to db and set all columns (url, type, etc)

Download the thumbnail and save as [id] + [textension]

If type is mlink or link // test by checking if domain is external


Mark record as downloaded




}increment page

}next section

After all of the above is done,

For each record in db not marked as downloaded


Go to ‘url’ of the record

Grab the ‘url’ of the file (from ‘download’ link)

Download the file and save as [id] + [extension]

Mark as downloaded


Program must maintain the following MS SQL table:

[table Media]

id (int, auto identity column)

type (movie = 1, pic = 2, game = 3, flash = 4, mlink = 5, link = 6)

title (title of the file, grabbed from the page)

description (description of the file, also grabbed from the page)

date (date the file was downloaded)

extension (file extension, eg .avi, .swf, .mpeg, etc)

textension (file extension of the thumbnail, eg .jpg, .jpeg, .gif)

url (this is the url to which the thumbnail points, or for text links, the text link points. Normally this is the same as the page that contains the ‘download’ link.)

downloaded (bit, default is 0. set to 1 after the file has been successfully downloaded.)

ALL of the requirements must be met:

Program must run either as a windows service or be executed in daily interval by the windows task scheduler. If the program is not a windows service, it must begin work upon execution (so it can be safely triggered by task scheduler).

Program must grab all items from dumpalink's five sections: movies, pictures, games, animations, links. Files and thumbnails are downloaded. Titles, descriptions, urls are saved into a database. [see database table and flow chart]

Thumbnails are always downloaded, for all items. Most files can be downloaded from the ‘download’ link. For pictures, the actual picture is downloaded. In the links section, text links from the linkdump must be added to the database like everything else, but nothing needs to be downloaded.

Each time the program runs it must spider ONLY new pages since its last execution. One way of doing this is to start checking each section at page 1 and increment until you hit at least 10 links that already exist in the database. [see flow chart] Also make sure you never download the same file twice, by checking with the database.

All files must be downloaded into a single folder and named [id] + [extension]. For example, [url removed, login to view] would become 123.mpeg. Thumbnails must be saved into a separate folder, and named [id] + [textension].

Program must check on every execution for any files that have not been downloaded successfully and make an attempt to do so. [see flow chart]

Program must have a config file or another way to configure the following parameters:

1) path to file folder, path to the thumbnail folder

2) sql server connection string

3) all RegEx expressions used in the code

4) scan frequency if the program is a service (how often it spiders, in hours)

5) a filter regex to ignore certain links (before adding a record to the database a filter regex is called on the ‘url’ string, if match occurs that file or link is ignored)

6) number of matches before section scan is terminated (default 10)

Answer the following questions along with your bid, bids without answers will not be considered:

1. Can you work in C#? If not, what language do you plan to use for development?

2. Do you indent to develop a windows service or an executable triggered by task scheduler?

3. Have you done similar projects before?

4. Do you have MS SQL 2005? An express version is available for free from Microsoft.

The flow chart got completely mangled by GOF. See attached doc.

Evner: .NET, C programmering, Databehandling, Visual Basic, Windows Skrivebord

Se mere: who can develop a website from nothing, which develop website, what you need to develop a website, what is ms project plan, what is express, website plan page, website media, website development questions, website development guide, website development free, visual work flow, task five, string match, start to make a free website, start free website, single match, set bit in c, regex match list, regex matches example, regex matches, regex is, regex in c, regex example, regex c, questions on website development

Om arbejdsgiveren:
( 12 bedømmelser ) oilville, United States

Projekt ID: #84739