Hello, I need this small job done - and I expect it to be a very easy ride for you (provided you have the required skills).
Basically, I've got a CMS in which I add articles (with HTML support)
I want a simple scrapper script to get posts from a single forum, by logging in to that forum with a predefined usrename.
The scraper has two add options: add by keyword or add by user.
The scrapper script must add X amount of posts to my DB each time it runs where X is an amount I can enter in a box, before telling the script to start.
The script must also feature a blacklist where I can block certain aspects of the posts copied from being posted in my version of the copied content (e.g if the source has keyword I am interested in blocking, etc..)
The first function is add by keyword, which searches the forum for posts that include the terms I specify and copies them. (for example, search the forum for thread titles that contain "Google" and add 10 posts to my DB from within these posts).
The second is add by user which fetches threads started by given user and adds x amount of posts sorted by date of posting (descending).
A combination of keyword and user scraping is a bonus.
More about: the blacklist - since content scraping can be a messy thing - like different CSS or design tags which I do not need on my sites, the blacklist must be very powerful.
It must have two options: Block starting, and block elements.
Block starting blocks everything after the HTML code I specify, and block elements simply removed said elements from the array in which the copied page is stored - the blacklist must be unique for every member in the target forum and it will remember the settings.
With the addition of being able to block HTML tags and blocks, the blacklist must enable for blocking text as well.
Important: The scraper must extract images, links and text and every post that is inserted into my DB must match a general rule of thumb template - images first, then description or text and then links. (Of course only from the post body within the source forum).
This is a rough idea of what I want, if you think you can come with with something better I'm open to hearing what you have to say, again, all I want is to be able to 'leech' posts from predetermined members of a predetermined forum, whilst being able to remove/ignore/block the mess that usually is combined with content scraping.
Note: the priority is for the script to work on a specific forum running PHPBB 2.x and must also be phpbb compatible.
If you can make it on more forums like Vbulletin etc.. that is a bonus.