We would like an HTML extraction engine that first spiders a given site, then extracts all the HTML from all the pages it spiders. There is currently no script out there I could find that does this (aside from maybe php dig?)
Therefore, we would like the following in a php script/series of scripts:
-spider a given website completely, ignoring images, media files, xml, docs, pdf, etc.
-copy the url/path of each file spidered (example, /products/[url removed, login to view] or /[url removed, login to view])
-extract the html from each spidered page
-if the page contains any of the following tags, print off the values within the html tags: ( through , , first paragraph of text, )
-The rest of the text doesn't need to be printed off.
-The expected result is a php page that prints off the URL and the 'defined' text on the page.
Attached is the code for a php file and relevant classes we've found and slightly modified. The script "test_pagelinks.php" basically looks at a URL, and gets all the links. Another part of the script goes to the URL, gets all the HTML and prints it out.
There are deficiencies with this script. First of all, it doesn't capture the home page, or index page. It also duplicates pages, so the same page can be indexed more than once. Additionally, there is no HTML parsing based on the parameters above.
So in closing, you can either modify this script attached, modify php dig components, or create your own (there may also be other components out there that can be used for this...)
Eventually this code will form the "cornerstone" of a WAP translator. The specs are still being worked out on that, but look for that project to be posted soon.
12 freelancere byder i gennemsnit $239 på dette job
We will doing all that you want (and more... :-))). Quickly, Professional, Quality - our answer you and your organization. We work more than 10 years.. There are questions?
This seems doable. I'm wondering if it would be easier to do in Perl though, or start by using wget to spider then Perl or PHP to parse out what you need.