I am interesting in having a xml reader and page info reader (quite easy editable).
Most of data I have to read from xml, but there are few sites, that don’t have xml and it should be read with phpQuery or something like that.
Main problem and goal would be that when cron job is running this php script, it should check if there already is this data from same source (check with regexp, fuzzy search methods or smth) and if there is such data from that source – insert data but with different source id.
As I thought should be those steps (we have a DB structure as I think it should be):
1. from 3-4 xml sources should be mapped to stationary fields (<title> mapped to <mytitle> - from 1st source, <name> - from 2nd source...we will explain which part of each xml belongs to <title> or smth).
2. then parcing each xml with those rules and if it’s unique from that source and <some_kind_caterogy> – insert data.
3. each time when parcing xml script it should check <title(-s)> and <date> nearby and if it’s same as data from that source – don’t insert, but if some data from that source different – insert data.
4. with page source reading it should be the same like with xml parcing.
Select and printing at webpage – I hope I do myself. I just need that xml parcing and inserting source and comparing logic and ideas with other inserted data before inserting.