I need a web crawler engine which will do the following:
1. Loop through a set of http sources (mainly wordpress & news sites)
2. Read the content of each source, uniquely identify articles and stores db the following information into a MySQL:
a. ID of the article
b. signature (must be created based on the source and the contents of the article
c. Category of the article
d. Title of the article
e. Main body of the article
f. Article image (if any)
g. Http source
h. Date & time
3. The crawler should categorize the articles according to the article contents (i.e. sports, politics, health, science, etc)
4. The crawler should find related articles and stores this relation into a table
Should be done in Python or Perl.
18 freelancers are bidding on average €986 for this job
I would be happy to take on this project. I fully understand your assignment and am sure that I could complete it in time using Python programming language.