I need to save a snapshot (all html files) of the website [login to view URL] It is an online forum that allows people to post and follow each other. I want to save the following information:
1. [login to view URL] saved as '[login to view URL]'
2. On the index page, there are 23 forums (Notice the Porn Addiction and Porn-Induced Sexual Dysfunctions are two forums when I count). I need all pages of all threads in each of the 23 forums to be saved. For example, the first forum is shown as "Rebooting - Porn Addiction Recovery". After clicking on it, it leads to [login to view URL] The ending number 2 in the previous link is an identifier. I want this page to be saved to "[login to view URL]". There are 583 pages of threads (posts) in this forum. You can save them to "[login to view URL]" all the way to "[login to view URL]". In each of these pages, there are 50 threads (a little more on the first page due to some information and announcement at the top). Each of the 50+ thread may contain multiple pages as well. I need all these pages of html files saved too. For example, the first post is "[login to view URL]". The ending number 88344 is also an identifier, I want them to be saved to "[login to view URL]" to "[login to view URL]" (5 pages of this posting thread).
3. I want all the user profile pages to be saved as well. The website ([login to view URL]) shows there are 156,726 members. You can actually enumerate all of them starting from 1 to 156726 using the following link(for user 1): [login to view URL] In this user profile page, I need html pages that show the 5 tabs "Profile Posts"(It may have multiple pages, all pages needed), "Recent Activity" ("Click on Show older items" at the bottom until the button disappears so that everything is captured), "Postings" (No need to find all since all postings are captured in the previous step), "Information", "Groups". Moreover, I want to know the user_id of the "Following" and "Followers". For example, user 1 is following 8 other users and followed by 826 users. I want 2 tables (csv or sqlite) to save the Following/Followers information, each with 2 columns. Following Table: user_id, following_user_id; Followers Table: user_id, follower_user_id. In the Following/Followers information, only 20 users are shown each page, you need to click on the more button multiple times to enumerate all users.
1. The program should be able to finish running within 24 hours (Multithreading might be needed. For example, several threads can handle several forums, one thread can handle the user profile pages). The shorter the time, the better. Because I plan to scrape the websites on different days to see the change of users and posts.
2. Since I want to scrape this website in different days, it would be great to do some type of incremental scrapping. Running it the first time would save everything, but running it again would keep a "diff" type of files necessary to know what is deleted (user, user following relationship, threads). That would save a lot of hard disk space because I don't need to save duplicate html files that are already saved.
3. Python 3.5+ and other packages that you find necessary
4. The program should login to the forum before saving the html files. It is free to register. Login credentials can be provided upon requested.
5. The program will run on Linux Ubuntu
6. Clear comments in the code so that I can modify later
7. Object oriented design is preferred
18 freelancere byder i gennemsnit $283 på dette job
Hey - I've checked [login to view URL] and confirm you that we can build a Python crawler as per your requirements. Please drop me a message so we can discuss every detail, thanks ~ Steve