Task: Web Crawling application to build
Important: Please read the whole document and then reply to the questions at the end of those guidelines along with your bid.
1. Introduction and Application Functions.
Our client is working on a research study on the etymology of the domain names used by businesses.
A business = any organization selling a product or service to other businesses or/and to customers. (B2B + B2C)
They hired us (now, it is your mission) to develop an automated application which will find (based on a list of 10 million domain names) which of those domains are being used by businesses and which are not. In other words, to differentiate the active "business websites" from the other domain names (by other domain names, we mean: non-active websites or non-used domains, news and general info sites, personal websites, school websites, non-profit organization websites, forums, blogs, directories, and so on...)
You need to develop an automated online (server based) application which will "crawl" the active websites from that list of domain names and analyze their "navigation elements" (you may know those
as "menu" or "website categories") to check if they (they = the "navigation elements") contain the words commonly used by businesses.
In more details, the application will:
a) "validate" each URL to check if it corresponds to an active (online) website.
b) locate the "navigation elements" (menu or website categories) of each active website.
c) check if the "menu" of each website contains at least one of the words commonly used by businesses for their website (we will call them "key words"), such as:
If one of those common business "key words" is used in their "menu", this is enough to imply that there is a high chance that the website may belong to a business and is not a personal site, blog or directory,...
So, if at least one of those words is located in the menu of the website, the application will return a positive answer and add the URL to the "positive" list of "domains used by businesses". And then go to step a) with next URL.
The application database needs to include for each positive answer the "key word" which matched.
This needs to be a "broad match". For example, if the app. is looking for "Service" and finds "Our Services" on the websites "menu", it needs to be validated positively as well.
2. "Crawling" Formula.
In order for the application to complete the task described previously in 1.c , we propose the following "formula":
b) the application will collect the target file names on those links. For example in
"http www. url . com / [url removed, login to view]", "products" is the target file name on the link.
It will also collect the anchor text on those links, the folder name on those links if there is any and so on... (We will call them "link names")
c) the application will check if 80% of the letters from at least one of the "key words" are included in the "link names". If it matches, it will assign a "positive" response to the url.
For example, if we take the "key word", "service", all the following "link names" will return a positive match:
[url removed, login to view]
[url removed, login to view]
The reason why we use 80% and not 100% (from the letters in the "key words") is to include plurals or singulars, spelling mistakes and word declinations (mainly for other languages where accents are used)
This is a proposition. If you find a better "formula", you are welcome to propose it.
You need to take a lot of time and to complete a lot of tests in order to design the best "formula" for the "crawling" as it is the most important part of the application process.
You need to develop a multi-threaded application which will be able to "crawl" millions of urls effectively.
It needs to integrate a powerful multi-threaded "crawling" process, a "queue" and "scheduler" management system and a strong database storage system.
3. "Introduction" Pages.
Lot of websites have intro pages (static or in flash) where the "menu" doesn't appear. You usually need to click on a "Skip intro" or "Enter website" link to go to the main site.
In order to use the right elements for the "crawling", the application needs to differentiate an "intro page" from a "main homepage".
The "formula" we propose for that task is the following:
a) When it arrives on a website, the app. will count the number of "inside links" on the "landing" page. (first level)
b) If the number of "inside links" on the landing page is 3 or below, it will go to step c). If the number of those links is over 3, it will process with the normal "crawling formula" process.
c) It will go to one of the "inside pages" (if any) by following one of the "inside links" from the landing page. It will process with the "crawling" on both the "landing page" and one of the "inside page(s)".
Again, this is a proposal. If you find a better "formula", you are welcome to propose it.
4. Web based Interface.
Along with the application process, you need to build a web-based interface which will help us to monitor the "crawling" and allow us to extract the lists.
The main functions of the web-based interface will be:
a) URL list Import. (lists will be in .txt or in .csv)
The application needs to provide 2 types of imports:
a)import by uploading the URL lists directly to the online application from the computer.
b)import by uploading the URL lists to a directory on the server by ftp. (for large lists, it will be more convenient).
There needs to be a page where we can check the status of the import and see the number of URLs that were found in the imported lists.
b) "Key Words"
The "Key Words" are the words commonly used by businesses in the "menu" (the menu = the navigation elements) of their website.
We need to be able to add/modify/delete words in the list of "Key Words" which will be used by the application for the "crawling" process.
(The "Key Words" can be from different languages as sites are in different languages)
c) The "Crawling" Process.
We need to be able to start/pause/end and check the status of the "crawling" process from the online interface.
d) list extraction and statistics
Once the "crawling" has been completed, we need to be able to export the list of negative (non-match) and positive URLs.
We need also to be able to export the "positives" lists for each keyword match and have access to a detailed statistics page about the "crawling".
5. Important Remarks:
- A dedicated Linux server will be provided.
- Those are the "general" guidelines for the application you have to build. We did not go to many details, because we want to give you lot of freedom to build the application the way you think will be most suitable for the task. Please develop a powerful, logical, efficient and easy-to-use application.
- If we assign the project to you, we will place the funds into escrow and then you will be required to submit to us a complete framework proposal (detailed specifications) about the way you will
build the application, the "crawling" formula, the testing, the programming language and technology along with the list of features you will include in the application. This will allow us to check before you start "programming", that your "solution" suits us.
- PLEASE REPLY TO THE FOLLOWING 4 QUESTIONS ALONG WITH YOUR BID:
1) Did you already create or worked with "crawling" applications?
2) What do you think about the "crawling formula" we suggested?
3) What programming language(s) would you use to develop the application?
4) In what timeframe can you
a) provide your framework proposal (detailed specifications about the "structure" of the application you intend to build) (in .pdf or .doc)
and b) develop the application? (please provide an estimate)
Thank you for your bid.