We are looking for an experienced C# developer to create the following application:
+Application crawls and spider the site:
+No problem to crawl any unicode character encoding (china symbol letter, japan, korea letter,arabic, hebrew, turkish, thailand, greek, baltic, cyrillic, utf-8 windows-12xx)
+Spider picture and video source code
+Checks website source code and returns:Site Title,Site Meta Description,Site Keywords,Site page size,Search term site url and much more tables
+Crawled informations export to sql and mysql file(automatic mysql create table,insert into,values title,meta,keywords,page size,search term site url etc... and much more functionality in sql )
+Reasonable duplicate domain and duplicate content detection to avoid re-crawling of identical sites on different domains. ([url removed, login to view] vs [url removed, login to view], and a million other sites that use multiple domains for the same content.)
+Understanding GET parameters, and what's a "search result" across many site-specific search engines. For example, some page may link to a search result page on another site's internal search with some GET parameters. Don't want to crawl these result pages.
+Chrome and Mozilla browser simple toolbar add site for crawling and searching
+Block the unwanted contents
+Proxy and cookies manage for anonymous access
+Resolve captcha with a simple way, and can be done by manual inputting or some third party decaptcha service.
+stop / resume crawl
+cache crawled items
+HTML cleaning algorithm
+respect [url removed, login to view] files
+Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.
+Support for http, https, ftp, nntp and news URL schemes.
+htdb virtual URL scheme support for indexing SQL databases.
+text/html, text/xml, text/plain, audio/mpeg (MP3) and image/gif mime types built-in support.
+Various character sets support.
+Phrases segmenting for Chinese, Japanese, Korean, and Thai.
+Start crawling from a list of the URLs specified by user;
+Detect broken links;(should automatically ignore broken links)
+Filter the extracted data;
+Customized web crawler / web spider. Crawling rules and multithreaded downloading (up to 50 threads);
+Extracts data from password protected websites;
+Phrase search, regular expressions, attribute search, and similarity search
+Extract data from highly dynamic web sites including AJAX web sites.
+Apply Regular Expressions (RegEx) on Text or HTML source of web pages and scrape the matching portion. This powerful technique offers more flexibility while scraping data.
+Extract using XPath
+Update every N min - to specify how often the program will scrape the target website
+Supports wide range of character sets support with automated character set and language detection.
+Provides phrase segmenting (tokenizing) for Chinese, Japanese, Korean and Thai.
+Can perform parallel and multi-threaded indexing for faster updating.
+Effective caching gives significant time reduction in search times.
+Query logging stores the query, query parameters and the number of results found.
+Duplicate data detection and removal. You can also use duplicate detection to stop web scraping when old data is reached. This is very useful when extracting data from websites such as forums.
+Extract data from documents such as PDF or Docx documents by using 3rd party document converters.
+Script should be able to return these results in less than 5 seconds
+I have an urgent requirement to scrape data from Social media (Twitter, FB and other Social Sites) and blogs on a real time basis
+ Efficiency: clever use of the processor, memory, bandwith (e.g. as few idle processes as possible)
+index all pages of each web site
+export (100;1000;10000;100000.......) results per file