Annulleret

Website crawlers (wrappers) for 12 sites

Use HTMLUnit to write programs to crawl each of the sites below. Generate a pipe-delimited file from each site with one row per record. The format of the file is given below. You don't need to crawl the entire site, but it's your responsibility to make sure that your program works over the entire site. Build in an N-second delay between page fetches, where N is a command-line parameter. Retry each page request 4 times before skipping (with an N-second delay between each retry). If you skip a page, write out the fact that you skipped it to a logfile. Note that I am *not* asking you to collect email addresses from any of these sites.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition as follows: source code must be written in Java and compilable under JDK 1.5 with HTMLUnit and Log4J jar files. Using HTMLUnit and its associated XPath libraries will make this project much easier to write and maintain over time, so it is required that you use it.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased.

4) The work must be completed by June 1, 2005.

5) The specific program requirements follow. Note that in total, 12 programs are to be delivered. Specs for the last 3 programs are in the attached file.

For each URL specified below, write a program that starts from that URL and generates a file according to the crawl instructions specified below where each row has the format "Type|LinkText|URL" where Type="Localities", "Surnames", or "Topics".

1. [url removed, login to view]

* follow the Localities, Surnames, and Topics links

* under Localities, follow the links for each location recursively. Location links are preceeded by a yellow folder icon. Write out any message board links you find. Message board links are found after a grey horizontal line on the page.

* for example, on the page [url removed, login to view], you would recursively follow the 14 location links, and write out the two message board links:

* Localities|General|[url removed, login to view]

* Localities|CanadaGenWeb|[url removed, login to view]

* under Surnames, follow the links for each 1-, 2-, and 3-character name prefix links recursively, and write out any message baord links you find.

* for example, on the page [url removed, login to view], you would recursively follow the 26 3-character name prefix links Sta..Stz, and write out the approximately 60 message board links, such as

* Surnames|St. Ama|[url removed, login to view]

* handle the Topics message boards similarly

2. [url removed, login to view]

* follow the links under Surnames, Regional, and General Topics

* on each of the 26 surname pages, write out the surname links in the list

* for example, on the page [url removed, login to view], you would write out approximately 200 rows, the first of which is:

* Surnames|Qafzezi|[url removed, login to view]

* There are only two links to follow under regional: U.S. States, and Countries. On each of those two pages, write out the links for each location (US State or Country).

* for example, on the page [url removed, login to view], you would write out roughly 100-150 rows, the first of which is

* Localities|Albania|[url removed, login to view]

* There's really just one page with Topics links: [url removed, login to view] On this page, capture all links under the various headings and subheadings in the list.

* for example:

* Topics|General Genealogy|[url removed, login to view]

For each URL specified below, write a program that starts from that URL and generates a file according to the crawl instructions specified below where each row has the format "LinkText|URL".

1. [url removed, login to view]

* this one is easy; just capture the link text and URL for each of the Links in the list

* for example:

* Louisiana, 1718-1925 Marriage Index|[url removed, login to view]

2. [url removed, login to view]

* follow the links recursively for each of the 1-, 2-, and sometimes 3-character prefixes. Write out the databases listed on each page. Instead of writing out just the link text, write out the entire line as the link text

* for example, on page [url removed, login to view],a,ab&firstTitle=0, you would write out 5 rows, the first of which is:

* Abandoned iron mines of Andover and Byram Townships, Sussex County, New Jersey|[url removed, login to view]

For each URL specified below, write a program that starts from that URL and generates a file according to the crawl instructions specified below where each row has the format "Location|LinkText|URL" where Location=the state, province, or country.

1. [url removed, login to view]

* follow all links from "Western United States and Canada" to "Additional Lists"; on the linked-to pages, capture information for the archives & libraries

* example:

* Alaska|Alaska State Archives|[url removed, login to view]

* Alaska|Alaska State Library. Alaska Historical Collections|[url removed, login to view]

2. [url removed, login to view]

* Get data from this one page only; capture information from the list of the state archives and records programs. Skip the links to the State Coordinator and SHRAB.

* example:

* Alabama|Alabama Department of Archives and History|[url removed, login to view]

* Alaska|Alaska Division of Libraries and Archives|[url removed, login to view]

* Alaska|Archives and Records Management|[url removed, login to view]

* Arizona|Arizona History and Archives Division|[url removed, login to view]

3. [url removed, login to view]

* Follow links from "Communal" to "State and Regional"; on the linked-to pages, capture information in the Links section, and continue following the links in the Categories section

* get the Location field from the Category; it's ok if the Category is not a State, Country or Province - just capture whatever it is

* example: here are the first couple of links from the Communal : Northern America : USA : Alabama category

* Alabama|Amador County - Archives|[url removed, login to view]

* Alabama|Birmingham Public Library - Archival Resources|[url removed, login to view]

4. [url removed, login to view]

* Follow the links from "Africa" to "Northern America", following the same crawl strategy as for site (3)

* example: here are the link from the Communities : Associations : Africa : Senegal category

* Senegal|Association des Amis des Archives du Senegal (AMIAS)|[url removed, login to view]

5. [url removed, login to view]

* Get data from this one page only; capture information from state libraries and organizations

* example:

* Alabama|Alabama Department of Archives & History|[url removed, login to view]

* Alabama|Alabama Public Library Service|[url removed, login to view]

## Platform

Must run on Java 1.5, using HTMLUnit and Dom4J.

Evner: Amazon Web Services, Ingeniørarbejde, MySQL, PHP, Software Arkitektur, Software Testning, Web Hosting, Hjemmeside Management, Hjemmeside Testning

Se mere: yellow character, xpath and or, www ancestry com, writing to a file java, writing to a file in java, writing text for website, writing communities, writing boards, writing az, write for us 200, write a website code hire, work in alaska, western writing, website icon html, us pipe, u.s. pipe, using collections in java, use of collections in java, us department of state, u. s. department of state

Om arbejdsgiveren:
( 39 bedømmelser ) United States

Projekt ID: #3727444