Populate an Excel sheet with the URLs of staff pages from a list of University websites.
To identify the XPath to various elements in a page, one of the tools that can be used is the XPathChecker plugin in Firefox ([url removed, login to view]).
The first step in creating a template is to identify the start page for each institute/organization. This start URL is added to the StartURL field in the institutes table. In most cases the list of staff members names is either a table or a list. The XPath to identify this table or list is then added to the TableXPath field in the corresponding record. The XPath to identify each staff member’s profile page link is added to the URLXPath field. Since most web profiles will be linked using a relative URL, the URLXPath based link needs to be combined with a URL prefix for the institute web server address and path. This is added to the URLPrefix field.
Once the StartURL, TableXPath, URLXPath and URLPrefix fields are populated, the script should be able to read the individual profile pages one by one. This can be verified by running the script and checking the output of the script on the screen to see whether the URLs are actually being retrieved.
Once the pages are able to be extracted, the template XPaths for the profile details need to be populated. The variables that are being captured include:
• Research Interests
Each of these details will require a separate XPath added to the template with an optional regular expression to eliminate unwanted formatting and HTML tags. Please note that not all organizational units/staff members will have all of these details. A few trial runs will need to be run to get the most optimal XPath that will capture the majority of the details. For each detail, there are two methods of using the XPath. One is to get the value as a list of XPath nodes (‘V’) and the other is to get the values found by the XPath as a string (‘S’). The type of return needs to be added to the corresponding type field in the table. If a regular expression is needed, the type would usually be ‘S’.