Matching Problem (II)

This project involves devising a method to match inconsistently coded data. We have a dataset of worksite inspections. Entries describing the location of the same work site are often recorded differently. For example, one entry might have an "address" cell recorded as 123 Elm Rd while another entry might be recorded as 123 Elm Road. In other cases, the same company's "company name" cells might be recorded differently. For example, Acme Inc. might be misspelled as Amce Inc. in one entry. We would like to devise a program to match inconsistently coded entries. A successful match would occur when there is a high probability that the two entries are actually one and the same. This must be an automated process because our data set contains a few hundred thousand observations.

I have attached a sample of the data.

