I need a PHP or PERL script that calculates the "edit distance" between the contents of two text files, where "edit distance" is defined as the minimal (in points, not in actions) number of "Edits" needed to transform the first one into the second one.
"Edits" can only be these actions:
- word insertion (1 point)
- word deletion (1 point)
- word substitution (1 point)
- block (sequence of contiguous words) shift (0.2 points)
Case insensitive and characters other than A to Z, 0 to 9 and "-" (minus) ignored.
Examples:
text 1: "THIS APPLE IS RED"
text 2: "THIS IS RED"
edit distance = 1
(1 word deletion)
text 1: "THIS IS RED"
text 2: "THIS APPLE IS RED"
edit distance = 1
(1 word insertion)
text 1: "THIS APPLE IS RED"
text 2: "THIS CHERRY IS RED"
edit distance = 1
(1 word substitution)
text 1: "THIS APPLE IS RED"
text 2: "APPLE RED"
edit distance = 2
(2 word deletion)
text 1: "THIS APPLE IS RED"
text 2: "THIS RED APPLE IS GOOD"
edit distance = 1.2
(1 block shift + 1 word insertion)
text 1: "THIS APPLE IS RED"
text 2: "THIS RED IS APPLE"
edit distance = 0.4
(2 block shift)
text 1: "THIS APPLE IS RED. THE SKY IS BLUE."
text 2: "THE SKY IS BLUE. THIS APPLE IS RED."
edit distance: 0.2
(1 block shift)
text 1: "THIS APPLE IS RED. THE SKY IS BLUE. THIS TABLE IS BROWN. BLA BLA BLA."
text 2: "THIS TABLE IS BROWN. THIS APPLE IS RED. BLA BLA BLA. THE SKY IS BLUE."
edit distance: 0.4
(2 block shift)
text 1: "THIS APPLE IS RED. THE SKY IS BLUE. THIS TABLE IS BROWN. BLA BLA BLA."
text 2: "THIS TABLE IS BROWN. THIS APPLE IS RED. BLA. THE SKY IS BLUE."
edit distance: 2.4
(2 block shift + 2 word deletion)
text 1: "THIS APPLE IS RED. THE SKY IS BLUE. THIS TABLE IS BROWN. BLA BLA BLA."
text 2: "THIS HAIR IS BROWN. THIS APPLE IS RED. BLA. THE SKY IS BLUE."
edit distance: 3.4
(2 block shift + 2 word deletion + 1 word substitution)
Requisites:
- it must be fast (less than 1 minute to calculate the edit distance between two completely different 100KB text files)
- it must work with any text file of any lenght
Escrow offered.
Demo appreciated.