Do not write code to solve this problem. Be clear about what condition is being tested. You do not need to provide sample input and output for each case.
The program description includes a step called “Normalize the document”. Do not write test cases for this normalization…there are a lot of things to consider here.
Program for which to create test cases
In the area of information management, we might want to find documents that are similar to one another to help people find relevant information. More specifically, given a set of documents D and another query document d, we want to find the document in D that is most similar to d.
A key question to ask is “What does it mean for two documents to be similar?”
For this problem, we will consider cosine similarity. The underlying idea is that two documents are similar if they’re using a similar set of words at a similar frequency. Given two documents d1 and d2, we convert the words in d1 and d 2 into vectors (x1, x2, x3, …, xn) and (y1, y2, y3, …, yn) respectively. We then use the cosine of the angle between these two high-dimension vectors as a measure of the similarity between the two documents. That measure will have a value between -1 and 1.
Here is how you convert a document into a vector:
1. Normalize the document:
a. Extract only the words of the document (no punctuation, no spaces, all lower case)
b. Throw away all of the really common words like articles (a, an, the, of, …)
c. Convert all verbs to their unconjugated form
d. Make every word singular
2. Count the number of occurrences of each word in the document
3. Create an order on the words in the document w1, w2, …, wk
4. Define the vector (x1, x2, x3, …, xk) for the document as xi is the number of occurrences of word wi in the document.
If you are comparing two documents, then the set and order of the words in their vectors must be the same. Consequently, when dealing with 2 or more documents, we usually extract the words from all of the documents, create a master set of words, order the master set of words, and use that master set to define the vector of all documents.
To find the cosine of the angle between two vectors, we use the dot product of two vectors. If we have vectors a and b, one formula for the dot product a . b is
a . b = || a || * || b || * cos( q )
where || a || and || b || are the lengths of vectors a and b and q is the angle between the vectors. We can rearrange the formula to be
cos( q ) = a . b / || a || * || b ||
Two other formulas now help us. Given a = (a1, a2, …, an) and b = (b1, b2, .., bn) another formula for the dot product is
a . b = a1*b1 + a2*b2 + … + an*bn
and the length of a is
|| a ||= sqrt( a12 + a22 + … + an2 )
We now have all we need to calculate the cosine of the angle between a and b.