You will create a script called tfidf.m that performs the following tasks:
1. Creates a cell array dict that contains all the terms in your collection of documents. [HINT: just use the code you wrote in lab 2 for this]
2. Creates a T × D matrix tdfm where T is the number of terms in your dictionary and D is the number of documents. Each column of tdfm cor responds to the normalized term frequencies of the corresponding docu-
ment. [HINT: use the function termfrequencies to get the unnormalized frequencies and norm to calculate the norm or length of a vector.]
3. Computes the document frequency vector df and from this, the inverse document frequency vector idf. [HINT: Let M be a m × n matrix. The matlab command sum(M>0,2) computes a m × 1 vector that contains the number of non-zero elements of each row in M. This is somehow related to df!]
You will now write a function for performing ranked retrieval queries. Open up a script ﬁle, type in
function r = rankedquery(q, dict, idf, tdfm, N)
and save it by the default name rankedquery.m. This function takes as inp
1. a cell array q that contains the terms of our query,
2. the dictionary array dict,
3. the vector of idf values,
4. the term document frequency matrix tdfm,
5. and the number N of results that should be returned.
The result r is a vector containing the indices of our retrieved documents. The function should begin by converting the query terms into a document, in the same way as you computed this for the collection documents in the previous section. Then it should multiply each element of the query vector by the idf score for that term [HINT: use the matlab .* operator]. After this, the function should compute the dot product between the query vector and every document vector in the collection (i.e. every column in tdfm). Thankfully you can do this with a single matrix multiplication. [HINT: Let M be a m × n matrix and u
an m × 1 vector. Then the matlab command u’*M computes the dot product of u with every column of M]. Finally, the function should sort the scores from highest to lowest [HINT: using the sort command with the option ’descend’] and then should just return the ﬁrst N results. That’s it! You have now written your very own search engine. Try it out on a few queries.