# Grouping similar sets algorithm

SITUATION:

I have a search engine. The search engine generates results when is searched for a keyword. What I need is to find all other keywords which generate similar results.

For example keyword k1 gives a result set R1 = {1,2,3,4,5,...,40}, which contains up to 40 document ids. I need to get a list of all other keywords K1 = {k1,...,kn} which generate results similar to what k1 generates.

The similarity function S(R1, R2) between two result sets R1 and R2 is computed as follows:

2 * (number of same elements both in R1 and R2) / ( (total number of elements in R1) + (total number of elements in R2) ). Example: Let R1 = {1,2,3} and R2 = {2,3,4,5}, then similarity S(R1, R2) = (2*|{2,3}|) / |{1,2,3}| + |{2,3,4,5}| = (2*2)/(3+4) = 4/7 = 0.57. Result sets R1 and R2 and similar is S(R1,R2) > t, where t is a threshold.

PROBLEM:

There are more than 200,000 keywords thus more than 200,000 result sets which need comparison. Right now it takes 4+ hours to compute all similarities in PHP script.

WHAT WE NEED:

We need a program (in C or Python for example) which runs in command line on Ubuntu Linux. The program gets an input of (keywords, result_set) pairs separated by newlines (see below). Then it computes and groups similar results sets and gives a list of keywords which give similar results when is searched for (see below).

The program should be able to analyse 200,000 (keyword,result_set) pairs within 30 minutes on an Apple MacBook laptop. Also the program should be able to report errors with error codes, if something goes wrong, like there is not enough memory of input is not correct.

Furthermore we should be able to change the similarity function and edit/compile the program source by ourselfs.

EXAMPLE:

Input (4 lines):

keyword | result set

k1 | 1;2;3;4;5

k2 | 1;2;3;8;9;10

k3 | 8;9;10

k4 | 11;12;13

Output (2 lines):

similar keywords

k1;k2

k2;k3

We'd like to get the program within 14 days after the project start.

Om arbejdsgiveren:
( 9 bedømmelser ) HENGELO OV, Netherlands

Projekt ID: #969937

## Tildelt til:

kevinxiaozi

Dear sir, It is Jaccard similarity measurement. I am strong in algorithm implementation and optimazation such as datamining, cryption, information retrieval, watermarking, AI, search and etc. I can use some data-st Flere

\$50 USD på 1 dag
(21 bedømmelser)
4.9

## 9 freelancers are bidding on average \$75 for this job

Appolon

\$150 USD in 10 dage
(2 bedømmelser)
1.9
Ashishaw

The time limit set here is just tentative? whereas the project can be delivered before that time. I have Programming expertise for about 6 years on algorithms and Kernel Development.

\$50 USD in 30 dage
(0 bedømmelser)
0.0
Sillybowen

It's not a difficult problem. I have 8 years experience of algorithm design. Please Consider my bids.

\$45 USD in 5 dage
(0 bedømmelser)
0.0
pkow

Your project is quite interesting and well described thus I'd like to make you an offer. I've just sent you a private message with some details.

\$150 USD in 5 dage
(0 bedømmelser)
0.0
baliganikhil

No problemo

\$50 USD in 0 dage
(0 bedømmelser)
0.0

\$100 USD in 10 dage
(0 bedømmelser)
0.0
wahisoufiane

dear sir, I've a very good experience on algorithm, C language and i work on ubuntu, i can do this job, so please give me a chance.

\$30 USD på 1 dag
(0 bedømmelser)
0.0
thanhdt

Hi, pls read my PM for you

\$50 USD in 2 dage
(0 bedømmelser)
0.0