Grouping similar sets algorithm


I have a search engine. The search engine generates results when is searched for a keyword. What I need is to find all other keywords which generate similar results.

For example keyword k1 gives a result set R1 = {1,2,3,4,5,...,40}, which contains up to 40 document ids. I need to get a list of all other keywords K1 = {k1,...,kn} which generate results similar to what k1 generates.

The similarity function S(R1, R2) between two result sets R1 and R2 is computed as follows:

2 * (number of same elements both in R1 and R2) / ( (total number of elements in R1) + (total number of elements in R2) ). Example: Let R1 = {1,2,3} and R2 = {2,3,4,5}, then similarity S(R1, R2) = (2*|{2,3}|) / |{1,2,3}| + |{2,3,4,5}| = (2*2)/(3+4) = 4/7 = 0.57. Result sets R1 and R2 and similar is S(R1,R2) > t, where t is a threshold.


There are more than 200,000 keywords thus more than 200,000 result sets which need comparison. Right now it takes 4+ hours to compute all similarities in PHP script.


We need a program (in C or Python for example) which runs in command line on Ubuntu Linux. The program gets an input of (keywords, result_set) pairs separated by newlines (see below). Then it computes and groups similar results sets and gives a list of keywords which give similar results when is searched for (see below).

The program should be able to analyse 200,000 (keyword,result_set) pairs within 30 minutes on an Apple MacBook laptop. Also the program should be able to report errors with error codes, if something goes wrong, like there is not enough memory of input is not correct.

Furthermore we should be able to change the similarity function and edit/compile the program source by ourselfs.


Input (4 lines):

keyword | result set

k1 | 1;2;3;4;5

k2 | 1;2;3;8;9;10

k3 | 8;9;10

k4 | 11;12;13

Output (2 lines):

similar keywords



We'd like to get the program within 14 days after the project start.

Evner: Algoritme, C programmering, Datasøgning, Databehandling

Se mere: what's an algorithm, what's algorithm, what is the algorithm, what is a programming algorithm, what is an algorithm in programming, what is an algorithm, what is algorithm in programming, what is algorithm, what is a algorithm in programming, what is a algorithm, what an algorithm, what algorithm, ubuntu programming, set of pairs, set algorithm, search algorithm in c, search algorithm example, search algorithm comparison, programming laptop, programming in ubuntu, programming errors, programming and algorithm, programming algorithm example, program algorithm example, program algorithm

Om arbejdsgiveren:
( 9 bedømmelser ) HENGELO OV, Netherlands

Projekt ID: #969937

Tildelt til:


Dear sir, It is Jaccard similarity measurement. I am strong in algorithm implementation and optimazation such as datamining, cryption, information retrieval, watermarking, AI, search and etc. I can use some data-st Flere

$50 USD på 1 dag
(21 bedømmelser)

9 freelancers are bidding on average $75 for this job


Please read the pm first.

$150 USD in 10 dage
(2 bedømmelser)

The time limit set here is just tentative? whereas the project can be delivered before that time. I have Programming expertise for about 6 years on algorithms and Kernel Development.

$50 USD in 30 dage
(0 bedømmelser)

It's not a difficult problem. I have 8 years experience of algorithm design. Please Consider my bids.

$45 USD in 5 dage
(0 bedømmelser)

Your project is quite interesting and well described thus I'd like to make you an offer. I've just sent you a private message with some details.

$150 USD in 5 dage
(0 bedømmelser)

No problemo

$50 USD in 0 dage
(0 bedømmelser)

I am ready.

$100 USD in 10 dage
(0 bedømmelser)

dear sir, I've a very good experience on algorithm, C language and i work on ubuntu, i can do this job, so please give me a chance.

$30 USD på 1 dag
(0 bedømmelser)

Hi, pls read my PM for you

$50 USD in 2 dage
(0 bedømmelser)