Find Jobs
Hire Freelancers

Develop a duplicate content checker that will work in the cloud

$1500-3000 USD

Lukket
Slået op cirka 7 år siden

$1500-3000 USD

Betales ved levering
Hello, We need a tool to analyse large batches of files (100K to 200K files). Each file will be an article. The tool must detect the following "sliding" n-grams: 3-grams, 4-grams, 5-grams and 6-grams. Here is an example sentence : « de demander différents devis sur le site de Buuyers » This sentence is in French. It’s formed of height 3-grams shingles: 1. de demander différents 2. devis sur le 3. site de Buuyers 4. demander différents devis 5. sur le site 6. site de Buuyers 7. différents devis sur 8. le site de The tool must be able to detect all these different n-grams by comparing each article to all the other articles in the batch of 100K to 200K articles we'll upload. Then, it will calculate the similarity ratio, based on the Jaccard's index. So, let's imagine we have 105 3-grams in common between 2 articles making 1026 words together, we should calculate their similarity ratio this way : 105 ---------- = 0.11 = 11% of similarity between these 2 articles. 1026-105 We must be able to add a list of stop-words in French and in English. These stop-words would then be excluded from the analysis. If for example "the", "this", "is" are listed as stop-words, the sentence "this is the world champion" would count as only 2 words: "world" and "champion". Once the comparison is done between all the articles, the tool must be able to extract the most different ones. We'll ask it to extract all the articles with a maximum similarity ratio of 15%, for example. We must get a table showing how many articles we can extract, based on different maximum similarity ratio values, by increments of 1%. So, for example: we can extract 0 articles having a maximum similarity ratio of 0%, 1%, 2%, 3%, 4%, 5% and 6%, 17 articles at 7%, 196 articles at 8%, 1345 articles at 9%, 7635 articles at 10%, etc. A table will show how many articles we can extract, based on their maximum similarity ratio. A detailed view will enable to see the most similar articles, with their common n-grams highlighted. Example of comparative view: [login to view URL] So, user would ask the tool to see articles having a similarity of 16%, for example, and the tool would then show random articles having this similarity ratio. It will enable user to visualise them. We must get a report showing the most redundant shingles by n-grams among all the articles: 3-grams, 4-grams, 5-grams and 6-grams. With a percentage indicating in how many articles they were found. The tool must be able to automatically delete the articles having a similarity ratio superior to a value that user might be able to set and adjust for each batch of file. We must be able to load the batch of articles in zip archives. And to download the extracted articles in this format too. Let's say we upload 200K articles in 1 zip file, and then, after the analysis, we want to keep the 8'736 articles with a maximum similarity ratio of 11%, we'd then get them in 1 zip file too. The tool must show a counter to indicate how long ago it started and estimate the remaining time until completion. A mail must be sent to several email adresses when the task is completed, or if there's any error/problem during the process. Due to the nature of the analysis and the very large number of files to analyse, the tool must be able to fully use the resources of cloud computing and HPC. When running an analysis, we'll rent specific resources for a limited time, to get the analysis as fast as possible. Bear in mind that the process mustn't take more than 1 day for 100K articles. We'll rent the necessary ressources to make this possible but the tool mustn't be limited. We might use 20 CPU and 320 GB of RAM for one session, 60 CPU and 1 TB of RAM for another session and it must make the tool 3x faster in the 2nd case. When applying, please indicate your experience as for HPC platforms you might have used, and for what kind of applications.
Projekt-ID: 13657295

Om projektet

43 forslag
Projekt på afstand
Aktiv 7 år siden

Leder du efter muligheder for at tjene penge?

Fordele ved budafgivning på Freelancer

Fastsæt dit budget og din tidsramme
Bliv betalt for dit arbejde
Oprids dit forslag
Det er gratis at skrive sig op og byde på jobs
43 freelancere byder i gennemsnit $2.401 USD på dette job
Brug Avatar.
Hello, sir. My ranking is TOP 5TH in freelancer.com as you can see my profile: (https://www.freelancer.com/u/kchg.html). I'd like to discuss with you in detail. Kind Regards
$3.000 USD på 30 dage
4,9 (478 anmeldelser)
9,8
9,8
Brug Avatar.
Dear Employer,I am developer Gang in China. I’m very interested in the project you recently posted. I'm a certificated freelancer with over 600 good reviews from clients. I have much experience in Web development. I never disappoint my clients and I’m able to lead your project to success and troubleshoot problems. I am sure these will reflect in my past results. I’m very excited to assist you in making your task successful please feel free to contact me in order to interview about your job deeply.
$2.319 USD på 30 dage
5,0 (853 anmeldelser)
9,3
9,3
Brug Avatar.
Hello RConseil!. Im a full stack developer, i can help you do this task fast. i have already charged a reasonable price. Please hire me!Thank you!
$2.429 USD på 30 dage
5,0 (820 anmeldelser)
8,7
8,7
Brug Avatar.
Hi I work towards providing reliable, relevant and robust IT solutions at most competitive prices to my customers. I ensure 100% customer satisfaction so lets start Thanks
$1.546 USD på 40 dage
4,9 (458 anmeldelser)
8,2
8,2
Brug Avatar.
Hi! I m interested in it, I have highest reviews for jQuery (frontend) at freelancer.com and have good completion rate. Experts in PHP(Laravel ,Yii ,Symfony, cakephp ,CI, Zend) for Server side, and integrating (SOAP and RESTFul API) 3rd Party APIs. Note: Please reply back if you are interested in my bid, I'll let u know cost and time of this project. My Bid Cost will change once we discuss the project and I'll qoute you reasonable Price, current Bid is 75% of your maximum budget which is not right cost.
$2.250 USD på 20 dage
5,0 (118 anmeldelser)
7,8
7,8
Brug Avatar.
Contact me. I can assist you. You can also check my portfolio: https://www.freelancer.com/u/micheal4299.html I also have experience in working on similar projects. I can get it done before deadline. Let me know if you are interested in working with me, I'll share more previous work. Thanks!
$1.500 USD på 15 dage
5,0 (74 anmeldelser)
7,5
7,5
Brug Avatar.
I have good experience with web development, Ecommerce as well as Android and iOS. I can do your project easily. I am a fast coder and usually write bug-free code. I won about 35 competitions in algorithms and development. You can look at my resume in the portfolio section at http://freelancer.com/u/allenross356.html Please let me know if you would like to hire me.
$3.000 USD på 30 dage
5,0 (153 anmeldelser)
7,7
7,7
Brug Avatar.
Hi! I am professional programmer! I can do this project with highest quality and satisfaction! Best regards!
$3.000 USD på 30 dage
5,0 (40 anmeldelser)
6,7
6,7
Brug Avatar.
In the bid amount , I can provide the website version , android application , ios application with one year support . I am available on all days to discuss and shall provide 2 work updates per week ( Tuesday and Friday ). You can release the payment as per milestones and after the milestone work is complete.
$1.500 USD på 44 dage
5,0 (28 anmeldelser)
6,0
6,0
Brug Avatar.
A proposal has not yet been provided
$2.777 USD på 20 dage
4,8 (82 anmeldelser)
6,2
6,2
Brug Avatar.
Hi, I have reviewed your requirement and I can do this job as per your requirement. We have huge expertise in WORDPRESS , Laravel, Node.Js , React.Js, CakePHP, Codeigniter, Angular.js, Bootstrap, API Integration, Plugins , MYSQL, JavaScript, HTML, Jquery, Magento, HTML 5 ,YII frame, PSD to HTML and CSS to name a few. We have built more than 200 website in Magento and WordPress including theme and customized theme as well. I am looking for long term work relationship from you. Looking forward for your positive response. For more reference please see portfolio herewith. Regards, Rajiv Sharma
$2.500 USD på 30 dage
4,8 (31 anmeldelser)
5,8
5,8
Brug Avatar.
Hello, I'm an experienced developer who loves challenging problems, I'm good at algorithms and building huge scale programs. Here's how I can build this project: "Front" side: I'll use Laravel (PHP) for building an application where you can upload and download lists, view statuses etc. Analyzing the text: I'll use C++/Python/NodeJS (I'll test the performance when working with texts first and choose the best option) to analyze the article and calculate the n-grams. The program will take 1 single article and calculate the n-grams for it, then save the results in cache (Redis list) so that the next time it has to calculate the n-grams for the same article, it doesn't need to re-analyze whole article again. Checking tool: I'll use NodeJS to write a multi-threaded checking tool, which grabs a pair of articles and analyzes it using the program described above. It will save the result of comparison in Redis. Database side: Each article will have it's own unique identificator. When NodeJS checking tool saves the result of comparison (which is the similarity ratio), it will append the two id's and that will be the redis key. Value will be the similarity ratio. So each record of a pair will be max 45 bytes in RAM. We'll have 20 billion pairs in case of 200k articles, that's total 838GB of RAM. (which can be reduced a lot!) Unfortunately I'm reaching a characters limit. But I do have good ideas regarding memory optimization. Let's have a chat! Best, Nick.
$2.500 USD på 20 dage
4,7 (17 anmeldelser)
5,9
5,9
Brug Avatar.
Dear Sir i will develop each feature that you mentioned in a long description i have the team of developers and designer with experince of more than 2 years all Please connect me so we can discus more on it Thanks sushil
$1.500 USD på 30 dage
4,9 (34 anmeldelser)
5,2
5,2
Brug Avatar.
Hello, We are working through freelancer since 7+ years. We have ranking in top 3000 on freelancer.com and we hope it will give you an idea about our work quality and dedication to work. You will be safe while working with us. We have dedicated in house team for WordPress, PHP, CakePHP, Bootstrap, Magento, HTML, CSS, JavaScript, Jquery, AJAX, MySQL, PHP framework, API's, PSD to HTML, Logo/banner design, SQL, JSP, ASP.NET, .NET, App Developer, App Designer, Apache, Websites Design and Development, Web Application Development, E-commerce Website Development, Marketplace Development, Web Portal Development, Custom Software and Plug-in Development, Web Applications Testing, Android Mobile App development, Game development, Unity3D, HTML, Multiplayer and all other IT skills. We will provide you proper proposal and time after you send us message as we can attach any document through PMB only. Please see our portfolio https://www.freelancer.com/u/Dilipjaipur.html?page=portfolio Final price and time we can only quote once complete discussion with you, it may be less then present bidding amount and time. Thanks
$1.666 USD på 30 dage
4,9 (10 anmeldelser)
4,8
4,8
Brug Avatar.
Hey There !! We have seen the job post of yours and very interested to start work with you I am having major 15+ years experience in Angular.Js Node.js ,MongoDb Asp.net,C# mssql,html,html5, .net, mvc, MVC4,MVC5,css, related frame work. 1 Responsive Design 2 SEO Friendly Website 3 Development in PHP|PHP5|YII|Laravel |Opencart| CODEIGNITER| MYSQL| Wordpress | ASP.NET |C#| MVC| MVC4| MVC5|MSSQL|AngularJS |Node.js |JQUERY|Android & IPhone Apps | SEO 4 ERP/CRM |ECommerce | CMS | Online Store | Stock Management | Payment Gateway Integration for International Transaction 5 MLM| VARIOUS MLM PORTAL| BINARY| MATRIX|LEVEL PLAN |BITCOIN|HYIP Some of the software projects and websites development that i have undertaken and successfully delivered include:( Customized ERP system for Schools and Universities, Management System for Tour and Travels industry,Real states,ticket booking software,complete ecommerce solution, Various MLM portals,bitcoin webiste Web Application for Media Company,Booking Software,Various types of CRM/ERP). Please feel free to contact us for any further discussion. Hoping for a fruitful business and chance to serve you. Regards SJAK
$2.500 USD på 30 dage
5,0 (4 anmeldelser)
4,4
4,4
Brug Avatar.
Hi,dear. I am a senior software developer. I have just checked your project report, I am able to perform this task with my developer team. I am looking forward to your proposal...
$2.500 USD på 8 dage
4,4 (18 anmeldelser)
4,9
4,9
Brug Avatar.
I don't understand why you are using that kind of computing resources/hardware. Maybe you are trying to solve the problem in a conventional/wrong way. I agree it is a HPC problem. But there are better and faster ways of doing this using less and horizontally scalable stack especially when we are talking about 'text analysis and string matching algorithms'. I propose to use Elasticsearch and write Python scripts that does the works and also expose the results for the front end 'tool' to consume. The proposal is to design and setup an ES infra and develop one script that can process 100K records in 10 minutes (actually I'm thinking we can bring it down to seconds for computing similarity ratios between 1 article and all other 'n' articles in the corpus off a nxn matrix but not committing on that yet). The proposal does not include developing the other features of the tool. These can be taken up later in Phase 2 as a separate scope that anyone can build on. PS: I have experience in Jaccard Index and Fuzzy matching in ES that performs way better than even a multi-node Spark cluster.
$3.000 USD på 20 dage
3,4 (1 bedømmelse)
4,1
4,1

Om klienten

Flag for FRANCE
PARIS, France
4,2
36
Medlem siden jan. 14, 2011

Klientverificering

Tak! Vi har sendt dig en e-mail med et link, så du kan modtage din kredit.
Noget gik galt, da vi forsøgte at sende din mail. Prøv venligst igen.
Registrerede brugere Oprettede jobs i alt
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Indlæser forhåndsvisning
Geolokalisering er tilladt.
Din session er udløbet, og du er blevet logget ud. Log venligst ind igen.