
Åben
Slået op
•
Slutter om 3 dage
Betales ved levering
I already have a working Python-based OCR pipeline that converts Tamil voter-list PDFs into Excel, then pushes the sheets to S3 for further processing. The PDFs are purely image-based. When I run the job in parallel on AWS today, the script sometimes skips entire voter entries and often mangles door numbers and other data. I need these two pain-points eliminated and the whole flow hardened so it can run unattended across hundreds of constituency files. Optimising to the least cost of Extraction in AWS bill is also required. Your task is to review and refactor the existing code, tune the Tamil OCR (Tesseract, AWS Textract, or any library you find more accurate), modify the parsing logic, and make sure parallel execution on my current ECS setup completes without a single missed record. Once fixed, you will execute at scale, monitor the run, and hand back clean, fully populated Excel files. Deliverables • Revised and documented OCR/parse code • One-click AWS deployment (Docker image + task definition) • Successful full-dataset run with zero skipped voters and correct door numbers (mixed letter-number cases handled) The extraction to be done at the AWS bill expense of 0.003 USD per pdf of 800 voters. This is the maximum allowed budget, within this extraction to be done with zero skipping of voters with 99% accuracy Total extraction time of 75,000 pdf to be done within 72 hours maximum • Brief monitoring log and accuracy report for sign-off
Projekt-ID: 40269976
18 forslag
Åben for bud
Projekt på afstand
Aktiv 14 timer siden
Fastsæt dit budget og din tidsramme
Bliv betalt for dit arbejde
Oprids dit forslag
Det er gratis at skrive sig op og byde på jobs
18 freelancere byder i gennemsnit ₹7.075 INR på dette job

Hi... Nice to meet you.(OCR EXPERT) I am have full experiences in extraction numeric and txt data from pdf or scanned image and convert this to csv file or txt file format using python automatically. In this project, we have to perform these stages preprocesing, Segmentation, Feature extraction, Recognition, Postprocessing. And another method is to use OCR engine or ML engine. In this method, firstly we have to preprocessing to make image clear. After that we can use pretrained model or we can train custom data to get trained model. Finally we can save result with various type of data. I am sure your project and i can deliver good result with high quality. I will wait your message to discuss project in more details. Thanks.
₹7.000 INR på 1 dag
5,4
5,4

Hello, With a solid background in full‑stack Python and experience building robust OCR pipelines for government data, I understand the critical need for zero‑missed records in your voter‑list extraction. My approach will start with a thorough audit of the existing code, followed by selecting the most accurate OCR engine for Tamil text—likely a tuned Tesseract model or a Textract workflow optimized for image PDFs. I will refactor the parsing logic to reliably capture mixed alphanumeric door numbers, and implement deterministic retries to prevent data loss. The solution will be containerized and deployed to your ECS cluster with a single‑click task definition, ensuring seamless parallel execution. I will run a full‑dataset test, monitor throughput, and provide a concise accuracy report. I’m ready to deliver a production‑ready pipeline that meets the 0.003 USD per PDF budget and completes 75,000 PDFs within 72 hours. Let’s discuss how I can get this project running smoothly. Best Regards Naveen Thakur
₹1.500 INR på 1 dag
5,1
5,1

I will review and refactor the existing Python-based OCR pipeline to eliminate skipped voter entries and data mangling, tune the Tamil OCR for 99% accuracy, and optimize the AWS deployment for $0.003 per PDF, ensuring zero skipped voters and correct door numbers, completing the extraction of 75,000 PDFs within 72 hours. Waiting for your response in chat! Best Regards.
₹8.499 INR på 3 dage
4,9
4,9

Hi,I’m a seasoned Applied ML Engineer(6+ YOE)& I can build high-volume OCR + parsing pipelines with zero-miss extraction at the lowest cloud cost. How I’ll reduce AWS bill while improving accuracy: >>Replace expensive per-page OCR with Paddle PP-OCRv4/v5 Tamil (onnxruntime).ONNX gives superior accuracy with faster CPU inference, enabling spot/on-demand ECS at predictable cost,often far cheaper than Textract at scale >>Add 2 stage OCR: fast ONNX OCR for all pages, and only fallback to Tesseract for low-confidence regions,keeps accuracy high while controlling spend Throughput plan: batch + pipeline parallelism,CPU ONNX with autoscaling to hit 75k PDFs < 30 hrs Deliverables: >>Clean, documented refactor + deterministic parsing (no skipped voters) >>Docker image + ECS task definition (1 click deploy) >>Runbook + monitoring logs + accuracy report (including door-number precision) Relevant experience: >>Refactored image-PDF OCR systems (government forms) to stop skipped records by fixing page segmentation,row/box detection & concurrency-safe parsing >>Built ID extraction with mixed letter-number patterns using post-OCR correction(regex + language-aware cleanup + confidence rules) >>Optimized AWS pipelines (ECS/Lambda/S3) to cut OCR costs by moving from paid OCR calls to self-hosted ONNX inference+ smart retry/QC I can start by reviewing your repo + a small sample set (10–20 PDFs) and then scale to full-run with strict QA gates.I can provide everything in less than 2 days
₹7.500 INR på 2 dage
4,2
4,2

I understand you require a robust solution to optimize your Python-based OCR pipeline that converts Tamil voter-list PDFs into Excel while ensuring zero skipped records and accurate door numbers during parallel AWS ECS execution. You also need the entire process hardened to run unattended across hundreds of constituency files within a strict cost and time limit. With over 15 years of experience and more than 200 completed projects, I specialize in Python, AWS, Docker, and OCR workflows, making me well-suited to tackle your Tamil voter list extraction challenges. My background in cloud automation and performance tuning aligns perfectly with your need for cost-effective, scalable data extraction using tools like Tesseract or AWS Textract. I will start by thoroughly reviewing and refactoring your existing code to fix the parsing logic and improve OCR accuracy, focusing on mixed letter-number door entries. I will containerize the solution with a one-click AWS ECS deployment, optimize parallel processing to avoid skipped records, and monitor a full-scale run of 75,000 PDFs within 72 hours. A detailed accuracy report and logs will be delivered upon completion. Let’s discuss your current pipeline and how I can help make this extraction process flawless and cost-efficient.
₹1.650 INR på 7 dage
2,0
2,0

Hi there, I will review and refactor your Tamil OCR pipeline immediately. Since you need zero skipped voter records accurate mixed letter number door parsing and ultra low AWS extraction cost I will optimize your existing Python code tune Tesseract or evaluate Textract versus custom preprocessing and harden the parsing logic while ensuring stable parallel execution across ECS with guaranteed record completeness. The final delivery will include revised documented OCR and parsing code production ready Docker image and ECS task definition one click deployment full dataset execution monitoring log and detailed accuracy validation report for sign off. Q) Are you currently using native Tesseract with traineddata for Tamil only or a custom fine tuned model and can you share one problematic sample PDF for benchmarking accuracy improvements? I am ready to start now. Please share the Details so I can review them. Best Regards, Usama F
₹5.999 INR på 4 dage
2,1
2,1

⭐If you want, I can show you my recent OCR project⭐ Availability: Immediate | Focus: Robust and Scalable Python OCR Hi, I can review and optimize your Tamil voter-list PDF OCR pipeline to work reliably on AWS ECS, fixing the following issues: Skipped Voters: Ensure no rows are skipped, even in parallel execution Poorly Parsed Data: Gates and alphanumeric characters are correctly extracted My Work Proposal: Analyze and refactor existing code with a focus on reliability and secure parallelization Evaluate and fine-tune OCR engine: Tesseract, AWS Textract, or another more accurate engine for Tamil Improve parsing logic to handle letter and number combinations in all fields Run a full test at scale, first with one PDF per 800 records and then with the entire 75,000 PDF dataset Optimize AWS costs: < $0.003 per PDF, meeting the budget limit Monitor and report on accuracy (minimum 99% accuracy Deliverables: Revised and documented Python code, ready for production Docker image + ECS task definition for one-click execution Final Excel spreadsheet with all correct records, along with monitoring and accuracy reports With my experience in complex OCR pipelines and AWS ECS, I guarantee a reliable, cost-effective execution, ready to process the entire dataset within 72 hours. If you'd like, I can prepare a mini-test of 1–2 PDFs to validate accuracy before mass execution, ensuring no data is lost.
₹10.000 INR på 3 dage
1,4
1,4

I will refactor and harden your existing Python OCR pipeline by optimising Tamil recognition (benchmarking Tesseract with tuned language packs against Amazon Textract for cost/accuracy), redesigning the parsing layer to eliminate skipped voter blocks and correctly handle mixed door-number formats, and making the workflow idempotent and fault-tolerant for parallel execution on Amazon ECS. The solution will include a production-grade Docker image, one-click task deployment, structured logging, and automated validation checks to guarantee zero missed records with ≥99% field accuracy while staying within the $0.003/PDF budget. Finally, I will execute the full 75,000-PDF batch within 72 hours, monitor throughput and cost, and deliver clean Excel outputs, S3 artifacts, and a concise accuracy + monitoring report for sign-off.
₹15.000 INR på 2 dage
0,4
0,4

We are pleased to submit this proposal for the development of your software solution. Our team specializes in designing, developing, and deploying scalable, secure, and user-friendly applications tailored to business needs. We understand that your goal is to build a reliable and efficient system that improves operations, enhances user experience, and supports future growth. Our approach ensures high-quality development, timely delivery, and ongoing support. 2. Project Understanding Based on the provided requirements, the project involves: Designing and developing a custom software application Creating a responsive and intuitive user interface Backend development and database integration API integrations (if required) Testing, deployment, and post-launch support We will follow an agile development methodology to ensure flexibility and continuous improvement throughout the project lifecycle.
₹9.999 INR på 7 dage
0,0
0,0

I am a perfect fit for your project because I understand the need to eliminate skipped voter entries and data mangling in your OCR pipeline while ensuring a clean, professional, and user-friendly workflow. Your requirement for seamless, automated, and cost-optimized extraction at scale on AWS aligns perfectly with my expertise. I specialize in Python, OCR technologies including Tesseract and AWS Textract, and AWS ECS deployment. While I am new to freelancer, I have tons of experience and have done other projects off site and I have a good team behind me covering everything. I would love to chat more about your project! Regards, Justin davis
₹9.400 INR på 14 dage
0,0
0,0

With a decade of experience in the digital publishing field, I have perfected the art of extracting and converting complex data accurately. Your project aligns seamlessly with my skillsets as it requires a robust OCR process that ensures zero skipped voters and correct door numbers, which I have excelled at throughout my career. Furthermore, my expertise in leveraging technologies like Tesseract and AWS Textract will ensure we utilize the most accurate tools for Tamil OCR. I understand the importance of unattended operation for large-scale projects, such as yours. My proficiency in Python will enable me to review and refactor your existing code, optimizing it for parallel execution on your current ECS setup while eliminating any chances of missing voter entries. I'll further guarantee an efficient one-click AWS deployment through a carefully designed Docker image and task definition. To tackle your concerns about cost optimization, I pledge to handle the extraction within the prescribed budget of 0.003 USD per pdf of 800 voters while maintaining a high accuracy rate. Additionally, with an aim for timely delivery, I commit to completing your project within 72 hours – ensuring you have 75,000 PDFs with clean, fully populated Excel files at hand. Trust me to provide you with regular monitoring logs and an accuracy report for your satisfactory sign-off.
₹7.000 INR på 7 dage
0,0
0,0

I am interested in this project. I can help you convert Tamil voter-list PDFs into Excel accurately. I have experience with data entry and can ensure the final sheets are perfectly formatted. I am ready to start immediately."
₹7.000 INR på 7 dage
0,0
0,0

Hi, I’ve read the issue carefully. The symptoms you describe don’t point to general OCR failure — they point to entries being dropped at the hand-off between text detection and parsing. Electoral PDFs often keep the same visual layout while slightly changing character spacing, line breaks, or mixed Tamil/number patterns, which causes rules that work on some pages to silently skip others. For this project I’m proposing a focused reliability fix, not a full pipeline rebuild. **What I will do** • Trace a few sample PDFs through your current extraction → parsing → output flow • Identify the exact condition causing voter rows to be skipped • Adjust detection/parsing rules so those records are captured instead of discarded • Validate on a representative sample batch and show before/after results **What this covers** Stabilizing extraction for the provided template so records are no longer missed during normal runs. **What this does not include** Large-scale processing, infrastructure redesign, or ongoing monitoring — those can be handled separately if needed after the fix is verified. If you can share 2–3 sample PDFs and their current output, I’ll confirm expected improvement before starting. This approach keeps the task fast, predictable, and within budget while solving the actual data-loss issue.
₹3.299,99 INR på 2 dage
0,0
0,0

Hi, I’ve worked on large-scale OCR pipelines under strict accuracy and cost constraints, and your issue is solvable with the right mix of OCR tuning and deterministic parsing. The skipped voters and door-number corruption likely stem from segmentation errors under parallel load and weak post-processing rules. I’ll refactor the pipeline to enforce record-boundary validation, add structured reconciliation checks, and tune Tamil OCR (custom Tesseract training + selective Textract fallback only where needed) to stay within your $0.003/PDF cap. I’ll also stabilize ECS concurrency to guarantee zero missed records across 75,000 PDFs within 72 hours. Before I proceed — are the PDFs uniform across constituencies or do layouts vary slightly?
₹7.000 INR på 7 dage
0,0
0,0

The skipped entries and mangled door numbers are almost certainly two separate issues. Skips usually come from Tesseract losing the voter-block boundaries when page layout varies between constituencies. Door number corruption is a classic mixed-alphanumeric problem where OCR confuses "1" with "l" or "0" with "O". Here's what I'd do: 1. Preprocessing: adaptive binarization (Sauvola instead of Otsu) + deskew per page. This alone usually fixes 30-40% of missed blocks. 2. Voter block detection: template matching or contour-based segmentation to find each voter entry before OCR, so nothing gets skipped regardless of layout. 3. Door numbers: post-OCR regex validation against known Tamil voter list formats, plus confidence thresholds to flag ambiguous characters for a second pass. 4. Cost: stick with Tesseract (tam lang pack) instead of Textract where possible. Textract is 10x the cost per page. Only fall back to Textract for pages where Tesseract confidence drops below threshold. 5. ECS hardening: retry logic per PDF, dead-letter queue for failures, progress tracking to S3. I've done similar OCR pipelines with pytesseract and OpenCV for structured documents. Quick question: are the voter list PDFs all the same format across constituencies, or do layouts vary between them?
₹3.500 INR på 4 dage
0,0
0,0

Chennai, India
Medlem siden jan. 23, 2026
₹1500-12500 INR
₹1500-12500 INR
₹1500-12500 INR
₹1500-12500 INR
₹12500-37500 INR
$30-250 CAD
$250-750 USD
₹600-1500 INR
$8-15 USD / time
$30-250 AUD
$30-250 USD
$2-8 AUD / time
₹12500-37500 INR
₹750-1250 INR / time
$10-15 USD
₹600-1500 INR
$30-250 NZD
$2-30 USD / time
$25-50 USD / time
£20-250 GBP
₹1500-12500 INR
$10-11 USD
₹600-1500 INR
₹1500-12500 INR
$10 USD