
Completed
Posted
Paid on delivery
I’d like your help pushing a small language model toward record-level performance on the ARC-AGI benchmark by reproducing—and hopefully surpassing—the Evolutionary Test-Time Compute approach outlined in Jeremy Berman’s posts ([login to view URL] and follow-up). My main goal is stronger generalization, not just memorized accuracy, so the solution must show consistent gains across the public and hidden splits. Everything will run in Python. You can pick whatever supporting libraries you like (PyTorch, JAX, NumPy, Hugging Face, etc.) as long as setup stays lightweight and the code is clearly documented. Core scope • Implement evolutionary prompt search: generate, mutate, rank, and select prompts on-the-fly against ARC tasks. • Automate evaluation: scoring script should mirror ARC’s official rubric so results are directly comparable to the leaderboard. • Track sample efficiency: log compute time, number of queries, and score progression so we can see where improvements come from. • Deliver reproducible runs: a single command should download the data, load the model, run the search, and print the final score. Acceptance criteria •Re-running the notebook with y keys reproduces the accuracy numbers •Plots clearly show defult vs tuned accuracy • Codebase installs with `pip install -r [login to view URL]` and runs on a single GPU. • Clear README explaining how to tweak evolutionary parameters for further experimentation. • document each expirements steps clearly The sooner we can iterate, the better—I’m ready to review progress immediately and will test each checkpoint as you push commits.
Project ID: 40191191
66 proposals
Remote project
Active 2 mos ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
66 freelancers are bidding on average $367 USD for this job

Hello, I trust you're doing well. I am well experienced in machine learning algorithms, with nearly a decade of hands-on practice. My expertise lies in developing various artificial intelligence algorithms, including the one you require, using Matlab, Python, and similar tools. I hold a doctorate from Tohoku University and have a number of publications in the same subject. My portfolio, which showcases my past work, is available for your review. Your project piqued my interest, and I would be delighted to be part of it. Let's connect to discuss in detail. Warm regards. please check my portfolio link: https://www.freelancer.com/u/sajjadtaghvaeifr
$500 USD in 7 days
7.2
7.2

Hello, I understand you’re looking for a focused implementation that reproduces and improves the Evolutionary Test-Time Compute approach for ARC-AGI, with an emphasis on genuine generalization rather than leaderboard overfitting. I specialize in small-model optimization and prompt search under tight compute budgets, and I can deliver a clean, fully reproducible Python pipeline that mirrors ARC’s official evaluation while making experimentation fast and transparent. The solution will implement on-the-fly evolutionary prompt search, including prompt generation, mutation, ranking, and selection against ARC tasks. Evaluation will strictly follow the official ARC scoring rubric so results are directly comparable across public and hidden splits. I will instrument the system to log compute usage, query counts, and score progression, allowing clear analysis of sample efficiency and where gains originate. Outputs will include side-by-side plots of default versus tuned performance. The codebase will be lightweight, GPU-friendly, and installable via a single requirements file, with a one-command entry point that downloads data, loads the model, runs the search, and prints final scores. Clear documentation and a structured README will explain each experimental step and how to adjust evolutionary parameters for further research. Thanks, Asif
$750 USD in 10 days
6.4
6.4

Hello, I have good working experience pushing Python-based models to their limits on benchmark tasks, and implementing evolutionary optimization pipelines is my specialty. I can reproduce Jeremy Berman’s Evolutionary Test-Time Compute approach on ARC-AGI with full transparency clear logging, reproducible runs, GPU-efficient execution, and step-by-step documentation. My goal will not just be to match prior scores but to optimize generalization across public and hidden splits, while keeping setup lightweight and experimentation seamless. I am ready to start immediately and deliver iterative results you can review in real time. Best Regards, Arzoo Farooq
$410 USD in 7 days
6.4
6.4

I'm experienced in implementing evolutionary prompt search and automating evaluation in Python using libraries like PyTorch and NumPy. I will focus on achieving stronger generalization by reproducing and surpassing the Evolutionary Test-Time Compute approach for the ARC-AGI benchmark. My solution will track sample efficiency and ensure reproducible runs with clear documentation. I am ready to collaborate closely with you for quick iterations and immediate progress review.
$300 USD in 7 days
5.9
5.9

Hello, I have 10+ years of experience in AI, ML, and full-stack development. I can implement your ARC-AGI prompt tuning pipeline in Python with a clean, reproducible, and well-documented codebase that meets your requirements. Requirement Confirmation & Planning Confirm model choice, dataset access, evaluation rules, and success metrics. Define the evolutionary prompt search workflow and logging requirements. Development & Implementation Build the evolutionary prompt search module (generate, mutate, rank, select) running on ARC tasks. Integrate a lightweight Python stack (PyTorch/JAX, Hugging Face, NumPy) and ensure single-GPU compatibility. Implement automated evaluation using ARC’s official rubric and generate comparison plots for default vs tuned accuracy. Experiment Tracking & Logging Log compute time, query counts, prompt evolution history, and score progression. Provide reproducible experiment scripts with fixed seeds and checkpointing. Testing & Delivery Create a single-command runner to download data, load the model, execute search, and print final scores. Validate reproducibility with y-keys and ensure outputs match expected results. Documentation & Handover Provide clear README with installation steps, evolutionary parameter tuning guidance, and experiment documentation for each run. I’m ready to start immediately and can iterate quickly with checkpoint commits. Best regards,
$200 USD in 7 days
6.5
6.5

Hello, I’m excited about helping you push a small language model toward record-level performance on the ARC-AGI benchmark. My approach will focus on implementing evolutionary prompt search with automated evaluation, ensuring it matches ARC’s official rubric for accurate, comparable results. I’ll also set up tracking for sample efficiency to monitor compute time, queries, and score progression, delivering reproducible runs that work with a single command. Expect clear documentation for setup, experimentation tweaks, and reproducibility, along with a clean, well-organized codebase to facilitate easy iteration. Best regards, Juan
$300 USD in 5 days
5.8
5.8

Greetings, I appreciate the opportunity to help you enhance a small language model's performance on the ARC-AGI benchmark. Your focus on achieving strong generalization rather than just memorized accuracy is crucial, and I plan to tackle this by implementing an evolutionary prompt search that dynamically generates and ranks prompts against ARC tasks. Using Python with libraries like PyTorch and Hugging Face, I’ll create an automated evaluation system that mirrors ARC's rubric for consistent comparison, while also logging essential metrics like compute time and score progression. Ensuring reproducibility will be a priority, allowing easy runs with clear documentation for future experiments. I'm excited to collaborate closely and iterate quickly as we push for those record-level results. Best regards, Saba Ehsan
$300 USD in 30 days
5.5
5.5

Hello client! I can push the small language model towards record level performance on the ARC AGI benchmark. Based on my vast experience, I am prepared to deliver results that exceed expectations. Just drop me a message and let's start. Cheers, Fahad.
$200 USD in 2 days
5.3
5.3

Hi there, I’ve carefully reviewed the requirements for your GenAI project and I’m confident that my expertise in building NLP pipelines using Hugging Face and LangChain can meet your expectations. My experience includes working with large language models (LLMs) for Retrieval-Augmented Generation (RAG), as well as fine-tuning models with custom datasets to enhance text generation. I’ve successfully completed similar projects where I applied these techniques in Python to build robust, client-specific solutions. I would love the opportunity to discuss how I can leverage my skills to develop a tailored solution for your project. Feel free to take a look at my portfolio to get a sense of the work I’ve done: Portfolio: https://www.freelancer.com/u/webmasters486 Looking forward to hearing from you! Best regards, Muhammad Adil
$300 USD in 5 days
5.1
5.1

As a seasoned Machine Learning Engineer with over 8 years in the field, my expertise lines up perfectly with all your requirements for this project. Be it building NLP models, optimizing system performances or having a solid command on Python and Software Architecture, I've got you covered. But beyond mere skills, I bring a deep understanding of the challenge this task poses and how we can overcome it. For achieving record-breaking performance on the ARC-AGI benchmark, I realize that focusing on better generalization rather than just memorized accuracy is crucial. I assure you that my solution will reflect this understanding and yield consistent gains across both public and hidden splits. Moreover, I will ensure that your benchmarks are met by automating evaluation mirroring ARC's official rubric, tracking sample efficiency to gauge improvements and manage reproducible runs for easy access. Lastly, my true knack lies in crafting systems that work beautifully in real-life scenarios. The fact that you value reliable engineering resonates deeply with me because producing robust and scalable AI systems has always been my priority. You can trust me not only to complete this project on time but also to provide a clear README documenting every step and even explaining how to tweak evolutionary parameters for further experimentation. Let's collaborate and create something extraordinary!
$500 USD in 7 days
5.0
5.0

Hello, I will implement the Evolutionary Test-Time Compute approach to maximize your small language model's performance on the ARC-AGI benchmark. The entire solution will be built in Python, leveraging libraries like NumPy, PyTorch, and Hugging Face. The core work involves implementing the evolutionary prompt search mechanism to generate, mutate, rank, and select effective prompts on-the-fly against ARC tasks. I will automate the evaluation process using a scoring script that accurately mirrors the official ARC rubric for direct leaderboard comparison. Crucially, I will implement logging to track sample efficiency, compute time, and score progression to ensure reproducible results and stronger generalization across splits. 1) Which specific small language model (LLM) do you plan to use for the experiment ( Llama 3 8B, Phi-3)? 2) Which specific evolutionary algorithm (genetic algorithm, simple hill climbing) do you prefer for the prompt search? 3) What is the highest number of parallel test-time compute threads that your hardware can support for the prompt generation/evaluation? Thanks, Bharat
$500 USD in 7 days
4.9
4.9

I’ll implement the evolutionary prompt search for your Python-based model to push ARC-AGI performance while keeping results reproducible and clearly documented. I’ll automate evaluation, track sample efficiency, and make sure the code runs with a single command on one GPU. You’ll get plots comparing default vs tuned accuracy and a README showing how to tweak evolutionary parameters for further experiments. If the solution doesn’t reliably improve generalization as promised, you don’t pay. Portfolio: https://www.freelancer.com/u/shawanay Let’s discuss in chat. I have a few questions about the current model setup and data access before proceeding.
$300 USD in 9 days
4.0
4.0

Hello, I hope you are doing well. I’m a deep learning engineer specializing in prompt engineering, evolutionary search for prompts, and reproducible evaluation pipelines. I build lightweight, well-documented Python tooling for prompt tuning and ARC-style scoring, designed to run on a single GPU with minimal setup. In past work, I’ve implemented automated prompt search loops, robust scoring aligned with public and hidden ARC splits, and logging for compute, queries, and score progression. I deliver clean code, clear README, and easy reproduce runs. I can handle the project end-to-end and help you iterate quickly toward stronger generalization, not just memorized accuracy. I will deliver a compact, reproducible solution and sensible defaults for evolutionary parameters. Please feel free to contact me so we can discuss more details. I am looking forward to the chance of working together. Best regards, Billy Bryan
$335 USD in 1 day
4.2
4.2

With my artisanal experience in data analytics and science with Python, I assure you a cut above the rest. We'll not just implement Jeremy Berman's approach to push language model Performance on the ARC-AGI benchmark; but enhance generalization for consistent gains across both public and hidden splits. Drawing from my rich eight-year exposure in this field, I guarantee top-notch work on implementing evolutionary prompt search, automating evaluation as per ARC rubrics, and tracking sample efficiency. Besides having a mastery of core tools like PyTorch and NumPy, I am an expert in various data visualization tools like Power BI and Looker. These skills will help ensure your codebase remains lightweight while generating clear plots for analysis. My understanding of TensorFlow/PyTorch also helps extend pre-existing models to deliver outstanding results. Further crucial skills evidenced by my substantial experience include ETL & Data Engineering (using Airflow/Talend/Alteryx), Analytics (using SQL/BigQuery), A/B Testing, and Experimentation (Statistical Analysis/Hypothesis Testing). So, beyond mere implementation, I guarantee full reproducibility of your runs allowing for further abstracted experimentation. This assignment is no doubt in right hands.
$350 USD in 7 days
4.1
4.1

Hi, I have carefully reviewed your project requirements to push a small language model towards record-level performance on the ARC-AGI benchmark using an evolutionary test-time compute approach. With my strong experience in natural language processing, deep learning, and prompt engineering, I am confident in reproducing and surpassing the methodology outlined by Jeremy Berman. I will implement an automated evolutionary prompt search in Python, using lightweight, well-documented code with libraries like PyTorch or Hugging Face to ensure reproducibility and efficiency. I will also build comprehensive evaluation metrics to track score progression and sample efficiency, producing clear plots comparing default versus tuned accuracy. The codebase will be pip-install friendly and run on a single GPU, with exhaustive documentation to facilitate easy experimentation. I propose an iterative process with frequent checkpoints for your review. I can start right away and provide the first runnable notebook within a week. What specific evolutionary parameters or constraints are you most interested in exploring during the prompt tuning? Best regards,
$450 USD in 23 days
4.2
4.2

Hello there, I reviewed your project Small LLM ARC-AGI Prompt Tuning - 30/01/2026 05:58 EST and understood the requirements at a high level. I focus on delivering clear, stable, and maintainable solutions aligned with the actual scope, I can work with Python, Software Architecture, Machine Learning (ML) and follow a clean development process with proper structure and error handling. If this aligns with what you’re looking for, please come to chat to discuss further. Best regards
$100 USD in 7 days
3.8
3.8

Hi, I'm excited about the opportunity to help you push a small language model towards record-level performance on the ARC-AGI benchmark. With over 10 years of experience in Machine Learning and Natural Language Processing, combined with proficiency in Python and libraries like PyTorch, I'm well-equipped to implement the evolutionary prompt search and ensure reproducible results as outlined. I'll ensure the code is cleanly documented and set up to run efficiently on a single GPU, with an easy-to-follow README for further experimentation. Let's collaborate to iterate efficiently on this, and I'm ready to start immediately. Best regards, Volodymyr
$335 USD in 1 day
3.9
3.9

Hello Najo666, My name is Bwalya, and I have 8 years of experience in Python, specializing in machine learning and natural language processing. I am excited about your project focusing on pushing a small language model towards record-level performance on the ARC-AGI benchmark. I have thoroughly reviewed your requirements and am confident in my ability to implement the evolutionary prompt search, automate evaluation, track sample efficiency, and ensure reproducible runs as per the project scope. I will use Python and relevant libraries such as PyTorch, JAX, NumPy, and Hugging Face to achieve the desired outcomes while maintaining lightweight setup and clear documentation. I would love to discuss the project further in the chat to clarify any details and ensure alignment with your expectations. Best regards, Bwalya
$300 USD in 7 days
3.4
3.4

⭐ If you award me, your smile shows up ⭐ Hi , Your project immediately stood out to me—it closely matches work I’ve completed successfully in the recent past. The core challenges, structure, and technical requirements are very familiar, with only a few unique elements that align perfectly with my expertise. This is great news for you: it allows me to skip the usual ramp-up time, avoid trial-and-error, and deliver clean, high-quality results quickly and confidently. I bring hands-on experience with Software Architecture, Natural Language Processing, Python, Machine Learning (ML), Deep Learning, Prompt Engineering and Hugging Face, along with proven workflows and best practices refined through multiple similar projects. You can view a directly relevant example in my portfolio here: https://www.freelancer.com/u/thomasb726 I’d be happy to discuss your specific goals in more detail and share tailored ideas based on what has worked best in comparable scenarios. Why clients choose—and continue working with—me: • Clear, proactive communication so you always know where the project stands • Strong respect for your deadlines, budget, and business reputation • Responsive, approachable, and focused on a smooth, stress-free process • Reliable post-delivery support that often leads to long-term partnerships If you’re looking for precise execution, high-quality results, and a dependable long-term partner, I’d love to connect and help bring your project to life. Best regards, Tom
$335 USD in 1 day
3.1
3.1

Dear Client, I have carefully reviewed your project requirements for tuning a small language model towards record-level performance on the ARC-AGI benchmark. With 8+ years of experience in Python and expertise in implementing evolutionary prompt search and automating evaluation processes, I am confident in my ability to deliver a solution that not only reproduces but surpasses the Evolutionary Test-Time Compute approach outlined in Jeremy Berman's posts. My approach will focus on implementing efficient prompt generation, mutation, ranking, and selection strategies while ensuring consistent gains across public and hidden splits. I will utilize Python libraries such as PyTorch, JAX, NumPy, or Hugging Face to keep the setup lightweight and the code well-documented. I would like to discuss your project further and share my ideas on how to achieve stronger generalization and reproducible runs. Looking forward to connecting with you to discuss this exciting opportunity in more detail. Best regards, Toma K.
$300 USD in 7 days
2.8
2.8

Norwich, United Kingdom
Payment method verified
Member since Feb 1, 2025
£10-20 GBP
$250-750 USD
£250-750 GBP
£750-1500 GBP
$250-750 USD
$15-25 CAD / hour
₹12500-37500 INR
$30-250 CAD
$3-10 NZD / hour
₹150000-250000 INR
₹12500-37500 INR
₹12500-37500 INR
€12-18 EUR / hour
₹600-1500 INR
₹1500-12500 INR
₹1500-12500 INR
$8-15 USD / hour
₹37500-75000 INR
$250-750 USD
₹12500-37500 INR
$3-10 NZD / hour
₹12500-37500 INR
$1500-3000 USD