
Færdiggjort
Slået op
Betales ved levering
I have to pull many thousands of PDF files from a publicly available but poorly structured online database. The pages are slow, there are no clear download links, and navigation relies on clunky JavaScript forms, so a straightforward “save as” approach will take far too long. You will receive a text file that contains the exact filenames for every document I need. Those filenames appear in the HTML once the record is loaded, so they can be used as reliable anchors for the scrape. The order in which the files arrive does not matter; accuracy and completeness do. I expect an automated approach—Python with Selenium, Playwright, Scrapy, or any comparable tool is fine—as long as it can work around the site’s fragile structure and occasional timeouts. If headless browsing or rate-limiting tricks are required, please build them in. Deliverables: • A zipped archive (or split archives) containing every requested PDF. • The runnable script with clear, inline comments so I can repeat the process in future. I hope to be able to run this program every few weeks to capture up to date files. • A brief README explaining environment setup, command-line usage, and any third-party libraries. I will validate the job by spot-checking a random sample of filenames against the list I provide and by ensuring the script reproduces the full download set on my end without manual tweaks. The above is AI generated for this job - the following is my description. I want to create a readable store/database of AFCA decisions. Their website is afca.org.au. My plan is to create a ChatCGP (or similar) AI tool to summarise each determination, or search across all determinations for keywords or phrases. AFCA publish each determination in a pdf document. Obviously, I only need the text for each determination. So whether your tool captures each pdf or simply gathers the text as a separate .txt file is a matter for you. As far as storage size goes, obviously .txt files will be far smaller. I don't need an Access or similar database created, I seek only the documents themselves for use in an AI environment. As far as indexing goes, we can start with this : Date: Determination/Case number: Financial Firm Creating an index in Excel or similar seems to be easiest. Those details are captured on the 1st page of each determination. At the outset, there will be many, many 000's of determinations across the old and new databases. Their online search facility is very poor. Older determinations (2018-2024) [login to view URL] take note of this - Service advisory: We are aware that some PDF links show the message *error opening/reading pdf file* — if you see this message, please disregard it. Simply click the link and the PDF will open as normal. Newer determinations (since 2024) [login to view URL]*gbf20z*_gcl_au*MjEwNjExNjQxNi4xNzY2MTg0OTg5 I would be happy starting with the newer determinations only to check for validity, then look at the older database.
Projekt-ID: 40237292
178 forslag
Projekt på afstand
Aktiv 19 dage siden
Fastsæt dit budget og din tidsramme
Bliv betalt for dit arbejde
Oprids dit forslag
Det er gratis at skrive sig op og byde på jobs
178 freelancere byder i gennemsnit $498 AUD på dette job

Hello, This is not just a bulk download task, it’s the foundation of a structured legal corpus that you’ll later use for AI summarisation and search. The solution needs to be reliable, repeatable, and built with scale in mind. My proposed approach would be: Phase 1 – Newer database (since 2024) • Build a Playwright-based crawler to handle the JavaScript-heavy interface and slow page loads. • Implement controlled rate limiting, retry logic, and resume capability to handle timeouts and interruptions. • Download each determination PDF and immediately extract clean text using PyMuPDF or pdfplumber. • Parse the first page to extract Date, Case Number, and Financial Firm using structured pattern matching. • Generate an index CSV with filename, date, case number, firm, and file path. Phase 2 – Older database (2018–2024) • Extend the same pipeline with adjustments for the legacy interface. Deliverables would include: • Complete set of PDFs (or text files if preferred) • Clean extracted text files • Structured index CSV • Fully commented Python script • README explaining environment setup and execution The architecture will allow you to re-run the process periodically to capture new determinations without duplicating existing files. I have experience building structured, audit-ready data pipelines where completeness and reproducibility are critical. Happy to begin with the newer database as a proof of validity before expanding. Best, Jenifer
$550 AUD på 25 dage
9,4
9,4

Hello, With my expertise in JavaScript, I can navigate and extract the required information from the clunky JavaScript forms you mentioned. I understand that the current website conditions are challenging, but I am well-versed in utilizing automation tools like Selenium, Playwright and Scrapy to handle such situations effectively. My experience in working with poorly structured websites will help me save your valuable time by providing an automated approach that avoids any rate-limiting issues. In terms of deliverables, I guarantee a zipped archive with all the requested PDFs accompanied by a comprehensible script ready for your future use. To ensure your satisfaction, I am happy to add clear inline comments to the script to clarify any steps or customization needed going forward. Furthermore, as an added bonus, I will provide you with a brief README file describing everything you need for environment setup, command-line usage and any third-party libraries used. Lastly, as a customer-focused professional, my aim is not only to complete the project but to ensure your ongoing satisfaction. Therefore, if there are any additional features or tweaks you'd like implemented in this process, kindly let me know; I'm more than ready to accommodate you. Choose me for this task and rest easy knowing the messy parts of extracting files from this complex database are well taken care of by an experienced professional. Thanks!
$350 AUD på 3 dage
7,9
7,9

I can efficiently handle the "Mass PDF Database Extraction" project using my expertise in JavaScript, Python, Data Processing, Web Scraping, and Software Architecture. The budget can be adjusted after discussing the full scope, and I aim to work within your budget constraints. Please review my 15-year-old profile to see my extensive experience. Let's discuss the details and get started on the project. Your satisfaction is my priority, and I am eager to showcase my commitment. Looking forward to hearing from you.
$525 AUD på 10 dage
7,9
7,9

⭐⭐⭐⭐⭐ Efficiently Gather PDF Files from AFCA's Online Database ❇️ Hi My Friend, I hope you are doing well. I reviewed your project details and see you are looking for a solution to pull thousands of PDF files from the AFCA website. You don’t need to look any further; Zohaib is here to help you! My team has successfully completed 50+ similar projects for data scraping. I will create an automated script using Python with Selenium or similar tools to navigate the site and gather the necessary files accurately. ➡️ Why Me? I can easily do your project of scraping AFCA decisions as I have 5 years of experience in web scraping, automation, and data extraction. My expertise includes Python, Selenium, and handling complex website structures. Additionally, I have a strong grip on API integration and data processing. ➡️ Let's have a quick chat to discuss your project in detail. I can show you samples of my previous work and how we can achieve your goals efficiently. Looking forward to discussing this with you in chat. ➡️ Skills & Experience: ✅ Web Scraping ✅ Python Programming ✅ Selenium Automation ✅ Data Extraction ✅ Error Handling ✅ API Integration ✅ Script Optimization ✅ File Management ✅ Data Indexing ✅ Task Scheduling ✅ Data Analysis ✅ Documentation Waiting for your response! Best Regards, Zohaib
$350 AUD på 2 dage
8,1
8,1

Hi there, I understand that you need to extract thousands of PDF documents from the AFCA website to create a readable database of decisions. Given the challenges with the site's structure and navigation, I propose an automated solution using Python with Selenium or a similar tool to efficiently gather the necessary files. My approach will involve developing a script that accurately retrieves each determination, focusing on the text content while ensuring the integrity of the data. I will implement error handling for any potential issues with PDF links and ensure that the script can be run periodically for future updates. Deliverables will include a zipped archive of the extracted documents, a well-documented script for future use, and a brief README for setup and usage instructions. I prioritize clear communication and quality work, ensuring you receive a reliable solution. I look forward to the opportunity to assist you with this project. Best regards, Burhan Ahmad TechPlus
$750 AUD på 5 dage
7,8
7,8

Greetings. I will approach the mass download by building a script to automate the process. I'll download the PDFs into a single folder and deliver it as a zip archive. I can generate a spreadsheet with additional metadata if you require any. I have noted that the site will be brittle and finicky and I can confirm that I have handled many such sites and it will not be a problem as long as waiting and retrying eventually works. Timeline: The building of the script should take 1 - 2 days and the completion time of the full scrape then depends on the number of PDFs you have on your list and the response time of the site. Features: The script will feature: - Rate limiting to avoid overburdening the site and bot-detection. - Fault tolerance to continue running when some pages fail to load. - Retry logic to maximize the number of pages that are successfully handled. Experience: I have a wealth of experience working in Python web scraping and can use technologies such as Playwright, Selenium, BeautifulSoup, Pandas and more. --- I am available to begin immediately and work until completion. Contact me if you wish to continue. Thanks.
$250 AUD på 4 dage
7,6
7,6

Hello Greetings, After reviewing your project description, I am confident and excited to work on this project for you. However, I have some crucial points and questions to clarify. Please leave a message in the chat to discuss this, and I can share my recent work that is similar to your requirements. Thanks for your time! I am excited to hear from you soon. Best regards
$5.000 AUD på 40 dage
7,7
7,7

Having worked extensively with data extraction and processing, I'm confident in my ability to tackle your project that involves pulling thousands of PDF files from a rather troublesome online database. My strong suit lies in web scraping and automating complex tasks, skills which you mentioned as key requirements for this job. I'm fluent in Python and have hands-on experience with Selenium, Playwright, and Scrapy - tools that would be ideal for handling the clunky JavaScript forms and timeouts you're facing. Given that my work philosophy resonates with your needs to the tee – providing an automated solution that others can implement easily – I am accustomed to creating clear yet comprehensive documentation like READMEs. This will help you not only validate the job but also harness the power of this script for future updates. A zipped archive containing every requested PDF is what you're asking for, and I promise to deliver exactly that. My attention to detail coupled with guaranteeing accuracy and completeness plays a crucial role when working on such high-volume projects. Best, Junaid.
$750 AUD på 7 dage
7,4
7,4

Hello, Working with large databases and automating data extraction is one of my core specialties and a reason why I'm well-suited for your project. As an experienced freelancer who has worked extensively with Python, Selenium, and Web Scraping, I assure you that I can develop a robust and efficient solution tailored to your needs. I understand the complexity of dealing with clunky JavaScript forms and slow website navigation, but my skills in finding reliable anchors for scraping will ensure accuracy and completeness in the PDF extraction process. In addition to providing you with the zipped archive containing all the requested PDFs, the runnable script with inline comments, and a clear README file, I am committed to delivering an optimal user experience for you. My expertise in handling headless browsing or rate-limiting issues empowers me to build workarounds easily into the script without compromising on speed or performance. Choose expertise. Choose reliability. Choose me! thank you Gaurav D.
$500 AUD på 7 dage
7,3
7,3

Hello, With my comprehensive set of skills and extensive experience, I'm confident I can successfully tackle the challenging task you've outlined. Utilizing my automation, data management, and web scraping expertise, I will develop a thoroughly automated solution using robust technologies like Python with Selenium. This approach will allow for the creation of an adaptable, future-proofed process that you can run yourself, capturing up-to-date files regularly without any manual intervention. My proficiencies extend beyond just scraping; I excel at data processing and management as well. I intend to parse each AFCA determination PDF and intelligently extract the text elements you need for your AI application. To meet your storage constraints, I can store the determinations as separate .txt files, significantly reducing the overall size without compromising the data's integrity. Furthermore, given my background in full-stack web development and software architecture, creating a readable store/database for your collected determinations will be a smooth endeavor. I propose an Excel index for easy access and reference using categories like date, determination/case number, and financial firm. Time and again, I've excelled at building efficient solutions under similar difficult conditions. Let's discuss your project further to ensure we surpass all your expectations. Thanks!
$555 AUD på 1 dag
7,3
7,3

Hello I have thoroughly reviewed your project description and am confident in my ability to assist you in completing it successfully. I believe it would be highly beneficial to delve deeper into the specifics of the job to determine the most effective way forward. I am open to scheduling an interview at your convenience, and I genuinely appreciate the chance to collaborate with you on this project. Your response is eagerly anticipated, and I'm excited about the prospect of working together. Thank you for considering my proposal. Looking forward to your prompt reply! Best regards Rekha!!!
$750 AUD på 7 dage
7,3
7,3

Hello, Would you like to see a demonstration of how we can streamline the extraction of thousands of PDF decisions effortlessly? Our automated approach utilizes advanced web scraping techniques to ensure accuracy and completeness while circumventing site limitations. Let's discuss how we can effectively compile the AFCA decisions into a usable format for your AI tool. Best, Smith
$500 AUD på 7 dage
7,1
7,1

Hi there, I’ve read your brief and understand you want to build an AI-ready library of AFCA determinations by systematically capturing each decision and creating a simple index (Date, Case number, Financial Firm). Starting with the newer database first is a smart way to validate. I’m a Python developer experienced with scraping JS-heavy and poorly structured sites. I focus on reliable, repeatable collection where completeness matters. Approach Python + Playwright/Selenium to handle AFCA’s dynamic search Iterate results and open each determination Save clean .txt extracted from PDFs (smaller, AI-friendly) or PDFs if preferred Parse page 1 to build an Excel index (date/case/firm) Add retries, rate limits, and resume support Deliverables All texts or PDFs Excel index Commented script + short README for reruns I suggest a small pilot on recent decisions first to confirm quality. Quick questions: • Text only or also keep PDFs? • How many for the pilot batch? Ready to begin once confirmed.
$350 AUD på 5 dage
7,3
7,3

Hi Glenn P. I’m your web developer, ready to turn your project Mass PDF Database Extraction into reality! I’d love to discuss the details and create something amazing together. Feel free to message me anytime, and we can also hop on a quick video or audio call whenever it's convenient for you. I’ve developed many projects exactly like what you’re looking for. If you want to see more relevant samples, just contact me through the chatbox, and I’ll share them instantly. ★ Why Clients Trust Me 500+ successful web projects delivered 430+ positive client reviews Expert in JavaScript, Python, Data Processing, Web Scraping, Software Architecture, Scrapy, Data Extraction, Selenium, Automation, Data Management WordPress, Shopify, PHP, JavaScript, HTML, CSS, Plugin/Theme Development, Laravel, WebApp Clean, modern, responsive and SEO-optimized designs Fast delivery, great communication, and long-term support Available during EST hours for smooth collaboration If you want a professional developer who delivers quality work on time and stress-free, let’s connect. I’m excited to help build something amazing for you. Best regards, Kausar Parveen
$350 AUD på 3 dage
6,9
6,9

SURE-------------I will start this work as per the given description ----I have extensive experience with similar PROJECT ---->>I am highly qualified to do this job with high QUALITY ----- I am Passionate PYTHON/Full stack developer having rich experience with so many successful Tasks. I have some queries to give you accurate time and price Please ping me to get started and provide you great results. Thanks
$550 AUD på 7 dage
7,1
7,1

With warm regards, I’m Sami from BN-Droids Digital Services - a highly skilled and experienced web development team offering cutting-edge digital solutions. We specialize in Python and tools like Scrapy for expert data extraction. Our portfolio includes an extensive experience of web scraping involving even the most complex, niche databases such as yours with its difficult to navigate JavaScript forms.
$250 AUD på 7 dage
6,9
6,9

Hi I can build a fully automated extraction tool that navigates AFCA’s poorly structured interfaces, loads each determination record, and reliably captures either the PDF or clean text using Python with Selenium or Playwright. The core challenge is handling dynamic JavaScript forms, broken link behaviors, and slow responses, and I’ll solve this with resilient selectors, retry logic, headless browsing, and safe rate-limiting to prevent timeouts. Once each determination is fetched, I can extract text directly to .txt for lighter storage and generate an index capturing Date, Case Number, and Financial Firm from the first page. The script will include inline comments, environment instructions, and a README so you can rerun it every few weeks without modification. I’ll also package all documents into zipped archives and ensure the scraper produces complete, reproducible results across both the new and old AFCA databases. Thanks, Hercules
$500 AUD på 7 dage
7,0
7,0

Hi I can build a fully automated extraction tool that navigates AFCA’s poorly structured interfaces, loads each determination record, and reliably captures either the PDF or clean text using Python with Selenium or Playwright. The core challenge is handling dynamic JavaScript forms, broken link behaviors, and slow responses, and I’ll solve this with resilient selectors, retry logic, headless browsing, and safe rate-limiting to prevent timeouts. Once each determination is fetched, I can extract text directly to .txt for lighter storage and generate an index capturing Date, Case Number, and Financial Firm from the first page. The script will include inline comments, environment instructions, and a README so you can rerun it every few weeks without modification. I’ll also package all documents into zipped archives and ensure the scraper produces complete, reproducible results across both the new and old AFCA databases. Thanks,
$300 AUD på 1 dag
6,9
6,9

Hello client, I’ve carefully reviewed your job description and have strong experience in these Scrapy, Data Processing, JavaScript, Web Scraping, Software Architecture, Data Extraction, Automation, Data Management, Python and Selenium. I can build a reliable web scraping solution tailored specifically to your needs. Whether using Node.js with Puppeteer/Cheerio or Python with Selenium/BeautifulSoup, I will extract, clean, and organize your data efficiently. I also handle anti-bot protections, pagination, and full automation as required. As you can see from my profile, my web scraping reviews are excellent, reflecting my commitment to quality work. I focus on writing clean, maintainable, and scalable code because I know the difference between 99% and 100%. If you hire me, I’ll do my best until you’re completely satisfied with the result. Let’s discuss your target website and preferred data format. Thanks, Denis
$300 AUD på 5 dage
6,1
6,1

Your AFCA scraper will fail if you treat this like a simple download job. The real problem is not the volume - it's that their search interface returns inconsistent pagination tokens and the PDF links expire after session timeout. I've debugged similar government portals where a naive loop loses 30% of files to stale references. Before I architect the solution, two questions: Are you planning to run this on a local machine or a cloud instance? The newer portal uses client-side rendering that burns through memory if you don't implement proper browser cleanup between batches. Second - do the determination numbers follow a sequential pattern, or do I need to scrape the search results themselves to build the master list? Here's the architectural approach: - SELENIUM + HEADLESS CHROME: Implement rotating user agents and random delays between 2-8 seconds to avoid triggering their rate limiter. I'll add retry logic with exponential backoff for timeout errors. - PYTHON + PDFPLUMBER: Extract text directly from PDFs in-memory without writing intermediate files. This cuts storage by 90% and speeds up processing since we skip disk I/O. - DATA EXTRACTION: Parse the first page of each determination using regex patterns to capture date, case number, and financial firm. Export to CSV with UTF-8 encoding to handle special characters in firm names. - SCRAPY PIPELINES: Build a validation layer that cross-references extracted filenames against your input list and flags missing documents before the job completes. This prevents discovering gaps three weeks later. - AUTOMATION: Structure the script to accept date ranges as command-line arguments so your weekly runs only pull new determinations instead of re-scraping the entire archive. I've built similar document extraction systems for legal databases that process 50K+ PDFs without manual intervention. The script will include a progress tracker that logs every successful extraction and writes failed URLs to a separate file for manual review. Let's schedule a 15-minute call to confirm the determination numbering scheme and whether AFCA blocks requests from AWS IPs - that determines if we need residential proxies.
$450 AUD på 10 dage
7,0
7,0

KILSYTH SOUTH, Australia
Betalingsmetode verificeret
Medlem siden aug. 8, 2014
$250-750 USD
$30-250 AUD
$30-250 AUD
$30-250 AUD
$30-250 AUD
₹5000-12000 INR
₹12500-37500 INR
€750-1500 EUR
$8-15 USD / time
£20-250 GBP
$15-25 USD / time
₹1500-12500 INR
£750-1500 GBP
₹100-400 INR / time
$250-750 USD
₹750-1250 INR / time
€12-18 EUR / time
₹1500-12500 INR
$250-750 USD
$30-250 USD
₹750-1250 INR / time
$30-250 USD
$750-1500 USD
$2-8 USD / time
$10-30 USD