PDF Parser

Retrieve PDF files from the S3 bucket and parse their data.

PDF data extraction and conversion process

PROBLEM

STATEMENT

Our client, a car insurance company in Canada, wanted to automate their appraisal process for any inbound claims. They had thousands of approved invoices in their database but they were all vaguely-formatted PDFs, and that too in French. For any Machine Learning model to be implemented, they had to bring that data into an organized and cleaned form. Our client tasked us to come up with an encompassing algorithm that would handle several different variants of the invoices and be able to reliably parse all items in an invoice and save them into a CSV file.

SOLUTION

This tool retrieved the PDF invoices from an S3 bucket. Then, extracted all the text from them and parsed each line using Regex operations. Since the invoice did not have a standard format, the Regex had to be compliant with all the edge-cases in invoices while staying generic enough for scalability. The parsed data was saved in CSV format.

Input

Access to S3 bucket

Output

CSVs of parsed invoices

Tools &
Technologies

Python programming language logo

Python (boto3, pdftotext, csv)

API

Regex

Opensea Bidding Bot

Scroll to Top

01. Home

02. Portfolio

03. Services

04. About

05. Blog

Office

Contact

Follow us