PDF Parser
Retrieve PDF files from the S3 bucket and parse their data.

PROBLEM
STATEMENT
Our client, a car insurance company in Canada, wanted to automate their appraisal process for any inbound claims. They had thousands of approved invoices in their database but they were all vaguely-formatted PDFs, and that too in French. For any Machine Learning model to be implemented, they had to bring that data into an organized and cleaned form. Our client tasked us to come up with an encompassing algorithm that would handle several different variants of the invoices and be able to reliably parse all items in an invoice and save them into a CSV file.
SOLUTION
This tool retrieved the PDF invoices from an S3 bucket. Then, extracted all the text from them and parsed each line using Regex operations. Since the invoice did not have a standard format, the Regex had to be compliant with all the edge-cases in invoices while staying generic enough for scalability. The parsed data was saved in CSV format.
Input
Access to S3 bucket
Output
CSVs of parsed invoices