PDF Parser Tool | Extract Data with Python PDF Parsing

PDF Parser

Retrieve PDF files from the S3 bucket and parse their data.

PROBLEM

STATEMENT

Our client, a car insurance company in Canada, wanted to automate their appraisal process for any inbound claims. They had thousands of approved invoices in their database but they were all vaguely-formatted PDFs, and that too in French. For any Machine Learning model to be implemented, they had to bring that data into an organized and cleaned form. Our client tasked us to come up with an encompassing algorithm that would handle several different variants of the invoices and be able to reliably parse all items in an invoice and save them into a CSV file.

SOLUTION

This tool retrieved the PDF invoices from an S3 bucket. Then, extracted all the text from them and parsed each line using Regex operations. Since the invoice did not have a standard format, the Regex had to be compliant with all the edge-cases in invoices while staying generic enough for scalability. The parsed data was saved in CSV format.

PDF Parser

PROBLEM

STATEMENT

SOLUTION

Input

Output

Tools &
Technologies

Python (boto3, pdftotext, csv)

Regex

Opensea Bidding Bot

01. Home

02. Portfolio

03. Services

04. About

05. Blog

Office

Contact

Follow us

PDF Parser

PROBLEM

STATEMENT

SOLUTION

Input

Output

Tools & Technologies

Python (boto3, pdftotext, csv)

Regex

Opensea Bidding Bot

01. Home

02. Portfolio

03. Services

04. About

05. Blog

Office

Contact

Follow us

Tools &
Technologies