PDF Parser

Key Libraries Employed

DataFrame Creation

Pandas

Text Extraction

Regex

PDF Reading & Extraction

PDF Query

Robust. Responsive. Reliable.

An efficient, error-minimizing PDF parser.

250+ lines.

PDF Extraction

We employ PDF Query to extract information from 9000+ CommonApp student applications, each having 15+ pages.
213,000,000+ words extracted.

Text Extraction

We employ Regex to generate specific patterns to extract key words to be later inputted into a data frame.
78+ Regex patterns written.

DataFrame Creator

We employ Pandas to ready a dataframe for final delivery.
101 Columns.

Due to privacy concerns, the final sheet cannot be shown. I implore you to place your trust in me instead, and perhaps now visit the gallery!

Excel Converter

We employ Pandas to convert dataframe to Excel for delivery.
9000+ PDFs =>9000+ Rows