PDF Parser

Key Libraries Employed

DataFrame Creation

Pandas

Text Extraction

Regex

PDF Reading & Extraction

PDF Query

Robust. Responsive. Reliable.

An efficient, error-minimizing PDF parser.

250+ lines.

PDF Extraction

  • We employ PDF Query to extract information from 9000+ CommonApp student applications, each having 15+ pages.
  • 213,000,000+ words extracted.

Text Extraction

  • We employ Regex to generate specific patterns to extract key words to be later inputted into a data frame.
  • 78+ Regex patterns written.

DataFrame Creator

  • We employ Pandas to ready a dataframe for final delivery.
  • 101 Columns.

Due to privacy concerns, the final sheet cannot be shown. I implore you to place your trust in me instead, and perhaps now visit the gallery!

Excel Converter

  • We employ Pandas to convert dataframe to Excel for delivery.
  • 9000+ PDFs =>9000+ Rows