PDF READER - Text extraction with PyTesseract OCR

Many industries use pdf forms for official documentations. The health care industry often uses pdfs for all types of patient data. Audits are typical for many of these processes, but in most cases these forms will be manually entered into a spreadsheet. I’m trying to wrap my mind around manually entering 200+ pages of pdf data. These forms are standard

Tesseract OCR can help with image to text solutions. If the pdf forms are standard, we can identifies the various ROI’s (region of interest) on the document and do what we want with the text data. For this example we will take the text data and organize it in a Panda’s data frame.

Regions of interest work like other bounding boxes, coordinates ( x and y axis pixels) are entered as parameters and the OCR finds the region on the page.

There are 5 main sections of the code

Convert PDF’s to PNG images
Remove color and lines from image
Identify ROI’s
Extract & clean text data
Compile in a data frame and export as csv

Click the Jupyter Notebook link below to see the code for the sample UB-04 Medicare claim form. **The data is sample data, created for demonstration purposes only**

PDF Reader - JUPYTER NOTEBOOK

Important note- When using Tesseract OCR on Windows OS, you will need to establish the location of OCR on your machine. Line 10 of the import cell contains the line, pytesseract.pytesseract.tesseract_cmd = r'C:\Users\AppData\Local\Tesseract-OCR\tesseract.exe'. This line is necessary for OCR to work on Windows OS.

PDF Reader - Data Entry with OCR

PDF READER - Text extraction with PyTesseract OCR