Blog -
PDF Parsing with AI
Parsing PDFs, whether extracting text, images, or metadata, can be daunting. It’s akin to deciphering someone’s handwriting interspersed with random sticky notes, doodles, and annotations in different colors. But fear not! In this article, we’ll explore how artificial intelligence (AI) can significantly simplify this process.
Wait, Why Should We Use AI for PDF Parsing?
Before we delve into the “how,” let’s understand why AI is beneficial for PDF parsing:
- External Packages: Many existing libraries for parsing PDFs require complex dependencies or offer limited functionality.
- Complex Layouts: PDFs often have intricate layouts, making text or image extraction challenging.
- Metadata Extraction: The lack of structure in PDFs makes metadata extraction difficult.
- Scalability: AI models, trained to handle various PDF layouts, offer greater scalability than traditional methods.
- Accuracy: AI models can achieve high accuracy in extracting text, images, and metadata.
- Customization: AI models can be fine-tuned to extract specific information types, such as tables, forms, or images.
How to Use AI for PDF Parsing
Now that we’ve covered the why, let’s dive into the how.
TL;DR
- Create OpenAI files with the PDFs you want to convert to images.
- Create an OpenAI Assistant.
- Use the Code Interpreter tool to convert PDFs to images by passing in the OpenAI file IDs from step 1.
- Receive the image files as output—an array of image objects.
- Extract the image file IDs from the Assistant’s output.
- Create a Chat Completions Thread.
- Use the Chat Completions tool to extract values from the images using the file IDs from step 5.
- Obtain the extracted values as a structured output, i.e., a JSON object.
Deeper Dive: Convert the PDF to an Image
The first step in using AI for PDF parsing is converting the PDF into images to facilitate easier parsing of complex layouts and metadata. We utilized an OpenAI Assistant with the code interpreter tool enabled for this task. Code Interpreter enables Assistants to execute Python code in a secure environment, supporting diverse files, generating data, and creating graph images. It iteratively solves complex problems, retrying if initial attempts fail.
For guidance on the Code Interpreter tool, refer to: OpenAI Code Interpreter
Prompts Used:
const inputStructurePrompt = ` Please interpret the following structured tags to refer to the corresponding file IDs and/or plain text values: \`{{ INPUT: ${file.filename} }}\` => \`${file.id}\` (MIME inputType: \`application/pdf\`) `; const convertPdftoPrompt = ` Convert every page of each PDF file into high-quality images optimized for use with vision APIs. Inputs: PDF files to be converted. Outputs: Downloadable 200 DPI PNG image files ready for vision API analysis. `;
Deeper Dive: Extract Values from Images
Once the PDF is converted to images, OpenAI Chat Completions with a vision-capable model can extract values from these images. This step involved extensive iteration and fine-tuning. Utilizing the Chat Completions playground on OpenAI’s Developer Platform, we tested and refined our prompt. We even asked ChatGPT how to improve our prompt and best practices to get the best response.
For insights on the Chat Completions tool, see OpenAI Chat Completions
Best Practices Include:
- Make prompts direct and detailed.
- Provide examples of values to be extracted and indicators to look for.
- Have users highlight values for extraction.
- Bold any value anchors to enhance contrast for the vision model.
- Use a sans-serif font with even spacing for value anchors, making it easier for the vision model to parse.
- Employ structured outputs for seamless downstream application integration.
Prompts Used:
const imageParsingPrompt = ` Parse the image for anchors starting with '#fq'. Inputs: Images to be parsed Outputs: JSON output where anchors are properties, and monetary values are values. Anchors may be followed by a space and fields in parentheses, considered part of the anchor. If anchors are absent, return an empty JSON object. For each anchor, extract the closest monetary value to the left. Prioritize highlighted monetary values for extraction and strip any currency codes before returning. Example anchors: #fq-1004, #fqdr-2000-1, #fqri-9008, #fqri-3400 (Date: "10/31/2024", Description: "Bank fees not recorded in GL") Example monetary values: $2,000,000.00, 1,234 zl, 9876.54 `;
And that’s it! We can efficiently parse PDFs with complex layouts and metadata by leveraging AI to convert PDFs into images and extract values.
Back to Blog