Allen Yang

Posted on Feb 6

Precise Text and Tabular Data Extraction from PDFs in Python

#python

In today’s digital world, PDF (Portable Document Format) has become one of the most widely used file formats. Whether it’s reports, contracts, invoices, or academic papers, PDFs store vast amounts of information. However, when we need to extract text from these documents for data analysis, automation, or information retrieval, manually copying and pasting is both tedious and inefficient.

Fortunately, Python offers an elegant solution thanks to its powerful ecosystem and extensive library support. This article explores how to efficiently and accurately extract text from PDF documents using Spire.PDF for Python, helping you streamline workflows and enhance data processing automation.

Why Choose Python for PDF Text Extraction?

Python has earned a strong reputation in data processing and automation, offering several key advantages:

Rich ecosystem: A wide range of libraries is available for handling different data formats, including powerful PDF tools.
Easy to learn and use: Its clear and concise syntax allows even beginners to quickly start writing scripts.
Strong automation capabilities: Python integrates seamlessly into automated workflows, making batch processing straightforward.

Compared to traditional manual extraction, Python scripts can complete tasks with remarkable speed and consistency—especially when dealing with large volumes of PDF files.

Introducing Spire.PDF for Python

Spire.PDF for Python is a powerful and professional library designed for creating, editing, converting, and parsing PDF documents. When it comes to text extraction, it offers several notable benefits:

High-accuracy text extraction: Precisely identifies and extracts text, even from documents with complex layouts and fonts.
Support for multiple PDF types: Handles both native PDFs and scanned PDFs (via built-in OCR).
Flexible API: Provides intuitive and easy-to-use interfaces for developers.
Handles complex layouts: Maintains logical reading order for multi-column documents, tables, and mixed text-image layouts.
Multilingual support: Capable of processing documents containing multiple languages.

Installation Guide

Before getting started, ensure that Spire.PDF for Python is installed. If not, you can install it easily using pip:

pip install Spire.Pdf

Basic Text Extraction: Retrieve Text from a PDF Document

Extract Text from an Entire PDF

A common requirement is extracting all text from a PDF document. Spire.PDF for Python provides a straightforward method to accomplish this.

from spire.pdf import *
from spire.pdf.common import *

def extract_full_text(pdf_path, output_txt_path):
    """
    Extract all text from a PDF document and save it to a TXT file.
    """
    # Create a PdfDocument object
    doc = PdfDocument()

    # Load the PDF file
    doc.LoadFromFile(pdf_path)

    # Store text from all pages
    full_text = ""

    # Iterate through each page in the PDF
    for i in range(doc.Pages.Count):
        page = doc.Pages.get_Item(i)

        # Create a PdfTextExtractor object
        text_extractor = PdfTextExtractor(page)

        # Extract text from the current page
        # PdfTextExtractOptions can configure extraction behavior; default settings are used here
        page_text = text_extractor.ExtractText(PdfTextExtractOptions())
        full_text += page_text + "\n"  # Add a newline to separate page content

    # Save the extracted text to a file
    with open(output_txt_path, "w", encoding="utf-8") as f:
        f.write(full_text)

    # Close the document
    doc.Close()
    print(f"Text successfully extracted from {pdf_path} and saved to {output_txt_path}")

# Example usage
pdf_file = "sample.pdf"  # Replace with your PDF file path
output_file = "sample_full_text.txt"
# Assuming you have a file named sample.pdf
# extract_full_text(pdf_file, output_file)

Extraction result example:

Code Explanation:

PdfDocument() – Instantiates a PDF document object.
doc.LoadFromFile(pdf_path) – Loads the specified PDF file.
doc.Pages.Count – Retrieves the number of pages in the document.
doc.Pages.get_Item(i) – Accesses a specific page by index.
PdfTextExtractor(page) – Creates a text extractor for the current page.
text_extractor.ExtractText(PdfTextExtractOptions()) – Performs the text extraction. PdfTextExtractOptions() allows you to control behaviors such as extracting hidden text or adjusting extraction modes.

Extract Text from a Specific Page

Sometimes, you may only need text from a particular page. Spire.PDF for Python fully supports this scenario.

from spire.pdf import *
from spire.pdf.common import *

def extract_page_text(pdf_path, page_number, output_txt_path):
    """
    Extract text from a specified page in a PDF and save it to a TXT file.
    """
    if page_number < 1:
        print("Page number must be greater than or equal to 1.")
        return

    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    if page_number > doc.Pages.Count:
        print(f"Page number {page_number} exceeds the total page count {doc.Pages.Count}.")
        doc.Close()
        return

    # PDF page indexing starts at 0, so subtract 1
    page = doc.Pages.get_Item(page_number - 1) 

    text_extractor = PdfTextExtractor(page)
    page_text = text_extractor.ExtractText(PdfTextExtractOptions())

    with open(output_txt_path, "w", encoding="utf-8") as f:
        f.write(page_text)

    doc.Close()
    print(f"Text from page {page_number} was successfully extracted from {pdf_path} and saved to {output_txt_path}")

# Example usage
extract_page_text(pdf_file, 2, "sample_page_2_text.txt")  # Extract text from page 2

Advanced Use Cases: Structured and Region-Based Extraction

Beyond full-document and single-page extraction, Spire.PDF for Python offers finer control—such as extracting text from specific regions—which is especially useful for retrieving structured data from complex documents.

Extract Text from a Specific Region in a PDF

In many scenarios, you may only care about a particular area on a page (for example, the total amount on an invoice or the abstract in a report). Spire.PDF for Python allows precise extraction by defining a rectangular region.

from spire.pdf import *
from spire.pdf.common import *


def extract_region_text(pdf_path, page_number, x, y, width, height, output_txt_path):
    """
    Extract text from a specific region on a specified page in a PDF.
    Coordinates (x, y) represent the upper-left corner, while width and height define the region size.
    """
    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    if page_number < 1 or page_number > doc.Pages.Count:
        print(f"Invalid page number {page_number}.")
        doc.Close()
        return

    page = doc.Pages.get_Item(page_number - 1)

    # Define the extraction area (RectangleF)
    # The coordinate system is typically measured in points, with the origin at the top-left.
    extract_area = RectangleF.FromLTRB(x, y, x + width, y + height)

    text_extractor = PdfTextExtractor(page)
    options = PdfTextExtractOptions()
    options.ExtractArea = extract_area  # Set the extraction region

    region_text = text_extractor.ExtractText(options)

    with open(output_txt_path, "w", encoding="utf-8") as f:
        f.write(region_text)

    doc.Close()
    print(f"Text extracted from the specified region on page {page_number} of {pdf_path} and saved to {output_txt_path}")

# Example usage: extract a 200x80 region starting at (20, 200) on page 1
pdf_file = "sample.pdf"
extract_region_text(pdf_file, 1, 20, 200, 200, 80, "sample_region_text.txt")

Extraction result example:

Tip: Determining the exact x, y, width, height values may require some experimentation, or you can use measurement tools in PDF readers such as Adobe Acrobat to help locate coordinates.

Extract Table Data from PDFs

Tables are a common form of structured data in PDF documents. Spire.PDF for Python can identify and extract table data, which is essential for automated data entry and analysis.

from spire.pdf import *
from spire.pdf.common import *
import csv
import os

def extract_tables_from_pdf(pdf_path, output_dir):
    """
    Extract all table data from a PDF document and save each table as a CSV file.
    """
    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    # Create the output directory if it does not exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    table_count = 0

    for i in range(doc.Pages.Count):

        # Create a PdfTableExtractor object
        table_extractor = PdfTableExtractor(doc)

        # Detect and extract tables from the page (True enables auto-detection)
        tables = table_extractor.ExtractTable(i)

        if tables:
            print(f"Found {len(tables)} table(s) on page {i + 1}.")

            for table_index, table in enumerate(tables):
                table_count += 1

                # Construct the CSV file name
                csv_file_name = f"Page_{i + 1}_Table_{table_index + 1}.csv"
                csv_file_path = os.path.join(output_dir, csv_file_name)

                # Write to CSV
                with open(csv_file_path, mode="w", newline="", encoding="utf-8-sig") as csv_file:
                    writer = csv.writer(csv_file)

                    for row_index in range(table.GetRowCount()):
                        row_data = []
                        for column_index in range(table.GetColumnCount()):
                            cell_text = table.GetText(row_index, column_index)
                            row_data.append(cell_text)
                        writer.writerow(row_data)

                print(f"Table saved: {csv_file_path}")

    doc.Close()

    if table_count == 0:
        print(f"No tables found in {pdf_path}.")
    else:
        print(f"A total of {table_count} table(s) were extracted and saved.")

# Example usage
extract_tables_from_pdf("sample.pdf", "output_csv")

Extraction result example:

Code Explanation:

PdfTableExtractor(doc) – Creates a table extractor.
table_extractor.ExtractTable(page_index) – Automatically detects and extracts tables from the specified page.
table.GetRowCount() and table.GetColumnCount() – Retrieve the number of rows and columns.
table.GetText(row_index, column_index) – Gets the text content of a specific cell.
The extracted table data can be further processed—for example, converting it into a DataFrame and saving it to Excel, which is a common workflow for structured data.

Common Challenges and How Spire.PDF for Python Addresses Them

Text extraction from PDFs is not always seamless, especially when dealing with complex or low-quality files.

Scanned PDF text extraction (OCR): Traditional libraries cannot extract text from scanned PDFs because they are essentially images. Spire.PDF for Python includes built-in OCR (Optical Character Recognition) that recognizes text within images. You can enable OCR by configuring the relevant parameters in PdfTextExtractOptions.
Multilingual text handling: The library supports extracting text in multiple languages, making it ideal for international documents. Its internal mechanisms can recognize different character encodings and fonts.
Disordered layouts and non-standard PDFs: Documents with unusual layouts may present challenges. Spire.PDF for Python uses a powerful parsing engine to reconstruct logical text structure as accurately as possible, though additional post-processing may sometimes be required.
Exception handling and error management: In real-world applications, corrupted files or permission issues can cause extraction failures. It is recommended to add try-except blocks to improve program robustness.

Conclusion

Spire.PDF for Python provides developers with a comprehensive and powerful toolkit for extracting text from PDF documents efficiently and accurately. Whether you need simple full-text extraction or more advanced region- and table-based data retrieval, it delivers reliable solutions. With the guidance and code examples in this article, you should now have a solid understanding of the core techniques required to perform PDF text extraction using this library.

DEV Community

Precise Text and Tabular Data Extraction from PDFs in Python

Why Choose Python for PDF Text Extraction?

Introducing Spire.PDF for Python

Installation Guide

Basic Text Extraction: Retrieve Text from a PDF Document

Extract Text from an Entire PDF

Extract Text from a Specific Page

Advanced Use Cases: Structured and Region-Based Extraction

Extract Text from a Specific Region in a PDF

Extract Table Data from PDFs

Common Challenges and How Spire.PDF for Python Addresses Them

Conclusion

Top comments (0)