Extract text from PDF File using Python

The process of extracting text from PDF files using Python is similar to capturing the contents of a book on a digital document and converting them to text. Suppose you wanted to convert a physical book into a digital format (PDF) that you could edit. Python libraries like PyPDF2 or pdfplumber are introduced here to trace the book content into text. The complexity of PDFs might limit its accuracy, despite its utility, if the PDF contains complex formatting, images, or non-standard fonts. Therefore, PDFs or Portable Document Format is a common way for preserving data across multiple platforms. Extracting text from PDF files for data processing and analysis applications is important.

Extract text from a PDF File using Python.

Python’s PyPDF2 library will allow you to extract text from a PDF file. The following steps will guide you through the process:

Among the many Python libraries available through pip, PyPDF2 and pdfplumber are popular for extracting text from PDFs. Opening a file and creating reader objects steps involved in PyPDF2. This follows by extracting text data from the page content using the reader object. Another approach to extract text from a PDF file using Python is thorough pdfplumber. The pdfplumber program, however, displays a PDF as a list. Meanwhile, each list page iterates to extract data to obtain its text content.

Methods to extract data from PDF files

This article will focus on extracting text from PDF files using Python. Python, a versatile programming language, provides different libraries to accomplish the task. Following are the methods that we will explore in this guide:

  • Using the PyPDF2 library
  • Using the PyMuPDF (fitz) library
  • Using the pdfplumber library
  • Use the nltk library to extract text from the pdf files in Python.
  • PDFQuery libraries to extract the text from the pdf
  • Using PDFMiner to extract text

1) Using the PyPDF2 library to extract content from pdf

PyPDF2 is a popular Python library that provides functionalities for working with PDF files. This library uses for extracting text from PDFs. However, using PyPDF2, you can read and extract text from PDF files. A PDF file can be opened, the text extracted, and even specific keywords or phrases can be searched. The PyPDF2 application provides several functions and methods for traversing through PDF pages, performing operations, and accessing text content. PyPDF2 is an ideal tool for reading text from PDF files. However, PyPDF2 might not handle all PDF formats perfectly, especially those with complex layouts or fonts. Additionally, text extraction can sometimes result in imperfect formatting due to the nature of PDF content. Consider alternative library options such as pdfplumber or PyMuPDF if such issues occur.

First and foremost, we have to install the PyPDF package. To do this, just enter the following in the terminal:

pip install PyPDF2

The process for extracting text using PyPDF2 is fairly simple and involves the following steps:1>

  1. Importing the library
  2. Loading the file
  3. Extracting text

The following example illustrates the process of using PyPDF2 to extract text:

import PyPDF2

file = open('sample.pdf', 'rb')

pdf_reader = PyPDF2.PdfReader(file)

text = ''

for page in pdf_reader.pages:
    text += page.extract_text()

print(text)

In the above example, after importing the library import PyPDF2, we use the open() function to open the sample PDF file in binary mode with the access mode rb for reading. The PdfReader() function constructs an object representing the PDF file. A for loop is used to iterate over each page of the PDF. The extract_text() function extracts the text from the current page.

2) Using the PyMuPDF (fitz) library to extract text from pdf

An alternate method to extract text from a PDF file is using the PyMuPDF (fitz) library, another powerful library for working with PDF files. With PyMuPDF, you can extract text from PDF files and manage complex PDF files. Text extraction might be challenged if the font, image, and formatting are not standard. As a result, you might need to perform additional processing on the extracted text, depending on your needs.

To install the PyMuPDF (fitz) library, just enter the following command in the terminal:

pip install PyMyPDF

The following example illustrates extracting text from a PDF using fitz.

import fitz

doc = fitz.open('sample.pdf')

text =''

for page in doc:
    text += page.get_text()

print(text)

We first import the ‘fitz’ module from the PyMuPDF library in the above code. The open() function opens the ‘sample.pdf’ file stored in the ‘doc’ variable. We have used a for loop to iterate over the pages of the pdf document. For each page, we have used the ‘get_text()’ function to extract the text that is appended to the ‘text’ variable declared before the loop.

3) Using the pdfplumber library

The ‘pdfplumber’ library provides us with another method to extract from a PDF file. This library is specifically for extracting text from a PDF file and can deal with complex PDF files that contain various text layouts, tables, and images. The pdfplumber provides an appropriate way to extract text from PDFs and handles a variety of PDF formats, including those with complex layouts and tables. However, like pyMuPDF, text extraction might still face challenges due to non-standard formatting. However, further text processing is required to clean and structure the extracted content.

To install the library, we can just enter the following command in the terminal:

pip install pdfplumber

Let’s explore the code for this method:

import pdfplumber

doc = pdfplumber.open('sample.pdf')

text = ''

for page in doc.pages:
    text += page.extract_text()

print(text)

Extracting a text from the pdf becomes a straightforward procedure by importing the ‘pdfplumber’ module with the ‘open()’ function. However, we use the ‘ pages ‘ attribute to iterate over the pages. The ‘extract_text()’ function extracts the text from the current page.

Comparison of the limitations of the three libraries for text extraction

PDF TO TEXT Library in PYTHONLimitations
pdfplumberPost-processing for non-standard formatting; varying accuracy with complexity.
PyMuPDFChallenged by complex PDFs, non-standard fonts, and formatting.
PyPDF2Limited for complex layouts and fonts; accuracy varies with intricate content.

4) Use nLTK to extract the text from PDF in Python.

The textract is another Python library that simplifies text extraction from various document formats, including PDFs. You can install the textract library via pip and package installing manager. 

With textract, you are not limited to one format; you can extract text from any format. However, like other libraries, text extraction might still face challenges due to complex formatting, images, and fonts. However, further text processing is required to clean and structure the extracted content. 
In window 10, the textract library arises a ShellError:

The command `pdftotext filename.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------

To fix it, use the nltk library to use the word_tokenize, and stopwords features to resolve the problem in this scenario. Here using PyPDF2 and nltk libraries to extract text from the pdf file using Python. However, these two libraries will help get the same result as textract.

import PyPDF2
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Open the PDF file in binary read mode
pdf_obj_file = open("parts of flower.pdf", 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(pdf_obj_file)

# Initialize variables
num_pages = len(pdf_reader.pages)
count = 0
text = ""

# Iterate through each page of the PDF
while count < num_pages:
    page_obj = pdf_reader.pages[count]
    count += 1
    text += page_obj.extract_text()

# Close the PDF file
pdf_obj_file.close()

# Tokenize the text into words
words = word_tokenize(text)

# Filter out stopwords
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]

# Convert the list of filtered words back to a string
cleaned_text = "\n".join(filtered_words)

# Print the cleaned text
print(cleaned_text)

parts
flower
Stem
roots
Patels
sepal
Leaves
flowers

5) Use PDFQuery to read and extract Text from multiple PDF files in Python

PDFQuery extracts data based on the selectors, so you’ll need to adjust the selectors to match the structure of your PDF documents. Additionally, the extraction accuracy depends on the PDF’s content and layout. However, by following this pointer, you can easily extract text from a pdf file using PDFQuery in Python:

  • Deploy the necessary environment for the package
  • Deploy the PDFQuery libraries to extract the text from the pdf
  • Extracting the text from the converted PDF files

Here is How you can convert the text from pdf using PDFQuery library in Python.

from pdfquery import PDFQuery

pdf_file = PDFQuery('parts of flower.pdf')
pdf_file.load()

# Use CSS-like selectors to locate the elements
text_data = pdf_file.pq('LTTextLineHorizontal')

# Extract the text from the elements
content = [t.text for t in text_data]

# Join the content list with new line character
formatted_content = '\n'.join(content)

# Print the formatted content
print(formatted_content)

Stem 
roots 
Patels 
sepal 
Leaves 
flowers

6) PDFMiner to read and extract data from multiple PDF files.

PDFMiner is another Python library that extracts data from PDF documents. However, using PDFMiner, you can read and extract data from multiple PDF files.

PDFMiner understands how the text in the PDF is structured. You need to tell it what to look for. This is done by giving PDFMiner specific instructions on what text to find and how to interpret it.

# Import the extract_text function from pdfminer.high_level module
from pdfminer.high_level import extract_text

# Define a function to extract text from a PDF
def extract_text_from_pdf(pdf_file):
    # Use the extract_text function to extract text from the PDF
    text = extract_text(pdf_file)
    return text

# Provide the path to the PDF file you want to extract text from
pdf_data = "parts of flower.pdf"

# Call the defined function and store the extracted text in a variable
extracted_data = extract_text_from_pdf(pdf_data)

# Print the extracted text to the console
print(extracted_data)
parts of flower

Stem
roots
Patels
sepal
Leaves
flowers

Writing Extracted Data to a CSV File

Extracting the text from pdf is straightforwardly transformed into the CSV format. First, you must extract the text from the PDF using the def function and required text to the pdf Python library. Then, it writes the extracted text to a CSV file using the csv.writer() function while opening a file using the ‘with’ context manager. 

The beauty of this procedure is that it iterates through each page, extracting the text, and writes the extracted text to a CSV file. However, upon execution, the Python script reads the PDF, processes the text, and creates a CSV file with the extracted content.

#importing necessary libraries
import PyPDF2
import csv

# Open the PDF file in binary read mode
pdf_obj_file = open("parts of flower.pdf", 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(pdf_obj_file)

# Initialize an empty string to store extracted text
text = ""

# Iterate through each page of the PDF and extract text
for page_obj in pdf_reader.pages:
    text += page_obj.extract_text()

# Close the PDF file
pdf_obj_file.close()

# Path to the CSV file to be constructed
csv_path = "extracted_text.csv"

# Write the extracted text to a CSV file
with open(csv_path, "w", newline="", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["Extracted Text"])
    csv_writer.writerow()

print("Extracted text has been written to", csv_path)

Conclusion

This comprehensive article explored various methods for extracting text from a PDF file using Python. We particularly explored the functionalities of the PyPDF2, PyMuPDF (fitz), and pdfplumber libraries, each with advantages and limitations. However, PyMuPDF (fitz) and pdfplumber provide more advanced features, such as (OCR) support. At the same time, the pdfplumber can also handle complex layouts, proving to be the most suitable choice for most developers.

  • PyPDF2’s simplicity can suffice for basic text extraction, but complex layouts might lead to formatting issues. 
  • Pdfplumber excels in diverse PDF layouts. 
  • PyMuPDF handles complex PDFs but may require additional processing. 
  • PDFMiner understands how the text in the PDF is structured.
  • When using Textract, ensure dependencies like pdftotext are available and limitations over using on Windows 10. 

However, grasping the library’s strengths and adapting the library as desired optimizes PDF text extraction in Python.

Leave a Comment

Your email address will not be published. Required fields are marked *