Reading Text from PDF Files in Python
Learning Objective
By the end of this lab, you will understand how to:
-
Open a PDF file in Python
-
Extract text from its pages
-
Display part of the extracted text
Reading PDF Text with PyPDF2
Python has a special library called PyPDF2 which allows us to read PDF files.
First thing that we need is to add this module to your program
pip install PyPDF2
In this lesson, we will write a program that extracts text from a PDF.
Code: read_pdf_text.py
# Import the PdfReader class from the PyPDF2 library
from PyPDF2 import PdfReader
def read_pdf_text(pdf_path):
"""
Extract text from all pages of a PDF file.
Args:
pdf_path (str): Path to the PDF file
Prints:
First 500 characters of the extracted text
"""
# Open the PDF file in read-binary mode
with open(pdf_path, 'rb') as file:
reader = PdfReader(file) # Create a PDF reader object
text = "" # Store extracted text here
# Loop through each page in the PDF
for page in reader.pages:
text += page.extract_text() + "nn" # Add page text with spacing
# Print a preview of the extracted text
print(f"Extracted text from {pdf_path}:n")
print(text[:500] + "...") # Print only the first 500 characters
# Example usage
read_pdf_text("sample.pdf")
Step-by-Step:
-
Importing the Library
from PyPDF2 import PdfReader-
PdfReaderis a tool from thePyPDF2library that helps us read PDF files.
-
-
Defining a Function
def read_pdf_text(pdf_path):-
This function takes the file path (location of the PDF) and extracts its text.
-
-
Opening the File
with open(pdf_path, 'rb') as file:-
'rb'means read in binary mode (needed for PDF files). -
with open(...)ensures the file closes automatically after use.
-
-
Creating a PDF Reader Object
reader = PdfReader(file)-
PdfReaderhelps us access the content of the PDF.
-
-
Extracting Text
for page in reader.pages: text += page.extract_text() + "nn"-
Loops through each page.
-
page.extract_text()pulls the text from that page. -
"nn"adds some spacing between pages.
-
-
Previewing the Extracted Text
print(text[:500] + "...")-
Shows only the first 500 characters so the output doesn’t become too long.
-
-
Running the Program
read_pdf_text("sample.pdf")-
Calls the function and tries to read a file named sample.pdf.
-
Example Output
If your PDF has text, the program will show something like:
Extracted text from sample.pdf:
This is the first page of the PDF...
It contains text that is now extracted using Python...
Key Points
-
Use PyPDF2 to read PDF files.
-
Always open PDF files in
'rb'(read-binary) mode. -
You can loop through pages and extract their text.
-
Print a preview instead of the entire content for large PDFs.
