Introduction to Python Programming (Copy 1)

Reading Text from PDF Files in Python

Learning Objective

By the end of this lab, you will understand how to:

Open a PDF file in Python
Extract text from its pages
Display part of the extracted text

Reading PDF Text with PyPDF2

Python has a special library called PyPDF2 which allows us to read PDF files.

First thing that we need is to add this module to your program

pip install PyPDF2

In this lesson, we will write a program that extracts text from a PDF.

Code: `read_pdf_text.py`

# Import the PdfReader class from the PyPDF2 library
from PyPDF2 import PdfReader

def read_pdf_text(pdf_path):
    """
    Extract text from all pages of a PDF file.
    
    Args:
        pdf_path (str): Path to the PDF file
    
    Prints:
        First 500 characters of the extracted text
    """
    
    # Open the PDF file in read-binary mode
    with open(pdf_path, 'rb') as file:
        reader = PdfReader(file)  # Create a PDF reader object
        text = ""  # Store extracted text here

        # Loop through each page in the PDF
        for page in reader.pages:
            text += page.extract_text() + "nn"  # Add page text with spacing

        # Print a preview of the extracted text
        print(f"Extracted text from {pdf_path}:n")
        print(text[:500] + "...")  # Print only the first 500 characters

# Example usage
read_pdf_text("sample.pdf")

Step-by-Step:

Importing the Library
```
from PyPDF2 import PdfReader
```
- PdfReader is a tool from the PyPDF2 library that helps us read PDF files.
Defining a Function
```
def read_pdf_text(pdf_path):
```
- This function takes the file path (location of the PDF) and extracts its text.
Opening the File
```
with open(pdf_path, 'rb') as file:
```
- 'rb' means read in binary mode (needed for PDF files).
- with open(...) ensures the file closes automatically after use.
Creating a PDF Reader Object
```
reader = PdfReader(file)
```
- PdfReader helps us access the content of the PDF.
Extracting Text
```
for page in reader.pages:
    text += page.extract_text() + "nn"
```
- Loops through each page.
- page.extract_text() pulls the text from that page.
- "nn" adds some spacing between pages.
Previewing the Extracted Text
```
print(text[:500] + "...")
```
- Shows only the first 500 characters so the output doesn’t become too long.
Running the Program
```
read_pdf_text("sample.pdf")
```
- Calls the function and tries to read a file named sample.pdf.

Example Output

If your PDF has text, the program will show something like:

Extracted text from sample.pdf:

This is the first page of the PDF...
It contains text that is now extracted using Python...

Key Points

Use PyPDF2 to read PDF files.
Always open PDF files in 'rb' (read-binary) mode.
You can loop through pages and extract their text.
Print a preview instead of the entire content for large PDFs.

Exercise Files

read_pdf_text.zip

Size: 634.00 B