pypdf2 extract text line by line

How to read line by line in pdf file using PyPdf?

stackoverflow.com › questions › 2481945 › how-to-read-line-by-line-in-pdf-file-using-pypdf

Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

You can use the StringIO class to wrap that content as a seekable file-like object:

>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'

In your case, do:

from StringIO import StringIO

# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
    doSomething(line.strip())

Answer from MikeyB on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 2481945 › how-to-read-line-by-line-in-pdf-file-using-pypdf

python - How to read line by line in pdf file using PyPdf? - Stack Overflow

Top answer

1 of 3

14

Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

You can use the StringIO class to wrap that content as a seekable file-like object:

>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'

In your case, do:

from StringIO import StringIO

# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
    doSomething(line.strip())

2 of 3

12

Using yield and PdfFileReader.pages can simplify things,

from pyPdf import PdfFileReader

def get_pdf_content_lines(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f)
        for page in pdf_reader.pages: 
            for line in page.extractText().splitlines():
                yield line

for line in get_pdf_content_lines('/path/to/file.pdf'):
    print line

In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)

from pyPdf import PdfFileReader

def get_pdf_content(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f)
        content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)
        content = ' '.join(content.split())
        return content


print get_pdf_content('/path/to/file.pdf')

Readthedocs

pypdf2.readthedocs.io › en › 3.x › user › extract-text.html

Extract Text from a PDF — PyPDF2 documentation

If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. This result of the scanners OCR software can be extracted by PyPDF2. However, in such cases it’s recommended to ...

Discussions

Extracting text from pdf using Python and Pypdf2 - Stack Overflow

My problem is P_lines cannot extract data line by line and results in one giant string. I want to extract text line by line to analyze it. Any suggestion on how to improve it? More on stackoverflow.com

stackoverflow.com

How to clean output from PyPDF2

PDF files do not contain text per se, think of them as instructions on how to draw a page ("draw a line from A to B", "put the letter X on this position", etc.) This makes text extraction from PDF files non trivial, as you need to render the PDF file to know where the letters are, and also know how wide each letter is to be sure which letters are next to each other. PyPDF2's extractText is really simple, it just looks for "draw letter sequence XYZ" commands and writes all the letters it finds in the order the draw instructions appear in the PDF, and apparently adds a new line after each instruction. It seems your PDF file was created by the layout program with one instruction to write "Normal" then another to write "ly" (probably because of different kerning), etc. tl;dr: PyPDF2 is not sophisticated enough to extract text reliably. The docs for extractText say: Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. As an alternative, you can try pdfminer.six. They also have a more in depth explanation of how difficult extracting text can be . There's a good chance that pdfminer will do a better job. More on reddit.com

r/learnpython

2

1

August 26, 2021

How to extract text from a PDF file via python? - Stack Overflow

If yes, then see if it helps by using a different encoding other than (utf-8) 2019-07-14T22:17:40.16Z+00:00 ... @Matin Thoma is it possible to preserve the format, when extracting, say python code from a PDF? 2023-01-24T14:04:25.517Z+00:00 ... After trying textract (which seemed to have too many dependencies) and pypdf2 ... More on stackoverflow.com

stackoverflow.com

January 20, 2022

Line returns missing in text_extraction()

There was an error while loading. Please reload this page · PDF file: https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf More on github.com

github.com

16

August 31, 2023

Videos

youtube.com

Extract Plain Text from PDFs Using PyPDF2

05:33

YouTube

How to Extract Text from PDF in Python | PDF Text Extraction Tutorial ...

April 18, 2025

13:53

YouTube

Extract Text from PDF with Python - YouTube

How to extract text from PDF with Python - YouTube

March 19, 2021

View all

reddit.com › r/learnpython › extract text line by line from a pdf

r/learnpython on Reddit: Extract text line by line from a PDF

July 28, 2024 -

Hi! I had to create a dictionary (book) with the key "Page n" and the value as another dictionary (page n) with the key "line m" and the value as the text of the m^th line on the n^th page. However, I'm struggling with how to extract single lines.

from pypdf import PdfReader

reader = PdfReader("title.pdf")

book = {}
for k in range(0, len(reader.pages)):
    text = ""
    text += reader.pages[k].extract_text()
    book["Page " + str(k)] = text

Below is my code so far. I think the main problem is that reader.pages[k].extract_text() converts all the text of the k^th page of the PDF, ignoring the fact that there are different lines. How can I separate each line?

ChatGPT suggest a solution that doesn't work:

# Import the necessary libraries
import PyPDF2

# Open the PDF file
with open('example.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)

    # Initialize the book dictionary
    book = {}

    # Loop through each page
    for k in range(len(reader.pages)):
        # Extract text from the k-th page
        text = reader.pages[k].extract_text()

        # Initialize the page dictionary
        page_dict = {}

        # Split the text into lines
        lines = text.split('\n')

        # Loop through each line and add it to the page dictionary
        for m, line in enumerate(lines):
            page_dict[f'line {m+1}'] = line

        # Add the page dictionary to the book dictionary
        book[f'Page {k+1}'] = page_dict

# Print the book dictionary
print(book)

Top answer

1 of 1

2

Use pdfminersix. Extracting text from a PDF can be surprisingly difficult, as PDF is a page description language, it's just a number of commands ("write string "foo" at position (100, 200)", "go to (300, 400) and draw a circle with radius 10") pdfminersix has some algorithms that are trying to do a good job to extract text, there's an overview of what they're doing on their homepage , the docs also have some examples you can use as a base.

Stack Overflow

stackoverflow.com › questions › 42743061 › extracting-text-from-pdf-using-python-and-pypdf2

Extracting text from pdf using Python and Pypdf2 - Stack Overflow

Top answer

1 of 4

2

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))

Output

Hydraulic Fracturing Fluid Product Component Information Disclosure

Fracture Date State: County: API Number: Operator Name: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type: True Vertical Depth (TVD): Total Water Volume (gal)*:

12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H -97.215242 32.558525 NAD27 Gas 7,595 2,608,032

Hydraulic Fracturing Fluid Composition:

Trade Name

Supplier

Purpose

Ingredients

Chemical Abstract Service Number

(CAS #)

Maximum Ingredient

Concentration

in Additive ( by mass)**

Comments

Maximum Ingredient

Concentration

in HF Fluid ( by mass)**

Water Sand HCL

Pumpco Pumpco

Proppant Hydrochloric Acid

Plexslick 921

Pumpco

Friction Reducer

Plexaid 673

Pumpco

Scale Inhibitor

Plexcide 24L

Pumpco

Biocide

Crystaline Silica

Hydrogen Chloride Water

Petroleum Distillate Ammonium Salts Polyethoxylated alcohol surfactants Water

Methyl Alcohol Organic phosphonic acid salts

Dazomat Sodium Hydroxide Water

7732-18-5 14808-60-7

7647-01-0 7732-18-5

64742-47-8 9003-06-9

7732-18-5

67-56-1

533-74-4 1310-73-2 7732-18-5

100.00 100.00

90.01799 9.84261

40.00 60.00

35.00 28.00 7.00 30.00

25.00 75.00

24.00 4.00 72.00

0.03353 0.05029

0.00941 0.00753 0.00188 0.00807

0.00276 0.00828

0.00424 0.00071 0.01271

Total Water Volume sources may include fresh water, produced water, and/or recycled water ** Information is based on the maximum potential for concentration and thus the total may be over 100

Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)

2 of 4

2

textract works fine in python3, using the tesseract method. Example code:

import textract
text = textract.process("pdfs/testpdf1.pdf", method='tesseract')
print(text)
with open('textract-results.txt', 'w+') as f:
    f.write(str(text))

https://pypi.org/project/textract/

Python Forum

python-forum.io › thread-14303.html

Extract Line from PDF

Hey, I want to extract the line, in which a specific keyword is found. So for text-documents it is very simple, because of looping through the text and print the line. So since now, I have done it so far with this: with open('text.txt','r') as f...

PyPDF

pypdf.readthedocs.io › en › stable › user › extract-text.html

Extract Text from a PDF — pypdf 6.9.2 documentation

The function provided in argument visitor_text of function extract_text has five arguments: text: the current text (as long as possible, can be up to a full line)

Find elsewhere

Google Bing Mojeek

Sou-Nan-De-Gesu

soudegesu.com › en › post › python › extract-text-from-pdf-with-pypdf2

Use PyPDF2 - extract text data from PDF file - Sou-Nan-De-Gesu

December 2, 2018 - 1import PyPDF2 2 3FILE_PATH = './files/executive_order.pdf' 4 5with open(FILE_PATH, mode='rb') as f: 6 reader = PyPDF2.PdfFileReader(f) 7 for page in reader.pages: 8 pass · PdfFileReader class has a pages property that is a list of PageObject class. Iterating pages property with for loops can access to all of page in order from first page. Now extract text string data from page object.

reddit.com › r/learnpython › how to clean output from pypdf2

r/learnpython on Reddit: How to clean output from PyPDF2

August 26, 2021 -

I'm using PyPDF2 to extract text from a PDF. This works fine. 'raw' text is imported and printed. However I need to clean up the output, since there are line breaks all over te place. At the same time, not ALL linebreaks need to be removed, some of them are valid.

I tried it with .strip(), .replace() and .splitlines() but everything I try leads to either one long line, 'ghost' spaces all over the place, or most line ends removed, but still some (unwanted) line ends left. So I have no clue on how to do this.

(I have the idea that 'real' line ends can be distinct from 'unwanted' line ends because they are double. But I'm not sure.)

My script (as it currently looks)

import PyPDF2
raw_text = ''

def cleanlines(value):
	value = value.strip()
	return ''.join(value.splitlines())

with open('testfile.pdf', 'rb') as pdf_file:
	pdfReader = PyPDF2.PdfFileReader(pdf_file)

	for page_num in range(pdfReader.numPages):
		pdf_page = pdfReader.getPage(page_num)
		raw_text_part = (pdf_page.extractText())
		raw_text += raw_text_part.strip()
		
# raw_text = raw_text.replace("\n", " ")
# raw_text = raw_text.strip()
# raw_text = raw_text.replace("   ", "\n")
print (raw_text)
print (cleanlines(raw_text))

(a part of) the original PDF and the output can be seen here: https://imgur.com/a/40u4nl2

The output is pasted in Notepad++ with "all characters" on to make the line ends visible.I'm using here Windows 10 and Python 3.7.0 Please forgive me the language, I used Samuel L. Ipsum for it (https://slipsum.com/)

Top answer

1 of 1

3

PDF files do not contain text per se, think of them as instructions on how to draw a page ("draw a line from A to B", "put the letter X on this position", etc.) This makes text extraction from PDF files non trivial, as you need to render the PDF file to know where the letters are, and also know how wide each letter is to be sure which letters are next to each other. PyPDF2's extractText is really simple, it just looks for "draw letter sequence XYZ" commands and writes all the letters it finds in the order the draw instructions appear in the PDF, and apparently adds a new line after each instruction. It seems your PDF file was created by the layout program with one instruction to write "Normal" then another to write "ly" (probably because of different kerning), etc. tl;dr: PyPDF2 is not sophisticated enough to extract text reliably. The docs for extractText say: Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. As an alternative, you can try pdfminer.six. They also have a more in depth explanation of how difficult extracting text can be . There's a good chance that pdfminer will do a better job.

woteq

woteq.com › home › how to read text line by line using pypdf2

How to read text line by line using PyPDF2 - Woteq Zone

February 10, 2026 - Learn how to extract and read text line by line from PDF documents using Python's PyPDF2 library. Step-by-step guide with code examples for developers working with PDF text extraction.

IronPDF

ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › python extract text from pdf line by line

Python Extract Text From PDF Line By Line (Tutorial)

January 19, 2026 - You can use IronPDF to extract text from PDF files in Python. It involves loading the PDF with the PdfDocument.FromFile method and iterating through pages to extract text line by line.

AskPython

askpython.com › home › how to process text from pdf files in python?

How to Process Text from PDF Files in Python? - AskPython

October 13, 2020 - !This is the first page. ! Process finished with exit code 0 · Here we used the getPage method to store the page as an object. Then we used extractText() method to get text from the page object.

Quora

quora.com › How-can-I-read-the-PDF-file-line-by-line-using-Python

How to read the PDF file line by line using Python - Quora

Answer: import PyPDF4 import re import io pdfFileObj = open(r'test.pdf', 'rb') pdfReader = PyPDF4.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(101) pages_text = pageObj.extractText() for line in io.StringIO(pages_text): print(line) The above code with StringIO will help you read...

CodeSpeedy

codespeedy.com › home › how to read pdf file in python line by line?

How to Read PDF File in Python Line by Line? - CodeSpeedy

January 31, 2020 - # Importing required modules import ... of lines text = pageObj.extractText().split(" ") # Finally the lines are stored into list # For iterating over list a loop is used for i in range(len(text)): # Printing the line # Lines are seprated using "\n" print(text[i],end="\n\n") # For Seprating ...

Medium

medium.com › @nutanbhogendrasharma › extracting-text-from-pdf-file-in-python-using-pypdf2-5cefb66f1230

Extracting Text From PDF File in Python Using PyPDF2 | by Nutan | Medium

August 10, 2022 - Extracting Text From PDF File in Python Using PyPDF2 In this blog we will extract text from pdf using PyPDF2 library. What is PyPDF2? PyPDF2 is a free and open source pure-python PDF library capable …

Automate the Boring Stuff

automatetheboringstuff.com › 2e › chapter15

Chapter 15 – Working with PDF and Word Documents

To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object. You can get a Page object by calling the getPage() method ➋ on a PdfFileReader object and passing it the page number of the page you’re interested in—in our ...

GitHub

github.com › py-pdf › pypdf › blob › main › docs › user › extract-text.md

pypdf/docs/user/extract-text.md at main · py-pdf/pypdf

Mathematical Formulas: Should they be extracted? Formulas have indices and nested fractions. Whitespace characters: How many new lines should be extracted for 3 cm of vertical whitespace? How many spaces should be extracted if there is 3 cm of horizontal whitespace?

Author py-pdf

FriendlyUsers Tech Blog

friendlyuser.github.io › posts › tech › python › extract_text_from_pdf_in_python

How to extract text from a PDF file in Python - FriendlyUsers Tech Blog

Unable to extract text.") return False total_pages = pdf_reader.numPages print(f"Total pages: {total_pages}") with open(output_path, 'w', encoding='utf-8') as output_file: print(f"Extracting text to: {output_path}") for page in range(total_pages): text = pdf_reader.getPage(page).extractText() output_file.write(text) print("Text extraction completed.") return True except FileNotFoundError: print(f"Error: The file {pdf_path} was not found.") return False except PyPDF2.utils.PdfReadError: print(f"Error: Unable to read the PDF file {pdf_path}") return False except Exception as e: print(f"Error: An unexpected error occurred: {str(e)}") return False if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: python script_name.py input_pdf_path output_txt_path") else: input_pdf_path = sys.argv[1] output_txt_path = sys.argv[2] extract_pdf_text(input_pdf_path, output_txt_path)

Stack Overflow

stackoverflow.com › questions › 34837707 › how-to-extract-text-from-a-pdf-file

How to extract text from a PDF file via python? - Stack Overflow

Top answer

1 of 16

323

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

2 of 16

244

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

Anything special regarding tables (just that the text is there, not about the formatting)
Arabic test (RTL-languages)
Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

PyPDF2, PyPDF3, PyPDF4
pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

pikepdf does not support text extraction (source)

GitHub

github.com › py-pdf › pypdf › issues › 2138

Line returns missing in text_extraction() · Issue #2138 · py-pdf/pypdf

August 31, 2023 - Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space. the first page in my attached file. Originally posted by @yonglee7015 in #2135 (reply in thread)

Author pubpub-zz

Studytonight

studytonight.com › post › extract-text-from-pdf-in-python-pypdf2-module

Extract Text from PDF in Python - PyPDF2 Module - Studytonight

June 28, 2023 - Learn how to extract Text from a PDF file in Python using the PyPDF2 module to fetch info from the PDF file and extract text from all pages with code examples.