Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

You can use the StringIO class to wrap that content as a seekable file-like object:

>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'

In your case, do:

from StringIO import StringIO

# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
    doSomething(line.strip())
Answer from MikeyB on Stack Overflow
🌐
Readthedocs
pypdf2.readthedocs.io › en › 3.x › user › extract-text.html
Extract Text from a PDF — PyPDF2 documentation
If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. This result of the scanners OCR software can be extracted by PyPDF2. However, in such cases it’s recommended to ...
Discussions

Extracting text from pdf using Python and Pypdf2 - Stack Overflow
My problem is P_lines cannot extract data line by line and results in one giant string. I want to extract text line by line to analyze it. Any suggestion on how to improve it? More on stackoverflow.com
🌐 stackoverflow.com
How to clean output from PyPDF2
PDF files do not contain text per se, think of them as instructions on how to draw a page ("draw a line from A to B", "put the letter X on this position", etc.) This makes text extraction from PDF files non trivial, as you need to render the PDF file to know where the letters are, and also know how wide each letter is to be sure which letters are next to each other. PyPDF2's extractText is really simple, it just looks for "draw letter sequence XYZ" commands and writes all the letters it finds in the order the draw instructions appear in the PDF, and apparently adds a new line after each instruction. It seems your PDF file was created by the layout program with one instruction to write "Normal" then another to write "ly" (probably because of different kerning), etc. tl;dr: PyPDF2 is not sophisticated enough to extract text reliably. The docs for extractText say: Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. As an alternative, you can try pdfminer.six. They also have a more in depth explanation of how difficult extracting text can be . There's a good chance that pdfminer will do a better job. More on reddit.com
🌐 r/learnpython
2
1
August 26, 2021
How to extract text from a PDF file via python? - Stack Overflow
If yes, then see if it helps by using a different encoding other than (utf-8) 2019-07-14T22:17:40.16Z+00:00 ... @Matin Thoma is it possible to preserve the format, when extracting, say python code from a PDF? 2023-01-24T14:04:25.517Z+00:00 ... After trying textract (which seemed to have too many dependencies) and pypdf2 ... More on stackoverflow.com
🌐 stackoverflow.com
January 20, 2022
Line returns missing in text_extraction()
There was an error while loading. Please reload this page · PDF file: https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf More on github.com
🌐 github.com
16
August 31, 2023
People also ask

How can I extract text from a PDF using Python?
You can use IronPDF to extract text from PDF files in Python. It involves loading the PDF with the PdfDocument.FromFile method and iterating through pages to extract text line by line.
🌐
ironpdf.com
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › python extract text from pdf line by line
Python Extract Text From PDF Line By Line (Tutorial)
What is required to start extracting text from PDFs in Python?
To extract text from PDFs in Python, you need to have Python installed, along with the IronPDF library, which can be installed via pip. An IDE like Visual Studio Code is recommended for writing and executing your scripts.
🌐
ironpdf.com
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › python extract text from pdf line by line
Python Extract Text From PDF Line By Line (Tutorial)
How do I execute a Python script for PDF text extraction?
After writing your script, you can execute it by running python main.py in your IDE's terminal, where main.py is the name of your script file.
🌐
ironpdf.com
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › python extract text from pdf line by line
Python Extract Text From PDF Line By Line (Tutorial)
🌐
Reddit
reddit.com › r/learnpython › extract text line by line from a pdf
r/learnpython on Reddit: Extract text line by line from a PDF
July 28, 2024 -

Hi! I had to create a dictionary (book) with the key "Page n" and the value as another dictionary (page n) with the key "line m" and the value as the text of the mth line on the nth page. However, I'm struggling with how to extract single lines.

from pypdf import PdfReader

reader = PdfReader("title.pdf")

book = {}
for k in range(0, len(reader.pages)):
    text = ""
    text += reader.pages[k].extract_text()
    book["Page " + str(k)] = text

Below is my code so far. I think the main problem is that reader.pages[k].extract_text() converts all the text of the kth page of the PDF, ignoring the fact that there are different lines. How can I separate each line?

ChatGPT suggest a solution that doesn't work:

# Import the necessary libraries
import PyPDF2

# Open the PDF file
with open('example.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)

    # Initialize the book dictionary
    book = {}

    # Loop through each page
    for k in range(len(reader.pages)):
        # Extract text from the k-th page
        text = reader.pages[k].extract_text()

        # Initialize the page dictionary
        page_dict = {}

        # Split the text into lines
        lines = text.split('\n')

        # Loop through each line and add it to the page dictionary
        for m, line in enumerate(lines):
            page_dict[f'line {m+1}'] = line

        # Add the page dictionary to the book dictionary
        book[f'Page {k+1}'] = page_dict

# Print the book dictionary
print(book)
Top answer
1 of 4
2
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))

Output

Hydraulic Fracturing Fluid Product Component Information Disclosure

Fracture Date State: County: API Number: Operator Name: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type: True Vertical Depth (TVD): Total Water Volume (gal)*:

12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H -97.215242 32.558525 NAD27 Gas 7,595 2,608,032

Hydraulic Fracturing Fluid Composition:

Trade Name

Supplier

Purpose

Ingredients

Chemical Abstract Service Number

(CAS #)

Maximum Ingredient

Concentration

in Additive ( by mass)**

Comments

Maximum Ingredient

Concentration

in HF Fluid ( by mass)**

Water Sand HCL

Pumpco Pumpco

Proppant Hydrochloric Acid

Plexslick 921

Pumpco

Friction Reducer

Plexaid 673

Pumpco

Scale Inhibitor

Plexcide 24L

Pumpco

Biocide

Crystaline Silica

Hydrogen Chloride Water

Petroleum Distillate Ammonium Salts Polyethoxylated alcohol surfactants Water

Methyl Alcohol Organic phosphonic acid salts

Dazomat Sodium Hydroxide Water

7732-18-5 14808-60-7

7647-01-0 7732-18-5

64742-47-8 9003-06-9

7732-18-5

67-56-1

533-74-4 1310-73-2 7732-18-5

100.00 100.00

90.01799 9.84261

40.00 60.00

35.00 28.00 7.00 30.00

25.00 75.00

24.00 4.00 72.00

0.03353 0.05029

0.00941 0.00753 0.00188 0.00807

0.00276 0.00828

0.00424 0.00071 0.01271

  • Total Water Volume sources may include fresh water, produced water, and/or recycled water ** Information is based on the maximum potential for concentration and thus the total may be over 100

Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)

2 of 4
2

textract works fine in python3, using the tesseract method. Example code:

import textract
text = textract.process("pdfs/testpdf1.pdf", method='tesseract')
print(text)
with open('textract-results.txt', 'w+') as f:
    f.write(str(text))

https://pypi.org/project/textract/

🌐
Python Forum
python-forum.io › thread-14303.html
Extract Line from PDF
Hey, I want to extract the line, in which a specific keyword is found. So for text-documents it is very simple, because of looping through the text and print the line. So since now, I have done it so far with this: with open('text.txt','r') as f...
🌐
PyPDF
pypdf.readthedocs.io › en › stable › user › extract-text.html
Extract Text from a PDF — pypdf 6.9.2 documentation
The function provided in argument visitor_text of function extract_text has five arguments: text: the current text (as long as possible, can be up to a full line)
Find elsewhere
🌐
Sou-Nan-De-Gesu
soudegesu.com › en › post › python › extract-text-from-pdf-with-pypdf2
Use PyPDF2 - extract text data from PDF file - Sou-Nan-De-Gesu
December 2, 2018 - 1import PyPDF2 2 3FILE_PATH = './files/executive_order.pdf' 4 5with open(FILE_PATH, mode='rb') as f: 6 reader = PyPDF2.PdfFileReader(f) 7 for page in reader.pages: 8 pass · PdfFileReader class has a pages property that is a list of PageObject class. Iterating pages property with for loops can access to all of page in order from first page. Now extract text string data from page object.
🌐
Reddit
reddit.com › r/learnpython › how to clean output from pypdf2
r/learnpython on Reddit: How to clean output from PyPDF2
August 26, 2021 -

I'm using PyPDF2 to extract text from a PDF. This works fine. 'raw' text is imported and printed. However I need to clean up the output, since there are line breaks all over te place. At the same time, not ALL linebreaks need to be removed, some of them are valid.

I tried it with .strip(), .replace() and .splitlines() but everything I try leads to either one long line, 'ghost' spaces all over the place, or most line ends removed, but still some (unwanted) line ends left. So I have no clue on how to do this.

(I have the idea that 'real' line ends can be distinct from 'unwanted' line ends because they are double. But I'm not sure.)

My script (as it currently looks)

import PyPDF2
raw_text = ''

def cleanlines(value):
	value = value.strip()
	return ''.join(value.splitlines())

with open('testfile.pdf', 'rb') as pdf_file:
	pdfReader = PyPDF2.PdfFileReader(pdf_file)

	for page_num in range(pdfReader.numPages):
		pdf_page = pdfReader.getPage(page_num)
		raw_text_part = (pdf_page.extractText())
		raw_text += raw_text_part.strip()
		
# raw_text = raw_text.replace("\n", " ")
# raw_text = raw_text.strip()
# raw_text = raw_text.replace("   ", "\n")
print (raw_text)
print (cleanlines(raw_text))

(a part of) the original PDF and the output can be seen here: https://imgur.com/a/40u4nl2

The output is pasted in Notepad++ with "all characters" on to make the line ends visible.I'm using here Windows 10 and Python 3.7.0 Please forgive me the language, I used Samuel L. Ipsum for it (https://slipsum.com/)

🌐
woteq
woteq.com › home › how to read text line by line using pypdf2
How to read text line by line using PyPDF2 - Woteq Zone
February 10, 2026 - Learn how to extract and read text line by line from PDF documents using Python's PyPDF2 library. Step-by-step guide with code examples for developers working with PDF text extraction.
🌐
IronPDF
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › python extract text from pdf line by line
Python Extract Text From PDF Line By Line (Tutorial)
January 19, 2026 - You can use IronPDF to extract text from PDF files in Python. It involves loading the PDF with the PdfDocument.FromFile method and iterating through pages to extract text line by line.
🌐
AskPython
askpython.com › home › how to process text from pdf files in python?
How to Process Text from PDF Files in Python? - AskPython
October 13, 2020 - !This is the first page. ! Process finished with exit code 0 · Here we used the getPage method to store the page as an object. Then we used extractText() method to get text from the page object.
🌐
Quora
quora.com › How-can-I-read-the-PDF-file-line-by-line-using-Python
How to read the PDF file line by line using Python - Quora
Answer: import PyPDF4 import re import io pdfFileObj = open(r'test.pdf', 'rb') pdfReader = PyPDF4.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(101) pages_text = pageObj.extractText() for line in io.StringIO(pages_text): print(line) The above code with StringIO will help you read...
🌐
CodeSpeedy
codespeedy.com › home › how to read pdf file in python line by line?
How to Read PDF File in Python Line by Line? - CodeSpeedy
January 31, 2020 - # Importing required modules import ... of lines text = pageObj.extractText().split(" ") # Finally the lines are stored into list # For iterating over list a loop is used for i in range(len(text)): # Printing the line # Lines are seprated using "\n" print(text[i],end="\n\n") # For Seprating ...
🌐
Medium
medium.com › @nutanbhogendrasharma › extracting-text-from-pdf-file-in-python-using-pypdf2-5cefb66f1230
Extracting Text From PDF File in Python Using PyPDF2 | by Nutan | Medium
August 10, 2022 - Extracting Text From PDF File in Python Using PyPDF2 In this blog we will extract text from pdf using PyPDF2 library. What is PyPDF2? PyPDF2 is a free and open source pure-python PDF library capable …
🌐
Automate the Boring Stuff
automatetheboringstuff.com › 2e › chapter15
Chapter 15 – Working with PDF and Word Documents
To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object. You can get a Page object by calling the getPage() method ➋ on a PdfFileReader object and passing it the page number of the page you’re interested in—in our ...
🌐
GitHub
github.com › py-pdf › pypdf › blob › main › docs › user › extract-text.md
pypdf/docs/user/extract-text.md at main · py-pdf/pypdf
Mathematical Formulas: Should they be extracted? Formulas have indices and nested fractions. Whitespace characters: How many new lines should be extracted for 3 cm of vertical whitespace? How many spaces should be extracted if there is 3 cm of horizontal whitespace?
Author   py-pdf
🌐
FriendlyUsers Tech Blog
friendlyuser.github.io › posts › tech › python › extract_text_from_pdf_in_python
How to extract text from a PDF file in Python - FriendlyUsers Tech Blog
Unable to extract text.") return False total_pages = pdf_reader.numPages print(f"Total pages: {total_pages}") with open(output_path, 'w', encoding='utf-8') as output_file: print(f"Extracting text to: {output_path}") for page in range(total_pages): text = pdf_reader.getPage(page).extractText() output_file.write(text) print("Text extraction completed.") return True except FileNotFoundError: print(f"Error: The file {pdf_path} was not found.") return False except PyPDF2.utils.PdfReadError: print(f"Error: Unable to read the PDF file {pdf_path}") return False except Exception as e: print(f"Error: An unexpected error occurred: {str(e)}") return False if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: python script_name.py input_pdf_path output_txt_path") else: input_pdf_path = sys.argv[1] output_txt_path = sys.argv[2] extract_pdf_text(input_pdf_path, output_txt_path)
Top answer
1 of 16
323

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

2 of 16
244

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

  • Anything special regarding tables (just that the text is there, not about the formatting)
  • Arabic test (RTL-languages)
  • Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

  • PyPDF2, PyPDF3, PyPDF4
  • pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

  • pikepdf does not support text extraction (source)
🌐
GitHub
github.com › py-pdf › pypdf › issues › 2138
Line returns missing in text_extraction() · Issue #2138 · py-pdf/pypdf
August 31, 2023 - Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space. the first page in my attached file. Originally posted by @yonglee7015 in #2135 (reply in thread)
Author   pubpub-zz
🌐
Studytonight
studytonight.com › post › extract-text-from-pdf-in-python-pypdf2-module
Extract Text from PDF in Python - PyPDF2 Module - Studytonight
June 28, 2023 - Learn how to extract Text from a PDF file in Python using the PyPDF2 module to fetch info from the PDF file and extract text from all pages with code examples.