I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

Answer from DJK on Stack Overflow
🌐
PyPDF
pypdf.readthedocs.io › en › latest › user › extract-text.html
Extract Text from a PDF — pypdf 6.10.0 documentation
For this reason, text extraction from PDFs is hard. If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. pypdf can extract this result of the scanners OCR software.
Top answer
1 of 16
323

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

2 of 16
244

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

  • Anything special regarding tables (just that the text is there, not about the formatting)
  • Arabic test (RTL-languages)
  • Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

  • PyPDF2, PyPDF3, PyPDF4
  • pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

  • pikepdf does not support text extraction (source)
Discussions

Problem with extracting text from PDF in python with pyPDF2
Garbage in what way? PDF text extraction isn't easy due to the way PDFs are structured. Text doesn't flow as it would in a Word doc for example, instead it is positioned on the page in blocks. Try PYMUPDF - https://pymupdf.readthedocs.io/en/latest/app1.html More on reddit.com
🌐 r/Python
10
2
July 24, 2023
extract_text works for some PDF files, but not the others
I am using Python 3.6.1 on Windows 8.1 and I want to extract certain texts from a group of PDF files. To do so, I am using this code and it works fine returning the PDF as a continuous text as string variable: import PyPDF2 # creating a pdf file object pdfFileObj = open('C:/Google Drive/Ward ... More on github.com
🌐 github.com
20
June 22, 2018
🌐
Nutrient
nutrient.io › blog › sdk › extract text from pdf using python
Parse PDFs with Python: Step-by-step text extraction tutorial
June 4, 2025 - Then it loops over all the pages in the PDF using the .pages(opens in a new tab) property and prints the text from each page using the .extract_text(opens in a new tab) method. PyPDF allows you to use visitor functions that get called with each ...
🌐
PyPDF
pypdf.readthedocs.io › en › stable › user › extract-text.html
Extract Text from a PDF — pypdf 6.9.2 documentation
For this reason, text extraction from PDFs is hard. If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. pypdf can extract this result of the scanners OCR software.
🌐
GeeksforGeeks
geeksforgeeks.org › python › extract-text-from-pdf-file-using-python
Extract text from PDF File using Python - GeeksforGeeks
July 12, 2025 - Here, we iterated pages in pdf and used the get_text() method to extract each page from the file. ... import fitz doc = fitz.open('sample.pdf') text = "" for page in doc: text+=page.get_text() print(text) ... We have seen two Python libraries, pypdf and PyMuPDF, that can extract text from a PDF file.
🌐
GitHub
github.com › py-pdf › pypdf › blob › main › docs › user › extract-text.md
pypdf/docs/user/extract-text.md at main · py-pdf/pypdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files - pypdf/docs/user/extract-text.md at main · py-pdf/pypdf
Author   py-pdf
🌐
Readthedocs
pypdf2.readthedocs.io › en › 3.x › user › extract-text.html
Extract Text from a PDF — PyPDF2 documentation
If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. This result of the scanners OCR software can be extracted by PyPDF2. However, in such cases it’s recommended to directly use OCR software as errors can accumulate: The OCR software is not perfect in recognizing the text.
Find elsewhere
🌐
Geekflare
geekflare.com › development › how to extract text, links, and images from pdf files using python
How to Extract Text, Links, and Images from PDF Files Using Python
December 29, 2024 - 7. To confirm the text was extracted successfully, you can print the contents of the variable textPage1. Our entire code, which also prints the text on the first page of the PDF file, is shown below: # import the PdfReader class from PyPDF2 from PyPDF2 import PdfReader # create an instance of the PdfReader class reader = PdfReader('games.pdf') # get the number of pages available in the pdf file print(len(reader.pages)) # access the first page in the pdf page1 = reader.pages[0] # extract the text in page 1 of the pdf file textPage1 = page1.extract_text() # print out the extracted text print(textPage1)
🌐
Medium
medium.com › @nutanbhogendrasharma › extracting-text-from-pdf-file-in-python-using-pypdf2-5cefb66f1230
Extracting Text From PDF File in Python Using PyPDF2 | by Nutan | Medium
August 10, 2022 - Extracting Text From PDF File in Python Using PyPDF2 In this blog we will extract text from pdf using PyPDF2 library. What is PyPDF2? PyPDF2 is a free and open source pure-python PDF library capable …
🌐
Scaler
scaler.com › home › topics › program to extract text from pdf in python
Program to Extract Text From PDF in Python - Scaler Topics
March 15, 2023 - In our example, the number of pages is equal to 1. Now we create a page object using the first page, by passing in the index ... Now we can extract the text from the page object using the function extractText().
🌐
PyPDF
pypdf.readthedocs.io › en › 3.12.0 › user › extract-text.html
Extract Text from a PDF — pypdf 3.12.0 documentation
If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. This result of the scanners OCR software can be extracted by pypdf. However, in such cases it’s recommended to directly use OCR software as errors can accumulate: The OCR software is not perfect in recognizing the text.
🌐
GitHub
github.com › py-pdf › pypdf › discussions › 2038
Text Extraction Improvements · py-pdf/pypdf · Discussion #2038
The OOTB pypdf extract_text() function was returning the text more or less "raw": elements were distributed vertically according to the order in which the Text Show operators appeared with virtually no spacing between horizontally distributed ...
Author   py-pdf
🌐
PyPDF
pypdf.readthedocs.io › en › 3.15.2 › user › extract-text.html
Extract Text from a PDF — pypdf 3.15.2 documentation
If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. This result of the scanners OCR software can be extracted by pypdf. However, in such cases it’s recommended to directly use OCR software as errors can accumulate: The OCR software is not perfect in recognizing the text.
🌐
GitHub
github.com › py-pdf › pypdf › issues › 437
extract_text works for some PDF files, but not the others · Issue #437 · py-pdf/pypdf
June 22, 2018 - I am using Python 3.6.1 on Windows 8.1 and I want to extract certain texts from a group of PDF files. To do so, I am using this code and it works fine returning the PDF as a continuous text as string variable: import PyPDF2 # creating a pdf file object pdfFileObj = open('C:/Google Drive/Ward 29/data/ndvi.pdf', 'rb') # creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False) # getting the number of pages in pdf file number_of_pages =pdfReader.getNumPages() # creating a page object pageObj = pdfReader.getPage(0) page_content = pageObj.extractText() print(page_content) # closing the pdf file object
Author   babak-khamsehi
🌐
Medium
medium.com › data-science › pdf-text-extraction-in-python-5b6ab9e92dd
PDF Text Extraction in Python. How to split, save, and extract text… | by Mate Pocs | TDS Archive | Medium
November 18, 2021 - How to split, save, and extract text from PDF files using PyPDF2 and PDFMiner, demonstrated with the complete works of H. P. Lovecraft.
🌐
DEV Community
dev.to › vast-cow › a-simple-python-tool-for-controlled-pdf-text-extraction-pypdf-3gi7
A Simple Python Tool for Controlled PDF Text Extraction (PyPDF) - DEV Community
January 19, 2026 - Example: NameObject('/Hoge') -> 'Hoge' """ if raw is None: return None s = str(raw) if s.startswith("/"): s = s[1:] return s or None def is_target_text(font_name: Optional[str], font_size: Optional[float]) -> bool: """Determine whether a text fragment is a target for extraction (by font name and size).""" if not ENABLE_FONT_FILTER: return True if font_name is None or font_size is None: return False for f, sz in TARGET_FONTS: if font_name == f and math.isclose(font_size, sz, rel_tol=0.0, abs_tol=SIZE_TOL): return True return False def extract_text_stream(fp) -> Iterator[str]: """ - Extract only
🌐
Data Science Dojo
discuss.datasciencedojo.com › python
Extracting text from PDFs using a Python library - Python - Data Science Dojo Discussions
January 12, 2023 - Data scientists extract text from PDFs for several reasons: Data collection: PDFs are often used to store and share data, such as reports, research papers, and government documents. Extracting text from these PDFs all…
🌐
CodeCut
codecut.ai › home › daily tips › workflow & automation › workflow automation › pypdf: supercharge pdf text extraction in python
pypdf: Supercharge PDF Text Extraction in Python | CodeCut
April 10, 2025 - To extract only relevant text, we use the visitor_text feature of PyPDF, which enables us to apply custom logic for filtering.
🌐
Studytonight
studytonight.com › post › extract-text-from-pdf-in-python-pypdf2-module
Extract Text from PDF in Python - PyPDF2 Module - Studytonight
June 28, 2023 - You can open a PDF, iterate over its pages, and use the extract_text() method to retrieve the text content. No, PyPDF2 is primarily designed for extracting text from text-based PDFs.