In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. Answer from ImGallo on reddit.com
🌐
The Python Code
thepythoncode.com › article › extract-text-from-pdf-in-python
How to Extract Text from PDF in Python - The Python Code
Learn how to extract text as paragraphs line by line from PDF documents with the help of PyMuPDF library in Python.
🌐
Reddit
reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?
r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?
July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

Discussions

data mining - How to extract paragraphs from text document? - Data Science Stack Exchange
I have extracted text data from pdf files of annual reports of companies using pdftotext. The extracted file content looks like: Sample pdf file is here FORWARD-LOOKING STATEMENTS In this Annual ... More on datascience.stackexchange.com
🌐 datascience.stackexchange.com
PDF Extraction with python wrappers
Hi everyone, I need some advices. Some people recommend me to use python wrappers (poppler pdfto text) to extract data from this PDF file, from page 4 to end or known limit (here: page 605). But I never used poppler pdfto text before and need some help, please. More on discuss.python.org
🌐 discuss.python.org
19
0
December 5, 2023
How to extract text from a PDF file via python? - Stack Overflow
I'm extracting this PDF's text using the PyPDF2 Python package (version 1.27.2): import PyPDF2 with open("sample.pdf", "rb") as pdf_file: read_pdf = PyPDF2.PdfFileReader(pd... More on stackoverflow.com
🌐 stackoverflow.com
Extracting certain paragraphs from pdf
Have you checked out pdftools ? Will require some string manipulation after reading into R. Would probably read in, filter the resulting vector for strings that contain 'clients', then extract paragraphs (something like '^clients.+\n'). More on reddit.com
🌐 r/rstats
9
13
February 16, 2022
🌐
PyPDF
pypdf.readthedocs.io › en › stable › user › extract-text.html
Extract Text from a PDF — pypdf 6.9.2 documentation
Hyperlinks and Metadata: Should it be extracted at all? Where should it be placed in which format? Linearization: Assume you have a floating figure in between a paragraph. Do you first finish the paragraph, or do you put the figure text in between? Then there are issues where most people would agree on the correct output, but the way PDF stores information just makes it hard to achieve that:
🌐
Medium
medium.com › asposepdf › how-to-extract-text-from-pdf-python-547de98db6cc
How to Extract Text from PDF using Python | by PDF-Python | asposepdf | Medium
December 3, 2023 - This code performs a neat trick: it extracts text from a specific PDF page using the Aspose.PDF library. A handy method to grab text for future analysis or experimentation in Python! ParagraphAbsorber, similar to prior tools, aids in managing text as paragraphs within its unique collection.
🌐
GeeksforGeeks
geeksforgeeks.org › extract-text-from-pdf-file-using-python
Extract text from PDF File using Python - GeeksforGeeks
August 9, 2024 - Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.
🌐
Towards Data Science
towardsdatascience.com › home › latest › extracting text from pdf files with python: a comprehensive guide
Extracting text from PDF files with Python: A comprehensive guide | Towards Data Science
January 27, 2025 - Then, by comparing the distances of those characters from others it composes the appropriate words, sentences, lines, and paragraphs of text. (4) To achieve that, the library: Separates the individual pages from the PDF file using the high-level function extract_pages() and converts them into LTPage objects.
Find elsewhere
🌐
Posos
posos.co › blog-articles › how-to-extract-and-structure-text-from-pdf-files-with-python-and-machine-learning
How to extract and structure text from PDF files with Python ...
Processing time is related to the PDF complexity (e.g. multi-column, tables, etc.), but it takes approximately up to one second to parse a page (with Python and scikit-learn [7]). We are currently using our pipeline over more than 20k PDF files (some of them made of more than 500 pages), transformed into 420k paragraphs (a text textblock with a set of titles).
🌐
Readthedocs
pypdf2.readthedocs.io › en › 3.0.0 › user › extract-text.html
Extract Text from a PDF — PyPDF2 documentation
If you scan a document, the resulting PDF typically shows the image of the scan. Scanners then also run OCR software and put the recognized text in the background of the image. This result of the scanners OCR software can be extracted by PyPDF2. However, in such cases it’s recommended to directly use OCR software as errors can accumulate: The OCR software is not perfect in recognizing the text.
🌐
Artifex
artifex.com › blog › text-extraction-with-pymupdf
Text Extraction with PyMuPDF - Artifex Software Inc.
July 12, 2022 - The method is about three times ... pure Python packages like pdfminer or PyPDF2. If you suspect that text in your document is physically not stored in reading sequence, simply use the sort parameter of the method: page.get_text(sort=True). This will return the page’s text paragraphs arranged in the sequence “top-left to bottom-right” and should deliver satisfying results for many or most documents. You can also restrict extraction to certain ...
Top answer
1 of 16
323

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

2 of 16
244

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

  • Anything special regarding tables (just that the text is there, not about the formatting)
  • Arabic test (RTL-languages)
  • Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

  • PyPDF2, PyPDF3, PyPDF4
  • pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

  • pikepdf does not support text extraction (source)
🌐
Quora
quora.com › Is-there-any-way-to-extract-a-paragraph-from-a-PDF
Is there any way to extract a paragraph from a PDF? - Quora
Answer: Recently I was working on a PDF parsing tool to extract information of any PDF . After studying about PDF I realised that PDF doesn't has any structure like you have in a .doc or .docx or html document. All information are positional in nature i.e information extraction is totally based o...
🌐
Neurond
neurond.com › blog › extract-text-from-pdf-pymupdf-and-python
Extract Text From PDF Resumes Using PyMuPDF And Python
From this, we’re able to create a new column in our span dataframe for the tag information. span_tags = [tag[score] for score in span_scores] span_df['tag'] = span_tags · That’s it. We’re now clear on which text is the headings and which one is the content in the document. This is very useful when extracting information since we want all paragraphs below a heading will be grouped.
Address   162 Nguyen Co Thach Street, Hoa Hai Ward, Ngu Hanh Son District, Da Nang City, Vietnam, 50000
🌐
Unstructured
unstructured.io › blog › how-to-process-pdf-in-python
Process PDFs in Python: Step-by-Step Guide | Unstructured
Unlike CSV or JSON files, PDFs encode content as a mix of text layers, images, and layout instructions, which means the logical structure of a document (headings, tables, paragraphs) is rarely preserved in a way that's easy to parse programmatically. Tables are especially problematic, since their cell boundaries are often implied by position rather than explicit markup. Popular options include PyPDF2, pdfplumber, and pdfminer.six, each with different trade-offs in terms of accuracy, speed, and support for complex layouts. For straightforward text extraction from well-formatted PDFs, these libraries work reasonably well.
🌐
IronPDF
ironpdf.com › ironpdf for python › blog › using ironpdf for python › python extract text from pdf
Python Extract Text From PDF (Developer Tutorial) | IronPDF for Python
January 19, 2026 - In this article, we'll demonstrate how to efficiently extract text from a PDF file using IronPDF for Python.
🌐
PSPDFKit
pspdfkit.com › blog › sdk › extract text from pdf using python
Extract Text from PDF in Python: A Comprehensive Guide ...
June 4, 2025 - Parsing PDFs in Python is easy with the right tools. This tutorial walks you through extracting text from PDFs using PyPDF(opens in a new tab) for basic, selectable text, and the Nutrient Processor API for more advanced use cases like OCR, encrypted documents, and structured JSON output.
🌐
Stack Overflow
stackoverflow.com › questions › 76110821 › extract-specific-text-from-pdf-using-python
Extract specific text from pdf using python - Stack Overflow
import fitz # PyMuPDF doc=fitz.open("test.pdf") page = doc[0] blocks = page.get_text("blocks") # extract text separated by paragraphs # a block is a tuple starting with 4 floats followed by lines in paragraph for b in blocks: lines = b[4].splitlines() # lines in the paragraph for line in lines: # look for lines having 'Name:' and 'Color:' p1 = line.find("Name:") if p1 < 0: continue p2 = line.fine("Color:", p1) if p2 < 0: continue text = line[p1+5:p2] # all text in between p3 = text.find(",") # find any comma if p3 >= 0: # there, shorten text accordingly text = text[:p3] # finished