In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. Answer from ImGallo on reddit.com
🌐
Medium
onlyoneaman.medium.com › i-tested-7-python-pdf-extractors-so-you-dont-have-to-2025-edition-c88013922257
I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition) | by Aman Kumar | Medium
July 21, 2025 - pdfplumber (0.10s): Good for tables, text extraction needs configuration · Important caveat: These results reflect basic usage with minimal configuration. Each library has advanced features that could significantly change performance for specific use cases. You can find the link to all results in the references. Context matters more than raw performance. The “best” extractor depends entirely on what you’re building and how you’ll use the extracted text.
🌐
Reddit
reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?
r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?
July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

Discussions

How to extract text from a PDF file via python? - Stack Overflow
I'm extracting this PDF's text using the PyPDF2 Python package (version 1.27.2): import PyPDF2 with open("sample.pdf", "rb") as pdf_file: read_pdf = PyPDF2.PdfFileReader(pd... More on stackoverflow.com
🌐 stackoverflow.com
PDF Extraction with python wrappers
Hi everyone, I need some advices. Some people recommend me to use python wrappers (poppler pdfto text) to extract data from this PDF file, from page 4 to end or known limit (here: page 605). But I never used poppler pdfto text before and need some help, please. More on discuss.python.org
🌐 discuss.python.org
19
0
December 5, 2023
Best PDF library for extracting text from structured templates
I have encountered some similar challenges as you have and have solved them by utilizing PDFplumber. I have parsed through a document of 100k pages with no issues. Here's how I did it. I see that you are using regex module which suggests to me that you have to iterate the text line by line. text_lines = page.extract_text() for line in text_lines.split("\n"): if re.match(line): ... extract_text method unfortunately leaks memory and that's the reason why you get an error with large documents (see Github for discussion). However, I noticed that the objects generated by extract_words are correctly being garbage collected allowing you to process large documents without any problems. Using extract_words has the added benefit that you can often get rid of regular expressions. Extract_words allow you to pin point specific parts of the document using the x0,y0,x1,y0 and top coordinates. Note: this code is untested and solely for illustrative purposes. from operator import itemgetter import pdfplumber # page.crop if needed bounding_box = (0, 700, 100, 700) page = page.crop(bounding_box) settings = {"x_tolerance" : 2, "y_tolerance": 2} words = page.extract_words(settings) # a list of list of dictionaries that represents "new lines", essentially simulating the above example. Contains more information, e.g. x0,y0,x1,y1 coordinates # split document into lines that differ by 1.6 from each other based on top coordinate lines_by_top_coordinate = pdfplumber.utils.cluster_objects(words, itemgetter("top"), 1.6) # iterate line by line for idx, line in enumerate(lines_by_top_coordinate): # build the text_line for possible regex if needed text_line = " ".join(w["text"] for w in line) # allows regexes if re.match(pat, text_line): ... # or get the text based on coordinates, no regexes needed text_line = " ".join(w["text"] for w in line if w["x0"] > 400 and w["x1"] < 500) # or if the document has information that is located # below an indicator, e.g a header text_line = " ".join(w["text"] for lines_by_top_coordinate[idx + 1] in line) Guaranteed this method is not the fastest, but allows you to be more precise with the data extraction and solves the large document problem you are facing. Hope this helps! Let me know if you need further assistance. Edit. typos and fixes More on reddit.com
🌐 r/Python
34
43
December 2, 2024
[D] Choosing a pdf processing package in Python
If you’re ever building something more production-level or need deeper control (like merging, cropping, rotating, or handling PDFs and other formats across platforms). Take a look at Apryse. It’s not open source, but their Python SDK is super robust and covers everything from text extraction to page manipulation. More on reddit.com
🌐 r/MachineLearning
15
31
January 8, 2024
🌐
GeeksforGeeks
geeksforgeeks.org › python › extract-text-from-pdf-file-using-python
Extract text from PDF File using Python - GeeksforGeeks
July 12, 2025 - Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.
🌐
Ploomber
ploomber.io › blog › pdf-ocr
Python OCR libraries for converting PDFs into editable text
March 27, 2024 - Tesseract is an open-source OCR Engine that extracts printed or written text from images. It was originally developed by Hewlett-Packard, and development was later taken over by Google. ... Community Support: Tesseract has a large and active community of developers and contributors who continuously work on improving the engine, fixing bugs, and adding new features. First, we need to install Tesseract. ... Next, we’ll first convert the PDF pages to PIL objects and then extract text from these objects using pytesseract’s image_to_string method:
🌐
Medium
pradeepundefned.medium.com › a-comparison-of-python-libraries-for-pdf-data-extraction-for-text-images-and-tables-c75e5dbcfef8
A Comparison of python libraries for PDF Data Extraction for text, images and tables | by Pradeep Bansal | Medium
June 9, 2023 - Each Python library has its strengths and focuses on different aspects of PDF data extraction. If you primarily require text extraction, pdfminer.six is the best choice as it strives to preserve the original formatting of the text, including carriage return and newline characters, as closely as possible.
🌐
Nutrient
nutrient.io › blog › cloud › extract text from pdf pymupdf
How to extract text from a PDF using PyMuPDF and Python
October 9, 2025 - PyMuPDF vs. Nutrient for PDF text extraction — performance comparison, code examples, and migration guide for Python developers.
Top answer
1 of 16
323

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

2 of 16
244

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

  • Anything special regarding tables (just that the text is there, not about the formatting)
  • Arabic test (RTL-languages)
  • Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

  • PyPDF2, PyPDF3, PyPDF4
  • pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

  • pikepdf does not support text extraction (source)
Find elsewhere
🌐
GitHub
github.com › opendatalab › PDF-Extract-Kit
GitHub - opendatalab/PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction · GitHub
conda create -n pdf-extract-kit-1.0 python=3.10 conda activate pdf-extract-kit-1.0 pip install -r requirements.txt
Starred by 9.6K users
Forked by 720 users
Languages   Python
🌐
Metric Coders
metriccoders.com › post › a-guide-to-pdf-extraction-libraries-in-python
A Guide to PDF Extraction Libraries in Python - Metric Coders
January 24, 2025 - PyPDF2 is one of the most popular libraries for working with PDFs in Python. It’s lightweight and provides basic functionalities for reading and writing PDF files. ... Extract text from PDF pages.
🌐
Nutrient
nutrient.io › blog › sdk › extract text from pdf using python
Parse PDFs with Python: Step-by-step text extraction tutorial
June 4, 2025 - In this tutorial, you’ll learn how to parse PDF files in Python using: The open source PyPDF(opens in a new tab) library for quick and simple tasks. The Nutrient Processor API for advanced, reliable, and structured text extraction — including OCR and support for encrypted or scanned documents.
🌐
freeCodeCamp
freecodecamp.org › news › extract-data-from-pdf-files-with-python
How to Extract Data from PDF Files with Python
March 6, 2023 - Data extraction from PDF files is a crucial task because these files are frequently used for document storage and sharing. Python's PDFQuery is a potent tool for extracting data from PDF files.
🌐
Medium
dheerajnbhat.medium.com › best-pdf-text-extractor-in-python-909f960e4bb
Best PDF text extractor in Python | by Dheeraj Bhat | Medium
February 3, 2023 - There are other libraries like PyPDF2, pikepdf, pdfminer.six, pdfplumber etc to name a few, but this benchmark clearly shows that PyMuPDF largely out-stands all these in terms of speed and performance (as of writing this blog). So, if you are looking for an open-source library for text extraction purposes, PyMuPDF is the go to library for your tasks!
🌐
Reddit
reddit.com › r/python › best pdf library for extracting text from structured templates
r/Python on Reddit: Best PDF library for extracting text from structured templates
December 2, 2024 -

Hello All,

I am currently working on a project where I have to extract data from around 8 different structured templates which together spans 12 Million + pages across 10K PDF Documents.

I am using a mix of Regular Expression and bounding box approach where by 4 of these templates are regular expression friendly and for the rest I am using bounding box to extract the data. On testing the extraction works very well. There are no images or tables, but simple labels and values.

The library that I am currently using is PDF Plumber for data extraction and PyPDF for splitting the documents in small chunks for better memory utilization(PDF Plumber sometimes throws an error when the page count goes above 4000 pages, hence splitting them into smaller chunks temporarily). However this approach is taking 5 seconds per page which is a bit too much considering that I have to process 12M pages.

I did take a look at the different other libraries mentioned in the below link but I am not sure which one to choose as I would love to work with an open source library that is having a good maintenance history and better performance .
https://github.com/py-pdf/benchmarks?tab=readme-ov-file

Request your suggestions . Thanks in advance !

Top answer
1 of 5
16
I have encountered some similar challenges as you have and have solved them by utilizing PDFplumber. I have parsed through a document of 100k pages with no issues. Here's how I did it. I see that you are using regex module which suggests to me that you have to iterate the text line by line. text_lines = page.extract_text() for line in text_lines.split("\n"): if re.match(line): ... extract_text method unfortunately leaks memory and that's the reason why you get an error with large documents (see Github for discussion). However, I noticed that the objects generated by extract_words are correctly being garbage collected allowing you to process large documents without any problems. Using extract_words has the added benefit that you can often get rid of regular expressions. Extract_words allow you to pin point specific parts of the document using the x0,y0,x1,y0 and top coordinates. Note: this code is untested and solely for illustrative purposes. from operator import itemgetter import pdfplumber # page.crop if needed bounding_box = (0, 700, 100, 700) page = page.crop(bounding_box) settings = {"x_tolerance" : 2, "y_tolerance": 2} words = page.extract_words(settings) # a list of list of dictionaries that represents "new lines", essentially simulating the above example. Contains more information, e.g. x0,y0,x1,y1 coordinates # split document into lines that differ by 1.6 from each other based on top coordinate lines_by_top_coordinate = pdfplumber.utils.cluster_objects(words, itemgetter("top"), 1.6) # iterate line by line for idx, line in enumerate(lines_by_top_coordinate): # build the text_line for possible regex if needed text_line = " ".join(w["text"] for w in line) # allows regexes if re.match(pat, text_line): ... # or get the text based on coordinates, no regexes needed text_line = " ".join(w["text"] for w in line if w["x0"] > 400 and w["x1"] < 500) # or if the document has information that is located # below an indicator, e.g a header text_line = " ".join(w["text"] for lines_by_top_coordinate[idx + 1] in line) Guaranteed this method is not the fastest, but allows you to be more precise with the data extraction and solves the large document problem you are facing. Hope this helps! Let me know if you need further assistance. Edit. typos and fixes
2 of 5
3
Are you doing all the processing in sync? If so maybe something like celery could help you doing stuff in parallel.
🌐
Medium
medium.com › analytics-vidhya › python-packages-for-pdf-data-extraction-d14ec30f0ad0
Python Packages for PDF Data Extraction | by Rucha Sawarkar | Analytics Vidhya | Medium
February 11, 2024 - Below is the list of packages I have used for extracting text from PDF files. ... We will go through each package in detail along with python code.
🌐
Unstract
unstract.com › home › product › evaluating python libraries for converting pdf to text — a 2026 comparison and evaluation guide
Best Python PDF to Text Parser Libraries: A 2026 Evaluation
December 18, 2025 - Let's compare how PyPDF and PyMuPDF handle PDF to text extraction, and see how LLMWhisperer offers improvements over these traditional libraries.
🌐
Towards Data Science
towardsdatascience.com › home › latest › extracting text from pdf files with python: a comprehensive guide
Extracting text from PDF files with Python: A comprehensive guide | Towards Data Science
January 27, 2025 - Alternatively, you can run the following commands to directly include their paths in the Python script using the following code: pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe' ... Lastly, we will import all the libraries at the beginning of our script. # To read the PDF import PyPDF2 # To analyze the PDF layout and extract text from pdfminer.high_level import extract_pages, extract_text from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure # To extract text from tables in PDF import pdfplumber # To extract the images from the PDFs from PIL import Image from pdf2image import convert_from_path # To perform OCR to extract text from images import pytesseract # To remove the additional created files import os
🌐
Javatpoint
javatpoint.com › python-libraries-for-pdf-extraction
Python Libraries for PDF Extraction - Javatpoint
Python Libraries for PDF Extraction with tutorial, tkinter, button, overview, canvas, frame, environment set-up, first python program, etc.