best pdf text extractor python

What’s the Best Python Library for Extracting Text from PDFs?

reddit.com › r › LangChain › comments › 1e7cntq › whats_the_best_python_library_for_extracting_text

In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. Answer from ImGallo on reddit.com

Medium

onlyoneaman.medium.com › i-tested-7-python-pdf-extractors-so-you-dont-have-to-2025-edition-c88013922257

I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition) | by Aman Kumar | Medium

July 21, 2025 - pdfplumber (0.10s): Good for tables, text extraction needs configuration · Important caveat: These results reflect basic usage with minimal configuration. Each library has advanced features that could significantly change performance for specific use cases. You can find the link to all results in the references. Context matters more than raw performance. The “best” extractor depends entirely on what you’re building and how you’ll use the extracted text.

reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?

r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?

July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

Top answer

1 of 27

38

In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables.

2 of 27

12

llama parse, use it, super cheap and has a free version up to 3000 pages Best in the world

Discussions

How to extract text from a PDF file via python? - Stack Overflow

I'm extracting this PDF's text using the PyPDF2 Python package (version 1.27.2): import PyPDF2 with open("sample.pdf", "rb") as pdf_file: read_pdf = PyPDF2.PdfFileReader(pd... More on stackoverflow.com

stackoverflow.com

PDF Extraction with python wrappers

Hi everyone, I need some advices. Some people recommend me to use python wrappers (poppler pdfto text) to extract data from this PDF file, from page 4 to end or known limit (here: page 605). But I never used poppler pdfto text before and need some help, please. More on discuss.python.org

discuss.python.org

19

0

December 5, 2023

Best PDF library for extracting text from structured templates

I have encountered some similar challenges as you have and have solved them by utilizing PDFplumber. I have parsed through a document of 100k pages with no issues. Here's how I did it. I see that you are using regex module which suggests to me that you have to iterate the text line by line. text_lines = page.extract_text() for line in text_lines.split("\n"): if re.match(line): ... extract_text method unfortunately leaks memory and that's the reason why you get an error with large documents (see Github for discussion). However, I noticed that the objects generated by extract_words are correctly being garbage collected allowing you to process large documents without any problems. Using extract_words has the added benefit that you can often get rid of regular expressions. Extract_words allow you to pin point specific parts of the document using the x0,y0,x1,y0 and top coordinates. Note: this code is untested and solely for illustrative purposes. from operator import itemgetter import pdfplumber # page.crop if needed bounding_box = (0, 700, 100, 700) page = page.crop(bounding_box) settings = {"x_tolerance" : 2, "y_tolerance": 2} words = page.extract_words(settings) # a list of list of dictionaries that represents "new lines", essentially simulating the above example. Contains more information, e.g. x0,y0,x1,y1 coordinates # split document into lines that differ by 1.6 from each other based on top coordinate lines_by_top_coordinate = pdfplumber.utils.cluster_objects(words, itemgetter("top"), 1.6) # iterate line by line for idx, line in enumerate(lines_by_top_coordinate): # build the text_line for possible regex if needed text_line = " ".join(w["text"] for w in line) # allows regexes if re.match(pat, text_line): ... # or get the text based on coordinates, no regexes needed text_line = " ".join(w["text"] for w in line if w["x0"] > 400 and w["x1"] < 500) # or if the document has information that is located # below an indicator, e.g a header text_line = " ".join(w["text"] for lines_by_top_coordinate[idx + 1] in line) Guaranteed this method is not the fastest, but allows you to be more precise with the data extraction and solves the large document problem you are facing. Hope this helps! Let me know if you need further assistance. Edit. typos and fixes More on reddit.com

r/Python

34

43

December 2, 2024

[D] Choosing a pdf processing package in Python

If you’re ever building something more production-level or need deeper control (like merging, cropping, rotating, or handling PDFs and other formats across platforms). Take a look at Apryse. It’s not open source, but their Python SDK is super robust and covers everything from text extraction to page manipulation. More on reddit.com

r/MachineLearning

15

31

January 8, 2024

Videos

How to Extract Text from PDF in Python | PDF Text Extraction Tutorial ...

April 18, 2025

13:15

YouTube

Extract PDF Content with Python - YouTube

August 29, 2022

208K

youtube.com

Extract text, links, images, tables from Pdf with Python ...

View all

GeeksforGeeks

geeksforgeeks.org › python › extract-text-from-pdf-file-using-python

Extract text from PDF File using Python - GeeksforGeeks

July 12, 2025 - Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

Ploomber

ploomber.io › blog › pdf-ocr

Python OCR libraries for converting PDFs into editable text

March 27, 2024 - Tesseract is an open-source OCR Engine that extracts printed or written text from images. It was originally developed by Hewlett-Packard, and development was later taken over by Google. ... Community Support: Tesseract has a large and active community of developers and contributors who continuously work on improving the engine, fixing bugs, and adding new features. First, we need to install Tesseract. ... Next, we’ll first convert the PDF pages to PIL objects and then extract text from these objects using pytesseract’s image_to_string method:

Medium

pradeepundefned.medium.com › a-comparison-of-python-libraries-for-pdf-data-extraction-for-text-images-and-tables-c75e5dbcfef8

A Comparison of python libraries for PDF Data Extraction for text, images and tables | by Pradeep Bansal | Medium

June 9, 2023 - Each Python library has its strengths and focuses on different aspects of PDF data extraction. If you primarily require text extraction, pdfminer.six is the best choice as it strives to preserve the original formatting of the text, including carriage return and newline characters, as closely as possible.

Nutrient

nutrient.io › blog › cloud › extract text from pdf pymupdf

How to extract text from a PDF using PyMuPDF and Python

October 9, 2025 - PyMuPDF vs. Nutrient for PDF text extraction — performance comparison, code examples, and migration guide for Python developers.

Stack Overflow

stackoverflow.com › questions › 34837707 › how-to-extract-text-from-a-pdf-file-via-python

How to extract text from a PDF file via python? - Stack Overflow

Top answer

1 of 16

323

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

2 of 16

244

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

Anything special regarding tables (just that the text is there, not about the formatting)
Arabic test (RTL-languages)
Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

Please note that those packages are not maintained:

PyPDF2, PyPDF3, PyPDF4
pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

print(text)

Other PDF libraries

pikepdf does not support text extraction (source)

Find elsewhere

Google Bing Mojeek

PyPDF

pypdf.readthedocs.io › en › stable › user › extract-text.html

Extract Text from a PDF — pypdf 6.9.2 documentation

You can extract text from a PDF: · Refer to extract_text() for more details

Python.org

discuss.python.org › python help

PDF Extraction with python wrappers - Python Help - Discussions on Python.org

Top answer

1 of 15

2

[image] Michael Duarte Gonçalves: After discussing with some people, they suggest me the following: Extract all XML from PDFs and later convert them into .csv files. The site already seems to give you the XML files, right? So you do not need to create those XMLs from the PDF files and there…

2 of 15

0

I don’t understand what help you are looking for. Did you get stuck somewhere? What have you tried doing so far, and what problem did you encounter?

GitHub

github.com › opendatalab › PDF-Extract-Kit

GitHub - opendatalab/PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction · GitHub

conda create -n pdf-extract-kit-1.0 python=3.10 conda activate pdf-extract-kit-1.0 pip install -r requirements.txt

Starred by 9.6K users

Forked by 720 users

Languages Python

Metric Coders

metriccoders.com › post › a-guide-to-pdf-extraction-libraries-in-python

A Guide to PDF Extraction Libraries in Python - Metric Coders

January 24, 2025 - PyPDF2 is one of the most popular libraries for working with PDFs in Python. It’s lightweight and provides basic functionalities for reading and writing PDF files. ... Extract text from PDF pages.

Nutrient

nutrient.io › blog › sdk › extract text from pdf using python

Parse PDFs with Python: Step-by-step text extraction tutorial

June 4, 2025 - In this tutorial, you’ll learn how to parse PDF files in Python using: The open source PyPDF(opens in a new tab) library for quick and simple tasks. The Nutrient Processor API for advanced, reliable, and structured text extraction — including OCR and support for encrypted or scanned documents.

freeCodeCamp

freecodecamp.org › news › extract-data-from-pdf-files-with-python

How to Extract Data from PDF Files with Python

March 6, 2023 - Data extraction from PDF files is a crucial task because these files are frequently used for document storage and sharing. Python's PDFQuery is a potent tool for extracting data from PDF files.

Medium

dheerajnbhat.medium.com › best-pdf-text-extractor-in-python-909f960e4bb

Best PDF text extractor in Python | by Dheeraj Bhat | Medium

February 3, 2023 - There are other libraries like PyPDF2, pikepdf, pdfminer.six, pdfplumber etc to name a few, but this benchmark clearly shows that PyMuPDF largely out-stands all these in terms of speed and performance (as of writing this blog). So, if you are looking for an open-source library for text extraction purposes, PyMuPDF is the go to library for your tasks!

reddit.com › r/python › best pdf library for extracting text from structured templates

r/Python on Reddit: Best PDF library for extracting text from structured templates

December 2, 2024 -

Hello All,

I am currently working on a project where I have to extract data from around 8 different structured templates which together spans 12 Million + pages across 10K PDF Documents.

I am using a mix of Regular Expression and bounding box approach where by 4 of these templates are regular expression friendly and for the rest I am using bounding box to extract the data. On testing the extraction works very well. There are no images or tables, but simple labels and values.

The library that I am currently using is PDF Plumber for data extraction and PyPDF for splitting the documents in small chunks for better memory utilization(PDF Plumber sometimes throws an error when the page count goes above 4000 pages, hence splitting them into smaller chunks temporarily). However this approach is taking 5 seconds per page which is a bit too much considering that I have to process 12M pages.

I did take a look at the different other libraries mentioned in the below link but I am not sure which one to choose as I would love to work with an open source library that is having a good maintenance history and better performance .
https://github.com/py-pdf/benchmarks?tab=readme-ov-file

Request your suggestions . Thanks in advance !

Top answer

1 of 5

16

I have encountered some similar challenges as you have and have solved them by utilizing PDFplumber. I have parsed through a document of 100k pages with no issues. Here's how I did it. I see that you are using regex module which suggests to me that you have to iterate the text line by line. text_lines = page.extract_text() for line in text_lines.split("\n"): if re.match(line): ... extract_text method unfortunately leaks memory and that's the reason why you get an error with large documents (see Github for discussion). However, I noticed that the objects generated by extract_words are correctly being garbage collected allowing you to process large documents without any problems. Using extract_words has the added benefit that you can often get rid of regular expressions. Extract_words allow you to pin point specific parts of the document using the x0,y0,x1,y0 and top coordinates. Note: this code is untested and solely for illustrative purposes. from operator import itemgetter import pdfplumber # page.crop if needed bounding_box = (0, 700, 100, 700) page = page.crop(bounding_box) settings = {"x_tolerance" : 2, "y_tolerance": 2} words = page.extract_words(settings) # a list of list of dictionaries that represents "new lines", essentially simulating the above example. Contains more information, e.g. x0,y0,x1,y1 coordinates # split document into lines that differ by 1.6 from each other based on top coordinate lines_by_top_coordinate = pdfplumber.utils.cluster_objects(words, itemgetter("top"), 1.6) # iterate line by line for idx, line in enumerate(lines_by_top_coordinate): # build the text_line for possible regex if needed text_line = " ".join(w["text"] for w in line) # allows regexes if re.match(pat, text_line): ... # or get the text based on coordinates, no regexes needed text_line = " ".join(w["text"] for w in line if w["x0"] > 400 and w["x1"] < 500) # or if the document has information that is located # below an indicator, e.g a header text_line = " ".join(w["text"] for lines_by_top_coordinate[idx + 1] in line) Guaranteed this method is not the fastest, but allows you to be more precise with the data extraction and solves the large document problem you are facing. Hope this helps! Let me know if you need further assistance. Edit. typos and fixes

2 of 5

3

Are you doing all the processing in sync? If so maybe something like celery could help you doing stuff in parallel.

Medium

medium.com › analytics-vidhya › python-packages-for-pdf-data-extraction-d14ec30f0ad0

Python Packages for PDF Data Extraction | by Rucha Sawarkar | Analytics Vidhya | Medium

February 11, 2024 - Below is the list of packages I have used for extracting text from PDF files. ... We will go through each package in detail along with python code.

Unstract

unstract.com › home › product › evaluating python libraries for converting pdf to text — a 2026 comparison and evaluation guide

Best Python PDF to Text Parser Libraries: A 2026 Evaluation

December 18, 2025 - Let's compare how PyPDF and PyMuPDF handle PDF to text extraction, and see how LLMWhisperer offers improvements over these traditional libraries.

Towards Data Science

towardsdatascience.com › home › latest › extracting text from pdf files with python: a comprehensive guide

Extracting text from PDF files with Python: A comprehensive guide | Towards Data Science

January 27, 2025 - Alternatively, you can run the following commands to directly include their paths in the Python script using the following code: pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe' ... Lastly, we will import all the libraries at the beginning of our script. # To read the PDF import PyPDF2 # To analyze the PDF layout and extract text from pdfminer.high_level import extract_pages, extract_text from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure # To extract text from tables in PDF import pdfplumber # To extract the images from the PDFs from PIL import Image from pdf2image import convert_from_path # To perform OCR to extract text from images import pytesseract # To remove the additional created files import os

Javatpoint

javatpoint.com › python-libraries-for-pdf-extraction

Python Libraries for PDF Extraction - Javatpoint

Python Libraries for PDF Extraction with tutorial, tkinter, button, overview, canvas, frame, environment set-up, first python program, etc.

GitHub

github.com › mhadeli › Python-Text-Extraction

GitHub - mhadeli/Python-Text-Extraction: Extract text from multiple PDFs via Python script. · GitHub

Extract text from multiple PDFs via Python script. - mhadeli/Python-Text-Extraction

Author mhadeli