Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!
How to extract text from a PDF file via python? - Stack Overflow
PDF Extraction with python wrappers
Best PDF library for extracting text from structured templates
[D] Choosing a pdf processing package in Python
Videos
I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.
Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.
from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])
Note that Tika is written in Java so you will need a Java runtime installed.
pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.
pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.
Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:
- Anything special regarding tables (just that the text is there, not about the formatting)
- Arabic test (RTL-languages)
- Mathematical formulas.
That means if your use-case requires those points, you might perceive the quality differently.
Having said that, the results from November 2022:


pypdf
I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)
First, install it:
pip install pypdf
And then use it:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
Please note that those packages are not maintained:
- PyPDF2, PyPDF3, PyPDF4
pdfminer(without .six)
pymupdf
import fitz # install using: pip install PyMuPDF
with fitz.open("my.pdf") as doc:
text = ""
for page in doc:
text += page.get_text()
print(text)
Other PDF libraries
- pikepdf does not support text extraction (source)
Hello All,
I am currently working on a project where I have to extract data from around 8 different structured templates which together spans 12 Million + pages across 10K PDF Documents.
I am using a mix of Regular Expression and bounding box approach where by 4 of these templates are regular expression friendly and for the rest I am using bounding box to extract the data. On testing the extraction works very well. There are no images or tables, but simple labels and values.
The library that I am currently using is PDF Plumber for data extraction and PyPDF for splitting the documents in small chunks for better memory utilization(PDF Plumber sometimes throws an error when the page count goes above 4000 pages, hence splitting them into smaller chunks temporarily). However this approach is taking 5 seconds per page which is a bit too much considering that I have to process 12M pages.
I did take a look at the different other libraries mentioned in the below link but I am not sure which one to choose as I would love to work with an open source library that is having a good maintenance history and better performance .
https://github.com/py-pdf/benchmarks?tab=readme-ov-file
Request your suggestions . Thanks in advance !