I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.
Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.
from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])
Note that Tika is written in Java so you will need a Java runtime installed.
Answer from DJK on Stack OverflowI was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.
Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.
from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])
Note that Tika is written in Java so you will need a Java runtime installed.
pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.
pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.
Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:
- Anything special regarding tables (just that the text is there, not about the formatting)
- Arabic test (RTL-languages)
- Mathematical formulas.
That means if your use-case requires those points, you might perceive the quality differently.
Having said that, the results from November 2022:


pypdf
I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)
First, install it:
pip install pypdf
And then use it:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
Please note that those packages are not maintained:
- PyPDF2, PyPDF3, PyPDF4
pdfminer(without .six)
pymupdf
import fitz # install using: pip install PyMuPDF
with fitz.open("my.pdf") as doc:
text = ""
for page in doc:
text += page.get_text()
print(text)
Other PDF libraries
- pikepdf does not support text extraction (source)
Problem with extracting text from PDF in python with pyPDF2
extract_text works for some PDF files, but not the others
Videos
So I am trying to extract text from the PDF and the text which pyPDF churns out is garbage. any ideas how to solve this? The link for the PDF is here:
http://www.786investments.com/wp-content/uploads/2017/10/786_-Investments_Jun2017.pdf