What’s the Best Python Library for Extracting Text from PDFs?
Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange
[D] Choosing a pdf processing package in Python
Best OCR out there?
I'm a not an OCR expert either, but I've used tesseract a few times and it's quite impressive. Of course the ocr will not be 100% perfect but if your input is a good quality picture, and the handwriting is ~ok, you should have something to work with.
More on reddit.comVideos
Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!
Finally I went for OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF), which uses tesseract for the actual OCR part (https://github.com/tesseract-ocr/tesseract) - as I understand tesseract is a OCR tool that has been open-sourced by Google.
OCRmyPDF has great documentation, also works from the command line and has many language packs:
ocrmypdf -l eng pdf_to_ocr.pdf new_pdf_with_ocr.pdf
In order to extract text from the PDF, the best tool I found is pdftotext (https://github.com/jalan/pdftotext), which is a Python wrapper for Poppler (https://poppler.freedesktop.org/). I am getting very satisfying results with this tool, far better than PyPDF2.
Update: Here are some top-of-the-line PDF readers & writers for Python:
- PyMuPDF
- PikePDF
Be sure to check these out. Although for text extraction, I must say I still prefer pdftotext for basic usage as it nicely preserves layout order using spaces.
I'm now the maintainer of pypdf and PyPDF2. I merged everything back into pypdf. pypdf is the way to go. PyPDF2 will be deprecated.
Original answer:
PyPDF2 is mainained again since April 2022. We made massive improvements in text extraction and added type annotations. The docs were improved, the interface is now more pythonic.
Internally, we deprecated Python 3.5 and lower + added a lot of unit tests. This simplifies the development / maintenance.
PyPDF2 is free and open source.
PyPDF2 is a pure-python library without any dependencies.
» pip install pdfreader