What’s the Best Python Library for Extracting Text from PDFs?
Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange
My python based selfhosted PDF manager, viewer and editor reached 600 stars on github
[D] Choosing a pdf processing package in Python
Videos
Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!
» pip install pdfreader
Finally I went for OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF), which uses tesseract for the actual OCR part (https://github.com/tesseract-ocr/tesseract) - as I understand tesseract is a OCR tool that has been open-sourced by Google.
OCRmyPDF has great documentation, also works from the command line and has many language packs:
ocrmypdf -l eng pdf_to_ocr.pdf new_pdf_with_ocr.pdf
In order to extract text from the PDF, the best tool I found is pdftotext (https://github.com/jalan/pdftotext), which is a Python wrapper for Poppler (https://poppler.freedesktop.org/). I am getting very satisfying results with this tool, far better than PyPDF2.
Update: Here are some top-of-the-line PDF readers & writers for Python:
- PyMuPDF
- PikePDF
Be sure to check these out. Although for text extraction, I must say I still prefer pdftotext for basic usage as it nicely preserves layout order using spaces.
I'm now the maintainer of pypdf and PyPDF2. I merged everything back into pypdf. pypdf is the way to go. PyPDF2 will be deprecated.
Original answer:
PyPDF2 is mainained again since April 2022. We made massive improvements in text extraction and added type annotations. The docs were improved, the interface is now more pythonic.
Internally, we deprecated Python 3.5 and lower + added a lot of unit tests. This simplifies the development / maintenance.
PyPDF2 is free and open source.
PyPDF2 is a pure-python library without any dependencies.
Hi r/Python,
I am the developer of PdfDing - a selfhosted PDF manager, viewer and editor offering a seamless user experience on multiple devices. You can find the repo here.
Today I reached a big milestone as PdfDing reached over 600 stars on github. A good portion of these stars probably comes from being included in the favorite selfhosted apps launched in 2024 on selfh.st.
What My Project Does
PdfDing is a selfhosted PDF manager, viewer and editor. Here is a quick overview over the project’s features:
-
Seamless browser based PDF viewing on multiple devices. Remembers current position - continue where you stopped reading
-
Stay on top of your PDF collection with multi-level tagging, starring and archiving functionalities
-
Edit PDFs by adding annotations, highlighting and drawings
-
Clean, intuitive UI with dark mode, inverted color mode and custom theme colors
-
SSO support via OIDC
-
Share PDFs with an external audience via a link or a QR Code with optional access control
-
Markdown Notes
-
Progress bars show the reading progress of each PDF at a quick glance
PdfDing heavily uses Django, the Python based web framework. Other than this the tech stack includes tailwind css, htmx, alpine js and pdf.js.
Target Audience
-
Homelabs
-
Small businesses
-
Everyone who wants to read PDFs in style :)
Comparison
-
PdfDing is all about reading and organizing your PDFs while being simple and intuitive. All features are added with the goal of improving the reading experience or making the management of your PDF collection simpler.
-
Other solutions were either too resource hungry, do not allow reading Pdfs in the browser on mobile devices (they'll download the files) or do not allow individual users to upload files.
Conclusion
As always I am happy if you star the repo or if someone wants to contribute.