🌐
GitHub
github.com › py-pdf › pypdf
GitHub - py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files · GitHub
from pypdf import PdfReader reader = PdfReader("example.pdf") number_of_pages = len(reader.pages) page = reader.pages[0] text = page.extract_text()
Starred by 9.9K users
Forked by 1.6K users
Languages   Python
🌐
GitHub
github.com › py-pdf › awesome-pdf
GitHub - py-pdf/awesome-pdf: A curated list of resources around PDF files · GitHub
KOReader: a document viewer primarily aimed at e-ink readers · react-native-pdf: a react native PDF view component · PdfViewPager: Android widget to display PDF documents in your Activities or Fragments ... pdftotext: an application that converts Portable Document Format (PDF) files to plain text. Part of poppler-utils. pdfminer.six: a Python library for extracting information from PDF documents
Starred by 154 users
Forked by 26 users
Discussions

What’s the Best Python Library for Extracting Text from PDFs?
In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. More on reddit.com
🌐 r/LangChain
85
81
July 19, 2024
Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange
Bring the best of human thought and AI automation together at your work. Explore Stack Internal ... I'm looking for well-maintained and well-documented powerful PDF parsing libraries for Python (mainly to extract and parse data from various types of PDFs with different/unpredictable structures, including with the help of reliable and powerful OCR). Currently I'm aware of the following main projects: PDFMiner: https://github... More on softwarerecs.stackexchange.com
🌐 softwarerecs.stackexchange.com
November 14, 2019
My python based selfhosted PDF manager, viewer and editor reached 600 stars on github
Been looking for something like this for so long. Thanks a lot for making this. More on reddit.com
🌐 r/Python
8
188
February 6, 2025
[D] Choosing a pdf processing package in Python
If you’re ever building something more production-level or need deeper control (like merging, cropping, rotating, or handling PDFs and other formats across platforms). Take a look at Apryse. It’s not open source, but their Python SDK is super robust and covers everything from text extraction to page manipulation. More on reddit.com
🌐 r/MachineLearning
15
31
January 8, 2024
🌐
GitHub
github.com › pikepdf › pikepdf
GitHub - pikepdf/pikepdf: A Python library for reading and writing PDF, powered by QPDF · GitHub
February 23, 2026 - Jupyter integration -- render PDF and page previews inline in notebooks · Binary wheels everywhere -- pre-built for Linux, macOS, Windows (x86-64 and ARM64) Liberal license -- MPL-2.0, compatible with most open and closed source projects ... Python has several PDF libraries, each with different strengths.
Starred by 2.7K users
Forked by 220 users
Languages   Python 77.3% | C++ 22.1%
🌐
Medium
onlyoneaman.medium.com › i-tested-7-python-pdf-extractors-so-you-dont-have-to-2025-edition-c88013922257
I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition) | by Aman Kumar | Medium
July 21, 2025 - I am building aiwand , a simple OS Python Package which could help speed things up solving for extraction, structured data and providers switching. Check it out on github. ... https://bella.amankumar.ai/examples/pdf/11-google-doc-document/google-doc-document.pdf (Test Python File)
🌐
GitHub
github.com › maxpmaxp › pdfreader
GitHub - maxpmaxp/pdfreader: Python API for PDF documents
Python API for PDF documents. Contribute to maxpmaxp/pdfreader development by creating an account on GitHub.
Starred by 124 users
Forked by 28 users
Languages   Python 100.0% | Python 100.0%
🌐
GitHub
github.com › pmaupin › pdfrw
GitHub - pmaupin/pdfrw: pdfrw is a pure Python library that reads and writes PDFs · GitHub
It can do decompression and decryption and seems to know a lot about items inside at least some kinds of PDF files. In comparison, pdfrw knows less about specific PDF file features (such as metadata), but focuses on trying to have a more Pythonic API for mapping the PDF file container syntax to Python, and (IMO) has a simpler and better PDF file parser.
Starred by 1.9K users
Forked by 277 users
Languages   Python 71.7% | Jupyter Notebook 28.3%
🌐
GitHub
github.com › ashutoshvarma › pyxpdf
GitHub - ashutoshvarma/pyxpdf: Fast and memory-efficient Python PDF Parser based on xpdf sources
pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources. docs · tests · package · license · Almost x20 times faster than pure python based pdf parsers (see Speed Comparison) Extract text ...
Starred by 44 users
Forked by 17 users
Languages   Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1% | Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1%
🌐
Reddit
reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?
r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?
July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

Find elsewhere
🌐
GitHub
github.com › py-pdf › pdf
GitHub - py-pdf/pdf: A modern pure-Python library for reading PDF files · GitHub
A modern pure-Python library for reading PDF files - py-pdf/pdf
Starred by 12 users
Forked by 4 users
Languages   Python 95.2% | Makefile 4.8%
🌐
GitHub
github.com › topics › pdf-reader
pdf-reader · GitHub Topics · GitHub
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents · ocr tesseract text-extraction tesseract-ocr pdf-to-text poppler optical-character-recognition pdf-reader pdftotext ...
🌐
GitHub
github.com › topics › pdf-document
pdf-document · GitHub Topics · GitHub
Multiple and Large PDF Documents Text Extraction. python pdf parser data-science pdf-document text-analytics pdfs pypdf2 extract-text pdfminer pdf-processing pdfs-textextract
🌐
GitHub
github.com › topics › pdf-viewer
pdf-viewer · GitHub Topics · GitHub
A PDF component Python library for integrating with the ComPDF API to build powerful PDF viewer and editor features.
🌐
GitHub
github.com › topics › python-pdf
python-pdf · GitHub Topics · GitHub
A professional, modular, and open-source Python command-line tool to extract data from PDFs — including plain text, tables, images, and OCR content — using best-in-class libraries like PyMuPDF, pdfplumber, and pytesseract.
🌐
GitHub
github.com › jstockwin › py-pdf-parser
GitHub - jstockwin/py-pdf-parser: A Python tool to help extracting information from structured PDFs. · GitHub
A Python tool to help extracting information from structured PDFs. - jstockwin/py-pdf-parser
Starred by 429 users
Forked by 49 users
Languages   Python 99.5% | Dockerfile 0.5%
🌐
GitHub
github.com › Agnik7 › PDF-Reader
GitHub - Agnik7/PDF-Reader: A simple PDF Reader made using Python
This is a Python-based application that reads you your PDF file. Say goodbye to traditional PDF reading and straining your eyes. Bring your documents to life using this interactive reader.
Author   Agnik7
🌐
GitHub
github.com › euske › pdfminer
GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six. · GitHub
April 15, 2024 - Python PDF Parser (Not actively maintained). Check out pdfminer.six. - euske/pdfminer
Starred by 5.3K users
Forked by 1.1K users
Languages   Python 99.6% | Makefile 0.4%
🌐
PyPI
pypi.org › project › pdfreader
pdfreader · PyPI
See GitHub for the latest source. ... Nevertheless it can be used as a part of such tools. See Tutorials & Documentation. Extracts texts (plain text and formatted text objects) Extract PDF forms data (pure strings and formatted text objects)
      » pip install pdfreader
    
Published   May 03, 2024
Version   0.1.15
🌐
GitHub
github.com › topics › pdf-parser
pdf-parser · GitHub Topics · GitHub
Scanipy stands for "scan it with Python"—it's your smart Python library for scanning and parsing complex PDF files like books, reports, articles, and academic papers. Utilizing cutting-edge Deep Learning algorithms, Scanipy transforms your PDFs into a treasure trove of extractable information: tables, images, equations, and text. ... file-upload api-rest authentification pdf-reader ...
🌐
Reddit
reddit.com › r/python › my python based selfhosted pdf manager, viewer and editor reached 600 stars on github
r/Python on Reddit: My python based selfhosted PDF manager, viewer and editor reached 600 stars on github
February 6, 2025 -

Hi r/Python,

I am the developer of PdfDing - a selfhosted PDF manager, viewer and editor offering a seamless user experience on multiple devices. You can find the repo here.

Today I reached a big milestone as PdfDing reached over 600 stars on github. A good portion of these stars probably comes from being included in the favorite selfhosted apps launched in 2024 on selfh.st.

What My Project Does

PdfDing is a selfhosted PDF manager, viewer and editor. Here is a quick overview over the project’s features:

  • Seamless browser based PDF viewing on multiple devices. Remembers current position - continue where you stopped reading

  • Stay on top of your PDF collection with multi-level tagging, starring and archiving functionalities

  • Edit PDFs by adding annotations, highlighting and drawings

  • Clean, intuitive UI with dark mode, inverted color mode and custom theme colors

  • SSO support via OIDC

  • Share PDFs with an external audience via a link or a QR Code with optional access control

  • Markdown Notes

  • Progress bars show the reading progress of each PDF at a quick glance

PdfDing heavily uses Django, the Python based web framework. Other than this the tech stack includes tailwind css, htmx, alpine js and pdf.js.

Target Audience

  • Homelabs

  • Small businesses

  • Everyone who wants to read PDFs in style :)

Comparison

  • PdfDing is all about reading and organizing your PDFs while being simple and intuitive. All features are added with the goal of improving the reading experience or making the management of your PDF collection simpler.

  • Other solutions were either too resource hungry, do not allow reading Pdfs in the browser on mobile devices (they'll download the files) or do not allow individual users to upload files.

Conclusion

As always I am happy if you star the repo or if someone wants to contribute.