best pdf reader python github - Brave Search

github.com › py-pdf › pypdf

GitHub - py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files · GitHub

from pypdf import PdfReader reader = PdfReader("example.pdf") number_of_pages = len(reader.pages) page = reader.pages[0] text = page.extract_text()

Starred by 9.9K users

Forked by 1.6K users

Languages Python

github.com › py-pdf › awesome-pdf

GitHub - py-pdf/awesome-pdf: A curated list of resources around PDF files · GitHub

KOReader: a document viewer primarily aimed at e-ink readers · react-native-pdf: a react native PDF view component · PdfViewPager: Android widget to display PDF documents in your Activities or Fragments ... pdftotext: an application that converts Portable Document Format (PDF) files to plain text. Part of poppler-utils. pdfminer.six: a Python library for extracting information from PDF documents

Starred by 154 users

Forked by 26 users

Discussions

What’s the Best Python Library for Extracting Text from PDFs?

In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. More on reddit.com

r/LangChain

85

81

July 19, 2024

Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange

Bring the best of human thought and AI automation together at your work. Explore Stack Internal ... I'm looking for well-maintained and well-documented powerful PDF parsing libraries for Python (mainly to extract and parse data from various types of PDFs with different/unpredictable structures, including with the help of reliable and powerful OCR). Currently I'm aware of the following main projects: PDFMiner: https://github... More on softwarerecs.stackexchange.com

softwarerecs.stackexchange.com

November 14, 2019

My python based selfhosted PDF manager, viewer and editor reached 600 stars on github

Been looking for something like this for so long. Thanks a lot for making this. More on reddit.com

r/Python

8

188

February 6, 2025

[D] Choosing a pdf processing package in Python

If you’re ever building something more production-level or need deeper control (like merging, cropping, rotating, or handling PDFs and other formats across platforms). Take a look at Apryse. It’s not open source, but their Python SDK is super robust and covers everything from text extraction to page manipulation. More on reddit.com

r/MachineLearning

15

31

January 8, 2024

Videos

How to Extract Text from PDF in Python | PDF Text Extraction Tutorial ...

Python Libraries to Extract Tables from PDFs - YouTube

PDF Parsing in Python | The non AI tutorial - YouTube

February 9, 2025

How To Read PDF Files In Python - YouTube

Extract PDF Content with Python - YouTube

August 29, 2022

github.com › pikepdf › pikepdf

GitHub - pikepdf/pikepdf: A Python library for reading and writing PDF, powered by QPDF · GitHub

February 23, 2026 - Jupyter integration -- render PDF and page previews inline in notebooks · Binary wheels everywhere -- pre-built for Linux, macOS, Windows (x86-64 and ARM64) Liberal license -- MPL-2.0, compatible with most open and closed source projects ... Python has several PDF libraries, each with different strengths.

Starred by 2.7K users

Forked by 220 users

Languages Python 77.3% | C++ 22.1%

onlyoneaman.medium.com › i-tested-7-python-pdf-extractors-so-you-dont-have-to-2025-edition-c88013922257

I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition) | by Aman Kumar | Medium

July 21, 2025 - I am building aiwand , a simple OS Python Package which could help speed things up solving for extraction, structured data and providers switching. Check it out on github. ... https://bella.amankumar.ai/examples/pdf/11-google-doc-document/google-doc-document.pdf (Test Python File)

github.com › maxpmaxp › pdfreader

GitHub - maxpmaxp/pdfreader: Python API for PDF documents

Python API for PDF documents. Contribute to maxpmaxp/pdfreader development by creating an account on GitHub.

Starred by 124 users

Forked by 28 users

Languages Python 100.0% | Python 100.0%

github.com › pmaupin › pdfrw

GitHub - pmaupin/pdfrw: pdfrw is a pure Python library that reads and writes PDFs · GitHub

It can do decompression and decryption and seems to know a lot about items inside at least some kinds of PDF files. In comparison, pdfrw knows less about specific PDF file features (such as metadata), but focuses on trying to have a more Pythonic API for mapping the PDF file container syntax to Python, and (IMO) has a simpler and better PDF file parser.

Starred by 1.9K users

Forked by 277 users

Languages Python 71.7% | Jupyter Notebook 28.3%

github.com › ashutoshvarma › pyxpdf

GitHub - ashutoshvarma/pyxpdf: Fast and memory-efficient Python PDF Parser based on xpdf sources

pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources. docs · tests · package · license · Almost x20 times faster than pure python based pdf parsers (see Speed Comparison) Extract text ...

Starred by 44 users

Forked by 17 users

Languages Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1% | Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1%

reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?

r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?

July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables.

llama parse, use it, super cheap and has a free version up to 3000 pages Best in the world

Find elsewhere

Google Bing Mojeek

github.com › py-pdf › pdf

GitHub - py-pdf/pdf: A modern pure-Python library for reading PDF files · GitHub

A modern pure-Python library for reading PDF files - py-pdf/pdf

Starred by 12 users

Forked by 4 users

Languages Python 95.2% | Makefile 4.8%

github.com › topics › pdf-reader

pdf-reader · GitHub Topics · GitHub

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents · ocr tesseract text-extraction tesseract-ocr pdf-to-text poppler optical-character-recognition pdf-reader pdftotext ...

github.com › topics › pdf-document

pdf-document · GitHub Topics · GitHub

Multiple and Large PDF Documents Text Extraction. python pdf parser data-science pdf-document text-analytics pdfs pypdf2 extract-text pdfminer pdf-processing pdfs-textextract

github.com › topics › pdf-viewer

pdf-viewer · GitHub Topics · GitHub

A PDF component Python library for integrating with the ComPDF API to build powerful PDF viewer and editor features.

github.com › topics › python-pdf

python-pdf · GitHub Topics · GitHub

A professional, modular, and open-source Python command-line tool to extract data from PDFs — including plain text, tables, images, and OCR content — using best-in-class libraries like PyMuPDF, pdfplumber, and pytesseract.

github.com › jstockwin › py-pdf-parser

GitHub - jstockwin/py-pdf-parser: A Python tool to help extracting information from structured PDFs. · GitHub

A Python tool to help extracting information from structured PDFs. - jstockwin/py-pdf-parser

Starred by 429 users

Forked by 49 users

Languages Python 99.5% | Dockerfile 0.5%

github.com › Agnik7 › PDF-Reader

GitHub - Agnik7/PDF-Reader: A simple PDF Reader made using Python

This is a Python-based application that reads you your PDF file. Say goodbye to traditional PDF reading and straining your eyes. Bring your documents to life using this interactive reader.

Author Agnik7

github.com › euske › pdfminer

GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six. · GitHub

April 15, 2024 - Python PDF Parser (Not actively maintained). Check out pdfminer.six. - euske/pdfminer

Starred by 5.3K users

Forked by 1.1K users

Languages Python 99.6% | Makefile 0.4%

pypi.org › project › pdfreader

pdfreader · PyPI

See GitHub for the latest source. ... Nevertheless it can be used as a part of such tools. See Tutorials & Documentation. Extracts texts (plain text and formatted text objects) Extract PDF forms data (pure strings and formatted text objects)

      » pip install pdfreader

Published May 03, 2024

Version 0.1.15

Homepage http://github.com/maxpmaxp/pdfreader

github.com › topics › pdf-parser

pdf-parser · GitHub Topics · GitHub

Scanipy stands for "scan it with Python"—it's your smart Python library for scanning and parsing complex PDF files like books, reports, articles, and academic papers. Utilizing cutting-edge Deep Learning algorithms, Scanipy transforms your PDFs into a treasure trove of extractable information: tables, images, equations, and text. ... file-upload api-rest authentification pdf-reader ...

softwarerecs.stackexchange.com › questions › 70780 › python-pdf-parsing-any-modern-powerful-well-maintained-open-source-librarie

Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange

Finally I went for OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF), which uses tesseract for the actual OCR part (https://github.com/tesseract-ocr/tesseract) - as I understand tesseract is a OCR tool that has been open-sourced by Google.

OCRmyPDF has great documentation, also works from the command line and has many language packs:

ocrmypdf -l eng pdf_to_ocr.pdf new_pdf_with_ocr.pdf

In order to extract text from the PDF, the best tool I found is pdftotext (https://github.com/jalan/pdftotext), which is a Python wrapper for Poppler (https://poppler.freedesktop.org/). I am getting very satisfying results with this tool, far better than PyPDF2.

Update: Here are some top-of-the-line PDF readers & writers for Python:

PyMuPDF
PikePDF

Be sure to check these out. Although for text extraction, I must say I still prefer pdftotext for basic usage as it nicely preserves layout order using spaces.

I'm now the maintainer of pypdf and PyPDF2. I merged everything back into pypdf. pypdf is the way to go. PyPDF2 will be deprecated.

Original answer:

PyPDF2 is mainained again since April 2022. We made massive improvements in text extraction and added type annotations. The docs were improved, the interface is now more pythonic.

Internally, we deprecated Python 3.5 and lower + added a lot of unit tests. This simplifies the development / maintenance.

PyPDF2 is free and open source.

PyPDF2 is a pure-python library without any dependencies.

reddit.com › r/python › my python based selfhosted pdf manager, viewer and editor reached 600 stars on github

r/Python on Reddit: My python based selfhosted PDF manager, viewer and editor reached 600 stars on github

February 6, 2025 -

Hi r/Python,

I am the developer of PdfDing - a selfhosted PDF manager, viewer and editor offering a seamless user experience on multiple devices. You can find the repo here.

Today I reached a big milestone as PdfDing reached over 600 stars on github. A good portion of these stars probably comes from being included in the favorite selfhosted apps launched in 2024 on selfh.st.

What My Project Does

PdfDing is a selfhosted PDF manager, viewer and editor. Here is a quick overview over the project’s features:

Seamless browser based PDF viewing on multiple devices. Remembers current position - continue where you stopped reading
Stay on top of your PDF collection with multi-level tagging, starring and archiving functionalities
Edit PDFs by adding annotations, highlighting and drawings
Clean, intuitive UI with dark mode, inverted color mode and custom theme colors
SSO support via OIDC
Share PDFs with an external audience via a link or a QR Code with optional access control
Markdown Notes
Progress bars show the reading progress of each PDF at a quick glance

PdfDing heavily uses Django, the Python based web framework. Other than this the tech stack includes tailwind css, htmx, alpine js and pdf.js.

Target Audience

Homelabs
Small businesses
Everyone who wants to read PDFs in style :)

Comparison

PdfDing is all about reading and organizing your PDFs while being simple and intuitive. All features are added with the goal of improving the reading experience or making the management of your PDF collection simpler.
Other solutions were either too resource hungry, do not allow reading Pdfs in the browser on mobile devices (they'll download the files) or do not allow individual users to upload files.

Conclusion

As always I am happy if you star the repo or if someone wants to contribute.

Been looking for something like this for so long. Thanks a lot for making this.

Does it have a full fledged PostScript interpreter internally ?