best pdf reader python github - Brave Search

github.com › py-pdf › pypdf

GitHub - py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files · GitHub

from pypdf import PdfReader reader = PdfReader("example.pdf") number_of_pages = len(reader.pages) page = reader.pages[0] text = page.extract_text()

Starred by 9.9K users

Forked by 1.6K users

Languages Python

github.com › py-pdf › awesome-pdf

GitHub - py-pdf/awesome-pdf: A curated list of resources around PDF files · GitHub

KOReader: a document viewer primarily aimed at e-ink readers · react-native-pdf: a react native PDF view component · PdfViewPager: Android widget to display PDF documents in your Activities or Fragments ... pdftotext: an application that converts Portable Document Format (PDF) files to plain text. Part of poppler-utils. pdfminer.six: a Python library for extracting information from PDF documents

Starred by 154 users

Forked by 21 users

Discussions

What’s the Best Python Library for Extracting Text from PDFs?

In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. More on reddit.com

r/LangChain

85

81

July 19, 2024

Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange

Bring the best of human thought and AI automation together at your work. Explore Stack Internal ... I'm looking for well-maintained and well-documented powerful PDF parsing libraries for Python (mainly to extract and parse data from various types of PDFs with different/unpredictable structures, including with the help of reliable and powerful OCR). Currently I'm aware of the following main projects: PDFMiner: https://github... More on softwarerecs.stackexchange.com

softwarerecs.stackexchange.com

November 14, 2019

[D] Choosing a pdf processing package in Python

If you’re ever building something more production-level or need deeper control (like merging, cropping, rotating, or handling PDFs and other formats across platforms). Take a look at Apryse. It’s not open source, but their Python SDK is super robust and covers everything from text extraction to page manipulation. More on reddit.com

r/MachineLearning

15

31

January 8, 2024

Best OCR out there?

I'm a not an OCR expert either, but I've used tesseract a few times and it's quite impressive. Of course the ocr will not be 100% perfect but if your input is a good quality picture, and the handwriting is ~ok, you should have something to work with.

More on reddit.com

r/LanguageTechnology

9

11

April 11, 2018

Videos

How to Extract Text from PDF in Python | PDF Text Extraction Tutorial ...

Python Libraries to Extract Tables from PDFs - YouTube

PDF Parsing in Python | The non AI tutorial - YouTube

February 9, 2025

How To Read PDF Files In Python - YouTube

Extract PDF Content with Python - YouTube

August 29, 2022

github.com › pikepdf › pikepdf

GitHub - pikepdf/pikepdf: A Python library for reading and writing PDF, powered by QPDF · GitHub

February 23, 2026 - Jupyter integration -- render PDF and page previews inline in notebooks · Binary wheels everywhere -- pre-built for Linux, macOS, Windows (x86-64 and ARM64) Liberal license -- MPL-2.0, compatible with most open and closed source projects ... Python has several PDF libraries, each with different strengths.

Starred by 2.7K users

Forked by 219 users

Languages Python 77.3% | C++ 22.1%

github.com › maxpmaxp › pdfreader

GitHub - maxpmaxp/pdfreader: Python API for PDF documents

Python API for PDF documents. Contribute to maxpmaxp/pdfreader development by creating an account on GitHub.

Starred by 124 users

Forked by 28 users

Languages Python 100.0% | Python 100.0%

github.com › pmaupin › pdfrw

GitHub - pmaupin/pdfrw: pdfrw is a pure Python library that reads and writes PDFs · GitHub

It can do decompression and decryption and seems to know a lot about items inside at least some kinds of PDF files. In comparison, pdfrw knows less about specific PDF file features (such as metadata), but focuses on trying to have a more Pythonic API for mapping the PDF file container syntax to Python, and (IMO) has a simpler and better PDF file parser.

Starred by 1.9K users

Forked by 277 users

Languages Python 71.7% | Jupyter Notebook 28.3%

onlyoneaman.medium.com › i-tested-7-python-pdf-extractors-so-you-dont-have-to-2025-edition-c88013922257

I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition) | by Aman Kumar | Medium

July 21, 2025 - I am building aiwand , a simple OS Python Package which could help speed things up solving for extraction, structured data and providers switching. Check it out on github. ... https://bella.amankumar.ai/examples/pdf/11-google-doc-document/google-doc-document.pdf (Test Python File)

github.com › py-pdf › pdf

GitHub - py-pdf/pdf: A modern pure-Python library for reading PDF files · GitHub

A modern pure-Python library for reading PDF files - py-pdf/pdf

Starred by 12 users

Forked by 4 users

Languages Python 95.2% | Makefile 4.8%

github.com › ashutoshvarma › pyxpdf

GitHub - ashutoshvarma/pyxpdf: Fast and memory-efficient Python PDF Parser based on xpdf sources

pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources. docs · tests · package · license · Almost x20 times faster than pure python based pdf parsers (see Speed Comparison) Extract text ...

Starred by 44 users

Forked by 17 users

Languages Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1% | Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1%

Find elsewhere

Google Bing Mojeek

reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?

r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?

July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables.

llama parse, use it, super cheap and has a free version up to 3000 pages Best in the world

github.com › topics › pdf-reader

pdf-reader · GitHub Topics · GitHub

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents · ocr tesseract text-extraction tesseract-ocr pdf-to-text poppler optical-character-recognition pdf-reader pdftotext ...

github.com › topics › pdf-document

pdf-document · GitHub Topics · GitHub

Multiple and Large PDF Documents Text Extraction. python pdf parser data-science pdf-document text-analytics pdfs pypdf2 extract-text pdfminer pdf-processing pdfs-textextract

github.com › topics › pdf-viewer

pdf-viewer · GitHub Topics · GitHub

A PDF component Python library for integrating with the ComPDF API to build powerful PDF viewer and editor features.

github.com › topics › python-pdf

python-pdf · GitHub Topics · GitHub

A professional, modular, and open-source Python command-line tool to extract data from PDFs — including plain text, tables, images, and OCR content — using best-in-class libraries like PyMuPDF, pdfplumber, and pytesseract.

softwarerecs.stackexchange.com › questions › 70780 › python-pdf-parsing-any-modern-powerful-well-maintained-open-source-librarie

Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange

Finally I went for OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF), which uses tesseract for the actual OCR part (https://github.com/tesseract-ocr/tesseract) - as I understand tesseract is a OCR tool that has been open-sourced by Google.

OCRmyPDF has great documentation, also works from the command line and has many language packs:

ocrmypdf -l eng pdf_to_ocr.pdf new_pdf_with_ocr.pdf

In order to extract text from the PDF, the best tool I found is pdftotext (https://github.com/jalan/pdftotext), which is a Python wrapper for Poppler (https://poppler.freedesktop.org/). I am getting very satisfying results with this tool, far better than PyPDF2.

Update: Here are some top-of-the-line PDF readers & writers for Python:

PyMuPDF
PikePDF

Be sure to check these out. Although for text extraction, I must say I still prefer pdftotext for basic usage as it nicely preserves layout order using spaces.

I'm now the maintainer of pypdf and PyPDF2. I merged everything back into pypdf. pypdf is the way to go. PyPDF2 will be deprecated.

Original answer:

PyPDF2 is mainained again since April 2022. We made massive improvements in text extraction and added type annotations. The docs were improved, the interface is now more pythonic.

Internally, we deprecated Python 3.5 and lower + added a lot of unit tests. This simplifies the development / maintenance.

PyPDF2 is free and open source.

PyPDF2 is a pure-python library without any dependencies.

github.com › euske › pdfminer

GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six. · GitHub

April 15, 2024 - Python PDF Parser (Not actively maintained). Check out pdfminer.six. - euske/pdfminer

Starred by 5.3K users

Forked by 1.1K users

Languages Python 99.6% | Makefile 0.4%

github.com › Agnik7 › PDF-Reader

GitHub - Agnik7/PDF-Reader: A simple PDF Reader made using Python

This is a Python-based application that reads you your PDF file. Say goodbye to traditional PDF reading and straining your eyes. Bring your documents to life using this interactive reader.

Author Agnik7

github.com › jstockwin › py-pdf-parser

GitHub - jstockwin/py-pdf-parser: A Python tool to help extracting information from structured PDFs. · GitHub

A Python tool to help extracting information from structured PDFs. - jstockwin/py-pdf-parser

Starred by 429 users

Forked by 49 users

Languages Python 99.5% | Dockerfile 0.5%

pypi.org › project › pdfreader

pdfreader · PyPI

See GitHub for the latest source. ... Nevertheless it can be used as a part of such tools. See Tutorials & Documentation. Extracts texts (plain text and formatted text objects) Extract PDF forms data (pure strings and formatted text objects)

      » pip install pdfreader

Published May 03, 2024

Version 0.1.15

Homepage http://github.com/maxpmaxp/pdfreader

github.com › topics › pdf-parser

pdf-parser · GitHub Topics · GitHub

Scanipy stands for "scan it with Python"—it's your smart Python library for scanning and parsing complex PDF files like books, reports, articles, and academic papers. Utilizing cutting-edge Deep Learning algorithms, Scanipy transforms your PDFs into a treasure trove of extractable information: tables, images, equations, and text. ... file-upload api-rest authentification pdf-reader ...

ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › best pdf reader for python

Best PDF Reader for Python (Free & Paid Tools) | IronPDF

January 19, 2026 - Only supports Python 3, which may be a limitation for environments using Python 2. PDFMiner is available under the MIT License, a permissive free software license. Like PyPDF2, it is open-source and free to use. There are no fees for utilizing PDFMiner in your projects, making it an economically attractive option for text extraction and analysis tasks. Selecting the best Python PDF library depends mainly on the specific PDF processing needs.