🌐
GitHub
github.com › py-pdf › pypdf
GitHub - py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files · GitHub
from pypdf import PdfReader reader = PdfReader("example.pdf") number_of_pages = len(reader.pages) page = reader.pages[0] text = page.extract_text()
Starred by 9.9K users
Forked by 1.6K users
Languages   Python
🌐
GitHub
github.com › py-pdf › awesome-pdf
GitHub - py-pdf/awesome-pdf: A curated list of resources around PDF files · GitHub
KOReader: a document viewer primarily aimed at e-ink readers · react-native-pdf: a react native PDF view component · PdfViewPager: Android widget to display PDF documents in your Activities or Fragments ... pdftotext: an application that converts Portable Document Format (PDF) files to plain text. Part of poppler-utils. pdfminer.six: a Python library for extracting information from PDF documents
Starred by 154 users
Forked by 21 users
Discussions

What’s the Best Python Library for Extracting Text from PDFs?
In my experience, PyMuPDF is the best open-source Python library for this, better than PDFplumber, PyPDF2, and others. For paid options, Azure Document Intelligence is excellent; it can even handle unstructured tables. More on reddit.com
🌐 r/LangChain
85
81
July 19, 2024
Python & PDF parsing: any modern, powerful, well-maintained open-source libraries out there? - Software Recommendations Stack Exchange
Bring the best of human thought and AI automation together at your work. Explore Stack Internal ... I'm looking for well-maintained and well-documented powerful PDF parsing libraries for Python (mainly to extract and parse data from various types of PDFs with different/unpredictable structures, including with the help of reliable and powerful OCR). Currently I'm aware of the following main projects: PDFMiner: https://github... More on softwarerecs.stackexchange.com
🌐 softwarerecs.stackexchange.com
November 14, 2019
[D] Choosing a pdf processing package in Python
If you’re ever building something more production-level or need deeper control (like merging, cropping, rotating, or handling PDFs and other formats across platforms). Take a look at Apryse. It’s not open source, but their Python SDK is super robust and covers everything from text extraction to page manipulation. More on reddit.com
🌐 r/MachineLearning
15
31
January 8, 2024
Best OCR out there?

I'm a not an OCR expert either, but I've used tesseract a few times and it's quite impressive. Of course the ocr will not be 100% perfect but if your input is a good quality picture, and the handwriting is ~ok, you should have something to work with.

More on reddit.com
🌐 r/LanguageTechnology
9
11
April 11, 2018
🌐
GitHub
github.com › pikepdf › pikepdf
GitHub - pikepdf/pikepdf: A Python library for reading and writing PDF, powered by QPDF · GitHub
February 23, 2026 - Jupyter integration -- render PDF and page previews inline in notebooks · Binary wheels everywhere -- pre-built for Linux, macOS, Windows (x86-64 and ARM64) Liberal license -- MPL-2.0, compatible with most open and closed source projects ... Python has several PDF libraries, each with different strengths.
Starred by 2.7K users
Forked by 219 users
Languages   Python 77.3% | C++ 22.1%
🌐
GitHub
github.com › maxpmaxp › pdfreader
GitHub - maxpmaxp/pdfreader: Python API for PDF documents
Python API for PDF documents. Contribute to maxpmaxp/pdfreader development by creating an account on GitHub.
Starred by 124 users
Forked by 28 users
Languages   Python 100.0% | Python 100.0%
🌐
GitHub
github.com › pmaupin › pdfrw
GitHub - pmaupin/pdfrw: pdfrw is a pure Python library that reads and writes PDFs · GitHub
It can do decompression and decryption and seems to know a lot about items inside at least some kinds of PDF files. In comparison, pdfrw knows less about specific PDF file features (such as metadata), but focuses on trying to have a more Pythonic API for mapping the PDF file container syntax to Python, and (IMO) has a simpler and better PDF file parser.
Starred by 1.9K users
Forked by 277 users
Languages   Python 71.7% | Jupyter Notebook 28.3%
🌐
Medium
onlyoneaman.medium.com › i-tested-7-python-pdf-extractors-so-you-dont-have-to-2025-edition-c88013922257
I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition) | by Aman Kumar | Medium
July 21, 2025 - I am building aiwand , a simple OS Python Package which could help speed things up solving for extraction, structured data and providers switching. Check it out on github. ... https://bella.amankumar.ai/examples/pdf/11-google-doc-document/google-doc-document.pdf (Test Python File)
🌐
GitHub
github.com › py-pdf › pdf
GitHub - py-pdf/pdf: A modern pure-Python library for reading PDF files · GitHub
A modern pure-Python library for reading PDF files - py-pdf/pdf
Starred by 12 users
Forked by 4 users
Languages   Python 95.2% | Makefile 4.8%
🌐
GitHub
github.com › ashutoshvarma › pyxpdf
GitHub - ashutoshvarma/pyxpdf: Fast and memory-efficient Python PDF Parser based on xpdf sources
pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources. docs · tests · package · license · Almost x20 times faster than pure python based pdf parsers (see Speed Comparison) Extract text ...
Starred by 44 users
Forked by 17 users
Languages   Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1% | Cython 69.7% | Python 22.8% | C++ 5.1% | Makefile 1.3% | Shell 1.1%
Find elsewhere
🌐
Reddit
reddit.com › r/langchain › what’s the best python library for extracting text from pdfs?
r/LangChain on Reddit: What’s the Best Python Library for Extracting Text from PDFs?
July 19, 2024 -

Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!

🌐
GitHub
github.com › topics › pdf-reader
pdf-reader · GitHub Topics · GitHub
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents · ocr tesseract text-extraction tesseract-ocr pdf-to-text poppler optical-character-recognition pdf-reader pdftotext ...
🌐
GitHub
github.com › topics › pdf-document
pdf-document · GitHub Topics · GitHub
Multiple and Large PDF Documents Text Extraction. python pdf parser data-science pdf-document text-analytics pdfs pypdf2 extract-text pdfminer pdf-processing pdfs-textextract
🌐
GitHub
github.com › topics › pdf-viewer
pdf-viewer · GitHub Topics · GitHub
A PDF component Python library for integrating with the ComPDF API to build powerful PDF viewer and editor features.
🌐
GitHub
github.com › topics › python-pdf
python-pdf · GitHub Topics · GitHub
A professional, modular, and open-source Python command-line tool to extract data from PDFs — including plain text, tables, images, and OCR content — using best-in-class libraries like PyMuPDF, pdfplumber, and pytesseract.
🌐
GitHub
github.com › euske › pdfminer
GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six. · GitHub
April 15, 2024 - Python PDF Parser (Not actively maintained). Check out pdfminer.six. - euske/pdfminer
Starred by 5.3K users
Forked by 1.1K users
Languages   Python 99.6% | Makefile 0.4%
🌐
GitHub
github.com › Agnik7 › PDF-Reader
GitHub - Agnik7/PDF-Reader: A simple PDF Reader made using Python
This is a Python-based application that reads you your PDF file. Say goodbye to traditional PDF reading and straining your eyes. Bring your documents to life using this interactive reader.
Author   Agnik7
🌐
GitHub
github.com › jstockwin › py-pdf-parser
GitHub - jstockwin/py-pdf-parser: A Python tool to help extracting information from structured PDFs. · GitHub
A Python tool to help extracting information from structured PDFs. - jstockwin/py-pdf-parser
Starred by 429 users
Forked by 49 users
Languages   Python 99.5% | Dockerfile 0.5%
🌐
PyPI
pypi.org › project › pdfreader
pdfreader · PyPI
See GitHub for the latest source. ... Nevertheless it can be used as a part of such tools. See Tutorials & Documentation. Extracts texts (plain text and formatted text objects) Extract PDF forms data (pure strings and formatted text objects)
      » pip install pdfreader
    
Published   May 03, 2024
Version   0.1.15
🌐
GitHub
github.com › topics › pdf-parser
pdf-parser · GitHub Topics · GitHub
Scanipy stands for "scan it with Python"—it's your smart Python library for scanning and parsing complex PDF files like books, reports, articles, and academic papers. Utilizing cutting-edge Deep Learning algorithms, Scanipy transforms your PDFs into a treasure trove of extractable information: tables, images, equations, and text. ... file-upload api-rest authentification pdf-reader ...
🌐
IronPDF
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › best pdf reader for python
Best PDF Reader for Python (Free & Paid Tools) | IronPDF
January 19, 2026 - Only supports Python 3, which may be a limitation for environments using Python 2. PDFMiner is available under the MIT License, a permissive free software license. Like PyPDF2, it is open-source and free to use. There are no fees for utilizing PDFMiner in your projects, making it an economically attractive option for text extraction and analysis tasks. Selecting the best Python PDF library depends mainly on the specific PDF processing needs.