There? Hope the below code will be helpful, still I didn't test it with large tables. Let me know is there any scenario which could affect or fail with this code. I'm new to python so that I can improve my knowledge :)

import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)

i=1
for table in tables:
    table.columns = table.iloc[0]
    table = table.reindex(table.index.drop(0)).reset_index(drop=True)
    table.columns.name = None
    #To write Excel
    table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
    #To write CSV
    table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
    i=i+1
Answer from Parvathirajan Natarajan on Stack Overflow
🌐
The Python Code
thepythoncode.com › article › extract-pdf-tables-in-python-camelot
How to Extract Tables from PDF in Python - The Python Code
Learning how to extract tables from PDF files in Python using camelot and tabula libraries and export them into several formats such as CSV, excel, Pandas dataframe and HTML.
🌐
Readthedocs
camelot-py.readthedocs.io
Camelot: PDF Table Extraction for Humans — Camelot 1.0.9 documentation
Extract tables from PDFs in just a few lines of code: Try it yourself in our interactive quickstart notebook. Or check out a simple example using this pdf. >>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Discussions

Python Model for PDF table extraction
We had the same problem, ended up using azure document intelligence More on reddit.com
🌐 r/Python
19
26
December 31, 2024
python - How to extract a table as text from the PDF - Stack Overflow
Use pdfimages from https://pop... of the pdf into images. Use Tesseract to detect rotation and ImageMagick mogrify to fix it. Use OpenCV to find and extract tables. Use OpenCV to find and extract each cell from the table. Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software. Use Tesseract to OCR each cell. Combine the extracted text of each cell into the format you need. I wrote a python package with ... More on stackoverflow.com
🌐 stackoverflow.com
PDF Table Extraction
if you're trying to get data that would be on the sec.gov website (10-Qs/10-Ks/etc) I would recommend looking at their API which will ultimately be much easier to extract the right data than trying to parse PDFs. You will also know that the information is accurate rather than hoping that the PDF extractor tool worked right. You can also checkout websites like bamsec.com which can get specific weird tables out of financial statements if you're looking for something more obscure/company specific. If you really want to parse PDFs I don't have a ton of ideas other than what you've tried, parsing PDFs is a pain and I'd really try to avoid it if I were you. Maybe see if you can find an XML version of the PDFs (assuming they're public info) that would be easier to parse using beautifulsoup in python or something similar. More on reddit.com
🌐 r/dataengineering
29
12
January 16, 2024
Extract text and tables of a PDF file in Python - Stack Overflow
With algodocs you can extract both - text and tables from system-generated pdfs and scanned images even with poor quality. See algodocs.com/blog/… ... Thanks Zhavat, great tool, but looks like it is not an open-source tool and has no python source code available. More on stackoverflow.com
🌐 stackoverflow.com
People also ask

Is there support for extracting images from PDFs using IronPDF in Python?
Yes, IronPDF supports extracting images from PDFs in Python, allowing you to isolate and save images from PDF documents as part of your data processing tasks.
🌐
ironpdf.com
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › extract table from pdf python
Extract Table From PDF in Python (Developer Tutorial) | IronPDF ...
Can I convert HTML to PDF using IronPDF in Python?
Yes, IronPDF allows you to convert HTML to PDF in Python. You can render HTML strings or files as PDFs using IronPDF's methods, facilitating the creation of PDF documents from web content.
🌐
ironpdf.com
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › extract table from pdf python
Extract Table From PDF in Python (Developer Tutorial) | IronPDF ...
How do I troubleshoot common issues when extracting tables from PDF using IronPDF?
To troubleshoot extraction issues with IronPDF, ensure your Python environment is correctly set up with all necessary installations. Verify the PDF file is accessible and check your code syntax for using PdfDocument.FromFile() and ExtractAllText() methods. Consult the IronPDF documentation for further guidance.
🌐
ironpdf.com
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › extract table from pdf python
Extract Table From PDF in Python (Developer Tutorial) | IronPDF ...
🌐
Unstract
unstract.com › home › product › python libraries to extract table from pdf
Best Python Libraries to Extract Tables From PDF in 2026
December 16, 2025 - Support for visual debugging, which helps make sure the table extraction is accurate. Tabula is another popular library for getting tables out of PDFs, known for being easy to use and strong. It uses Java but provides a convenient Python wrapper. Tabula’s main features include: Easy installation and use within Python. The ability to handle multiple pages and get tables from all of them.
🌐
PyPI
pypi.org › project › pypdf-table-extraction
pypdf-table-extraction · PyPI
Here's how you can extract tables from PDFs. You can check out the quickstart notebook. Or follow the example below. You can check out the PDF used in this example here. >>> import pypdf_table_extraction >>> tables = pypdf_table_extraction.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
      » pip install pypdf-table-extraction
    
Published   Apr 02, 2025
Version   1.0.2
🌐
GeeksforGeeks
geeksforgeeks.org › python › how-to-extract-pdf-tables-in-python
How to Extract PDF Tables in Python? - GeeksforGeeks
July 23, 2025 - It’s like a smart scanner that spots these tables and turns them into neat data frames you can easily handle in Python. It’s very handy if you want quick and clean results and PDF file used here is PDF. ... Explanation: camelot.read_pdf() extract tables from the PDF file "test.pdf".
Find elsewhere
🌐
Medium
medium.com › analytics-vidhya › how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673
How to extract multiple tables from a PDF through python and tabula-py | by Angelica Lo Duca | Analytics Vidhya | Medium
April 20, 2022 - However, it may happen that you ... to copy and paste each of them separately. ... Here, the python library tabula-py helps you to extract multiple tables separately....
🌐
Reddit
reddit.com › r/python › python model for pdf table extraction
r/Python on Reddit: Python Model for PDF table extraction
December 31, 2024 -

Hi

I am looking for a python library model that can extract tables out of PDF, but here are some more requirements:

a) Able to differentiate two table in same page, having different width

b) Able to Understand table that spans across multiple Pages in Same pdf

Tried Tabula, pyMuPDF both are not showing any good results, Suggest some better models

🌐
DataScientYst
datascientyst.com › extract-table-from-pdf-with-python-pandas
How to Extract Table from PDF with Python and Pandas
February 14, 2025 - In this short tutorial, we'll see how to extract tables from PDF files with Python and Pandas. We will cover two cases of table extraction from PDF: (1) Simple table with tabula-py from tabula import read_pdf df_temp = read_pdf('china.pdf') (2) Table with merged cells import pandas
🌐
IronPDF
ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › extract table from pdf python
Extract Table From PDF in Python (Developer Tutorial) | IronPDF for Python
July 29, 2025 - Afterward, the ExtractAllText function extracts all the table data from all the pages within the PDF files. Then, the split function is used to divide the extracted table data into multiple rows and display them on the console screen.
🌐
GitHub
github.com › atlanhq › camelot
GitHub - atlanhq/camelot: Camelot: PDF Table Extraction for Humans · GitHub
Here's how you can extract tables from PDF files. Check out the PDF used in this example here. >>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Starred by 3.7K users
Forked by 362 users
Languages   Python 99.7% | Makefile 0.3%
🌐
Medium
medium.com › h7w › python-libraries-for-extracting-tables-from-pdfs-03f069fc4980
Python Libraries for Extracting Tables from PDFs | by Py-Core Python Programming | T3CH | Medium
March 19, 2025 - import camelot tables = camelot.read_pdf("example.pdf", pages="1", flavor="stream") df = tables[0].df print(df) tables[0].to_csv("table.csv") ... Snoop & Learn about Technology, AI, Hacking, Coding, Software, News, Tools, Leaks, Bug Bounty, OSINT & Cybersecurity !¡! But, not limited 2, anything that is Tech Linked…You’ll probably find here ! ;) — Stay ahead with Latest Tech News! -> You write about? Just ping to join ! ... Building Python utilities and UI apps | Founder of py-core.com | Sharing coding insights, automation tricks, and development tips | Follow for Python tutorials.
🌐
Readthedocs
tabula-py.readthedocs.io
tabula-py: Read tables in a PDF into DataFrame — tabula-py documentation
You can read tables from PDF and convert them into pandas’ DataFrame. tabula-py also converts a PDF file into CSV/TSV/JSON file. We highly recommend looking at the example notebook and trying it on Google Colab. For high-level API reference, see High level interfaces. ... I got an empty DataFrame. How can I resolve it? The result is different from tabula-java. Or, stream option seems not to work appropriately ... I faced ParserError: Error tokenizing data. C error. How can I extract multiple tables?
🌐
PyShark
pyshark.com › home › extract table from pdf using python
Extract Table from PDF using Python - Python for PDF - PyShark
January 4, 2024 - Learn how to extract tables from PDF files using Python. Complete code walkthrough with detailed examples using tabula-py library.
🌐
Medium
medium.com › @MemoonaTahira › working-with-embedded-tables-in-pdfs-using-python-64ce273f59de
How to Extract Embedded Tables from PDFs: Types of tables and Python Libraries Explained | by Memoona Tahira | Medium
February 8, 2025 - If the table is created using form fields (e.g., text fields, checkboxes, etc.), the data can be extracted directly from the form fields. Python libraries like PyPDF2 or pdfrw can extract this form data.
Top answer
1 of 4
29

This answer is for anyone encountering pdfs with images and needing to use OCR. I could not find a workable off-the-shelf solution; nothing that gave me the accuracy I needed.

Here are the steps I found to work.

  1. Use pdfimages from https://poppler.freedesktop.org/ to turn the pages of the pdf into images.

  2. Use Tesseract to detect rotation and ImageMagick mogrify to fix it.

  3. Use OpenCV to find and extract tables.

  4. Use OpenCV to find and extract each cell from the table.

  5. Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.

  6. Use Tesseract to OCR each cell.

  7. Combine the extracted text of each cell into the format you need.

I wrote a python package with modules that can help with those steps.

Repo: https://github.com/eihli/image-table-ocr

Docs & Source: https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

Some of the steps don't require code, they take advantage of external tools like pdfimages and tesseract. I'll provide some brief examples for a couple of the steps that do require code.

  1. Finding tables:

This link was a good reference while figuring out how to find tables. https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

import cv2

def find_tables(image):
    BLUR_KERNEL_SIZE = (17, 17)
    STD_DEV_X_DIRECTION = 0
    STD_DEV_Y_DIRECTION = 0
    blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
    MAX_COLOR_VAL = 255
    BLOCK_SIZE = 15
    SUBTRACT_FROM_MEAN = -2

    img_bin = cv2.adaptiveThreshold(
        ~blurred,
        MAX_COLOR_VAL,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY,
        BLOCK_SIZE,
        SUBTRACT_FROM_MEAN,
    )
    vertical = horizontal = img_bin.copy()
    SCALE = 5
    image_width, image_height = horizontal.shape
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
    horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
    vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)

    horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
    vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))

    mask = horizontally_dilated + vertically_dilated
    contours, hierarchy = cv2.findContours(
        mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
    )

    MIN_TABLE_AREA = 1e5
    contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
    perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
    epsilons = [0.1 * p for p in perimeter_lengths]
    approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
    bounding_rects = [cv2.boundingRect(a) for a in approx_polys]

    # The link where a lot of this code was borrowed from recommends an
    # additional step to check the number of "joints" inside this bounding rectangle.
    # A table should have a lot of intersections. We might have a rectangular image
    # here though which would only have 4 intersections, 1 at each corner.
    # Leaving that step as a future TODO if it is ever necessary.
    images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
    return images
  1. Extract cells from table.

This is very similar to 2, so I won't include all the code. The part I will reference will be in sorting the cells.

We want to identify the cells from left-to-right, top-to-bottom.

We’ll find the rectangle with the most top-left corner. Then we’ll find all of the rectangles that have a center that is within the top-y and bottom-y values of that top-left rectangle. Then we’ll sort those rectangles by the x value of their center. We’ll remove those rectangles from the list and repeat.

def cell_in_same_row(c1, c2):
    c1_center = c1[1] + c1[3] - c1[3] / 2
    c2_bottom = c2[1] + c2[3]
    c2_top = c2[1]
    return c2_top < c1_center < c2_bottom

orig_cells = [c for c in cells]
rows = []
while cells:
    first = cells[0]
    rest = cells[1:]
    cells_in_same_row = sorted(
        [
            c for c in rest
            if cell_in_same_row(c, first)
        ],
        key=lambda c: c[0]
    )

    row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
    rows.append(row_cells)
    cells = [
        c for c in rest
        if not cell_in_same_row(c, first)
    ]

# Sort rows by average height of their center.
def avg_height_of_center(row):
    centers = [y + h - h / 2 for x, y, w, h in row]
    return sum(centers) / len(centers)

rows.sort(key=avg_height_of_center)
2 of 4
23
  • I would suggest you to extract the table using tabula.
  • Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
  • Each table in your pdf is returned as one dataframe.
  • The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

Please refer to this repo of mine for more details.

🌐
Reddit
reddit.com › r/dataengineering › pdf table extraction
r/dataengineering on Reddit: PDF Table Extraction
January 16, 2024 -

Hi everyone,

I have a list of PDFs from which I need to extract table data in automated way. I need one specific table or some important data points from that table. PDFs are from different sources, so the table structures are different from one another. I also need to locate the table in PDF because they appear in different pages every year. I was wondering what would be the most robust way of trying to extract the tables in this case?

Things I have experimented:

  1. 3rd party Python packages (pdfplumber, tabula): results were not good enough, these packages couldn't extract tables neatly in consistent manner. They were dividing values/labels into chunks and etc.

  2. openAI gpt-4 chat completions endpoint: very much inconsistent. It is difficult both to locate table in the PDF and extract table or specific data points.

  3. openAI gpt-4 vision API endpoint: I take snapshots of PDF pages and try to extract data using vision endpoint, but because the resolution is not high it makes mistakes.

I need as much Automation as possible for this task. That's why I am even trying to locate the table in PDF in automated way. Do any of you have experience with similar task? Does it even make sense to make an effort on this? If so, what would be the most optimal solution?

Sample PDF table which I am trying to extract (let's say I need Total revenue & expense for 2023):

🌐
GitHub
github.com › masterhimanshupoddar › extracting-multiple-tables-from-pdf-using-Tabula
GitHub - masterhimanshupoddar/extracting-multiple-tables-from-pdf-using-Tabula: Extract multiple tables or particular table from a pdf file · GitHub
#For extracting all the tables in a pdf file we can directly pass multiple_tables = True as an argument to the function eg df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
Starred by 8 users
Forked by 4 users
Languages   Python
🌐
DEV Community
dev.to › rishabdugar › pdf-extraction-retrieving-text-and-tables-together-using-python-14c2
PDF Extraction: Retrieving Text and Tables together using Python🐍 - DEV Community
September 22, 2024 - This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout.