extract multiple tables from pdf python

How to extract more than one table present in a PDF file with tabula in Python?

stackoverflow.com › questions › 49733576 › how-to-extract-more-than-one-table-present-in-a-pdf-file-with-tabula-in-python

There? Hope the below code will be helpful, still I didn't test it with large tables. Let me know is there any scenario which could affect or fail with this code. I'm new to python so that I can improve my knowledge :)

import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)

i=1
for table in tables:
    table.columns = table.iloc[0]
    table = table.reindex(table.index.drop(0)).reset_index(drop=True)
    table.columns.name = None
    #To write Excel
    table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
    #To write CSV
    table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
    i=i+1

Answer from Parvathirajan Natarajan on Stack Overflow

The Python Code

thepythoncode.com › article › extract-pdf-tables-in-python-camelot

How to Extract Tables from PDF in Python - The Python Code

Learning how to extract tables from PDF files in Python using camelot and tabula libraries and export them into several formats such as CSV, excel, Pandas dataframe and HTML.

Readthedocs

camelot-py.readthedocs.io

Camelot: PDF Table Extraction for Humans — Camelot 1.0.9 documentation

Extract tables from PDFs in just a few lines of code: Try it yourself in our interactive quickstart notebook. Or check out a simple example using this pdf. >>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!

Discussions

Python Model for PDF table extraction

We had the same problem, ended up using azure document intelligence More on reddit.com

r/Python

December 31, 2024

python - How to extract a table as text from the PDF - Stack Overflow

Use pdfimages from https://pop... of the pdf into images. Use Tesseract to detect rotation and ImageMagick mogrify to fix it. Use OpenCV to find and extract tables. Use OpenCV to find and extract each cell from the table. Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software. Use Tesseract to OCR each cell. Combine the extracted text of each cell into the format you need. I wrote a python package with ... More on stackoverflow.com

stackoverflow.com

PDF Table Extraction

if you're trying to get data that would be on the sec.gov website (10-Qs/10-Ks/etc) I would recommend looking at their API which will ultimately be much easier to extract the right data than trying to parse PDFs. You will also know that the information is accurate rather than hoping that the PDF extractor tool worked right. You can also checkout websites like bamsec.com which can get specific weird tables out of financial statements if you're looking for something more obscure/company specific. If you really want to parse PDFs I don't have a ton of ideas other than what you've tried, parsing PDFs is a pain and I'd really try to avoid it if I were you. Maybe see if you can find an XML version of the PDFs (assuming they're public info) that would be easier to parse using beautifulsoup in python or something similar. More on reddit.com

r/dataengineering

January 16, 2024

Extract text and tables of a PDF file in Python - Stack Overflow

With algodocs you can extract both - text and tables from system-generated pdfs and scanned images even with poor quality. See algodocs.com/blog/… ... Thanks Zhavat, great tool, but looks like it is not an open-source tool and has no python source code available. More on stackoverflow.com

stackoverflow.com

Videos

19:08

YouTube

How to Extract Tables from PDFs Using Python: Step-by-Step Tutorial ...

December 6, 2023

31:39

YouTube

Python Libraries to Extract Tables from PDFs - YouTube

March 10, 2025

03:40

YouTube

Extract All the Tables From PDF in 3 minutes With Python - YouTube

November 17, 2022

14.5K

reddit.com

r/Python on Reddit: Learn how to extract tables from PDF using ...

October 17, 2021

01:56

YouTube

Extract Tables from Complex PDF with Python | Skip Any Table You ...

June 30, 2025

17:00

YouTube

Extract text, links, images, tables from Pdf with Python | PyMuPDF, ...

stackoverflow.com › questions › 49733576 › how-to-extract-more-than-one-table-present-in-a-pdf-file-with-tabula-in-python

dataframe - How to extract more than one table present in a PDF file with tabula in Python? - Stack Overflow

Top answer

1 of 5

import os
from tabula import wrapper
os.chdir("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all',encoding='utf-8',spreadsheet=True)

i=1
for table in tables:
    table.columns = table.iloc[0]
    table = table.reindex(table.index.drop(0)).reset_index(drop=True)
    table.columns.name = None
    #To write Excel
    table.to_excel('output'+str(i)+'.xlsx',header=True,index=False)
    #To write CSV
    table.to_csv('output'+str(i)+'.csv',sep='|',header=True,index=False)
    i=i+1

2 of 5

Even when using the tabula-py wrapper you can use all the same options as can be found on the Tabula Java Docs.

In your case you can simply add pages = "all":

from tabula import read_pdf
df = read_pdf(r"C:\Users\Himanshu Poddar\Desktop\pdf_file.pdf", pages = "all")

Unstract

unstract.com › home › product › python libraries to extract table from pdf

Best Python Libraries to Extract Tables From PDF in 2026

December 16, 2025 - Support for visual debugging, which helps make sure the table extraction is accurate. Tabula is another popular library for getting tables out of PDFs, known for being easy to use and strong. It uses Java but provides a convenient Python wrapper. Tabula’s main features include: Easy installation and use within Python. The ability to handle multiple pages and get tables from all of them.

PyPI

pypi.org › project › pypdf-table-extraction

pypdf-table-extraction · PyPI

Here's how you can extract tables from PDFs. You can check out the quickstart notebook. Or follow the example below. You can check out the PDF used in this example here. >>> import pypdf_table_extraction >>> tables = pypdf_table_extraction.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!

      » pip install pypdf-table-extraction

Published Apr 02, 2025

Version 1.0.2

Homepage https://github.com/py-pdf/pypdf_table_extraction

GeeksforGeeks

geeksforgeeks.org › python › how-to-extract-pdf-tables-in-python

How to Extract PDF Tables in Python? - GeeksforGeeks

July 23, 2025 - It’s like a smart scanner that spots these tables and turns them into neat data frames you can easily handle in Python. It’s very handy if you want quick and clean results and PDF file used here is PDF. ... Explanation: camelot.read_pdf() extract tables from the PDF file "test.pdf".

Find elsewhere

Google Bing Mojeek

Medium

medium.com › analytics-vidhya › how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673

How to extract multiple tables from a PDF through python and tabula-py | by Angelica Lo Duca | Analytics Vidhya | Medium

April 20, 2022 - However, it may happen that you ... to copy and paste each of them separately. ... Here, the python library tabula-py helps you to extract multiple tables separately....

reddit.com › r/python › python model for pdf table extraction

r/Python on Reddit: Python Model for PDF table extraction

December 31, 2024 -

I am looking for a python library model that can extract tables out of PDF, but here are some more requirements:

a) Able to differentiate two table in same page, having different width

b) Able to Understand table that spans across multiple Pages in Same pdf

Tried Tabula, pyMuPDF both are not showing any good results, Suggest some better models

Top answer

1 of 5

We had the same problem, ended up using azure document intelligence

2 of 5

If you know the table headers, you can ocr the PDF and search/identify the tables by the headers.

DataScientYst

datascientyst.com › extract-table-from-pdf-with-python-pandas

How to Extract Table from PDF with Python and Pandas

February 14, 2025 - In this short tutorial, we'll see how to extract tables from PDF files with Python and Pandas. We will cover two cases of table extraction from PDF: (1) Simple table with tabula-py from tabula import read_pdf df_temp = read_pdf('china.pdf') (2) Table with merged cells import pandas

IronPDF

ironpdf.com › ironpdf for python › ironpdf for python blog › using ironpdf for python › extract table from pdf python

Extract Table From PDF in Python (Developer Tutorial) | IronPDF for Python

July 29, 2025 - Afterward, the ExtractAllText function extracts all the table data from all the pages within the PDF files. Then, the split function is used to divide the extracted table data into multiple rows and display them on the console screen.

GitHub

github.com › atlanhq › camelot

GitHub - atlanhq/camelot: Camelot: PDF Table Extraction for Humans · GitHub

Here's how you can extract tables from PDF files. Check out the PDF used in this example here. >>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite >>> tables[0].df # get a pandas DataFrame!

Starred by 3.7K users

Forked by 362 users

Languages Python 99.7% | Makefile 0.3%

Medium

medium.com › h7w › python-libraries-for-extracting-tables-from-pdfs-03f069fc4980

Python Libraries for Extracting Tables from PDFs | by Py-Core Python Programming | T3CH | Medium

March 19, 2025 - import camelot tables = camelot.read_pdf("example.pdf", pages="1", flavor="stream") df = tables[0].df print(df) tables[0].to_csv("table.csv") ... Snoop & Learn about Technology, AI, Hacking, Coding, Software, News, Tools, Leaks, Bug Bounty, OSINT & Cybersecurity !¡! But, not limited 2, anything that is Tech Linked…You’ll probably find here ! ;) — Stay ahead with Latest Tech News! -> You write about? Just ping to join ! ... Building Python utilities and UI apps | Founder of py-core.com | Sharing coding insights, automation tricks, and development tips | Follow for Python tutorials.

Readthedocs

tabula-py.readthedocs.io

tabula-py: Read tables in a PDF into DataFrame — tabula-py documentation

You can read tables from PDF and convert them into pandas’ DataFrame. tabula-py also converts a PDF file into CSV/TSV/JSON file. We highly recommend looking at the example notebook and trying it on Google Colab. For high-level API reference, see High level interfaces. ... I got an empty DataFrame. How can I resolve it? The result is different from tabula-java. Or, stream option seems not to work appropriately ... I faced ParserError: Error tokenizing data. C error. How can I extract multiple tables?

PyShark

pyshark.com › home › extract table from pdf using python

Extract Table from PDF using Python - Python for PDF - PyShark

January 4, 2024 - Learn how to extract tables from PDF files using Python. Complete code walkthrough with detailed examples using tabula-py library.

Medium

medium.com › @MemoonaTahira › working-with-embedded-tables-in-pdfs-using-python-64ce273f59de

How to Extract Embedded Tables from PDFs: Types of tables and Python Libraries Explained | by Memoona Tahira | Medium

February 8, 2025 - If the table is created using form fields (e.g., text fields, checkboxes, etc.), the data can be extracted directly from the form fields. Python libraries like PyPDF2 or pdfrw can extract this form data.

Stack Overflow

stackoverflow.com › questions › 47533875 › how-to-extract-a-table-as-text-from-the-pdf

python - How to extract a table as text from the PDF - Stack Overflow

Top answer

1 of 4

This answer is for anyone encountering pdfs with images and needing to use OCR. I could not find a workable off-the-shelf solution; nothing that gave me the accuracy I needed.

Here are the steps I found to work.

Use pdfimages from https://poppler.freedesktop.org/ to turn the pages of the pdf into images.
Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
Use OpenCV to find and extract tables.
Use OpenCV to find and extract each cell from the table.
Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
Use Tesseract to OCR each cell.
Combine the extracted text of each cell into the format you need.

I wrote a python package with modules that can help with those steps.

Repo: https://github.com/eihli/image-table-ocr

Docs & Source: https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

Some of the steps don't require code, they take advantage of external tools like pdfimages and tesseract. I'll provide some brief examples for a couple of the steps that do require code.

Finding tables:

This link was a good reference while figuring out how to find tables. https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

import cv2

def find_tables(image):
    BLUR_KERNEL_SIZE = (17, 17)
    STD_DEV_X_DIRECTION = 0
    STD_DEV_Y_DIRECTION = 0
    blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
    MAX_COLOR_VAL = 255
    BLOCK_SIZE = 15
    SUBTRACT_FROM_MEAN = -2

    img_bin = cv2.adaptiveThreshold(
        ~blurred,
        MAX_COLOR_VAL,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY,
        BLOCK_SIZE,
        SUBTRACT_FROM_MEAN,
    )
    vertical = horizontal = img_bin.copy()
    SCALE = 5
    image_width, image_height = horizontal.shape
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
    horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
    vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)

    horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
    vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))

    mask = horizontally_dilated + vertically_dilated
    contours, hierarchy = cv2.findContours(
        mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
    )

    MIN_TABLE_AREA = 1e5
    contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
    perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
    epsilons = [0.1 * p for p in perimeter_lengths]
    approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
    bounding_rects = [cv2.boundingRect(a) for a in approx_polys]

    # The link where a lot of this code was borrowed from recommends an
    # additional step to check the number of "joints" inside this bounding rectangle.
    # A table should have a lot of intersections. We might have a rectangular image
    # here though which would only have 4 intersections, 1 at each corner.
    # Leaving that step as a future TODO if it is ever necessary.
    images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
    return images

Extract cells from table.

This is very similar to 2, so I won't include all the code. The part I will reference will be in sorting the cells.

We want to identify the cells from left-to-right, top-to-bottom.

We’ll find the rectangle with the most top-left corner. Then we’ll find all of the rectangles that have a center that is within the top-y and bottom-y values of that top-left rectangle. Then we’ll sort those rectangles by the x value of their center. We’ll remove those rectangles from the list and repeat.

def cell_in_same_row(c1, c2):
    c1_center = c1[1] + c1[3] - c1[3] / 2
    c2_bottom = c2[1] + c2[3]
    c2_top = c2[1]
    return c2_top < c1_center < c2_bottom

orig_cells = [c for c in cells]
rows = []
while cells:
    first = cells[0]
    rest = cells[1:]
    cells_in_same_row = sorted(
        [
            c for c in rest
            if cell_in_same_row(c, first)
        ],
        key=lambda c: c[0]
    )

    row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
    rows.append(row_cells)
    cells = [
        c for c in rest
        if not cell_in_same_row(c, first)
    ]

# Sort rows by average height of their center.
def avg_height_of_center(row):
    centers = [y + h - h / 2 for x, y, w, h in row]
    return sum(centers) / len(centers)

rows.sort(key=avg_height_of_center)

2 of 4

I would suggest you to extract the table using tabula.
Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
Each table in your pdf is returned as one dataframe.
The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

Please refer to this repo of mine for more details.

reddit.com › r/dataengineering › pdf table extraction

r/dataengineering on Reddit: PDF Table Extraction

January 16, 2024 -

Hi everyone,

I have a list of PDFs from which I need to extract table data in automated way. I need one specific table or some important data points from that table. PDFs are from different sources, so the table structures are different from one another. I also need to locate the table in PDF because they appear in different pages every year. I was wondering what would be the most robust way of trying to extract the tables in this case?

Things I have experimented:

3rd party Python packages (pdfplumber, tabula): results were not good enough, these packages couldn't extract tables neatly in consistent manner. They were dividing values/labels into chunks and etc.
openAI gpt-4 chat completions endpoint: very much inconsistent. It is difficult both to locate table in the PDF and extract table or specific data points.
openAI gpt-4 vision API endpoint: I take snapshots of PDF pages and try to extract data using vision endpoint, but because the resolution is not high it makes mistakes.

I need as much Automation as possible for this task. That's why I am even trying to locate the table in PDF in automated way. Do any of you have experience with similar task? Does it even make sense to make an effort on this? If so, what would be the most optimal solution?

Sample PDF table which I am trying to extract (let's say I need Total revenue & expense for 2023):

Top answer

1 of 5

2 of 5

You can use Camelot-py module. I got better results than tabula using stream method. The output is in pandas df, so easier to do further transformations

GitHub

github.com › masterhimanshupoddar › extracting-multiple-tables-from-pdf-using-Tabula

GitHub - masterhimanshupoddar/extracting-multiple-tables-from-pdf-using-Tabula: Extract multiple tables or particular table from a pdf file · GitHub

#For extracting all the tables in a pdf file we can directly pass multiple_tables = True as an argument to the function eg df = tabula.read_pdf(path, pages = '1', multiple_tables = True)

Starred by 8 users

Forked by 4 users

Languages Python

Stack Overflow

stackoverflow.com › questions › 69262489 › extract-text-and-tables-of-a-pdf-file-in-python

Extract text and tables of a PDF file in Python - Stack Overflow

Top answer

1 of 1

The answer depends if the question is general or specific to a single form. Your approach is reasonable for the general case, but there will be variability. If you have a pdf form that is a single form or report that has been created with different data at each iteration consider converting the form from pdf to postscript then see if you can parse the postscript.

Two utilities do this: pdf2ps and pdftops Try each. This approach may benefit if you know some postscript. With some luck the needed fields may be simple text strings. Worth a try.

DEV Community

dev.to › rishabdugar › pdf-extraction-retrieving-text-and-tables-together-using-python-14c2

PDF Extraction: Retrieving Text and Tables together using Python🐍 - DEV Community

September 22, 2024 - This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout.