extract table from pdf python github

A Python library to extract tabular data from PDFs - py-pdf/pypdf_table_extraction

Starred by 67 users

Forked by 17 users

Languages Python

github.com › Baskar-forever › TableExtractor-Advanced-PDF-Table-Extraction

GitHub - Baskar-forever/TableExtractor-Advanced-PDF-Table-Extraction: PDF Table Extractor is an innovative Python project designed to tackle the challenge of extracting tables from scanned PDF documents. Leveraging advanced optical character recognition (OCR) and image processing techniques. · GitHub

PDF Table Extractor is an innovative Python project designed to tackle the challenge of extracting tables from scanned PDF documents. Leveraging advanced optical character recognition (OCR) and image processing techniques.

Starred by 43 users

Forked by 11 users

Languages Jupyter Notebook 58.6% | Python 41.4%

Discussions

How to extract Table from PDF in Python? - Stack Overflow

I have thousands of PDF files, composed only by tables, with this structure: pdf file However, despite being fairly structured, I cannot read the tables without losing the structure. I tried Py... More on stackoverflow.com

stackoverflow.com

Extracting information (Text, Tables, Layouts) from PDFs using OCR.

Have you tried https://github.com/tesseract-ocr/tesseract ? More on reddit.com

r/Python

February 21, 2024

PDF Table Extraction

if you're trying to get data that would be on the sec.gov website (10-Qs/10-Ks/etc) I would recommend looking at their API which will ultimately be much easier to extract the right data than trying to parse PDFs. You will also know that the information is accurate rather than hoping that the PDF extractor tool worked right. You can also checkout websites like bamsec.com which can get specific weird tables out of financial statements if you're looking for something more obscure/company specific. If you really want to parse PDFs I don't have a ton of ideas other than what you've tried, parsing PDFs is a pain and I'd really try to avoid it if I were you. Maybe see if you can find an XML version of the PDFs (assuming they're public info) that would be easier to parse using beautifulsoup in python or something similar. More on reddit.com

r/dataengineering

January 16, 2024

[Complete Beginner] I want to extract tabular data from a PDF for subsequent analysis

I did a fair bit of research into extracting tabular data from PDF recently. I found that doing it programmatically is an absolute mess since PDF is a terribly unstructured format. My data was all text selectable, but even a simple select and copy was awful because of all the extra/missing spaces and newlines. The best solution I found for selectable text was the "table copy" tool in KDE's Okular PDF viewer. Okular is free and open source, and it allows you to draw a rectangle and click in the row/column delimiters. It copies it as tab separated values, which you can then just paste into a file. It's not exactly automatic, and won't work with non-selectable files, but it still saved me a ton of time.