After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)
Answer from fmarques on Stack OverflowAfter struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)
use library tabula (note that the package name tabula is not correct, the correct one is tabula-py)
pip install tabula-py
then extract it
import tabula
# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)
# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)
df[1]
By the way, I tried reading PDF files by using another way. Then it works better than library tabula. I will post it soon.
PyPDF2 to parse through tables in PDF
check out this thread
https://www.reddit.com/r/learnpython/comments/7x9inm/project_help_pdf_extraction_to_csv/du7jyzf/?context=3
More on reddit.compython - How to extract table value from pdf using PYPDF2? - Stack Overflow
python - PyPDF2 : extract table of contents/outlines and their page number - Stack Overflow
Extract text and tables of a PDF file in Python - Stack Overflow
Videos
» pip install pypdf-table-extraction
I want to write script that can read tables from pdf's for data visualization. I installed PyPDF2 and have been playing around with it but would like some additional resources to find the best way to do this. I can read the data I want from the pdf but it just reads the whole page and is not structured well.
Similar files to what I am working on can be found here
check out this thread
https://www.reddit.com/r/learnpython/comments/7x9inm/project_help_pdf_extraction_to_csv/du7jyzf/?context=3
Hey there,
I'm kind of in a similar spot - trying to get tabular data out of a PDF file and reprocessing it for other uses. I feel your pain ;)
I made some progress on my little project last night... basically I ended up spending some time getting up to speed (definitely not 'running', but maybe a slow crawl) with regular expressions to strip out some of the extraneous formatting that got thrown into the PDFs I'm dealing with. Still a long way from a complete solution at this point, but at least its progress. My point is don't necessarily throw up your hands when the simpler tools like .strip() and .split() and .replace() don't solve everything.
Good luck!
Martin Thoma's answer is exactly what I needed (PyMuPDF). Diblo Dk's answer is an interesting workaround as well (PyPDF2).
I am citing exactly Martin Thoma's code :
from typing import Dict
import fitz # pip install pymupdf
def get_bookmarks(filepath: str) -> Dict[int, str]:
# WARNING! One page can have multiple bookmarks!
bookmarks = {}
with fitz.open(filepath) as doc:
toc = doc.getToC() # [[lvl, title, page, …], …]
for level, title, page in toc:
bookmarks[page] = title
return bookmarks
print(get_bookmarks("my.pdf"))
you should reference this PDF outlines and their Page Number
targetPDFFile = 'your_pdf_filename.pdf'
pdfFileObj=open(targetPDFFile, 'rb')
# use outline replace of bookmark, outline is more accuracy than bookmark
result = {}
def outline_dict(bookmark_list):
for item in bookmark_list:
if isinstance(item, list):
# recursive call
outline_dict(item)
else:
try:
pageNum = pdfReader.getDestinationPageNumber(item) + 1
# print("key=" + str(pageNum) + ",title=" + item.title)
# 相同页码的item会被替换掉
result[pageNum] = item.title
except:
print("except:" + item)
pass
outline_dict(pdfReader.getOutlines())
print(result)