Looks like what you have is a large chunk of text data that you want to interpret line-by-line.
You can use the StringIO class to wrap that content as a seekable file-like object:
>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'
In your case, do:
from StringIO import StringIO
# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
doSomething(line.strip())
Answer from MikeyB on Stack OverflowLooks like what you have is a large chunk of text data that you want to interpret line-by-line.
You can use the StringIO class to wrap that content as a seekable file-like object:
>>> import StringIO
>>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
>>> buf = StringIO.StringIO(content)
>>> buf.readline()
'big\n'
>>> buf.readline()
'ugly\n'
>>> buf.readline()
'contents\n'
>>> buf.readline()
'of\n'
>>> buf.readline()
'multiple\n'
>>> buf.readline()
'pdf files'
>>> buf.seek(0)
>>> buf.readline()
'big\n'
In your case, do:
from StringIO import StringIO
# Read each line of the PDF
pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
for line in pdfContent:
doSomething(line.strip())
Using yield and PdfFileReader.pages can simplify things,
from pyPdf import PdfFileReader
def get_pdf_content_lines(pdf_file_path):
with open(pdf_file_path) as f:
pdf_reader = PdfFileReader(f)
for page in pdf_reader.pages:
for line in page.extractText().splitlines():
yield line
for line in get_pdf_content_lines('/path/to/file.pdf'):
print line
In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)
from pyPdf import PdfFileReader
def get_pdf_content(pdf_file_path):
with open(pdf_file_path) as f:
pdf_reader = PdfFileReader(f)
content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)
content = ' '.join(content.split())
return content
print get_pdf_content('/path/to/file.pdf')
Extracting text from pdf using Python and Pypdf2 - Stack Overflow
How to clean output from PyPDF2
How to extract text from a PDF file via python? - Stack Overflow
Line returns missing in text_extraction()
How can I extract text from a PDF using Python?
PdfDocument.FromFile method and iterating through pages to extract text line by line.What is required to start extracting text from PDFs in Python?
How do I execute a Python script for PDF text extraction?
python main.py in your IDE's terminal, where main.py is the name of your script file.Videos
Hi! I had to create a dictionary (book) with the key "Page n" and the value as another dictionary (page n) with the key "line m" and the value as the text of the mth line on the nth page. However, I'm struggling with how to extract single lines.
from pypdf import PdfReader
reader = PdfReader("title.pdf")
book = {}
for k in range(0, len(reader.pages)):
text = ""
text += reader.pages[k].extract_text()
book["Page " + str(k)] = text
Below is my code so far. I think the main problem is that reader.pages[k].extract_text() converts all the text of the kth page of the PDF, ignoring the fact that there are different lines. How can I separate each line?
ChatGPT suggest a solution that doesn't work:
# Import the necessary libraries
import PyPDF2
# Open the PDF file
with open('example.pdf', 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Initialize the book dictionary
book = {}
# Loop through each page
for k in range(len(reader.pages)):
# Extract text from the k-th page
text = reader.pages[k].extract_text()
# Initialize the page dictionary
page_dict = {}
# Split the text into lines
lines = text.split('\n')
# Loop through each line and add it to the page dictionary
for m, line in enumerate(lines):
page_dict[f'line {m+1}'] = line
# Add the page dictionary to the book dictionary
book[f'Page {k+1}'] = page_dict
# Print the book dictionary
print(book)from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))
Output
Hydraulic Fracturing Fluid Product Component Information Disclosure
Fracture Date State: County: API Number: Operator Name: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type: True Vertical Depth (TVD): Total Water Volume (gal)*:
12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H -97.215242 32.558525 NAD27 Gas 7,595 2,608,032
Hydraulic Fracturing Fluid Composition:
Trade Name
Supplier
Purpose
Ingredients
Chemical Abstract Service Number
(CAS #)
Maximum Ingredient
Concentration
in Additive ( by mass)**
Comments
Maximum Ingredient
Concentration
in HF Fluid ( by mass)**
Water Sand HCL
Pumpco Pumpco
Proppant Hydrochloric Acid
Plexslick 921
Pumpco
Friction Reducer
Plexaid 673
Pumpco
Scale Inhibitor
Plexcide 24L
Pumpco
Biocide
Crystaline Silica
Hydrogen Chloride Water
Petroleum Distillate Ammonium Salts Polyethoxylated alcohol surfactants Water
Methyl Alcohol Organic phosphonic acid salts
Dazomat Sodium Hydroxide Water
7732-18-5 14808-60-7
7647-01-0 7732-18-5
64742-47-8 9003-06-9
7732-18-5
67-56-1
533-74-4 1310-73-2 7732-18-5
100.00 100.00
90.01799 9.84261
40.00 60.00
35.00 28.00 7.00 30.00
25.00 75.00
24.00 4.00 72.00
0.03353 0.05029
0.00941 0.00753 0.00188 0.00807
0.00276 0.00828
0.00424 0.00071 0.01271
- Total Water Volume sources may include fresh water, produced water, and/or recycled water ** Information is based on the maximum potential for concentration and thus the total may be over 100
Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)
textract works fine in python3, using the tesseract method. Example code:
import textract
text = textract.process("pdfs/testpdf1.pdf", method='tesseract')
print(text)
with open('textract-results.txt', 'w+') as f:
f.write(str(text))
https://pypi.org/project/textract/
I'm using PyPDF2 to extract text from a PDF. This works fine. 'raw' text is imported and printed. However I need to clean up the output, since there are line breaks all over te place. At the same time, not ALL linebreaks need to be removed, some of them are valid.
I tried it with .strip(), .replace() and .splitlines() but everything I try leads to either one long line, 'ghost' spaces all over the place, or most line ends removed, but still some (unwanted) line ends left. So I have no clue on how to do this.
(I have the idea that 'real' line ends can be distinct from 'unwanted' line ends because they are double. But I'm not sure.)
My script (as it currently looks)
import PyPDF2
raw_text = ''
def cleanlines(value):
value = value.strip()
return ''.join(value.splitlines())
with open('testfile.pdf', 'rb') as pdf_file:
pdfReader = PyPDF2.PdfFileReader(pdf_file)
for page_num in range(pdfReader.numPages):
pdf_page = pdfReader.getPage(page_num)
raw_text_part = (pdf_page.extractText())
raw_text += raw_text_part.strip()
# raw_text = raw_text.replace("\n", " ")
# raw_text = raw_text.strip()
# raw_text = raw_text.replace(" ", "\n")
print (raw_text)
print (cleanlines(raw_text))(a part of) the original PDF and the output can be seen here: https://imgur.com/a/40u4nl2
The output is pasted in Notepad++ with "all characters" on to make the line ends visible.I'm using here Windows 10 and Python 3.7.0 Please forgive me the language, I used Samuel L. Ipsum for it (https://slipsum.com/)
I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.
Tika-Python is a Python binding to the Apache Tika REST services allowing Tika to be called natively in the Python community.
from tika import parser # pip install tika
raw = parser.from_file('sample.pdf')
print(raw['content'])
Note that Tika is written in Java so you will need a Java runtime installed.
pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six.
pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.
Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:
- Anything special regarding tables (just that the text is there, not about the formatting)
- Arabic test (RTL-languages)
- Mathematical formulas.
That means if your use-case requires those points, you might perceive the quality differently.
Having said that, the results from November 2022:


pypdf
I became the maintainer of pypdf and PyPDF2 in 2022! The community improved the text extraction a lot in 2022. Give it a try :-)
First, install it:
pip install pypdf
And then use it:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
Please note that those packages are not maintained:
- PyPDF2, PyPDF3, PyPDF4
pdfminer(without .six)
pymupdf
import fitz # install using: pip install PyMuPDF
with fitz.open("my.pdf") as doc:
text = ""
for page in doc:
text += page.get_text()
print(text)
Other PDF libraries
- pikepdf does not support text extraction (source)