You need to extract text from the PDF pages using extract_text:
import PyPDF2
with open('dummy.pdf', 'rb') as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
print(page.extract_text())
Check the documentation here
Answer from حمزة نبيل on Stack Overflow
» pip install PyPDF2
python - How do I use PyPDF2 to read and display the contents of my PDF when ran? - Stack Overflow
PyPDF vs PyPDF2 vs PyPDF3 vs PyPDF4 vs others
python - "no module named PyPDF2" error - Stack Overflow
Failing to write some PDFs with PyPDF2
Videos
You need to extract text from the PDF pages using extract_text:
import PyPDF2
with open('dummy.pdf', 'rb') as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
print(page.extract_text())
Check the documentation here
Following code will give you text instead of object
import PyPDF2
with open('dummy.pdf', 'rb') as file:
reader = PyPDF2.PdfReader(file)
#print(reader)
for page in reader.pages:
text = page.extract_text()
print(text)
Initially I just googled for a way to get the number of pages in a pdf file. First result was PyPDF2 so I just used that.
After a while I got an error from a file, so I started looking around and realized that there are 4 different forks of this library!
What is going on here? Why are there so many forks?
In other news, later on I will be scraping some text from some pdf files. Which library would you recommend? I won't be needing OCR, the text is already in the files.
Thanks!