I don't know why pypdf2 can't extract the information from that PDF, but the package pdftotext can:

import pdftotext
from six.moves.urllib.request import urlopen
import io

url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urlopen(url).read()
memory_file = io.BytesIO(remote_file)

pdf = pdftotext.PDF(memory_file)

# Iterate over all the pages
for page in pdf:
    print(page)

Extracted

                                  UNITED STATES OF AMERICA
                                                Before the
                        SECURITIES AND EXCHANGE COMMISSION
SECURITIES EXCHANGE ACT OF 1934
Release No. 76574 / December 7, 2015
ADMINISTRATIVE PROCEEDING
File No. 3-16987
                                                        ORDER INSTITUTING CEASE-AND-DESIST
In the Matter of                                        PROCEEDINGS, PURSUANT TO SECTION
                                                        21C OF THE SECURITIES EXCHANGE ACT
        KEFEI WANG                                      OF 1934, MAKING FINDINGS, AND
                                                        IMPOSING REMEDIAL SANCTIONS AND A
Respondent.                                             CEASE-AND-DESIST ORDER
                                                     I.
        The Securities and Exchange Commission (“Commission”) deems it appropriate and in the
public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of
the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”).
                                                    II.
        In anticipation of the institution of these proceedings, Respondent has submitted an Offer
of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the
purpose of these proceedings and any other proceedings brought by or on behalf of the
Commission, or to which the Commission is a party, and without admitting or denying the findings
herein, except as to the Commission’s jurisdiction over him and the subject matter of these
proceedings, which are admitted, and except as provided herein in Section V, Respondent consents
to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the
Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a
Cease-and-Desist Order (“Order”), as set forth below.

                                                  III.
        On the basis of this Order and Respondent’s Offer, the Commission finds1 that:
                                              Summary
        1.      Respondent violated Section 15(a)(1) of the Exchange Act by acting as an
unregistered broker-dealer in connection with his representation of clients who were seeking U.S.
residency through the Immigrant Investor Program. Respondent helped effect certain individuals’
securities purchases in an EB-5 Regional Center. Respondent received a commission from that
Regional Center for each investment he facilitated.
                                             Respondent
        2.      Kefei Wang, age 39, is a resident of China. During the relevant time period, he was
a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based
in Fremont, California.
                                             Background
        3.      The United States Congress created the Immigrant Investor Program, also known as
“EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by
foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new
commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S.
workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A
certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional
Center is defined as “any economic unit, public or private, which is involved with the promotion of
economic growth, including increased export sales, improved regional productivity, job creation,
and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015).
        4.      Typical Regional Center investment vehicles are offered as limited partnership
interests. The partnership interests are securities, usually offered pursuant to one or more
exemptions from the registration requirements of the U.S. securities laws. The Regional Centers
are often managed by a person or entity which acts as a general partner of the limited partnership.
The Regional Centers, the investment vehicles, and the managers are collectively referred to herein
as “EB-5 Investment Offerers.”
        5.      Various EB-5 Investment Offerers paid commissions to anyone who successfully
sold limited partnership interests to new investors.
1
        The findings herein are made pursuant to Respondent’s Offer of Settlement and are not
        binding on any other person or entity in this or any other proceeding.
                                                    2

              Respondent Received Commissions for His Clients’ EB-5 Investments
         6.      From at least January 2010 through May 2014, Respondent received a portion of
commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted
his portion of the commissions that were paid pursuant to a written Agency Agreement between
Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the
commission was paid to a foreign bank account identified by the Respondent despite the fact that
the Respondent was U.S.-based during the relevant time period.
         7.      Respondent performed activities necessary to effectuate the transaction, including
recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients;
acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the
transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent
received his portion of transaction-based commissions due to Nautilus Global Capital for its
services from that EB-5 Investment Offerer.
         8.      As a result of the conduct described above, Respondent violated Section 15(a)(1) of
the Exchange Act which makes it unlawful for any broker or dealer which is either a person other
than a natural person or a natural person not associated with a broker or dealer to make use of the
mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to
induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is
registered in accordance with Section 15(b) of the Exchange Act.
                                                  IV.
         In view of the foregoing, the Commission deems it appropriate to impose the sanctions
agreed to in Respondent Kefei Wang’s Offer.
         Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that:
         A.      Respondent shall cease and desist from committing or causing any violations and
any future violations of Section 15(a)(1) of the Exchange Act.
         B.      Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement
of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities
and Exchange Commission for transfer to the general fund of the United States Treasury in
accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and
prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice
600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional
interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following
ways:
         (1)     Respondent may transmit payment electronically to the Commission, which will
                 provide detailed ACH transfer/Fedwire instructions upon request;
                                                   3

         (2)       Respondent may make direct payment from a bank account via Pay.gov through the
                   SEC website at http://www.sec.gov/about/offices/ofm.htm; or
         (3)       Respondent may pay by certified check, bank cashier’s check, or United States
                   postal money order, made payable to the Securities and Exchange Commission and
                   hand-delivered or mailed to:
                   Enterprise Services Center
                   Accounts Receivable Branch
                   HQ Bldg., Room 181, AMZ-341
                   6500 South MacArthur Boulevard
                   Oklahoma City, OK 73169
         Payments by check or money order must be accompanied by a cover letter identifying
Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a
copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate
Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE,
Washington, DC 20549-5553.
                                                    V.
         It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section
523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by
Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other
amounts due by Respondent under this Order or any other judgment, order, consent order, decree
or settlement agreement entered in connection with this proceeding, is a debt for the violation by
Respondent of the federal securities laws or any regulation or order issued under such laws, as set
forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19).
         By the Commission.
                                                          Brent J. Fields
                                                          Secretary
                                                     4

[Finished in 0.5s]
Answer from Martin Thoma on Stack Overflow
🌐
GitHub
github.com › mstamy2 › PyPDF2 › issues › 437
extract_text works for some PDF files, but not the others · Issue #437 · py-pdf/pypdf
June 22, 2018 - However, print(page_content) does return null if I use another PDF file, “55 HARRISON GARDEN.pdf” which I actually need to extract some information from: ### This code works for the ndvi file, but returns empty string for the harrison gdn file! I need to figure out why import PyPDF2 # creating a pdf file object pdfFileObj = open('C:/Google Drive/Ward 29/data/55 HARRISON GARDEN.pdf', 'rb') # creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False) # getting the number of pages in pdf file number_of_pages =pdfReader.getNumPages() # creating a page object pageObj = pdfReader.getPage(0) page_content = pageObj.extractText() print(page_content) # closing the pdf file object pdfFileObj.close()
Author   babak-khamsehi
Top answer
1 of 2
13

I don't know why pypdf2 can't extract the information from that PDF, but the package pdftotext can:

import pdftotext
from six.moves.urllib.request import urlopen
import io

url = 'https://www.sec.gov/litigation/admin/2015/34-76574.pdf'
remote_file = urlopen(url).read()
memory_file = io.BytesIO(remote_file)

pdf = pdftotext.PDF(memory_file)

# Iterate over all the pages
for page in pdf:
    print(page)

Extracted

                                  UNITED STATES OF AMERICA
                                                Before the
                        SECURITIES AND EXCHANGE COMMISSION
SECURITIES EXCHANGE ACT OF 1934
Release No. 76574 / December 7, 2015
ADMINISTRATIVE PROCEEDING
File No. 3-16987
                                                        ORDER INSTITUTING CEASE-AND-DESIST
In the Matter of                                        PROCEEDINGS, PURSUANT TO SECTION
                                                        21C OF THE SECURITIES EXCHANGE ACT
        KEFEI WANG                                      OF 1934, MAKING FINDINGS, AND
                                                        IMPOSING REMEDIAL SANCTIONS AND A
Respondent.                                             CEASE-AND-DESIST ORDER
                                                     I.
        The Securities and Exchange Commission (“Commission”) deems it appropriate and in the
public interest that cease-and-desist proceedings be, and hereby are, instituted pursuant to 21C of
the Securities Exchange Act of 1934 (“Exchange Act”) against Kefei Wang (“Respondent”).
                                                    II.
        In anticipation of the institution of these proceedings, Respondent has submitted an Offer
of Settlement (the “Offer”) which the Commission has determined to accept. Solely for the
purpose of these proceedings and any other proceedings brought by or on behalf of the
Commission, or to which the Commission is a party, and without admitting or denying the findings
herein, except as to the Commission’s jurisdiction over him and the subject matter of these
proceedings, which are admitted, and except as provided herein in Section V, Respondent consents
to the entry of this Order Instituting Cease-and-Desist Proceedings, Pursuant to Section 21C of the
Securities Exchange Act of 1934, Making Findings, and Imposing Remedial Sanctions and a
Cease-and-Desist Order (“Order”), as set forth below.

                                                  III.
        On the basis of this Order and Respondent’s Offer, the Commission finds1 that:
                                              Summary
        1.      Respondent violated Section 15(a)(1) of the Exchange Act by acting as an
unregistered broker-dealer in connection with his representation of clients who were seeking U.S.
residency through the Immigrant Investor Program. Respondent helped effect certain individuals’
securities purchases in an EB-5 Regional Center. Respondent received a commission from that
Regional Center for each investment he facilitated.
                                             Respondent
        2.      Kefei Wang, age 39, is a resident of China. During the relevant time period, he was
a U.S. resident and an owner of Nautilus Global Capital, LLC , a now defunct entity that was based
in Fremont, California.
                                             Background
        3.      The United States Congress created the Immigrant Investor Program, also known as
“EB-5,” in 1990 to stimulate the U.S. economy through job creation and capital investment by
foreign investors. The Program offers EB-5 visas to individuals who invest $1 million in a new
commercial enterprise that creates or preserves at least 10 full-time jobs for qualifying U.S.
workers (or $500,000 in an enterprise located in a rural area or an area of high unemployment). A
certain number of EB-5 visas are set aside for investors in approved Regional Centers. A Regional
Center is defined as “any economic unit, public or private, which is involved with the promotion of
economic growth, including increased export sales, improved regional productivity, job creation,
and increased domestic capital investment.” 8 C.F.R. § 204.6(e) (2015).
        4.      Typical Regional Center investment vehicles are offered as limited partnership
interests. The partnership interests are securities, usually offered pursuant to one or more
exemptions from the registration requirements of the U.S. securities laws. The Regional Centers
are often managed by a person or entity which acts as a general partner of the limited partnership.
The Regional Centers, the investment vehicles, and the managers are collectively referred to herein
as “EB-5 Investment Offerers.”
        5.      Various EB-5 Investment Offerers paid commissions to anyone who successfully
sold limited partnership interests to new investors.
1
        The findings herein are made pursuant to Respondent’s Offer of Settlement and are not
        binding on any other person or entity in this or any other proceeding.
                                                    2

              Respondent Received Commissions for His Clients’ EB-5 Investments
         6.      From at least January 2010 through May 2014, Respondent received a portion of
commissions from one EB-5 Investment Offerer totaling $40,000. The commissions constituted
his portion of the commissions that were paid pursuant to a written Agency Agreement between
Nautilus Global Capital and the EB-5 Investment Offerer. On one or more occasions the
commission was paid to a foreign bank account identified by the Respondent despite the fact that
the Respondent was U.S.-based during the relevant time period.
         7.      Respondent performed activities necessary to effectuate the transaction, including
recommending the specific EB-5 Investment Offerer referenced in paragraph 6 to his clients;
acting as a liaison between the EB-5 Investment Offerer and the investors; and facilitating the
transfer and/or documentation of investment funds to the EB-5 Investment Offerer. Respondent
received his portion of transaction-based commissions due to Nautilus Global Capital for its
services from that EB-5 Investment Offerer.
         8.      As a result of the conduct described above, Respondent violated Section 15(a)(1) of
the Exchange Act which makes it unlawful for any broker or dealer which is either a person other
than a natural person or a natural person not associated with a broker or dealer to make use of the
mails or any means or instrumentality of interstate commerce “to effect any transactions in, or to
induce or attempt to induce the purchase or sale of, any security” unless such broker or dealer is
registered in accordance with Section 15(b) of the Exchange Act.
                                                  IV.
         In view of the foregoing, the Commission deems it appropriate to impose the sanctions
agreed to in Respondent Kefei Wang’s Offer.
         Accordingly, pursuant to Section 21C of the Exchange Act, it is hereby ORDERED that:
         A.      Respondent shall cease and desist from committing or causing any violations and
any future violations of Section 15(a)(1) of the Exchange Act.
         B.      Respondent shall, within ten (10) days of the entry of this Order, pay disgorgement
of $40,000, prejudgment interest of $1,590, and a civil money penalty of $25,000 to the Securities
and Exchange Commission for transfer to the general fund of the United States Treasury in
accordance with Exchange Act Section 21F(g)(3). If timely payment of disgorgement and
prejudgment interest is not made, additional interest shall accrue pursuant to SEC Rule of Practice
600 [17 C.F.R. § 201.600]. If timely payment of the civil money penalty is not made, additional
interest shall accrue pursuant to 31 U.S.C. § 3717. Payment must be made in one of the following
ways:
         (1)     Respondent may transmit payment electronically to the Commission, which will
                 provide detailed ACH transfer/Fedwire instructions upon request;
                                                   3

         (2)       Respondent may make direct payment from a bank account via Pay.gov through the
                   SEC website at http://www.sec.gov/about/offices/ofm.htm; or
         (3)       Respondent may pay by certified check, bank cashier’s check, or United States
                   postal money order, made payable to the Securities and Exchange Commission and
                   hand-delivered or mailed to:
                   Enterprise Services Center
                   Accounts Receivable Branch
                   HQ Bldg., Room 181, AMZ-341
                   6500 South MacArthur Boulevard
                   Oklahoma City, OK 73169
         Payments by check or money order must be accompanied by a cover letter identifying
Kefei Wang as a Respondent in these proceedings, and the file number of these proceedings; a
copy of the cover letter and check or money order must be sent to Stephen L. Cohen, Associate
Director, Division of Enforcement, Securities and Exchange Commission, 100 F St., NE,
Washington, DC 20549-5553.
                                                    V.
         It is further Ordered that, solely for purposes of exceptions to discharge set forth in Section
523 of the Bankruptcy Code, 11 U.S.C. § 523, the findings in this Order are true and admitted by
Respondent, and further, any debt for disgorgement, prejudgment interest, civil penalty or other
amounts due by Respondent under this Order or any other judgment, order, consent order, decree
or settlement agreement entered in connection with this proceeding, is a debt for the violation by
Respondent of the federal securities laws or any regulation or order issued under such laws, as set
forth in Section 523(a)(19) of the Bankruptcy Code, 11 U.S.C. § 523(a)(19).
         By the Commission.
                                                          Brent J. Fields
                                                          Secretary
                                                     4

[Finished in 0.5s]
2 of 2
-4

I think that there might be an issue with how you are extracting the pages try making a loop and calling each page separately like so

    for i in range(0 , number_of_pages ):  
        pageObj = pdfReader.getPage(i)
        page = pageObj.extractText()
Discussions

Problem with extracting text from PDF in python with pyPDF2
Garbage in what way? PDF text extraction isn't easy due to the way PDFs are structured. Text doesn't flow as it would in a Word doc for example, instead it is positioned on the page in blocks. Try PYMUPDF - https://pymupdf.readthedocs.io/en/latest/app1.html More on reddit.com
🌐 r/Python
10
2
July 24, 2023
pdf - Python - pypdf2 extractText() not working - Stack Overflow
I am trying to extract text and then editing finally , but the text is not getting extracted , it is showing the number of pages , header elements correctly , only the extractText() is not working.... More on stackoverflow.com
🌐 stackoverflow.com
PyPDF2 and PyPDF4 fails to extract text from the PDF
While running the above the 4th line of code successfully returns the correct value i.e no. of pages in the PDF, however, the 6th line (PDF extraction) gives a one page long blank data. I’ve tried using PyPDF2 and PyPDF4 and ran the code in both Python terminal and sublimetext and in both ... More on forum.freecodecamp.org
🌐 forum.freecodecamp.org
0
0
September 11, 2021
Parse PDF to extract text?

I think what you want is extractText, detailed in PyPDF2's documenation, here: https://pythonhosted.org/PyPDF2/PageObject.html#PyPDF2.pdf.PageObject.extractText

Try this line at the end:

print(pageObj.extractText())
More on reddit.com
🌐 r/learnpython
5
8
June 22, 2016
🌐
GitHub
github.com › mstamy2 › PyPDF2 › issues › 168
ExtractText yields nothing for apparently good PDF · Issue #168 · py-pdf/pypdf
August 1, 2015 - with open(filename, "rb") as pdf_file: try: pdf_obj = PdfFileReader(pdf_file) # gather properties prop_en = pdf_obj.getIsEncrypted() err = "" if not prop_en: # Look for any text on the first N pages prop_img = True prop_pg = pdf_obj.getNumPages() for i in xrange(min(prop_pg, 3)): pagei = pdf_obj.getPage(i) pageitext = pagei.extractText() # Set property and stop searching at first text found if len(pageitext) > 0: prop_img = False break
Author   chrisinmtown
🌐
Python Forum
python-forum.io › thread-22407.html
PyPDF2 processing problem
Hello, I've met problem using PyPDF2 module. For some books the text extraction works, for others - not (i.e. text is empty) page_number = 11 pageObj = pdfReader.getPage(page_number) text = pageObj.ex
🌐
Automate the Boring Stuff
automatetheboringstuff.com › 2e › chapter15
Chapter 15 – Working with PDF and Word Documents
While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As a result, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. There isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files.
🌐
freeCodeCamp
forum.freecodecamp.org › python
PyPDF2 and PyPDF4 fails to extract text from the PDF - Python - The freeCodeCamp Forum
September 11, 2021 - import PyPDF4 as p2 pdffile = open("XXXX.pdf","rb") pdfread=p2.PdfFileReader(pdffile) print(pdfread.getNumPages()) pageinfo=pdfread.getPage(0) print(pageinfo.extractText()) While running the above the 4th line of code successfully returns the correct value i.e no. of pages in the PDF, however, the 6th line (PDF extraction) gives a one page long blank data.
Find elsewhere
🌐
Readthedocs
pypdf2.readthedocs.io › en › 3.0.0 › user › extract-text.html
Extract Text from a PDF — PyPDF2 documentation
PyPDF2 is no OCR software; it will not be able to detect those failures. PyPDF2 will also never be able to extract text from images. And finally there are issues that PyPDF2 will deal with. If you find such a text extraction bug, please share the PDF with us so we can work on it!
🌐
Reddit
reddit.com › r › learnpython › comments › 4oz0le › parse_pdf_to_extract_text
r/learnpython - Parse PDF to extract text?
June 22, 2016 - PyPDF2 is good at splitting/ combining PDFs and transforming pdf files, but slightly unreliable for extracting text. PDFMiner is good at that. (PDFMiner3k for Python 3).
🌐
Sou-Nan-De-Gesu
soudegesu.com › en › post › python › extract-text-from-pdf-with-pypdf2
Use PyPDF2 - extract text data from PDF file - Sou-Nan-De-Gesu
December 2, 2018 - (b) The prohibitions in subsection (a) of this section apply except to the extent provided by statutes, or in regulations, orders, directives, or licenses that may be issued pursuant to this order, and notwithstanding any contract entered into or any license or permit granted prior to the date of this order. VerDate Sep<11>2014 18:13 Nov 01, 2018Jkt 247001PO 00000Frm 00003Fmt 4705Sfmt 4790E:\FR\FM\02NOE0.SGM02NOE0 · I can extract text in page, but some symbols are garbled like Title 3Ñ and ezuelaÕs.
🌐
GitHub
github.com › mstamy2 › PyPDF2 › issues › 362
extractText returns only empty lines for some files · Issue #362 · py-pdf/pypdf
July 28, 2017 - import PyPDF2 # File name file = 'sample.pdf' # Open File with open(file, "rb") as f: # Read in file pdfReader = PyPDF2.PdfFileReader(f) # Check number of pages number_of_pages = pdfReader.numPages print(number_of_pages) # Get first page pageObj = pdfReader.getPage(0) # Extract text from page 1 text = pageObj.extractText() print(text) It does get the number of pages correctly, so it is able to open the PDF. If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
Author   hoenie-ams
🌐
Medium
medium.com › @nutanbhogendrasharma › extracting-text-from-pdf-file-in-python-using-pypdf2-5cefb66f1230
Extracting Text From PDF File in Python Using PyPDF2 | by Nutan | Medium
August 10, 2022 - PyPDF2 can retrieve text and metadata from PDFs as well. There are several ways to install PyPDF2. The most common option is to use pip. PyPDF2 requires Python 3.6+ to run. Using pip we can install PyPDF2: ... Install the PyPDF2 library in your system, if it is not installed.
🌐
Stack Overflow
stackoverflow.com › questions › 66465096 › pypdf2-no-extracting-text-properly
python - PyPDF2 no extracting text properly - Stack Overflow
My code can read in the PDF file but I can't extract the text with PyPDF2. It worked with other PDF files before. Why is the text appearing in this encoded form and how can I fix it? Code: with open(
🌐
GitHub
github.com › mstamy2 › PyPDF2 › issues › 543
pageObj.extractText() does not work · Issue #543 · py-pdf/pypdf
March 29, 2020 - No errors come up, simply prints pages number=19 from the document but the extractTex() doesn't work ... It's a 1.6 version. Also, I tried with 1.4v with same result. Does PyPDF2 work only with a particular version? ... I had this strange behavior as well and I've seen a lot of people complaining about this. At first, I expected that it was a problem in my code, but when I tried with the provided pdf files in this repository, it worked like a charm. Here is my code: from PyPDF2 import PdfFileReader input_pdf = PdfFileReader(open("SF424_page2.pdf", "rb")) thispage = input_pdf.getPage(0) print(thispage.extractText())
Top answer
1 of 2
2

It may be worth trying the latest version of PyPDF2, latest as I write this is 1.24.

With that said, I have found the extractText() feature to be very fragile. It works on some documents, fails on others. See some open issues:

https://github.com/mstamy2/PyPDF2/issues/180 and https://github.com/mstamy2/PyPDF2/issues/168

I worked around the problem by using the Poppler command-line utility pdftotext instead, both to classify a doc as image vs text and to get all the content. Has been extremely stable for me - I've run it on thousands of PDF documents. In my experience it also extracts text without further ado from protected/encrypted PDFs.

For example (written for Python 2):

def consult_pdftotext(filename):
    '''
    Runs pdftotext to extract text of pages 1..3.
    Returns the count of characters received.

    `filename`: Name of PDF file to be analyzed.
    '''
    print("Running pdftotext on file %s" % filename, file=sys.stderr)
    # don't forget that final hyphen to say, write to stdout!!
    cmd_args = [ "pdftotext", "-f", "1", "-l", "3", filename, "-" ]
    pdf_pipe = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    std_out, std_err = pdf_pipe.communicate()
    count = len(std_out)
    return count

HTH

2 of 2
1

You are doing two things in one line. Try to break things done to get closer to the problem. Change:

page_Content = Pdf_File.getPage(pg_idx).extractText()

into

page = Pdf_File.getPage(pg_idx)
page_Content = page.extractText()

To see where the error happens. Also run the program from the command line not from Eclipse just to make sure it is the same error. You say it happens at extractText() but this line does not show up in the traceback.

🌐
Studytonight
studytonight.com › post › extract-text-from-pdf-in-python-pypdf2-module
Extract Text from PDF in Python - PyPDF2 Module - Studytonight
June 28, 2023 - No, PyPDF2 is primarily designed for extracting text from text-based PDFs. It may not work well with scanned or image-based PDFs that lack textual content.