I found the solution to this problem. It would appear that if the PDF comes from an external source, sometimes the PDF is protected or encrypted.
If you get a blank output when loading up a PDF document from an external source and add protections you are probably working with an encrypted document. I have a stream processing system working on PDF documents. So the following code works for me. If you are just working with PDF inputs then you could integrate the below code with your flow.
public InputStream convertDocument(InputStream dataStream) throws Exception {
// just acts as a pass through since already in pdf format
PipedOutputStream os = new PipedOutputStream();
PipedInputStream is = new PipedInputStream(os);
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "2024768"); //for large files
PDDocument doc = PDDocument.load(dataStream, true);
if (doc.isEncrypted()) { //remove the security before adding protections
doc.decrypt("");
doc.setAllSecurityToBeRemoved(true);
}
doc.save(os);
doc.close();
dataStream.close();
os.close();
return is;
}
Now take that returned InputStream and use it for your security application;
PipedOutputStream os = new PipedOutputStream();
PipedInputStream is = new PipedInputStream(os);
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "2024768");
InputStream dataStream = secureData.data();
PDDocument doc = PDDocument.load(dataStream, true);
AccessPermission ap = new AccessPermission();
//add what ever perms you need blah blah...
ap.setCanModify(false);
ap.setCanExtractContent(false);
ap.setCanPrint(false);
ap.setCanPrintDegraded(false);
ap.setReadOnly();
StandardProtectionPolicy spp = new StandardProtectionPolicy(UUID.randomUUID().toString(), "", ap);
doc.protect(spp);
doc.save(os);
doc.close();
dataStream.close();
os.close();
Now this should return a proper document with no blank output!
Trick is to remove encryption first!
Answer from NightWolf on Stack Overflowjava - Protecting PDF using PDFBox - Stack Overflow
java - How to check pdf file is password protected? - Stack Overflow
Java script embedded in pdf file
java - How to password protect a existing pdf file using java8 & iText? - Stack Overflow
Videos
Update
As per mkl's comment below this answer, it seems that there are two types of PDF structures permitted by the specs: (1) Cross-referenced tables (2) Cross-referenced Streams. The following solution only addresses the first type of structure. This answer needs to be updated to address the second type.
====
All of the answers provided above refer to some third party libraries which is what the OP is already aware of. The OP is asking for native Java approach. My answer is yes, you can do it but it will require a lot of work.
It will require a two step process:
Step 1: Figure out if the PDF is encrypted
As per Adobe's PDF 1.7 specs (page number 97 and 115), if the trailer record contains the key "\Encrypted", the pdf is encrypted (the encryption could be simple password protection or RC4 or AES or some custom encryption). Here's a sample code:
Boolean isEncrypted = Boolean.FALSE;
try {
byte[] byteArray = Files.readAllBytes(Paths.get("Resources/1.pdf"));
//Convert the binary bytes to String. Caution, it can result in loss of data. But for our purposes, we are simply interested in the String portion of the binary pdf data. So we should be fine.
String pdfContent = new String(byteArray);
int lastTrailerIndex = pdfContent.lastIndexOf("trailer");
if(lastTrailerIndex >= 0 && lastTrailerIndex < pdfContent.length()) {
String newString = pdfContent.substring(lastTrailerIndex, pdfContent.length());
int firstEOFIndex = newString.indexOf("%%EOF");
String trailer = newString.substring(0, firstEOFIndex);
if(trailer.contains("/Encrypt"))
isEncrypted = Boolean.TRUE;
}
}
catch(Exception e) {
System.out.println(e);
//Do nothing
}
Step 2: Figure out the encryption type
This step is more complex. I don't have a code sample yet. But here is the algorithm:
- Read the value of the key "/Encrypt" from the trailer as read in the step 1 above. E.g. the value is 288 0 R.
- Look for the bytes "288 0 obj". This is the location of the "encryption dictionary" object in the document. This object boundary ends at the string "endobj".
- Look for the key "/Filter" in this object. The "Filter" is the one that identifies the document's security handler. If the value of the "/Filter" is "/Standard", the document uses the built-in password-based security handler.
If you just want to know whether the PDF is encrypted without worrying about whether the encryption is in form of owner / user password or some advance algorithms, you don't need the step 2 above.
Hope this helps.
you can use PDFBox:
http://pdfbox.apache.org/
code example :
try
{
document = PDDocument.load( yourPDFfile );
if( document.isEncrypted() )
{
//ITS ENCRYPTED!
}
}
using maven?
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0</version>
</dependency>
So I downloaded this pdf file , I checked with kaspersky and it detected there is no threat and I also checked with virustotal and there was no threat detected;however, when I used cape sandbox it showed that the pdf gave 1 low IDS rule, is this pdf considered dangerous ? Thanks in advance