Look at the newest version of mudraw. It is a command line tool that is from the MuPDF family of tools.
mudraw -o out.html -F html in.pdf
Use the newest version, if possible. It has gained quite a few new and additional features (it can do more than just PDF->HTML conversion):
$ mudraw
Usage: mudraw [options] file [pages]
-p - password
-o - output file name (%d for page number)
-F - output format (default inferred from output file name)
raster: png, tga, pnm, pam, pbm, pwg, pcl
vector: svg, pdf, trace
text: txt, html, stext
-s - show extra information:
m - show memory use
t - show timings
f - show page features
5 - show md5 checksum of rendered image
-R - rotate clockwise (default: 0 degrees)
-r - resolution in dpi (default: 72)
-w - width (in pixels) (maximum width if -r is specified)
-h - height (in pixels) (maximum height if -r is specified)
-f - fit width and/or height exactly; ignore original aspect ratio
-B - maximum bandheight (pgm, ppm, pam, png output only)
-W - page width for EPUB layout
-H - page height for EPUB layout
-S - font size for EPUB layout
-c - colorspace (mono, gray, grayalpha, rgb, rgba, cmyk, cmykalpha)
-G - apply gamma correction
-I invert colors
-A - number of bits of antialiasing (0 to 8)
-D disable use of display list
-i ignore errors
pages comma separated list of page numbers and ranges
Update (April 2016)
The calling convention of the tool has been changed. It is still part of the MuPDF family, but you run it like this now:
mutool draw
Answer from Kurt Pfeifle on Stack ExchangeLook at the newest version of mudraw. It is a command line tool that is from the MuPDF family of tools.
mudraw -o out.html -F html in.pdf
Use the newest version, if possible. It has gained quite a few new and additional features (it can do more than just PDF->HTML conversion):
$ mudraw
Usage: mudraw [options] file [pages]
-p - password
-o - output file name (%d for page number)
-F - output format (default inferred from output file name)
raster: png, tga, pnm, pam, pbm, pwg, pcl
vector: svg, pdf, trace
text: txt, html, stext
-s - show extra information:
m - show memory use
t - show timings
f - show page features
5 - show md5 checksum of rendered image
-R - rotate clockwise (default: 0 degrees)
-r - resolution in dpi (default: 72)
-w - width (in pixels) (maximum width if -r is specified)
-h - height (in pixels) (maximum height if -r is specified)
-f - fit width and/or height exactly; ignore original aspect ratio
-B - maximum bandheight (pgm, ppm, pam, png output only)
-W - page width for EPUB layout
-H - page height for EPUB layout
-S - font size for EPUB layout
-c - colorspace (mono, gray, grayalpha, rgb, rgba, cmyk, cmykalpha)
-G - apply gamma correction
-I invert colors
-A - number of bits of antialiasing (0 to 8)
-D disable use of display list
-i ignore errors
pages comma separated list of page numbers and ranges
Update (April 2016)
The calling convention of the tool has been changed. It is still part of the MuPDF family, but you run it like this now:
mutool draw
pdf2htmlEX accurately converts PDFs to HTML and retains the formatting. However, the generated HTML code is hard to read and parse programmatically. It is free, open source, and works offline on a variety of platforms.
https://github.com/coolwanglu/pdf2htmlEX
https://github.com/coolwanglu/pdf2htmlEX/wiki/Download
Videos
If you're on Linux, try pdftohtml:
sudo apt-get install poppler-utils
pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html
On MacOS (with homebrew) pdftohtml can be installed with:
brew install pdftohtml
The open source ebook converter Calibre can also convert PDF files to HTML and is available on MacOS, Windows and Linux.
Like I mentioned in the comment above, it is definitely possible to convert pdf to html using the tool Able2Extract7 which can be downloaded from here
I have been using this tool for almost 2 years now and I am pretty happy with it. This tool lets you convert PDF to Word, Excel, PowerPoint, Publisher, HTML, OO etc. See screenshot
Imp Note: This tool is not a freeware.
HTH
Currently for a client I am building a script to convert PDF into HTML fitting their CMS. We also do this conversion on DOCX (to HTML) and that goes wel. But PDF is a whole other format and less of a document format and more of a layout format. What would be the most efficient way to convert PDF to HTML in a format that strips the layout as much as possible, but for example keeps all markup (such as bold/italic texts), images, etc.
Does anyone have experience with this kind of conversion? I can use recommendations and advice!
Things I have tried:
-
pdf2htmlEX: Very elegant for normal conversions for users in the browser, but it is so elegant that it keeps the layout, strips tags and put them as styling (CSS) and converts tables to background images; not something useful for me
-
pdftohtml: Not the most pretty output, disregards tables, puts a lot of
<br/>tags into the HTML.
Things I still want to try:
-
Parsr: [EDIT] After some experience with Parsr, it might be exactly what I am looking for. And it seems it's capable of Markdown conversion (haven't seen it working yet), that would mean I can easily convert to HTML. [EDIT 2] This tool is performing amazingly well!
You're unlikely to find a single offering that does all this, especially in the open source world. It's more likely that you'll end up relying on a mishmash of things, and may even need to chain some converters in order to get to HTML. (Eg PDF -> ps -> HTML)
OpenOffice supports conversion to HTML, and can be called from the command line.
http://pdftohtml.sourceforge.net/ looks reasonably good at converting pdf to html.
For Doc that is Word ML or OpenXML format it's conceivable that you could use XSLT transforms since both input and output formats are XML. I've seen some stylesheets floating around the net that do this, but YMMV.
Incidentally, why is there a specific requirement for open source? MS Powerpoint already supports save-as-HTML for example.
Open Office will convert pdf to html but you'll take a hit to design quality.
I suggest either: Crocodoc as a paid service (It provides different flavours for different platforms such as Python,Ruby,Java,PHP Developers are allowed to work on their APIs.) or waiting for an official Adobe tool (it's in the works).
I my organization we get a couple of PDF magazines every week. Some of the magazines is only 10-20 pages and 10 MB in size. Others are several hundred pages and up to 250 MB in size.
Most people want to view them on their iPad, and until now we just posted the PDFs to a Apache host where indexing is allowed.
It works, but when the magazine is not showed in the browsers before the entire PDF is downloaded. That can take quite a long time.
The ideal solution would somewhat like Issuu, but that could be hosted locally. If I could just FTP upload the PDFs to a folder and let the software convert the files automatically it would be perfect.
Does such a fantastic piece of software exist?
I want to convert the Html file to pdf. And my app is WPF.
I have found IronPdf and SynFusion Html to pdf conversion library but they are way costly for my free project.
We are moving systems and to do that i need the html of an invoice / quote what would be the best way to convert a pdf into html code?
Ps. I did ask ChatGPT but he sucks at this
edit: I am a Python developer and I don't know anything about HTML