Wednesday, September 12, 2012

Using mcommand to convert PDF to DJVU

In my last post, I mentioned my new (currently IronRuby-based) tool, mcommand. I have since figured out a method by which to perform safe conversion of PDFs to DJVU format for archival purposes. My approach: rewrite the PDF as a known good file (per my previous printing issues); then try to convert it.

Setup


1. Setup mcommand
2. Install the latest version of Ghostscript + WinDJView. I recommend putting Ghostscript in C:\misc\gs
3. Download pdf2djvu + extract it to its own folder, such as C:\misc\pdf2djvu

Procedure


1. Run mcommand with the following setup...

Command line: c:\misc\gs\bin\gswin64c.exe
Arguments: -dBATCH -dNOPAUSE -dOptimize=true -dCompatibilityLevel=1.6 -dDownsample=false -dEmbedAllFonts=true -sDEVICE=pdfwrite -sOutputFile=!N! !S!
Source Directory: (wherever your PDFs are)
Target Directory: (new folder to save PDFs to)
Source Ext & New Ext: .pdf
Program Directory: c:\misc\gs\bin

Update3: For handling the PDFs with CCITT / FAX source data, you'll need to use a different method of washing the PDFs. A few days ago, I compiled a win32 build of the poppler tools. The same pdftocairo program I use on Linux for fixing PDF printouts in CUPS, I can use to convert a CCITT PDF to a greyscale PDF! On Windows, pdftocairo messes with fonts; it also seems to make invalid PDFs on larger color documents.

Command line: C:\misc\poppler\pdftocairo.exe
Arguments: -pdf !S! !N!
Source Directory: (wherever your PDFs are)
Target Directory: (new folder to save PDFs to)
Source Ext & New Ext: .pdf
Program Directory: C:\misc\poppler

2. Look over your PDF files, and make sure they appear to be correct. You can overwrite the old PDFs if this appears so (usual warning on having a backup handy). Look for files of zero, or very little size: they could be invalid.

3. Run mcommand (you can leave the other instance running if you want) with the following setup. Notice that you can change or omit the "-j" parameter, based on how many CPU cores you want to use for the converter program. There is also an "--anti-alias" parameter for text: my own testing of it suggests it makes the output files bigger, with only slight text improvement.

Update: the "-j" parameter may or may not be causing hiccups on large batches. If the process gets stuck, check your Task Manager, and omit said parameter for future runs. 

Update2: the problem seems to be more with files that have CCITT (fax-machine) graphics in them. Check the Task Manager for hung djused.exe processes, if you're converting PDFs that come from an MFC or fax machine.


Command line: c:\misc\pdf2djvu\pdf2djvu.exe
Arguments:--verbatim-metadata -j2 -o !N! !S!
Source Directory: (wherever your PDFs are)
Target Directory: (folder to save DJVUs to)
Source Ext: .pdf
New Ext: .djvu
Program Directory: c:\misc\pdf2djvu

Example screenshots of the process: original; washed; converted


No comments:

Post a Comment