Fixing up Scanned PDFs

Contents

  1. Requirements
  2. Steps
    1. Convert Scanned PDF(s) to Images
    2. Make It Pretty
    3. Recognize Text

1. Requirements

Install the following tools:

imagemagick
pdfsandwich
scantailor

You probably already have imagemagick.

Say you have a scanned PDF file scan.pdf. Move it to a new directory and go to that directory.

2. Steps

Assume we have one or more PDFs in a directory /path/to/scan/.

2.1. Convert Scanned PDF(s) to Images

First we must get some images from the scanned PDF.

Option 1: pdfimages. Try running:

ls -U | grep -a -e '\.[Pp][Dd][Ff]$' | \
while read -r f
do
  g=$(printf '%s' "$f" | sed -e 's/\.[^.]*$//')
  pdfimages "$f" "$g"
done

The output images are in .pbm or .ppm format, depending on if they have color or not. Take a look at these images, if they are not 1 image per page, then you'll have to do option 2.

The tool we use in the next step does not recognize Netpbm formats, so convert the images to .png files.

ls -U | grep -a -e '\.p[bp]m$' | \
while read -r f
do
  g=$(printf '%s' "$f" | sed -e 's/\.[^.]*$//')
  pnmtopng "$f" > "$g".png
done

Option 2: imagemagick. This option takes a longer time and will not correlate exactly with the original scan.

ls -U | grep -a -e '\.[Pp][Dd][Ff]$' | \
while read -r f
do
  g=$(printf '%s' "$f" | sed -e 's/\.[^.]*$//')
  convert -density 400 "$f" "$g".png
done

2.2. Make It Pretty

Run scantailor and create a new project in the directory with all of the .png files. Let the output directory be out/. This tool lets you reorient the pages, split them, make text darker, etc. It is very helpful for file size and OCR.

After fixing up the formatting and exporting, stitch together the output images into all.pdf.

{ ls -v out/*.tif ; echo all.tif ; } | xargs -d '\n' tiffcp
tiff2pdf -o all.pdf -u m -p "letter" -F all.tif

You can also specify a title and author for the PDF using the -t and -a flags respectively.

2.3. Recognize Text

Here we use the pdfsandwich tool to do OCR. This runs tesseract in the background using multiple processes if available. Just run the following command to make a (hopefully) text-searchable final.pdf.

pdfsandwich -o final.pdf all.pdf