Ubuntu: How to turn a pdf into a text searchable pdf?



Question:

I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?

Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).

  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)
  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)
  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.
  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.
  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?


Solution:1

Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.

git clone https://github.com/fritz-hh/OCRmyPDF  cd OCRmyPDF  sh ./OCRmyPDF.sh -h  # to see the usage  

If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):

sudo apt-get install parallel  sudo rm /etc/parallel/config  

Finally you can OCR your pdf with the command:

sh ./OCRmyPDF.sh input.pdf output.pdf  # change input and output to the files you want  

If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:

pdftk A=input.pdf cat A1-5 output output.pdf  


Solution:2

pdfsandwich performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.

If you have a scanned file scanned_file.pdf, simply call

pdfsandwich scanned_file.pdf  

which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.

Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.

DISCLAIMER: I'm the developer of pdfsandwich and therefore heavily biased.


Solution:3

@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).

sudo apt install ocrmypdf  

Then you have to install the tesseract languages you need.

To list which languages are already in your system, type:

tesseract --list-langs  

In case you miss one, install it. For instance,

sudo apt install tesseract-ocr-spa  

Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command

ocrmypdf -l 'spa' old.pdf new.pdf  

You can, of course, check its man page for some additional options.


Solution:4

OCRfeeder has a bug in

/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py  

line 436 should read:

            lines = asUnicode(stuff).strip().split('\n')  # bug here, was:  #            lines = '\n'.split(asUnicode(stuff).strip())  

changed this and it worked for me


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »