I have just done a search on Google as to how I would convert a scanned document (of a typescript) into a document that would recognize the characters just like any other Word document. But ofcourse I went and forgot that I am using Ubuntu and not Windows. So is it still possible somehow to do the same on Ubuntu is what I am wondering. I would really appreciate any help.

Thank you.


Tesseract is one option that worked great for me!

I used it as follows:

Install it, if you don't have it with:

 sudo apt-get install tesseract-ocr  


  • Convert the .JPG scanned file to .tif (this is the format Tesseract
    requires). This is done with ImageMagick as follows:

    convert foo.JPG foo.tif

  • Now simply let Tesseract do it's magic:

    tesseract foo.tif foo (will save output to foo.txt)

I recently had to convert an old manual with multiple(36) pages to something digital. I whipped up a BASH script to do it.

Code here:

#!/bin/bash  # makeDoc.sh  # Turn a set of scanned JPG pages into a single document file.  # Requires the ImageMagick and Tesseract packages.  # Author: Fred Fury     echo "makeDoc.sh"  echo "Convert a set of scanned JPG pages into a single document file."  echo "Starting up..."  for i in {01..36}  do      echo "converting $i.JPG to $i.tif..."      bash -c "convert $i.JPG $i.tif"     # Convert the file to tesseract usable format      bash -c "tesseract $i.tif $i &>-"   # Convert the tif to txt  done  echo "Merging files into Output.doc"      bash -c "cat *.txt > Output.doc"        # Merge all the generated txt files into a single file  echo "Done."  

Also check out this page for some other solutions: What's the best, simplest OCR solution? This is where I found tesseract.

Hope that helps!


I had a similar problem to this a while ago. Try uploading the file to online-convert.com. It will take a while, but the webapp can handle just about any format. Good luck!

