Score:9

Get printer-ready black text on white background in scanned pdf files (remove grayscale or color background)

br flag

How can I turn photos of paper documents into a scanned document? is related, but not the same, as I'm talking about pdf files. The processing of images seems complicated in the answers under the linked question, especially because it involves processing each image separately: given my pdf has hundreds of pages, the solution I expect is not that of processing/editing images, but simply of scanning digital photos and documents the way real ones are. I mean something like a "virtual scanner" for which the input would be a photo-based pdf or collection of photos and the output a "normal" scanned document. (Also the Scantailor tool recommended - also here - seems to lack a Linux version now.)


This is not about OCR and not about converting image to text.

To clarify what I mean I will post a few examples.

There are pdf files based on text, not image, and they are text files (let's say docx or odt) exported to pdf. They look ready to be printed:

enter image description here

The above is not what I discuss here.

What I'm interested in are the pdfs in the images below, namely the difference between scanned text pages that look too much like images and scanned text pages that look like digitized text.

The first are formed of images that look like pictures taken of book pages:

enter image description here

or

enter image description here

Such copies can hardly be re-printed on paper, as the background will be printed too.

The second ones are what one would expect from scanned text, and can be printed:

enter image description here

or

enter image description here

The picture-like pdf may already be OCR-processed and its text searchable, and still look like a collection of (page) photos: OCR is not the problem here.

What I want is the clear black-on-white look of the "scanned" pdf and the removal of all the "real" details (especially shadows) that are normal in a photo but should be absent in a printed page.


As @vanadium noticed in a comment, I am looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.

As @user535733 said in a comment, the problem here seems to be, at least to some extent, that of converting the greyscale (scanned/image) text to black-and-white.

pLumo avatar
in flag
First of all, I don't think your question is related to Ubuntu. How to fix it is fairly easy as linked in the other question, but you want to automate the task somehow, so your question is more of a programming task which would fit better to other sites on the network. Also, you also don't provide anything you tried, nor any idea how to start. And I also don't see why a JPEG-based PDF is any different to a picture, so picture processing is the correct answer, although not manually.
br flag
@pLumo - I am looking for a Ubuntu tool to scan digital documents the way a real scanner does it for real documents.
Thomas Weller avatar
ru flag
IMHO you want the impossible: you don't want image processing, but that's exactly what the virtual scanner must do. Linux and Ubuntu make it easy to run a utility on a directory full of files. Contrast and brightness change are typically sufficient.
Thomas Weller avatar
ru flag
BTW: the rather grey example is not a photography: it is scanned. Where the book folds, the light reflects and due to the angle of reflection, it becomes brighter towards the inside, although that is further away from the light source and one would expect it to be darker. This would not happen for a photograph. One solution is to scan only one page at a time instead of two pages. People build special scanners to support this: books will not be laid flat.
br flag
@ThomasWeller - I don't mean I refuse any image processing, just the manual adjusting the way GIMP is recommended. Practically I'm looking for something like simple-scan or skanlite but that would input digital documents instead of real ones from a real scanner. (Also the fact that in that document pages are in fact scanned and just *look* like photos is not the issue: I need to make them look more bare/simple scanned text.)
vanadium avatar
cn flag
@ThomasWeller not quite the impossible. It is what is available on smart photo's. OP is just looking for a software solution that automatically cleans up pictures of a document, much alike Google Scan on a smartphone.
user535733 avatar
cn flag
Looks like you want to take a *greyscale image* that contains multiple pages and *process that image* , separating the individual pages, straightening the pages, converting the greyscale to black-and-white, and otherwise enhancing readability. None of this requires the confusing term "scan" in the title or body of the question.
vn flag
Does this answer your question? [How can I remove the gray-scale page background of a PDF document scan while preserving the text? (Binarization)](https://askubuntu.com/questions/396437/how-can-i-remove-the-gray-scale-page-background-of-a-pdf-document-scan-while-pre)
karel avatar
sa flag
@PabloBianchi I voted to leave this question open and close voted your linked duplicate question as a duplicate of this question because this question's answers are more up to date.
karel avatar
sa flag
@cipricus Please close vote [this question](https://askubuntu.com/q/396437/) as a duplicate of your more up-to-date question.
vn flag
@karel Are you sure the answers here are more up-to-date? I had the opposite impression, also with lower quality..
Score:10
in flag

scantailor is not maintained anymore but you can still build it from source and use it.

However, the original repository needs qt4, which is not easily installable in recent Ubuntu versions. You can use e.g. this fork that has adapted to qt5.

Prerequisites:

sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

Installation:

git clone https://github.com/victl/scantailor
cd scantailor
cmake .
make
sudo make install

Disclaimer: I don't know the maintainer of this fork, and cannot say anything about the safety of his version.


Another option would be to use Scantailor advanced. You can install it via snap ...

sudo snap install scantailor-advanced

... or flatpak.

... or via ppa.

sudo add-apt-repository ppa:alex-p/scantailor
sudo apt update
sudo apt install scantailor # or scantailor-advanced

Quick test:

enter image description here

br flag
I have found a solution that works directly on pdf files and posted it along my "complementary" answer.
Score:2
br flag

As a direct solution on PDF (no manual image extraction):

Using ocrmypdf to restore OCR (as mentioned at the end of the complementary part of this answer) I have noticed that ocrmypdf -h shows an option which sounded like exactly what is asked:

--remove-background Attempt to remove background from gray or color pages, setting it to white

The initial pdf already had OCR, which gives an error unless one of the following options are used:

-f, --force-ocr Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF)

or

-s, --skip-text Skip OCR on any pages that already contain text, but include the page in final output; useful for PDFs that contain a mix of images, text pages, and/or previously OCRed pages

Applying each separately to one of my large files with hundreds of pages that already had OCR crashed the process.

The best solution seems to me to first print to pdf the initial file (which removes OCR), and then do

ocrmypdf input.pdf output.pdf -l <LANG> --remove-background -v

For English, the -l option is not needed. -v is for verbose details in terminal.

The resulted pdf is larger than the input (because of the --remove-background option): reduce the size as said below.


About Scan Tailor, as a complement to the main answer

Even its icon illustrates the fact that it is intended exactly for what is asked here:

![enter image description here

Here is how to use Scan Tailor with pdfs:

  1. Extract all pdf pages as image files - because this tool doesn't process pdf directly and needs images. Master PDF Editor can do this but on my machine it crashes after extracting about 80 images. But it can still be used by setting a new batch/range of pages to be extracted. (PDF Mod crashed before any processing). What I prefer after a few trials is a CLI reliable albeit slower method, with a command like: pdftoppm MY_PDF.pdf NAME -tiff - as said here. — Other variables can be used instead of tiff (which gives tif files), for example png or jpeg. See here a set of Dolphin service menu actions for the various extraction options:
[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=pdf;tif;jpeg;
X-KDE-Submenu=PDF action: EXTRACT ALL pages
Icon=application-pdf

[Desktop Action pdf]
Name=Extract pages as pdf
Icon=application-pdf
Exec=bash -c 'pdf=$(pdftk "%u" burst); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';

[Desktop Action tif]
Name=Extract pages as tif
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(pdftoppm "$f" "${f%%.*}" -tiff); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';


[Desktop Action jpeg]
Name=Extract pages as jpeg
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(pdftoppm "$f" "${f%%.*}" -jpeg); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';
  1. Load and process the resulting images in Scan Tailor. Put resulting image files in a separate folder and add that folder under New Project>Input Directory in Scan Tailor. (I have installed that program from PPA, as said in a comment by @N0rbert under the main answer.) Some pages containing real images and not text might look better if for each of them is selected "Grayscale and Color" instead of the default "Black and white" (meant here for text). Run one by one the listed procedures. Check the pages before running the last one ("Output").

enter image description here

  1. Create a new pdf out of the resulting images. (First check the resulted tif files are as you want them.) There are many ways to create a new pdf. Again the GUI tools that I've tried very soon crashed or gave odd results, so I prefer to put the resulting tif files in a separate folder and there run the command img2pdf *.tif -o out.pdf - as said here. (This may need proper naming/numbering of the files. More on that here.)

The resulting "tailored" pdf will be smaller than the initial one, but the percentage of the size reduction varies depending on factors that I ignore (but I imagine that the pages contained in the initial pdf should be extracted — at step 1 — in the format they already have; I think jpeg and tif should be used instead of png; use pdfimages -list your.pdf in terminal to see details on format, dpi and other details before processing with the commands above and below).

The final pdf can be further reduced with a command like:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

More details on that, here.

Here is a set of Dolphin service menu actions based on the above link:

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=shrink;shrink0;shrink1;shrink2;
X-KDE-Submenu=PDF action: SHRINK
Icon=application-pdf

[Desktop Action shrink]
Name=Shrink pdf to "printer" size, 300dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/printer    -sOutputFile="${f%.pdf}_printer.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

[Desktop Action shrink0]
Name=Shrink pdf to "prepress" size, 300dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress    -sOutputFile="${f%.pdf}_prepress.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';


[Desktop Action shrink1]
Name=Shrink pdf to "ebook size, 150dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook    -sOutputFile="${f%.pdf}_small.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

[Desktop Action shrink2]
Name=Shrink pdf to "screen" size, 72dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/screen    -sOutputFile="${f%.pdf}_smaller.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

I got some help from this answer too.


OCR (text search and copy capability) is lost during the above procedure, if present in the initial pdf. In order to get OCR, use ocrmypdf input.pdf output.pdf for English, as said here. For other languages, look for them with apt-cache search tesseract-ocr, and install them. Add -l <LANG> at the end of the command for specific languages; more here; see their names also here.

Here is a Dolphin service menu action for Romanian OCR with two options (one with progress in terminal and fixed output name, the other with background process but with output name based on input; I would like to have both process in terminal and output name based on input but don't know how; if someone can do it, please post here!). For English, replace "Romanian" and remove the -l ron variable:

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=ocr1;ocr2;
X-KDE-Submenu=PDF action: apply OCR
Icon=application-pdf

[Desktop Action ocr1]
Name=Apply OCR Romanian (see progress in terminal; output name: ocr_ro.pdf!)
Icon=application-pdf
Exec=konsole --noclose -e ocrmypdf "%u" ocr_ro.pdf -l ron

[Desktop Action ocr2]
Name=Apply OCR Romanian (backgroud process: NO terminal! input>output name)
Icon=application-pdf
Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf" -l ron;'

(Extracting and processing images, as well as 'printing as pdf' removes OCR, but reducing size with ghostscript as above does not, so the "shrinking" can be applied before or after the OCR.)

Score:1
tr flag

I've got pretty good result using imageMagick and the following script http://www.fmwconcepts.com/imagemagick/shadowhighlight/index.php

Here is the result using the following parameters:

./shadowhighlight -ma 100 -sa 100 -ha 00 -hw 0 -bc 20 inputFile.png OutputFile.png

enter image description here

br flag
You mean you can use simple-scan to input already existing digital images?
tr flag
Ho so you don't look for a scanner program but an automation image processing software. If so, have a look at https://imagemagick.org/ it will do the job but you will have to find the right settings.
br flag
can we apply that command to hundreds of pages at the same time?
Score:1
by flag

Just install Gimp(preferably use appimage). Following are the options:

  1. Select Colour>Thresold and it is done your image will be black and white. for for this you have to do it for each page

Second option 2) Select Image>Mode>Indexed>Use black and white 1 bit palette

Any number of pages your pdf may have this will convert all to 1 bit Black and White.

Edit on 02/11/2021: As per query raised by cipiricus

Here are steps that I follow:

  1. Scan pages with "simple scan" or Xsane. (I found simple scan do better work in color) OR use already available scanned pdf.
  2. File>open OR drag and drop pdf file in GIMP. Here you need to give width X height of image you need. (Check what dpi you need 150 dpi or 300 dpi give value of width accordingly)
  3. Now the pdf file with more than 1 pages open as layers.
  4. Go to Image>Mode>Indexed>Use black and white 1 bit palette
  5. Now I export the pdf using File> "Export As"
  6. Check if each page of exported pdf is as per requirement. If not I individually process each defective page with following method: a) Select Image> Mode> Grayscale b) (If there is too much gray/ noise on page) Select Color> Exposure and adjust as per need. c) Select Colour> Thresold and it is done your image will be black and white. for for this you have to do it for each defective page to match required quality. d) Now I insert this edited page in this layer of original pdf file layers and delete defective page layer. and Export pdf again. Hope this will help.
br flag
Do you mean that with the second option hundreds of pages/images can be selected and processed?
Ajay avatar
by flag
Yes, actually in second option there is no need to select pages. You will be just changing the colour from RGB or Gray or CMYK to 1 bit Black and White. so there will be only two shades black or white Just like a Photocopy.
br flag
Clearly only the second option can count here (processing each page in 400+ pages pdfs is not doable). Could you please elaborate a bit more on option 2? how to process the pdf? Should pages be extracted as images first? Or should the pdf be opened as such in Gimp?
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.