Score:2

Search spreadsheets for strings in formulae

cn flag

I have just discovered that my favourite full text indexing program Recoll, does not index the contents of formulae in spreadsheet cells. It does index cells containing just a number or text, and it does index the calculation result, but it does not index the formula which does the calculation.

Does anyone know how to search a directory full of spreadsheets for a string that appears within a formula, without opening each file in the spreadsheet application and searching there?

I am familiar with eg grep, but that is not clever about spreadsheet formats and most tools to extract spreadsheet contents behave like Recoll, they ignore formulae and extract only results.

Here is an illustration of what I mean. Recoll and other tools generally can find this spreadsheet if I search for the value of G6 (1575.50) but I want to be able to find examples where I used a particular type of formula to calculate a value- in this case I would be searching for IFERROR(.

enter image description here

medoc avatar
fr flag
Recoll developper here. Could you please send me a sample spreadsheet ? [email protected]
Graham avatar
cn flag
kia ora Medoc, I could not work out how to attach a file but I hope the image and extra paragraph explain the meaning better. See also my current best solution at the bottom of the thread.
Score:2
uz flag
Jos

If your spreadsheets are made with LibreOffice Calc (or similar), they are really zip files. The following script will work:

#!/bin/bash
for f in $(find . -name "*.ods" -type f) 
do
        unzip -qq "$f" content.xml
        if grep -q [search string] ./content.xml; then
                echo "$f contains the search string"
        fi
        rm -rf ./content.xml
done

which does the following:

  1. find spreadsheets in folder
  2. for each spreadsheet file, unzip the file content.xml inside it
  3. do grep [search string] content.xml
  4. If there is a match, write a message to the user
  5. remove the file content.xml

Replace [search string] by the string to be searched, or make it a variable to be supplied at run time.

Make sure there is not already a file content.xml in the folder, or it will be lost.

The -q flags are merely for suppressing output.

If you need a case insensitive search, add -i to the grep command.

This will find the search string in formulas in cells, but (obviously) not strings that are the result of formulas. E.g. if your formula is =concat('nice ';'day') it will not find nice day. But it will find nice and day.

Score:1
cn flag

Thanks Jos, a really useful answer which broke the back of the problem. I added a couple of trivial changes which help in my use case:

for f in $(find . -iname "*.ods" -type f) 
do
        unzip -qq "$f" content.xml 2>/dev/null
        if grep -q "$1" ./content.xml 2>/dev/null; then
                echo "$f contains the search string $1"
        fi
        rm -rf ./content.xml
done

-iname makes find case-insensitive in the filename

2>/dev/null throws away error messages, which are not useful here

$1 is the first command line parameter so I can use this as a script with a different search string each time.

Now you showed the principle, I know that Excel .xlsx files are also zip files really, but I think the internal structure is a little more complex. I will look at working with them too.

Score:0
cn flag

My solution for .xlsx is a little more complex than for .ods but not much...

for f in **/*.xlsx; do # Whitespace-safe and recursive
  unzip -qq  -o -j -d /tmp "$f" xl/worksheets/*.xml 2>/dev/null
    # -o overwrite; 
    # -j do not recreate directory structure
    # -d /path tells it where to extract to
  if grep -i -q $1 /tmp/*.xml ; then
    echo -e "$f contains the search string $1"
  fi
  rm -rf /tmp/*.xml
done

Unlike the .ods version, you cannot predict the filename of the xml files that result from unzip, hence I unzip them to the system /tmp directory and then just delete all xml from here after each .xlsx processed. That's what /tmp is for...

The 'for' line at the top of this example is actually much better than the equivalent for line in the ods example above, because it handles paths and filenames with spaces in them properly.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.