Score:2

What is the methodology for selecting symbol bit length and window size when performing Shannon Entropy Analysis?

cn flag

When performing Shannon Entropy analysis on something like an RNG or a file, you must:

  1. Select a symbol bit length and number of samples would will perform analysis on at a time (IE: windows size)
  2. Read the input till the window is full
  3. Perform a histogram on the collected symbols
  4. Take the histogram output and calculate Shannon Entropy
  5. Repeat step 2 by either getting entirely new samples or sliding the window (IE: keep a portion of the already used samples)

Tools like binwalk do this automatically under the hood and do a pretty good job at showing unusual portions of files; however, it is not entirely clear how they:

  • Select symbol bit length
  • Select the windows size
  • If any window sliding is performed

Is there a methodology to selecting these values in the context of RNG and file analysis?

Score:1
cn flag

Liam, what you're asking is still an open question. There is no standardised methodology for calculating the entropy of a file in the general case. Even NIST have said so with their non IID 800-90B calculations. The following questions are rhetorical to illustrate the problem:-

  1. What is the symbol bit length? Who knows. Shakespeare's works have line, act and paragraph demarcations. Are they included within your window? And they use weird words that could be represented by Huffman codes.

  2. What do you histogram? Really, what exactly would you histogram?

  3. How are the previous findings weighted?

The problem is not the window. It's the manipulation and weighting of said window that's the problem.

See https://en.wikipedia.org/wiki/Kolmogorov_complexity, http://www.reallyreallyrandom.com/photonic/technical/90b_latest/ and http://www.reallyreallyrandom.com/photonic/technical/algorithms/ and follow the links.

In short, there is no such thing as Shannon Entropy analysis in the general case :-(

cn flag
Well it is at least comforting I am not missing something obvious.
Paul Uszak avatar
cn flag
@LiamKelly God no. You're pushing boundaries of how we calculate entropy of general things. If you followed the links, you'll realise that it's quite tricky. The Shannon do-da formula only works or identically and independent sources.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.