Consider a scenario where a company wants to release a microdata of their employees total annual compensation for the following year to an analyst in a recruiting firm in order to provide an indication of relative employee importance so that the recruiting firm could suggest similar candidates for hiring to the company. An employee's total offered compensation for year X is made up of 3 components:
- Performance Based income which is based on how the employee performed previous year (X-1).
- Potential Based income which is determined by how the company predicts the employee will perform this year.
- Bonus which is offered based on the employee's business team's collective financial returns in the previous year (X-1).
|(B) total annual compensation
|(C) performance-based income
|150 (100 + 50 + 0)
|150 (0 + 100 + 50)
|150 (0 + 100 + 0)
In the above example, the company will only release column A and B. Column C, D & E are strategically sensitive data that the company wants to protect. Now, consider a scenario where the analyst has some auxiliary data (say from LinkedIn) that shows that Bob and Trudy are new company employees so there performance-based income based on last year performance should be \$0. Also, LinkedIn data shows that Alice and Trudy are in company's cloud business which registered loss last year so potentially have \$0 bonus. Based on this auxiliary data and the complete released dataset (i.e. col A and B), the analyst could potentially reconstruct the decomposition of B into C, D and E which will leak company's strategically sensitive data.
What noise injection strategies can we use for col B, that provides information theoretic guarantees that the analyst will not be able to reconstruct the decomposition of column B while also ensuring utility of the released data i.e. allowing the analyst to join with auxiliary data and determine the kind of candidates the company prefers.
Here's my thoughts:
Surveying literature on differential privacy, I believe my problem statement is not applicable (at least directly) since the risk isn't really 'privacy' related i.e. we are releasing dataset with customer identifiers in it but the risk is actually what I may call "decomposition risk" i.e. prevent the analyst from making definitive statements about the exact magnitude of an employee's sub-component of final compensation.
Intuitively, I believe (a) we need to add some noise to column B before sharing with the analyst and (b) that added noise would be more if lesser sub-components are contributing to the final compensation value for an employee.
Also, the magnitude (and the need) to add noise also depends on the function form i.e. how the sub-components are combined. In our example, since its a simple addition there is higher risk that an analyst can guess the form of functional form of how sub-components are contributing to the final value.
The closest literature I could find is around decomposing risks into risk components (Katja et. al) but that's more from an analyst's lens on how they could potentially decompose a metric into its contributing components. What I am interested in coming up with an approach that can provide an information theoretic guarantee that the analyst will not be able to reconstruct the decomposition of column B even though they guess the functional form of the equation correctly.