Score:0

Compressing SHA256 to be a viable database id?

de flag

Don't know a lot about cryptography, so need some help on this.

I would like to use a SHA256 string as a unique id in my database for users, but scaling that would be difficult.

Is it possible to convert a SHA256 string to be a shorter unique version, that would not collide (or collide very rarely)?

Could passing the SHA256 string through CRC32, FNV164, or ADLER32 be a viable option in this case?

kelalaka avatar
in flag
CRC is not option. How much are you going to [truncate form SHA-256](https://crypto.stackexchange.com/q/64314/18298)? What is the _metric_ of rarely for you? Why cannot you support 256-bit output? Do you have billions of users? There is no uniqueness guarantee from hash functions. You might check [UUIDs](https://datatracker.ietf.org/doc/html/rfc4122)
Score:2
kr flag

I think this is an XY problem and actually should be posted at Software Engineering SE. The goal described in the OP, the generation of user IDs, can be solved without any cryptography.

1. Scaling

Scaling is relevant when the load can increase essentially within a short time. But a new user ID is needed for new users only. One user would normally need 1 to 5 minutes for registration. Thus you would have not more than 1 new ID per user per minute.

Many databases provide ID generators. PostreSQL, MariaDB, Oracle provide generators called "sequences". MySQL provides autoincremental ID. Not only it is fast when used straightforward, but these databases provide additional performance optimization like pools of IDs. Platforms like Java and C# integrate well with these ID generators. Basically generation of a new ID means just incrementing an integer, and database requests are needed very rare.

Example: Suppose you use PostgreSQL and sequence with a pool of 10 000 IDs. Suppose request from the application to the database to refresh the pool range takes 10ms. Thus you can generate 1 000 000 new IDs per second per application instance (i.e. per cluster node, per Kubernetes pod or similar). This generator will produce in 2 hours as many IDs as there are people in the whole world.

Obviously, if such standard user ID generator is used, it will not be a bottleneck.

2. Shortening

How much data are you going to store per user? 1K, 10K, 100K? Suppose you have 1K data per user. Suppose you have as many users as Facebook or Twitter. Thus 4 bytes for ID will be sufficient. Truncating SHA-256 from 32 to 4 bytes saves you 28 bytes per user, less than 3% storage savings. Thus the complexity to find an algorithm for transformation of SHA-256 to 4 bytes without many collisions, the efforts to implement it correctly, the efforts to implement handling for the cases when collisions happen, the efforts for bug fixing, and thus the total costs of such solution can be much higher than the costs of 3% of saved storage. Calculate it and then you will know if it makes sense in your case.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.