Score:0

Disk snapshot size created from a google compute engine exceeds used space

in flag

I have a Google Compute Instance (VM) that has a 2TB disk and around 80GB used space. I wanted to archive this VM so that I don't get billed for the whole 2TB, and also so that it is ready to be recreated quickly if needed. Disk Snapshots seemed to be the best option since it is mentioned that I only get billed for the disk space used in that case. But when I try this, the snapshot size I get is around 600GB, almost 10 times the used space, but still less than the full 2TB.

I tried defragmenting the disk but that didn't help. I also tried using "zerofree" to write 0's to unused space, and that reduced the snapshot size to 20GB - 4x lower than the used space. However zerofree takes a lot of effort and time to run, but I'm guessing it is helping with the compression of the disk.

Is there a better way to improve disk compression efficiency in this case? Maybe any crucial step that I am missing while generating the disk snapshot?

NOTE: I also tried Machine Images but that seems to use disk snapshots under the hood, and they cost more for some reason.

eg flag
I would say to use `fstrim` rather than `zerofree` but if that worked and got the size of the snapshot down, then you're done.. what exactly are you asking for?
John Hanley avatar
cn flag
@psusi - Why do you think fstrim would help with disk snapshots.
mrtksy avatar
in flag
@psusi zerofree takes around 6-7 hours to run for a 5TB drive, and I have multiple of those. My question is if there is a more efficient way to create a snapshot-like copy of a disk, but one that does not bill me for the whole 5TB.
eg flag
@JohnHanley, because ( assuming the VM supports it ) it can instruct the VM to discard the data and free the unused space, and without the bother of writing zeros to the space.
eg flag
@mrtksy, Could you just make a backup and then decommission the VM? Or maybe use a smaller system disk with an additional disk that you can plug in for data storage when you need, and dispose of it when you don't ( possibly after making a backup if you may need to restore it later )?
Score:0
cn flag

Disks normally have file systems. File systems have user data and file system metadata. The details depend on the disk partitioning scheme and file system type. The snapshot consists of changed disk blocks. This includes disk data blocks that were allocated, modified, and then deallocated by the file system.

Your block-zero strategy is increasing the number of changed blocks which means recovering the snapshot will take longer. Note: persistent disks can be recovered from snapshots in a lazy manner which gives the appearance of fast recovery while the actual data restoration takes place. However, that process consumes disk bandwidth transferring the data in the background.

Recommendation:

Use tar or similar archive tools and save the files to Cloud Storage as a compressed archive. Recreating a persistent disk, partitioning and formatting are very easy and in most cases takes seconds. Then restore the saved files.

mrtksy avatar
in flag
Thanks for the quick response. When you say "changed blocks", what is this change with respect to? To make my question clearer, I don't have hourly or daily snapshots - I have a single snapshot that I manually create based on the disk, so it should contain all the information on the disk as opposed to any incremental information. Your idea makes perfect sense, but what I want to capture the whole state of the disk, including any installed repos using apt. Do you have any recommendations for this use-case? Thanks
John Hanley avatar
cn flag
@mrtksy Changed blocks are simply that. Disk blocks that are modified even if the new data is the same as what was previously there. If you want the convenience and features of snapshots, then use snapshots. If disk space used is your goal, then either clone the files to another persistent disk and then snapshot OR use a file archiver.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.