Score:5

Server

Best way to remove text from the beginning of a huge file

Christopher Schultz

3/30/23, 7:09 PM

I have a huge MySQL backup file (from mysqldump) with the tables in alphabetical order. My restore failed and I want to pick up where I left off with the next table in the backup file. (I have corrected the problem, this isn't really a question about MySQL restores, etc.)

What I would like to do is take my backup file, e.g. backup.sql and trim-off the beginning of the file until I see this line:

-- Table structure for `mytable`

Then everything after that will end up in my result file, say backup-secondhalf.sql. This is somewhat complicated by the fact that the file is bzip2-compressed, but that shouldn't be too big of a deal.

I think I can do it like this:

$ bunzip2 -c backup.sql.bz2 | grep --text --byte-offset --only-matching -e '--Table structure for table `mytable`' -m 1

This will give me the byte-offset in the file that I want to trim up to. Then:

$ bunzip2 -c backup.sql.bz2 | dd skip=[number from above] | bzip2 -c > backup-secondhalf.sql.bz2

Unfortunately, this requires me to run bunzip2 on the file twice and read-through all those bytes twice.

Is there a way to do this all at once?

I'm not sure my sed-fu is strong enough to do a "delete all lines until regular expression, then let the rest of the file through" expression.

This is on Debian Linux, so I have GNU tools available.

793

0 + 0

text

grep

sed

jrw32982

4/4/23, 6:48 PM

If the lines can be of arbitrarily long length, how do you know that grep will be able to locate the `--Table structure` target string? Also, is the target string always at the beginning of a line? If so, then a custom program should work even for arbitrarily long lines (N = length of fixed target string): read a buffer, locate each newline in turn, check for N chars in buffer past the newline (else shift newline to beginning of buffer, fill remainder of buffer), check for target string after the newline, skip to next newline if not found. No need for KMP.

jrw32982

4/4/23, 6:50 PM

If the data were already uncompressed in a regular (seekable) file, then `grep -m1` followed by `cat` would work.

Score:8

Server

Mark Wagner

3/30/23, 11:54 PM

bunzip2 -c backup.sql.bz2 | \
  sed -n '/-- Table structure for `mytable`/,$p'

Explanation:

-n suppress automatic printing of pattern space

Address range construction: Start with regex

/-- Table structure for  `mytable`/

End with

$ Match the last line.

Command

p Print the current pattern space.

Edit: depending on how you dumped the database you may have very long lines. GNU sed can handle them up to the amount of available memory.

0 + 0

Christopher Schultz

4/1/23, 1:32 AM

Indeed, I do have very long lines. This is a64-bit system, so theoretically it may be willing to allocate up to 2^64 bytes to a single process. But my physical memory is limited to 64GiB and swap is nowhere near the gigabyte range. So I think the whole pattern space wouldn't fit into memory for those long lines.

Score:2

Server

Christopher Schultz

4/1/23, 2:31 AM

NOTE: Not an actual answer

Since I was motivated to get this solved now, I went ahead and used grep to find the offset in the file I wanted; it worked great.

Running dd unfortunately requires that you set ibs=1 which basically means no buffering, and performance is terrible. While waiting for dd to complete, I spent time writing my own custom-built C program to skip the bytes. After having done that, I see that tail could have done it for me just as easily:

$ bunzip2 -c restore.sql.bz2 | tail -c +[offset] | bzip2 -c > restore-trimmed.sql.bz2

I say "this doesn't answer my question" because it still requires two passes through the file: one to find the offset of the thing I'm looking for and another to trim the file.

If I were to go back to my custom program, I could implement a KMP during the "read-only" phase of the program and then switch-over to "read+write everything" after that.

0 + 0

Score:0

Server

mestia

3/30/23, 7:41 PM

I wonder if something like that would do the trick:

use strict;
use warnings;
use feature 'say';

use IO::Uncompress::Bunzip2 '$Bunzip2Error';

my $file = $ARGV[0] // die "need a file";

my $zh = IO::Uncompress::Bunzip2->new( $file, {
    AutoClose   => 1,
    Transparent => 1,
} ) or die "IO::Uncompress::Bunzip2 failed: $Bunzip2Error\n";

my $trigger = undef;
while ( <$zh> ) {
    chomp;
    $trigger = 1 if $_ eq '-- Dumping data for table `experiments`';
    say if $trigger;
}

So basically it starts printing stuff after the pattern, one can also pipe it directly to bzip2/gzip, like perl chop.pl input_sql.bz2 | bzip2 > out.sql.bz2 You would need libio-compress-perl on Debian.

0 + 0

Christopher Schultz

3/30/23, 8:10 PM

This may work, but may either not-work or run out of memory, depending on how Perl treats long lines. I believe `<>` will end up reading a line entirely into memory, and that will likely blow up. Some of these lines are dozens of GiB long.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Best way to remove text from the beginning of a huge file

TH: วิธีที่ดีที่สุดในการลบข้อความออกจากจุดเริ่มต้นของไฟล์ขนาดใหญ่

RO: Cel mai bun mod de a elimina textul de la începutul unui fișier uriaș

RU: Лучший способ удалить текст с начала огромного файла

VI: Cách tốt nhất để xóa văn bản khỏi phần đầu của một tệp lớn

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.