Split: slicing up very large files

© 2013 Lawrence I. Charters

Washington Apple Pi Journal, Vol. 35, no. 1, Spring 2013, pp. 28-29.

This article is a geeky missive on handling a really, really large file using Unix commands in the OS X Terminal application. The concepts and commands are not complicated, but the tasks themselves are not what the average user does with a Mac. You have been warned.

Isn’t 2^28 a strange value?

Every month I create sets of web statistics for various websites. The process involves taking the raw web logs, compressed using gzip, and running them through a utility that turns the bare IP address of web visitors into a more English-like name, such as converting 24.28.199.168 into www.rr.com. The logs are then fed to a web analytics program that produces reports on the number of visitors, which pages were visited, what files were downloaded, and so on.

One site in particular usually produces a gzip-compressed log file of roughly 1.5 megabytes, which translates to around 60,000 lines of log entries. However, through a series of unusual events, this website became very popular one month, and produced ginormous.log.gz. Even gzip-compressed, the log file was 5.79 gigabytes. My utility that converts bare IP addresses into English-like names balked after reading 256 million lines of the log.

This prompted me to ask two questions: first, what programmer has a parameter set at 2**28 bytes? Second, exactly how large was this log? I couldn’t answer the former, so I tackled the latter.

OS X has a built-in utility for handling gzip archives. If you have a file that ends in .gz, all you need to do is double-click on it and it decompresses the file, using a utility called, quite descriptively, Archive Utility. Archive Utility isn’t found with your normal applications, as users don’t really care where it is or how it works, but if you are curious, it is located in /System/Library/CoreServices.

Double-clicking on ginormouslog.gz produced an uncompressed log file of 250,363,402,476 bytes called ginormous.log. As you might imagine, this took a while; even on a multi-core Mac Pro, it took over two hours to decompress. Wondering how many entries there were, I used a Unix command to find out:

wc –l ginormous.log

which, after another two hours (the command needs to read everything in the log file), reported that ginormous.log contained 801,758,175 lines of log entries. This was clearly more than my IP conversion utility could handle, so the next question was: how do you usefully chop something that large into smaller pieces?

Split – the oddly named Unix command

Most Unix commands seem to have obscure names, such as the ls command for viewing directories. Split is the exception: split splits large files into smaller files. It has a number of options; to see a manual, open up Terminal and type:

man split

and remember to press q (just the letter q, nothing else) to stop reading the help file.

I decided to split ginormous.log into files 20 billion bytes long. The syntax for this is:

split –b [for byte count] 20000m [for 20,000 megabytes] ginormous.log

which looks like this without the comments:

split –b 20000m ginormous.log

If you don’t specify any additional options (and I didn’t want to spent hours producing useless files, so kept it simple), split defaults to naming files in a particular pattern: xaa, xab, xac, xad, etc.

Four and a half hours later, I had twelve files, each 20 gigabytes in size, named xaa through xal. I used an AppleScript command to rename them ginormousa.log, ginormousb.log, etc. Then I gzip compressed them by typing:

gzip ginormous*

which cheerfully compressed every file in the directory that started with “ginormous.” Compressing took another two hours.

And at that point, I could start doing the IP conversion, followed by the web analytics. All was right with the world, or at least rightly sized.

Note: the split command only works, usefully, on text files such as log files. If you split a large JPEG file into multiple pieces, there is an excellent chance you would destroy the file completely.