Snow Leopard’s file system changes

Mac OS X 10.6 “Snow Leopard” brought a slew of under the hood changes. There are two distinct but related changes that Apple made that most users will probably never notice and many developers probably never will either. But if you happen to be someone who works with the file system, these changes may affect you. These changes are:

1. KB vs. KiB. That is, is a kilobyte 1000 bytes or 1024 bytes.

2. file-system-level compression of files

Since I recently went around with these, I felt it’d be worthwhile to share what I discovered, since there isn’t much information out there on the topic. The only thing I found that remotely discussed the matter was this page of Ars Technica’s Snow Leopard review.

KB vs. KiB

Long ago in the days of 128 K of memory and 1.44 MB floppy disks, everything was based upon 1 KB = 1024 bytes (base-2). 1000 was close enough t0 1024 so if you said something was “1 KB” it didn’t really matter if it was 1000 bytes or 1024 bytes, it was close enough for approximation purposes to say “1000 bytes” and do our math in our head based upon the 1000 approximation. While there was some loss in the calculation, 24 bytes didn’t amount to much. However today, such “rounding errors” can leave quite a significant difference. Where this is seen most often is in marketing of large multi-gigabyte hard drives or devices (e.g. portable music players), where they want to advertise that it’s the “32 GB model” but users get it home and discover it’s much less than 32 GB. This tends to happen because of this KB vs. KiB issue.

Until Snow Leopard, everything on Mac OS was done in base-2: file sizes, disk capacities, memory. With Snow Leopard, we now have a mixed bag with some things being done in base-2 (KiB) and some things in base-10 (KB). Of course, the “KB” abbreviation still exists and is used for base-2 things as well as base-10 things. Furthermore, with some things done in base-2 and some in base-10, confusion can still exist. But for the most part, I know users are unlikely to notice or care, so long as what they see makes sense (I’ll discuss this more when I get to point 2). So what’s there to know about the new base-1o setup?

It only applies to the user interface.

The numbers you see for file sizes in the Finder? base-10. Get Info windows? base-10. Most disk utilities? base-10. Spotlight? base-10. Yes, in Mac OS X 10.5 if you did a Finder Find query for “size is less than 6 KB” that would expand out to being “size is less than 6144 bytes”. In Mac OS X 10.6, the same query will expand to “size is less than 6000 bytes”. So it’s not just presenting data to the user, it’s also getting/interpreting data from the user.

Is this a hard-fast rule? Not necessarily, as I believe it depends upon what you’re doing. For instance, a piece of software that allows you to make something (e.g. disk image) “the size of a CD-ROM” might still present a user interface of “650 MB”. IMHO, that should not equate to 650,000,000 bytes but should still equate to the established 681,984,000 bytes. Should the user interface be changed to say “682 MB”? I don’t think so as it’s been ingrained after many many years that a CD-ROM is “650 MB”. So IMHO there isn’t a clear cut way to deal with this issue, but in general you’ll want to have your user-level interactions and file size calculation be base-10.

Of course, even this isn’t very clear cut. For instance, how are you going to round off the numbers? how many significant digits will you present? As well, if your code needs to support both Mac OS X 10.6 (Snow Leopard) and prior OS versions, you will have to conditionalize your code. It felt rather silly in my code to have to create a function like (in pseudo-code):

+ (uint64_t)sizeOfKilobyte
{
    if ([Environment isRunningAtLeastSnowLeopard])
    {
        return 1000;
    }

    return 1024;
}

then any time I did math having to invoke this function where before I could just hard-code in 1024, but here we are.

Personally, I don’t like what Apple did, but likely only because it’s created a major headache for me to deal with. When I do my best to look at it objectively, it’s not really a bad change. Kilo does mean 1000, not 1024, so it is more accurate and likely less confusing to the lay person. As the joke goes, there are 10 kinds of people in this world: those who understand binary and those who don’t.

Bottom line: the change is here and going forward should be a good one. It will take time for all developers to adopt and update their software to be in parity, but hopefully in short time all will come along for the ride. Just remember: this is only a user-interface-level change. Under the hood, we’re still in a base-2 world.

Compression

For the products I work on, this is probably the biggest headache to deal with. On the one hand it’s pretty clever, and while in a lot of ways it works out, it has some subtle-side effects that mess with long-held assumptions about files and their sizes on disk.

What Apple has done is look for ways to not just improve how much space files take up on disk (in fact, I suspect this may be a lesser reason), but more so how to improve overall application performance. If a file is of an acceptable size, what Apple has done is compress that file data and put it into the “extended attributes” storage area of a file’s metadata storage area on disk. All of this magic happens automagically — it’s kernel-level voodoo going on with the file system at the lowest of levels. This is stuff that you’re unlikely to ever directly experience or notice is going on… unless you’re a developer that works with the file system like I do. 🙂

The Ars Technica article explains the advantage of this approach:

You may be wondering, if this is all about data compression, how does storing eight uncompressed bytes plus a 17-byte preamble in an extended attribute save any disk space? The answer to that lies in how HFS+ allocates disk space. When storing information in a data or resource fork, HFS+ allocates space in multiples of the file system’s allocation block size (4 KB, by default). So those eight bytes will take up a minimum of4,096 bytes if stored in the traditional way. When allocating disk space for extended attributes, however, the allocation block size is not a factor; the data is packed in much more tightly. In the end, the actual space saved by storing those 25 bytes of data in an extended attribute is over 4,000 bytes.

But compression isn’t just about saving disk space. It’s also a classic example of trading CPU cycles for decreased I/O latency and bandwidth. Over the past few decades, CPU performance has gotten better (and computing resources more plentiful—more on that later) at a much faster rate than disk performance has increased. Modern hard disk seek times and rotational delays are still measured in milliseconds. In one millisecond, a 2 GHz CPU goes through two million cycles. And then, of course, there’s still the actual data transfer time to consider.

Granted, several levels of caching throughout the OS and hardware work mightily to hide these delays. But those bits have to come off the disk at some point to fill those caches. Compression means that fewer bits have to be transferred. Given the almost comical glut of CPU resources on a modern multi-core Mac under normal use, the total time needed to transfer a compressed payload from the disk and use the CPU to decompress its contents into memory will still usually be far less than the time it’d take to transfer the data in uncompressed form.

That explains the potential performance benefits of transferring less data, but the use of extended attributes to store file contents can actually make things faster, as well. It all has to do with data locality.

If there’s one thing that slows down a hard disk more than transferring a large amount of data, it’s moving itsheads from one part of the disk to another. Every move means time for the head to start moving, then stop, then ensure that it’s correctly positioned over the desired location, then wait for the spinning disk to put the desired bits beneath it. These are all real, physical, moving parts, and it’s amazing that they do their dance as quickly and efficiently as they do, but physics has its limits. These motions are the real performance killers for rotational storage like hard disks.

The HFS+ volume format stores all its information about files—metadata—in two primary locations on disk: the Catalog File, which stores file dates, permissions, ownership, and a host of other things, and the Attributes File, which stores “named forks.”

Extended attributes in HFS+ are implemented as named forks in the Attributes File. But unlike resource forks, which can be very large (up to the maximum file size supported by the file system), extended attributes in HFS+ are stored “inline” in the Attributes File. In practice, this means a limit of about 128 bytes per attribute. But it also means that the disk head doesn’t need to take a trip to another part of the disk to get the actual data.

As you can imagine, the disk blocks that make up the Catalog and Attributes files are frequently accessed, and therefore more likely than most to be in a cache somewhere. All of this conspires to make the complete storage of a file, including both its metadata in its data, within the B-tree-structured Catalog and Attributes files an overall performance win. Even an eight-byte payload that balloons to 25 bytes is not a concern, as long as it’s still less than the allocation block size for normal data storage, and as long as it all fits within a B-tree node in the Attributes File that the OS has to read in its entirety anyway.

Whew! A lot of information, but hopefully you can see the advantage of what Apple did. A lot of people talk about how Snow Leopard feels so much faster.The compression trick certainly plays into that speed increase.

Unfortunately, to implement this clever feature comes with some level of cost.  Let’s take a look at a file and its sizes as reported by various OS mechanisms. The file in question is the /Applications/Address Book.app/Contents/MacOS/Address Book application binary as it is installed by Mac OS X 10.6. This is the same binary on the same machine, just rebooting the machine into different OS versions.

First, let’s look at the binary sizes when examined under Mac OS X 10.5.8 (Leopard):

Size Mechanism Reported Size
Terminal 0
Finder 112 KB on disk (111,641 bytes)
FSCatalogInfo data fork logicial 0
FSCatalogInfo data fork physical 0
FSCatalogInfo rsrc fork logical 111641
FSCatalogInfo rsrc fork physical 114688
lstat() st_size 0
lstat() st_blocks 224
kMDItemFSSize 111641

Now, let’s look at the same binary when examined under Mac OS X 10.6 (Snow Leopard):

Size Mechanism Reported Size
Terminal 322928
Finder 323 KB on disk (322928 bytes)
FSCatalogInfo data fork logicial 322928
FSCatalogInfo data fork physical 114688
FSCatalogInfo rsrc fork logical 0
FSCatalogInfo rsrc fork physical 0
lstat() st_size 322928
lstat() st_blocks 224
kMDItemFSSize 322928

Same file, very different sizes. Which set is correct? They both are. There’s the rub.

There’s no way pre-Snow-Leopard systems can understand the new voodoo, so they report what’s actually there. The file is compressed, Apple is storing it in the attributes area as a “resource fork” attribute, and so that’s what we actually see under Snow Leopard. The logical size of the file is correct, the physical size of the file is correct. That’s the actual setup, not accounting for the storage taken up in the attributes are. Under Snow Leopard, Apple starts to do mystical voodoo so that you see the file as they want you to see it, not as the file actually is. That’s why the logical size is greater than the physical size. Yes, this breaks the decades old maxim that a file’s logical size can never be greater than its physical size, but here we are. Is it incorrect? No. The logical size of the file is 323 KB (note the Finder applying base-10 here), and physically the file is only taking up 114 KB on disk. Thank you compression.

Is this a problem for most people? No. Most people want to just know how big their file is, and for that they turn to the logical size and it’s correct. If you FTP’d the file to another machine, burned it to a CD or DVD, or other such thing, it would wind up coming out at 323 KB. Remember, this is all happening at the kernel-level, mystical voodoo in the bowels of the OS. Any higher-level API’s that manipulate the file (e.g. read, write) will never see the compression and will manipulate the file as if it was uncompressed. So again, unless you somehow care about what’s going on down here, you may never know nor care about this voodoo.

If you’re someone who cares about the physical size well… now things start to get interesting. For instance, let’s say you need to display the physical size to the user, what do you do? Well, we can take a small cue from the Finder. If you note in the Get Info window it will say the size is something like “xxx MB on disk (yyy bytes)”. Traditionally that meant “<physical size> on disk (<logical size> bytes)”, but now, maybe not. If the physical size is less than the logical size, that sort of display is going to look very bizarre to the user. The best you can do is get both the physical size and logical size. If the physical size is greater than or equal to the logical size, you ought to be able to assume things are fine and go with that emotion. If however the logical size is greater than the physical size, you may have to substitute “yyy” for “xxx” in that case. Is this ideal? No, but what else can you do?

If you need to contend with this, while by default the Mac OS hides much of this away from you, there are ways for developers to get at this information.

  • You can detect whether a volume supports compression by checking the VOL_CAP_FMT_DECMPFS_COMPRESSION flag in the ATTR_VOL_CAPABILITIES attribute. See man 2 getattrlist.
    • Note this value is an attribute of the 10.6 kernel, so it’s presence is determined in part by what OS version you’re running under.
  • You can check if a specific file is compressed by checking the UF_COMPRESSED flag bit of the st_flags stat
    • You could also check the ATTR_CMN_FLAGS from getattrlist.
    • Note this value resides with the file in its metadata storage. Thus, if say you rebooted the machine under a previous OS version or the file resides on a portable volume that you mount on a pre-10.6 machine, this flag will still be set.
  • Note that the physical size of the file is the number of allocation blocks reserved on disk for that fork. You must realize this might be zero, and that happens if the fork is small enough for the compressed contents to fit into the attributes B-tree.
    • The physical size is the number reported by ATTR_FILE_DATAALLOCSIZE and ATTR_FILE_RSRCALLOCSIZE from getattrlist, thus the dataPhysicalSize and rsrcPhysicalSize fields returned by FSGetCatalogInfo.
  • An interesting note about the storage location. If the file’s content is small, it gets compressed and stored into the extended attributes. If the file’s content is large, it gets compressed and stored in the resource fork.
    • Yes, I too thought resource forks were dead. Long live resource forks!

So, there’s some gory details. It’s not too painful to get at, if you care about information.

Conclusion

In the end, this is an issue that most users probably will never notice nor care about. In most respects, that’s how it should be. However, to make that wonderful Mac user experience of things “just working,” often we developers have to deal with a lot of hell and hassle to keep things going. The above can give the lay person some idea (if you made it this far) of what we have to deal with, and hopefully gives fellow developers some clue as to what’s going on and how to deal with it.

Much thanx to Quinn for help here. Without Quinn, much of my life as a software developer would be far more painful. I owe him many beers…. or probably a small brewery at this point. 🙂  Share and enjoy.

Updated: Check it out. You can compress files yourself using the --hfsCompression option for ditto. Now, just because you can do this doesn’t necessarily mean you should. But know it’s there.

2 thoughts on “Snow Leopard’s file system changes

  1. Hsoi, you just saved my ass. I was trying to figure out why the logical was bigger than the physical in some code and boom you answered it.

    How did we ever program before google?

Join the discussion!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.