About File Size
Posted by Sven Koester, Last modified by Ibrahim Tannir on 04 September 2013 14:48
This article is about the size of files. It sounds trivial, but at a second glance it turns out to hold some surprises.
As an example, lets use a text file that resides on the desktop and contains the text "Hello World". The file has 11 character, each one represented by a byte, so its size should be 11 bytes. When calling a command that lists the file, like ls-l, you will probably get 11 for the size.
But that is not the whole truth. For instance you may want to call du hello.txt. This command lists the disk space used by that file and will (usually) return 4096 bytes. This is because most file systems use blocks of 4 kB to store a file and this file requires one such block.
And that's not all: the file has metadata like attributes, a name, an owner, belongs to a group, has access rights, has several time stamps - when it was created and last modified, it has extended attributes and so on. This space is usually not counted in the file size. And the path to that file also requires some space. On tape or in other sequential formats like in a tar file, it is stored as a string.
On a side note: PresSTORE does not exactly calculate the size of attributes and paths but assumes 1 kB per file to store that information.
On disks, the path is stored in folders. That means there is a folder structure (requiring its own space) that leads to the file. Folders are actually simple lists, requiring some kBytes to store the file names and positions of the contained files. In today's file Explorers and/or Finders, the folder also shows the size of the contents. That is an add on, calculated by the program as a sum of all sizes of its recursive content. What exactly the sum of the sizes represents and how exactly it is calculated depends on the mechanism, and it is volatile, since the calculation may be done a while after the actual modification took place.
As a result, it is not possible to tell, what the size obtained by a certain command or program actually represents. Even using the same program and calculating the size of the same files on two different disks may differ, due to the different block sizes as well as to the other metadata. Comparing a disk to a tape is even worse, as there the additional info like attributes and paths are not counted.(there is another article here: http://portal.archiware.com/support/index.php?/Knowledgebase/Article/View/181 that deals with tapes and the size).
These are the basics. However, today's file systems contain files where even the above size discussion is not complete:
Hardlinks: are mutliple directory entries for the same file, probably with different names. So there are two files that occupy the same disk space. In sums of sizes, these files are sometimes counted once, sometimes twice. Some tools, including PresSTORE Backup, resolve hard links to copies. Some others like PresSTORE Synchronize keep them as links.
Sparse files: are special files with a big size (lets say Gigabytes) that are sparsely filled, i.e. the content is stored at very different positions. Most file systems can handle such files in a way that only the portions that contain data are truly written to disk. Such files are bigger than the disk space they require, i.e. the have a bigger size.
Softlinks: Softlinks are files that contain a pointer to another file. They have usually no noticeable size. Softlinks can also be created for folders. PresSTORE treats softlinks as they are, as pointers (and does not follow them).
When building sums over file sizes, different tools may count these sizes differently.
Last but not least, there is the size calculation and display issue regarding what a Gigabyte actually is.
Older systems, like OSX up to 10.5 and also PresSTORE up to P4, used multiples of 1024. That is a kilobyte is 1024 bytes, a megabyte is 1024 kilobytes and a gigabyte is 1024 megabytes. Newer systems use multiples of 1000 instead. You may now think that the difference is 2.4 %, but with bigger sizes that difference grows significantly.
As an example lets look at the raw capacity of an LTO-6 tape, 2.5 TB or 2.5 * 1000 * 1000 * 1000 * 1000 Bytes. When regarded with multiples of 1024, the capacity is somewhat smaller:2.500.000.000.000 / 1024 / 1024 / 1024 / 1024 = 2.27 TB