INTRODUCTION ------------ The ``Disk Pool'' technology developed by Broadcom consists of a mechanism for storing a file system across parts of one or more mass storage devices, optionally with mirroring of the file system data. This mechanism includes features for dynamically changing the size and layout of the file system across mass storage devices and robustly recovering from loss or temporary unavailability of one or more of those mass storage devices or interruption of power in the middle of changes to the pool, including changes to the size and layout of the file system across devices. Note that for ease and clarity of language, the word ``disk'' will be used to mean any mass storage device in the remainder of this document. In many cases, the mass storage device will be a hard disk drive (hence the term ``disk'') but it need not be. For example, solid state flash USB storage keys have been used successfully as the ``disks'' in Broadcom's ``Disk Pool'' implementation. The ``disks'' for a given pool can even be a combination of some conventional hard disk drives and some other mass storage devices. OVERVIEW -------- The external interface to the Disk Pool mechanism consists of a fairly conventional file system interface plus some specialized control functionality for dealing with changes in the size and layout of the file system across the disks. The file system interface matches pretty closely the standard file system interfaces used by various operating systems, including standard Linux, Unix, and Windows file systems. There is a tree of directories containing files, plus particular kinds of meta-data for each file or directory including the name of the file or directory, the last access time, the owner, read and write permissions, etc. The file system interface allows creating, deleting, reading, and writing files and directories and reading and modifying the file and directory meta-data. Internally, the Disk Pool is implemented in two major parts, the Low-Level Part and the File System Part. The two parts communicate through a simple interface that provides the File System Part with access to what appears to it as a flat array of data bytes. This flat array is called the Mid-Level Data Array. The File System Part is a conventional file system. Any of a number of standard file systems, such as ReiserFS, JFS, ext3fs, or NTFS, may be used as the File System Part. Implementations of these file systems that were created to run on a simple disk or disk partition device may be used unchanged or nearly unchanged on the Mid-Level Data array. The heart of the Disk Pool technology is the Low-Level Part. The low level part must have available to it one or more Primary Disks on which to store data. Each primary disk must be a mass storage device of some sort and must be given over entirely to the Low-Level Part. That is, nothing else besides the Low-Level Part of the Disk Pool technology should be reading or writing any part of the disk if it is a Primary Disk. Note that this does not mean the whole disk needs to be used by a single pool; several pools may share the disk, but the whole of the disk is managed by the Low-Level Part of the Disk Pool technology. The Low-Level Part keeps two main kinds of state: transitory state, which may reside on any medium, including volatile memory such as DRAM, but not on the Primary Disks; and persistent data, which is all kept on the Primary Disks. The transitory state is destroyed every time the machine running the Disk Pool mechanism is rebooted, which can happen because it had a fatal error, it was turned off, or it had its power interrupted. The persistent data used by the Low-Level Part can be further sub-divided into two categories: chunks of Raw Data and meta-data which specifies how the Raw Data is stitched together to form the Mid-Level Data Array. Each chunk of Raw Data should exactly match a contiguous chunk of the Mid-Level Data Array, with the exception that in some cases a chunk of Raw Data might have an undefined value because it is a stand-by chunk for a mirror, to be used in case another disk fails, or because mirror rebuilding is ongoing and this chunk of Raw Data has not yet been completely rebuilt. The meta-data describes how all the Raw Data chunks, which may be scattered across any number of disks, are mapped to the Mid-Level Data Array. This meta-data can include mirroring and striping in addition to or instead of arbitrary concatination of Raw Data chunks. It is desirable for at least some of the meta-data to be available even when one or more of the disks is unavailable. If a pool is set up to use mirroring, it is usually set up so that if any one disk is unavailable, at least one copy of each piece of the Mid-Level Data Array is still present on the remaining disk or disks -- that's the primary reason for using mirroring. In that case, with mirroring and one more disks missing but all the data present, the Low-Level Part is designed to continue making the entire Mid-Level Data Array available. And even in the case where missing disks mean that not all the Mid-Level Data Array data is available, the Low-Level Part is designed to provide enough diagnostic information to issue a reasonable error message that tells the user which disk is not present. For those reasons, the meta-data is distributed across the same disks as the chunks of Raw Data for the disk. The details of how this meta-data is distributed is key to both making the information available when one or more disks is absent and to making the system robust in the face of power interruption in the middle of an operation to change the size and/or layout of the Raw Data on the disks, or other changes in the meta-data such as the name of a pool. If the meta-data were all simply copied on each disk, that would be convenient for handling missing-disk cases, but not power interruption cases since the power interruption might happen when one of the disks had been updated but another of the disks had not been updated. If the meta-data were all on one disk, dealing with the power-interruption problem would be easier but if that one disk with the meta-data were missing, the system would break down. META-DATA LAYOUT ---------------- First of all, a fixed-size block of space at the start of every disk is set aside for some of the meta-data. This fixed-size block is called the Disk Header. The rest of the disk is divided into contiguous, non-overlapping regions called partitions, with the Disk Header specifying, among other things, the start and size of each partition. There may also be some space on any given disk that is not allocated to the Disk Header or any of the partitions; this space is available for creating new partitions (when creating new or expanding existing pools), or extending existing partitions (when expanding existing pools). This much of the data layout is similar to schemes that are widely used in other systems. The specification of the layout of partitions is called a partition table, and there are a number of partition table formats that are used on various systems, including one used by most Linux, Windows, and DOS systems. The format of the partition table used by the Low-Level Part of the Disk Pool mechanism is a bit different from the formats used on other systems, but conceptually it is similar. Also, in many systems the partition table is placed in a fixed-size block at the start of the disk called the Master Boot Record. The Disk Header is similar to the Master Boot Record in some ways, but in other ways it is different. Both the Master Boot Record and the Disk Header contain the partition table, but where the Master Boot Record also usually contains the first part of the executable code of a boot loader in addition to the partition table, the Disk Header contains no executable code and instead contains other meta-information for the Low-Level Part of the Disk Pool mechanism. Each partition consists of three components. The first two components are called Pool Info Blocks and each is always 512 bytes long and contains pool meta-data. The third and final component is called the Partition Payload and contains Raw Data plus, in some cases, some additional meta-data. The size of the Partition Payload is variable and can be any number of bytes, as specified by the partition table in the Disk Header. DISK HEADER DETAILS ------------------- The Disk Header is a total of 2560 bytes long and consists of the following fields: * At offset 0 bytes, a ``magic'' header field of 34 bytes length. This field contains a fixed set of data identifying this disk as part of a pool system used by the Low-Level Part of the Disk Pool mechanism. Any disk that doesn't have the first 34 bytes of the disk set to this special value won't be considered part of a disk pool system. * At offset 34 bytes, a 6 byte NAS ID. This is six bytes of binary data that uniquely identifies the machine that formatted this disk and created this Disk Header. * At offset 40 bytes, a 41 byte Disk Name. This is designed to be an ASCII or Unicode string that is human-readable and used as the name of the disk for communications with users. It must be zero-terminated, with every byte after the first zero byte also equal to zero, so the last byte must always be zero. * At offset 81 bytes, 3 bytes of zero padding. * At offset 84 bytes, 16 bytes of Disk Unique ID. This is 16 bytes of binary data that should be unique for each disk. It is generated randomly or semi-randomly when the Disk Header is created. * At offset 100 bytes, a one-byte flag indicating which of the two partition tables is active. A value of zero means the first table and any other value means the second table. * At offset 101 bytes, 411 unused bytes. * At offset 512 bytes, the first partition table, which is 1024 bytes long. * At offset 1536 bytes, the second partition table, which is also 1024 bytes long. The second, third, and fifth of these fields (the NAS ID, the Disk Name, and the Disk Unique ID) together are used to reference a particular disk, to make it very unlikely that two different disks might be confused for one another. At any given time, only one of the two partition tables is active, as indicated by the flag at offset 100 bytes. When changes are to be made to the partition table, the new information is written to the inactive partition table and then the flag is changed to make that table active. Since the active table is never being written, it won't ever be in an inconsistent state, even if the write operation is interrupted by a power failure or something else. The format of each partition table is as follows. Each is 1024 bytes and is divided into 64 entries of 16 bytes each. Each entry can either specify a partition or be empty. All 16 bytes equal to zero indicates the entry is empty. If it is not empty, the first 8 bytes of the entry specify the starting block number and the other 8 bytes specify the size in blocks of the partition. Both the start and the size are in terms of 512-byte blocks. The starting block number is relative to the start of the disk and points to the Partition Payload of the partition. The size specifies the size of the Partition Payload. Note that this does not include the two Pool Info Blocks that implicitly come before the payload of each partition. For example, if the partition table specifies a starting block number of 811 and a size of 13 blocks, then the first Pool Info Block for that partition will be block number 809, the second Pool Info Block for that partition will be block number 810, and the Partition Payload will be blocks 811 through 823 inclusive. Note also that since the Disk Header always occupies the first 8 blocks, the lowest valid value for the starting block number of a partition table entry is 10. POOL INFO BLOCK DETAILS ----------------------- Each partition has two Pool Info Blocks and together the Pool Info Blocks determine the pools. The reason there are two Pool Info Blocks per partition is the same as the reason there are two copies of the partition table -- so that while one copy is being modified the other contains valid information in case the operation is interrupted before being completed. In the case of Pool Information Blocks, though, its whole sets of Pool Information Blocks that are modified together. We'll refer to the two copies of the Pool Info Blocks as the A copy and the B copy where the A copy is the one that comes first on the disk and the B copy is the one that comes second. When updating the information for a pool, while all the A copies for all the partitions in the pool are being changed, none of the B copies are being changed, so the B copies will always be in a consistent state if the A copies are not. Similarly, while the B copies are being updated, all the A copies are left alone so that if the B copies are inconsistent, the A copies will all be consistent. Each Pool Info Block is 512 bytes and consists of the following fields: * At offset 0 bytes, an 81 byte pool name. This is designed to be an ASCII or Unicode string that is human-readable and used as the name of the pool of which this Pool Info Block is a part. It must be zero-terminated, with every byte after the first zero byte also equal to zero, so the last byte must always be zero. * At offset 81 bytes, 3 bytes of zero padding. * At offset 84 bytes, 16 bytes of Pool Unique ID. This is 16 bytes of binary data that should be unique for each pool. It is generated randomly or semi-randomly when the pool is created. * At offset 100 bytes, 6 bytes of the NAS ID of the machine that created this pool. This is the same value that is put in the field at offset 34 bytes in the Disk Header. The reason for putting it here is to help give each pool a unique tag -- by combining the Pool Unique ID, this NAS ID, and the creation time stamp we get a unique tag for the pool that is very unlikely to be shared by any other pool. If two different pools had the same tags, then there might be confusion about which partitions where in which of the pools. But any two pools created on the same NAS should have different time stamps and two pools created on different NAS boxes should have different NAS IDs. The semi-random data of the Pool Unique ID further decrease the likelyhood of a conflict. Note that this NAS ID might be different from the NAS ID of the machine currently using the pool because disks might have been removed from one NAS box and put into another. Also, the NAS ID of the pool might be different from the NAS ID in the Disk Header of a disk used by that pool if the disk was first claimed by one NAS and then later the pool was created by another NAS. The NAS ID in the Disk Header is not changed when the partition tables are updated -- it is kept matching the NAS ID of the NAS that created the Disk Header on the disk, not the one that most recently changed it. This allows the NAS ID in the Disk Header to be part of a way to distinguish different disks, just as the NAS ID in the Pool Info Block helps allow different pools to be distinguished. * At offset 106 bytes, 2 bytes of zero padding. * At offset 108 bytes, 9 bytes of creation time/date stamp for the pool. As with the NAS ID field, this is recorded to help uniquely identify a pool and distinguish pools from one another. The first 4 bytes are the year, the next byte the month (1-12), the next the day of the month (1-31), the next the hour (0-23), the next the minutes (0-59), and the final byte is the second (0-59). The time/date stamp is in terms of Universal Time. * At offset 117 bytes, 3 bytes of zero padding. * At offset 120 bytes, 4 bytes specifying the number of stripes in the pool. A value of 1 means striping is not used -- there must be at least two stripes for striping to be meaningful. Striping with a single stripe is equivalent to no striping. A value of zero is illegal here. * At offset 124 bytes, 4 bytes specifying the number of mirrors in the pool. A value of 1 means mirroring is not used, that there is only one copy of the data. A value of zero is illegal here since that would imply no copies of the data, so the data wouldn't be stored anywhere. * At offset 128 bytes, 4 bytes specifying the number of spares in the pool. A value of 0 means no spares are available. * At offset 132 bytes, 4 bytes specifying the number of the pane of which this partition is a part. Panes are units which are composed through striping, mirroring, and spares to produce the entire data for the pool. The number of panes in a pool is (NST * (NM * NSP)) where NST is the number of stripes (the field at offset 120 bytes above), NM is the number of mirrors (the field at offset 124 bytes above), and NSP is the number of spares. That's because each complete copy of the data is made up of NST panes, over which the data is striped. The number of copies of the data is NM plus NSP (actually the spares aren't really copies of the data, but they are space to put copies of the data if one of the mirrors has a failure, but in terms of the space needed it's the same). All the panes in a pool are numbered from zero to one less than the total number of panes, and it is that pane number that fills this field at offset 132 bytes. They are numbered as follows: first come all the stripes for the first mirror, then all the stripes for the second mirror, and so on, followed by all the stripes for the first spare and then all the stripes for the second spare and so on. * At offset 136 bytes, 4 bytes specifying the number of chunks in this pane. The data for a pane can be spread among multiple partitions, on one disk or across different disks. The data from all the partitions in the pane is concatenated to form the pane. This field specifies how many chunks there are for the pane of which this partition is a part. The minimum value is 1. * At offset 140 bytes, 4 bytes specifying which chunk in the pane this partition represents. The chunks are numbered from zero to one less than the number of chunks in the pane and this numbering specifies the order in which the partitions are concatenated to form the pane. * At offset 144 bytes, 4 bytes specifying the RAID Chunk Size. This is an implementation detail used in the implementation of the code to concatenate partitions and do mirroring and striping. It is always the same value for all Pool Info Blocks in a given pool. Note that the term ``Chunk'' here is not the same as the term ``chunk'' used in describing the fields at offsets 136 and 140. * At offset 148 bytes, 108 bytes of Partition Specification determining which partition is the start of the next pane of this pool. A Partition Specification is used in this field and several others. In each case, consists of 81 bytes specifying the name of the disk on which the partition resides, 1 byte of zero padding, 6 bytes of NAS ID specifying the NAS ID field for the Disk Header of the disk containing the partition, 16 bytes of Disk Unique ID of the disk containing the specified partition, and 4 bytes of the number of the partition being specified (i.e. the zero-based index into the partition table corresponding to that partition), for a total of 81 + 1 + 6 + 16 + 4 = 108 bytes. If this partition is in the last pane, this field specifies the first pane in the pool, so these links form a circle through all the panes. If there is only one pane, this field just points back to the start of the same pane the partition is in, which can be this partition itself if the partition is the first (or only) partition in its pane. * At offset 256 bytes, 108 bytes of Partition Specification determining the next chunk in the pane of this partition. If this partition is the last in the pane, this field points back to the first in the pane. So if the pane consists of only a single partition, this field specifies the partition itself. * At offset 364 bytes, 4 bytes of flag indicating whether or not this pool is currently in the midst of a resizing operation. A value of zero means it is not, a value of one means it is, and all other values are illegal. Information about resizing operations in progress is kept so that if the operation is interrupted, the next time the machine is booted it can clean up the pool to either the state before the resizing operation or the state after so data isn't lost. The rest of the fields in the Pool Info Block (except for the last field, the validity marker) are only used when a resizing operation is in progress on the pool. * At offset 368 bytes, 8 bytes indicating how far a resize operation has progressed in the backward data movement pass. When resizing a pool, each partition can grow or shrink in size (including shrinking to zero, for a partition being removed, or growing from zero for a partition being added), so data might have to be moved forward or backward within the pane. The ordering of these data movements is very important for avoiding overwriting data before it is copied where it is needed. This data movement is done by first moving forward through the pane copying data backward and then moving backward through the pane copying data forward. This field specifies the number of bytes of progress that have been made on the backward data movement pass. If the resize operation is interrupted and has to be cleaned up, the backward data movement is restarted at the correct point based on this field. * At offset 376 bytes, 8 bytes indicating how far a resize operation has progressed in the forward data movement pass. This is the counterpart to the previous field and is also measured in terms of the number of bytes of progress made on this data movement pass. * At offset 384 bytes, 8 bytes indicating the old size in kilobytes of this partition for a resize operation in progress. Note that this can be zero to indicate that this partition is being newly added in this resizing operation. * At offset 392 bytes, 8 bytes indicating the new size in kilobytes of this partition for a resize operation in progress. Note that this can be zero to indicate this this partition is being removed by the resizing operation. * At offset 400 bytes, 96 bytes of zero padding. * At offset 496 bytes, 16 bytes of magic data indicating that the block is a valid Pool Info Block. There is a binary value that this field must match for this Pool Info Block to be considered valid. This ensures that if this block doesn't contain valid Pool Info Block data, it isn't interpretted as such. If this field is cleared, the system won't try to use this Pool Info Block. When the system boots or when new disks are added, the software scans the disks for pools. It does this by first finding all the partitions by reading the Disk Headers. Then it reads both copies of the Pool Info Blocks for each partition. Pool Info Blocks that aren't valid are discarded. Then it tries to piece together pools. It starts with the A copy of the Pool Info Blocks and tries to piece them together into pools. Using the Partition Specifications at offsets 148 and 256 in each Pool Info Block, it tries to follow from one partition to the next to put together all the partitions in a pool. If one of the Partition Specifications points to a disk that cannot be currently found, the pool is considered incomplete and it is assumed that the rest of the pool is on a disk or disks that aren't currently attached to the system, either because they were temporarily removed or they were broken. If one of the Partition Specifications points to a disk that can be found but the specified partition's A Pool Info Block doesn't match (any of the first 10 fields doesn't match), then the Pool Info Block that pointed to the mismatching partition and all others linked to it are discarded as being invalid. Then after the A Pool Info Blocks have all been considered, the B copies of the Pool Info Blocks for partitions not yet used in valid pools are considered the same way. So if a given pool has internally-consistent A copies of Pool Info Blocks specifying it, it will be found and if it has internally-consistent B copies of Pool Info Blocks specifying it, it will be found. Since a pool being modified always has one set or the other valid, no pool should be lost this way.