19Jan2022

The second extended file system

The size of the blocks is usually a power of 2. By default, the block size of the ext2 file system is 4K. Research, shows that 4K is the most optimal size for a disk block. If the size of a file does not come to a multiple of 4K, part of the last disk block on an average, half a block is wasted. In the worst of cases, almost an entire block, having just one byte of the file, is wasted. Since all implementations still limited the file names to characters this 8-bit value was always 0.

Name of the entry. The ISO-Latin-1 character set is expected in most system. The name must be no longer than bytes after encoding. Using the standard linked list directory format can become very slow once the number of files starts growing. To improve performances in such a system, a hashed index is used, which allow to quickly locate the particular file searched. In order to maintain backward compatibility with older implementations, the indexed directory also maintains a linked directory format side-by-side.

In case there's any discrepency between the indexed and linked directories, the linked directory is preferred. This backward compatibility is achieved by placing a fake directory entries at the beginning of block 0 of the indexed directory data blocks.

An indirect index is another data block allocated to the directory inode that contains directory entries. The indexed directory entries are used to quickly lookup the inode number associated with the hash of a filename. The first indexed directory entry, rather than containing an actual hash and block number, contains the maximum number of indexed directory entries that can fit in the block and the actual number of indexed directory entries stored in the block.

The format of this special entry is detailed in Table 4. The other directory entries are sorted by hash value starting from the smallest to the largest numerical value. Normally, two logical blocks of the file will need to be accessed, and one or two metadata index blocks.

The effect of the metadata index blocks can largely be ignored in terms of disk access time since these blocks are unlikely to be evicted from cache. There is some small CPU cost that can be addressed by moving the whole directory into the page cache.

Insertion of new entries into the directory is considerably more complex than lookup, due to the need to split leaf blocks when they become full, and to satisfy the conditions that allow hash key collisions to be handled reliably and efficiently. I'll just summarize here:. The details of splitting and hash collision handling are somewhat messy, but I will be happy to dwell on them at length if anyone is interested. In brief, when a leaf node fills up and we want to put a new entry into it the leaf has to be split, and its share of the hash space has to be partitioned.

The most straightforward way to do this is to sort the entrys by hash value and split somewhere in the middle of the sorted list. I used Combsort for this, although Quicksort would have been just as good in this case since average case performance is more important than worst case. An alternative approach would be just to guess a median value for the hash key, and the partition could be done in linear time, but the resulting poorer partitioning of hash key space outweighs the small advantage of the linear partition algorithm.

In any event, the number of entries needing sorting is bounded by the number that fit in a leaf. Some complexity is introduced by the need to handle sequences of hash key collisions.

It is desireable to avoid splitting such sequences between blocks, so the split point of a block is adjusted with this in mind. But the possibility still remains that if the block fills up with identically-hashed entries, the sequence may still have to be split. This situation is flagged by placing a 1 in the low bit of the index entry that points at the sucessor block, which is naturally interpreted by the index probe as an intermediate value without any special coding.

Thus, handling the collision problem imposes no real processing overhead, just come extra code and a slight reduction in the hash key space. The hash key space remains sufficient for any conceivable number of directory entries, up into the billions. The exact properties of the hash function critically affect the performance of this indexing strategy, as I learned by trying a number of poor hash functions, at times intentionally.

A poor hash function will result in many collisions or poor partitioning of the hash space. To illustrate why the latter is a problem, consider what happens when a block is split such that it covers just a few distinct hash values. The probability of later index entries hashing into the same, small hash space is very small.

In practice, once a block is split, if its hash space is too small it tends to stay half full forever, an effect I observed in practice. After some experimentation I came up with a hash function that gives reasonably good dispersal of hash keys across the entire 31 bit key space. But the current hash function is just a place holder, waiting for an better version based on some solid theory. I currently favor the idea of using crc32 as the default hash function, but I welcome suggestions.

Inevitably, no matter how good a hash function I come up with, somebody will come up with a better one later. For this reason the design allows for additional hash functiones to be added, with backward compatibility. This is accomplished simply, by including a hash function number in the index root. If a new, improved hash function is added, all the previous versions remain available, and previously created indexes remain readable.

Of course, the best strategy is to have a good hash function right from the beginning. The initial, quick hack has produced results that certainly have not been disappointing. OK, if you have read this far then this is no doubt the part you've been waiting for.

In short, the performance improvement over normal Ext2 has been stunning. With very small directories performance is similar to standard Ext2, but as directory size increases standard Ext2 quickly blows up quadratically, while htree-enhanced Ext2 continues to scale linearly.

Uli Luckas ran benchmarks for file creation in various sizes of directories ranging from 10, to 90, files. The results are pleasing: total file creation time stays very close to linear, versus quadratic increase with normal Ext2. All of these tests are CPU-bound, which may come as a surprise. The directories fit easily in cache, and the limiting factor in the case of standard Ext2 is the looking up of directory blocks in buffer cache, and the low level scan of directory entries.

In the case of htree indexing there are a number of costs to be considered, all of them pretty well bounded. Notwithstanding, there are a few obvious optimizations to be done:. In time the optimizations will be applied and we can expect to see another doubling or so in performance.

There will be a very slight performance hit when the directory gets big enough to need a second level. Because of caching this will be very small. Traversing the directories metadata index blocks will be a bigger cost, and once again, this cost can be reduced by moving the directory blocks into the page cache. Typically, we will traverse 3 blocks to read or write a directory entry, and that number increases to with really huge directories.

But this is really nothing compared to normal Ext2, which traverses several hundred blocks in the same situation. Most of the file also directory, symlink, device Some other attributes are only available as extended attributes. Under most implementations, the owner and group are 16bit values, but on some recent Linux and Hurd implementations the owner and group id are 32bit.

Extended attributes are name:value pairs associated permanently with files and directories, similar to the environment strings associated with a process. An attribute may be defined or undefined. If it is defined, its value may be empty or non-empty. Extended attributes are extensions to the normal attributes which are associated with all inodes in the system. They are often used to provide additional functionality to a filesystem - for example, additional security features such as Access Control Lists ACLs may be implemented using extended attributes.

Extended attributes are accessed as atomic objects. Reading retrieves the whole value of an attribute and stores it in a buffer. Writing replaces any previous value with the new value. Extended attributes are stored on disk blocks allocated outside of any inode. Inodes which have all identical extended attributes may share the same extended attribute block. The attribute values are on the same block as their attribute entry descriptions, aligned to the end of the attribute block.

This allows for additional attributes to be added more easily. The size of entry headers varies with the length of the attribute name. The block header is followed by multiple entry descriptors. The entry descriptors are sorted by attribute name, so that two extended attribute blocks can be compared efficiently.

Attribute values are aligned to the end of the block, stored in no specific order. No additional gaps are left between them.

This value is incremented everytime a link is created to this attribute block and decremented when a link is destroyed.

Whenever this value reaches 0 the attribute block can be freed. This effectively restrict the amount of extended attributes to what can be fit in a single block. Procedure 5. Procedure to compute Extended Attribute Header Hash. Do a cyclic bit shift of 16 bits to the left of the 32bits hash value, effectively swapping the upper and lower 16bits of the hash.

Perform a bitwise OR between the extended attribute entry hash and the header hash being computed. The following bits are currently defined:. Enabling this bit will cause random data to be written over the file's content several times before the blocks are unlinked. Make sure to study the implementation notes before relying on this option.

When supported by the implementation, setting this bit will cause the deleted data to be moved to a temporary location, where the user can restore the original file without any risk of data lost. This is most useful when using ext2 on a desktop or workstation.

Ext2fs implements fast symbolic links. A fast symbolic link does not use any data block on the filesystem. The target name is not stored in a data block but in the inode itself. This policy can save some disk space no data block needs to be allocated and speeds up link operations there is no need to read a data block when accessing such a link. Of course, the space available in the inode is limited so not every link can be implemented as a fast symbolic link.

The maximal size of the target name in a fast symbolic link is 60 characters. We plan to extend this scheme to small files in the near future. Ext2fs keeps track of the filesystem state.

A special field in the superblock is used by the kernel code to indicate the status of the file system. At boot time, the filesystem checker uses this information to decide if a filesystem must be checked. The kernel code also records errors in this field.

The filesystem checker tests this to force the check of the filesystem regardless of its apparently clean state. Always skipping filesystem checks may sometimes be dangerous, so Ext2fs provides two ways to force checks at regular intervals. A mount counter is maintained in the superblock.

A last check time and a maximal check interval are also maintained in the superblock. These two fields allow the administrator to request periodical checks.

When the maximal check interval has been reached, the checker ignores the filesystem state and forces a filesystem check. Ext2fs offers tools to tune the filesystem behavior. The tune2fs program can be used to modify: the error behavior.

Mount options can also be used to change the kernel error behavior. An attribute allows the users to request secure deletion on files. When such a file is deleted, random data is written in the disk blocks previously allocated to the file. This prevents malicious people from gaining access to the previous content of the file by using a disk editor. Last, new types of files inspired from the 4. Immutable files can only be read: nobody can write or delete them. This can be used to protect sensitive configuration files.

Append-only files can be opened in write mode but data is always appended at the end of the file. Like immutable files, they cannot be deleted or renamed. This is especially useful for log files which can only grow. A filesystem is made up of block groups. However, block groups are not tied to the physical layout of the blocks on the disk, since modern drives tend to be optimized for sequential access and hide their physical geometry to the operating system.

Block Group N Each block group contains a redundant copy of crucial filesystem control informations superblock and the filesystem descriptors and also contains a part of the filesystem a block bitmap, an inode bitmap, a piece of the inode table, and data blocks.

The structure of a block group is represented in this table: Super Block FS descriptors Block Bitmap Inode Bitmap Inode Table Data Blocks Using block groups is a big win in terms of reliability: since the control structures are replicated in each block group, it is easy to recover from a filesystem where the superblock has been corrupted.

In Ext2fs, directories are managed as linked lists of variable length entries. Each entry contains the inode number, the entry length, the file name and its length.

By using variable length entries, it is possible to implement long file names without wasting disk space in directories. This way, it tries to ensure that the next block to read will already be loaded into the buffer cache. Readaheads are normally performed during sequential reads on files and Ext2fs extends them to directory reads, either explicit reads readdir 2 calls or implicit ones namei kernel directory lookup. Ext2fs also contains many allocation optimizations.

Block groups are used to cluster together related inodes and data: the kernel code always tries to allocate data blocks for a file in the same group as its inode. This is intended to reduce the disk head seeks made when the kernel reads an inode and its data blocks.

When writing data to a file, Ext2fs preallocates up to 8 adjacent blocks when allocating a new block. This preallocation achieves good write performances under heavy load. It also allows contiguous blocks to be allocated to files, thus it speeds up the future sequential reads. These two allocation optimizations produce a very good locality of: related files through block groups related blocks through the 8 bits clustering of block allocations. The Ext2fs library To allow user mode programs to manipulate the control structures of an Ext2 filesystem, the libext2fs library was developed.

This library provides routines which can be used to examine and modify the data of an Ext2 filesystem, by accessing the filesystem directly through the physical device. The Ext2fs library was designed to allow maximal code reuse through the use of software abstraction techniques.

For example, several different iterators are provided. Another iterator function allows an user-provided function to be called for each file in a directory. Many of the Ext2fs utilities mke2fs , e2fsck , tune2fs , dumpe2fs , and debugfs use the Ext2fs library. This greatly simplifies the maintainance of these utilities, since any changes to reflect new features in the Ext2 filesystem format need only be made in one place--in the Ext2fs library.

This code reuse also results in smaller binaries, since the Ext2fs library can be built as a shared library image. Because the interfaces of the Ext2fs library are so abstract and general, new programs which require direct access to the Ext2fs filesystem can very easily be written.

For example, the Ext2fs library was used during the port of the 4. Very few changes were needed to adapt these tools to Linux: only a few filesystem dependent functions had to be replaced by calls to the Ext2fs library.

The Ext2fs library provides access to several classes of operations. The first class are the filesystem-oriented operations.

A program can open and close a filesystem, read and write the bitmaps, and create a new filesystem on the disk. Functions are also available to manipulate the filesystem's bad blocks list. The second class of operations affect directories. For this feature to be useful the inode size must be bytes in size or larger.

This feature is supported by ext2, ext3, and ext4. Setting the filesystem feature is equivalent to using the -j option. This feature is supported by ext3 and ext4, and ignored by the ext2 file system driver. The block size for the external journal must be the same as the file system which uses it. Very old kernels could not handle large files, so this feature flag was used to prohibit those kernels from mounting file systems that they could not understand.

The block groups used to store the backup superblock and blockgroup descriptors are stored in the superblock, but typically, one will be located at the beginning of block group 1, and one in the last block group in the file system.

passtimouca1982's Ownd

0コメント

1000 / 1000