Wednesday, March 31, 2010

How Ext4 Extents Work?

How Ext4 Extents Work?

Earlier Ext2 and Ext3 had the limitation on the size of the file. It used 32 bit block number to access the data blocks. So that limited the maximum size of file to be 2^32 * blocksize(eg. 4k**) = 16TB*. Also the access time for large Files were slow because in had to go through lots of indirection.Ext4 Filesystem can support very large files it has 48 bits to adress a block. Also its uses extents to store data so access time is faster for large files.

The information for the data blocks are stored in the i_data of the inode structure. In a system without Extents, the first 12 entries contains the block numbers of the first 12 blocks of data. Then
 it contains the block number for the Indirect blocks. That block contains the array of block numbers which point to the data. Similarly, there is Double indirect block and triple indirect block. So if we need to get the data from a very large file, we need to go through those indirection.


How to determine if the Filesystem uses Extents or indirect Mapping?

To determine whether the inode has extent based mapping or indirect mapping. We need to look at the EXT2_EXTENTS_FL bit in the i_flags in inode structure. The root directory always has the indirect mapping instead of the block mapping.

What does the Ext4 Extents Data Structure Look like and where it is stored?

In extenxt based block mapping, the i_data of inode contains Extent structures.  There is a extent header, Extent and Extent index. The following structures define those structures.

/*
 * This is the extent on-disk structure.
 * It's used at the bottom of the tree.
 */
struct ext4_extent {
    uint32_t ee_block; /* first logical block extent covers */
    uint16_t ee_len; /* number of blocks covered by extent */
    uint16_t ee_start_hi; /* high 16 bits of physical block */
    uint32_t ee_start_lo; /* low 32 bits of physical block */
};

/*
 * This is index on-disk structure.
 * It's used at all the levels except the bottom.
 */
typedef struct ext4_extent_idx {
    uint32_t  ei_block;       /* index covers logical blocks from 'block' */
    uint32_t  ei_leaf_lo;     /* pointer to the physical block of the next *
                                 * level. leaf or next index could be there */
    uint16_t  ei_leaf_hi;     /* high 16 bits of physical block */
    uint16_t   ei_unused;
};

/*
 * Each block (leaves and indexes), even inode-stored has header.
 */
typedef struct ext4_extent_header {
    uint16_t  eh_magic;       /* probably will support different formats */
    uint16_t  eh_entries;     /* number of valid entries */
    uint16_t  eh_max;         /* capacity of store in entries */
    uint16_t  eh_depth;       /* has tree real underlying blocks? */
    uint32_t  eh_generation;  /* generation of the tree */
};

(NOTE: They are stored as little endean and Linux code has __le32 types used)

The Extent is implemented as a B+ Tree. Only the leaf nodes has the Extent Structure. Others have Extent Index Structure. The extent information starts with the header and then either Extent or Extent Index.In the i_data there is only space for header and 3 more Extent structures. If more extent information needs to be stored, than whole bocks with extent index structures are used. In the non leaf nodes, extent index structures are used. It contains the block number of the block where next level of nodes are stored.
The logic can be described by the following 2 figures.

(Please click the images to see full image)






The eh_magic field in the header is always 0xf30a (little endean).
The eh_entries field determines the no of valid extents in that extent array.
The eh_depth is the depth of the B+ tree. if the depth is 0, the extents in that array are always extent structures and not the extent index structures.

I have implemented this feature in ext2read project (http://ext2read.sf.net).It hasn't been tested yet though for very large files. The Ext4 also uses HTree for directory entries. I am currently studying it and implementing.

Please let me know if you have any questions.
Manish Regmi (regmi dot manish at gmail dot com)


Monday, March 15, 2010

Ext4 and LVM support

It looks like the application does not work with Ext4 partitions. I am currently working on Ext4 and LVM support. I am completely redesigning the GUI now With QT4. Stay tuned....

Friday, March 12, 2010

Ext2read Documentation


Documentation

Every operating system provides facility for the persistent storage and management of data. Those data are stored in a container called files. For this purpose the operating system provides a file system that allows users to organize, manipulate and access the files.
To provide an efficient and convenient access to the files, the operating system imposes one or more file systems to allow data to be stored. Linux uses Ext2 (Second Extended File System) file system (it also supports tens of other file systems like fat, ntfs e.t.c.). Recently the ext2 file system has journaling support and called Ext3 file system (though the most data structures are same).
This tutorial will try to show you what is inside the hard disk. What is inside the Ext2/3 file system? And how our program works?

Inside Hard Disk

This part of the text will tell you the organization of the hard disk, how the partition information is stored in a dos/windows based systems. This text will also tell us how to retrieve the data from the disk and how ext2read does that.

Hard Disks are the most popular storage devices. We cannot imagine a computer without a hard disk. Hard disks offer high capacity and low costs.
The organization of disk.
In a disk, the data is stored in the surface of a circular disc called platters. The platters are double sided and the data is read by a read/write head. So, there are two heads per platter. There are many of them and each of them is divided into circular rings called tracks. The collection of tracks that are exactly above or below the other tracks is called a cylinder. Again each track is divided into a number of sectors (usually 64). Thus, sectors are the smallest addressable unit in disk. The disk can also be viewed as a large array of sectors.
Addressing the disk sectors.
The data in the disk can be addressed in mainly two ways. They are the CHS addressing and the LBA addressing.
The CHS (cylinder/head/sector) addressing uses the exact cylinder number, sector number and the head number to address a disk. The PC bios rely on this technique for booting the OS. The bios interrupt 0×13 function 0×2 (for details see Ralph brown’s interrupt list) can be used to read the disk using CHS addressing.
In the LBA (logical block addressing), the disk is viewed as a large array of sectors. The sector to address is number of sectors from the starting of disk. The first sector is 0. For reading the disk using LBA we use bios interrupt 0×13, function 0×42 (also called IBM’s int 13 extensions).
Reading the disk sectors
The disk sectors can be in many ways. Firstly, we can use the Operating System’s system call facility like open (), read () etc. Secondly, we can use IO instructions to read and write the IO ports of the disk (0×1f0 – 0×1f7 for 1st IDE). And thirdly using the bios (or DOS) interrupts.
Reading disk sectors from Dos
We can use bios interrupts to read the sectors in real mode OS like DOS. After inserting the values in the general purpose registers (function no. in AX, ) we call int 0×13. After returning from interrupt it will fill the buffer by sector’s content.
A C language example.
BOOL ReadSect(BYTE disk, int nsects,DWORD lsects,void* data)
{
union REGS iregs, oregs;
struct SREGS sregs;
iregs.h.ah = 0×02;
iregs.h.al = nsects;
iregs.x.bx = FP_OFF (data);
sregs.es = FP_SEG (data);
iregs.h.ch = (BYTE) track;
iregs.h.cl = (BYTE) sector;
iregs.h.dh = (BYTE) head;
iregs.h.dl = disk;
int86x (0×13, &iregs, &iregs, &sregs);
if (iregs.x.cflag)
Return FALSE;
Else
Return TRUE;
}
The above method has limitations of reading up to 1034 cylinders so we should use extend read method. See our source code for details. Bios and DOS interrupt details can be found on Ralph Brown’s interrupt list.
Reading sectors from UNIX and Windows.
In UNIX (and UNIX like systems) everything is file, so a disk is a file and is identified by a block device file /dev/hda (varies according to UNIX implementation). It can be Opened, read written and seeked as a simple file.
In Windows (NT), reading disk is similar to UNIX. The disk is identified as \\\\.\\PhysicalDrive0 and partitions as \\\\.\\c:. We simply open a file using CreateFile()(win32 API), read using ReadFile(), and seek using SetFilePointer(). Here are the examples.
/* Unix Implementation (User mode)*/
BOOL ReadSect(BYTE disk, int nsects,DWORD lsects,void* data)
{
int desc;
desc = open(“/dev/hda”, O_READ); /* file name differs from one implementation to other */
lseek (desc, lsects*512, SEEK_SET);
read (desc, data, nsects*512);
}
/* Windows Implementation */
BOOL ReadSect(BYTE disk, int nsects,DWORD lsects,void* data)
{
HANDLE hDevice;
DWORD dummy;
hDevice = CreateFile(“////.//PhysicalDrive0”, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE,NULL, OPEN_EXISTING, 0, NULL);
if(!hDevice)
return FALSE;
SetFilePointer(hDevice, (lsects*512), NULL, FILE_BEGIN);
if(!ReadFile(hDevice, data, (512*nsects), &dummy, NULL))
return FALSE;
CloseHandle(hDevice);
Return TRUE;
}
The Partition structure
In Dos like system, the partition table entry is stored in a data structure called Master Boot Record (MBR). MBR is the most important data structure in the disk. It lies on the first sector of the disk and is created by disk partitioning programs like fdisk. It is the most critical part of the disk. If it is damaged or erased accidentally, all the data will be lost.
The MBR structure contains of three parts: the master boot code, partition table and the signature. The master boot code is a small program which scans the partition table for the active partition, finds the starting sector of the active partition, loads the boot sector from the active partition to the memory and transfers the execution flow to the loaded code.
The last two bytes of the MBR is a magic number ‘0×55AA’ which is called the signature.
The information of the partitions lies in the Partition table entry. It starts at 446 bytes and is 64 bytes long. It contains 4 partition table entries each 16 bytes long. Thus, there can be only four primary partitions in a disk. The fields of the partition structure are given below in the form of C structure.
Struct partition {
BYTE bootIndicator;
BYTE startingHead;
unsigned startingSector :6;
unsigned startingCylinder:10;
BYTE systemId;
BYTE endingHead;
unsigned endingSector:6;
unsigned endingCylinder:10;
DWORD relativeSectors;
DWORD totalSectors;
};
  • bootIndicator (offset 0×01be):- The boot indicator tells if the partition is active or not. The value of 0×80 denotes active and 0×00 inactive (There must be only one active partition).
  • startingHead (offset 0×01bf), startingSector (offset 0×01c1), startingCylinder (offset 0×01c3):- The startingHead, startingSector and startingCylinder gives the starting position of the partition in chs mode.
  • systemId (offset 0×01c2):- This field gives the type of the partition. E.g. 0×83 for ext2 partition.
  • endingHead (offset 0×01c4), endingSector (offset 0×01c5), endingCylinder (offset 0×01c6):- These fields give the ending position of the partition.
  • RelativeSectors (offset 0×01c6):- Offset from the beginning of the disk in sectors.
  • totalSectors (offset 0×01ca):- The total sectors in the partition.
Among the four primary partitions one is the extended partition and contains the logical partitions. The Partition Table Entry in the Extended Partition points to the MBR like structure called the Extended Boot Record (EBR). There is a EBR for each logical partition. Among the four Partition table entry, first entry points to the boot sector of the logical partition. Second entry points to the EBR of the second logical partition. Thus, the logical partitions are the linked lists.



Inside Ext2/3 File System

This part of the text will describe about the layout of the ext2/3 file systems and the data structures it uses. We will also look how ext2read uses those data.

In the MBR/EBR the value for the Ext2/Ext3 partition is ‘0×83’. The Ext2/Ext3 file system contains several data structures for keeping the file system information. These data structures are also known as metadata structures. The important data structures contained in the Ext2/Ext3 File System are boot block, super block, descriptor table and inode table.
The Boot Block
The boot block of the Ext2/Ext3 filesystem is 1024 bytes long and does not contain any useful information (as far as I know. I would like to know if it contains any.).
The Super Block
The Super Block contains the information of the whole file system. It contains and the metadata like size, total inodes e.t.c. The Ext2/Ext3 super block has the following components.
struct EXT2_SUPER_BLOCK
{
DWORD s_inodes_count;
DWORD s_blocks_count;
DWORD s_r_blocks_count;
DWORD s_free_blocks_count;
DWORD s_free_inodes_count;
DWORD s_first_data_block;
DWORD s_log_block_size;
DWORD s_log_frag_size;
DWORD s_blocks_per_group;
DWORD s_frags_per_group;
DWORD s_inodes_per_group;
DWORD s_mtime;
WORD s_wtime;
WORD s_mnt_count;
WORD s_max_mnt_count;
WORD s_magic;
WORD s_state;
WORD s_pad;
WORD s_minor_rev_level;
DWORD s_lastcheck;
DWORD s_checkinterval;
DWORD s_creator_os;
DWORD s_rev_level;
WORD s_def_resuid;
WORD s_def_regid;
/* for EXT2_DYNAMIC_REV superblocks only */
DWORD s_first_ino;
WORD s_inode_size;
WORD s_block_group_nr;
DWORD s_feature_compat;
DWORD s_feature_incompat;
DWORD s_feature_ro_compat;
BYTE s_uuid[16];
char s_volume_name[16];
char s_last_mounted[64];
DWORD s_algorithm_usage_bitmap;
BYTE s_prealloc_blocks;
BYTE s_prealloc_dir_blocks;
WORD s_padding1;
DWORD s_reserved[204];
};
s_inodes_count :- Stores the total no of inodes.
s_blocks_count:- Stores the total no of blocks.
s_r_blocks_count:- Stores the total no of blocks reserved for exclusive use of superuser.
s_free_blocks_count:- Stores the total no of free blocks.
s_free_inodes_count:- Stores the total no of free inodes in the file System.
s_first_data_block:- Position of the first data block.
s_log_block_size:- used to compute logical block size in bytes. E.g if it is 1, block size is 1024. if it is 2, block size is 2048.
s_log_frag_size:- used to compute logical fragment size.
s_blocks_per_group:- Total number of blocks contained in the group.(see groups later.).
s_frags_per_group:- Total number of fragments in a group.
s_inodes_per_group:- Total number of inodes in a group.
s_mtime:- Time at which the last mount was performed. The time is stored in UNIX format as defined by posix.
s_wtime:- Time at which the last write was performed. The time is stored in UNIX format as defined by posix.
s_mnt_count:- The total number of time the fs system has been mounted in r/w mode without having checked. The Linux OS uses this value to automatically check the file system when the specified time reaches. The Specified time is s_max_mnt_count.
s_max_mnt_count:- The max no of times the fs can be mounted in r/w mode before a check must be done.
s_magic:- A number that identifies the file System. (eg. 0xef53 for ext2).
s_state; Gives the state of fs (eg. 0×001 is Unmounted cleanly). The Linux OS uses this value to determine.
s_pad:- Unused.
s_minor_rev_level:- Contains the minor number for the revision level.
s_lastcheck; The time of last File System check performed.
s_checkinterval; The max possible time between checks on the file system.
s_creator_os:- Owner Operating System of the file system. (linux=0, hurd=1, masix=2, FreeBSD=3, Lites=4 etc.).
s_rev_level:- Revision level of the file system. (0 -> original format, 1 -> v2 format with dynamic inode sizes.).
s_def_resuid:- Default uid for reserved blocks.
s_def_regid:- Default gid for reserved blocks.
s_first_ino:- First non-reserved inode.
s_inode_size:- Size of inode structure.
s_block_group_nr:- Block group no of this super block. There is another Super Block in File System for the rescue of damaged file system.
s_feature_compat:- Compatible feature set.
s_feature_incompat:- Incompatible feature set.
s_feature_ro_compat:- Read only compatible feature set.
s_uuid:- 128-bit uuid for volume.
s_volume_name:- volume name (e.g. /, /boot etc.).
s_last_mounted:- Directory where last mounted.
Reading the super block is pretty easy. Just read the starting sector +2 of the filesystem.
Group Descriptor
The ext2/ext3 file system is divided into groups called block group. The number of groups can be derived from the formula, block_group = s_blocks_count/s_blocks_per_group.
The attributes of the group is identified by group descriptor. There is an array of group descriptors describing each group. The group descriptor table can be found at the first block (block-1) following the superblock structure of the file system (block no starts from 0.). The structure of the group descriptor is as follows.
struct EXT2_GROUP_DESC
{
DWORD bg_block_bitmap;
DWORD bg_inode_bitmap;
DWORD bg_inode_table;
WORD bg_free_blocks_count;
WORD bg_free_inodes_count;
WORD bg_used_dirs_count;
WORD bg_pad;
DWORD bg_reserved[3];
};
bg_block_bitmap:- The block which contains the block bitmap for the group.
bg_inode_bitmap:- The block contains the inode bitmap for the group.
bg_inode_table:- The block contains the inode table first block (the starting block of the inode table.).
bg_free_blocks_count:- Number of free blocks in the group.
bg_free_inodes_count:- Number of free inodes in the group.
bg_used_dirs_count:- Number of inodes allocated to the directories.
bg_pad:- Padding (reserved).
bg_reserved:- Reserved.
Block Bitmap
The block bitmap represents the status of each block. It shows that the block is used (1) or free (0). E.g. 1001… show block 1 is used, block 2 is free, block 3 is free etc. Its correct location can be found by looking at bg_block_bitmap.
It is used to determine which block is free and which is used. It is used when making or copying files (Ext2read does not currently support writing to file system.).
Inode Bitmap
The Inode Bitmap works in the similar way as the block bitmap. The inode bitmap represents the status of each inode. It determines whether the inode is used (1) or free (0). E.g. 1001… show inode 0 is used, inode 1 is free, inode 2 is free etc. Its correct location can be found by looking at bg_inode_bitmap.
Inode Table
In Ext2/Ext3 file system, each file is identified by an inode. Each file has its own inode entry. The File Systems a table of all the inodes in the file system called the inode table. Furthermore, each block group has its own inode table. The starting location for the inode table can be identified by looking at bg_inode_table.
The inode gives the attributes like mode, size, uid, creation time etc. of the file.
The structure of the inode is as follows.
struct EXT2_INODE
{
WORD i_mode; /* File mode */
WORD i_uid; /* Low 16 bits of Owner Uid */
DWORD i_size; /* Size in bytes */
DWORD i_atime; /* Access time */
DWORD i_ctime; /* Creation time */
DWORD i_mtime; /* Modification time */
DWORD i_dtime; /* Deletion Time */
WORD i_gid; /* Low 16 bits of Group Id */
WORD i_links_count; /* Links count */
DWORD i_blocks; /* Blocks count */
DWORD i_flags; /* File flags */
DWORD osd1; /* OS dependent 1 */
DWORD i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
DWORD i_generation; /* File version (for NFS) */
DWORD i_file_acl; /* File ACL */
DWORD i_dir_acl; /* Directory ACL */
DWORD i_faddr; /* Fragment address */
BYTE l_i_frag; /* Fragment number */
BYTE l_i_fsize; /* Fragment size */
WORD i_pad1;
WORD l_i_uid_high; /* these 2 fields */
WORD l_i_gid_high; /* were reserved2[0] */
DWORD l_i_reserved2;
};
i_mode:- It describes the format and the access rights of the file. The result obtained by masking the value with EXT2_S_IFMT(0xF000) gives the file type. When it is masked with EXT2_IRWXU(0×01c0) gives the user access, when it is masked with EXT2_IRWX(0×0038) gives group access and when masked with EXT2_IRWXO(0×0007) gives others rights.
i_uid:- The id of the owner.
i_size:- The size of the file in bytes.
i_atime:- The last access time of the file. The time is number of seconds since 1st january 1970.
i_ctime:- The creation time of the file. The time is number of seconds since 1st january 1970.
i_mtime:- The last modification time of the file. The time is number of seconds since 1st january 1970.
i_dtime:- deletion time of the file. The time is number of seconds since 1st january 1970.
i_gid:- The group associated with this file.
i_links_count:- The number of times the inode is refered to.
i_blocks:- The number of blocks reserved for the file. The block is not the size of the block but the sector size ie. 512 bytes.
i_flags:- The behaviour flags of the file system.
osd1:- The OS dependent value.
i_block:- This array is used to locate the data of the file. The first twelve entries are the direct data blocks ie, they point directly to the data.
The 13th field is the indirect block. It points to the block which has the address of data blocks. The block holds 1 block of entries.
The 14th field is the bi- indirect block. It points to the block holding indirect entries.
The 15th field is the triple indirect block. It points to the block holding bi- indirect entries.
i_generation:- defines the file version. It is used by NFS.
i_file_acl:- The access control flags associated with the file.
i_dir_acl:- The access control flags associated with the directory.
The directory structure
The data blocks of the directory points to the directory structure. The directory structure of Ext2 is:
struct EXT2_DIR_ENTRY {
DWORD inode; /* Inode number */
WORD rec_len; /* Directory entry length */
WORD name_len; /* Name length */
char name[EXT2_NAME_LEN]; /* File name */
};
The directory entries are the array of struct EXT2_DIR_ENTRY. The size of the each structure is given by the rec_len.
inode:- The inode number of the entry.
rec_len:- The length of the record.
name_len:- The length of the name of the file.
name:- The name of the file. The string is not NULL terminated.

Ext2read 2.0


The newer version of Ext2read is released. It is now an explorer like application from where you can read ext2/ext3 partitions.
Download the codes and executables.