Chapter 13 - virtual file system

The virtual file system is VFS, and the kernel provides file and file related interfaces. All files in the system not only depend on VFS coexistence, but also rely on VFS coordination.

Common file interfaces, including read, write and open, can be used to read files and hard disks.

The reason why a unified interface can be used to read and write files is that linux provides a unified abstraction layer

1.unix file system

unix uses four abstract concepts: file, directory and index point

2.VFS object and data structure

VFS mainly has four objects, which are

Superblock object

Inode object

Directory object

File object

A directory is another form of file. VFS treats a directory as a file

The four objects are:

super_operation object, including the methods that can be called by a specific file

inode_operation object, including the methods that the kernel can call for specific files, including create and link

dentry_operation object, including methods that can be called by a specific directory, such as d_compare and d_delete

file_operation object, the method that can be called for the opened file, such as read or write open.

3. Superblock object

Various file systems must implement a superblock object, which is used to store specific file system information, usually corresponding to the file system superblock or file system control block stored in the pending sector of a specific disk. For non disk based file systems, they create superblocks in the field and put them in memory

struct super_block {
        struct list_head        s_list;         /* Keep this first */
        dev_t                   s_dev;          /* search index; _not_ kdev_t */
        unsigned char           s_blocksize_bits;
        unsigned long           s_blocksize;
        loff_t                  s_maxbytes;     /* Max file size */
        struct file_system_type *s_type;
        const struct super_operations   *s_op;
        const struct dquot_operations   *dq_op;
        const struct quotactl_ops       *s_qcop;
        const struct export_operations *s_export_op;
        unsigned long           s_flags;
        unsigned long           s_iflags;       /* internal SB_I_* flags */
        unsigned long           s_magic;
        struct dentry           *s_root;
        struct rw_semaphore     s_umount;
        int                     s_count;
        atomic_t                s_active;
#ifdef CONFIG_SECURITY
        void                    *s_security;
#endif
        const struct xattr_handler **s_xattr;
#ifdef CONFIG_FS_ENCRYPTION
        const struct fscrypt_operations *s_cop;
        struct key              *s_master_keys; /* master crypto keys in use */
#endif
#ifdef CONFIG_FS_VERITY
        const struct fsverity_operations *s_vop;
#endif
#ifdef CONFIG_UNICODE
        struct unicode_map *s_encoding;
        __u16 s_encoding_flags;
#endif
        struct hlist_bl_head    s_roots;        /* alternate root dentries for NFS */
        struct list_head        s_mounts;       /* list of mounts; _not_ for fs use */
        struct block_device     *s_bdev;
 struct backing_dev_info *s_bdi;
        struct mtd_info         *s_mtd;
        struct hlist_node       s_instances;
        unsigned int            s_quota_types;  /* Bitmask of supported quota types */
        struct quota_info       s_dquot;        /* Diskquota specific options */

        struct sb_writers       s_writers;

        /*
         * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
         * s_fsnotify_marks together for cache efficiency. They are frequently
         * accessed and rarely modified.
         */
        void                    *s_fs_info;     /* Filesystem private info */

        /* Granularity of c/m/atime in ns (cannot be worse than a second) */
        u32                     s_time_gran;
        /* Time limits for c/m/atime in seconds */
        time64_t                   s_time_min;
        time64_t                   s_time_max;
#ifdef CONFIG_FSNOTIFY
        __u32                   s_fsnotify_mask;
        struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
#endif

        char                    s_id[32];       /* Informational name */
        uuid_t                  s_uuid;         /* UUID */

        unsigned int            s_max_links;
        fmode_t                 s_mode;

        /*
         * The next field is for VFS *only*. No filesystems have any business
         * even looking at it. You had been warned.
         */
        struct mutex s_vfs_rename_mutex;        /* Kludge */

        /*
         * Filesystem subtype.  If non-empty the filesystem type field
         * in /proc/mounts will be "type.subtype"
         */
/*
         * Filesystem subtype.  If non-empty the filesystem type field
         * in /proc/mounts will be "type.subtype"
         */
        const char *s_subtype;

        const struct dentry_operations *s_d_op; /* default d_op for dentries */

        /*
         * Saved pool identifier for cleancache (-1 means none)
         */
        int cleancache_poolid;

        struct shrinker s_shrink;       /* per-sb shrinker handle */

        /* Number of inodes with nlink == 0 but still referenced */
        atomic_long_t s_remove_count;

        /* Pending fsnotify inode refs */
        atomic_long_t s_fsnotify_inode_refs;

        /* Being remounted read-only */
        int s_readonly_remount;

        /* per-sb errseq_t for reporting writeback errors via syncfs */
        errseq_t s_wb_err;

        /* AIO completions deferred from interrupt context */
        struct workqueue_struct *s_dio_done_wq;
        struct hlist_head s_pins;

        /*
         * Owning user namespace and default context in which to
         * interpret filesystem uids, gids, quotas, device nodes,
         * xattrs and security labels.
         */
        struct user_namespace *s_user_ns;

        /*
         * The list_lru structure is essentially just a pointer to a table
         * of per-node lru lists, each of which has its own spinlock.
         * There is no need to put them into separate cachelines.
         */
        struct list_lru         s_dentry_lru;
        /*
         * The list_lru structure is essentially just a pointer to a table
         * of per-node lru lists, each of which has its own spinlock.
         * There is no need to put them into separate cachelines.
         */
        struct list_lru         s_dentry_lru;
        struct list_lru         s_inode_lru;
        struct rcu_head         rcu;
        struct work_struct      destroy_work;

        struct mutex            s_sync_lock;    /* sync serialisation lock */

        /*
         * Indicates how deep in a filesystem stack this SB is
         */
        int s_stack_depth;

        /* s_inode_list_lock protects s_inodes */
        spinlock_t              s_inode_list_lock ____cacheline_aligned_in_smp;
        struct list_head        s_inodes;       /* all inodes */

        spinlock_t              s_inode_wblist_lock;
        struct list_head        s_inodes_wb;    /* writeback inodes */
} __randomize_layout;

Superblock operation:

If a system wants to write its own superblock, it needs to call:

        sb->s_op->write_super(sb);

See here, I will think, under what circumstances will I write Super blocks

Operation method s_op is very important for each file system. It points to the operation function table of the super block and contains the implementation of a series of operation methods, including:

Allocate inode
Destroy inode
Read and write inode
File synchronization
wait

Let's take a look at the usage of super blocks

struct inode *alloc_inode(struct_block* sn);

Creates and initializes a new inode object under a given superblock

Releases the given index point

void destroy_inode(struct inode *inode);

This function is called when the index node is dirty

void dirty_inode(struct inode* inode);

The given index point is written to disk

void write_inode(struct inode* inode, int wait);

This function is triggered when the last index point is released

drop_inode(struct inode* inode);

Delete index point from disk

delete_inode(struct inode* inode)

The given superblock updates the superblock on the disk, and the caller must hold s consistently_ Lock lock

void write_super(struct super_blokck* sb);

Synchronize the data source of the file system with the file system on the disk. The wait parameter specifies whether to synchronize

void sync_fs(struct super_block* sb, int wait);

First prohibit changes to the file system, and then update the superblock on the disk with the given superblock

void write_super_lockfs(struct super_block *sb);

Unlock the file system. It's write_ super_ Inverse operation of lockfs

void unlokcfs(struct super_block *sb);

Get the file status by calling this function. Specifies that the relevant statistics in the file system are placed in statfs.

int statfs(struct super_block* sb, int *flags, char* data);

This function is called when the file system is reinstalled

int remount_fs(struct super_block* sb, int *flags, char* data);

Call this function to release the index point

void clear_inode(struct inode* inode);

Call this function for terminal installation operation. This function is generally used by network programs

umount_begin(struct super_block *sb);

All of these functions are invoked by VFS in the context of the process.

Thinking, I don't seem to come into contact with the super block in my daily work. How does it accompany our daily work? What does the linux super block look like?

I think it's actually a piece of equipment

First of all, what is a super block?

A superblock is part of metadata that contains information about the file system on the block device. Superblock provides the following information about files, binaries, DLLs, metadata, etc.

1) Super block, the first block in the file system is called super block. This block stores the structure information of the file system itself. For example, the superblock records the size of each area, and the superblock also stores the information of unused disk blocks.

There are many file systems in Linux, such as ext2, ext3, ext4, sysfs, rootf, proc, etc. a super block actually corresponds to an independent file system

What does superblock do?

Each file system has a super block structure, and each super block should be linked to a super block linked list. When each file in the file system is opened, an inode structure needs to be allocated in memory, and these inode structures must be linked to the superblock.

After reading, I think the linux super block is like the c disk of windows, but linux is more obscure and will not expose these underlying concepts to users

4. Inode object

The inode object contains all the information when the kernel operates files or directories. If there is no such information, no matter how the information is stored on the disk, all the information must be read from the disk ---- that is, the information is recorded on the inode, which can be read faster and directly in memory instead of on the disk

i_bdev points to the block device structure

c_dev refers to the string device structure

Index point operand

The operation functions are as follows:

VFS calls this function through system calls create and open to create an index node for the dentry object. Use the initial mode of mode when creating

int create(struct inode* dir, struct dentry* dentry, int mode)

Find the index node in a specific directory. The index points correspond to the file name in dentry

struct dentry * lookup(struct inode* dir, struct dentry * dentry)

There are many common

link, unlink, symlink, mkdir, rmdir, readlink

mknod is an interesting command

The mknod command establishes the correspondence between a directory entry and a special file Index node.

mknod can also create device files

I see it's a little messy here. What exactly does the inode object do? Is it related to directories or files?

According to Baidu Encyclopedia, extract some key introductions

linux file system Save the file inode number and file name in the directory at the same time.

Therefore, the directory is just a table combining the file name and its inode number. Each pair of file name and inode number in the directory is called a connection. For a file, there is a unique inode number corresponding to it, but for an inode number, there can be multiple file names corresponding to it. Therefore, the same file on disk can be accessed through different paths.

Let's look at the first data directory first

zhanglei@ubuntu:~$ stat data
  File: data
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 805h/2053d	Inode: 11536259    Links: 32
Access: (0775/drwxrwxr-x)  Uid: ( 1000/zhanglei)   Gid: ( 1000/zhanglei)
Access: 2021-12-30 16:40:23.940211255 +0800
Modify: 2021-12-29 10:27:41.774194959 +0800
Change: 2021-12-29 10:27:41.774194959 +0800
 Birth: -

Take another look at the index in the data directory HTML and test log

zhanglei@ubuntu:~/data$ stat index.html 
  File: index.html
  Size: 14848     	Blocks: 32         IO Block: 4096   regular file
Device: 805h/2053d	Inode: 11535969    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/zhanglei)   Gid: ( 1000/zhanglei)
Access: 2021-12-27 16:41:13.534683164 +0800
Modify: 2021-11-04 11:46:49.090311376 +0800
Change: 2021-11-04 11:46:49.090311376 +0800
 Birth: -


zhanglei@ubuntu:~/data$ stat test.log 
  File: test.log
  Size: 2124      	Blocks: 8          IO Block: 4096   regular file
Device: 805h/2053d	Inode: 11535967    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/zhanglei)   Gid: ( 1000/zhanglei)
Access: 2021-11-02 19:30:36.301627388 +0800
Modify: 2021-11-02 19:30:36.209629723 +0800
Change: 2021-11-02 19:30:36.209629723 +0800
 Birth: -

This is a good explanation. Therefore, the directory is just a table combining the file name and its inode number. Each file has its own node id number

5. Directory item object

VFS treats directories and files as files. For example, / bin/vi ， is two files, bin file and vi file. However, linux treats directories as a special file to facilitate search

Catalog item object format:

There are three directory states: unused, used, and negative

The negative state is just an identification bit because the inode is deleted. The directory entry may be incorrect, but the inode remains and is still a valid object

Directory items are cached. The directory item cache consists of three parts:

1. Table of contents used. The linked list passes through the I of the index node_ Dentry entries connect related inodes. Because a given inode may consist of multiple directory entry objects, a linked list is used

2. The recently used two-way linked list, which contains unused and negative directory objects. Because this linked list always inserts directory entries from the head. Therefore, the head of the linked list is always newer than the tail. When the internal node must be deleted, it should be deleted from the tail

3. The hash table and corresponding hash functions are used to quickly parse a given path into related directory item objects.

Hash array dentry_hashtable means that each element is a pointer to the linked list of directory item objects with the same key value.

The directory entry operation is simple:

To determine whether directory objects are valid, most file systems set them to null because they think dcache is always valid

int d_revalidate(struct dentry *dentry, struct nameidate*)

This function generates a hash table of the directory and adds the directory item to the hash table

int d_hash(struct dentry* dentry, struct qstr* name);

Compare name1 with Name2. Note that DCache needs to be added_ lock

int d_compare(struct dentry* dentry, struct qstr *name1, struct qstr *name2);

When the directory item counter is 0, the system will call this function

int d_delete(struct dentry* dentry);

This function is called when the directory object is about to be released

void d_release(struct dentry* dentry);

vfs uses this function when a directory entry pair object loses its associated node.

void d_input(struct dentry* dentry, struct inode* inode);

6. Document object

A file object represents a file that has been opened by a process. The process directly operates on the i file object, creating open and closing

I don't write many operation functions. I'm really familiar with them

The kernel uses some standard data structures to manage other related data structures of the file system

Each file system has only one file_system_type structure

When a file system is mounted, a vfsmount structure is created

As I have said many times before, each process maintains an fd_ The table will maintain the relationship between fd and the actual physical media

fs_ The second structure related to the struct process is fs_struct, which contains information about file systems and processes

This structure contains the working directory and root directory of the current process.

The third directory structure is the namespace structure, which is composed of the process descriptor mmt_namespace domain pointing

List is a two-way linked list of installed systems

Each process maintains a count as a reference count to prevent other processes from being destructed when using this data structure

By default, all processes use the same namespace. Clone is used only when clone is used_ The new flag will give the process a unique namespace copy structure. Most processes have the same default namespace

I'm a little curious. Is this the namespace in K8s? Ha ha ha

Keywords: Linux Operation & Maintenance debian

Added by SteveMT on Fri, 31 Dec 2021 17:42:39 +0200

Programming VIP