Files
Computer-Fundamentals/osv2/5.fileSystem.md
T
tarun-elango 3c0881290e more subjects
2026-04-26 14:53:29 -04:00

51 KiB

File Systems and Disk Scheduling in Operating Systems

File systems and disk scheduling are where operating systems stop being abstract software platforms and start dealing with stubborn physical reality.

As a software engineer, you usually work with a friendly model:

  • open a file
  • write bytes
  • read them back later
  • trust that paths name things reliably
  • assume storage is slower than RAM, but still manageable

Underneath that model, the operating system is solving a much harder problem:

  • storage devices expose blocks, not named files
  • crashes can happen in the middle of updates
  • persistence has to survive process death, kernel reboot, and power loss
  • many processes want to read and write the same storage at once
  • the hardware may be mechanical, flash-based, network-backed, or hidden behind a hypervisor

This guide is written for practical understanding rather than memorization. The goal is to build the kind of mental model that helps in interviews, debugging, performance tuning, database work, and infrastructure design.

1. Why File Systems Exist

At the hardware level, a disk or SSD does not naturally contain directories, filenames, permissions, or "the log file for service X". It exposes addressable storage locations. Historically these were sectors on a rotating disk. Today they may be logical block addresses backed by flash translation layers, RAID controllers, or cloud storage systems.

If the OS exposed raw storage directly to applications, every program would need to solve the same problems itself:

  • where to place data
  • how to find it later
  • how to avoid overwriting other data
  • how to reuse freed space
  • how to handle crashes halfway through an update
  • how to represent ownership, permissions, and timestamps

That would be chaos.

The file system exists to provide a durable logical namespace over raw block storage.

1.1 The abstraction gap

Think of RAM and disk as very different kinds of storage:

  • RAM is fast, byte-addressable, and volatile.
  • Disk is slow, block-oriented, and persistent.

Applications want a simple logical model:

  • named files
  • hierarchical directories
  • append, overwrite, truncate, rename
  • metadata such as owner and mode bits

The file system is the translation layer between that logical model and the physical storage layout.

1.2 Logical view vs physical storage

The logical view says:

  • there is a file called /var/log/app.log
  • it has permissions 0640
  • it is owned by root:adm
  • it has a size of 48 MB

The physical view says:

  • a directory entry maps app.log to inode 912341
  • inode 912341 stores metadata and block mapping information
  • the file's contents live in a set of physical blocks or extents
  • some data may still be dirty in page cache and not yet on media

That distinction is central to both interview answers and production debugging.

1.3 Naming and organization matter

A file system does more than store bytes. It also gives structure:

  • directories let humans and programs organize data
  • metadata enables permissions, accounting, and auditing
  • links allow multiple names for the same underlying object
  • mount points let multiple storage backends appear in one namespace

Without a file system, persistence would still exist, but it would look more like manual block management than application-friendly storage.

flowchart LR
	A["Application<br/>open read write fsync"] --> B["System call layer"]
	B --> C["VFS<br/>common file API"]
	C --> D["Concrete file system<br/>ext4 xfs tmpfs procfs nfs"]
	D --> E["Page cache and writeback"]
	E --> F["Block layer and I/O scheduler"]
	F --> G["Device controller<br/>SATA SAS NVMe RAID"]
	G --> H["Physical media<br/>HDD platters or NAND flash"]

2. Disk Structure Basics

Before understanding file systems, it helps to understand what they sit on top of.

2.1 Sectors, blocks, and clusters

These terms are often mixed together, but they refer to different layers.

Sectors

A sector is the basic addressable unit exposed by a storage device. Historically it was often 512 bytes. Modern disks commonly use 4 KiB physical sectors, though they may emulate 512-byte logical sectors for compatibility.

This matters because partial-sector updates are not a native physical operation. The device often has to read-modify-write internally.

Blocks

A file system block is the allocation and I/O unit chosen by the file system. Common Linux file systems often use 4 KiB blocks because that aligns well with memory pages.

The file system usually allocates storage in block-sized units, not arbitrary byte ranges.

Clusters

A cluster usually means a group of sectors used as a larger allocation unit. The term is common in FAT and NTFS discussions. Conceptually, it is similar to a file system allocation block.

2.2 Tracks, cylinders, platters

These are HDD concepts and still matter for intuition even though modern drives hide the real geometry behind logical block addressing.

  • A platter is a physical disk surface coated with magnetic material.
  • A track is a circular ring on a platter.
  • A sector is a subdivision of a track.
  • A cylinder is the set of tracks at the same radius across multiple platters.

In old textbooks, the OS and controller cared more directly about these physical details. In modern systems, they are largely abstracted away, but the performance consequences remain.

2.3 HDD access costs

For rotating media, access time is roughly:


	ext{Access Time} \approx \text{Queueing Delay} + \text{Seek Time} + \text{Rotational Latency} + \text{Transfer Time}

The important intuition is this:

  • moving the head is expensive
  • waiting for the platter to rotate is expensive
  • actually transferring the bytes is often cheap by comparison

For example, on a 7200 RPM disk:

  • one full rotation takes about 8.33 ms
  • average rotational latency is about half a rotation, about 4.17 ms
  • seek time may be several milliseconds
  • transfer of a few kilobytes may be a small fraction of a millisecond

That is why random I/O on HDDs is dramatically slower than sequential I/O.

2.4 SSD behavior is different

SSDs remove seek time and rotational latency because there is no mechanical arm and no spinning platter. That changes the performance profile, but it does not make storage magically free.

SSDs still have:

  • controller queues
  • internal mapping overhead
  • erase-before-write constraints
  • garbage collection
  • wear leveling
  • tail-latency spikes under heavy write load

So the dominant costs shift from mechanics to controller behavior, queueing, flash management, and software stack overhead.

3. File System Layout on Disk

Every file system needs to answer the same basic on-disk questions:

  • where is the file system itself described
  • where are metadata records stored
  • where is actual file data stored
  • how is free space tracked
  • how can the system recover after a crash

Different file systems answer these differently, but a classic Unix-style layout is a good mental model.

3.1 Major on-disk components

Common structures include:

  • Boot block: space near the beginning used for boot-related code or reserved metadata
  • Superblock: global file system metadata such as block size, inode count, feature flags, UUID, and state
  • Inode table: persistent metadata records for files and directories
  • Data blocks: the actual file contents and directory contents
  • Free space metadata: bitmaps, lists, or extent trees describing unallocated space
  • Journal area: write-ahead log used for crash recovery in journaling file systems
  • Directory structures: name-to-inode mappings

3.2 ext4 conceptual organization

ext4 is not just one giant array of blocks. Conceptually, it organizes storage into block groups. Each group keeps related metadata and data somewhat localized.

This design helps reduce long seeks on HDDs and improves locality in general:

  • a file's inode can live near its data blocks
  • free-space metadata is distributed rather than fully centralized
  • directory data and child inodes can often be allocated near each other

An oversimplified ext4-style layout looks like this:

flowchart TB
	FS["Whole File System"] --> BB["Boot Block"]
	FS --> SB["Primary Superblock"]
	FS --> GDT["Group Descriptor Table"]
	FS --> J["Journal Area<br/>internal or external"]
	FS --> BG1["Block Group 0"]
	FS --> BG2["Block Group 1"]
	FS --> BG3["Block Group N"]

	subgraph Group["Typical Block Group"]
		direction TB
		SBK["Backup superblock<br/>in some groups"] --> BBM["Block bitmap"] --> IBM["Inode bitmap"] --> IT["Inode table"] --> DB["Data blocks / extents"]
	end

	BG1 -. same pattern .-> Group

3.3 Why this layout exists

If all metadata lived in one place and all file data in another, every operation would bounce back and forth between distant regions of the disk. Block groups reduce that cost.

This is a recurring OS design theme:

  • separate concerns logically
  • keep related data physically close when possible

3.4 Superblock intuition

The superblock is the file system's identity card and rulebook. It tells the kernel things like:

  • block size
  • total blocks and free blocks
  • inode counts
  • mount state
  • supported features such as journaling, extents, checksums, large files

If the superblock is lost or corrupted, the file system may become unmountable. That is why many file systems keep backup copies.

4. Inodes

If you remember one thing about Unix file systems, remember this:

the filename is not the file.

The durable object is the inode plus its data. The name is just a directory entry that points to that inode.

4.1 What an inode is

An inode is a metadata structure representing a filesystem object such as:

  • regular file
  • directory
  • symbolic link
  • device node
  • FIFO
  • socket

An inode stores metadata, not the human-readable filename.

Typical inode contents include:

  • file type
  • permissions and mode bits
  • owner UID and GID
  • file size
  • link count
  • timestamps such as atime, mtime, ctime, and sometimes birth time
  • pointers or extents describing where file data lives
  • flags and extended metadata

4.2 Why filenames are separate from inodes

Separating names from inode metadata enables several important behaviors:

  • multiple hard links can refer to the same inode
  • rename can update directory entries without moving file data
  • open file descriptors can continue to refer to an inode even after the name is removed

This is why Unix file systems feel flexible and why operations like mv within the same file system can be atomic metadata operations.

4.3 Inode number

Every inode has an inode number that is unique within a file system. On Linux you can see it with:

ls -li somefile
stat somefile

When a directory maps report.txt to inode 481002, that inode number is what the kernel uses to find the file's metadata.

4.4 Classic block pointers

In the classic Unix model, the inode contains direct references to data blocks plus a small tree of indirection for large files.

The standard interview picture is:

  • direct pointers
  • one single indirect pointer
  • one double indirect pointer
  • one triple indirect pointer
flowchart TB
	I["Inode"] --> D1["Direct block 1"]
	I --> D2["Direct block 2"]
	I --> D3["Direct block N"]
	I --> S["Single indirect"]
	I --> DD["Double indirect"]
	I --> TD["Triple indirect"]
	S --> S1["Data block"]
	S --> S2["Data block"]
	DD --> L1["Indirect block"]
	L1 --> L2["Data block"]
	TD --> T1["Double indirect layer"]
	T1 --> T2["Indirect block"]
	T2 --> T3["Data block"]

4.5 Direct pointers

Direct pointers are the fastest and simplest path. If the inode directly names the data blocks, the kernel can resolve file offset to block with minimal metadata traversal.

Small files are cheap in this design because they often fit entirely within direct blocks.

4.6 Indirect pointers

For larger files, the inode cannot hold every block address directly. So it stores pointers to blocks that themselves contain block addresses.

With 4 KiB blocks and 8-byte block pointers:

  • one indirect block can hold 4096 / 8 = 512 pointers
  • one double indirect pointer can reference 512^2 data blocks
  • one triple indirect pointer can reference 512^3 data blocks

That is how a small fixed-size inode can represent very large files.

4.7 Large file handling

As the file grows:

  1. direct pointers fill first
  2. then the single indirect block is allocated
  3. then double indirect
  4. then triple indirect

The tradeoff is obvious:

  • small files are efficient
  • very large files require more metadata lookups

This matters for random reads into huge files, because fetching a block may require traversing multiple layers of metadata unless cached.

4.8 ext4 and extents

Modern file systems such as ext4 improve on the classic pointer-per-block model using extents.

An extent describes a contiguous range, for example:

  • logical file blocks 1000 through 1255
  • stored in physical blocks 880000 through 880255

That is much more compact than storing hundreds of separate block pointers when the file is mostly contiguous.

So the classic direct/indirect tree is still essential interview knowledge, but in real ext4 the common case is often an inode containing or pointing to an extent tree.

5. Directories and Name Resolution

Directories are not magical containers. In Unix-like systems, a directory is a special file whose contents encode mappings from names to inode numbers.

5.1 Directory as a special file

A directory entry conceptually stores something like:

  • filename
  • inode number
  • sometimes entry type hints

That means the directory itself has an inode, occupies blocks, and is subject to permissions.

Large directories may use indexed structures. For example, ext4 uses hashed directory indexing for scalability.

5.2 Absolute vs relative paths

  • An absolute path starts from the root, such as /usr/bin/python3.
  • A relative path starts from the process current working directory, such as logs/app.log.

The starting point changes, but the resolution logic is the same.

5.3 Path traversal step by step

Suppose a process calls:

open("/var/log/app/current.log", O_RDONLY);

The kernel conceptually does this:

  1. Start from the root directory inode because the path is absolute.
  2. Look up var in that directory.
  3. Confirm execute permission on the directory for traversal.
  4. Load the inode for var.
  5. Look up log inside var.
  6. Repeat for app.
  7. Look up current.log.
  8. Resolve symlinks if encountered, subject to limits.
  9. Perform final permission checks.
  10. Create an open file object and return a file descriptor.

The kernel tries to avoid repeating all of this work by caching directory entries and inodes in memory.

flowchart LR
	A["Path string<br/>/var/log/app/current.log"] --> B["Start at root or cwd"]
	B --> C["Lookup next component<br/>in directory"]
	C --> D["Check permissions<br/>and follow symlink rules"]
	D --> E["Load or reuse inode and dentry"]
	E --> F["Repeat until final component"]
	F --> G["Create open file object<br/>return file descriptor"]

A hard link is another directory entry pointing to the same inode.

ln original.txt alias.txt
ls -li original.txt alias.txt

Both names refer to the same inode. Neither is the "real" one. The file's data is reclaimed only when:

  • link count reaches zero
  • and no process still has the inode open

Hard links normally cannot span file systems, because inode numbers are meaningful only inside one file system.

A symbolic link is a separate file whose contents are a path string.

ln -s /var/log/app/current.log current-link

The symlink has its own inode. Accessing it causes another lookup step. Symlinks can cross file system boundaries because they store names, not inode references.

5.6 rename is usually metadata work

When you rename a file within the same file system, the OS often just updates directory entries. The data blocks usually do not move.

That is why same-filesystem rename() is fast and why it is used for atomic replacement patterns.

Important production detail:

  • rename within the same mounted file system can be atomic
  • rename across file systems is not a simple metadata update and usually turns into copy plus unlink behavior at a higher layer

unlink() removes a directory entry. It does not necessarily free storage immediately.

If a process still has the file open:

  • the name disappears from the directory
  • the inode still exists in memory and on disk
  • the storage is reclaimed only after the last reference is gone

This is a classic Linux debugging case:

  • a service keeps a deleted log file open
  • rm appears to succeed
  • disk space does not return

Useful command:

lsof +L1

That finds open files with link count below one.

6. File Allocation Methods

How should a file system place file data on disk? This is a foundational design question.

6.1 Contiguous allocation

Store the file in one continuous run of blocks.

Strengths:

  • excellent sequential performance
  • simple block computation
  • minimal metadata overhead

Weaknesses:

  • hard for growing files
  • external fragmentation becomes a problem
  • finding large continuous free regions gets harder over time

This is conceptually ideal for reading but awkward for real workloads where files grow unpredictably.

6.2 Linked allocation

Each block points to the next block.

Strengths:

  • files can grow easily
  • no need for one large contiguous area

Weaknesses:

  • random access is poor
  • pointer corruption is dangerous
  • pointer overhead consumes space

FAT is the classic teaching example, though its pointer structure is centralized in a table rather than embedded directly inside data blocks.

6.3 Indexed allocation

Store block pointers in a separate index structure.

Strengths:

  • good random access
  • flexible growth
  • clean separation between metadata and data

Weaknesses:

  • extra metadata reads may be needed
  • index structures themselves consume space

Unix inode designs are a form of indexed allocation.

6.4 Extent-based allocation

Store ranges instead of single-block pointers.

Strengths:

  • compact metadata for large contiguous regions
  • better sequential locality
  • lower pointer overhead

Weaknesses:

  • fragmentation still matters when files grow in many places
  • allocation policy becomes more complex

Modern file systems such as ext4 and XFS use extent-based strategies because they are a practical compromise between performance and flexibility.

6.5 The real tradeoff

The design space is about balancing:

  • sequential performance
  • random access efficiency
  • metadata overhead
  • growth flexibility
  • fragmentation resistance

There is no single perfect method. The right answer depends on workload shape.

7. Free Space Management

If the file system knows where used blocks are, it also needs to know where free blocks are.

This is harder than it first appears because allocation policy directly affects performance.

7.1 Bitmaps

A bitmap uses one bit per block or inode to indicate whether it is free.

Strengths:

  • compact
  • easy to scan for contiguous runs
  • good fit for extent allocation

Weaknesses:

  • scanning can be expensive on large file systems if poorly optimized

Many modern file systems, including ext-family systems, use bitmap-based tracking.

7.2 Free lists

A free list stores free blocks as a linked list or chain.

Strengths:

  • simple conceptually

Weaknesses:

  • poor at finding large contiguous regions quickly
  • cache-unfriendly for large-scale allocation decisions

7.3 Grouping

Grouping stores the addresses of several free blocks together, often with one free block pointing to a batch of others.

This reduces traversal overhead compared with a single pointer chain.

7.4 Counting

Counting stores free space as runs, such as:

  • start block 90000
  • length 2048 blocks

This is efficient when free space tends to be contiguous. It fits naturally with extent-oriented allocators.

7.5 Allocation efficiency matters

Free space management is not just bookkeeping. It affects:

  • fragmentation
  • locality
  • metadata contention
  • future read and write performance

For example, ext4 tries to allocate blocks near the owning inode or directory when possible, because good placement today prevents performance pain tomorrow.

8. Journaling and Crash Consistency

This is one of the most important practical topics in operating systems.

8.1 The crash consistency problem

Many file operations require multiple physical updates.

Creating a new file might involve:

  • allocating a free inode
  • allocating data blocks
  • writing file data
  • updating the inode
  • updating the directory entry
  • updating free-space metadata

If the machine crashes in the middle, the disk may contain a partially applied update. That can leave the file system inconsistent.

8.2 Why partial writes are dangerous

The key issue is not just losing the last few bytes of a file. The bigger issue is breaking metadata invariants.

Examples:

  • an inode claims a block that the free-space bitmap still marks free
  • a directory entry points to an uninitialized inode
  • file size says 64 KB but only half the block mapping is valid

Without recovery logic, the whole file system may become corrupt.

8.3 Write-ahead logging

Journaling uses write-ahead logging.

The basic idea is:

  1. write a description of the metadata changes to a journal
  2. mark the journal transaction committed
  3. later apply those changes to the main file system structures

After a crash, the kernel can replay committed journal entries and restore a consistent state.

flowchart LR
	A["User operation<br/>create write rename"] --> B["Prepare metadata updates"]
	B --> C["Write journal transaction"]
	C --> D["Commit journal entry"]
	D --> E["Checkpoint updates<br/>to home locations"]
	E --> F["Clear or advance journal"]

8.4 Metadata journaling

In metadata journaling, the file system journals metadata changes but not necessarily the file data blocks themselves.

This protects structural consistency while avoiding the overhead of writing all file data twice.

8.5 Full data journaling

In full data journaling, both metadata and file data are written to the journal before reaching home locations.

Strengths:

  • strong crash guarantees

Weaknesses:

  • higher write amplification
  • lower write throughput

This mode is safer but more expensive.

8.6 Ordered journaling

In ordered journaling, metadata is journaled, and the file system ensures that dirty data blocks reach their home locations before the metadata commit that would expose them as valid.

This is a practical compromise and historically the default mode for ext3/ext4.

It avoids a particularly ugly failure mode where metadata says new data exists but the actual data blocks still contain garbage or old contents.

8.7 ext3/ext4-style intuition

In ext-style systems you will often hear about modes like:

  • data=journal
  • data=ordered
  • data=writeback

Interpret them as a spectrum:

  • stronger consistency usually means more writes and less throughput
  • weaker ordering often means better performance but more surprising crash behavior

8.8 fsck vs journaling

fsck and journaling solve related but different problems.

fsck:

  • scans and repairs the file system by checking invariants
  • can be very slow on large volumes
  • may need to reconstruct or discard damaged state

Journaling:

  • keeps a short recent log of intended updates
  • makes crash recovery much faster
  • usually restores metadata consistency quickly after unclean shutdown

Journaling does not mean application-level data is always safe. It mainly protects filesystem consistency.

8.9 Practical crash-safe file replacement

The robust pattern for replacing a config file is not:

  1. open target
  2. overwrite target
  3. close target

The safer pattern is:

  1. write a new temporary file
  2. fsync() the temporary file
  3. rename() it over the old file
  4. fsync() the parent directory

That last step is often forgotten. Directory metadata also needs durability.

9. The Virtual File System (VFS)

The kernel needs one common interface that works across many filesystem types.

Applications should be able to call:

  • open
  • read
  • write
  • stat
  • rename

without caring whether the target lives on ext4, XFS, tmpfs, procfs, or a network file system.

That abstraction layer is the Virtual File System.

9.1 Why VFS exists

The VFS solves a very practical kernel engineering problem:

  • many file systems
  • one syscall API
  • shared kernel mechanisms for caching, pathname lookup, permissions, mounts, and open file descriptors

9.2 Key kernel objects on Linux

Linux uses several central structures:

  • superblock: one mounted file system instance
  • inode: one filesystem object
  • dentry: one directory-cache name component
  • file: one open file description, including current offset and flags

BSD-derived systems often talk about vnodes. The concept is similar: a filesystem-neutral kernel object representing a file-like node.

9.3 open path through VFS

When a process calls open():

  1. syscall enters kernel
  2. VFS parses the pathname
  3. VFS walks dentries and mount points
  4. target filesystem provides lookup operations
  5. permissions are checked
  6. inode is resolved
  7. a struct file is created
  8. a file descriptor is installed in the process table
flowchart LR
	A["open /path/file"] --> B["VFS pathname walk"]
	B --> C["Dentry cache lookup"]
	C --> D["Filesystem-specific lookup<br/>ext4 xfs nfs tmpfs"]
	D --> E["Resolve inode and permissions"]
	E --> F["Create open file object"]
	F --> G["Return file descriptor"]

9.4 read and write through VFS

The VFS also normalizes read and write behavior:

  • read() usually goes through page cache first
  • write() usually updates page cache first unless special flags are used
  • filesystem-specific code handles block mapping, journaling, and writeback details

9.5 Why multiple filesystem types fit the same model

The VFS lets very different backends look file-like:

  • ext4: general-purpose journaling disk filesystem
  • xfs: high-performance extent-based filesystem
  • tmpfs: RAM-backed filesystem with swap backing behavior
  • procfs: synthetic kernel information exposed as files
  • nfs: file operations forwarded over the network

That is one of the most powerful Unix design patterns: treat many resources as files, but keep the kernel abstraction flexible enough that the implementation can vary radically underneath.

10. Page Cache and Buffer Cache

This section matters a lot in real systems.

When developers say "the app wrote the file," what often really happened is "the kernel copied data into page cache and promised to flush it later."

10.1 Why caching exists

Disk access is vastly slower than RAM access. Even SSD access is still orders of magnitude slower than CPU caches and main memory.

So the kernel caches file data in memory to:

  • avoid repeated device reads
  • coalesce writes
  • allow read-ahead and write-behind
  • reduce syscall-visible latency

10.2 Read path

For a typical buffered read():

  1. process calls read(fd, buf, n)
  2. kernel checks whether the needed file pages are already in page cache
  3. on a cache hit, data is copied from cache to user buffer
  4. on a cache miss, the filesystem maps file offsets to disk blocks
  5. block I/O is issued
  6. device DMA fills memory pages
  7. pages enter page cache
  8. data is copied to user space

The kernel may also trigger readahead if it detects sequential access.

10.3 Write path

For a typical buffered write():

  1. process calls write(fd, buf, n)
  2. kernel copies data into page cache pages for that file
  3. those pages are marked dirty
  4. write() may return before storage is durable
  5. background writeback threads later flush dirty pages to storage
  6. journaling and barriers determine metadata ordering and durability behavior

This is why writes can look fast until the system is forced to flush.

flowchart TB
	R1["read()"] --> R2["Check page cache"]
	R2 -->|hit| R3["Copy to user buffer"]
	R2 -->|miss| R4["Map file offset to blocks"]
	R4 --> R5["Issue block I/O"]
	R5 --> R6["Fill page cache via DMA"]
	R6 --> R3

	W1["write()"] --> W2["Copy user data to page cache"]
	W2 --> W3["Mark pages dirty"]
	W3 --> W4["Background writeback or fsync"]
	W4 --> W5["Filesystem commit and device flush"]

10.4 Dirty pages and writeback

Dirty pages are cached file pages whose contents differ from what is currently durable on storage.

Linux allows dirty data to accumulate up to configured thresholds. When too much dirty data builds up:

  • background flushers start writeback
  • writers may be throttled
  • latency spikes can appear

Useful Linux signals:

grep -E 'Dirty|Writeback' /proc/meminfo
vmstat 1

10.5 sync, fsync, fdatasync

These are often misunderstood.

  • sync() asks the kernel to flush dirty data system-wide
  • fsync(fd) asks for the file's data and required metadata to be durable
  • fdatasync(fd) is like fsync but may avoid unrelated metadata updates

In production, fsync is the operation that makes people discover how expensive durability really is.

10.6 Why fsync is expensive

fsync is not just "write these bytes." It may require:

  • flushing dirty file pages
  • writing journal records
  • waiting for metadata commit
  • issuing cache flush commands to the device
  • waiting for controller acknowledgment

On cloud block storage, there may also be hypervisor or network replication in the path.

10.7 Buffer cache vs page cache

Historically Unix systems described a buffer cache for block metadata and a page cache for file data pages.

In modern Linux, the picture is mostly unified around the page cache, though buffer-head-like metadata structures may still exist internally for block mapping and bookkeeping.

The practical takeaway is not the historical naming. It is this:

  • cached file I/O and memory management are deeply intertwined
  • file data lives in memory pages that the kernel manages like other memory-backed objects

10.8 O_DIRECT and why databases care

Some workloads use O_DIRECT to reduce or bypass page cache involvement.

Reasons include:

  • avoiding double caching between the kernel and database buffer pool
  • more explicit control over write ordering and eviction

But O_DIRECT is not a universal win. It can reduce cache pollution in some systems and hurt performance badly in others.

11. mmap()

mmap() lets a file appear as part of a process address space.

This is where memory management and file systems directly meet.

11.1 File-backed memory

With mmap, the process does not call read() for every access. Instead, it accesses memory addresses. The kernel loads file-backed pages on demand.

11.2 Page faults drive loading

When the process first touches an unmapped file-backed page:

  1. CPU raises a page fault
  2. kernel sees the faulting address belongs to a file mapping
  3. kernel finds the corresponding file offset
  4. page cache is checked or filled
  5. page table is updated
  6. the process resumes

This is lazy loading backed by the same file/page cache machinery.

11.3 Shared vs private mappings

  • MAP_SHARED: writes can propagate back to the file and be visible to other mappers
  • MAP_PRIVATE: copy-on-write view; modifications affect private pages, not the file

11.4 Performance implications

mmap can be powerful because it:

  • avoids explicit copy loops in some cases
  • integrates with demand paging
  • lets the kernel manage readahead and caching naturally

But it also has costs:

  • page fault overhead
  • tricky error handling semantics
  • harder control of writeback timing
  • possible major faults under memory pressure

11.5 Database usage

Some systems, such as LMDB, lean heavily on mmap. Others, such as PostgreSQL, prefer explicit buffered I/O and their own buffer management for tighter control.

The key question is not whether mmap is good or bad. It is whether you want the kernel's paging policy or the application's own caching policy to dominate.

12. File System Performance and Debugging

Storage performance problems are rarely just "disk is slow." You need to reason layer by layer.

12.1 Sequential vs random I/O

Sequential I/O is friendly because:

  • readahead works well
  • extent locality helps
  • HDDs avoid repeated seeks
  • SSDs can stream through controller pipelines efficiently

Random I/O is harder because:

  • HDDs pay seek and rotation repeatedly
  • metadata lookups are less cache-friendly
  • SSDs may still suffer queueing and mapping overhead

12.2 The small files problem

Small files are deceptively expensive.

A 200-byte file may consume:

  • one inode
  • one directory entry
  • one or more data blocks depending on layout
  • journaling traffic for metadata
  • many cache and lookup operations relative to useful payload

This is why systems with millions of tiny files can become metadata-bound rather than bandwidth-bound.

12.3 Metadata bottlenecks

Not all filesystem work is data transfer. Sometimes the bottleneck is:

  • inode lookup
  • path traversal
  • directory locking
  • journal commit throughput
  • inode or dentry cache churn

If a workload creates and deletes huge numbers of files, metadata can dominate the total cost.

12.4 Fragmentation

Fragmentation hurts locality.

  • HDDs suffer more because the head moves between scattered regions
  • SSDs suffer less from location changes, but fragmented metadata and smaller I/Os can still reduce efficiency

Useful Linux tool:

filefrag -v bigfile

12.5 Caching effects

Two runs of the same program can have totally different performance depending on cache state.

Questions to ask:

  • was the data already in page cache
  • did directory entries come from dentry cache
  • did readahead trigger
  • were writes absorbed in cache and delayed

12.6 fsync cost and sync storms

If many threads or services call fsync frequently, the system can enter a pattern of repeated forced flushes and journal commits.

Symptoms include:

  • low average throughput
  • very high p99 latency
  • bursts of stalled writers

This is common in:

  • databases
  • durable message queues
  • log-heavy services
  • applications doing crash-safe file replacement incorrectly or too often

12.7 Write amplification

One logical write can become many physical writes because of:

  • journaling
  • metadata updates
  • SSD erase block behavior
  • RAID parity updates
  • copy-on-write designs in some file systems or storage layers

This is why small synchronous writes can perform much worse than application developers expect.

12.8 Practical Linux debugging intuition

Useful commands when debugging storage behavior:

iostat -x 1
pidstat -d 1
vmstat 1
df -h
df -i
strace -tt -T -e openat,read,write,fsync your_program

How to interpret them:

  • high disk utilization with low throughput often suggests small random I/O or sync-heavy workload
  • low disk utilization with slow app I/O may suggest lock contention, page faults, throttling, or network-backed storage delay
  • inode exhaustion from df -i can break systems even when df -h shows free space

13. Modern Storage: SSD Internals

To understand modern I/O behavior, you need a rough model of how flash works.

13.1 Flash pages and erase blocks

NAND flash is not overwritten in place the way applications imagine files are.

Typical properties:

  • reads happen at page granularity
  • writes happen at page granularity
  • erases happen at much larger erase-block granularity

You cannot simply overwrite one already-written page forever. The controller must remap writes and eventually reclaim stale pages.

13.2 Flash Translation Layer (FTL)

The SSD exposes logical block addresses, but internally it maps them to physical flash locations using the Flash Translation Layer.

The FTL handles:

  • logical-to-physical remapping
  • garbage collection
  • wear leveling
  • bad block management

This means the device itself is already doing sophisticated scheduling and mapping behind the OS's back.

13.3 Wear leveling

Flash cells wear out after repeated program/erase cycles. Wear leveling spreads writes across the device to avoid burning out hot regions.

From the OS perspective, that means logical locality is not the same as physical locality.

13.4 TRIM

When the OS deletes blocks, the SSD may not know those pages are no longer needed unless the OS sends discard or TRIM information.

TRIM helps the controller:

  • reclaim invalid pages earlier
  • reduce garbage collection pressure
  • improve sustained write performance

13.5 Why SSD scheduling differs from HDD scheduling

Classical disk scheduling optimizes head movement on rotating disks.

SSDs do not need seek optimization, but they still care about:

  • queue depth
  • request merging
  • read/write balance
  • latency control under heavy writeback
  • internal parallelism across channels and dies

So the scheduling problem changes rather than disappearing.

14. Why Disk Scheduling Exists

Disk scheduling exists because storage requests arrive faster and in a worse order than the device can serve them efficiently.

14.1 The raw problem on HDDs

If ten processes ask for blocks in random order, a naive scheduler may send the disk head zig-zagging across the platter constantly.

That destroys throughput and increases latency.

So the OS reorders requests to improve one or more of:

  • total throughput
  • average latency
  • tail latency
  • fairness
  • starvation resistance

14.2 Throughput vs fairness

The scheduler is not only trying to make the disk fast. It is deciding whose requests wait.

A policy that always serves the closest request may maximize seek efficiency but starve requests far away.

That is why scheduling is always a tradeoff between mechanical efficiency and fairness.

15. Disk Access Time Components

Understanding disk scheduling starts with understanding what contributes to request latency.

15.1 Seek time

The time required to move the disk arm to the correct track.

On HDDs this can be one of the dominant costs for random I/O.

15.2 Rotational latency

Once the head reaches the right track, the platter still has to rotate until the desired sector passes under the head.

Average rotational latency is about half a rotation.

15.3 Transfer time

Once positioned correctly, the actual data transfer begins. For small requests, transfer time may be much smaller than seek plus rotation.

15.4 Queueing delay

This is the time the request waits before the device starts serving it.

Under heavy load, queueing delay can dominate everything else, even on SSDs.

15.5 What dominates in practice

  • On HDD random I/O, seek and rotational latency dominate.
  • On HDD sequential I/O, transfer time becomes more important.
  • On SSDs, controller behavior and queueing are often more important than media access time.
  • On cloud block devices, network and virtualization delay may dominate.

16. Classical Disk Scheduling Algorithms

These algorithms are taught because they build intuition about queue management and fairness, even though modern devices often do additional reordering internally.

Assume current head position is 53 and pending requests are:

98, 183, 37, 122, 14, 124, 65, 67

16.1 FCFS

First-Come, First-Served serves requests in arrival order.

Order here:

53 -> 98 -> 183 -> 37 -> 122 -> 14 -> 124 -> 65 -> 67

Why it exists:

  • simple
  • fair in arrival order
  • no starvation from reordering

Why it performs poorly on HDDs:

  • terrible seek behavior under random workloads
  • high average head movement
  • convoy effects where unlucky order hurts everyone

FCFS is a good baseline for fairness, not for mechanical efficiency.

16.2 SSTF

Shortest Seek Time First serves the request closest to the current head position.

Order here:

53 -> 65 -> 67 -> 37 -> 14 -> 98 -> 122 -> 124 -> 183

Why it is attractive:

  • reduces immediate seek distance
  • often improves average throughput over FCFS

Its main problem:

  • starvation

If requests keep arriving near the current head, distant requests may wait a very long time.

This is the storage equivalent of a scheduler that always favors the most convenient work item.

16.3 SCAN

SCAN, often called the elevator algorithm, moves in one direction servicing requests until it reaches the end, then reverses.

If the head moves upward first:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> end -> 37 -> 14

Why it helps:

  • avoids zig-zagging
  • provides better fairness than SSTF
  • gives more predictable wait times
flowchart LR
	A["Low tracks"] --> B["Head moves upward<br/>serving requests on the way"] --> C["Reach high end"] --> D["Reverse direction"] --> E["Serve remaining lower requests"]

16.4 C-SCAN

Circular SCAN moves in only one servicing direction.

If it services upward:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> end -> jump to start -> 14 -> 37

Why it exists:

  • more uniform waiting time than SCAN
  • requests are treated more like positions on a circular track queue

The jump back is not free physically, but no requests are serviced during that return.

flowchart LR
	A["Low tracks"] --> B["Move upward<br/>serve requests"] --> C["Reach high end"] --> D["Jump to low end<br/>without servicing"] --> E["Resume upward scan"]

16.5 LOOK

LOOK is like SCAN, but it does not go all the way to the physical end if no request is there.

Order here:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> 37 -> 14

It saves unnecessary movement compared with SCAN.

16.6 C-LOOK

C-LOOK is like C-SCAN, but it jumps from the highest pending request to the lowest pending request rather than going to the absolute disk end.

Order here:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> 14 -> 37

This keeps the predictable one-direction service pattern while avoiding some extra movement.

16.7 How to think about them in interviews

Do not memorize only names. Memorize the tradeoffs:

  • FCFS: simple and fair, poor throughput
  • SSTF: good average seek, starvation risk
  • SCAN: good balance of throughput and fairness
  • C-SCAN: more uniform wait time
  • LOOK and C-LOOK: like SCAN family but avoid unnecessary travel

17. NCQ and Modern Scheduling

Classical scheduling was designed for an OS directly shaping a single disk queue. Modern storage is more layered.

17.1 Native Command Queuing

With NCQ on SATA drives and deeper queues on NVMe, the device controller can accept multiple outstanding requests and reorder them internally.

That means:

  • the OS is no longer the only scheduler
  • the controller can optimize for media characteristics the OS cannot see directly
  • device firmware can exploit internal parallelism and locality

17.2 Why classical algorithms matter less today

They matter less as exact production policies because:

  • drives reorder internally
  • SSDs do not need head-movement minimization
  • NVMe devices have many queues and much lower latency

But the mental models still matter because real systems still need request management for:

  • fairness
  • read versus write prioritization
  • latency bounds
  • cgroup isolation
  • merge behavior
  • saturated device queues

17.3 Multi-queue block layer

Modern Linux uses a multi-queue block layer (blk-mq) so many CPUs can submit I/O without one global lock becoming the bottleneck.

This matters especially for NVMe, where the hardware itself supports many submission and completion queues.

18. Linux I/O Scheduler Concepts

For interviews and production work, it is more useful to know why each scheduler exists than to memorize every historical default.

18.1 noop

noop does very little beyond simple merging and FIFO-like dispatch.

It exists for cases where lower layers already do the smart work, such as:

  • hardware RAID controllers
  • SSDs with strong internal scheduling
  • virtualized environments where the guest should avoid over-optimizing unknown physical layout

18.2 deadline

deadline tries to prevent starvation by giving requests deadlines while still allowing some reordering for efficiency.

Why it exists:

  • SSTF-like behavior can starve distant requests
  • databases care about bounded latency, not only throughput

18.3 CFQ

CFQ means Completely Fair Queuing. It tried to give processes fair access to the disk by assigning time slices and queues.

It mattered more in older single-queue block-layer designs. It is historically important because it focused on fairness and interactive responsiveness.

18.4 mq-deadline

mq-deadline adapts the deadline idea to the multi-queue block layer.

Think of it as a modern version of deadline designed for current Linux storage stacks.

18.5 BFQ

BFQ means Budget Fair Queueing. It focuses on fairness in terms of bandwidth and can help interactive workloads stay responsive under I/O load.

This is useful when one noisy workload would otherwise dominate the device.

18.6 kyber

kyber is designed for fast multi-queue devices, focusing on latency control by managing queue depth for different request classes.

18.7 The real lesson

Schedulers exist because storage workloads differ:

  • some want maximum throughput
  • some want predictable latency
  • some want fairness across tenants or processes
  • some run on devices that already reorder aggressively

Useful Linux check:

cat /sys/block/<device>/queue/scheduler

19. RAID and Scheduling Interaction

RAID changes both performance and failure behavior. It also changes how scheduling should be interpreted, because one logical request may map to multiple physical operations.

19.1 RAID 0

RAID 0 stripes data across disks.

Effects:

  • higher throughput
  • potentially more parallel I/O
  • no redundancy

Scheduling implication:

  • sequential I/O can scale well across members
  • random I/O may gain parallelism, but a single member failure loses the array

19.2 RAID 1

RAID 1 mirrors data.

Effects:

  • reads may be served from either mirror
  • writes must update all mirrors

Scheduling implication:

  • read scheduling can exploit multiple copies
  • write latency is influenced by the slower mirror path

19.3 RAID 5

RAID 5 stripes data plus distributed parity.

Effects:

  • space-efficient redundancy
  • painful small-write behavior due to parity updates

For small writes, the controller may need read-modify-write cycles. That increases latency and write amplification.

19.4 RAID 10

RAID 10 combines mirroring and striping.

Effects:

  • strong performance
  • better failure tolerance than RAID 0
  • better random-write behavior than RAID 5

This is why RAID 10 is often preferred for database workloads that care about both latency and resilience.

19.5 Rebuilds change everything

During rebuild:

  • background recovery I/O competes with foreground workload
  • queueing increases
  • latency spikes are common

In production, an array that benchmarks well while healthy may behave very differently while degraded or rebuilding.

20. Real-World Production Relevance

This is where the theory becomes operationally useful.

20.1 Database latency spikes

A database write path often depends on:

  • page cache or direct I/O strategy
  • journal or WAL fsync frequency
  • device cache flush latency
  • controller queue depth
  • RAID behavior
  • cloud storage variance

If p99 commit latency spikes every few seconds, think about:

  • journal commits
  • dirty page throttling
  • device garbage collection
  • burst-credit exhaustion on cloud volumes
  • noisy neighbors on shared infrastructure

20.2 Log-heavy systems

Append-only logging sounds simple, but under durability requirements it can become sync-heavy and metadata-heavy.

Questions to ask:

  • is every append followed by fsync
  • is log rotation causing rename and reopen churn
  • are deleted log files still open
  • is the device saturated by small sync writes

20.3 fsync bottlenecks

If an application insists on durability after every request, then your performance ceiling is often set by the storage system's durable commit latency, not by CPU.

That is why group commit and batching matter so much in databases and messaging systems.

20.4 Noisy neighbors

In shared environments, your latency can rise because someone else is consuming queue depth, bandwidth, or controller time.

This may happen on:

  • shared SSD arrays
  • virtualized hosts
  • cloud block devices
  • multi-tenant storage services

The file system inside the guest may look healthy while the real bottleneck is outside the guest.

20.5 Cloud block storage behavior

Many cloud "disks" are not local disks. They may be network-attached block devices with caching, replication, and throttling policies outside your VM.

That means:

  • latency can have network-like variance
  • throughput may depend on provisioned IOPS or burst budgets
  • fsync may imply remote durability work

20.6 SSD misconceptions

Common bad assumptions:

  • "SSDs make scheduling irrelevant"
  • "If average latency is low, tail latency must also be low"
  • "delete immediately frees space inside the SSD"
  • "write completion means the data is definitely durable on flash"

All of these can be false depending on controller behavior, queueing, caches, and durability settings.

20.7 Practical debugging checklist

When a system is "slow on disk," ask in this order:

  1. Is the workload read-heavy, write-heavy, metadata-heavy, or sync-heavy?
  2. Is it hitting page cache or actual storage?
  3. Is latency caused by queueing, flushes, seek behavior, or cloud/network effects?
  4. Is fragmentation or file layout hurting locality?
  5. Is the bottleneck the file system, block layer, controller, device, RAID, or remote backend?
  6. Are open deleted files, inode exhaustion, or journal pressure involved?

21. Interview Mental Models

If you need to explain these topics clearly in an interview, center your answers around a few strong mental models.

21.1 A file is name plus metadata plus data blocks

More precisely:

  • the directory maps name to inode
  • the inode stores metadata and block mapping
  • the data lives elsewhere

That one model explains hard links, rename, unlink, open deleted files, and path lookup.

21.2 File systems are consistency machines

A file system is not just a storage format. It is a consistency mechanism that preserves invariants across crashes.

That is why journaling, ordering, barriers, and fsync exist.

21.3 Buffered write does not mean durable write

If you say this confidently in an interview and then explain page cache, dirty pages, and fsync, you are operating at the right depth.

21.4 Scheduling is about tradeoffs, not one best algorithm

The right algorithm depends on what you optimize:

  • throughput
  • fairness
  • latency
  • starvation resistance
  • device characteristics

21.5 Modern systems are layered

Application-visible I/O behavior is shaped by all of these:

  • VFS
  • concrete filesystem
  • page cache
  • block layer
  • scheduler
  • controller
  • device internals
  • virtualization or cloud storage

If you debug only one layer, you will miss real bottlenecks.

22. Final Takeaway

File systems and disk scheduling matter because persistence is not just about storing bytes. It is about building a reliable, performant illusion over unreliable timing, slow media, complex metadata, and failure-prone updates.

For interviews, the winning mental model is:

  • directories give names
  • inodes give identity and metadata
  • block mapping gives location
  • journaling gives recoverability
  • page cache gives speed
  • fsync gives durability semantics
  • scheduling gives controlled access to scarce storage time

For production systems, the winning habit is to ask which layer is actually responsible for the latency or correctness problem you are seeing.

That is the difference between knowing storage APIs and understanding operating systems.