Files

T

tarun-elango 3c0881290e more subjects

2026-04-26 14:53:29 -04:00

51 KiB

Raw Blame History

File Systems and Disk Scheduling in Operating Systems

File systems and disk scheduling are where operating systems stop being abstract software platforms and start dealing with stubborn physical reality.

As a software engineer, you usually work with a friendly model:

open a file
write bytes
read them back later
trust that paths name things reliably
assume storage is slower than RAM, but still manageable

Underneath that model, the operating system is solving a much harder problem:

storage devices expose blocks, not named files
crashes can happen in the middle of updates
persistence has to survive process death, kernel reboot, and power loss
many processes want to read and write the same storage at once
the hardware may be mechanical, flash-based, network-backed, or hidden behind a hypervisor

This guide is written for practical understanding rather than memorization. The goal is to build the kind of mental model that helps in interviews, debugging, performance tuning, database work, and infrastructure design.

1. Why File Systems Exist

At the hardware level, a disk or SSD does not naturally contain directories, filenames, permissions, or "the log file for service X". It exposes addressable storage locations. Historically these were sectors on a rotating disk. Today they may be logical block addresses backed by flash translation layers, RAID controllers, or cloud storage systems.

If the OS exposed raw storage directly to applications, every program would need to solve the same problems itself:

where to place data
how to find it later
how to avoid overwriting other data
how to reuse freed space
how to handle crashes halfway through an update
how to represent ownership, permissions, and timestamps

That would be chaos.

The file system exists to provide a durable logical namespace over raw block storage.

1.1 The abstraction gap

Think of RAM and disk as very different kinds of storage:

RAM is fast, byte-addressable, and volatile.
Disk is slow, block-oriented, and persistent.

Applications want a simple logical model:

named files
hierarchical directories
append, overwrite, truncate, rename
metadata such as owner and mode bits

The file system is the translation layer between that logical model and the physical storage layout.

1.2 Logical view vs physical storage

The logical view says:

there is a file called /var/log/app.log
it has permissions 0640
it is owned by root:adm
it has a size of 48 MB

The physical view says:

a directory entry maps app.log to inode 912341
inode 912341 stores metadata and block mapping information
the file's contents live in a set of physical blocks or extents
some data may still be dirty in page cache and not yet on media

That distinction is central to both interview answers and production debugging.

1.3 Naming and organization matter

A file system does more than store bytes. It also gives structure:

directories let humans and programs organize data
metadata enables permissions, accounting, and auditing
links allow multiple names for the same underlying object
mount points let multiple storage backends appear in one namespace

Without a file system, persistence would still exist, but it would look more like manual block management than application-friendly storage.

flowchart LR
	A["Application<br/>open read write fsync"] --> B["System call layer"]
	B --> C["VFS<br/>common file API"]
	C --> D["Concrete file system<br/>ext4 xfs tmpfs procfs nfs"]
	D --> E["Page cache and writeback"]
	E --> F["Block layer and I/O scheduler"]
	F --> G["Device controller<br/>SATA SAS NVMe RAID"]
	G --> H["Physical media<br/>HDD platters or NAND flash"]

2. Disk Structure Basics

Before understanding file systems, it helps to understand what they sit on top of.

2.1 Sectors, blocks, and clusters

These terms are often mixed together, but they refer to different layers.

Sectors

A sector is the basic addressable unit exposed by a storage device. Historically it was often 512 bytes. Modern disks commonly use 4 KiB physical sectors, though they may emulate 512-byte logical sectors for compatibility.

This matters because partial-sector updates are not a native physical operation. The device often has to read-modify-write internally.

Blocks

A file system block is the allocation and I/O unit chosen by the file system. Common Linux file systems often use 4 KiB blocks because that aligns well with memory pages.

The file system usually allocates storage in block-sized units, not arbitrary byte ranges.

Clusters

A cluster usually means a group of sectors used as a larger allocation unit. The term is common in FAT and NTFS discussions. Conceptually, it is similar to a file system allocation block.

2.2 Tracks, cylinders, platters

These are HDD concepts and still matter for intuition even though modern drives hide the real geometry behind logical block addressing.

A platter is a physical disk surface coated with magnetic material.
A track is a circular ring on a platter.
A sector is a subdivision of a track.
A cylinder is the set of tracks at the same radius across multiple platters.

In old textbooks, the OS and controller cared more directly about these physical details. In modern systems, they are largely abstracted away, but the performance consequences remain.

2.3 HDD access costs

For rotating media, access time is roughly:


	ext{Access Time} \approx \text{Queueing Delay} + \text{Seek Time} + \text{Rotational Latency} + \text{Transfer Time}

The important intuition is this:

moving the head is expensive
waiting for the platter to rotate is expensive
actually transferring the bytes is often cheap by comparison

For example, on a 7200 RPM disk:

one full rotation takes about 8.33 ms
average rotational latency is about half a rotation, about 4.17 ms
seek time may be several milliseconds
transfer of a few kilobytes may be a small fraction of a millisecond

That is why random I/O on HDDs is dramatically slower than sequential I/O.

2.4 SSD behavior is different

SSDs remove seek time and rotational latency because there is no mechanical arm and no spinning platter. That changes the performance profile, but it does not make storage magically free.

SSDs still have:

controller queues
internal mapping overhead
erase-before-write constraints
garbage collection
wear leveling
tail-latency spikes under heavy write load

So the dominant costs shift from mechanics to controller behavior, queueing, flash management, and software stack overhead.

3. File System Layout on Disk

Every file system needs to answer the same basic on-disk questions:

where is the file system itself described
where are metadata records stored
where is actual file data stored
how is free space tracked
how can the system recover after a crash

Different file systems answer these differently, but a classic Unix-style layout is a good mental model.

3.1 Major on-disk components

Common structures include:

Boot block: space near the beginning used for boot-related code or reserved metadata
Superblock: global file system metadata such as block size, inode count, feature flags, UUID, and state
Inode table: persistent metadata records for files and directories
Data blocks: the actual file contents and directory contents
Free space metadata: bitmaps, lists, or extent trees describing unallocated space
Journal area: write-ahead log used for crash recovery in journaling file systems
Directory structures: name-to-inode mappings

3.2 ext4 conceptual organization

ext4 is not just one giant array of blocks. Conceptually, it organizes storage into block groups. Each group keeps related metadata and data somewhat localized.

This design helps reduce long seeks on HDDs and improves locality in general:

a file's inode can live near its data blocks
free-space metadata is distributed rather than fully centralized
directory data and child inodes can often be allocated near each other

An oversimplified ext4-style layout looks like this:

flowchart TB
	FS["Whole File System"] --> BB["Boot Block"]
	FS --> SB["Primary Superblock"]
	FS --> GDT["Group Descriptor Table"]
	FS --> J["Journal Area<br/>internal or external"]
	FS --> BG1["Block Group 0"]
	FS --> BG2["Block Group 1"]
	FS --> BG3["Block Group N"]

	subgraph Group["Typical Block Group"]
		direction TB
		SBK["Backup superblock<br/>in some groups"] --> BBM["Block bitmap"] --> IBM["Inode bitmap"] --> IT["Inode table"] --> DB["Data blocks / extents"]
	end

	BG1 -. same pattern .-> Group

3.3 Why this layout exists

If all metadata lived in one place and all file data in another, every operation would bounce back and forth between distant regions of the disk. Block groups reduce that cost.

This is a recurring OS design theme:

separate concerns logically
keep related data physically close when possible

3.4 Superblock intuition

The superblock is the file system's identity card and rulebook. It tells the kernel things like:

block size
total blocks and free blocks
inode counts
mount state
supported features such as journaling, extents, checksums, large files

If the superblock is lost or corrupted, the file system may become unmountable. That is why many file systems keep backup copies.

4. Inodes

If you remember one thing about Unix file systems, remember this:

the filename is not the file.

The durable object is the inode plus its data. The name is just a directory entry that points to that inode.

4.1 What an inode is

An inode is a metadata structure representing a filesystem object such as:

regular file
directory
symbolic link
device node
FIFO
socket

An inode stores metadata, not the human-readable filename.

Typical inode contents include:

file type
permissions and mode bits
owner UID and GID
file size
link count
timestamps such as atime, mtime, ctime, and sometimes birth time
pointers or extents describing where file data lives
flags and extended metadata

4.2 Why filenames are separate from inodes

Separating names from inode metadata enables several important behaviors:

multiple hard links can refer to the same inode
rename can update directory entries without moving file data
open file descriptors can continue to refer to an inode even after the name is removed

This is why Unix file systems feel flexible and why operations like mv within the same file system can be atomic metadata operations.

4.3 Inode number

Every inode has an inode number that is unique within a file system. On Linux you can see it with:

ls -li somefile
stat somefile

When a directory maps report.txt to inode 481002, that inode number is what the kernel uses to find the file's metadata.

4.4 Classic block pointers

In the classic Unix model, the inode contains direct references to data blocks plus a small tree of indirection for large files.

The standard interview picture is:

direct pointers
one single indirect pointer
one double indirect pointer
one triple indirect pointer

flowchart TB
	I["Inode"] --> D1["Direct block 1"]
	I --> D2["Direct block 2"]
	I --> D3["Direct block N"]
	I --> S["Single indirect"]
	I --> DD["Double indirect"]
	I --> TD["Triple indirect"]
	S --> S1["Data block"]
	S --> S2["Data block"]
	DD --> L1["Indirect block"]
	L1 --> L2["Data block"]
	TD --> T1["Double indirect layer"]
	T1 --> T2["Indirect block"]
	T2 --> T3["Data block"]

4.5 Direct pointers

Direct pointers are the fastest and simplest path. If the inode directly names the data blocks, the kernel can resolve file offset to block with minimal metadata traversal.

Small files are cheap in this design because they often fit entirely within direct blocks.

4.6 Indirect pointers

For larger files, the inode cannot hold every block address directly. So it stores pointers to blocks that themselves contain block addresses.

With 4 KiB blocks and 8-byte block pointers:

one indirect block can hold 4096 / 8 = 512 pointers
one double indirect pointer can reference 512^2 data blocks
one triple indirect pointer can reference 512^3 data blocks

That is how a small fixed-size inode can represent very large files.

4.7 Large file handling

As the file grows:

direct pointers fill first
then the single indirect block is allocated
then double indirect
then triple indirect

The tradeoff is obvious:

small files are efficient
very large files require more metadata lookups

This matters for random reads into huge files, because fetching a block may require traversing multiple layers of metadata unless cached.

4.8 ext4 and extents

Modern file systems such as ext4 improve on the classic pointer-per-block model using extents.

An extent describes a contiguous range, for example:

logical file blocks 1000 through 1255
stored in physical blocks 880000 through 880255

That is much more compact than storing hundreds of separate block pointers when the file is mostly contiguous.

So the classic direct/indirect tree is still essential interview knowledge, but in real ext4 the common case is often an inode containing or pointing to an extent tree.

5. Directories and Name Resolution

Directories are not magical containers. In Unix-like systems, a directory is a special file whose contents encode mappings from names to inode numbers.

5.1 Directory as a special file

A directory entry conceptually stores something like:

filename
inode number
sometimes entry type hints

That means the directory itself has an inode, occupies blocks, and is subject to permissions.

Large directories may use indexed structures. For example, ext4 uses hashed directory indexing for scalability.

5.2 Absolute vs relative paths

An absolute path starts from the root, such as /usr/bin/python3.
A relative path starts from the process current working directory, such as logs/app.log.

The starting point changes, but the resolution logic is the same.

5.3 Path traversal step by step

Suppose a process calls:

open("/var/log/app/current.log", O_RDONLY);

The kernel conceptually does this:

Start from the root directory inode because the path is absolute.
Look up var in that directory.
Confirm execute permission on the directory for traversal.
Load the inode for var.
Look up log inside var.
Repeat for app.
Look up current.log.
Resolve symlinks if encountered, subject to limits.
Perform final permission checks.
Create an open file object and return a file descriptor.

The kernel tries to avoid repeating all of this work by caching directory entries and inodes in memory.

flowchart LR
	A["Path string<br/>/var/log/app/current.log"] --> B["Start at root or cwd"]
	B --> C["Lookup next component<br/>in directory"]
	C --> D["Check permissions<br/>and follow symlink rules"]
	D --> E["Load or reuse inode and dentry"]
	E --> F["Repeat until final component"]
	F --> G["Create open file object<br/>return file descriptor"]

5.4 Hard links

A hard link is another directory entry pointing to the same inode.

ln original.txt alias.txt
ls -li original.txt alias.txt

Both names refer to the same inode. Neither is the "real" one. The file's data is reclaimed only when:

link count reaches zero
and no process still has the inode open

Hard links normally cannot span file systems, because inode numbers are meaningful only inside one file system.

5.5 Symbolic links

A symbolic link is a separate file whose contents are a path string.

ln -s /var/log/app/current.log current-link

The symlink has its own inode. Accessing it causes another lookup step. Symlinks can cross file system boundaries because they store names, not inode references.

5.6 rename is usually metadata work

When you rename a file within the same file system, the OS often just updates directory entries. The data blocks usually do not move.

That is why same-filesystem rename() is fast and why it is used for atomic replacement patterns.

Important production detail:

rename within the same mounted file system can be atomic
rename across file systems is not a simple metadata update and usually turns into copy plus unlink behavior at a higher layer

5.7 unlink and deleted-but-open files

unlink() removes a directory entry. It does not necessarily free storage immediately.

If a process still has the file open:

the name disappears from the directory
the inode still exists in memory and on disk
the storage is reclaimed only after the last reference is gone

This is a classic Linux debugging case:

a service keeps a deleted log file open
rm appears to succeed
disk space does not return

Useful command:

lsof +L1

That finds open files with link count below one.

6. File Allocation Methods

How should a file system place file data on disk? This is a foundational design question.

6.1 Contiguous allocation

Store the file in one continuous run of blocks.

Strengths:

excellent sequential performance
simple block computation
minimal metadata overhead

Weaknesses:

hard for growing files
external fragmentation becomes a problem
finding large continuous free regions gets harder over time

This is conceptually ideal for reading but awkward for real workloads where files grow unpredictably.

6.2 Linked allocation

Each block points to the next block.

Strengths:

files can grow easily
no need for one large contiguous area

Weaknesses:

random access is poor
pointer corruption is dangerous
pointer overhead consumes space

FAT is the classic teaching example, though its pointer structure is centralized in a table rather than embedded directly inside data blocks.

6.3 Indexed allocation

Store block pointers in a separate index structure.

Strengths:

good random access
flexible growth
clean separation between metadata and data

Weaknesses:

extra metadata reads may be needed
index structures themselves consume space

Unix inode designs are a form of indexed allocation.

6.4 Extent-based allocation

Store ranges instead of single-block pointers.

Strengths:

compact metadata for large contiguous regions
better sequential locality
lower pointer overhead

Weaknesses:

fragmentation still matters when files grow in many places
allocation policy becomes more complex

Modern file systems such as ext4 and XFS use extent-based strategies because they are a practical compromise between performance and flexibility.

6.5 The real tradeoff

The design space is about balancing:

sequential performance
random access efficiency
metadata overhead
growth flexibility
fragmentation resistance

There is no single perfect method. The right answer depends on workload shape.

7. Free Space Management

If the file system knows where used blocks are, it also needs to know where free blocks are.

This is harder than it first appears because allocation policy directly affects performance.

7.1 Bitmaps

A bitmap uses one bit per block or inode to indicate whether it is free.

Strengths:

compact
easy to scan for contiguous runs
good fit for extent allocation

Weaknesses:

scanning can be expensive on large file systems if poorly optimized

Many modern file systems, including ext-family systems, use bitmap-based tracking.

7.2 Free lists

A free list stores free blocks as a linked list or chain.

Strengths:

simple conceptually

Weaknesses:

poor at finding large contiguous regions quickly
cache-unfriendly for large-scale allocation decisions

7.3 Grouping

Grouping stores the addresses of several free blocks together, often with one free block pointing to a batch of others.

This reduces traversal overhead compared with a single pointer chain.

7.4 Counting

Counting stores free space as runs, such as:

start block 90000
length 2048 blocks

This is efficient when free space tends to be contiguous. It fits naturally with extent-oriented allocators.

7.5 Allocation efficiency matters

Free space management is not just bookkeeping. It affects:

fragmentation
locality
metadata contention
future read and write performance

For example, ext4 tries to allocate blocks near the owning inode or directory when possible, because good placement today prevents performance pain tomorrow.

8. Journaling and Crash Consistency

This is one of the most important practical topics in operating systems.

8.1 The crash consistency problem

Many file operations require multiple physical updates.

Creating a new file might involve:

allocating a free inode
allocating data blocks
writing file data
updating the inode
updating the directory entry
updating free-space metadata

If the machine crashes in the middle, the disk may contain a partially applied update. That can leave the file system inconsistent.

8.2 Why partial writes are dangerous

The key issue is not just losing the last few bytes of a file. The bigger issue is breaking metadata invariants.

Examples:

an inode claims a block that the free-space bitmap still marks free
a directory entry points to an uninitialized inode
file size says 64 KB but only half the block mapping is valid

Without recovery logic, the whole file system may become corrupt.

8.3 Write-ahead logging

Journaling uses write-ahead logging.

The basic idea is:

write a description of the metadata changes to a journal
mark the journal transaction committed
later apply those changes to the main file system structures

After a crash, the kernel can replay committed journal entries and restore a consistent state.

flowchart LR
	A["User operation<br/>create write rename"] --> B["Prepare metadata updates"]
	B --> C["Write journal transaction"]
	C --> D["Commit journal entry"]
	D --> E["Checkpoint updates<br/>to home locations"]
	E --> F["Clear or advance journal"]

8.4 Metadata journaling

In metadata journaling, the file system journals metadata changes but not necessarily the file data blocks themselves.

This protects structural consistency while avoiding the overhead of writing all file data twice.

8.5 Full data journaling

In full data journaling, both metadata and file data are written to the journal before reaching home locations.

Strengths:

strong crash guarantees

Weaknesses:

higher write amplification
lower write throughput

This mode is safer but more expensive.

8.6 Ordered journaling

In ordered journaling, metadata is journaled, and the file system ensures that dirty data blocks reach their home locations before the metadata commit that would expose them as valid.

This is a practical compromise and historically the default mode for ext3/ext4.

It avoids a particularly ugly failure mode where metadata says new data exists but the actual data blocks still contain garbage or old contents.

8.7 ext3/ext4-style intuition

In ext-style systems you will often hear about modes like:

data=journal
data=ordered
data=writeback

Interpret them as a spectrum:

stronger consistency usually means more writes and less throughput
weaker ordering often means better performance but more surprising crash behavior

8.8 fsck vs journaling

fsck and journaling solve related but different problems.

fsck:

scans and repairs the file system by checking invariants
can be very slow on large volumes
may need to reconstruct or discard damaged state

Journaling:

keeps a short recent log of intended updates
makes crash recovery much faster
usually restores metadata consistency quickly after unclean shutdown

Journaling does not mean application-level data is always safe. It mainly protects filesystem consistency.

8.9 Practical crash-safe file replacement

The robust pattern for replacing a config file is not:

open target
overwrite target
close target

The safer pattern is:

write a new temporary file
fsync() the temporary file
rename() it over the old file
fsync() the parent directory

That last step is often forgotten. Directory metadata also needs durability.

9. The Virtual File System (VFS)

The kernel needs one common interface that works across many filesystem types.

Applications should be able to call:

open
read
write
stat
rename

without caring whether the target lives on ext4, XFS, tmpfs, procfs, or a network file system.

That abstraction layer is the Virtual File System.

9.1 Why VFS exists

The VFS solves a very practical kernel engineering problem:

many file systems
one syscall API
shared kernel mechanisms for caching, pathname lookup, permissions, mounts, and open file descriptors

9.2 Key kernel objects on Linux

Linux uses several central structures:

superblock: one mounted file system instance
inode: one filesystem object
dentry: one directory-cache name component
file: one open file description, including current offset and flags

BSD-derived systems often talk about vnodes. The concept is similar: a filesystem-neutral kernel object representing a file-like node.

9.3 open path through VFS

When a process calls open():

syscall enters kernel
VFS parses the pathname
VFS walks dentries and mount points
target filesystem provides lookup operations
permissions are checked
inode is resolved
a struct file is created
a file descriptor is installed in the process table

flowchart LR
	A["open /path/file"] --> B["VFS pathname walk"]
	B --> C["Dentry cache lookup"]
	C --> D["Filesystem-specific lookup<br/>ext4 xfs nfs tmpfs"]
	D --> E["Resolve inode and permissions"]
	E --> F["Create open file object"]
	F --> G["Return file descriptor"]

9.4 read and write through VFS

The VFS also normalizes read and write behavior:

read() usually goes through page cache first
write() usually updates page cache first unless special flags are used
filesystem-specific code handles block mapping, journaling, and writeback details

9.5 Why multiple filesystem types fit the same model

The VFS lets very different backends look file-like:

ext4: general-purpose journaling disk filesystem
xfs: high-performance extent-based filesystem
tmpfs: RAM-backed filesystem with swap backing behavior
procfs: synthetic kernel information exposed as files
nfs: file operations forwarded over the network

That is one of the most powerful Unix design patterns: treat many resources as files, but keep the kernel abstraction flexible enough that the implementation can vary radically underneath.

10. Page Cache and Buffer Cache

This section matters a lot in real systems.

When developers say "the app wrote the file," what often really happened is "the kernel copied data into page cache and promised to flush it later."

10.1 Why caching exists

Disk access is vastly slower than RAM access. Even SSD access is still orders of magnitude slower than CPU caches and main memory.

So the kernel caches file data in memory to:

avoid repeated device reads
coalesce writes
allow read-ahead and write-behind
reduce syscall-visible latency

10.2 Read path

For a typical buffered read():

process calls read(fd, buf, n)
kernel checks whether the needed file pages are already in page cache
on a cache hit, data is copied from cache to user buffer
on a cache miss, the filesystem maps file offsets to disk blocks
block I/O is issued
device DMA fills memory pages
pages enter page cache
data is copied to user space

The kernel may also trigger readahead if it detects sequential access.

10.3 Write path

For a typical buffered write():

process calls write(fd, buf, n)
kernel copies data into page cache pages for that file
those pages are marked dirty
write() may return before storage is durable
background writeback threads later flush dirty pages to storage
journaling and barriers determine metadata ordering and durability behavior

This is why writes can look fast until the system is forced to flush.

flowchart TB
	R1["read()"] --> R2["Check page cache"]
	R2 -->|hit| R3["Copy to user buffer"]
	R2 -->|miss| R4["Map file offset to blocks"]
	R4 --> R5["Issue block I/O"]
	R5 --> R6["Fill page cache via DMA"]
	R6 --> R3

	W1["write()"] --> W2["Copy user data to page cache"]
	W2 --> W3["Mark pages dirty"]
	W3 --> W4["Background writeback or fsync"]
	W4 --> W5["Filesystem commit and device flush"]

10.4 Dirty pages and writeback

Dirty pages are cached file pages whose contents differ from what is currently durable on storage.

Linux allows dirty data to accumulate up to configured thresholds. When too much dirty data builds up:

background flushers start writeback
writers may be throttled
latency spikes can appear

Useful Linux signals:

grep -E 'Dirty|Writeback' /proc/meminfo
vmstat 1

10.5 sync, fsync, fdatasync

These are often misunderstood.

sync() asks the kernel to flush dirty data system-wide
fsync(fd) asks for the file's data and required metadata to be durable
fdatasync(fd) is like fsync but may avoid unrelated metadata updates

In production, fsync is the operation that makes people discover how expensive durability really is.

10.6 Why fsync is expensive

fsync is not just "write these bytes." It may require:

flushing dirty file pages
writing journal records
waiting for metadata commit
issuing cache flush commands to the device
waiting for controller acknowledgment

On cloud block storage, there may also be hypervisor or network replication in the path.

10.7 Buffer cache vs page cache

Historically Unix systems described a buffer cache for block metadata and a page cache for file data pages.

In modern Linux, the picture is mostly unified around the page cache, though buffer-head-like metadata structures may still exist internally for block mapping and bookkeeping.

The practical takeaway is not the historical naming. It is this:

cached file I/O and memory management are deeply intertwined
file data lives in memory pages that the kernel manages like other memory-backed objects

10.8 O_DIRECT and why databases care

Some workloads use O_DIRECT to reduce or bypass page cache involvement.

Reasons include:

avoiding double caching between the kernel and database buffer pool
more explicit control over write ordering and eviction

But O_DIRECT is not a universal win. It can reduce cache pollution in some systems and hurt performance badly in others.

11. mmap()

mmap() lets a file appear as part of a process address space.

This is where memory management and file systems directly meet.

11.1 File-backed memory

With mmap, the process does not call read() for every access. Instead, it accesses memory addresses. The kernel loads file-backed pages on demand.

11.2 Page faults drive loading

When the process first touches an unmapped file-backed page:

CPU raises a page fault
kernel sees the faulting address belongs to a file mapping
kernel finds the corresponding file offset
page cache is checked or filled
page table is updated
the process resumes

This is lazy loading backed by the same file/page cache machinery.

11.3 Shared vs private mappings

MAP_SHARED: writes can propagate back to the file and be visible to other mappers
MAP_PRIVATE: copy-on-write view; modifications affect private pages, not the file

11.4 Performance implications

mmap can be powerful because it:

avoids explicit copy loops in some cases
integrates with demand paging
lets the kernel manage readahead and caching naturally

But it also has costs:

page fault overhead
tricky error handling semantics
harder control of writeback timing
possible major faults under memory pressure

11.5 Database usage

Some systems, such as LMDB, lean heavily on mmap. Others, such as PostgreSQL, prefer explicit buffered I/O and their own buffer management for tighter control.

The key question is not whether mmap is good or bad. It is whether you want the kernel's paging policy or the application's own caching policy to dominate.

12. File System Performance and Debugging

Storage performance problems are rarely just "disk is slow." You need to reason layer by layer.

12.1 Sequential vs random I/O

Sequential I/O is friendly because:

readahead works well
extent locality helps
HDDs avoid repeated seeks
SSDs can stream through controller pipelines efficiently

Random I/O is harder because:

HDDs pay seek and rotation repeatedly
metadata lookups are less cache-friendly
SSDs may still suffer queueing and mapping overhead

12.2 The small files problem

Small files are deceptively expensive.

A 200-byte file may consume:

one inode
one directory entry
one or more data blocks depending on layout
journaling traffic for metadata
many cache and lookup operations relative to useful payload

This is why systems with millions of tiny files can become metadata-bound rather than bandwidth-bound.

12.3 Metadata bottlenecks

Not all filesystem work is data transfer. Sometimes the bottleneck is:

inode lookup
path traversal
directory locking
journal commit throughput
inode or dentry cache churn

If a workload creates and deletes huge numbers of files, metadata can dominate the total cost.

12.4 Fragmentation

Fragmentation hurts locality.

HDDs suffer more because the head moves between scattered regions
SSDs suffer less from location changes, but fragmented metadata and smaller I/Os can still reduce efficiency

Useful Linux tool:

filefrag -v bigfile

12.5 Caching effects

Two runs of the same program can have totally different performance depending on cache state.

Questions to ask:

was the data already in page cache
did directory entries come from dentry cache
did readahead trigger
were writes absorbed in cache and delayed

12.6 fsync cost and sync storms

If many threads or services call fsync frequently, the system can enter a pattern of repeated forced flushes and journal commits.

Symptoms include:

low average throughput
very high p99 latency
bursts of stalled writers

This is common in:

databases
durable message queues
log-heavy services
applications doing crash-safe file replacement incorrectly or too often

12.7 Write amplification

One logical write can become many physical writes because of:

journaling
metadata updates
SSD erase block behavior
RAID parity updates
copy-on-write designs in some file systems or storage layers

This is why small synchronous writes can perform much worse than application developers expect.

12.8 Practical Linux debugging intuition

Useful commands when debugging storage behavior:

iostat -x 1
pidstat -d 1
vmstat 1
df -h
df -i
strace -tt -T -e openat,read,write,fsync your_program

How to interpret them:

high disk utilization with low throughput often suggests small random I/O or sync-heavy workload
low disk utilization with slow app I/O may suggest lock contention, page faults, throttling, or network-backed storage delay
inode exhaustion from df -i can break systems even when df -h shows free space

13. Modern Storage: SSD Internals

To understand modern I/O behavior, you need a rough model of how flash works.

13.1 Flash pages and erase blocks

NAND flash is not overwritten in place the way applications imagine files are.

Typical properties:

reads happen at page granularity
writes happen at page granularity
erases happen at much larger erase-block granularity

You cannot simply overwrite one already-written page forever. The controller must remap writes and eventually reclaim stale pages.

13.2 Flash Translation Layer (FTL)

The SSD exposes logical block addresses, but internally it maps them to physical flash locations using the Flash Translation Layer.

The FTL handles:

logical-to-physical remapping
garbage collection
wear leveling
bad block management

This means the device itself is already doing sophisticated scheduling and mapping behind the OS's back.

13.3 Wear leveling

Flash cells wear out after repeated program/erase cycles. Wear leveling spreads writes across the device to avoid burning out hot regions.

From the OS perspective, that means logical locality is not the same as physical locality.

13.4 TRIM

When the OS deletes blocks, the SSD may not know those pages are no longer needed unless the OS sends discard or TRIM information.

TRIM helps the controller:

reclaim invalid pages earlier
reduce garbage collection pressure
improve sustained write performance

13.5 Why SSD scheduling differs from HDD scheduling

Classical disk scheduling optimizes head movement on rotating disks.

SSDs do not need seek optimization, but they still care about:

queue depth
request merging
read/write balance
latency control under heavy writeback
internal parallelism across channels and dies

So the scheduling problem changes rather than disappearing.

14. Why Disk Scheduling Exists

Disk scheduling exists because storage requests arrive faster and in a worse order than the device can serve them efficiently.

14.1 The raw problem on HDDs

If ten processes ask for blocks in random order, a naive scheduler may send the disk head zig-zagging across the platter constantly.

That destroys throughput and increases latency.

So the OS reorders requests to improve one or more of:

total throughput
average latency
tail latency
fairness
starvation resistance

14.2 Throughput vs fairness

The scheduler is not only trying to make the disk fast. It is deciding whose requests wait.

A policy that always serves the closest request may maximize seek efficiency but starve requests far away.

That is why scheduling is always a tradeoff between mechanical efficiency and fairness.

15. Disk Access Time Components

Understanding disk scheduling starts with understanding what contributes to request latency.

15.1 Seek time

The time required to move the disk arm to the correct track.

On HDDs this can be one of the dominant costs for random I/O.

15.2 Rotational latency

Once the head reaches the right track, the platter still has to rotate until the desired sector passes under the head.

Average rotational latency is about half a rotation.

15.3 Transfer time

Once positioned correctly, the actual data transfer begins. For small requests, transfer time may be much smaller than seek plus rotation.

15.4 Queueing delay

This is the time the request waits before the device starts serving it.

Under heavy load, queueing delay can dominate everything else, even on SSDs.

15.5 What dominates in practice

On HDD random I/O, seek and rotational latency dominate.
On HDD sequential I/O, transfer time becomes more important.
On SSDs, controller behavior and queueing are often more important than media access time.
On cloud block devices, network and virtualization delay may dominate.

16. Classical Disk Scheduling Algorithms

These algorithms are taught because they build intuition about queue management and fairness, even though modern devices often do additional reordering internally.

Assume current head position is 53 and pending requests are:

98, 183, 37, 122, 14, 124, 65, 67

16.1 FCFS

First-Come, First-Served serves requests in arrival order.

Order here:

53 -> 98 -> 183 -> 37 -> 122 -> 14 -> 124 -> 65 -> 67

Why it exists:

simple
fair in arrival order
no starvation from reordering

Why it performs poorly on HDDs:

terrible seek behavior under random workloads
high average head movement
convoy effects where unlucky order hurts everyone

FCFS is a good baseline for fairness, not for mechanical efficiency.

16.2 SSTF

Shortest Seek Time First serves the request closest to the current head position.

Order here:

53 -> 65 -> 67 -> 37 -> 14 -> 98 -> 122 -> 124 -> 183

Why it is attractive:

reduces immediate seek distance
often improves average throughput over FCFS

Its main problem:

starvation

If requests keep arriving near the current head, distant requests may wait a very long time.

This is the storage equivalent of a scheduler that always favors the most convenient work item.

16.3 SCAN

SCAN, often called the elevator algorithm, moves in one direction servicing requests until it reaches the end, then reverses.

If the head moves upward first:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> end -> 37 -> 14

Why it helps:

avoids zig-zagging
provides better fairness than SSTF
gives more predictable wait times

flowchart LR
	A["Low tracks"] --> B["Head moves upward<br/>serving requests on the way"] --> C["Reach high end"] --> D["Reverse direction"] --> E["Serve remaining lower requests"]

16.4 C-SCAN

Circular SCAN moves in only one servicing direction.

If it services upward:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> end -> jump to start -> 14 -> 37

Why it exists:

more uniform waiting time than SCAN
requests are treated more like positions on a circular track queue

The jump back is not free physically, but no requests are serviced during that return.

flowchart LR
	A["Low tracks"] --> B["Move upward<br/>serve requests"] --> C["Reach high end"] --> D["Jump to low end<br/>without servicing"] --> E["Resume upward scan"]

16.5 LOOK

LOOK is like SCAN, but it does not go all the way to the physical end if no request is there.

Order here:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> 37 -> 14

It saves unnecessary movement compared with SCAN.

16.6 C-LOOK

C-LOOK is like C-SCAN, but it jumps from the highest pending request to the lowest pending request rather than going to the absolute disk end.

Order here:

53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> 14 -> 37

This keeps the predictable one-direction service pattern while avoiding some extra movement.

16.7 How to think about them in interviews

Do not memorize only names. Memorize the tradeoffs:

FCFS: simple and fair, poor throughput
SSTF: good average seek, starvation risk
SCAN: good balance of throughput and fairness
C-SCAN: more uniform wait time
LOOK and C-LOOK: like SCAN family but avoid unnecessary travel

17. NCQ and Modern Scheduling

Classical scheduling was designed for an OS directly shaping a single disk queue. Modern storage is more layered.

17.1 Native Command Queuing

With NCQ on SATA drives and deeper queues on NVMe, the device controller can accept multiple outstanding requests and reorder them internally.

That means:

the OS is no longer the only scheduler
the controller can optimize for media characteristics the OS cannot see directly
device firmware can exploit internal parallelism and locality

17.2 Why classical algorithms matter less today

They matter less as exact production policies because:

drives reorder internally
SSDs do not need head-movement minimization
NVMe devices have many queues and much lower latency

But the mental models still matter because real systems still need request management for:

fairness
read versus write prioritization
latency bounds
cgroup isolation
merge behavior
saturated device queues

17.3 Multi-queue block layer

Modern Linux uses a multi-queue block layer (blk-mq) so many CPUs can submit I/O without one global lock becoming the bottleneck.

This matters especially for NVMe, where the hardware itself supports many submission and completion queues.

18. Linux I/O Scheduler Concepts

For interviews and production work, it is more useful to know why each scheduler exists than to memorize every historical default.

18.1 noop

noop does very little beyond simple merging and FIFO-like dispatch.

It exists for cases where lower layers already do the smart work, such as:

hardware RAID controllers
SSDs with strong internal scheduling
virtualized environments where the guest should avoid over-optimizing unknown physical layout

18.2 deadline

deadline tries to prevent starvation by giving requests deadlines while still allowing some reordering for efficiency.

Why it exists:

SSTF-like behavior can starve distant requests
databases care about bounded latency, not only throughput

18.3 CFQ

CFQ means Completely Fair Queuing. It tried to give processes fair access to the disk by assigning time slices and queues.

It mattered more in older single-queue block-layer designs. It is historically important because it focused on fairness and interactive responsiveness.

18.4 mq-deadline

mq-deadline adapts the deadline idea to the multi-queue block layer.

Think of it as a modern version of deadline designed for current Linux storage stacks.

18.5 BFQ

BFQ means Budget Fair Queueing. It focuses on fairness in terms of bandwidth and can help interactive workloads stay responsive under I/O load.

This is useful when one noisy workload would otherwise dominate the device.

18.6 kyber

kyber is designed for fast multi-queue devices, focusing on latency control by managing queue depth for different request classes.

18.7 The real lesson

Schedulers exist because storage workloads differ:

some want maximum throughput
some want predictable latency
some want fairness across tenants or processes
some run on devices that already reorder aggressively

Useful Linux check:

cat /sys/block/<device>/queue/scheduler

19. RAID and Scheduling Interaction

RAID changes both performance and failure behavior. It also changes how scheduling should be interpreted, because one logical request may map to multiple physical operations.

19.1 RAID 0

RAID 0 stripes data across disks.

Effects:

higher throughput
potentially more parallel I/O
no redundancy

Scheduling implication:

sequential I/O can scale well across members
random I/O may gain parallelism, but a single member failure loses the array

19.2 RAID 1

RAID 1 mirrors data.

Effects:

reads may be served from either mirror
writes must update all mirrors

Scheduling implication:

read scheduling can exploit multiple copies
write latency is influenced by the slower mirror path

19.3 RAID 5

RAID 5 stripes data plus distributed parity.

Effects:

space-efficient redundancy
painful small-write behavior due to parity updates

For small writes, the controller may need read-modify-write cycles. That increases latency and write amplification.

19.4 RAID 10

RAID 10 combines mirroring and striping.

Effects:

strong performance
better failure tolerance than RAID 0
better random-write behavior than RAID 5

This is why RAID 10 is often preferred for database workloads that care about both latency and resilience.

19.5 Rebuilds change everything

During rebuild:

background recovery I/O competes with foreground workload
queueing increases
latency spikes are common

In production, an array that benchmarks well while healthy may behave very differently while degraded or rebuilding.

20. Real-World Production Relevance

This is where the theory becomes operationally useful.

20.1 Database latency spikes

A database write path often depends on:

page cache or direct I/O strategy
journal or WAL fsync frequency
device cache flush latency
controller queue depth
RAID behavior
cloud storage variance

If p99 commit latency spikes every few seconds, think about:

journal commits
dirty page throttling
device garbage collection
burst-credit exhaustion on cloud volumes
noisy neighbors on shared infrastructure

20.2 Log-heavy systems

Append-only logging sounds simple, but under durability requirements it can become sync-heavy and metadata-heavy.

Questions to ask:

is every append followed by fsync
is log rotation causing rename and reopen churn
are deleted log files still open
is the device saturated by small sync writes

20.3 fsync bottlenecks

If an application insists on durability after every request, then your performance ceiling is often set by the storage system's durable commit latency, not by CPU.

That is why group commit and batching matter so much in databases and messaging systems.

20.4 Noisy neighbors

In shared environments, your latency can rise because someone else is consuming queue depth, bandwidth, or controller time.

This may happen on:

shared SSD arrays
virtualized hosts
cloud block devices
multi-tenant storage services

The file system inside the guest may look healthy while the real bottleneck is outside the guest.

20.5 Cloud block storage behavior

Many cloud "disks" are not local disks. They may be network-attached block devices with caching, replication, and throttling policies outside your VM.

That means:

latency can have network-like variance
throughput may depend on provisioned IOPS or burst budgets
fsync may imply remote durability work

20.6 SSD misconceptions

Common bad assumptions:

"SSDs make scheduling irrelevant"
"If average latency is low, tail latency must also be low"
"delete immediately frees space inside the SSD"
"write completion means the data is definitely durable on flash"

All of these can be false depending on controller behavior, queueing, caches, and durability settings.

20.7 Practical debugging checklist

When a system is "slow on disk," ask in this order:

Is the workload read-heavy, write-heavy, metadata-heavy, or sync-heavy?
Is it hitting page cache or actual storage?
Is latency caused by queueing, flushes, seek behavior, or cloud/network effects?
Is fragmentation or file layout hurting locality?
Is the bottleneck the file system, block layer, controller, device, RAID, or remote backend?
Are open deleted files, inode exhaustion, or journal pressure involved?

21. Interview Mental Models

If you need to explain these topics clearly in an interview, center your answers around a few strong mental models.

21.1 A file is name plus metadata plus data blocks

More precisely:

the directory maps name to inode
the inode stores metadata and block mapping
the data lives elsewhere

That one model explains hard links, rename, unlink, open deleted files, and path lookup.

21.2 File systems are consistency machines

A file system is not just a storage format. It is a consistency mechanism that preserves invariants across crashes.

That is why journaling, ordering, barriers, and fsync exist.

21.3 Buffered write does not mean durable write

If you say this confidently in an interview and then explain page cache, dirty pages, and fsync, you are operating at the right depth.

21.4 Scheduling is about tradeoffs, not one best algorithm

The right algorithm depends on what you optimize:

throughput
fairness
latency
starvation resistance
device characteristics

21.5 Modern systems are layered

Application-visible I/O behavior is shaped by all of these:

VFS
concrete filesystem
page cache
block layer
scheduler
controller
device internals
virtualization or cloud storage

If you debug only one layer, you will miss real bottlenecks.

22. Final Takeaway

File systems and disk scheduling matter because persistence is not just about storing bytes. It is about building a reliable, performant illusion over unreliable timing, slow media, complex metadata, and failure-prone updates.

For interviews, the winning mental model is:

directories give names
inodes give identity and metadata
block mapping gives location
journaling gives recoverability
page cache gives speed
fsync gives durability semantics
scheduling gives controlled access to scarce storage time

For production systems, the winning habit is to ask which layer is actually responsible for the latency or correctness problem you are seeing.

That is the difference between knowing storage APIs and understanding operating systems.

51 KiB Raw Blame History

File Systems and Disk Scheduling in Operating Systems

1. Why File Systems Exist

1.1 The abstraction gap

1.2 Logical view vs physical storage

1.3 Naming and organization matter

2. Disk Structure Basics

2.1 Sectors, blocks, and clusters

Sectors

Blocks

Clusters

2.2 Tracks, cylinders, platters

2.3 HDD access costs

2.4 SSD behavior is different

3. File System Layout on Disk

3.1 Major on-disk components

3.2 ext4 conceptual organization

3.3 Why this layout exists

3.4 Superblock intuition

4. Inodes

4.1 What an inode is

4.2 Why filenames are separate from inodes

4.3 Inode number

4.4 Classic block pointers

4.5 Direct pointers

4.6 Indirect pointers

4.7 Large file handling

4.8 ext4 and extents

5. Directories and Name Resolution

5.1 Directory as a special file

5.2 Absolute vs relative paths

5.3 Path traversal step by step

5.4 Hard links

5.5 Symbolic links

5.6 rename is usually metadata work

5.7 unlink and deleted-but-open files

6. File Allocation Methods

6.1 Contiguous allocation

6.2 Linked allocation

6.3 Indexed allocation

6.4 Extent-based allocation

6.5 The real tradeoff

7. Free Space Management

7.1 Bitmaps

7.2 Free lists

7.3 Grouping

7.4 Counting

7.5 Allocation efficiency matters

8. Journaling and Crash Consistency

8.1 The crash consistency problem

8.2 Why partial writes are dangerous

8.3 Write-ahead logging

8.4 Metadata journaling

8.5 Full data journaling

8.6 Ordered journaling

8.7 ext3/ext4-style intuition

8.8 fsck vs journaling

8.9 Practical crash-safe file replacement

9. The Virtual File System (VFS)

9.1 Why VFS exists

9.2 Key kernel objects on Linux

9.3 open path through VFS

9.4 read and write through VFS

9.5 Why multiple filesystem types fit the same model

10. Page Cache and Buffer Cache

10.1 Why caching exists

10.2 Read path

10.3 Write path

10.4 Dirty pages and writeback

10.5 sync, fsync, fdatasync

10.6 Why fsync is expensive

10.7 Buffer cache vs page cache

10.8 O_DIRECT and why databases care

11. mmap()

11.1 File-backed memory

11.2 Page faults drive loading

11.3 Shared vs private mappings

11.4 Performance implications

11.5 Database usage

12. File System Performance and Debugging

51 KiB

Raw Blame History