51 KiB
File Systems and Disk Scheduling in Operating Systems
File systems and disk scheduling are where operating systems stop being abstract software platforms and start dealing with stubborn physical reality.
As a software engineer, you usually work with a friendly model:
- open a file
- write bytes
- read them back later
- trust that paths name things reliably
- assume storage is slower than RAM, but still manageable
Underneath that model, the operating system is solving a much harder problem:
- storage devices expose blocks, not named files
- crashes can happen in the middle of updates
- persistence has to survive process death, kernel reboot, and power loss
- many processes want to read and write the same storage at once
- the hardware may be mechanical, flash-based, network-backed, or hidden behind a hypervisor
This guide is written for practical understanding rather than memorization. The goal is to build the kind of mental model that helps in interviews, debugging, performance tuning, database work, and infrastructure design.
1. Why File Systems Exist
At the hardware level, a disk or SSD does not naturally contain directories, filenames, permissions, or "the log file for service X". It exposes addressable storage locations. Historically these were sectors on a rotating disk. Today they may be logical block addresses backed by flash translation layers, RAID controllers, or cloud storage systems.
If the OS exposed raw storage directly to applications, every program would need to solve the same problems itself:
- where to place data
- how to find it later
- how to avoid overwriting other data
- how to reuse freed space
- how to handle crashes halfway through an update
- how to represent ownership, permissions, and timestamps
That would be chaos.
The file system exists to provide a durable logical namespace over raw block storage.
1.1 The abstraction gap
Think of RAM and disk as very different kinds of storage:
- RAM is fast, byte-addressable, and volatile.
- Disk is slow, block-oriented, and persistent.
Applications want a simple logical model:
- named files
- hierarchical directories
- append, overwrite, truncate, rename
- metadata such as owner and mode bits
The file system is the translation layer between that logical model and the physical storage layout.
1.2 Logical view vs physical storage
The logical view says:
- there is a file called
/var/log/app.log - it has permissions
0640 - it is owned by
root:adm - it has a size of 48 MB
The physical view says:
- a directory entry maps
app.logto inode912341 - inode
912341stores metadata and block mapping information - the file's contents live in a set of physical blocks or extents
- some data may still be dirty in page cache and not yet on media
That distinction is central to both interview answers and production debugging.
1.3 Naming and organization matter
A file system does more than store bytes. It also gives structure:
- directories let humans and programs organize data
- metadata enables permissions, accounting, and auditing
- links allow multiple names for the same underlying object
- mount points let multiple storage backends appear in one namespace
Without a file system, persistence would still exist, but it would look more like manual block management than application-friendly storage.
flowchart LR
A["Application<br/>open read write fsync"] --> B["System call layer"]
B --> C["VFS<br/>common file API"]
C --> D["Concrete file system<br/>ext4 xfs tmpfs procfs nfs"]
D --> E["Page cache and writeback"]
E --> F["Block layer and I/O scheduler"]
F --> G["Device controller<br/>SATA SAS NVMe RAID"]
G --> H["Physical media<br/>HDD platters or NAND flash"]
2. Disk Structure Basics
Before understanding file systems, it helps to understand what they sit on top of.
2.1 Sectors, blocks, and clusters
These terms are often mixed together, but they refer to different layers.
Sectors
A sector is the basic addressable unit exposed by a storage device. Historically it was often 512 bytes. Modern disks commonly use 4 KiB physical sectors, though they may emulate 512-byte logical sectors for compatibility.
This matters because partial-sector updates are not a native physical operation. The device often has to read-modify-write internally.
Blocks
A file system block is the allocation and I/O unit chosen by the file system. Common Linux file systems often use 4 KiB blocks because that aligns well with memory pages.
The file system usually allocates storage in block-sized units, not arbitrary byte ranges.
Clusters
A cluster usually means a group of sectors used as a larger allocation unit. The term is common in FAT and NTFS discussions. Conceptually, it is similar to a file system allocation block.
2.2 Tracks, cylinders, platters
These are HDD concepts and still matter for intuition even though modern drives hide the real geometry behind logical block addressing.
- A platter is a physical disk surface coated with magnetic material.
- A track is a circular ring on a platter.
- A sector is a subdivision of a track.
- A cylinder is the set of tracks at the same radius across multiple platters.
In old textbooks, the OS and controller cared more directly about these physical details. In modern systems, they are largely abstracted away, but the performance consequences remain.
2.3 HDD access costs
For rotating media, access time is roughly:
ext{Access Time} \approx \text{Queueing Delay} + \text{Seek Time} + \text{Rotational Latency} + \text{Transfer Time}
The important intuition is this:
- moving the head is expensive
- waiting for the platter to rotate is expensive
- actually transferring the bytes is often cheap by comparison
For example, on a 7200 RPM disk:
- one full rotation takes about
8.33ms - average rotational latency is about half a rotation, about
4.17ms - seek time may be several milliseconds
- transfer of a few kilobytes may be a small fraction of a millisecond
That is why random I/O on HDDs is dramatically slower than sequential I/O.
2.4 SSD behavior is different
SSDs remove seek time and rotational latency because there is no mechanical arm and no spinning platter. That changes the performance profile, but it does not make storage magically free.
SSDs still have:
- controller queues
- internal mapping overhead
- erase-before-write constraints
- garbage collection
- wear leveling
- tail-latency spikes under heavy write load
So the dominant costs shift from mechanics to controller behavior, queueing, flash management, and software stack overhead.
3. File System Layout on Disk
Every file system needs to answer the same basic on-disk questions:
- where is the file system itself described
- where are metadata records stored
- where is actual file data stored
- how is free space tracked
- how can the system recover after a crash
Different file systems answer these differently, but a classic Unix-style layout is a good mental model.
3.1 Major on-disk components
Common structures include:
- Boot block: space near the beginning used for boot-related code or reserved metadata
- Superblock: global file system metadata such as block size, inode count, feature flags, UUID, and state
- Inode table: persistent metadata records for files and directories
- Data blocks: the actual file contents and directory contents
- Free space metadata: bitmaps, lists, or extent trees describing unallocated space
- Journal area: write-ahead log used for crash recovery in journaling file systems
- Directory structures: name-to-inode mappings
3.2 ext4 conceptual organization
ext4 is not just one giant array of blocks. Conceptually, it organizes storage into block groups. Each group keeps related metadata and data somewhat localized.
This design helps reduce long seeks on HDDs and improves locality in general:
- a file's inode can live near its data blocks
- free-space metadata is distributed rather than fully centralized
- directory data and child inodes can often be allocated near each other
An oversimplified ext4-style layout looks like this:
flowchart TB
FS["Whole File System"] --> BB["Boot Block"]
FS --> SB["Primary Superblock"]
FS --> GDT["Group Descriptor Table"]
FS --> J["Journal Area<br/>internal or external"]
FS --> BG1["Block Group 0"]
FS --> BG2["Block Group 1"]
FS --> BG3["Block Group N"]
subgraph Group["Typical Block Group"]
direction TB
SBK["Backup superblock<br/>in some groups"] --> BBM["Block bitmap"] --> IBM["Inode bitmap"] --> IT["Inode table"] --> DB["Data blocks / extents"]
end
BG1 -. same pattern .-> Group
3.3 Why this layout exists
If all metadata lived in one place and all file data in another, every operation would bounce back and forth between distant regions of the disk. Block groups reduce that cost.
This is a recurring OS design theme:
- separate concerns logically
- keep related data physically close when possible
3.4 Superblock intuition
The superblock is the file system's identity card and rulebook. It tells the kernel things like:
- block size
- total blocks and free blocks
- inode counts
- mount state
- supported features such as journaling, extents, checksums, large files
If the superblock is lost or corrupted, the file system may become unmountable. That is why many file systems keep backup copies.
4. Inodes
If you remember one thing about Unix file systems, remember this:
the filename is not the file.
The durable object is the inode plus its data. The name is just a directory entry that points to that inode.
4.1 What an inode is
An inode is a metadata structure representing a filesystem object such as:
- regular file
- directory
- symbolic link
- device node
- FIFO
- socket
An inode stores metadata, not the human-readable filename.
Typical inode contents include:
- file type
- permissions and mode bits
- owner UID and GID
- file size
- link count
- timestamps such as atime, mtime, ctime, and sometimes birth time
- pointers or extents describing where file data lives
- flags and extended metadata
4.2 Why filenames are separate from inodes
Separating names from inode metadata enables several important behaviors:
- multiple hard links can refer to the same inode
- rename can update directory entries without moving file data
- open file descriptors can continue to refer to an inode even after the name is removed
This is why Unix file systems feel flexible and why operations like mv within the same file system can be atomic metadata operations.
4.3 Inode number
Every inode has an inode number that is unique within a file system. On Linux you can see it with:
ls -li somefile
stat somefile
When a directory maps report.txt to inode 481002, that inode number is what the kernel uses to find the file's metadata.
4.4 Classic block pointers
In the classic Unix model, the inode contains direct references to data blocks plus a small tree of indirection for large files.
The standard interview picture is:
- direct pointers
- one single indirect pointer
- one double indirect pointer
- one triple indirect pointer
flowchart TB
I["Inode"] --> D1["Direct block 1"]
I --> D2["Direct block 2"]
I --> D3["Direct block N"]
I --> S["Single indirect"]
I --> DD["Double indirect"]
I --> TD["Triple indirect"]
S --> S1["Data block"]
S --> S2["Data block"]
DD --> L1["Indirect block"]
L1 --> L2["Data block"]
TD --> T1["Double indirect layer"]
T1 --> T2["Indirect block"]
T2 --> T3["Data block"]
4.5 Direct pointers
Direct pointers are the fastest and simplest path. If the inode directly names the data blocks, the kernel can resolve file offset to block with minimal metadata traversal.
Small files are cheap in this design because they often fit entirely within direct blocks.
4.6 Indirect pointers
For larger files, the inode cannot hold every block address directly. So it stores pointers to blocks that themselves contain block addresses.
With 4 KiB blocks and 8-byte block pointers:
- one indirect block can hold
4096 / 8 = 512pointers - one double indirect pointer can reference
512^2data blocks - one triple indirect pointer can reference
512^3data blocks
That is how a small fixed-size inode can represent very large files.
4.7 Large file handling
As the file grows:
- direct pointers fill first
- then the single indirect block is allocated
- then double indirect
- then triple indirect
The tradeoff is obvious:
- small files are efficient
- very large files require more metadata lookups
This matters for random reads into huge files, because fetching a block may require traversing multiple layers of metadata unless cached.
4.8 ext4 and extents
Modern file systems such as ext4 improve on the classic pointer-per-block model using extents.
An extent describes a contiguous range, for example:
- logical file blocks 1000 through 1255
- stored in physical blocks 880000 through 880255
That is much more compact than storing hundreds of separate block pointers when the file is mostly contiguous.
So the classic direct/indirect tree is still essential interview knowledge, but in real ext4 the common case is often an inode containing or pointing to an extent tree.
5. Directories and Name Resolution
Directories are not magical containers. In Unix-like systems, a directory is a special file whose contents encode mappings from names to inode numbers.
5.1 Directory as a special file
A directory entry conceptually stores something like:
- filename
- inode number
- sometimes entry type hints
That means the directory itself has an inode, occupies blocks, and is subject to permissions.
Large directories may use indexed structures. For example, ext4 uses hashed directory indexing for scalability.
5.2 Absolute vs relative paths
- An absolute path starts from the root, such as
/usr/bin/python3. - A relative path starts from the process current working directory, such as
logs/app.log.
The starting point changes, but the resolution logic is the same.
5.3 Path traversal step by step
Suppose a process calls:
open("/var/log/app/current.log", O_RDONLY);
The kernel conceptually does this:
- Start from the root directory inode because the path is absolute.
- Look up
varin that directory. - Confirm execute permission on the directory for traversal.
- Load the inode for
var. - Look up
loginsidevar. - Repeat for
app. - Look up
current.log. - Resolve symlinks if encountered, subject to limits.
- Perform final permission checks.
- Create an open file object and return a file descriptor.
The kernel tries to avoid repeating all of this work by caching directory entries and inodes in memory.
flowchart LR
A["Path string<br/>/var/log/app/current.log"] --> B["Start at root or cwd"]
B --> C["Lookup next component<br/>in directory"]
C --> D["Check permissions<br/>and follow symlink rules"]
D --> E["Load or reuse inode and dentry"]
E --> F["Repeat until final component"]
F --> G["Create open file object<br/>return file descriptor"]
5.4 Hard links
A hard link is another directory entry pointing to the same inode.
ln original.txt alias.txt
ls -li original.txt alias.txt
Both names refer to the same inode. Neither is the "real" one. The file's data is reclaimed only when:
- link count reaches zero
- and no process still has the inode open
Hard links normally cannot span file systems, because inode numbers are meaningful only inside one file system.
5.5 Symbolic links
A symbolic link is a separate file whose contents are a path string.
ln -s /var/log/app/current.log current-link
The symlink has its own inode. Accessing it causes another lookup step. Symlinks can cross file system boundaries because they store names, not inode references.
5.6 rename is usually metadata work
When you rename a file within the same file system, the OS often just updates directory entries. The data blocks usually do not move.
That is why same-filesystem rename() is fast and why it is used for atomic replacement patterns.
Important production detail:
- rename within the same mounted file system can be atomic
- rename across file systems is not a simple metadata update and usually turns into copy plus unlink behavior at a higher layer
5.7 unlink and deleted-but-open files
unlink() removes a directory entry. It does not necessarily free storage immediately.
If a process still has the file open:
- the name disappears from the directory
- the inode still exists in memory and on disk
- the storage is reclaimed only after the last reference is gone
This is a classic Linux debugging case:
- a service keeps a deleted log file open
rmappears to succeed- disk space does not return
Useful command:
lsof +L1
That finds open files with link count below one.
6. File Allocation Methods
How should a file system place file data on disk? This is a foundational design question.
6.1 Contiguous allocation
Store the file in one continuous run of blocks.
Strengths:
- excellent sequential performance
- simple block computation
- minimal metadata overhead
Weaknesses:
- hard for growing files
- external fragmentation becomes a problem
- finding large continuous free regions gets harder over time
This is conceptually ideal for reading but awkward for real workloads where files grow unpredictably.
6.2 Linked allocation
Each block points to the next block.
Strengths:
- files can grow easily
- no need for one large contiguous area
Weaknesses:
- random access is poor
- pointer corruption is dangerous
- pointer overhead consumes space
FAT is the classic teaching example, though its pointer structure is centralized in a table rather than embedded directly inside data blocks.
6.3 Indexed allocation
Store block pointers in a separate index structure.
Strengths:
- good random access
- flexible growth
- clean separation between metadata and data
Weaknesses:
- extra metadata reads may be needed
- index structures themselves consume space
Unix inode designs are a form of indexed allocation.
6.4 Extent-based allocation
Store ranges instead of single-block pointers.
Strengths:
- compact metadata for large contiguous regions
- better sequential locality
- lower pointer overhead
Weaknesses:
- fragmentation still matters when files grow in many places
- allocation policy becomes more complex
Modern file systems such as ext4 and XFS use extent-based strategies because they are a practical compromise between performance and flexibility.
6.5 The real tradeoff
The design space is about balancing:
- sequential performance
- random access efficiency
- metadata overhead
- growth flexibility
- fragmentation resistance
There is no single perfect method. The right answer depends on workload shape.
7. Free Space Management
If the file system knows where used blocks are, it also needs to know where free blocks are.
This is harder than it first appears because allocation policy directly affects performance.
7.1 Bitmaps
A bitmap uses one bit per block or inode to indicate whether it is free.
Strengths:
- compact
- easy to scan for contiguous runs
- good fit for extent allocation
Weaknesses:
- scanning can be expensive on large file systems if poorly optimized
Many modern file systems, including ext-family systems, use bitmap-based tracking.
7.2 Free lists
A free list stores free blocks as a linked list or chain.
Strengths:
- simple conceptually
Weaknesses:
- poor at finding large contiguous regions quickly
- cache-unfriendly for large-scale allocation decisions
7.3 Grouping
Grouping stores the addresses of several free blocks together, often with one free block pointing to a batch of others.
This reduces traversal overhead compared with a single pointer chain.
7.4 Counting
Counting stores free space as runs, such as:
- start block 90000
- length 2048 blocks
This is efficient when free space tends to be contiguous. It fits naturally with extent-oriented allocators.
7.5 Allocation efficiency matters
Free space management is not just bookkeeping. It affects:
- fragmentation
- locality
- metadata contention
- future read and write performance
For example, ext4 tries to allocate blocks near the owning inode or directory when possible, because good placement today prevents performance pain tomorrow.
8. Journaling and Crash Consistency
This is one of the most important practical topics in operating systems.
8.1 The crash consistency problem
Many file operations require multiple physical updates.
Creating a new file might involve:
- allocating a free inode
- allocating data blocks
- writing file data
- updating the inode
- updating the directory entry
- updating free-space metadata
If the machine crashes in the middle, the disk may contain a partially applied update. That can leave the file system inconsistent.
8.2 Why partial writes are dangerous
The key issue is not just losing the last few bytes of a file. The bigger issue is breaking metadata invariants.
Examples:
- an inode claims a block that the free-space bitmap still marks free
- a directory entry points to an uninitialized inode
- file size says 64 KB but only half the block mapping is valid
Without recovery logic, the whole file system may become corrupt.
8.3 Write-ahead logging
Journaling uses write-ahead logging.
The basic idea is:
- write a description of the metadata changes to a journal
- mark the journal transaction committed
- later apply those changes to the main file system structures
After a crash, the kernel can replay committed journal entries and restore a consistent state.
flowchart LR
A["User operation<br/>create write rename"] --> B["Prepare metadata updates"]
B --> C["Write journal transaction"]
C --> D["Commit journal entry"]
D --> E["Checkpoint updates<br/>to home locations"]
E --> F["Clear or advance journal"]
8.4 Metadata journaling
In metadata journaling, the file system journals metadata changes but not necessarily the file data blocks themselves.
This protects structural consistency while avoiding the overhead of writing all file data twice.
8.5 Full data journaling
In full data journaling, both metadata and file data are written to the journal before reaching home locations.
Strengths:
- strong crash guarantees
Weaknesses:
- higher write amplification
- lower write throughput
This mode is safer but more expensive.
8.6 Ordered journaling
In ordered journaling, metadata is journaled, and the file system ensures that dirty data blocks reach their home locations before the metadata commit that would expose them as valid.
This is a practical compromise and historically the default mode for ext3/ext4.
It avoids a particularly ugly failure mode where metadata says new data exists but the actual data blocks still contain garbage or old contents.
8.7 ext3/ext4-style intuition
In ext-style systems you will often hear about modes like:
data=journaldata=ordereddata=writeback
Interpret them as a spectrum:
- stronger consistency usually means more writes and less throughput
- weaker ordering often means better performance but more surprising crash behavior
8.8 fsck vs journaling
fsck and journaling solve related but different problems.
fsck:
- scans and repairs the file system by checking invariants
- can be very slow on large volumes
- may need to reconstruct or discard damaged state
Journaling:
- keeps a short recent log of intended updates
- makes crash recovery much faster
- usually restores metadata consistency quickly after unclean shutdown
Journaling does not mean application-level data is always safe. It mainly protects filesystem consistency.
8.9 Practical crash-safe file replacement
The robust pattern for replacing a config file is not:
- open target
- overwrite target
- close target
The safer pattern is:
- write a new temporary file
fsync()the temporary filerename()it over the old filefsync()the parent directory
That last step is often forgotten. Directory metadata also needs durability.
9. The Virtual File System (VFS)
The kernel needs one common interface that works across many filesystem types.
Applications should be able to call:
openreadwritestatrename
without caring whether the target lives on ext4, XFS, tmpfs, procfs, or a network file system.
That abstraction layer is the Virtual File System.
9.1 Why VFS exists
The VFS solves a very practical kernel engineering problem:
- many file systems
- one syscall API
- shared kernel mechanisms for caching, pathname lookup, permissions, mounts, and open file descriptors
9.2 Key kernel objects on Linux
Linux uses several central structures:
- superblock: one mounted file system instance
- inode: one filesystem object
- dentry: one directory-cache name component
- file: one open file description, including current offset and flags
BSD-derived systems often talk about vnodes. The concept is similar: a filesystem-neutral kernel object representing a file-like node.
9.3 open path through VFS
When a process calls open():
- syscall enters kernel
- VFS parses the pathname
- VFS walks dentries and mount points
- target filesystem provides lookup operations
- permissions are checked
- inode is resolved
- a
struct fileis created - a file descriptor is installed in the process table
flowchart LR
A["open /path/file"] --> B["VFS pathname walk"]
B --> C["Dentry cache lookup"]
C --> D["Filesystem-specific lookup<br/>ext4 xfs nfs tmpfs"]
D --> E["Resolve inode and permissions"]
E --> F["Create open file object"]
F --> G["Return file descriptor"]
9.4 read and write through VFS
The VFS also normalizes read and write behavior:
read()usually goes through page cache firstwrite()usually updates page cache first unless special flags are used- filesystem-specific code handles block mapping, journaling, and writeback details
9.5 Why multiple filesystem types fit the same model
The VFS lets very different backends look file-like:
ext4: general-purpose journaling disk filesystemxfs: high-performance extent-based filesystemtmpfs: RAM-backed filesystem with swap backing behaviorprocfs: synthetic kernel information exposed as filesnfs: file operations forwarded over the network
That is one of the most powerful Unix design patterns: treat many resources as files, but keep the kernel abstraction flexible enough that the implementation can vary radically underneath.
10. Page Cache and Buffer Cache
This section matters a lot in real systems.
When developers say "the app wrote the file," what often really happened is "the kernel copied data into page cache and promised to flush it later."
10.1 Why caching exists
Disk access is vastly slower than RAM access. Even SSD access is still orders of magnitude slower than CPU caches and main memory.
So the kernel caches file data in memory to:
- avoid repeated device reads
- coalesce writes
- allow read-ahead and write-behind
- reduce syscall-visible latency
10.2 Read path
For a typical buffered read():
- process calls
read(fd, buf, n) - kernel checks whether the needed file pages are already in page cache
- on a cache hit, data is copied from cache to user buffer
- on a cache miss, the filesystem maps file offsets to disk blocks
- block I/O is issued
- device DMA fills memory pages
- pages enter page cache
- data is copied to user space
The kernel may also trigger readahead if it detects sequential access.
10.3 Write path
For a typical buffered write():
- process calls
write(fd, buf, n) - kernel copies data into page cache pages for that file
- those pages are marked dirty
write()may return before storage is durable- background writeback threads later flush dirty pages to storage
- journaling and barriers determine metadata ordering and durability behavior
This is why writes can look fast until the system is forced to flush.
flowchart TB
R1["read()"] --> R2["Check page cache"]
R2 -->|hit| R3["Copy to user buffer"]
R2 -->|miss| R4["Map file offset to blocks"]
R4 --> R5["Issue block I/O"]
R5 --> R6["Fill page cache via DMA"]
R6 --> R3
W1["write()"] --> W2["Copy user data to page cache"]
W2 --> W3["Mark pages dirty"]
W3 --> W4["Background writeback or fsync"]
W4 --> W5["Filesystem commit and device flush"]
10.4 Dirty pages and writeback
Dirty pages are cached file pages whose contents differ from what is currently durable on storage.
Linux allows dirty data to accumulate up to configured thresholds. When too much dirty data builds up:
- background flushers start writeback
- writers may be throttled
- latency spikes can appear
Useful Linux signals:
grep -E 'Dirty|Writeback' /proc/meminfo
vmstat 1
10.5 sync, fsync, fdatasync
These are often misunderstood.
sync()asks the kernel to flush dirty data system-widefsync(fd)asks for the file's data and required metadata to be durablefdatasync(fd)is likefsyncbut may avoid unrelated metadata updates
In production, fsync is the operation that makes people discover how expensive durability really is.
10.6 Why fsync is expensive
fsync is not just "write these bytes." It may require:
- flushing dirty file pages
- writing journal records
- waiting for metadata commit
- issuing cache flush commands to the device
- waiting for controller acknowledgment
On cloud block storage, there may also be hypervisor or network replication in the path.
10.7 Buffer cache vs page cache
Historically Unix systems described a buffer cache for block metadata and a page cache for file data pages.
In modern Linux, the picture is mostly unified around the page cache, though buffer-head-like metadata structures may still exist internally for block mapping and bookkeeping.
The practical takeaway is not the historical naming. It is this:
- cached file I/O and memory management are deeply intertwined
- file data lives in memory pages that the kernel manages like other memory-backed objects
10.8 O_DIRECT and why databases care
Some workloads use O_DIRECT to reduce or bypass page cache involvement.
Reasons include:
- avoiding double caching between the kernel and database buffer pool
- more explicit control over write ordering and eviction
But O_DIRECT is not a universal win. It can reduce cache pollution in some systems and hurt performance badly in others.
11. mmap()
mmap() lets a file appear as part of a process address space.
This is where memory management and file systems directly meet.
11.1 File-backed memory
With mmap, the process does not call read() for every access. Instead, it accesses memory addresses. The kernel loads file-backed pages on demand.
11.2 Page faults drive loading
When the process first touches an unmapped file-backed page:
- CPU raises a page fault
- kernel sees the faulting address belongs to a file mapping
- kernel finds the corresponding file offset
- page cache is checked or filled
- page table is updated
- the process resumes
This is lazy loading backed by the same file/page cache machinery.
11.3 Shared vs private mappings
MAP_SHARED: writes can propagate back to the file and be visible to other mappersMAP_PRIVATE: copy-on-write view; modifications affect private pages, not the file
11.4 Performance implications
mmap can be powerful because it:
- avoids explicit copy loops in some cases
- integrates with demand paging
- lets the kernel manage readahead and caching naturally
But it also has costs:
- page fault overhead
- tricky error handling semantics
- harder control of writeback timing
- possible major faults under memory pressure
11.5 Database usage
Some systems, such as LMDB, lean heavily on mmap. Others, such as PostgreSQL, prefer explicit buffered I/O and their own buffer management for tighter control.
The key question is not whether mmap is good or bad. It is whether you want the kernel's paging policy or the application's own caching policy to dominate.
12. File System Performance and Debugging
Storage performance problems are rarely just "disk is slow." You need to reason layer by layer.
12.1 Sequential vs random I/O
Sequential I/O is friendly because:
- readahead works well
- extent locality helps
- HDDs avoid repeated seeks
- SSDs can stream through controller pipelines efficiently
Random I/O is harder because:
- HDDs pay seek and rotation repeatedly
- metadata lookups are less cache-friendly
- SSDs may still suffer queueing and mapping overhead
12.2 The small files problem
Small files are deceptively expensive.
A 200-byte file may consume:
- one inode
- one directory entry
- one or more data blocks depending on layout
- journaling traffic for metadata
- many cache and lookup operations relative to useful payload
This is why systems with millions of tiny files can become metadata-bound rather than bandwidth-bound.
12.3 Metadata bottlenecks
Not all filesystem work is data transfer. Sometimes the bottleneck is:
- inode lookup
- path traversal
- directory locking
- journal commit throughput
- inode or dentry cache churn
If a workload creates and deletes huge numbers of files, metadata can dominate the total cost.
12.4 Fragmentation
Fragmentation hurts locality.
- HDDs suffer more because the head moves between scattered regions
- SSDs suffer less from location changes, but fragmented metadata and smaller I/Os can still reduce efficiency
Useful Linux tool:
filefrag -v bigfile
12.5 Caching effects
Two runs of the same program can have totally different performance depending on cache state.
Questions to ask:
- was the data already in page cache
- did directory entries come from dentry cache
- did readahead trigger
- were writes absorbed in cache and delayed
12.6 fsync cost and sync storms
If many threads or services call fsync frequently, the system can enter a pattern of repeated forced flushes and journal commits.
Symptoms include:
- low average throughput
- very high p99 latency
- bursts of stalled writers
This is common in:
- databases
- durable message queues
- log-heavy services
- applications doing crash-safe file replacement incorrectly or too often
12.7 Write amplification
One logical write can become many physical writes because of:
- journaling
- metadata updates
- SSD erase block behavior
- RAID parity updates
- copy-on-write designs in some file systems or storage layers
This is why small synchronous writes can perform much worse than application developers expect.
12.8 Practical Linux debugging intuition
Useful commands when debugging storage behavior:
iostat -x 1
pidstat -d 1
vmstat 1
df -h
df -i
strace -tt -T -e openat,read,write,fsync your_program
How to interpret them:
- high disk utilization with low throughput often suggests small random I/O or sync-heavy workload
- low disk utilization with slow app I/O may suggest lock contention, page faults, throttling, or network-backed storage delay
- inode exhaustion from
df -ican break systems even whendf -hshows free space
13. Modern Storage: SSD Internals
To understand modern I/O behavior, you need a rough model of how flash works.
13.1 Flash pages and erase blocks
NAND flash is not overwritten in place the way applications imagine files are.
Typical properties:
- reads happen at page granularity
- writes happen at page granularity
- erases happen at much larger erase-block granularity
You cannot simply overwrite one already-written page forever. The controller must remap writes and eventually reclaim stale pages.
13.2 Flash Translation Layer (FTL)
The SSD exposes logical block addresses, but internally it maps them to physical flash locations using the Flash Translation Layer.
The FTL handles:
- logical-to-physical remapping
- garbage collection
- wear leveling
- bad block management
This means the device itself is already doing sophisticated scheduling and mapping behind the OS's back.
13.3 Wear leveling
Flash cells wear out after repeated program/erase cycles. Wear leveling spreads writes across the device to avoid burning out hot regions.
From the OS perspective, that means logical locality is not the same as physical locality.
13.4 TRIM
When the OS deletes blocks, the SSD may not know those pages are no longer needed unless the OS sends discard or TRIM information.
TRIM helps the controller:
- reclaim invalid pages earlier
- reduce garbage collection pressure
- improve sustained write performance
13.5 Why SSD scheduling differs from HDD scheduling
Classical disk scheduling optimizes head movement on rotating disks.
SSDs do not need seek optimization, but they still care about:
- queue depth
- request merging
- read/write balance
- latency control under heavy writeback
- internal parallelism across channels and dies
So the scheduling problem changes rather than disappearing.
14. Why Disk Scheduling Exists
Disk scheduling exists because storage requests arrive faster and in a worse order than the device can serve them efficiently.
14.1 The raw problem on HDDs
If ten processes ask for blocks in random order, a naive scheduler may send the disk head zig-zagging across the platter constantly.
That destroys throughput and increases latency.
So the OS reorders requests to improve one or more of:
- total throughput
- average latency
- tail latency
- fairness
- starvation resistance
14.2 Throughput vs fairness
The scheduler is not only trying to make the disk fast. It is deciding whose requests wait.
A policy that always serves the closest request may maximize seek efficiency but starve requests far away.
That is why scheduling is always a tradeoff between mechanical efficiency and fairness.
15. Disk Access Time Components
Understanding disk scheduling starts with understanding what contributes to request latency.
15.1 Seek time
The time required to move the disk arm to the correct track.
On HDDs this can be one of the dominant costs for random I/O.
15.2 Rotational latency
Once the head reaches the right track, the platter still has to rotate until the desired sector passes under the head.
Average rotational latency is about half a rotation.
15.3 Transfer time
Once positioned correctly, the actual data transfer begins. For small requests, transfer time may be much smaller than seek plus rotation.
15.4 Queueing delay
This is the time the request waits before the device starts serving it.
Under heavy load, queueing delay can dominate everything else, even on SSDs.
15.5 What dominates in practice
- On HDD random I/O, seek and rotational latency dominate.
- On HDD sequential I/O, transfer time becomes more important.
- On SSDs, controller behavior and queueing are often more important than media access time.
- On cloud block devices, network and virtualization delay may dominate.
16. Classical Disk Scheduling Algorithms
These algorithms are taught because they build intuition about queue management and fairness, even though modern devices often do additional reordering internally.
Assume current head position is 53 and pending requests are:
98, 183, 37, 122, 14, 124, 65, 67
16.1 FCFS
First-Come, First-Served serves requests in arrival order.
Order here:
53 -> 98 -> 183 -> 37 -> 122 -> 14 -> 124 -> 65 -> 67
Why it exists:
- simple
- fair in arrival order
- no starvation from reordering
Why it performs poorly on HDDs:
- terrible seek behavior under random workloads
- high average head movement
- convoy effects where unlucky order hurts everyone
FCFS is a good baseline for fairness, not for mechanical efficiency.
16.2 SSTF
Shortest Seek Time First serves the request closest to the current head position.
Order here:
53 -> 65 -> 67 -> 37 -> 14 -> 98 -> 122 -> 124 -> 183
Why it is attractive:
- reduces immediate seek distance
- often improves average throughput over FCFS
Its main problem:
- starvation
If requests keep arriving near the current head, distant requests may wait a very long time.
This is the storage equivalent of a scheduler that always favors the most convenient work item.
16.3 SCAN
SCAN, often called the elevator algorithm, moves in one direction servicing requests until it reaches the end, then reverses.
If the head moves upward first:
53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> end -> 37 -> 14
Why it helps:
- avoids zig-zagging
- provides better fairness than SSTF
- gives more predictable wait times
flowchart LR
A["Low tracks"] --> B["Head moves upward<br/>serving requests on the way"] --> C["Reach high end"] --> D["Reverse direction"] --> E["Serve remaining lower requests"]
16.4 C-SCAN
Circular SCAN moves in only one servicing direction.
If it services upward:
53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> end -> jump to start -> 14 -> 37
Why it exists:
- more uniform waiting time than SCAN
- requests are treated more like positions on a circular track queue
The jump back is not free physically, but no requests are serviced during that return.
flowchart LR
A["Low tracks"] --> B["Move upward<br/>serve requests"] --> C["Reach high end"] --> D["Jump to low end<br/>without servicing"] --> E["Resume upward scan"]
16.5 LOOK
LOOK is like SCAN, but it does not go all the way to the physical end if no request is there.
Order here:
53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> 37 -> 14
It saves unnecessary movement compared with SCAN.
16.6 C-LOOK
C-LOOK is like C-SCAN, but it jumps from the highest pending request to the lowest pending request rather than going to the absolute disk end.
Order here:
53 -> 65 -> 67 -> 98 -> 122 -> 124 -> 183 -> 14 -> 37
This keeps the predictable one-direction service pattern while avoiding some extra movement.
16.7 How to think about them in interviews
Do not memorize only names. Memorize the tradeoffs:
- FCFS: simple and fair, poor throughput
- SSTF: good average seek, starvation risk
- SCAN: good balance of throughput and fairness
- C-SCAN: more uniform wait time
- LOOK and C-LOOK: like SCAN family but avoid unnecessary travel
17. NCQ and Modern Scheduling
Classical scheduling was designed for an OS directly shaping a single disk queue. Modern storage is more layered.
17.1 Native Command Queuing
With NCQ on SATA drives and deeper queues on NVMe, the device controller can accept multiple outstanding requests and reorder them internally.
That means:
- the OS is no longer the only scheduler
- the controller can optimize for media characteristics the OS cannot see directly
- device firmware can exploit internal parallelism and locality
17.2 Why classical algorithms matter less today
They matter less as exact production policies because:
- drives reorder internally
- SSDs do not need head-movement minimization
- NVMe devices have many queues and much lower latency
But the mental models still matter because real systems still need request management for:
- fairness
- read versus write prioritization
- latency bounds
- cgroup isolation
- merge behavior
- saturated device queues
17.3 Multi-queue block layer
Modern Linux uses a multi-queue block layer (blk-mq) so many CPUs can submit I/O without one global lock becoming the bottleneck.
This matters especially for NVMe, where the hardware itself supports many submission and completion queues.
18. Linux I/O Scheduler Concepts
For interviews and production work, it is more useful to know why each scheduler exists than to memorize every historical default.
18.1 noop
noop does very little beyond simple merging and FIFO-like dispatch.
It exists for cases where lower layers already do the smart work, such as:
- hardware RAID controllers
- SSDs with strong internal scheduling
- virtualized environments where the guest should avoid over-optimizing unknown physical layout
18.2 deadline
deadline tries to prevent starvation by giving requests deadlines while still allowing some reordering for efficiency.
Why it exists:
- SSTF-like behavior can starve distant requests
- databases care about bounded latency, not only throughput
18.3 CFQ
CFQ means Completely Fair Queuing. It tried to give processes fair access to the disk by assigning time slices and queues.
It mattered more in older single-queue block-layer designs. It is historically important because it focused on fairness and interactive responsiveness.
18.4 mq-deadline
mq-deadline adapts the deadline idea to the multi-queue block layer.
Think of it as a modern version of deadline designed for current Linux storage stacks.
18.5 BFQ
BFQ means Budget Fair Queueing. It focuses on fairness in terms of bandwidth and can help interactive workloads stay responsive under I/O load.
This is useful when one noisy workload would otherwise dominate the device.
18.6 kyber
kyber is designed for fast multi-queue devices, focusing on latency control by managing queue depth for different request classes.
18.7 The real lesson
Schedulers exist because storage workloads differ:
- some want maximum throughput
- some want predictable latency
- some want fairness across tenants or processes
- some run on devices that already reorder aggressively
Useful Linux check:
cat /sys/block/<device>/queue/scheduler
19. RAID and Scheduling Interaction
RAID changes both performance and failure behavior. It also changes how scheduling should be interpreted, because one logical request may map to multiple physical operations.
19.1 RAID 0
RAID 0 stripes data across disks.
Effects:
- higher throughput
- potentially more parallel I/O
- no redundancy
Scheduling implication:
- sequential I/O can scale well across members
- random I/O may gain parallelism, but a single member failure loses the array
19.2 RAID 1
RAID 1 mirrors data.
Effects:
- reads may be served from either mirror
- writes must update all mirrors
Scheduling implication:
- read scheduling can exploit multiple copies
- write latency is influenced by the slower mirror path
19.3 RAID 5
RAID 5 stripes data plus distributed parity.
Effects:
- space-efficient redundancy
- painful small-write behavior due to parity updates
For small writes, the controller may need read-modify-write cycles. That increases latency and write amplification.
19.4 RAID 10
RAID 10 combines mirroring and striping.
Effects:
- strong performance
- better failure tolerance than RAID 0
- better random-write behavior than RAID 5
This is why RAID 10 is often preferred for database workloads that care about both latency and resilience.
19.5 Rebuilds change everything
During rebuild:
- background recovery I/O competes with foreground workload
- queueing increases
- latency spikes are common
In production, an array that benchmarks well while healthy may behave very differently while degraded or rebuilding.
20. Real-World Production Relevance
This is where the theory becomes operationally useful.
20.1 Database latency spikes
A database write path often depends on:
- page cache or direct I/O strategy
- journal or WAL fsync frequency
- device cache flush latency
- controller queue depth
- RAID behavior
- cloud storage variance
If p99 commit latency spikes every few seconds, think about:
- journal commits
- dirty page throttling
- device garbage collection
- burst-credit exhaustion on cloud volumes
- noisy neighbors on shared infrastructure
20.2 Log-heavy systems
Append-only logging sounds simple, but under durability requirements it can become sync-heavy and metadata-heavy.
Questions to ask:
- is every append followed by
fsync - is log rotation causing rename and reopen churn
- are deleted log files still open
- is the device saturated by small sync writes
20.3 fsync bottlenecks
If an application insists on durability after every request, then your performance ceiling is often set by the storage system's durable commit latency, not by CPU.
That is why group commit and batching matter so much in databases and messaging systems.
20.4 Noisy neighbors
In shared environments, your latency can rise because someone else is consuming queue depth, bandwidth, or controller time.
This may happen on:
- shared SSD arrays
- virtualized hosts
- cloud block devices
- multi-tenant storage services
The file system inside the guest may look healthy while the real bottleneck is outside the guest.
20.5 Cloud block storage behavior
Many cloud "disks" are not local disks. They may be network-attached block devices with caching, replication, and throttling policies outside your VM.
That means:
- latency can have network-like variance
- throughput may depend on provisioned IOPS or burst budgets
fsyncmay imply remote durability work
20.6 SSD misconceptions
Common bad assumptions:
- "SSDs make scheduling irrelevant"
- "If average latency is low, tail latency must also be low"
- "delete immediately frees space inside the SSD"
- "write completion means the data is definitely durable on flash"
All of these can be false depending on controller behavior, queueing, caches, and durability settings.
20.7 Practical debugging checklist
When a system is "slow on disk," ask in this order:
- Is the workload read-heavy, write-heavy, metadata-heavy, or sync-heavy?
- Is it hitting page cache or actual storage?
- Is latency caused by queueing, flushes, seek behavior, or cloud/network effects?
- Is fragmentation or file layout hurting locality?
- Is the bottleneck the file system, block layer, controller, device, RAID, or remote backend?
- Are open deleted files, inode exhaustion, or journal pressure involved?
21. Interview Mental Models
If you need to explain these topics clearly in an interview, center your answers around a few strong mental models.
21.1 A file is name plus metadata plus data blocks
More precisely:
- the directory maps name to inode
- the inode stores metadata and block mapping
- the data lives elsewhere
That one model explains hard links, rename, unlink, open deleted files, and path lookup.
21.2 File systems are consistency machines
A file system is not just a storage format. It is a consistency mechanism that preserves invariants across crashes.
That is why journaling, ordering, barriers, and fsync exist.
21.3 Buffered write does not mean durable write
If you say this confidently in an interview and then explain page cache, dirty pages, and fsync, you are operating at the right depth.
21.4 Scheduling is about tradeoffs, not one best algorithm
The right algorithm depends on what you optimize:
- throughput
- fairness
- latency
- starvation resistance
- device characteristics
21.5 Modern systems are layered
Application-visible I/O behavior is shaped by all of these:
- VFS
- concrete filesystem
- page cache
- block layer
- scheduler
- controller
- device internals
- virtualization or cloud storage
If you debug only one layer, you will miss real bottlenecks.
22. Final Takeaway
File systems and disk scheduling matter because persistence is not just about storing bytes. It is about building a reliable, performant illusion over unreliable timing, slow media, complex metadata, and failure-prone updates.
For interviews, the winning mental model is:
- directories give names
- inodes give identity and metadata
- block mapping gives location
- journaling gives recoverability
- page cache gives speed
fsyncgives durability semantics- scheduling gives controlled access to scarce storage time
For production systems, the winning habit is to ask which layer is actually responsible for the latency or correctness problem you are seeing.
That is the difference between knowing storage APIs and understanding operating systems.