1020 lines
34 KiB
Markdown
1020 lines
34 KiB
Markdown
# Compilers and Interpreters: Interview Guide for Software Engineers
|
|
|
|
This guide is aimed at interview preparation rather than language theory for its own sake. The goal is to help you build a mental model that is strong enough to answer both conceptual questions and practical engineering questions such as:
|
|
|
|
- Why does Java start slower but often run faster than Python?
|
|
- What exactly happens between writing `main.cpp` and running a binary?
|
|
- Why do some languages need a runtime even after they are compiled?
|
|
- What does a parser or an AST actually do in a real compiler?
|
|
- Why do link errors happen even when compilation succeeds?
|
|
|
|
The most important meta-point is this: modern language implementations are usually hybrids. In interviews, avoid talking as if every language is either "compiled" or "interpreted" in a pure sense. C++, Java, Python, JavaScript, and Go all sit at different points on a spectrum.
|
|
|
|
## 1. What Compilers and Interpreters Are
|
|
|
|
At a high level, a programming language implementation must turn human-readable source code into behavior that a machine can execute.
|
|
|
|
### Compiler
|
|
|
|
A compiler translates source code into another form before the program runs. That target form might be:
|
|
|
|
- Native machine code for a specific CPU and operating system
|
|
- Assembly that is later assembled into machine code
|
|
- Bytecode for a virtual machine
|
|
- An intermediate representation consumed by later stages
|
|
|
|
The key idea is that significant translation work happens ahead of execution.
|
|
|
|
Examples:
|
|
|
|
- C and C++ are usually compiled ahead of time into native binaries.
|
|
- Go is compiled ahead of time into native machine code.
|
|
- Java source is compiled by `javac` into JVM bytecode.
|
|
|
|
### Interpreter
|
|
|
|
An interpreter executes a program more directly, usually by reading a representation of the program and performing the required actions step by step. Historically that meant walking source code or a parse tree directly. In modern systems, it often means executing bytecode inside a virtual machine.
|
|
|
|
Examples:
|
|
|
|
- CPython compiles Python source to bytecode, then interprets that bytecode in a virtual machine.
|
|
- JavaScript engines often start by interpreting bytecode before JIT-compiling hot code paths.
|
|
|
|
### The important nuance
|
|
|
|
People often say:
|
|
|
|
- "C++ is compiled"
|
|
- "Python is interpreted"
|
|
- "Java uses JIT"
|
|
|
|
These are directionally correct, but incomplete.
|
|
|
|
More precise versions are:
|
|
|
|
- C++ is usually compiled ahead of time to native code and then linked into an executable.
|
|
- CPython compiles source to bytecode, then interprets bytecode in a VM.
|
|
- Java compiles source to bytecode, then the JVM interprets and JIT-compiles hot paths to native code.
|
|
|
|
That precision is often what separates a shallow interview answer from a strong one.
|
|
|
|
## 2. Compiler vs Interpreter vs JIT
|
|
|
|
### Ahead-of-time compilation
|
|
|
|
Ahead-of-time, or AOT, compilation means translating source code before the program starts running.
|
|
|
|
Strengths:
|
|
|
|
- Low runtime translation overhead
|
|
- Strong optimization opportunities before deployment
|
|
- Produces native binaries that the CPU can execute directly
|
|
|
|
Tradeoffs:
|
|
|
|
- Output is often platform-specific
|
|
- Less runtime information is available for optimization
|
|
- Build steps can be more complex
|
|
|
|
### Interpretation
|
|
|
|
Interpretation means execution happens through a program that understands another program.
|
|
|
|
Strengths:
|
|
|
|
- Portability when the interpreter exists on many platforms
|
|
- Faster edit-run feedback in many workflows
|
|
- Easier dynamic behavior in many implementations
|
|
|
|
Tradeoffs:
|
|
|
|
- Higher execution overhead
|
|
- Fewer opportunities for low-level optimization in the simple case
|
|
- Performance depends heavily on interpreter design
|
|
|
|
### JIT, or just-in-time compilation
|
|
|
|
JIT compilation happens during execution. The runtime observes the program while it runs, identifies hot code paths, and compiles those parts into machine code on the fly.
|
|
|
|
Strengths:
|
|
|
|
- Can use runtime information such as actual types, branch behavior, and call frequency
|
|
- Often gets better peak performance than a pure interpreter
|
|
- Can optimize only the parts of the program that matter in practice
|
|
|
|
Tradeoffs:
|
|
|
|
- Startup cost and warm-up time
|
|
- More complex runtime system
|
|
- Optimized code may need to be discarded if assumptions become invalid, which is called deoptimization
|
|
|
|
### Comparison table
|
|
|
|
| Aspect | Compiler (AOT) | Interpreter | JIT |
|
|
| --- | --- | --- | --- |
|
|
| When translation happens | Before execution | During execution | During execution |
|
|
| Typical target | Native code or bytecode | Direct execution of source, AST, or bytecode | Native code generated from bytecode or IR |
|
|
| Startup time | Often good after build step | Often good for simple scripts | Usually slower due to warm-up |
|
|
| Peak performance | Usually strong | Usually weaker | Often very strong |
|
|
| Portability | Often lower for native binaries | Often high | High if VM exists |
|
|
| Runtime complexity | Lower | Moderate | High |
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A[Source Code] --> B[AOT Compiler]
|
|
B --> C[Native Binary]
|
|
C --> D[CPU Executes]
|
|
|
|
E[Source Code] --> F[Interpreter or VM]
|
|
F --> G[Operations Executed Step by Step]
|
|
|
|
H[Source Code] --> I[Bytecode Compiler]
|
|
I --> J[VM]
|
|
J --> K[Profiler Identifies Hot Paths]
|
|
K --> L[JIT Compiler]
|
|
L --> M[Native Machine Code]
|
|
M --> N[CPU Executes Hot Code]
|
|
```
|
|
|
|
## 3. High-Level Language to Machine Code Flow
|
|
|
|
For many languages, the implementation pipeline looks like this:
|
|
|
|
1. Read source text.
|
|
2. Break the text into tokens.
|
|
3. Parse tokens into a structural representation such as an AST.
|
|
4. Run semantic checks such as name resolution and type checking.
|
|
5. Lower the program into an intermediate representation.
|
|
6. Optimize the intermediate representation.
|
|
7. Generate machine code or bytecode.
|
|
8. Link with libraries and runtime support.
|
|
9. Load the result into memory.
|
|
10. Execute with help from the runtime, operating system, and hardware.
|
|
|
|
This is the high-level picture interviewers usually want you to hold in your head.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A[Source Code] --> B[Lexical Analysis]
|
|
B --> C[Parsing]
|
|
C --> D[AST]
|
|
D --> E[Semantic Analysis]
|
|
E --> F[Intermediate Representation]
|
|
F --> G[Optimization]
|
|
G --> H[Code Generation]
|
|
H --> I[Object Code or Bytecode]
|
|
I --> J[Linking and Loading]
|
|
J --> K[Runtime Execution]
|
|
```
|
|
|
|
### Front end vs back end
|
|
|
|
Compilers are often divided into two broad parts:
|
|
|
|
- Front end: language-specific work such as lexing, parsing, semantic analysis, and type checking
|
|
- Back end: target-specific work such as optimization, register allocation, instruction selection, and code generation
|
|
|
|
The intermediate representation between them is what makes retargeting possible. A single front end can feed multiple back ends, and a single back end can sometimes support multiple languages.
|
|
|
|
## 4. Lexical Analysis, or Tokenization
|
|
|
|
Lexical analysis turns a stream of characters into a stream of tokens.
|
|
|
|
Tokens are language-level units such as:
|
|
|
|
- Keywords: `if`, `while`, `return`
|
|
- Identifiers: variable names and function names
|
|
- Literals: `42`, `3.14`, `"hello"`
|
|
- Operators: `+`, `-`, `==`, `&&`
|
|
- Delimiters: `(`, `)`, `{`, `}`, `,`, `;`
|
|
|
|
Given this code:
|
|
|
|
```txt
|
|
total = price * 2 + tax
|
|
```
|
|
|
|
A lexer may produce something like:
|
|
|
|
- `IDENT(total)`
|
|
- `ASSIGN`
|
|
- `IDENT(price)`
|
|
- `STAR`
|
|
- `INT(2)`
|
|
- `PLUS`
|
|
- `IDENT(tax)`
|
|
|
|
### Why tokenization matters
|
|
|
|
It simplifies later stages. The parser does not want to reason about raw characters. It wants structured units with categories and values.
|
|
|
|
### What lexers usually handle
|
|
|
|
- Whitespace and comments
|
|
- Keyword recognition
|
|
- Numeric and string literal formats
|
|
- Error reporting for invalid characters or malformed literals
|
|
- Longest-match behavior for operators such as `=` versus `==`
|
|
|
|
### Implementation intuition
|
|
|
|
Lexers are often built using regular expressions or finite automata. You do not usually need to derive automata in interviews unless the role is compiler-heavy, but you should know the relationship:
|
|
|
|
- Regular languages are enough for token patterns.
|
|
- Context-free grammars are needed for nested syntax structures.
|
|
|
|
## 5. Parsing and Syntax Trees
|
|
|
|
Parsing takes the token stream and checks whether it matches the grammar of the language.
|
|
|
|
If lexing answers, "What are the words?", parsing answers, "What is the sentence structure?"
|
|
|
|
### Parse tree vs AST
|
|
|
|
A parse tree preserves the grammar structure in a very literal way.
|
|
|
|
An abstract syntax tree, or AST, keeps the meaningful program structure while dropping unnecessary grammar detail.
|
|
|
|
For example, the expression:
|
|
|
|
```txt
|
|
a + b * c
|
|
```
|
|
|
|
should parse so that multiplication binds tighter than addition.
|
|
|
|
An AST might look like this:
|
|
|
|
```mermaid
|
|
graph TD
|
|
Add[+]
|
|
Add --> A[a]
|
|
Add --> Mul[*]
|
|
Mul --> B[b]
|
|
Mul --> C[c]
|
|
```
|
|
|
|
That tree encodes precedence correctly. If the parser built `(a + b) * c`, the AST would have a different shape and the meaning would change.
|
|
|
|
### Common parsing approaches
|
|
|
|
- Recursive descent: simple and common for hand-written parsers
|
|
- LL parsers: top-down parsing families
|
|
- LR parsers: bottom-up parsing families, common in parser generators
|
|
- Pratt parsers: elegant for expression parsing with precedence and associativity
|
|
|
|
In interviews, you usually do not need to compare LR item sets or parsing tables unless that is the focus. It is more valuable to explain what parsing accomplishes and why operator precedence and associativity matter.
|
|
|
|
## 6. Semantic Analysis
|
|
|
|
After parsing, a program may be syntactically valid but still semantically invalid.
|
|
|
|
Examples:
|
|
|
|
- Using a variable before it is declared
|
|
- Calling a function with the wrong number of arguments
|
|
- Adding a string to an integer in a statically typed language that disallows it
|
|
- Returning a value from a function declared as `void`
|
|
- Referencing a private symbol where it is not visible
|
|
|
|
Semantic analysis is where the compiler checks meaning.
|
|
|
|
### Typical semantic checks
|
|
|
|
- Name resolution: what declaration does this identifier refer to?
|
|
- Scope checking: is the name visible here?
|
|
- Type checking: are operations applied to compatible types?
|
|
- Control-flow checks: does every path return a value when required?
|
|
- Definite assignment checks: was a variable initialized before use?
|
|
- Access control checks: public, private, protected, package visibility, and similar rules
|
|
|
|
### Static vs dynamic semantic checks
|
|
|
|
Some languages shift more checks to runtime.
|
|
|
|
- In Java or Go, many errors are caught before execution.
|
|
- In Python or JavaScript, some checks happen only when that code path actually runs.
|
|
|
|
This is why syntactically valid Python code can import successfully but still fail later when a particular function is executed.
|
|
|
|
## 7. Symbol Tables
|
|
|
|
A symbol table is the compiler's record of names and what they mean.
|
|
|
|
It typically stores entries for things like:
|
|
|
|
- Variables
|
|
- Functions and methods
|
|
- Classes, structs, interfaces, and types
|
|
- Constants and enums
|
|
- Modules and packages
|
|
- Labels and sometimes temporary compiler-generated symbols
|
|
|
|
For each symbol, the compiler may track:
|
|
|
|
- Name
|
|
- Kind, such as variable or function
|
|
- Type information
|
|
- Scope level
|
|
- Storage location or stack offset
|
|
- Linkage and visibility
|
|
- Mutability or const-ness
|
|
|
|
### Why symbol tables matter
|
|
|
|
They are used across multiple stages:
|
|
|
|
- Semantic analysis uses them for name lookup and type checking.
|
|
- Optimization may use them for aliasing and data-flow facts.
|
|
- Code generation uses them to map names to memory locations or registers.
|
|
- Linkers use symbol information to resolve references across object files.
|
|
|
|
### Scope example
|
|
|
|
```txt
|
|
int x = 1;
|
|
{
|
|
string x = "inner";
|
|
print(x);
|
|
}
|
|
print(x);
|
|
```
|
|
|
|
The inner `x` shadows the outer `x`. A symbol table usually models this by pushing a new scope when entering a block and popping it when leaving.
|
|
|
|
## 8. Static vs Dynamic Typing
|
|
|
|
Typing is about when type constraints are checked and how much type information is known before execution.
|
|
|
|
### Static typing
|
|
|
|
In statically typed languages, many type errors are caught before the program runs.
|
|
|
|
Examples:
|
|
|
|
- Java
|
|
- Go
|
|
- C++
|
|
|
|
Benefits:
|
|
|
|
- Earlier error detection
|
|
- Better compiler reasoning for optimization
|
|
- Stronger tooling and refactoring support
|
|
|
|
Tradeoffs:
|
|
|
|
- More upfront constraints
|
|
- Some patterns may require more annotations or more complex type systems
|
|
|
|
### Dynamic typing
|
|
|
|
In dynamically typed languages, values carry type information at runtime and many checks happen during execution.
|
|
|
|
Examples:
|
|
|
|
- Python
|
|
- JavaScript
|
|
|
|
Benefits:
|
|
|
|
- Flexible programming style
|
|
- Faster prototyping in many cases
|
|
- Easier expression of some dynamic patterns
|
|
|
|
Tradeoffs:
|
|
|
|
- More runtime checks
|
|
- Some errors appear later
|
|
- Implementations often need more runtime machinery
|
|
|
|
### Important nuance for interviews
|
|
|
|
Static versus dynamic typing is not the same thing as compiled versus interpreted.
|
|
|
|
- Go is statically typed and compiled.
|
|
- Java is statically typed and compiled to bytecode, then JIT-executed.
|
|
- Python is dynamically typed but still goes through compilation to bytecode.
|
|
- TypeScript is statically typed at development time, but JavaScript at runtime remains dynamically typed.
|
|
|
|
## 9. Intermediate Representation, or IR
|
|
|
|
An intermediate representation is a program form used between the source language and the target machine code.
|
|
|
|
Why compilers use IR:
|
|
|
|
- It decouples language-specific parsing from machine-specific code generation.
|
|
- It gives the optimizer a cleaner structure to work on.
|
|
- It allows multiple source languages or multiple target architectures to share infrastructure.
|
|
|
|
### Common IR forms
|
|
|
|
- ASTs
|
|
- Three-address code
|
|
- Control-flow graphs
|
|
- Static single assignment, or SSA, form
|
|
- Bytecode for a VM
|
|
|
|
### Simple example
|
|
|
|
Source code:
|
|
|
|
```txt
|
|
x = a * 2 + b
|
|
```
|
|
|
|
A simple three-address IR could look like:
|
|
|
|
```txt
|
|
t1 = a * 2
|
|
t2 = t1 + b
|
|
x = t2
|
|
```
|
|
|
|
That form is easier for later optimization passes than the original source text.
|
|
|
|
### SSA intuition
|
|
|
|
In SSA form, each variable is assigned only once. That makes data-flow analysis cleaner.
|
|
|
|
Instead of repeatedly updating `x`, the compiler may create versions like `x1`, `x2`, and `x3`. SSA is heavily used in modern optimizing compilers such as LLVM-based systems and JIT compilers.
|
|
|
|
## 10. Optimization Basics
|
|
|
|
Optimization is about improving some objective while preserving program meaning.
|
|
|
|
That objective might be:
|
|
|
|
- Faster execution
|
|
- Smaller binary size
|
|
- Lower memory use
|
|
- Lower power consumption
|
|
- Better startup latency
|
|
|
|
There is no universal "best" optimization. Interviewers often want to hear that optimization is a tradeoff, not magic.
|
|
|
|
### Common optimization passes
|
|
|
|
- Constant folding: compute constant expressions at compile time
|
|
- Constant propagation: carry known constant values forward
|
|
- Dead code elimination: remove code that cannot affect results
|
|
- Common subexpression elimination: avoid recomputing the same expression
|
|
- Inlining: replace a call with the function body when profitable
|
|
- Loop-invariant code motion: move repeated work out of loops
|
|
- Strength reduction: replace expensive operations with cheaper ones when possible
|
|
- Escape analysis: determine whether an object can stay on the stack instead of the heap
|
|
- Devirtualization: replace indirect dispatch with direct calls when the target is known
|
|
|
|
### Why optimizers need care
|
|
|
|
They must preserve observable behavior. That gets tricky because "observable behavior" includes more than just printed output. It can include:
|
|
|
|
- Memory ordering
|
|
- Exception behavior
|
|
- Volatile reads and writes
|
|
- Undefined behavior rules in languages like C and C++
|
|
- Reflection and dynamic code loading in runtime-heavy languages
|
|
|
|
### Interview insight
|
|
|
|
If asked why a compiler cannot always optimize more aggressively, a strong answer mentions aliasing, unknown side effects, dynamic dispatch, separate compilation boundaries, and the need to preserve semantics.
|
|
|
|
## 11. Code Generation
|
|
|
|
Code generation turns IR into target code such as machine instructions or bytecode.
|
|
|
|
Important subproblems include:
|
|
|
|
- Instruction selection: which machine instructions implement the operation?
|
|
- Register allocation: which values stay in CPU registers versus spilling to memory?
|
|
- Stack frame layout: where do locals, saved registers, and return addresses live?
|
|
- Calling conventions: how are arguments passed and return values received?
|
|
- Emitting metadata: debug info, relocation entries, unwind info, and symbol information
|
|
|
|
### Example intuition
|
|
|
|
For `return a + b;`, a backend may:
|
|
|
|
1. Load `a` and `b` from argument locations or registers.
|
|
2. Emit an add instruction.
|
|
3. Place the result in the return register dictated by the calling convention.
|
|
4. Emit function epilogue code and `ret`.
|
|
|
|
The exact instructions differ by architecture and ABI.
|
|
|
|
## 12. Linking and Loading
|
|
|
|
Compilation alone usually does not produce a complete runnable program.
|
|
|
|
The compiler often produces object files. Those object files contain:
|
|
|
|
- Machine code for the compiled translation unit
|
|
- Unresolved references to symbols defined elsewhere
|
|
- Symbol tables
|
|
- Relocation information
|
|
- Debug metadata
|
|
|
|
### Linking
|
|
|
|
The linker combines object files and libraries into a final executable or shared library.
|
|
|
|
It performs tasks such as:
|
|
|
|
- Resolving symbol references
|
|
- Laying out code and data sections
|
|
- Applying relocations
|
|
- Merging in startup code and runtime support
|
|
|
|
Typical interview examples:
|
|
|
|
- Compilation succeeds, but linking fails with an undefined reference
|
|
- Two object files define the same global symbol and the linker reports a duplicate symbol error
|
|
|
|
### Loading
|
|
|
|
The loader, usually with help from the operating system, maps the executable into memory, sets up the process address space, loads shared libraries, resolves dynamic symbols, and transfers control to the entry point.
|
|
|
|
On real systems, startup also includes runtime initialization such as:
|
|
|
|
- Setting up the heap
|
|
- Initializing thread-local storage
|
|
- Running static initializers
|
|
- Preparing language runtime state
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A[Object Files] --> B[Linker]
|
|
C[Libraries] --> B
|
|
B --> D[Executable or Shared Library]
|
|
D --> E[OS Loader]
|
|
E --> F[Process Memory Image]
|
|
F --> G[Program Entry Point]
|
|
```
|
|
|
|
## 13. Static vs Dynamic Linking
|
|
|
|
### Static linking
|
|
|
|
With static linking, library code is copied into the final binary at link time.
|
|
|
|
Benefits:
|
|
|
|
- Easier deployment because dependencies are bundled
|
|
- Often fewer runtime dependency issues
|
|
- Common in Go builds and some systems programming deployments
|
|
|
|
Tradeoffs:
|
|
|
|
- Larger binaries
|
|
- Library bug fixes usually require rebuilding and redeploying the binary
|
|
- Multiple processes may each carry their own copy of the code
|
|
|
|
### Dynamic linking
|
|
|
|
With dynamic linking, the program depends on shared libraries that are loaded at runtime.
|
|
|
|
Benefits:
|
|
|
|
- Smaller binaries
|
|
- Shared code can be updated independently
|
|
- Common system libraries can be shared across many processes
|
|
|
|
Tradeoffs:
|
|
|
|
- Dependency versioning issues
|
|
- Runtime failures if the expected shared library is missing or incompatible
|
|
- More complicated deployment and startup behavior
|
|
|
|
### Interview framing
|
|
|
|
If asked when you would prefer static or dynamic linking, the right answer is contextual:
|
|
|
|
- Prefer static linking when deployment simplicity matters and binary size is acceptable.
|
|
- Prefer dynamic linking when shared updates, ecosystem integration, or reduced duplication matter more.
|
|
|
|
## 14. Bytecode and Virtual Machines
|
|
|
|
Bytecode is a lower-level representation than source code, but usually more portable than machine code. A virtual machine executes bytecode rather than directly running source text.
|
|
|
|
### Why bytecode exists
|
|
|
|
- Portability across platforms
|
|
- Faster startup than parsing source every time
|
|
- Easier verification and tooling than raw machine code
|
|
- A stable target for multiple language implementations or runtime optimizers
|
|
|
|
### Stack VM vs register VM
|
|
|
|
Many bytecode VMs are stack-based.
|
|
|
|
Example model:
|
|
|
|
```txt
|
|
LOAD a
|
|
LOAD b
|
|
ADD
|
|
STORE x
|
|
```
|
|
|
|
The VM implicitly uses the operand stack.
|
|
|
|
Other VMs use register-like instructions, which can reduce dispatch overhead but may be more complex to generate.
|
|
|
|
### Real-world examples
|
|
|
|
- Java: source compiles to JVM bytecode in `.class` files, executed by the JVM
|
|
- Python: CPython compiles source to bytecode in `.pyc` caches and executes it in the CPython VM
|
|
- JavaScript: engines such as V8 commonly lower code to internal bytecode before optimization tiers run
|
|
|
|
### Verification and safety
|
|
|
|
VM-based systems often perform bytecode verification or structural checks before execution. This is part of why Java could promise "write once, run anywhere" more plausibly than shipping raw native binaries.
|
|
|
|
## 15. Runtime Systems
|
|
|
|
The runtime system is everything the program needs while executing beyond just the generated instructions.
|
|
|
|
Depending on the language, the runtime may handle:
|
|
|
|
- Memory allocation
|
|
- Garbage collection
|
|
- Exception handling and stack unwinding
|
|
- Thread scheduling
|
|
- Reflection and metadata lookup
|
|
- Dynamic dispatch support
|
|
- Security checks or sandboxing
|
|
- Foreign function interfaces
|
|
- Class loading or module loading
|
|
|
|
### Key insight
|
|
|
|
A compiled language can still need a significant runtime.
|
|
|
|
For example:
|
|
|
|
- Go compiles to native code but relies on a runtime for goroutine scheduling and garbage collection.
|
|
- Java compiles to bytecode but relies heavily on the JVM runtime.
|
|
- C++ compiles to native code, but still depends on runtime support for exceptions, RTTI, startup code, and parts of the standard library.
|
|
|
|
## 16. Garbage Collection Basics
|
|
|
|
Garbage collection, or GC, automatically reclaims memory that is no longer reachable by the program.
|
|
|
|
### Core idea
|
|
|
|
The runtime starts from roots such as:
|
|
|
|
- Stack references
|
|
- CPU registers
|
|
- Global variables
|
|
- Static fields
|
|
|
|
It then discovers reachable objects. Objects that are no longer reachable can be reclaimed.
|
|
|
|
### Common GC strategies
|
|
|
|
#### Mark-and-sweep
|
|
|
|
1. Mark reachable objects.
|
|
2. Sweep through memory and reclaim unmarked objects.
|
|
|
|
Simple mental model, but can introduce pauses.
|
|
|
|
#### Copying collection
|
|
|
|
Move live objects from one space to another and reclaim the old space wholesale.
|
|
|
|
This can make allocation fast and reduce fragmentation.
|
|
|
|
#### Generational GC
|
|
|
|
Based on the observation that most objects die young.
|
|
|
|
Young objects are collected frequently. Older surviving objects are promoted and collected less often.
|
|
|
|
This is common in production runtimes such as the JVM.
|
|
|
|
#### Reference counting
|
|
|
|
Track how many references point to each object.
|
|
|
|
When the count drops to zero, reclaim the object immediately.
|
|
|
|
Benefit:
|
|
|
|
- Predictable reclamation in many cases
|
|
|
|
Weakness:
|
|
|
|
- Cycles require extra handling
|
|
|
|
CPython is the classic interview example here: reference counting plus a cyclic garbage collector.
|
|
|
|
### Important tradeoffs
|
|
|
|
- Throughput versus pause time
|
|
- Memory overhead versus collection frequency
|
|
- Simplicity versus sophistication
|
|
- Predictability versus peak performance
|
|
|
|
### Practical examples
|
|
|
|
- Java: tracing GC, often generational and highly optimized
|
|
- Go: concurrent GC designed to limit pause times
|
|
- CPython: reference counting with cycle detection
|
|
- C++: typically manual memory management or smart pointers rather than mandatory GC
|
|
|
|
## 17. Language-Specific Mental Models
|
|
|
|
## C++
|
|
|
|
Typical flow:
|
|
|
|
1. Preprocessing expands includes and macros.
|
|
2. The compiler parses and type-checks each translation unit.
|
|
3. The backend generates machine code into object files.
|
|
4. The linker resolves symbols and produces the final executable.
|
|
|
|
What to emphasize in interviews:
|
|
|
|
- Native ahead-of-time compilation
|
|
- Strong optimization opportunities
|
|
- Separate compilation and linking are central
|
|
- ABI, templates, inline functions, and ODR issues can matter
|
|
- Memory management is largely explicit, though abstractions such as RAII and smart pointers help
|
|
|
|
Good interview phrasing:
|
|
|
|
"C++ is not just compile and run. It has a distinct preprocessing, compilation, assembly, and linking pipeline, and many real build issues happen at link time rather than compile time."
|
|
|
|
## Java
|
|
|
|
Typical flow:
|
|
|
|
1. `javac` compiles source to JVM bytecode.
|
|
2. The JVM loads classes and verifies bytecode.
|
|
3. Code may initially run in an interpreter or baseline tier.
|
|
4. Hot methods are JIT-compiled to native code.
|
|
5. The runtime manages memory, GC, class loading, and synchronization.
|
|
|
|
What to emphasize:
|
|
|
|
- Bytecode portability
|
|
- Heavyweight runtime with class loading and GC
|
|
- Tiered execution with JIT for hot code
|
|
- Startup versus warm performance tradeoff
|
|
|
|
## Python
|
|
|
|
Typical flow in CPython:
|
|
|
|
1. Python source is parsed into an AST.
|
|
2. The AST is compiled into Python bytecode.
|
|
3. The CPython VM executes the bytecode.
|
|
4. Memory management uses reference counting plus cycle detection.
|
|
|
|
What to emphasize:
|
|
|
|
- Python is not "just interpreted" in the simplistic sense
|
|
- CPython does have a compilation stage, but not usually to native code
|
|
- Runtime type checks and object model overhead are significant
|
|
- Different implementations exist, such as PyPy with a JIT
|
|
|
|
## JavaScript
|
|
|
|
Typical flow in engines like V8:
|
|
|
|
1. Source is parsed.
|
|
2. The engine produces AST and internal bytecode.
|
|
3. Bytecode runs in an interpreter or baseline execution tier.
|
|
4. The engine profiles runtime behavior.
|
|
5. Hot code is optimized by JIT tiers.
|
|
6. If assumptions break, optimized code can be deoptimized.
|
|
|
|
What to emphasize:
|
|
|
|
- Highly dynamic language semantics make optimization hard
|
|
- JITs rely on speculative assumptions about object shapes and types
|
|
- Warm-up and deoptimization are important real-world performance concepts
|
|
|
|
## Go
|
|
|
|
Typical flow:
|
|
|
|
1. `go build` compiles source ahead of time into native machine code.
|
|
2. The linker produces a native executable, often with static linking in common deployments.
|
|
3. The Go runtime provides GC, goroutine scheduling, stack growth, and other services.
|
|
|
|
What to emphasize:
|
|
|
|
- Native compilation with a substantial runtime
|
|
- Simple deployment model compared with many dynamic ecosystems
|
|
- Concurrency is language-level, but implemented with runtime support
|
|
- `go run` still compiles first, then runs the resulting binary
|
|
|
|
## 18. One Command, Five Different Execution Stories
|
|
|
|
This is a very interview-friendly way to compare languages.
|
|
|
|
### `g++ main.cpp && ./a.out`
|
|
|
|
- Preprocess source
|
|
- Compile to object code
|
|
- Link libraries
|
|
- Produce native executable
|
|
- OS loader maps it into memory and starts execution
|
|
|
|
### `java Main`
|
|
|
|
- The source was already compiled to bytecode
|
|
- JVM loads classes
|
|
- Bytecode is verified
|
|
- Methods run and hot methods may be JIT-compiled
|
|
- GC and runtime services stay active throughout execution
|
|
|
|
### `python app.py`
|
|
|
|
- Source is parsed
|
|
- Bytecode is produced or loaded from cache when valid
|
|
- CPython VM executes bytecode
|
|
- Runtime type checks and object dispatch happen as execution proceeds
|
|
|
|
### `node app.js`
|
|
|
|
- V8 parses JavaScript
|
|
- Internal bytecode is generated
|
|
- Code starts running in execution tiers
|
|
- Hot paths may be JIT-optimized
|
|
- Runtime assumptions may trigger deoptimization if they become invalid
|
|
|
|
### `go run main.go`
|
|
|
|
- Go compiles source to a temporary native binary
|
|
- The binary is linked
|
|
- The binary is executed
|
|
- The Go runtime manages scheduling, GC, and other services at runtime
|
|
|
|
## 19. Common Interview Questions and Strong Answer Shapes
|
|
|
|
### 1. What is the difference between a compiler and an interpreter?
|
|
|
|
Strong answer shape:
|
|
|
|
"A compiler performs substantial translation before execution, often to native code or bytecode. An interpreter executes a representation of the program directly, often statement by statement or bytecode instruction by bytecode instruction. In practice many systems combine both approaches, for example CPython compiles to bytecode and Java uses both bytecode interpretation and JIT compilation."
|
|
|
|
### 2. What is JIT and why is it useful?
|
|
|
|
Strong answer shape:
|
|
|
|
"JIT compilation moves some compilation work to runtime so the system can optimize hot code paths using real execution data such as observed types and branch behavior. That improves peak performance, but it adds warm-up cost and runtime complexity."
|
|
|
|
### 3. What is an AST and why do compilers use it?
|
|
|
|
Strong answer shape:
|
|
|
|
"An AST captures the structural meaning of a program without unnecessary grammar detail. It is easier than raw tokens for semantic analysis, type checking, optimization, and code generation."
|
|
|
|
### 4. What does semantic analysis do?
|
|
|
|
Strong answer shape:
|
|
|
|
"It checks meaning after parsing: name resolution, scope rules, type rules, control-flow constraints, and related consistency checks. A program can be syntactically valid but semantically invalid."
|
|
|
|
### 5. Why do we need an intermediate representation?
|
|
|
|
Strong answer shape:
|
|
|
|
"IR decouples the source language from the target architecture and gives optimizers a uniform representation to transform. It is one of the main reasons modern compilers are modular and retargetable."
|
|
|
|
### 6. What is the difference between linking and loading?
|
|
|
|
Strong answer shape:
|
|
|
|
"Linking resolves symbols and combines object files and libraries into a final binary or shared library. Loading is the operating system and runtime step that maps the program into memory, loads shared libraries, sets up the process, and transfers control to the entry point."
|
|
|
|
### 7. Why can a compiled language still need a runtime?
|
|
|
|
Strong answer shape:
|
|
|
|
"Because generated code still relies on services such as memory management, exception handling, thread scheduling, metadata lookup, or dynamic dispatch. Go and Java are clear examples."
|
|
|
|
### 8. Why is Python generally slower than C++?
|
|
|
|
Strong answer shape:
|
|
|
|
"Python usually executes through a VM with dynamic typing and higher object model overhead, while C++ is compiled ahead of time to optimized native code with far fewer runtime checks. That difference shows up in dispatch cost, memory layout, and optimization opportunities."
|
|
|
|
### 9. What is the difference between static and dynamic linking?
|
|
|
|
Strong answer shape:
|
|
|
|
"Static linking copies library code into the final binary at build time, while dynamic linking resolves shared libraries at runtime. Static linking simplifies deployment, while dynamic linking reduces binary size and allows shared library updates."
|
|
|
|
### 10. What happens when the compiler says nothing but the linker fails?
|
|
|
|
Strong answer shape:
|
|
|
|
"Compilation checked each translation unit in isolation, but the final symbol references could not be resolved across units or libraries. That usually indicates an undefined symbol, duplicate definition, or library ordering or ABI issue."
|
|
|
|
## 20. Practical Scenarios Interviewers Like
|
|
|
|
### Scenario 1: The program builds, but you get an undefined reference error
|
|
|
|
What it tests:
|
|
|
|
- Understanding of separate compilation
|
|
- Symbol resolution
|
|
- Link-time failures versus compile-time failures
|
|
|
|
Good explanation:
|
|
|
|
"The source compiled fine because each file was valid on its own, but at link time the toolchain could not find a definition for a referenced symbol. I would check whether the function is actually defined, whether the declaration matches the definition, whether the right object files or libraries were linked, and whether name mangling or ABI mismatch is involved."
|
|
|
|
### Scenario 2: Java is slow on first requests and faster later
|
|
|
|
What it tests:
|
|
|
|
- JIT warm-up
|
|
- Tiered compilation
|
|
- Runtime profiling
|
|
|
|
Good explanation:
|
|
|
|
"That often reflects JIT behavior. The JVM initially runs code in lower tiers and profiles execution. Frequently executed methods are then optimized into machine code, so steady-state throughput improves after warm-up."
|
|
|
|
### Scenario 3: Python memory usage keeps growing
|
|
|
|
What it tests:
|
|
|
|
- GC basics
|
|
- Reachability versus leaks
|
|
- Runtime behavior
|
|
|
|
Good explanation:
|
|
|
|
"In a GC or reference-counted language, memory growth does not necessarily mean the collector is broken. It may mean objects remain reachable through caches, global references, cycles, or long-lived containers. I would investigate object lifetimes and retained references rather than assuming raw malloc-style leaks."
|
|
|
|
### Scenario 4: JavaScript gets slower after a code change that adds many object shapes
|
|
|
|
What it tests:
|
|
|
|
- JIT assumptions
|
|
- Speculation and deoptimization
|
|
- Dynamic language optimization limits
|
|
|
|
Good explanation:
|
|
|
|
"Many JavaScript engines optimize based on stable object shapes and observed type patterns. If the code change makes shapes more polymorphic, the optimizer may lose its assumptions, trigger deoptimizations, and reduce inline cache effectiveness."
|
|
|
|
### Scenario 5: A Go service has noticeable GC pauses
|
|
|
|
What it tests:
|
|
|
|
- Runtime systems
|
|
- Allocation behavior
|
|
- GC tradeoffs
|
|
|
|
Good explanation:
|
|
|
|
"I would look at allocation rate, object lifetimes, and memory pressure. Even with a good concurrent GC, excessive short-lived allocation or large heaps can raise collector work and pause costs."
|
|
|
|
## 21. How to Talk About This Topic in Interviews
|
|
|
|
When you answer, try to move through three levels:
|
|
|
|
1. State the concept precisely.
|
|
2. Explain where it fits in the end-to-end execution pipeline.
|
|
3. Ground it in one real language implementation.
|
|
|
|
For example:
|
|
|
|
"An AST is the structural representation the parser produces after tokenization. The compiler then uses it for semantic analysis and later lowering into IR. In CPython, the source is parsed to an AST before compilation to bytecode."
|
|
|
|
That style of answer shows both theoretical knowledge and practical intuition.
|
|
|
|
## 22. Quick Summary to Remember
|
|
|
|
- A compiler translates code ahead of execution, but the output may be native code or bytecode.
|
|
- An interpreter executes a program representation directly, often through a VM.
|
|
- JIT compilation performs runtime compilation of hot code paths.
|
|
- Lexing turns characters into tokens.
|
|
- Parsing turns tokens into structured syntax such as an AST.
|
|
- Semantic analysis checks meaning, not just syntax.
|
|
- Symbol tables track what names refer to.
|
|
- IR gives compilers a better form for optimization and code generation.
|
|
- Optimization preserves semantics while improving speed, size, memory use, or startup.
|
|
- Code generation maps IR onto instructions, registers, calling conventions, and stack layout.
|
|
- Linking resolves symbols and produces final binaries.
|
|
- Loading maps the program into memory and starts execution.
|
|
- Runtime systems provide services such as GC, class loading, scheduling, and exception handling.
|
|
- Bytecode and VMs trade some raw speed for portability and runtime flexibility.
|
|
- The most accurate interview answers are language-specific and pipeline-aware.
|
|
|
|
## 23. Final Mental Model
|
|
|
|
If you remember only one sentence, remember this:
|
|
|
|
Source code becomes executable behavior through a pipeline of analysis, transformation, packaging, loading, and runtime support, and different languages move work between those stages in different ways.
|
|
|
|
That single idea ties together compilers, interpreters, JITs, bytecode, VMs, linking, loading, runtimes, and garbage collection.
|