Files

T

tarun-elango 62197e52c0 ml

Co-authored-by: Copilot <copilot@github.com>

2026-04-30 19:59:29 -04:00

51 KiB

Raw Blame History

Dimensionality Reduction Handbook: PCA and t-SNE

Dimensionality reduction is the problem of replacing a high-dimensional representation with a smaller one that preserves what matters. In real engineering work, that usually means one of four things: making data cheaper to store, making models easier to train, making noise easier to suppress, or making complex structure easier for humans to inspect.

This sounds simple until you try to do it on real data. A production dataset might contain hundreds of metrics per service, thousands of sensor readings per machine, or 768-dimensional embeddings from a language or vision model. At that point, the question is not only how to compress the data. The harder question is what information must survive the compression, what distortion is acceptable, and whether the reduced representation will be used by software, humans, or both.

This handbook is written as a long-term reference for engineers and computer engineering students. It explains dimensionality reduction from first principles, then goes deep on PCA and t-SNE, and finally covers practical concerns such as preprocessing, debugging, failure modes, production deployment, hardware implications, and the common mistakes that make reduced data look more trustworthy than it really is.

1. Why Dimensionality Reduction Exists

High-dimensional data appears everywhere in engineering:

telemetry pipelines with hundreds of metrics per host, service, or device
industrial systems with many temperature, vibration, voltage, and current channels
image, audio, and language embeddings with hundreds or thousands of dimensions
networking and RF systems with many correlated signal features
manufacturing and quality systems with large sets of measured process variables

The problem is that more dimensions do not automatically mean more useful information.

Common reasons to reduce dimensionality:

many features are redundant and carry nearly the same signal
some features are mostly noise, logging artifacts, or unstable measurements
distance-based methods become harder to trust in high dimensions
models can become slower, less stable, and harder to explain
humans cannot reason directly about a 200-dimensional cloud of points

Dimensionality reduction is useful when it improves a real engineering outcome, such as:

lower storage, memory, or network cost
faster training or inference
better signal-to-noise ratio
easier anomaly detection or clustering
clearer human inspection of embedding spaces or sensor states

It is not automatically useful just because the dataset has many columns. If every feature carries unique operational meaning and the downstream system depends on those meanings, aggressive reduction can make the system worse.

2. First Principles

2.1 What a dimension really is

A dimension is one coordinate used to describe an observation.

If you record a machine state with 40 sensor values, then the machine lives in a 40-dimensional measurement space. If you represent a user session with a 256-dimensional embedding, then each session is a point in a 256-dimensional space.

But that does not mean the underlying behavior really has 40 or 256 independent degrees of freedom.

Example:

CPU usage, power draw, fan speed, and heat output often move together under load
multiple accelerometer channels may rise together during a fault
many embedding dimensions are correlated mixtures of a smaller number of latent patterns

This is the key idea: the observed data may live in a high-dimensional measurement space while the underlying phenomenon lives on a much lower-dimensional structure.

2.2 Redundancy, noise, and latent variables

Real systems often generate dimensions for three different reasons:

genuinely independent factors
redundant measurements of the same factor
noise, drift, or instrumentation side effects

Suppose you monitor a motor with these channels:

rotor speed
shaft vibration in multiple directions
current draw
winding temperature
ambient temperature
control duty cycle

The true operating state may depend mostly on a small set of hidden factors such as load, alignment, cooling efficiency, and wear. The raw measurements are only indirect views of those hidden factors.

Dimensionality reduction tries to find a smaller representation that captures those major patterns without carrying every raw degree of freedom forward.

2.3 Why high dimensions are hard

High-dimensional spaces create several engineering problems.

Distance becomes less informative

As dimensionality increases, many points begin to look similarly far from one another. This is part of the curse of dimensionality. Distance-based methods, nearest-neighbor search, density estimation, and clustering often become less stable because the contrast between near and far points shrinks.

Data demand increases

When the feature count grows, you usually need more data to estimate structure reliably. Otherwise, the model starts learning quirks of the sample rather than stable behavior.

Compute and memory costs grow

More dimensions mean larger matrices, more expensive pairwise distance calculations, more memory bandwidth, and slower iteration cycles.

Interpretation gets worse

Even if a model trains successfully, engineers still need to debug it. A 500-dimensional feature representation can be powerful, but it is hard to inspect directly.

2.4 Projection, embedding, and reconstruction

Dimensionality reduction methods do not all do the same thing.

projection methods map the data onto fewer axes, often preserving as much useful structure as possible
embedding methods construct a new space where some notion of similarity is preserved
compression-oriented methods may allow approximate reconstruction of the original data
visualization-oriented methods may sacrifice reconstruction and global geometry in exchange for human-readable structure

This distinction matters a lot.

PCA gives a reusable linear transform and supports approximate reconstruction.

t-SNE mainly gives a visualization-oriented embedding. It is excellent for exploring local neighborhoods, but it is not a general-purpose replacement for the original feature space in production pipelines.

2.5 Dimensionality reduction is not feature selection

Feature selection keeps a subset of the original features.

Dimensionality reduction usually creates new derived features.

Example:

feature selection might keep temperature, current, and vibration_rms
PCA might replace 30 raw telemetry channels with 5 components, each being a weighted combination of many channels

The difference is important for interpretability and deployment. Feature selection preserves raw semantics. PCA creates compressed coordinates that are usually more efficient but less directly intuitive. t-SNE creates coordinates that are usually not semantically interpretable at all.

2.6 Linear vs nonlinear reduction

PCA is a linear method. It looks for straight-line directions in feature space.

This works very well when the data cloud is roughly shaped like a tilted, stretched ellipsoid or when most variation can be described by linear combinations of the features.

t-SNE is nonlinear. It does not try to preserve a global linear structure. Instead, it tries to keep nearby points nearby in a lower-dimensional map.

This makes t-SNE far more flexible for visualizing complex manifolds, but also much less suitable as a stable reusable transform for new incoming data.

2.7 End-to-end mental model

flowchart TD
	A[High-dimensional raw data] --> B[Clean, encode, and scale features]
	B --> C{What must be preserved?}
	C -->|Compression, denoising, reusable features| D[PCA]
	C -->|Human inspection of local neighborhoods| E[t-SNE]
	D --> F[Downstream model, storage, or monitoring]
	E --> G[Analyst visualization and debugging]
	F --> H[Versioning and drift checks]
	G --> H

The most important design question is not "Which algorithm is best?" The important question is "What kind of information must survive the reduction?"

3. Choosing the Objective Before Choosing the Algorithm

Before you use any dimensionality reduction method, ask these questions.

3.1 Do you need a reusable transform for future data?

If yes, PCA is usually a strong candidate.

You can fit PCA on a training set, save the mean vector and component matrix, and later transform new data consistently.

t-SNE does not naturally work that way. Standard t-SNE is mostly an offline embedding method for a fixed dataset.

3.2 Do you need interpretability?

If you need to explain what the compressed dimensions mean, PCA is at least somewhat interpretable because each component has loadings on original features.

t-SNE axes do not have stable semantic meaning. A point moving left or right in a t-SNE plot does not translate into an interpretable real-world direction.

3.3 Is the goal compression or visualization?

If the goal is:

reducing storage or compute cost
denoising before a downstream model
building a stable lower-dimensional feature pipeline

then PCA is usually the right starting point.

If the goal is:

visually inspecting embeddings
understanding local neighborhood structure
spotting subgroups, label noise, or failure patterns

then t-SNE is often the better tool.

3.4 What structure matters: global or local?

PCA tries to preserve large-scale variance structure in a linear way.

t-SNE tries much harder to preserve local neighborhoods than global geometry.

That means:

PCA distances and directions can still carry broad geometric meaning
t-SNE distances between far-apart clusters are often not reliable
PCA component axes are real linear combinations of original features
t-SNE plot axes are mostly arbitrary coordinates used for display

3.5 Practical decision table

Goal	Better first choice	Why
Reusable compressed features	PCA	stable transform for new data
Denoising correlated numeric signals	PCA	keeps dominant linear structure
Visualizing embeddings for human inspection	t-SNE	preserves local neighborhoods well
Offline cluster exploration in 2D	t-SNE, often after PCA	reveals local grouping better than PCA plots
Edge-device compression	PCA	cheap matrix multiply at inference time
Production serving transform	PCA	deterministic, versionable, fast
Executive dashboard with one fixed map	t-SNE only with care	useful visually, but not geometrically literal

4. PCA

4.1 Core intuition

PCA, or Principal Component Analysis, finds orthogonal directions in the data that capture as much variance as possible.

The practical intuition is simpler than the formal definition:

move the cloud of points so its center is at the origin
rotate the coordinate system until the first axis points along the strongest spread
make the second axis point along the next strongest spread, subject to being orthogonal to the first
keep only the top few axes and drop the rest

If the dropped directions carry mostly small fluctuations or redundant detail, then the data has been compressed without losing much useful structure.

4.2 Geometric picture: rotate first, then drop axes

A lot of engineers first imagine PCA as simply deleting columns. That is not what it does.

PCA first creates a better coordinate system.

Imagine a 2D cloud shaped like a long diagonal ellipse. If you keep the original x and y axes, both are somewhat informative. But if you rotate the axes so one axis follows the long direction of the ellipse, then most of the information sits along that new axis. The shorter axis becomes mostly minor variation.

That is why PCA works: it aligns the representation with the real direction of variation before discarding dimensions.

flowchart LR
	A[Centered data cloud] --> B[Compute covariance structure]
	B --> C[Find orthogonal directions of strongest spread]
	C --> D[Sort components by explained variance]
	D --> E[Project onto top k components]
	E --> F[Use compressed representation or reconstruct approximately]

4.3 Why centering matters

PCA assumes variance is measured around the mean.

So the first basic step is to subtract the feature-wise mean from every sample.

Why this matters:

without centering, the first component can point toward the offset of the cloud from the origin rather than the real direction of variation
covariance estimates become distorted
the resulting components do not represent the true spread of the data

In practice, centering is not optional for standard PCA.

4.4 Why scaling often matters too

Suppose you run PCA on these features:

temperature in degrees Celsius
current in amps
uptime in seconds
revenue in dollars

If you do not scale them and one feature has a much larger numeric range, that feature can dominate the covariance structure.

So the major question is not "Should I always standardize?" The real question is whether the raw variance scale is meaningful.

Use standardization when:

features use different units
you care about relative variation, not raw magnitude
you want balanced influence from different channels

Do not blindly standardize when:

the magnitude itself is operationally meaningful
the feature scales already encode an intentional weighting

This is a modeling choice, not a housekeeping step.

4.5 Covariance, eigenvectors, and what PCA is finding

After centering the data matrix X, PCA studies the covariance matrix:

Cov = (1 / (n - 1)) * X^T * X

This matrix tells you how features vary together.

large diagonal values mean a feature varies a lot
large positive off-diagonal values mean two features tend to increase together
large negative off-diagonal values mean one tends to increase when the other decreases

PCA finds eigenvectors of this covariance matrix.

Engineering interpretation:

an eigenvector is a direction in feature space
its eigenvalue tells you how much variance exists along that direction
the direction with the largest eigenvalue becomes principal component 1
the next largest orthogonal direction becomes principal component 2

Another practical view is through SVD.

In real software libraries, PCA is often computed using Singular Value Decomposition because it is numerically stable and efficient. For engineers, the important point is that SVD and covariance-eigendecomposition are closely connected ways of finding the dominant low-rank structure of the data.

4.6 Step-by-step PCA algorithm

collect the data matrix
impute or handle missing values if needed
center the features
optionally scale features
compute the principal directions
sort directions by explained variance
keep the top k components
project each point onto those components

The projection is:

Z = X_centered * W_k

where W_k contains the top k principal directions.

The approximate reconstruction is:

X_hat = Z * W_k^T + mean

This reconstruction view is important because it connects PCA directly to information loss.

4.7 Why PCA works

Variance view

PCA keeps the directions where the data changes the most.

If a direction has tiny variance, then most points are already close to each other along that direction. Dropping it does not change the points much under squared-error reconstruction.

Reconstruction-error view

PCA can also be understood as finding the low-dimensional linear subspace that minimizes total squared reconstruction error.

That means PCA is not keeping variance for its own sake. It is preserving the directions that matter most if your loss function is squared distance back to the original data.

This is why the two common descriptions are equivalent:

maximize retained variance
minimize reconstruction error

They are two views of the same optimization problem.

4.8 Explained variance and how to choose the number of components

Each component has an explained variance ratio. This tells you what fraction of the total variance that component captures.

Common strategies for choosing k:

cumulative explained variance threshold such as 90 percent, 95 percent, or 99 percent
scree plot, looking for a bend where additional components add little value
downstream model performance on validation data
reconstruction error target
deployment constraints such as memory budget or inference latency

Important warning:

High explained variance does not automatically mean high task usefulness.

A rare but operationally critical fault pattern might live in a low-variance direction. If you keep components only by a variance threshold, you can remove the very signal you care about.

4.9 Interpreting components and loadings

Each principal component is a weighted combination of the original features.

Those weights are often called loadings.

Example interpretation:

if a component has strong positive weights on CPU, memory, and network throughput, it may represent overall workload intensity
if another component contrasts temperature against fan speed, it may capture cooling behavior or thermal response

Important cautions:

the sign of a component is arbitrary; multiplying a component by -1 gives an equivalent solution
orthogonal components are uncorrelated in the PCA basis, but not necessarily causally independent
component meaning can change if you retrain on a different population or time window

4.10 Whitening

Whitening rescales PCA components to unit variance.

Why engineers sometimes use it:

some downstream models behave better when each retained dimension has comparable scale
certain clustering or ICA-style workflows prefer decorrelated, normalized coordinates

Why engineers misuse it:

whitening throws away the original variance magnitudes
if variance magnitude itself carries meaning, whitening can hide that structure

Use whitening only when the downstream objective justifies it.

4.11 When PCA works well

PCA is a strong choice when:

features are numeric and substantially correlated
the useful structure is approximately linear
you want a reusable transform for future data
you need denoising or compression
you want fast inference and simple deployment

It is especially effective in systems where many measurements are different views of a smaller physical or behavioral process.

4.12 Where PCA fails or becomes misleading

PCA struggles when:

the important structure is strongly nonlinear
outliers dominate the covariance structure
features are mostly categorical or poorly encoded
batch effects dominate true signal
low-variance directions contain the rare events you care about
interpretability is required but components combine too many unrelated features

Classic failure case:

If the data lies on a curved manifold, like a spiral or a folded surface, PCA may need many linear components to approximate what is really a simple nonlinear pattern.

4.13 Real-world PCA use cases

Sensor compression for edge and industrial systems

Suppose an edge controller reads 64 correlated sensor channels every few milliseconds. Transmitting all channels upstream can be expensive in bandwidth, storage, and power.

PCA can compress those channels into a small number of components that still represent the major operating modes. The controller or gateway sends the compressed state instead of every raw reading.

This is not only a software optimization. It affects radio bandwidth, bus utilization, storage cost, and even battery life.

Denoising telemetry before anomaly detection

In observability systems, many metrics move together because they reflect the same workload shift. PCA can compress those correlated metrics into a smaller set of factors. The anomaly detector then operates on the major modes rather than every noisy metric independently.

Image, spectral, and signal preprocessing

PCA is often used to reduce correlated channels before classification or clustering.

Examples:

hyperspectral imaging, where adjacent wavelength bands are highly correlated
vibration and acoustic feature banks, where many frequency summaries overlap
embedded vision pipelines, where compact representations reduce memory movement

Embedding compression

Large language or vision embeddings are often hundreds or thousands of dimensions. PCA can reduce them for:

cheaper nearest-neighbor indexing
faster downstream classifiers
smaller storage footprint
lower memory pressure in serving systems

4.14 Production engineering details for PCA

Fit on training data only

If PCA is part of a predictive pipeline, fit it only on the training split. Then apply the learned mean and components to validation, test, and production data.

Otherwise you create leakage.

Persist the full transform artifact

For reproducible deployment, store:

feature order
imputation rules
scaling parameters
PCA mean vector
component matrix
explained variance metadata
training dataset version or time window

If any one of these changes silently, the compressed coordinates stop being comparable.

Use incremental or randomized PCA when scale demands it

For very large datasets:

incremental PCA processes data in batches
randomized SVD can speed up approximate computation

These are practical engineering choices when exact full-matrix decomposition is too expensive.

Be careful with sparse data

For sparse matrices such as bag-of-words or some event-count features, ordinary centered PCA can destroy sparsity and become memory-heavy.

In those cases, methods such as truncated SVD are often more operationally sensible than textbook PCA.

Monitor drift in component space

Once PCA is deployed, monitor:

projected feature distributions
explained variance stability on retraining
reconstruction error trends
changes in component interpretation

If the data generating process changes, the old components may stop representing the system well.

Hardware implications

PCA at inference time is basically a centered matrix multiply. That is attractive because:

it maps well to SIMD, BLAS, and GPU operations
it is predictable in latency
it can be implemented efficiently on DSPs, NPUs, or even fixed-point pipelines when needed
it reduces downstream memory movement if fewer features need to be processed later

In many systems, the memory bandwidth saved after PCA matters at least as much as the floating-point cost of the projection itself.

4.15 Practical PCA implementation pattern

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pca_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=0.95, svd_solver="full")),
    ]
)

X_train_reduced = pca_pipeline.fit_transform(X_train)
X_valid_reduced = pca_pipeline.transform(X_valid)

explained = pca_pipeline.named_steps["pca"].explained_variance_ratio_
print(explained.cumsum())

What this gets right:

preprocessing is tied to the PCA artifact
validation data is transformed with training-fitted parameters only
component count is chosen by cumulative explained variance, which is a useful starting point

What still needs engineering judgment:

whether 95 percent variance is the right target
whether standardization is appropriate for the domain
whether downstream task performance agrees with the variance-based choice

4.16 Common PCA mistakes

running PCA on raw mixed-unit features and assuming the result is meaningful
fitting PCA before splitting train and test data
keeping components only by a variance rule without checking task impact
interpreting every component as a real physical cause
forgetting that component signs are arbitrary
using PCA on strongly nonlinear structure and expecting a small number of components to capture it
ignoring outliers that dominate the covariance matrix
failing to persist the exact preprocessing and component artifacts used in production

4.17 Debugging PCA in practice

If PCA results look wrong, inspect the following in order:

feature units and scaling
missing-value handling
outlier influence
cumulative explained variance and scree shape
component loadings in original feature names
reconstruction error by segment or operating mode
stability across time windows or retraining runs

Useful debugging questions:

Did one high-variance feature hijack the first component?
Are the top components mostly representing batch or device identity instead of behavior?
Does reconstruction error spike for rare but important cases?
Did the effective rank change after a software release or sensor replacement?

Useful practical checks:

compare PCA with and without standardization
compare PCA with and without obvious outliers
inspect loadings as a sorted table, not only a plot
evaluate downstream model accuracy with several k values
examine reconstruction error per class, device family, or incident type

4.18 Interview-level understanding of PCA

You should be able to explain these clearly:

PCA finds orthogonal directions of maximum variance in centered data
the top k components define the best k-dimensional linear subspace under squared reconstruction error
eigenvectors give component directions and eigenvalues give captured variance
SVD is the practical numerical tool often used to compute PCA
standardization can radically change PCA because covariance depends on scale
PCA is good for compression, denoising, and linear structure, but not for arbitrary nonlinear manifolds

5. t-SNE

5.1 Core intuition

t-SNE, or t-distributed Stochastic Neighbor Embedding, is primarily a visualization method.

Its goal is not to preserve a faithful global geometry of the original space. Its goal is to build a low-dimensional map in which nearby points in the original space stay nearby in the map.

That makes it especially useful for asking questions like:

do embeddings form recognizable local groups?
are there mislabeled examples sitting inside another class neighborhood?
do several operating modes appear inside what we thought was one category?

It is much less appropriate for questions like:

what is the exact distance between these two far-apart clusters?
can I use these 2D coordinates as stable production features?
does a larger cluster area mean larger variance or higher density in the original space?

5.2 What t-SNE is actually trying to preserve

t-SNE converts pairwise relationships into probabilities.

In the original space, each point treats nearby points as likely neighbors and distant points as unlikely neighbors.

In the low-dimensional map, t-SNE tries to produce a similar neighbor-probability pattern.

The result is a map where local neighborhoods are often very informative, even when global layout is distorted.

5.3 Step-by-step idea behind t-SNE

start with high-dimensional points
for each point, define a probability distribution over other points so nearby points get higher probability
choose a low-dimensional map, usually 2D
define another probability distribution in the map
move the low-dimensional points to make the map probabilities resemble the original probabilities

The optimization objective is based on KL divergence:

minimize KL(P || Q)

where:

P represents neighbor relationships in the original space
Q represents neighbor relationships in the low-dimensional map

Important intuition:

Because the divergence is asymmetric, t-SNE cares strongly about not losing true neighbors. It is usually more tolerant of creating some extra apparent neighbors than of separating points that really belonged together locally.

5.4 High-dimensional probabilities and perplexity

In the original space, t-SNE uses Gaussian-like neighborhoods around each point.

The width of that neighborhood is adjusted per point to match a target perplexity.

Perplexity is best understood as an approximate effective neighborhood size.

Low perplexity means:

focus strongly on very local structure
more sensitivity to small groups and noise
higher risk of fragmented islands

High perplexity means:

broader neighborhood definition
smoother map structure
more emphasis on medium-scale relationships

There is no universal best perplexity. It depends on sample size, density, and what structure you want to inspect.

5.5 Why t-SNE uses a Student t distribution in low dimension

If the low-dimensional map also used a Gaussian neighborhood model, points that are moderately far apart in the original space could crowd together in 2D.

This is part of the crowding problem.

The Student t distribution has heavier tails. That gives the low-dimensional map more room to push dissimilar points farther apart.

Practical interpretation:

nearby neighbors can still stay close
unrelated groups can separate more cleanly
the map becomes visually clearer for local structure

This is one of the main reasons t-SNE produces readable cluster-like maps where naive projections often fail.

flowchart TD
	A[High-dimensional points] --> B[Convert distances into neighbor probabilities]
	B --> C[Initialize a 2D or 3D map]
	C --> D[Compute low-dimensional similarities with heavy tails]
	D --> E[Move points to reduce mismatch]
	E --> F[Local neighborhoods preserved]
	F --> G[Global geometry may still distort]

5.6 Early exaggeration and optimization behavior

t-SNE is solved by iterative optimization, not by one closed-form matrix decomposition.

One important phase is early exaggeration. During the early stage, neighbor attractions are temporarily amplified. This helps clusters or neighborhoods pull together before finer structure is refined.

Why this matters in practice:

poor optimization settings can make the map unstable or misleading
random initialization can change the visual arrangement
too few iterations can leave the map half-formed
the final picture is an optimization result, not a unique ground truth

5.7 What the t-SNE plot means and what it does not mean

What it often means:

points near each other are often genuinely similar in the original space
a local group may represent a meaningful subpopulation
isolated points may indicate rare patterns, outliers, or mislabeled examples

What it does not reliably mean:

the distance between two separated clusters is a literal measure of dissimilarity
cluster area corresponds directly to variance or population density
the x and y axes have semantic interpretation
two runs with different seeds or hyperparameters are aligned coordinate systems

This distinction is critical. Engineers often over-read t-SNE figures because they look intuitive.

5.8 When t-SNE works well

t-SNE is a strong choice when:

the main goal is visual inspection
local neighborhoods matter more than global geometry
you want to inspect embeddings from a model
you suspect there are subgroups, label issues, or failure patterns hidden in a high-dimensional representation

Typical successful use cases:

visualizing image, audio, or language embeddings
inspecting device-state embeddings derived from telemetry windows
checking whether classes overlap or split into subfamilies
exploring defect-image embeddings in manufacturing quality systems

5.9 Where t-SNE fails or becomes misleading

t-SNE struggles when:

you treat it like a general-purpose production transform
you care about exact global geometry
you compare maps generated from different runs as if coordinates were stable
hyperparameters are chosen blindly
sample size is too small or too large for the chosen settings
the dataset contains many duplicates or near-duplicates that distort optimization

Classic failure case:

An engineer sees two clusters far apart in a t-SNE plot and concludes they are fundamentally different populations. In reality, t-SNE only promised to preserve local neighborhoods. The gap between clusters may be visually convenient rather than geometrically literal.

5.10 Real-world t-SNE use cases

Embedding model debugging

You train a classifier or contrastive model and want to inspect whether related examples actually live near each other. t-SNE helps show:

clean class separation
mixed or ambiguous classes
mislabeled data points
failure pockets where one class splits into several modes

Telemetry and operating-state inspection

Suppose you generate 128-dimensional embeddings from time windows of sensor data. t-SNE can reveal whether machine behavior separates into:

normal operation
startup transients
overload events
unusual fault-like states

This is especially useful for analyst workflows where humans need to inspect representative examples from each region.

Semiconductor, imaging, and hardware test pipelines

t-SNE is often useful for visualizing embeddings from defect images, board inspection features, or wafer-test signatures. The map helps engineers spot previously hidden defect families or drift between manufacturing lots.

5.11 Production engineering details for t-SNE

Treat t-SNE as an analysis artifact, not a default serving transform

Most production systems should not feed t-SNE coordinates directly into critical online models. Standard t-SNE does not naturally provide a simple stable transform for new points.

It is usually better suited for:

offline reports
experiment dashboards
quality inspection tools
model debugging notebooks and review systems

Pre-reduce with PCA for speed and denoising

A common professional workflow is:

clean and scale the data
use PCA to reduce to 30 to 50 dimensions
run t-SNE on that reduced representation

Why this helps:

removes noisy minor dimensions
speeds up pairwise similarity computation
makes t-SNE optimization more stable

Control randomness

For comparability, log and version:

random seed
perplexity
learning rate
iteration count
initialization method
subset or sampling rule

Without this, two analysts may produce different pictures from the same dataset and debate the wrong issue.

Large-scale considerations

t-SNE can become expensive because it depends heavily on pairwise relationships.

Professional approaches at scale include:

sub-sampling for human inspection
PCA pre-reduction
approximate neighbor search
faster implementations such as Barnes-Hut or FFT-based variants

The important operational point is that human-readable maps usually do not require embedding every single point ever observed. A carefully sampled, representative subset is often more useful.

New-point handling

If your workflow truly requires mapping new points into an existing visualization, standard t-SNE is awkward. You may need:

parametric variants
approximation schemes
a different method for serving-time transforms

In many systems, the better design is simple:

use PCA or another stable transform for production features
use t-SNE only for offline visualization snapshots

5.12 Practical t-SNE implementation pattern

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)
X_pca = PCA(n_components=50, random_state=42).fit_transform(X_scaled)

X_tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate=200,
    init="pca",
    random_state=42,
).fit_transform(X_pca)

What this gets right:

scaling happens before distance-based work
PCA removes noise and improves runtime
the random seed and initialization are fixed for reproducibility

What still needs judgment:

whether perplexity=30 matches the sample size and local structure
whether the dataset should be sub-sampled first
whether comparisons across runs use the exact same data subset and settings

5.13 Common t-SNE mistakes

treating the axes like meaningful physical directions
assuming large gaps between clusters imply large original-space distances
using t-SNE output as default features for production classifiers or regressors
running t-SNE directly on raw unscaled mixed-unit data
ignoring the effect of perplexity, learning rate, and random seed
comparing two separately generated maps as if their coordinates are aligned
embedding too many points without a sampling strategy and then over-trusting the picture
forgetting that t-SNE is mainly an exploratory visualization tool

5.14 Debugging t-SNE in practice

If the t-SNE plot looks wrong, check the following:

feature scaling and preprocessing
whether PCA pre-reduction should be added
sample size versus perplexity
random seed and initialization
iteration count and optimization convergence
duplicate or near-duplicate points
whether you are asking the plot to answer a global-geometry question it was never designed for

Useful symptom checks:

Symptom	Likely cause	What to check
Many tiny disconnected islands	perplexity too low, noise too high, or duplicates	increase perplexity, deduplicate, pre-reduce with PCA
One dense blob with little separation	poor embeddings, too much noise, or unsuitable settings	inspect upstream representation, try PCA first, adjust perplexity
Plot changes dramatically across runs	unstable optimization or insufficient control of randomness	fix seed, fix subset, increase iterations, compare several settings
Analysts over-interpret distances between clusters	misuse of t-SNE itself	restate that local neighborhoods are the reliable part

Useful practical checks:

run several perplexity values and see which structures remain stable
inspect actual nearest neighbors in the original space for points of interest
compare t-SNE on raw features versus PCA-compressed features
color the plot by known labels, time windows, device family, or error type
verify whether isolated islands correspond to real semantic differences or data quality issues

5.15 Interview-level understanding of t-SNE

You should be able to explain these clearly:

t-SNE preserves local neighborhoods better than global geometry
it converts similarities into probabilities in both spaces and minimizes KL divergence between them
perplexity acts like an effective neighborhood size, not a magic cluster-count control
the Student t distribution helps reduce crowding in low-dimensional maps
t-SNE is excellent for visualization but usually poor as a default production feature transform
the axes are not semantically interpretable and separate runs are not directly aligned coordinate systems

6. PCA vs t-SNE

6.1 Decision intuition

flowchart TD
	A[Need a lower-dimensional representation] --> B{Must new incoming data be transformed consistently?}
	B -->|Yes| C{Need compression, denoising, or reusable features?}
	C -->|Yes| D[Start with PCA]
	C -->|No| D
	B -->|No| E{Main goal is human visualization of local neighborhoods?}
	E -->|Yes| F[Use t-SNE, often after PCA]
	E -->|No| G[Revisit objective or choose another method]

6.2 Practical comparison table

Concern	PCA	t-SNE
Core idea	find linear directions of maximum variance	preserve local neighborhoods in a low-dimensional map
Reusable transform for new data	yes	usually no
Supports approximate reconstruction	yes	no
Axes somewhat interpretable	sometimes	no
Good for compression and serving	yes	usually no
Good for 2D visualization of embeddings	sometimes, but limited	yes
Preserves global linear structure	relatively well	not reliably
Preserves local neighborhoods	moderately	strongly
Deterministic given fixed solver	mostly yes	more optimization-sensitive
Scales to production pipelines	often yes	mostly offline analysis

6.3 A practical rule of thumb

If you need a lower-dimensional representation that software will use repeatedly, start with PCA.

If you need a lower-dimensional map that humans will inspect visually, test t-SNE.

If you need both, a very common workflow is PCA first, then t-SNE on the PCA output.

7. Common Combined Workflow: PCA Before t-SNE

This combined pattern appears constantly in practice.

Why it works:

PCA removes low-variance noise and redundancy
t-SNE then focuses on neighborhood structure in a cleaner space
runtime improves because t-SNE handles fewer dimensions
plots often become more stable and easier to inspect

Example workflow:

clean and scale the features
reduce from 500 dimensions to 50 with PCA
run t-SNE from 50 dimensions down to 2
color the map by label, device type, failure class, or time window
inspect local neighborhoods and representative samples

Important caution:

This does not mean t-SNE inherits PCA interpretability. The final 2D map is still a t-SNE map, not a pair of principal components.

8. Production Scenarios and System Design

8.1 A common production architecture

flowchart LR
	A[Telemetry, logs, images, sensors, or embeddings] --> B[Feature store or preprocessing service]
	B --> C[Training pipeline]
	C --> D[PCA artifact: mean, components, scaling]
	C --> E[t-SNE snapshots for analyst review]
	D --> F[Batch or online inference pipeline]
	E --> G[Dashboards and debugging tools]
	F --> H[Drift and reconstruction monitoring]
	G --> H

This reflects a common professional split:

PCA belongs naturally in the serving or batch-processing path
t-SNE belongs naturally in the analysis and visualization path

8.2 Industry example: observability platform

Imagine an observability system with hundreds of metrics per service instance.

Possible design:

use PCA to compress metric windows into a smaller factor representation for anomaly models
use t-SNE on sampled embeddings offline so reliability engineers can inspect incident families
monitor drift in PCA component distributions after service rollouts

This separates operational transforms from analyst-facing visual tools.

8.3 Industry example: industrial and embedded systems

Imagine a factory line with vibration, current, acoustic, and temperature features.

Possible design:

use PCA at the edge gateway to reduce bandwidth and storage
feed compressed features into a lightweight anomaly detector or state classifier
use t-SNE offline on fault windows to inspect whether failures cluster into distinct families

This is a software and hardware co-design problem. Compression changes communication cost, local memory pressure, and the complexity of the downstream control software.

8.4 Industry example: ML embedding platform

Suppose you operate a search or recommendation system with 768-dimensional embeddings.

Possible design:

test PCA compression to 128 or 64 dimensions to reduce ANN index size and memory footprint
use t-SNE on sampled embeddings to inspect semantic neighborhoods and identify drift or label issues
validate that compressed embeddings still preserve retrieval quality, not just variance

8.5 Design considerations that matter in production

version the exact preprocessing and reduction artifacts
document the intended use of each representation
do not let analysts compare unmatched t-SNE runs as if they were stable coordinates
monitor drift in both original and reduced spaces
validate reduction choices against business or system metrics, not visual appeal alone

9. Failure Modes and Troubleshooting

9.1 General failure modes across both methods

Scale mismatch

If features live on very different numeric scales, both PCA and t-SNE can reflect scale artifacts instead of real structure.

Data leakage

If you fit scaling or PCA on the full dataset before splitting, your evaluation becomes optimistic.

Batch effects

Data from different devices, sites, firmware versions, or time windows may dominate the structure. The method then learns environment differences instead of the phenomenon you care about.

Rare-event suppression

The information you care most about may be low variance, rare, or local. Generic reduction can remove it.

Visualization overconfidence

Humans trust pictures too easily. A beautiful low-dimensional plot is not proof that the representation preserves the right structure for the actual task.

9.2 Troubleshooting flow

flowchart TD
	A[Reduced representation looks wrong] --> B{Features cleaned, encoded, and scaled correctly?}
	B -->|No| C[Fix preprocessing first]
	B -->|Yes| D{Is the method matched to the goal?}
	D -->|No| E[Use PCA for reusable features or t-SNE for visualization]
	D -->|Yes| F{Is the signal dominated by outliers, batch effects, or drift?}
	F -->|Yes| G[Inspect segments, remove artifacts, retrain]
	F -->|No| H{Are hyperparameters and component counts validated?}
	H -->|No| I[Sweep k, perplexity, or learning settings]
	H -->|Yes| J[Check downstream utility, reconstruction, and stability]

9.3 Symptom-driven troubleshooting table

Symptom	Likely issue	Practical check
First principal component mostly reflects one raw feature	scale mismatch or feature dominance	compare with standardization and inspect loadings
PCA works in training but degrades in production	drift or preprocessing mismatch	verify feature order, mean, scaling, and component artifact version
t-SNE plot looks dramatically different every week	new sample subset or uncontrolled randomness	fix sampling strategy and random seed
Nice separation in t-SNE but poor downstream model	visualization is not preserving task-relevant structure	validate on the real task, not the plot
Rare failure cases disappear after PCA	low-variance but important signal got removed	inspect per-class reconstruction error and keep more components
Reduced features unstable across firmware versions	batch effect or distribution shift	stratify by version, inspect separate loadings or maps

9.4 Best debugging habits

always inspect the original feature pipeline before blaming the reduction method
evaluate stability across seeds, samples, and time windows
connect plots back to real examples, not just abstract points
log exact hyperparameters and artifacts so a picture can be reproduced
check downstream utility, not only visual aesthetics

10. Software and Hardware Connections

Dimensionality reduction is not only an algorithm choice. It often affects system architecture.

10.1 Edge and embedded devices

On embedded or edge systems, sending all raw channels upstream can be expensive. PCA can act like a learned compression stage:

fewer values transmitted over CAN, SPI, Ethernet, or radio links
less storage written to flash or local disk
lower memory footprint for downstream state estimators
potentially lower power usage because less data is moved and processed

This is why dimensionality reduction sometimes belongs in hardware-software co-design discussions, not only in ML notebooks.

10.2 Imaging and spectral systems

Camera pipelines, hyperspectral systems, and board inspection tools often produce many correlated channels. PCA can compress channel structure before classification or anomaly scoring.

The system-level benefits may include:

lower bandwidth between capture and inference stages
better cache behavior in CPU or GPU pipelines
reduced accelerator memory use
easier archival of compact representations for later forensic analysis

10.3 Networking, RF, and signal systems

In communication and signal-processing contexts, multiple observed features may reflect a smaller number of underlying propagation, interference, or device-state factors.

PCA-style reduction can help with:

compressing channel-state summaries
reducing correlated monitoring counters
building simpler anomaly detectors for network behavior

t-SNE, in contrast, is more useful here for offline inspection of learned embeddings or event signatures than for inline signal pipelines.

10.4 Numerical and compute considerations

PCA is friendly to linear algebra hardware and optimized libraries.

t-SNE is more dominated by pairwise relationships and iterative optimization.

That means:

PCA usually fits more naturally into latency-sensitive production code
t-SNE usually fits more naturally into offline analytics or debugging jobs
memory movement and neighbor-search cost often dominate t-SNE at scale

11. Common Interview and Design Questions

11.1 Why does PCA maximize variance?

Because under squared reconstruction error, keeping the directions with the most spread preserves the most information. Low-variance directions contribute less to total squared error when dropped.

11.2 Why do we center data before PCA?

Because PCA is about variation around the mean. Without centering, the mean offset can distort the principal directions.

11.3 When would PCA hurt a system?

When important signal is nonlinear, rare, low-variance, or buried under strong batch effects. Also when raw feature semantics matter more than compactness.

11.4 Why is t-SNE good for visualization but bad for serving features?

Because it is built to preserve local neighborhoods for a fixed dataset, not to provide a stable, interpretable, reusable coordinate transform for new incoming data.

11.5 What does perplexity control in t-SNE?

It roughly controls effective neighborhood size. Low values emphasize fine local structure. Higher values smooth over broader neighborhoods.

11.6 Why can two t-SNE plots of the same data look different?

Because the optimization can land in different valid low-dimensional arrangements, especially when initialization, seed, and settings change.

11.7 When should you use PCA before t-SNE?

Usually when the original feature space is high-dimensional and noisy. PCA reduces redundancy and speeds up t-SNE while often improving visual stability.

11.8 What is the biggest professional mistake with dimensionality reduction?

Treating the reduced representation as truth rather than as a task-dependent approximation.

12. Best Practices Checklist

start with the operational goal, not the algorithm name
decide whether you need reusable features, visualization, or both
scale features when units differ, but do it intentionally
fit preprocessing and PCA only on training data when building predictive pipelines
use PCA for stable compression, denoising, and serving-time transforms
use t-SNE for offline neighborhood visualization and debugging
do not over-interpret t-SNE distances, axes, or cluster areas
validate component count or t-SNE settings against downstream usefulness
inspect stability across seeds, time windows, and subsets
version every artifact needed to reproduce the reduced representation
monitor drift after deployment
remember that good-looking plots are evidence, not proof

13. Key Takeaways

Dimensionality reduction is really about preserving the right information while discarding the rest.

PCA is usually the right tool when engineers need a stable, reusable, efficient transform that compresses correlated numeric structure. It is strong for denoising, storage reduction, inference pipelines, and systems where a linear low-rank approximation is good enough.

t-SNE is usually the right tool when humans need to inspect local neighborhoods in a complex embedding space. It is powerful for model debugging, exploratory analysis, and discovering hidden subgroup structure, but it should be treated as a visualization method rather than a default production feature transform.

The core engineering skill is not memorizing the algorithm names. The real skill is deciding what structure matters, what distortion is acceptable, and how the reduced representation will actually be used inside a real system.

51 KiB Raw Blame History

Dimensionality Reduction Handbook: PCA and t-SNE

1. Why Dimensionality Reduction Exists

2. First Principles

2.1 What a dimension really is

2.2 Redundancy, noise, and latent variables

2.3 Why high dimensions are hard

Distance becomes less informative

Data demand increases

Compute and memory costs grow

Interpretation gets worse

2.4 Projection, embedding, and reconstruction

2.5 Dimensionality reduction is not feature selection

2.6 Linear vs nonlinear reduction

2.7 End-to-end mental model

3. Choosing the Objective Before Choosing the Algorithm

3.1 Do you need a reusable transform for future data?

3.2 Do you need interpretability?

3.3 Is the goal compression or visualization?

3.4 What structure matters: global or local?

3.5 Practical decision table

4. PCA

4.1 Core intuition

4.2 Geometric picture: rotate first, then drop axes

4.3 Why centering matters

4.4 Why scaling often matters too

4.5 Covariance, eigenvectors, and what PCA is finding

4.6 Step-by-step PCA algorithm

4.7 Why PCA works

Variance view

Reconstruction-error view

4.8 Explained variance and how to choose the number of components

4.9 Interpreting components and loadings

4.10 Whitening

4.11 When PCA works well

4.12 Where PCA fails or becomes misleading

4.13 Real-world PCA use cases

Sensor compression for edge and industrial systems

Denoising telemetry before anomaly detection

Image, spectral, and signal preprocessing

Embedding compression

4.14 Production engineering details for PCA

Fit on training data only

Persist the full transform artifact

Use incremental or randomized PCA when scale demands it

Be careful with sparse data

Monitor drift in component space

Hardware implications

4.15 Practical PCA implementation pattern

4.16 Common PCA mistakes

4.17 Debugging PCA in practice

4.18 Interview-level understanding of PCA

5. t-SNE

5.1 Core intuition

5.2 What t-SNE is actually trying to preserve

5.3 Step-by-step idea behind t-SNE

5.4 High-dimensional probabilities and perplexity

5.5 Why t-SNE uses a Student t distribution in low dimension

5.6 Early exaggeration and optimization behavior

5.7 What the t-SNE plot means and what it does not mean

5.8 When t-SNE works well

5.9 Where t-SNE fails or becomes misleading

5.10 Real-world t-SNE use cases

Embedding model debugging

Telemetry and operating-state inspection

Semiconductor, imaging, and hardware test pipelines

5.11 Production engineering details for t-SNE

Treat t-SNE as an analysis artifact, not a default serving transform

Pre-reduce with PCA for speed and denoising

Control randomness

Large-scale considerations

New-point handling

5.12 Practical t-SNE implementation pattern

5.13 Common t-SNE mistakes

5.14 Debugging t-SNE in practice

5.15 Interview-level understanding of t-SNE

6. PCA vs t-SNE

6.1 Decision intuition

6.2 Practical comparison table

6.3 A practical rule of thumb

51 KiB

Raw Blame History