Production DL
NAS — Let AI Design Its Own Neural Networks
Neural Architecture Search replaces manual engineering with algorithmic search, discovering architectures that outperform human-designed networks. From DARTS to EfficientNet, NAS finds optimal depth, width, and connectivity patterns for any task and hardware constraint.
- Key point 1 — DARTS enables differentiable architecture search with gradient-based optimization
- Key point 2 — EfficientNet's compound scaling jointly optimizes depth, width, and resolution
- Key point 3 — Once-for-All networks train one supernetwork to deploy many subnets
"The future of architecture is architecture that designs itself."
Neural Architecture Search
NAS automates the design of neural network architectures, replacing manual engineering with algorithmic search. It has discovered architectures that outperform human-designed networks.
NAS Framework
DfNAS Components
NAS consists of three components:
- Search space : Set of possible architectures (operations, connectivity)
- Search strategy: Algorithm to explore (reinforcement learning, evolutionary, gradient-based)
- Performance estimation: Evaluate architectures efficiently (proxy tasks, weight sharing)
Search Space
DfCell-Based Search Space
Most modern NAS uses a cell-based search space:
- Normal cell: Preserves spatial dimensions
- Reduction cell: Halves spatial dimensions, doubles channels
- Each cell is a DAG with nodes (typically 4-6)
- Edges are operations (conv 3x3, sep conv 5x5, max pool, etc.)
A network is built by stacking these cells.
Cell Output
Here,
- =Output of intermediate node j
- =Output of operation on edge (i,j)
- =Sum over all incoming edges
DARTS (Differentiable Architecture Search)
DfDARTS
DARTS (Liu et al., 2019) makes NAS differentiable by relaxing discrete architecture choices to continuous weights:
Architecture parameters and network weights are optimized jointly via bilevel optimization.
DARTS Continuous Relaxation
Here,
- =Architecture weight for operation o on edge (i,j)
- =Set of candidate operations
- =Output of operation o on input x
DARTS Training
DARTS alternates between:
- Step 1: Update weights on training data (minimize )
- Step 2: Update architecture params on validation data (minimize )
After search, the final architecture is derived by selecting the operation with highest on each edge.
EfficientNet
DfEfficientNet
EfficientNet (Tan and Le, 2019) uses compound scaling to jointly scale depth, width, and resolution:
subject to
- : depth coefficient
- : width coefficient
- : resolution coefficient
- : compound coefficient controlling overall scale
Found via NAS, then scaled uniformly.
Compound Scaling
Here,
- =Depth (number of layers)
- =Width (number of channels)
- =Resolution (input image size)
- =Compound coefficient
Once-for-All (OFA)
DfOnce-for-All Network
OFA (Cai et al., 2020) trains a single supernetwork that contains all possible sub-networks:
- Progress shrinking: Train full network, then progressively fine-tune smaller subnets
- Elastic depth/width/kernel: Support flexible depth, width, and kernel size
- Deployment-specific search: Find optimal subnet for target hardware without retraining
This eliminates the need for architecture search per deployment.
Search Strategies
DfReinforcement Learning (NASNet)
Use RNN controller to generate architecture descriptions:
- Controller predicts architecture as a sequence of decisions
- Train architecture, evaluate on validation set
- Use accuracy as reward to update controller via REINFORCE
Pros: Flexible search space. Cons: Very expensive (thousands of GPU hours).
DfEvolutionary Search (AmoebaNet)
Use genetic algorithms to evolve architectures:
- Population of architectures
- Tournament selection based on fitness (validation accuracy)
- Mutation: change operations, connectivity
- Crossover: combine architectures
Pros: Parallelizable. Cons: Still expensive.
PyTorch Implementation
Example: DARTS Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
# Candidate operations
OPS = {
'none': lambda C: Zero(C),
'skip_connect': lambda C: nn.Identity(),
'sep_conv_3x3': lambda C: SepConv(C, C, 3, 1, 1),
'sep_conv_5x5': lambda C: SepConv(C, C, 5, 1, 2),
'dil_conv_3x3': lambda C: DilConv(C, C, 3, 1, 2, 2),
'max_pool_3x3': lambda C: nn.MaxPool2d(3, 1, 1),
'avg_pool_3x3': lambda C: nn.AvgPool2d(3, 1, 1),
}
class MixedOp(nn.Module):
"""Mixed operation with architecture weights."""
def __init__(self, C):
super().__init__()
self.ops = nn.ModuleList([OPS[name](C) for name in OPS])
def forward(self, x, weights):
return sum(w * op(x) for w, op in zip(weights, self.ops))
class DARTSCell(nn.Module):
"""DARTS cell with learnable architecture parameters."""
def __init__(self, C, num_nodes=4):
super().__init__()
self.num_nodes = num_nodes
self.ops = nn.ModuleList()
for j in range(2, num_nodes + 2): # input nodes = 0, 1
for i in range(j):
self.ops.append(MixedOp(C))
# Architecture parameters
num_edges = sum(range(2, num_nodes + 2))
self.alphas_normal = nn.Parameter(
torch.randn(num_edges, len(OPS)) * 1e-3
)
self.alphas_reduce = nn.Parameter(
torch.randn(num_edges, len(OPS)) * 1e-3
)
def forward(self, s0, s1):
states = [s0, s1]
edges = 0
for j in range(2, self.num_nodes + 2):
node_inputs = []
for i in range(j):
weights = F.softmax(self.alphas_normal[edges], dim=0)
node_inputs.append(self.ops[edges](states[i], weights))
edges += 1
states.append(sum(node_inputs))
return states[-1]
class DARTS(nn.Module):
def __init__(self, C=16, num_classes=10, layers=8):
super().__init__()
self.stem = nn.Sequential(
nn.Conv2d(3, C, 3, 1, 1, bias=False),
nn.BatchNorm2d(C)
)
self.cells = nn.ModuleList([
DARTSCell(C) for _ in range(layers)
])
self.classifier = nn.Linear(C, num_classes)
def forward(self, x):
s0 = s1 = self.stem(x)
for cell in self.cells:
s0, s1 = s1, cell(s0, s1)
out = F.adaptive_avg_pool2d(s1, 1).view(s1.size(0), -1)
return self.classifier(out)
def derive_architecture(model):
"""Extract final architecture from trained DARTS model."""
alphas = F.softmax(model.cells[0].alphas_normal, dim=1)
ops_per_edge = alphas.argmax(dim=1)
return ops_per_edge
Practical Considerations
NAS Best Practices
- Search on proxy dataset: Use CIFAR-10 for search, transfer to ImageNet
- Use weight sharing: Train supernetwork once, evaluate subnets by weight inheritance
- Set GPU budget: DARTS: ~1-4 GPU days, RL: ~1000-2000 GPU days
- Regularize search: Add latency constraint for hardware-aware NAS
- Avoid degenerate solutions: DARTS may collapse to skip connections — use decay on skip ops
Practice Exercises
-
DARTS on CIFAR-10: Implement and run DARTS search. Visualize the discovered architecture.
-
Compound scaling: Reproduce EfficientNet scaling experiments. Plot accuracy vs. FLOPs.
-
Hardware-aware NAS: Add latency objective to DARTS. Find Pareto-optimal architectures.
-
Once-for-All: Train OFA supernetwork on MNIST. Extract subnets for different FLOP budgets.
Key Takeaways
Summary: Neural Architecture Search
- NAS automates architecture design with search space, strategy, and estimation
- DARTS: Differentiable relaxation enables gradient-based architecture search
- Cell-based search: Discover cells, then stack to form network
- EfficientNet: Compound scaling of depth, width, and resolution
- Once-for-All: Train one supernetwork, deploy many subnets
- Weight sharing reduces search cost from 1000+ GPU hours to 1-4 GPU days
- Hardware-aware NAS optimizes for target latency/memory
- Discovered architectures often outperform human-designed ones
- Practical NAS: search on proxy tasks, transfer to target
- See also: MLOps for deployment tracking
What to Learn Next
-> Model Compression Make deep learning models fast and efficient for production deployment.
-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.
-> Self-Supervised Learning Learn useful representations from unlabeled data without manual annotation.
-> CNN Architecture Deep Dive Master convolutional layers, pooling, and modern CNN architectures.
-> Attention Mechanisms Discover how attention solves the information bottleneck in sequence models.
-> Vision Transformers Apply Transformer architecture to image recognition by treating patches as tokens.