Production DL

NAS — Let AI Design Its Own Neural Networks

Neural Architecture Search replaces manual engineering with algorithmic search, discovering architectures that outperform human-designed networks. From DARTS to EfficientNet, NAS finds optimal depth, width, and connectivity patterns for any task and hardware constraint.

Key point 1 — DARTS enables differentiable architecture search with gradient-based optimization
Key point 2 — EfficientNet's compound scaling jointly optimizes depth, width, and resolution
Key point 3 — Once-for-All networks train one supernetwork to deploy many subnets

"The future of architecture is architecture that designs itself."

Neural Architecture Search

NAS automates the design of neural network architectures, replacing manual engineering with algorithmic search. It has discovered architectures that outperform human-designed networks.

NAS Framework

DfNAS Components

NAS consists of three components:

Search space $\mathcal{A}$ : Set of possible architectures (operations, connectivity)
Search strategy: Algorithm to explore $\mathcal{A}$ (reinforcement learning, evolutionary, gradient-based)
Performance estimation: Evaluate architectures efficiently (proxy tasks, weight sharing)

Search Space

DfCell-Based Search Space

Most modern NAS uses a cell-based search space:

Normal cell: Preserves spatial dimensions
Reduction cell: Halves spatial dimensions, doubles channels
Each cell is a DAG with $N$ nodes (typically 4-6)
Edges are operations (conv 3x3, sep conv 5x5, max pool, etc.)

A network is built by stacking these cells.

Cell Output

o_j = \sum_{i < j} \bar{o}_i^{(i,j)}

Here,

$o_j$ =Output of intermediate node j
$\bar{o}_i^{(i,j)}$ =Output of operation on edge (i,j)
$\sum$ =Sum over all incoming edges

DARTS (Differentiable Architecture Search)

DfDARTS

DARTS (Liu et al., 2019) makes NAS differentiable by relaxing discrete architecture choices to continuous weights:

\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)

Architecture parameters $\alpha$ and network weights $w$ are optimized jointly via bilevel optimization.

DARTS Objective

\min_\alpha \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \quad \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha)

DARTS Continuous Relaxation

\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)

Here,

$\alpha_o^{(i,j)}$ =Architecture weight for operation o on edge (i,j)
$\mathcal{O}$ =Set of candidate operations
$o(x)$ =Output of operation o on input x

DARTS Training

DARTS alternates between:

Step 1: Update weights $w$ on training data (minimize $\mathcal{L}_{\text{train}}$ )
Step 2: Update architecture params $\alpha$ on validation data (minimize $\mathcal{L}_{\text{val}}$ )

After search, the final architecture is derived by selecting the operation with highest $\alpha$ on each edge.

EfficientNet

DfEfficientNet

EfficientNet (Tan and Le, 2019) uses compound scaling to jointly scale depth, width, and resolution:

d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi

subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$

$\alpha$ : depth coefficient
$\beta$ : width coefficient
$\gamma$ : resolution coefficient
$\phi$ : compound coefficient controlling overall scale

Found via NAS, then scaled uniformly.

Compound Scaling

d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi \quad \text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

Here,

$d$ =Depth (number of layers)
$w$ =Width (number of channels)
$r$ =Resolution (input image size)
$\phi$ =Compound coefficient

Once-for-All (OFA)

DfOnce-for-All Network

OFA (Cai et al., 2020) trains a single supernetwork that contains all possible sub-networks:

Progress shrinking: Train full network, then progressively fine-tune smaller subnets
Elastic depth/width/kernel: Support flexible depth, width, and kernel size
Deployment-specific search: Find optimal subnet for target hardware without retraining

This eliminates the need for architecture search per deployment.

Search Strategies

DfReinforcement Learning (NASNet)

Use RNN controller to generate architecture descriptions:

Controller predicts architecture as a sequence of decisions
Train architecture, evaluate on validation set
Use accuracy as reward to update controller via REINFORCE

Pros: Flexible search space. Cons: Very expensive (thousands of GPU hours).

DfEvolutionary Search (AmoebaNet)

Use genetic algorithms to evolve architectures:

Population of architectures
Tournament selection based on fitness (validation accuracy)
Mutation: change operations, connectivity
Crossover: combine architectures

Pros: Parallelizable. Cons: Still expensive.

PyTorch Implementation

Example: DARTS Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

# Candidate operations
OPS = {
    'none': lambda C: Zero(C),
    'skip_connect': lambda C: nn.Identity(),
    'sep_conv_3x3': lambda C: SepConv(C, C, 3, 1, 1),
    'sep_conv_5x5': lambda C: SepConv(C, C, 5, 1, 2),
    'dil_conv_3x3': lambda C: DilConv(C, C, 3, 1, 2, 2),
    'max_pool_3x3': lambda C: nn.MaxPool2d(3, 1, 1),
    'avg_pool_3x3': lambda C: nn.AvgPool2d(3, 1, 1),
}


class MixedOp(nn.Module):
    """Mixed operation with architecture weights."""
    def __init__(self, C):
        super().__init__()
        self.ops = nn.ModuleList([OPS[name](C) for name in OPS])

    def forward(self, x, weights):
        return sum(w * op(x) for w, op in zip(weights, self.ops))


class DARTSCell(nn.Module):
    """DARTS cell with learnable architecture parameters."""
    def __init__(self, C, num_nodes=4):
        super().__init__()
        self.num_nodes = num_nodes
        self.ops = nn.ModuleList()

        for j in range(2, num_nodes + 2):  # input nodes = 0, 1
            for i in range(j):
                self.ops.append(MixedOp(C))

        # Architecture parameters
        num_edges = sum(range(2, num_nodes + 2))
        self.alphas_normal = nn.Parameter(
            torch.randn(num_edges, len(OPS)) * 1e-3
        )
        self.alphas_reduce = nn.Parameter(
            torch.randn(num_edges, len(OPS)) * 1e-3
        )

    def forward(self, s0, s1):
        states = [s0, s1]
        edges = 0

        for j in range(2, self.num_nodes + 2):
            node_inputs = []
            for i in range(j):
                weights = F.softmax(self.alphas_normal[edges], dim=0)
                node_inputs.append(self.ops[edges](states[i], weights))
                edges += 1
            states.append(sum(node_inputs))

        return states[-1]


class DARTS(nn.Module):
    def __init__(self, C=16, num_classes=10, layers=8):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, C, 3, 1, 1, bias=False),
            nn.BatchNorm2d(C)
        )
        self.cells = nn.ModuleList([
            DARTSCell(C) for _ in range(layers)
        ])
        self.classifier = nn.Linear(C, num_classes)

    def forward(self, x):
        s0 = s1 = self.stem(x)
        for cell in self.cells:
            s0, s1 = s1, cell(s0, s1)
        out = F.adaptive_avg_pool2d(s1, 1).view(s1.size(0), -1)
        return self.classifier(out)


def derive_architecture(model):
    """Extract final architecture from trained DARTS model."""
    alphas = F.softmax(model.cells[0].alphas_normal, dim=1)
    ops_per_edge = alphas.argmax(dim=1)
    return ops_per_edge

Practical Considerations

NAS Best Practices

Search on proxy dataset: Use CIFAR-10 for search, transfer to ImageNet
Use weight sharing: Train supernetwork once, evaluate subnets by weight inheritance
Set GPU budget: DARTS: ~1-4 GPU days, RL: ~1000-2000 GPU days
Regularize search: Add latency constraint for hardware-aware NAS
Avoid degenerate solutions: DARTS may collapse to skip connections — use decay on skip ops

Practice Exercises

DARTS on CIFAR-10: Implement and run DARTS search. Visualize the discovered architecture.
Compound scaling: Reproduce EfficientNet scaling experiments. Plot accuracy vs. FLOPs.
Hardware-aware NAS: Add latency objective to DARTS. Find Pareto-optimal architectures.
Once-for-All: Train OFA supernetwork on MNIST. Extract subnets for different FLOP budgets.

Key Takeaways

Summary: Neural Architecture Search

NAS automates architecture design with search space, strategy, and estimation
DARTS: Differentiable relaxation enables gradient-based architecture search
Cell-based search: Discover cells, then stack to form network
EfficientNet: Compound scaling of depth, width, and resolution
Once-for-All: Train one supernetwork, deploy many subnets
Weight sharing reduces search cost from 1000+ GPU hours to 1-4 GPU days
Hardware-aware NAS optimizes for target latency/memory
Discovered architectures often outperform human-designed ones
Practical NAS: search on proxy tasks, transfer to target
See also: MLOps for deployment tracking

What to Learn Next

-> Model Compression Make deep learning models fast and efficient for production deployment.

-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.

-> Self-Supervised Learning Learn useful representations from unlabeled data without manual annotation.

-> CNN Architecture Deep Dive Master convolutional layers, pooling, and modern CNN architectures.

-> Attention Mechanisms Discover how attention solves the information bottleneck in sequence models.

-> Vision Transformers Apply Transformer architecture to image recognition by treating patches as tokens.

Neural Architecture Search — Automated ML

NAS — Let AI Design Its Own Neural Networks

Neural Architecture Search

NAS Framework

DfNAS Components

Search Space

DfCell-Based Search Space

Cell Output

DARTS (Differentiable Architecture Search)

DfDARTS

DARTS Continuous Relaxation

EfficientNet

DfEfficientNet

Compound Scaling

Once-for-All (OFA)

DfOnce-for-All Network

Search Strategies

DfReinforcement Learning (NASNet)

DfEvolutionary Search (AmoebaNet)

PyTorch Implementation

Example: DARTS Implementation

Practical Considerations

Practice Exercises

Key Takeaways

Summary: Neural Architecture Search

What to Learn Next

Premium Content

Need Expert Deep Learning Help?