🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Neural Architecture Search — Automated ML

ProductionNAS🟢 Free Lesson

Advertisement

Production DL

NAS — Let AI Design Its Own Neural Networks

Neural Architecture Search replaces manual engineering with algorithmic search, discovering architectures that outperform human-designed networks. From DARTS to EfficientNet, NAS finds optimal depth, width, and connectivity patterns for any task and hardware constraint.

  • Key point 1 — DARTS enables differentiable architecture search with gradient-based optimization
  • Key point 2 — EfficientNet's compound scaling jointly optimizes depth, width, and resolution
  • Key point 3 — Once-for-All networks train one supernetwork to deploy many subnets

"The future of architecture is architecture that designs itself."

Neural Architecture Search

NAS automates the design of neural network architectures, replacing manual engineering with algorithmic search. It has discovered architectures that outperform human-designed networks.


NAS Framework

DfNAS Components

NAS consists of three components:

  1. Search space A\mathcal{A}: Set of possible architectures (operations, connectivity)
  2. Search strategy: Algorithm to explore A\mathcal{A} (reinforcement learning, evolutionary, gradient-based)
  3. Performance estimation: Evaluate architectures efficiently (proxy tasks, weight sharing)
Neural Architecture Search (NAS) FrameworkSearch Space ACandidate architectures• Operations: Conv, Pool, FC• Connectivity patterns• Hyperparameters• Depth, width choicesSearch StrategyExplore architecture space• Reinforcement Learning• Evolutionary Algorithms• Gradient-based (DARTS)• Bayesian OptimizationPerformance Est.Evaluate architectures• Proxy tasks (small data)• Weight sharing• Early stopping• Learning curve pred.BestArch.alpha-starFinalmodelIterative searchDARTS: Differentiable NASRelaxes discrete choices to continuous weights → gradient-based optimizationJointly optimizes architecture params α and network weights w via bilevel optimization

Search Space

DfCell-Based Search Space

Most modern NAS uses a cell-based search space:

  • Normal cell: Preserves spatial dimensions
  • Reduction cell: Halves spatial dimensions, doubles channels
  • Each cell is a DAG with NN nodes (typically 4-6)
  • Edges are operations (conv 3x3, sep conv 5x5, max pool, etc.)

A network is built by stacking these cells.

Cell Output

oj=i<joˉi(i,j)o_j = \sum_{i < j} \bar{o}_i^{(i,j)}

Here,

  • ojo_j=Output of intermediate node j
  • oˉi(i,j)\bar{o}_i^{(i,j)}=Output of operation on edge (i,j)
  • \sum=Sum over all incoming edges

DARTS (Differentiable Architecture Search)

DfDARTS

DARTS (Liu et al., 2019) makes NAS differentiable by relaxing discrete architecture choices to continuous weights:

oˉ(i,j)(x)=oOexp(αo(i,j))oexp(αo(i,j))o(x)\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)

Architecture parameters α\alpha and network weights ww are optimized jointly via bilevel optimization.

DARTS Objective
minαLval(w(α),α)s.t.w(α)=argminwLtrain(w,α)\min_\alpha \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \quad \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha)

DARTS Continuous Relaxation

oˉ(i,j)(x)=oOexp(αo(i,j))oexp(αo(i,j))o(x)\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)

Here,

  • αo(i,j)\alpha_o^{(i,j)}=Architecture weight for operation o on edge (i,j)
  • O\mathcal{O}=Set of candidate operations
  • o(x)o(x)=Output of operation o on input x

DARTS Training

DARTS alternates between:

  1. Step 1: Update weights ww on training data (minimize Ltrain\mathcal{L}_{\text{train}})
  2. Step 2: Update architecture params α\alpha on validation data (minimize Lval\mathcal{L}_{\text{val}})

After search, the final architecture is derived by selecting the operation with highest α\alpha on each edge.


EfficientNet

DfEfficientNet

EfficientNet (Tan and Le, 2019) uses compound scaling to jointly scale depth, width, and resolution:

d=αϕ,w=βϕ,r=γϕd = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi

subject to αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

  • α\alpha: depth coefficient
  • β\beta: width coefficient
  • γ\gamma: resolution coefficient
  • ϕ\phi: compound coefficient controlling overall scale

Found via NAS, then scaled uniformly.

Compound Scaling

d=αϕ,w=βϕ,r=γϕs.t.αβ2γ22d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi \quad \text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

Here,

  • dd=Depth (number of layers)
  • ww=Width (number of channels)
  • rr=Resolution (input image size)
  • ϕ\phi=Compound coefficient

Once-for-All (OFA)

DfOnce-for-All Network

OFA (Cai et al., 2020) trains a single supernetwork that contains all possible sub-networks:

  1. Progress shrinking: Train full network, then progressively fine-tune smaller subnets
  2. Elastic depth/width/kernel: Support flexible depth, width, and kernel size
  3. Deployment-specific search: Find optimal subnet for target hardware without retraining

This eliminates the need for architecture search per deployment.


Search Strategies

DfReinforcement Learning (NASNet)

Use RNN controller to generate architecture descriptions:

  • Controller predicts architecture as a sequence of decisions
  • Train architecture, evaluate on validation set
  • Use accuracy as reward to update controller via REINFORCE

Pros: Flexible search space. Cons: Very expensive (thousands of GPU hours).

DfEvolutionary Search (AmoebaNet)

Use genetic algorithms to evolve architectures:

  • Population of architectures
  • Tournament selection based on fitness (validation accuracy)
  • Mutation: change operations, connectivity
  • Crossover: combine architectures

Pros: Parallelizable. Cons: Still expensive.


PyTorch Implementation

Example: DARTS Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

# Candidate operations
OPS = {
    'none': lambda C: Zero(C),
    'skip_connect': lambda C: nn.Identity(),
    'sep_conv_3x3': lambda C: SepConv(C, C, 3, 1, 1),
    'sep_conv_5x5': lambda C: SepConv(C, C, 5, 1, 2),
    'dil_conv_3x3': lambda C: DilConv(C, C, 3, 1, 2, 2),
    'max_pool_3x3': lambda C: nn.MaxPool2d(3, 1, 1),
    'avg_pool_3x3': lambda C: nn.AvgPool2d(3, 1, 1),
}


class MixedOp(nn.Module):
    """Mixed operation with architecture weights."""
    def __init__(self, C):
        super().__init__()
        self.ops = nn.ModuleList([OPS[name](C) for name in OPS])

    def forward(self, x, weights):
        return sum(w * op(x) for w, op in zip(weights, self.ops))


class DARTSCell(nn.Module):
    """DARTS cell with learnable architecture parameters."""
    def __init__(self, C, num_nodes=4):
        super().__init__()
        self.num_nodes = num_nodes
        self.ops = nn.ModuleList()

        for j in range(2, num_nodes + 2):  # input nodes = 0, 1
            for i in range(j):
                self.ops.append(MixedOp(C))

        # Architecture parameters
        num_edges = sum(range(2, num_nodes + 2))
        self.alphas_normal = nn.Parameter(
            torch.randn(num_edges, len(OPS)) * 1e-3
        )
        self.alphas_reduce = nn.Parameter(
            torch.randn(num_edges, len(OPS)) * 1e-3
        )

    def forward(self, s0, s1):
        states = [s0, s1]
        edges = 0

        for j in range(2, self.num_nodes + 2):
            node_inputs = []
            for i in range(j):
                weights = F.softmax(self.alphas_normal[edges], dim=0)
                node_inputs.append(self.ops[edges](states[i], weights))
                edges += 1
            states.append(sum(node_inputs))

        return states[-1]


class DARTS(nn.Module):
    def __init__(self, C=16, num_classes=10, layers=8):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, C, 3, 1, 1, bias=False),
            nn.BatchNorm2d(C)
        )
        self.cells = nn.ModuleList([
            DARTSCell(C) for _ in range(layers)
        ])
        self.classifier = nn.Linear(C, num_classes)

    def forward(self, x):
        s0 = s1 = self.stem(x)
        for cell in self.cells:
            s0, s1 = s1, cell(s0, s1)
        out = F.adaptive_avg_pool2d(s1, 1).view(s1.size(0), -1)
        return self.classifier(out)


def derive_architecture(model):
    """Extract final architecture from trained DARTS model."""
    alphas = F.softmax(model.cells[0].alphas_normal, dim=1)
    ops_per_edge = alphas.argmax(dim=1)
    return ops_per_edge

Practical Considerations

NAS Best Practices

  1. Search on proxy dataset: Use CIFAR-10 for search, transfer to ImageNet
  2. Use weight sharing: Train supernetwork once, evaluate subnets by weight inheritance
  3. Set GPU budget: DARTS: ~1-4 GPU days, RL: ~1000-2000 GPU days
  4. Regularize search: Add latency constraint for hardware-aware NAS
  5. Avoid degenerate solutions: DARTS may collapse to skip connections — use decay on skip ops

Practice Exercises

  1. DARTS on CIFAR-10: Implement and run DARTS search. Visualize the discovered architecture.

  2. Compound scaling: Reproduce EfficientNet scaling experiments. Plot accuracy vs. FLOPs.

  3. Hardware-aware NAS: Add latency objective to DARTS. Find Pareto-optimal architectures.

  4. Once-for-All: Train OFA supernetwork on MNIST. Extract subnets for different FLOP budgets.


Key Takeaways

Summary: Neural Architecture Search

  • NAS automates architecture design with search space, strategy, and estimation
  • DARTS: Differentiable relaxation enables gradient-based architecture search
  • Cell-based search: Discover cells, then stack to form network
  • EfficientNet: Compound scaling of depth, width, and resolution
  • Once-for-All: Train one supernetwork, deploy many subnets
  • Weight sharing reduces search cost from 1000+ GPU hours to 1-4 GPU days
  • Hardware-aware NAS optimizes for target latency/memory
  • Discovered architectures often outperform human-designed ones
  • Practical NAS: search on proxy tasks, transfer to target
  • See also: MLOps for deployment tracking

What to Learn Next

-> Model Compression Make deep learning models fast and efficient for production deployment.

-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.

-> Self-Supervised Learning Learn useful representations from unlabeled data without manual annotation.

-> CNN Architecture Deep Dive Master convolutional layers, pooling, and modern CNN architectures.

-> Attention Mechanisms Discover how attention solves the information bottleneck in sequence models.

-> Vision Transformers Apply Transformer architecture to image recognition by treating patches as tokens.

Premium Content

Neural Architecture Search — Automated ML

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement