Neural Network Configuration

Network Architecture Visualization

Select a network architecture to visualize

Training Progress

Current Epoch

0/0

Training Accuracy

0%

Training Loss

0.0000

Validation Accuracy

0%

Ready to train

Network Explanation

Architecture
Forward Pass
Backpropagation
Optimization

The neural network architecture defines how data flows through the network. Different architectures are suitable for different types of data:

Feedforward Neural Network (FFN)

A basic architecture where information moves in one direction from input to output through hidden layers. Each neuron in a layer connects to all neurons in the next layer.

a(l) = σ(W(l)a(l-1) + b(l))

Where σ is the activation function, W are weights, b are biases, and a are activations.

Convolutional Neural Network (CNN)

Specialized for grid-like data (images). Uses convolutional layers that preserve spatial relationships through local receptive fields and shared weights.

(I * K)(i,j) = ∑mn I(i+m, j+n)K(m,n)

Where I is the input, K is the kernel, and * is the convolution operation.

Transformer

Uses self-attention mechanisms to process sequential data while handling long-range dependencies better than RNNs.

Attention(Q,K,V) = softmax(QKT/√dk)V

Where Q, K, V are queries, keys and values respectively, and dk is the dimension of keys.

The forward pass computes the output of the network for a given input by propagating data through each layer:

Key Steps:

  1. Input data is normalized/processed as needed
  2. Data passes through each layer (linear transformation + activation)
  3. The output layer produces predictions
  4. Loss is computed between predictions and true values
z(l) = W(l)a(l-1) + b(l)
a(l) = σ(z(l))

This computation is repeated for each layer until the final output is produced.

Activation Functions

Non-linear functions that introduce needed complexity to the model:

  • Sigmoid: σ(z) = 1/(1 + e-z)
  • ReLU: ReLU(z) = max(0, z)
  • Softmax: σ(z)i = ezi/∑jezj (for classification)

Backpropagation efficiently computes gradients of the loss function with respect to each parameter by applying the chain rule:

Key Equations (Output layer δ):

δ(L) = ∇aJ ⊙ σ'(z(L))

Where J is the loss function, ⊙ is element-wise multiplication, and σ' is the derivative of the activation function.

Hidden layers δ:

δ(l) = (W(l+1))Tδ(l+1) ⊙ σ'(z(l))

Parameter gradients:

W(l)J = δ(l)(a(l-1))T
b(l)J = δ(l)

Computational Graph

Backpropagation works by traversing the computation graph backward from the loss:

  1. Compute derivative of loss w.r.t. output
  2. For each layer, compute gradient of loss w.r.t. parameters and inputs
  3. Propagate gradients backward using the chain rule

Optimizers adjust network parameters to minimize the loss function:

Gradient Descent Update Rule:

θ = θ - η∇θJ(θ)

Where θ are the parameters and η is the learning rate.

Adam Optimizer

Adaptive Moment Estimation combines ideas from RMSprop and momentum:

mt = β1mt-1 + (1-β1)∇J(θ)
vt = β2vt-1 + (1-β2)(∇J(θ))2
θ = θ - η·m̂t/(√v̂t + ε)

Where m and v are estimates of first and second moments of gradients, and hats indicate bias-corrected versions.

Learning Rate Scheduling

Techniques to adapt learning rate during training:

  • Step decay (reduce at fixed intervals)
  • Exponential decay (continuous reduction)
  • Warmup (gradually increase then decrease)

Made with DeepSite DeepSite Logo