DNN Computations¶

Computational View¶

Aspect		Questions
Memory	Parameters	Size Does it fit on-chip How long does it take to load from off-chip Can I overlap loading with computation Is there re-use of loaded parameters
	Activation	Size Does it fit on-chip How long does it need to be on-chip
Compute	MACs (Multiply-ACcumulates)	What is the available parallelism in each layer? Does it fit the HW? Can I overlap compute with memory access
	Data dependencies	Which (parts of) tensors need to be ready before starting to process a layer

Operation in an epoch

Memory-Bound vs Compute-Bound

	Memory-Bound	Compute-Bound

Parameters to fetch	Many	Few
Stalls other	Compute	Memory
Implication Even if you have the best __, it won’t make a difference	GPU	RAM, hard drive
Not concerning	❌	✅
IDK		Unnecessarily over-optimized memory controller

DNN “fit” on a processor: DNN’s parameters and activations fit on the processor’s external memory

Why do GPUs have smaller external memories, ie why is VRAM smaller than RAM? Because VRAM is much faster, and hence more expensive

General: Store all activations from forward pass, to use during backward pass
Checkpointing: skips some of those activations and recalculate on the fly during backward pass

Implication: less memory usage, but more computation

	Type of bound	Compute complexity (Operations)	Memory complexity (No of parameters)	Comment
Convolution	Compute	\(k \times 2 \times rs \times w h \times c\)	\(k \times rs \times c\)
Depth-wise convolution	Compute	\(k \times 2 \times rs \times w h\)	\(k \times rs\)
Linear/Fully-Connected	Memory		\(k\)
Batched Linear	Equal		\(k\)
Pooling	Equal	\(O(1)\)	\(O(1)\)	Can reuse hardware for convolutions with max/avg filter
Normalization	Equal	\(O(1)\)	\(O(1)\)	Batch-norm becomes a simple scale+shift operation during inference
Activation Functions	Equal	\(O(1)\)	\(O(1)\)	AF that cannot be compute in-place would need gradients to be computed before & after the AF Some AF have parameters

	Convolution
\(w\)	Input width
\(h\)	Input height
\(c\)	No of input channels
\(s\)	Filter width
\(r\)	Filter height
\(k\)	No of filters in convolution No of weight tensors

Last Updated: 2024-05-12 ; Contributors: AhmedThahir, web-flow