Learning Notes
Yangyang Li
yangyang.li@northwestern.edu
Update on December 21, 2023
2
Contents
Acronyms 7
Preface 1
0.1 Features of this template ....... 1
0.1.1 crossref ............ 1
0.1.2 ToC (Table of Content) .... 1
0.1.3 header and footer ....... 1
0.1.4 bib ............... 2
0.1.5 preface, index, quote (epi-
graph) and appendix ..... 2
I Machine Learning 3
1 Probability 5
1.1 Basic Concepts ............. 5
1.2 Maximum Likelihood Estimation . . . 5
1.3 Maximum A Posteriori Estimation . . 6
1.4 Gaussian Distribution ......... 6
1.5 Bayesian Network ........... 8
1.6 Probability Graph ........... 8
1.6.1 Variables Elimination ..... 8
1.6.2 Belief propagation ...... 8
1.6.3 Max-product Algorithm . . . 9
1.6.4 Factor Graph ......... 9
1.7 Expectation Maximum ........ 9
1.8 Gaussian Mixture Model ....... 9
1.9 Hidden Markov Model ........ 9
2 Notes 11
2.1 Complexity of matrix multiply . . . . 11
2.2 Representation of matrix ....... 11
2.3 Training Tricks ............ 11
II Algorithm and Data Structure 15
3 Algorithm 17
3.1 Graph .................. 17
III Programming 19
4 Leetcode 21
4.1 Problems Sets ............. 21
4.2 Tips ................... 21
4.3 Problems ................ 22
4.3.1 1 Two Sum .......... 22
4.3.2 20 valid parentheses ..... 22
4.3.3 21 Merge Two Sorted List . . 22
4.3.4 121 Best time to buy and sell
stock .............. 22
4.3.5 226 Invert Binary Tree . . . . 22
4.3.6 242 Valid Anagram ...... 22
4.3.7 704 Binary Search ....... 22
4.3.8 733 Flood Fill ......... 22
4.3.9 235 lowest-common-
ancestor-of-a-binary-search-
tree ............... 22
4.3.10 110 balanced-binary-tree . . . 22
4.3.11 141 linked-list-cycle ..... 23
4.3.12 232 implement-queue-using-
stacks ............. 23
4.3.13 278 rst-bad-version ..... 23
4.3.14 383 ransom-note ....... 23
4.3.15 70 climbing-stairs ....... 23
4.3.16 409 longest-palindrome . . . 23
4.3.17 206 reverse-linked-list . . . . 23
4.3.18 169 majority-element ..... 23
4.3.19 67 add-binary ......... 23
4.3.20 543 diameter-of-binary-tree . 23
4.3.21 876 middle-of-the-linked-list . 24
4.3.22 104 maximum-depth-of-
binary-tree .......... 24
4.3.23 217 contains-duplicate . . . . 24
4.3.24 53 maximum-subarray . . . . 24
4.3.25 57 insert-interval ....... 24
4.3.26 542 01-matrix ......... 24
4.3.27 973 k-closest-points-to-origin 24
4.3.28 3 longest-substring-without-
repeating-characters ..... 24
4.3.29 15 3sum ............ 24
4.3.30 102 binary-tree-level-order-
traversal ............ 25
4.3.31 133 clone-graph ........ 25
4.3.32 207 course-schedule ..... 25
4.3.33 322 coin-change ........ 25
4.3.34 238 product-of-array-except-
self ............... 25
4.3.35 155 min-stack ......... 25
3
CONTENTS CONTENTS
4.3.36 98 validate-binary-search-tree 25
4.3.37 200 number-of-islands . . . . 25
4.3.38 33 search-in-rotated-sorted-
array .............. 25
4.3.39 39 combination-sum ..... 25
4.3.40 86.partition-list ........ 26
4.3.41 23.merge-k-sorted-lists . . . . 26
4.3.42 142.linked-list-cycle-ii.cpp . . 26
4.3.43 160.intersection-of-two-
linked-lists.cpp ........ 26
4.3.44 26.remove-duplicates-from-
sorted-array .......... 26
4.3.45 27.remove-element.cpp . . . . 26
4.3.46 283.move-zeroes.cpp ..... 26
4.3.47 5.longest-palindromic-substring 26
4.3.48 669.trim-a-binary-search-
tree.cpp ............ 26
4.3.49 124.binary-tree-maximum-
path-sum.cpp ......... 26
4.3.50 46.permutations.cpp ..... 26
4.3.51 51.n-queens.cpp ........ 27
4.3.52 78.subsets.cpp ......... 27
4.3.53 77.combinations.cpp ..... 27
4.3.54 90.subsets-ii.cpp ........ 27
4.3.55 40.combination-sum-ii.cpp . . 27
4.3.56 47.permutations-ii.cpp . . . . 27
4.3.57 111.minimum-depth-of-
binary-tree.cpp ........ 27
4.3.58 752.open-the-lock.cpp . . . . 27
4.3.59 76.minimum-window-
substring.cpp ......... 27
4.3.60 438.nd-all-anagrams-in-a-
string.cpp ........... 27
4.3.61 337.house-robber-iii.cpp . . . 27
4.3.62 116. Populating Next Right
Pointers in Each Node . . . . 27
5 C++ 29
5.1 Code Snippets ............. 29
5.1.1 Random Number Generation 29
5.1.2 ick Sort ........... 29
5.2 Code Minning ............. 30
5.2.1 Deep Learning ........ 30
6 Rust 31
IV Research 33
7 Paper Writing 35
7.1 Resources ................ 35
7.2 Writing Tools ............. 36
7.3 Writing Tips .............. 36
7.4 PhD eses Tips ............ 39
7.5 Abstract ................ 40
7.6 Introduction .............. 40
7.7 Literature Reviews ........... 40
7.8 Method ................. 41
7.9 Result .................. 41
7.10 Conclusion ............... 41
V Book Reading 43
8 Dive into Deep Learning 45
8.1 Linear Neural Networks ........ 45
8.1.1 Exercise 3.1.6 ......... 45
8.1.2 Exercise 3.3.5 ......... 49
8.1.3 Exercise 3.5.6 ......... 51
8.1.4 Exercise 4.1.5 ......... 52
8.1.5 Exercise 4.5.5 ......... 54
8.1.6 Exercise 7.1.6 ......... 54
9 Generative Deep Learning 59
9.1 Intro to Deep Generative Models . . . 59
9.2 Variational Autoencoders ....... 59
9.3 Generative Adversarial Networks . . 61
9.3.1 Generative Adversarial Nets
(GAN) Training Tips ..... 62
9.4 Autoregressive Models ........ 64
9.5 Normalizing Flow Models ....... 67
9.6 Energy-Based Models ......... 71
9.7 Diusion Models ............ 71
Appendices 73
Appendix A Formulas 73
A.1 Gaussian distribution ......... 73
Bibliography 75
Alphabetical Index 77
4
List of Figures
1.1 A simple Bayesian network. ..................................... 8
1.2 Belief propagation. .......................................... 9
2.1 Batch Normalization ......................................... 12
2.2 Transposed Convolution [Zha+23, Chapter 14.10] ......................... 13
2.3 Transposed convolution with a kernel with stride of 2×2. e shaded portions are a por-
tion of an intermediate tensor as well as the input and kernel tensor elements used for the
computation. [Zha+23]........................................ 14
9.1 Generative Models Taxonomy .................................... 60
9.2 Adding and subtracting features to and from faces ......................... 61
9.3 Artifacts when using convolutional transpose layers ........................ 62
9.4 e Wasserstein GAN with Gradient Penalty (WGAN-GP) critic training process ........ 64
9.5 Inputs and outputs of the generator and critic in a CGAN ..................... 65
9.6 How a single sequence ows through a recurrent layer ...................... 68
9.7 An Long Short-Term Memory (LSTM) Cell ............................. 69
9.8 A single Gated Recurrent Unit (GRU) cell .............................. 70
9.9 e forward diusion process .................................... 72
List of eorems
A.1 Theorem (Central limit theorem) . . . 73
List of Denitions
A.1 Denition (Gaussian distribution) . . 73
5
LIST OF DEFINITIONS LIST OF DEFINITIONS
6
Acronyms
CGAN Contitonal Generative Adversarial Nets 64
DCGAN deep convolutional GAN 64
DDM Denoising Diusion Models 71
GAN Generative Adversarial Nets 4,6164
GRU Gated Recurrent Unit 5,67,70
KL Kullback-Leibler 60
LSTM Long Short-Term Memory 5,66,67,69
MAP Maximum A Posteriori Estimation 6
MLE Maximum Likelihood Estimation 5,6
PDF Probability Density Function 6
RNN Recurrent Neural Network 67
VAE Variational Autoencoder 60,67
WGAN-GP Wasserstein GAN with Gradient Penalty 5,63,64
7
Acronyms Acronyms
8
Preface
Contents
0.1 Features of this template ................................... 1
0.1 Features of this template
TeX, stylized within the system as L
A
T
E
X, is a typeseing system which was designed and
wrien by Donald Knuth and rst released in 1978. TeX is a popular means of typeseing
complex mathematical formulae; it has been noted as one of the most sophisticated digital
typographical systems.
-Wikipedia
0.1.1 crossref
dierent styles of clickable denitions and theorems
nameref: Gaussian distribution
autoref: Denition A.1,
cref: Denition A.1,
hyperref: Gaussian,
0.1.2 ToC (Table of Content)
mini toc of sections at the beginning of each chapter
list of theorems, denitions, gures
the chapter titles are bi-directional linked
0.1.3 header and footer
fancyhdr
right header: section name and link to the beginning of the section
le header: chapter title and link to the beginning of the chapter
footer: page number linked to ToC of the whole document
1
Acronyms 0.1 Features of this template
0.1.4 bib
titles of reference is linked to the publisher webpage e.g., [Kit+02]
backref (go to the page where the reference is cited) e.g., [Chi09]
customized video entry in reference like in [Bab16]
0.1.5 preface, index, quote (epigraph) and appendix
index page at the end of this document…
2
Part I
Machine Learning
3
Chapter 1
Probability
Contents
1.1 Basic Concepts ......................................... 5
1.2 Maximum Likelihood Estimation .............................. 5
1.3 Maximum A Posteriori Estimation ............................. 6
1.4 Gaussian Distribution ..................................... 6
1.5 Bayesian Network ....................................... 8
1.6 Probability Graph ....................................... 8
1.7 Expectation Maximum .................................... 9
1.8 Gaussian Mixture Model ................................... 9
1.9 Hidden Markov Model ..................................... 9
1.1 Basic Concepts
Permutation is an arrangement of objects in which the order is important.
P(n, r) = n!
(nr)!
Combination is an arrangement of objects in which the order is not important.
C(n, r) = P(n, r)
r!=n!
r! (nr)!
in which 0rn.
1.2 Maximum Likelihood Estimation
X= (x1,x2,...,xN)T,xi= (xi1, xi2, . . . , xip)T(1.1)
in which Nis the number of samples, pis the number of features. e data is sampled from a distribution
p(x|θ), where θis the parameter of the distribution.
For Ni.i.d. samples, the likelihood function is p(X|θ) = QN
i=1 p(xi|θ))
In order to get θ, we use Maximum Likelihood Estimation (MLE) to maximize the likelihood function.
θMLE = argmax
θ
log p(X|θ) = argmax
θ
N
X
i=1
log p(xi|θ)(1.2)
5
Chapter 1 Probability 1.3 Maximum A Posteriori Estimation
1.3 Maximum A Posteriori Estimation
In Bayes’ theorem, the θis not a constant value, but θp(θ). Hence,
p(θ|X) = p(X|θ)p(θ)
p(X)=p(X|θ)p(θ)
R
θ
p(X|θ)p(θ) (1.3)
In order to get θ, we use Maximum A Posteriori Estimation (MAP) to maximize the posterior function.
θMAP = argmax
θ
p(θ|X) = argmax
θ
p(X|θ)p(θ)
p(X)(1.4)
Aer θis estimated, then calculating p(X|θ)·p(θ)
R
θ
p(X|θ)p(θ) to get the posterior distribution. We can use the
posterior distribution to predict the probability of a new sample x.
p(xnew |X) = Z
θ
p(xnew |θ)·p(θ|X) (1.5)
1.4 Gaussian Distribution
Gaussian distribution is also called normal distribution.
θ= (µ, σ2), µ =1
N
N
X
i=1
xi, σ2=1
N
N
X
i=1
(xiµ)2(1.6)
For MLE,
θ= (µ, Σ)=(µ, σ2), θMLE = argmax
θ
log p(X|θ) = argmax
θ
N
X
i=1
log p(xi|θ)(1.7)
Generally, the Probability Density Function (PDF) of a Gaussian distribution is:
p(x|µ, Σ) = 1
p(2π)pdet(Σ)exp 1
2(xµ)TΣ1(xµ)(1.8)
in which µis the mean vector, Σis the covariance matrix, det is the determinant of matrix. det is the product
of all eigenvalues of a matrix.
Hence,
log p(X|θ) =
N
X
i=1
log p(xi|θ) =
N
X
i=1
log 1
p(2π)pdet(Σ)exp 1
2(xµ)TΣ1(xµ)(1.9)
Let’s only consider 1 dimension case for brevity, then
log p(X|θ) =
N
X
i=1
log p(xi|θ) =
N
X
i=1
log 1
2πσ2exp 1
2
(xµ)2
σ2(1.10)
Let’s get the optimal value for µ,
6
Chapter 1 Probability 1.4 Gaussian Distribution
µMLE = argmax
µ
log p(X|θ) = argmin
µ
N
X
i=1
1
2(xiµ)2(1.11)
So,
log p(X|θ)
µ =
N
X
i=1
(µxi)=0µMLE =1
N
N
X
i=1
xi(1.12)
Let’s get the optimal value for σ2,
σMLE = argmax
σ
log p(X|θ)
= argmax
σ
N
X
i=1
log 1
2πσ2exp 1
2
(xµ)2
σ2
= argmax
σ
N
X
i=1 log 2πσ2(xµ)2
2σ2
= argmin
σ
N
X
i=1 "log σ+(xµ)2
2σ2#
Hence,
σ
N
X
i=1 "log σ+(xiµ)2
2σ2#= 0 σ2
MLE =1
N
N
X
i=1
(xiµ)2(1.13)
ED[µMLE]is unbaised.
ED[µMLE] = ED"1
N
N
X
i=1
xi#=1
N
N
X
i=1
ED[xi] = 1
N
N
X
i=1
µ=µ(1.14)
However, EDσ2
MLEis biased.
7
Chapter 1 Probability 1.5 Bayesian Network
a
bc
Figure 1.1: A simple Bayesian network.
EDσ2
MLE=ED"1
N
N
X
i=1
(xiµMLE)2#(1.15)
=ED"1
N
N
X
i=1
(xiµMLE)2#(1.16)
=ED"1
N
N
X
i=1 x2
i2xiµMLE +µ2
MLE#=ED"N
X
i=1
x2
i21
N
N
X
i=1
xiµMLE +µ2
MLE#(1.17)
=ED"1
N
N
X
i=1 x2
iµ2+µ2µ2
MLE#(1.18)
=σ2EDµ2
MLE µ2(1.19)
=σ2EDµ2
MLEEDµ2
MLE (1.20)
=σ2Var [µMLE] = σ2Var "1
N
N
X
i=1
xi#(1.21)
=σ21
N2
N
X
i=1
Var [xi] = N1
Nσ2(1.22)
(1.23)
1.5 Bayesian Network
1.6 Probability Graph
this section is not nished yet. Need to be reviewed p54
1.6.1 Variables Elimination
1.6.2 Belief propagation
Belief propagation is mainly used for tree data structure, and equals Section 1.6.1 with caching.
8
Chapter 1 Probability 1.7 Expectation Maximum
a
b
cd
mab
Figure 1.2: Belief propagation.
1.6.3 Max-product Algorithm
1.6.4 Factor Graph
1.7 Expectation Maximum
Θ(t+1) = argmax
ΘZZ
log P(x, z |θ)·P(z|x, Θ(t))dz
continue on p60
1.8 Gaussian Mixture Model
QΘ, Θ(t)=ZZ
log P(x, z |θ)·P(z|x, Θ(t))dz
1.9 Hidden Markov Model
9
Chapter 1 Probability 1.9 Hidden Markov Model
10
Chapter 2
Notes
2.1 Complexity of matrix multiply
Assume Ais m×nand Bis n×p. e naive algorithm takes O(mnp)time.
2.2 Representation of matrix
Row major, col major, stride
How to calculate dimension of convolutional layer
nhkh+ph+sh
sh×nwkw+pw+sw
sw
In which nhmeans height of input, khmeans height of lter, phmeans padding of height, shmeans stride
of height. So as nw,kw,pw,sw.
2.3 Training Tricks
Batch normalization
BN (x) = γxˆ
µB
ˆ
σB
+β
Batch normalization is a technique that drastically reduces this problem. e solution is surprisingly
simple. During training, a batch normalization layer calculates the mean and standard deviation of each of
its input channels across the batch and normalizes by subtracting the mean and dividing by the standard
deviation. ere are then two learned parameters for each channel, the scale (gamma) and shi (beta) Fig. 2.1.
e output is simply the normalized input, scaled by gamma and shied by beta [Fos22].
When it comes to prediction, we may only want to predict a single observation, so there is no batch over
which to calculate the mean and standard deviation. To get around this problem, during training a batch
normalization layer also calculates the moving average of the mean and standard deviation of each channel
and stores this value as part of the layer to use at test time [Fos22].
How many parameters are contained within a batch normalization layer? For every channel in the pre-
ceding layer, two weights need to be learned: the scale (gamma) and shi (beta). ese are the trainable
parameters. e moving average and standard deviation also need to be calculated for each channel, but
since they are derived from the data passing through the layer rather than trained through backpropagation,
they are called nontrainable parameters. In total, this gives four parameters for each channel in the preceding
layer, where two are trainable and two are nontrainable [Fos22].
11
Chapter 2 Notes 2.3 Training Tricks
Figure 2.1: Batch Normalization
12
Chapter 2 Notes 2.3 Training Tricks
Figure 2.2: Transposed Convolution [Zha+23, Chapter 14.10]
Transposed Convolution Ignoring channels for now, let’s begin with the basic transposed convolution
operation with stride of 1 and no padding. Suppose that we are given a input tensor nh×nwand a kernel
kh×kw. Sliding the kernel window with stride of 1 for nwtimes in each row and nhtimes in each column
yields a total of nhnwintermediate results. Each intermediate result is a (nh+kh1) ×(nw+kw1)
tensor that are initialized as zeros. To compute each intermediate tensor, each element in the input tensor is
multiplied by the kernel so that the resulting tensor kh×kwreplaces a portion in each intermediate tensor.
Note that the position of the replaced portion in each intermediate tensor corresponds to the position of the
element in the input tensor used for the computation. In the end, all the intermediate results are summed
over to produce the output [Zha+23, Chapter 14.10].
Dierent from in the regular convolution where padding is applied to input, it is applied to output in the
transposed convolution. For example, when specifying the padding number on either side of the height and
width as 1, the rst and last rows and columns will be removed from the transposed convolution output.
Consider implementing the convolution by multiplying matrices. Given an input vector xand a weight
matrix W, the forward propagation function of the convolution can be implemented by multiplying its input
with the weight matrix and outpuing a vector y=Wx. Since backpropagation follows the chain rule and
xy=W, the backpropagation function of the convolution can be implemented by multiplying its input
with the transposed weight matrix W. erefore, the transposed convolutional layer can just exchange the
forward propagation function and the backpropagation function of the convolutional layer: its forward prop-
agation and backpropagation functions multiply their input vector with Wand W, respectively [Zha+23].
13
Chapter 2 Notes 2.3 Training Tricks
Figure 2.3: Transposed convolution with a kernel with stride of 2×2. e shaded portions are a portion of
an intermediate tensor as well as the input and kernel tensor elements used for the computation. [Zha+23]
14
Part II
Algorithm and Data Structure
15
Chapter 3
Algorithm
Contents
3.1 Graph ............................................... 17
3.1 Graph
17
Chapter 3 Algorithm 3.1 Graph
18
Part III
Programming
19
Chapter 4
Leetcode
Contents
4.1 Problems Sets .......................................... 21
4.2 Tips ................................................ 21
4.3 Problems ............................................. 22
4.1 Problems Sets
https://leetcode.com/list/rab78cw1/https://www.techinterviewhandbook.org/grind75?
weeks=8
dfs recursive = bfs iterative (use stack)
two pointer
check size (is empty, compare, larger, .etc.)
hash map (xed array)
4.2 Tips
postorder: LRN
preorder: NLR
inorder: LNR
virtual head pointers
two pointers (le and right, fast and slow)
BST (value of le child is smaller, that of right child is larger)
Backtracking (terminated option, available paths, current paths, if has
duplicates(available paths or result))
BFS (queue;while;for)
sliding windows (tow pointers) for substring problems
best-time-to-buy-and-sell-template.cpp
(p + m - 1) / m equal to ceil(p / m)
21
Chapter 4 Leetcode 4.3 Problems
be careful about int overow and underow
binary search problem: 1. le, right, f(x) increase/decrease, target
4.3 Problems
4.3.1 1 Two Sum
hash map
4.3.2 20 valid parentheses
stack
4.3.3 21 Merge Two Sorted List
recursive
4.3.4 121 Best time to buy and sell stock
record value
4.3.5 226 Invert Binary Tree
DFS recursive
BFS iterative via stack
4.3.6 242 Valid Anagram
count freq then verify freq
4.3.7 704 Binary Search
two pointers
4.3.8 733 Flood Fill
graph matrix
DFS via recursive or BFS via iterative
4.3.9 235 lowest-common-ancestor-of-a-binary-search-tree
Aributes of BST, the value of le subtree is less than root, that of le larger than root.
Recursive
iterative
4.3.10 110 balanced-binary-tree
recursive DFS
use -1 means unbalanced
22
Chapter 4 Leetcode 4.3 Problems
4.3.11 141 linked-list-cycle
two points fast and slow
check if they meet again
4.3.12 232 implement-queue-using-stacks
Two stacks, input and output
move when output is empty
amortized cost for each operation is O(1).
4.3.13 278 rst-bad-version
Binary Search
use val directly instead of index
4.3.14 383 ransom-note
hash map (xed array)
check size rst
4.3.15 70 climbing-stairs
Fibonacci like
recursive (from top to down, memorization)
iterative (from boom to top, space O(1))
4.3.16 409 longest-palindrome
use even char freq
cal odd char freq (s.size() odd + (odd >0))
4.3.17 206 reverse-linked-list
recursive
iterative (three pointes (prev, curr,next))
4.3.18 169 majority-element
moorse vote
4.3.19 67 add-binary
carry
ASCII to int, vicewise
4.3.20 543 diameter-of-binary-tree
postorder traversal
23
Chapter 4 Leetcode 4.3 Problems
4.3.21 876 middle-of-the-linked-list
fast and slow pointers
get n and n/2 at same time
4.3.22 104 maximum-depth-of-binary-tree
DFS recursive
postorder
4.3.23 217 contains-duplicate
Hash set
unorder map to hash map
map to BST
4.3.24 53 maximum-subarray
DP
use subproblem to x global problem
4.3.25 57 insert-interval
three situations
not—overlap—not
4.3.26 542 01-matrix
DP rst le top, then boom, right
BFS, rst 0, then unseen
4.3.27 973 k-closest-points-to-origin
sort/n element/partial sort
max-heap
randomized quicksort
4.3.28 3 longest-substring-without-repeating-characters
slide window (two points)
4.3.29 15 3sum
sort
two points
24
Chapter 4 Leetcode 4.3 Problems
4.3.30 102 binary-tree-level-order-traversal
stack/queue BFS
DFS recursive / keep level
4.3.31 133 clone-graph
DFS
Amazing
4.3.32 207 course-schedule
https://www.geeksforgeeks.org/kahns-algorithm-vs-dfs-approach-a-comparative-analysis/
BFS kahn’s algorithm topological sort
DFS beer for larger V
topological sort
4.3.33 322 coin-change
DP
4.3.34 238 product-of-array-except-self
prex
sux
space O(1)
4.3.35 155 min-stack
keep value and min at the same time
4.3.36 98 validate-binary-search-tree
Recursive
4.3.37 200 number-of-islands
nd 1 and DFS clear 1
4.3.38 33 search-in-rotated-sorted-array
Binary Search
check if mid and target are in same side
if not same side assume one side is inf/-inf value let le/right advanced
4.3.39 39 combination-sum
Backtracking
25
Chapter 4 Leetcode 4.3 Problems
4.3.40 86.partition-list
dummy pointers
two pointers
4.3.41 23.merge-k-sorted-lists
heap/priority queue
merge two lists iteratively
4.3.42 142.linked-list-cycle-ii.cpp
two pointers
4.3.43 160.intersection-of-two-linked-lists.cpp
contact two lists
make sure two lists have same distance from the end
4.3.44 26.remove-duplicates-from-sorted-array
fast and slow pointers
4.3.45 27.remove-element.cpp
fast and slow pointers
4.3.46 283.move-zeroes.cpp
fast and slow pointers
snowball
4.3.47 5.longest-palindromic-substring
dp
two pointers (expand from center/narrow from both sides)
4.3.48 669.trim-a-binary-search-tree.cpp
BST (value of le child is smaller, that of right child is larger)
postorder
4.3.49 124.binary-tree-maximum-path-sum.cpp
postorder
discard nodes with negative value
4.3.50 46.permutations.cpp
Backtracking
26
Chapter 4 Leetcode 4.3 Problems
4.3.51 51.n-queens.cpp
Backtracking(current path, options, terminated option)
4.3.52 78.subsets.cpp
backtracking
4.3.53 77.combinations.cpp
backtracking
4.3.54 90.subsets-ii.cpp
backtracking
4.3.55 40.combination-sum-ii.cpp
backtracking
4.3.56 47.permutations-ii.cpp
backtracking
4.3.57 111.minimum-depth-of-binary-tree.cpp
BFS
4.3.58 752.open-the-lock.cpp
BFS
4.3.59 76.minimum-window-substring.cpp
sliding windows
4.3.60 438.nd-all-anagrams-in-a-string.cpp
sliding windows
4.3.61 337.house-robber-iii.cpp
213.house-robber-ii.cpp
198.house-robber.cpp
DP
recursive with memorization
DP table -¿ optimize table
4.3.62 116. Populating Next Right Pointers in Each Node
ree children tree traverse
BFS O(N)space O(1)
27
Chapter 4 Leetcode 4.3 Problems
28
Chapter 5
C++
Contents
5.1 Code Snippets .......................................... 29
5.2 Code Minning .......................................... 30
5.1 Code Snippets
5.1.1 Random Number Generation
#include <algorithm>
#include <iostream>
#include <iterator>
#include <random>
int main() {
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_int_distribution<int>dist6(1,6);
std::generate_n(std::ostream_iterator<int>(std::cout, " "), 10,
[&dist6, &rng]() { return dist6(rng); });
return 0;
}
5.1.2 ick Sort
#include <vector>
using std::vector;
class Solution {
public:
void sort(vector<int>& nums, int low, int high){
if (low >= high) return;
int p=partition(nums, low, high);
29
Chapter 5 C++ 5.2 Code Minning
sort(nums, low, p - 1); // Changed p to p-1
sort(nums, p + 1, high);
}
int partition(vector<int>& nums, int low, int high){
int pivot =low, l =pivot + 1,r=high;
while(l <= r) {
if (nums[l] <nums[pivot]) ++l;
else if (nums[r] >= nums[pivot]) --r;
else std::swap(nums[l], nums[r]);
}
std::swap(nums[pivot], nums[r]);
return r;
}
void shuffle(vector<int>& nums){
std::srand((unsigned) time(nullptr));
int n=nums.size();
for (int i= 0;i<n; ++i){
int r=i+rand() %(n -i);
std::swap(nums[i], nums[r]);
}
}
vector<int>sortArray(vector<int>& nums) {
shuffle(nums); // Shuffle the array before sorting
sort(nums, 0, nums.size() - 1); // Changed nums.size() to nums.size() - 1
return nums;
}
};
On average, the algorithm takes O(nlog n)time. In the worst case, it takes On2time. We shue the
array before sorting to avoid the worst case.
5.2 Code Minning
5.2.1 Deep Learning
https://ensmallen.org/docs.html
https://github.com/mlpack/mlpack/blob/master/doc/quickstart/cpp.md https://github.
com/flashlight/flashlight?tab=readme-ov-file#build-options
https://github.com/pytorch/pytorch
https://mp.weixin.qq.com/s/rx7SPYr-sEOz_GOYRfSDOw
30
Chapter 6
Rust
31
Chapter 6 Rust
32
Part IV
Research
33
Chapter 7
Paper Writing
Contents
7.1 Resources ............................................ 35
7.2 Writing Tools .......................................... 36
7.3 Writing Tips ........................................... 36
7.4 PhD eses Tips ........................................ 39
7.5 Abstract ............................................. 40
7.6 Introduction ........................................... 40
7.7 Literature Reviews ....................................... 40
7.8 Method .............................................. 41
7.9 Result .............................................. 41
7.10 Conclusion ........................................... 41
7.1 Resources
https://github.com/MLNLP-World/Paper-Writing-Tips
https://github.com/dspinellis/latex-advice
https://github.com/guanyingc/latex_paper_writing_tips
https://github.com/yzy1996/English-Writing
https://github.com/guanyingc/cv_rebuttal_template
https://www.overleaf.com/gallery/tagged/tikz/page
https://color.adobe.com/zh/explore
https://github.com/acmi-lab/cmu-10717-the-art-of-the-paper/tree/main/resources
https://github.com/daveshap/ChatGPT_Custom_Instructions/blob/main/academic_copy_
reviser.md
https://github.com/sylvainhalle/textidote?tab=readme-ov-file
https://github.com/egeerardyn/awesome-LaTeX
https://texfaq.org
https://github.com/xinychen/awesome-latex-drawing
https://pudding.cool/2022/02/plain
35
Chapter 7 Paper Writing 7.2 Writing Tools
Table 7.1: Writing Tools
Tool URL
Latex Table Generator tablesgenerator.com
Latex Equation Editor https://www.codecogs.com/latex/eqneditor.php
Latex Equation and Figure Editor https://www.mathcha.io/editor
Equation Image to Latex https://mathpix.com/
Figure Inspiration https://echarts.apache.org/examples/
Figure Creation https://altair-viz.github.io/
Online Figure Editor https://www.tldraw.com/
Identify Font https://www.fontsquirrel.com/matcherator
Identify Color https://imagecolorpicker.com/
Identify Math Symbol https://detexify.kirelabs.org/classify.html
Beer Labeling Tool https://labelstud.io/
Fix Wrong Citation https://github.com/yuchenlin/rebiber
Have a good name for model http://acronymify.com/
Deadline Reminder https://aideadlin.es/?sub=ML,CV,NLP
Great Paper/Model Info https://paperswithcode.com/
Same Meaning Words https://www.thesaurus.com/
Compare Same Meaning Words https://wikidiff.com/neglect/omit
Common Words for Academic Writing http://www.esoda.org/
Doi to Bib https://www.doi2bib.org/
Figure Gallery of Vega https://vega.github.io/vega-lite/examples/
Figure Gallery of Graphviz https://graphviz.org/gallery/
UML Diagram https://plantuml.com/
Color Picker https://colorbrewer2.org
Abstract/Title Generator https://x.writefull.com/
Online BibTeX Formaer https://flamingtempura.github.io/bibtex-tidy
https://www.youtube.com/watch?v=VK51E3gHENc&ab_channel=MicrosoftResearch
https://www.cs.jhu.edu/~jason/advice/write-the-paper-first.html
https://github.com/yzy1996/English-Writing
7.2 Writing Tools
7.3 Writing Tips
Your dra can describe your motivation, formal problem, model, algorithms, and experiments before you
actually build anything. (Ideally, it will also explain why you did it this way rather than some other way, and
point out gaps that remain for future work.) By showing others the dra at this stage, you’ll get important
feedback before you invest time in the “wrong” work.
If you run into trouble while doing the work, then I may have diculty diagnosing or even understanding
the problem you are facing. We may waste a whole meeting just geing aligned. But if you can show me
a precise writeup of what you’re doing, then I can be much more helpful. In general, meetings are very
productive when we have a concrete document in front of us.
Of course, there are many kinds of concrete documents. You could instead write up the notes from our
(usually extensive) discussions as private documents for further discussion. Still, writing up in the form of
36
Chapter 7 Paper Writing 7.3 Writing Tips
a nal paper makes you (1) integrate everything in one place, (2) decide which ideas will be made central
for this paper, and (3) focus on the coherence and impact of the end product. (Additional discussion and
brainstorming can still go into the document—those subsections can be cut or moved to an appendix for the
conference paper, but retained for a longer version as a tech report, journal article, or dissertation.)1
What needs to be done? Assuming the idea is indeed a good one, then writing a dra makes you sharpen
the message of the paper. en you can gure out what work needs to be done to support exactly that message.
Your introduction will make some claims. Oen you will realize that some interesting additional ex-
periment would really test those claims.
Writing the literature review will help you design your experiments. It may inuence what datasets
you use, what comparisons you do, and what you are trying to prove you can do that other people
can’t. However, don’t write the lit review rst. Write out your own ideas before comparing them with
the literature. is increases the chance that you’ll nd new angles on the problem.
Writing the experimental section is possible even before you’ve done the experiments. Explain the full
experimental design. Make empty result tables with row and column headers and explanatory captions.
Make empty graphs with axes and captions. e actual results will be missing, but it will be clear what
work is necessary. If you are having trouble switching your brain from experiments to writing, just
write up your current experiments rst. Although writing an intro will also help, so you can explain
why you are doing the experiments.
Honest writing may lead you to realize that proving your point requires more work than you’d thought
(which is why you’d beer write early).
How to give a wonderful talk?
What is the question?
Why is it interesting?
How do you propose to answer it?
e.g. means “for example.
i.e. means “that is”.
et al. means “and others of the same kind”.
etc. means “and others”, do not specify people.
Avoid extreme meaning words
1. Don’t spend too much time reading other people’s research, waiting for inspiration to strike you.
(a) Reading research should be a regular part of any economist’s routine.
(b) But it is not a substitute for doing your own.
2. Don’t set the bar too high. You don’t need to win the Nobel.
3. But don’t set the bar too low!
(a) Be wary of picking a topic that is of interest to a small number of people (e.g. you and your
adviser).
1https://www.cs.jhu.edu/~jason/advice/write-the-paper-first.html
37
Chapter 7 Paper Writing 7.3 Writing Tips
Table 7.2: Alternative Terms for Common Words
Original Term Alternative Terms
obvious straightforward
always usually, generally, oen
never seldom, rare
avoid, eliminate reduce, mitigate, alleviate, relieve
a lot of much, many
do (verb) perform, conduct, carry out
big large, signicant
like such as, for example
think believe, consider, argue, claim, suggest
talk discuss, describe, present, show
look at examine, investigate, explore, study
get obtain, acquire, receive, gain
keep maintain, retain, preserve
climb increase, rise, grow, improve, ascend
really very, extremely, highly
(b) In particular, think about topics that will interest those beyond this island
4. Your ideas and results won’t sell themselves.
5. How you communicate your work is of crucial importance.
6. ere is no point in having an interesting piece of research that nobody understands or sees the point
of.
7. Many economists think of themselves as primarily experts in technical methods: Econometrics, eco-
nomic theory, data expertise.
8. is “white coat” mentality—that we are mainly scientists who then do a write-up of our results—is
deeply wrong.
9. Writing is an essential part of the research process, not a last-minute thing to be rushed.
10. Do not wait to write
11. Identify your key idea explicitly
12. Tell a story paper structure
13. Nail your contributions to the mast introduction
14. Related work
15. Put your readers rst the main body
16. Listen to your readers
Latex Writing Tips
use \usepackage{microtype} to improve the appearance of the text.
use \tablename~\ref{tab} to reference table.
use \figurename~\ref{fig} to reference gure.
38
Chapter 7 Paper Writing 7.4 PhD eses Tips
Eq.~\eqref{eq1}
comments from dierent people can be distinguished by dierent colors. \newcommand{\yl}[1]{{\color{blue}{[(YL): #1]}}}
use \newcommand{\method}{ABC\xspace} to dene model name or method
mathematic notation https://www.deeplearningbook.org/contents/notation.html
7.4 PhD eses Tips
e PhD esis Tips from https://www.karlwhelan.com/Teaching/PhD/phd-writing-talk.pdf
What makes a good thesis? Forget about objectivity—beauty is in the eye of the beholder. A good thesis
is one that readers think is good. So, you need to explain well what you are doing to somebody else. Your
ideas and results won’t sell themselves. How you communicate your work is of crucial importance. ere is
no point in having an interesting piece of research that nobody understands or sees the point of. e “white
coat” mentality—that we are mainly scientists who then do a write-up of our results—is misguided. Writing
is an essential part of the research process, not a last-minute thing to be rushed. is is particularly true of
PhD theses which are read very carefully by externs, who are hoping that you have explained what you have
done in a clear fashion.
Writing Skills: More Important an You ink What makes a good thesis? Forget about objectiv-
ity—beauty is in the eye of the beholder. A good thesis is one that readers think is good. So, you need to
explain well what you are doing to somebody else Your ideas and results won’t sell themselves. How you
communicate your work is of crucial importance. ere is no point in having an interesting piece of research
that nobody understands or sees the point of. e “white coat” mentality—that we are mainly scientists who
then do a write-up of our results—is misguided. Writing is an essential part of the research process, not a last-
minute thing to be rushed. is is particularly true of PhD theses which are read very carefully by externs,
who are hoping that you have explained what you have done in a clear fashion.
Start by Avoiding Very Bad Writing People can be quite sensitive about their writing. If you are a native
English speaker and you have a degree, you probably think you know how to write and communicate well.
Chances are, you might be wrong. Most MA graduates and even many professional PhD-qualied economists
write very poorly. Good writing is hard to dene. Bad writing is easy to spot. A badly-wrien thesis will
have:
1. Mis-spelled words.
2. Missing words.
3. Sentences that don’t make sense or aren’t proper sentences.
4. Poor use of punctuation full stops, commas etc.
is annoys the reader because it makes things harder to read (and understand) but also because it’s so easy
to prevent. It suggests you didn’t take the time to be careful.
What To Do About It? Read, re-read, edit, and re-edit. And do this as you go along. Read and edit aer
you’ve wrien a page or so. is can correct most of the common errors of style, grammar, and spelling that
occur in the writing process. Most of you actually do know what a sentence is. In addition to catching typos,
a quick re-read and edit allows you to check that what you’ve wrien gets across what you’ve been trying to
say. Sometimes you can think you’ve made a point clearly but then you read what you’ve wrien and it’s not
so good. Read your stu aloud or slowly to yourself. Does it sound right? Are you writing proper sentences?
Are you over-using jargon or certain particular phrases? Use spell checks but don’t rely on them. Grammar.
39
Chapter 7 Paper Writing 7.5 Abstract
ere are rules about how to use commas, colons, semi-colons, full stops, about what denes a sentence. Try
to learn them.
7.5 Abstract
7.6 Introduction
1. Introductions are crucial because they set up the reader to understand what your topic is and what you
are going to do in your paper.
2. ickly explain two things:
(a) Why is the topic of your paper interesting?
(b) What did YOU do? What is YOUR contribution? A new question? An existing question but
new methodology? Existing question, existing methodology, new data (e.g. no previous Irish
application)?
3. Be willing to give an outline of what your results are but don’t get into too many details.
4. Because of its importance, spend a lot of time on the introduction.
5. But don’t make it too long. ree pages is a limit. Two is beer.
6. I oen start writing the introduction as soon I have some results and then keep adjusting it as the paper
evolves.
7. Conclusions, on the other hand, should be kept short and to the point. Don’t repeat lots of stu from
the intro.
7.7 Literature Reviews
1. You need to explain your contribution.
2. So it needs to be put in context.
3. is will require discussion of previous studies in this area.
4. Remember, though, the purpose is to set up your contribution, and distinguish it from previous work.
5. Don’t simply provide a long list of separate descriptions of weakly related studies. (X (2007) did this.
Y(2009) did that …)
6. Grouping studies together by type may be a beer way of explaining it than listing lots of separate
individual studies.
7. A well-focused literature review is usually a useful component of a PhD thesis. It shows externs and
your supervisor that you have read the literature.
8. However, it is not essential that articles submied for publication have a literature review. You may be
able to summarise the existing literature in your introduction or work it into your opening section that
sets up what you are doing.
40
Chapter 7 Paper Writing 7.8 Method
7.8 Method
7.9 Result
7.10 Conclusion
41
Chapter 7 Paper Writing 7.10 Conclusion
42
Part V
Book Reading
43
Chapter 8
Dive into Deep Learning
8.1 Linear Neural Networks
8.1.1 Exercise 3.1.6
Exercise 1: Let’s solve this analytically.
To minimize the sum of squared dierences, we take the derivative of the function with respect to (b), set
it equal to zero, and solve for (b). e function to minimize is:
f(b) =
n
X
i=1
(xib)2
Taking the derivative with respect to (b) gives:
f(b) =
n
X
i=1 2 (xib)
Seing this equal to zero and solving for (b) gives:
0 =
n
X
i=1 2(xib)
Solving for (b) gives:
b=1
n
n
X
i=1
xi
So the value of (b) that minimizes the sum of squared dierences is the mean of the xi.
is problem and its solution relate to the normal distribution because the sum of squared dierences
is the basis for the maximum likelihood estimation of the mean of a normal distribution. In other words,
the mean of a normal distribution is the value that maximizes the likelihood of the observed data, which is
equivalent to minimizing the sum of squared dierences from the mean.
If we change the loss function to the sum of absolute dierences, the problem becomes nding the median
of the xi. is is because the median is the value that minimizes the sum of absolute dierences. Unlike the
mean, the median is not sensitive to extreme values, so it is a more robust measure of central tendency when
there are outliers in the data.
45
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
Exercise 2: An ane function is a function composed of a linear transformation and a translation. In the
context of machine learning, an ane function oen takes the form f(x) = xw+b), where xis an input
vector, wis a weight vector, and (b) is a bias term.
We can show that this ane function is equivalent to a linear function on the augmented vector (x,1)
by considering a new weight vector w= (w, b)and an augmented input vector x= (x,1). en the ane
function can be wrien as a linear function:
f(x) = xw+b=x′⊤w
Here’s the proof:
e original ane function is f(x) = xw+b.
We dene x= (x,1))and(w= (w, b).
e dot product x′⊤w)is(xw+ 1 ·b, which is exactly the original ane function.
erefore, the ane function xw+bis equivalent to the linear function x′⊤won the augmented vector
(x,1). is equivalence is oen used in machine learning to simplify the notation and the implementation of
algorithms.
Exercise 3: A quadratic function of the form f(x) = b+Piwixi+Pjiwij xixjcan be implemented
in a deep network using a fully connected layer followed by an element-wise multiplication operation and
another fully connected layer. Here’s how you can do it:
Input Layer: e input to the network is the vector x.
First Fully Connected Layer: is layer applies a linear transformation to the input vector. e weights
of this layer represent the witerms in the quadratic function. e output of this layer is b+Piwixi.
Element-wise Multiplication Layer: is layer computes the element-wise product of the input vector with
itself, resulting in the vector xx, where denotes element-wise multiplication. is operation corresponds
to the xixjterms in the quadratic function.
Second Fully Connected Layer: is layer applies a linear transformation to the output of the element-
wise multiplication layer. e weights of this layer represent the wij terms in the quadratic function. e
output of this layer is Pjiwij xixj.
Summation Layer: is layer adds the outputs of the rst fully connected layer and the second fully
connected layer to produce the nal output of the network, which is the value of the quadratic function.
Note that this network does not include any activation functions, because the quadratic function is a
polynomial function, not a nonlinear function. If you want to model a more complex relationship between
the input and the output, you can add activation functions to the network.
Exercise 4: If the design matrix XXdoes not have full rank, it means that there are linearly dependent
columns in the matrix X. is can lead to several problems:
e matrix XXis not invertible, which means that the normal equations used to solve the linear re-
gression problem do not have a unique solution. is makes it impossible to nd a unique set of regression
coecients that minimize the sum of squared residuals.
e presence of linearly dependent columns in Xcan lead to overing, as the model can “learn” to use
these redundant features to perfectly t the training data, but it will not generalize well to new data.
One common way to x this issue is to add a small amount of noise to the entries of X. is can help
to make the columns of Xlinearly independent, which ensures that XXhas full rank and is invertible.
Another common approach is to use regularization techniques, such as ridge regression or lasso regression,
which add a penalty term to the loss function that discourages the model from assigning too much importance
to any one feature.
If you add a small amount of coordinate-wise independent Gaussian noise to all entries of X, the expected
value of the design matrix XXremains the same. is is because the expected value of a Gaussian random
46
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
variable is its mean, and adding a constant to a random variable shis its mean by that constant. So if the
noise has mean zero, the expected value of XXdoes not change.
Stochastic gradient descent (SGD) can still be used when XXdoes not have full rank, but it may not
converge to a unique solution. is is because SGD does not rely on the invertibility of XX, but instead
iteratively updates the regression coecients based on a randomly selected subset of the data. However, the
presence of linearly dependent columns in Xcan cause the loss surface to have multiple minima, which means
that SGD may converge to dierent solutions depending on the initial values of the regression coecients.
Exercise 5: e negative log-likelihood of the data under the model is given by:
e likelihood function for the data is P(y|X) = Qn
i=1 1
2exp−|yi(x
iw+b)|, where yiis the (i)-th
target value, xiis the (i)-th input vector, wis the weight vector, and (b) is the bias term.
Taking the negative logarithm of this gives the negative log-likelihood:
log P(y|X) = Pn
i=1 log 2 + |yi(x
iw+b)|.
Unfortunately, there is no closed-form solution for this problem. e absolute value in the log-likelihood
function makes it non-dierentiable at zero, which means that we cannot set its derivative equal to zero and
solve for wand b.
A minibatch stochastic gradient descent algorithm to solve this problem would involve the following steps:
Initialize wand bwith random values. For each minibatch of data: Compute the gradient of the negative log-
likelihood with respect to wand b. e gradient will be dierent depending on whether yi(x
iw+b)is
positive or negative. Update wand bby taking a step in the direction of the negative gradient. Repeat until
the algorithm converges. One issue that could arise with this algorithm is that the updates could become very
large if yi(x
iw+b)is close to zero, because the gradient of the absolute value function is not dened at
zero. is could cause the algorithm to diverge.
One possible solution to this problem is to use a variant of gradient descent that includes a regularization
term, such as gradient descent with momentum or RMSProp. ese algorithms modify the update rule to
prevent the updates from becoming too large. Another solution is to add a small constant to the absolute
value inside the logarithm to ensure that it is always dierentiable.
Exercise 6: e composition of two linear layers in a neural network essentially results in another linear
layer. is is due to the property of linearity: the composition of two linear functions is another linear
function.
Mathematically, if we have two linear layers L1and L2such that L1(x) = Ax +band L2(x) = Cx +d,
then the composition L2(L1(x)) = L2(Ax +b) = C(Ax +b) + d= (CA)x+ (Cb +d), which is another
linear function.
is means that a neural network with two linear layers has the same expressive power as a neural net-
work with a single linear layer. It cannot model complex, non-linear relationships between the input and the
output.
To overcome this limitation, we typically introduce non-linearities between the linear layers in the form
of activation functions, such as the ReLU (Rectied Linear Unit), sigmoid, or tanh functions. ese non-linear
activation functions allow the neural network to model complex, non-linear relationships and greatly increase
its expressive power.
Exercise 7: Regression for realistic price estimation of houses or stock prices can be challenging due to the
complex nature of these markets. Prices can be inuenced by a wide range of factors, many of which may not
be easily quantiable or available for use in a regression model.
e additive Gaussian noise assumption may not be appropriate for several reasons. First, prices cannot be
negative, but a Gaussian distribution is dened over the entire real line, meaning it allows for negative values.
Second, the Gaussian distribution is symmetric, but price changes in real markets are oen not symmetric. For
47
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
example, prices may be more likely to increase slowly but decrease rapidly, leading to a skewed distribution
of price changes. Finally, the Gaussian distribution assumes that large changes are extremely unlikely, but in
real markets, large price uctuations can and do occur.
Regression to the logarithm of the price can be a beer approach because it can help to address some of
the issues with the Gaussian noise assumption. Taking the logarithm of the prices can help to stabilize the
variance of the price changes and make the distribution of price changes more symmetric. It also ensures that
the predicted prices are always positive, since the exponential of any real number is positive.
When dealing with penny stocks, or stocks with very low prices, there are several additional factors to
consider. One issue is that the prices of these stocks may not be able to change smoothly due to the minimum
tick size, or the smallest increment by which the price can change. is can lead to a discontinuous distribution
of price changes, which is not well modeled by a Gaussian distribution. Another issue is that penny stocks are
oen less liquid than higher-priced stocks, meaning there may not be enough buyers or sellers to trade at all
possible prices. is can lead to large price jumps, which are also not well modeled by a Gaussian distribution.
e Black-Scholes model for option pricing is a celebrated model in nancial economics that makes spe-
cic assumptions about the distribution of stock prices. It assumes that the logarithm of the stock price follows
a geometric Brownian motion, which implies that the stock price itself follows a log-normal distribution. is
model has been very inuential, but it has also been criticized for its assumptions, which may not hold in real
markets. For example, it assumes that the volatility of the stock price is constant, which is oen not the case
in reality.
Exercise 8: e Gaussian additive noise model may not be appropriate for estimating the number of apples
sold in a grocery store for several reasons: e Gaussian model allows for negative values, but the number
of apples sold cannot be negative. e Gaussian model is a continuous distribution, but the number of apples
sold is a discrete quantity. e Gaussian model assumes that large deviations from the mean are extremely
unlikely, but in reality, there could be large uctuations in the number of apples sold due to various factors
(e.g., seasonal demand, promotions).
e Poisson distribution is a discrete probability distribution that expresses the probability of a given
number of events (in this case, the number of apples sold) occurring in a xed interval of time or space. e
parameter λis the rate at which events occur. e expected value of a Poisson-distributed random variable
is indeed λ. is can be shown as follows:
e expected value (E[k]) of a random variable (k) distributed according to a Poisson distribution is given
by:
E[k] =
X
k=0
k·p(k|λ) =
X
k=0
k·λkeλ
k!
By the properties of the Poisson distribution, this sum equals λ.
e loss function associated with the Poisson distribution is typically the negative log-likelihood. Given
observed counts y1, y2, . . . , ynand predicted rates λ1, λ2, . . . , λn, the negative log-likelihood is:
L(λ, y) =
n
X
i=1
(λiyilog λi)
If we want to estimate log λinstead of λ, we can modify the loss function accordingly. One possible choice
is to use the squared dierence between the logarithm of the predicted rate and the logarithm of the observed
count:
L(log λ, y) =
n
X
i=1
(log λilog yi)2
48
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
is loss function penalizes relative errors in the prediction of the rate, rather than absolute errors. It can
be more appropriate when the counts vary over a wide range, as it gives equal weight to proportional errors
in the prediction of small and large counts.
8.1.2 Exercise 3.3.5
1. What will happen if the number of examples cannot be divided by the batch size. How would you
change this behavior by specifying a dierent argument by using the framework’s API? PyTorch:
In PyTorch’s DataLoader, there’s an argument called drop last. If set to True, it will drop the last incomplete
batch. By default, it’s set to False.
train_loader =torch.utils.data.DataLoader(dataset, batch_size=batch_size,
shuffle=True, drop_last=True)
2. Suppose that we want to generate a huge dataset, where both the size of the parameter vector w
and the number of examples num examples are large. 1. What happens if we cannot hold all data in
memory?
2. How would you shue the data if it is held on disk? Your task is to design an ecient algorithm that
does not require too many random reads or writes. Hint: pseudorandom permutation generators 75 allow
you to design a reshue without the need to store the permutation table explicitly (Naor and Reingold, 1999).
1. What happens if we cannot hold all data in memory?
If we cannot hold all the data in memory:
Performance Impact: Reading data directly from disk is much slower than reading from memory. If the
training algorithm frequently accesses the disk, it can signicantly slow down the training process.
Out-of-Core Processing: You’ll need to use out-of-core or external memory” algorithms, which are
designed to process data that is too large to t into a computer’s main memory. ese algorithms break the
data into smaller chunks that can t into memory, process each chunk, and then combine the results.
Streaming: Data can be streamed from the disk in mini-batches. Only one mini-batch is loaded into
memory at a time, and aer processing it, the next mini-batch is loaded. is is common in deep learning
when dealing with large datasets.
Distributed Computing: Another approach is to distribute the dataset across multiple machines in a
cluster. Each machine processes a subset of the data. Frameworks like TensorFlow and PyTorch have support
for distributed training.
2. How would you shue the data if it is held on disk?
Shuing large datasets on disk without too many random reads or writes can be challenging. Here’s a
method using pseudorandom permutation generators:
1. Pseudorandom Permutation: Use a pseudorandom permutation generator to generate a sequence of
indices that represents the shued order of your data. e beauty of pseudorandom permutation generators
is that they can generate the nth element of the sequence without generating the rst n-1 elements, allowing
for ecient random access.
2. Sequential Reads and Writes: - Read the data from the disk sequentially in large chunks (to benet
from sequential read speeds). - For each item in the chunk, use the pseudorandom permutation generator to
determine its new location in the shued dataset. - Write the shued data back to the disk sequentially in
large chunks.
3. In-Place Shuing: If you have some memory available (but not enough to hold the entire dataset),
you can load chunks of data into memory, shue them using the pseudorandom permutation, and then write
them back to disk. is reduces the number of disk writes.
49
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
4. External Sorting: Another approach is to use techniques from external sorting. Divide the data into
chunks that t into memory, shue each chunk in memory, and then merge the chunks together in a shued
manner.
5. Temporary Storage: If possible, use a fast intermediate storage (like an SSD) to temporarily store data
chunks during the shuing process. is can speed up both reads and writes.
Remember, the key is to minimize random disk accesses, as they are much slower than sequential accesses.
Leveraging pseudorandom permutations allows for an ecient reshuing without the need to store large
permutation tables explicitly.
3. Implement a data generator that produces new data on the y, every time the iterator is called.
import torch
from torch.utils.data import Dataset, DataLoader
class RandomDataset(Dataset):
def __init__(self, num_samples, tensor_size):
"""
Args:
num_samples (int): Number of samples in the dataset.
tensor_size (tuple): Size of the tensor to be generated.
"""
self.num_samples =num_samples
self.tensor_size =tensor_size
def __len__(self):
return self.num_samples
def __getitem__(self, idx):
# Generate a random tensor on the fly
sample =torch.randn(self.tensor_size)
return sample
# Parameters
num_samples = 1000
tensor_size =(3,32,32)# Example: 3-channel, 32x32 image
# Create dataset and dataloader
random_dataset =RandomDataset(num_samples, tensor_size)
dataloader =DataLoader(random_dataset, batch_size=32, shuffle=True)
# Iterate over the dataloader
for batch in dataloader:
print(batch.shape) # Should print torch.Size([32, 3, 32, 32])
# for all batches except possibly the last one
4. How would you design a random data generator that generates the same data each time it is
called?
torch.manual_seed(idx)
50
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
8.1.3 Exercise 3.5.6
1. How would you need to change the learning rate if you replace the aggregate loss over the mini-
batch with an average over the loss on the minibatch? When you replace the aggregate loss over the
minibatch with an average over the loss on the minibatch, you’re essentially scaling the loss by a factor of
1
batch size . If you’re using gradient descent or a variant thereof, the update rule is:
θnew =θold ηL
Where: - θrepresents the parameters of the model. - ηis the learning rate. - Lis the gradient of the
loss function Lwith respect to the parameters.
If you change the loss from an aggregate to an average, the gradient Lwill also be scaled by 1
batch size .
To compensate for this scaling and to ensure that the magnitude of the parameter updates remains the same,
you would need to multiply the learning rate ηby the batch size.
In other words, if your original learning rate was ηoriginal when using aggregate loss, and you switch to
using average loss, your new learning rate should be:
ηnew =ηoriginal ×batch size
However, in practice, many deep learning frameworks, including TensorFlow and PyTorch, compute the
average loss over a minibatch by default. So, if you’re already using one of these frameworks, you likely
don’t need to adjust the learning rate when switching between aggregate and average loss, unless you were
explicitly scaling the loss by the batch size yourself.
4. What is the eect on the solution if you change the learning rate and the number of epochs?
Does it keep on improving? e learning rate and the number of epochs are two critical hyperparameters
in training neural networks and other machine learning models. eir values can signicantly impact the
convergence and performance of the model. Let’s discuss the eects of changing each:
1. Learning Rate (η): - Too High: If the learning rate is set too high, the model might overshoot the
minima in the loss landscape, leading to divergence. e loss might oscillate or even explode. - Too Low: If
the learning rate is too low, the model will converge very slowly. It might get stuck in local minima or plateaus
in the loss landscape. - Adaptive: Many modern optimizers (like Adam, RMSprop) adjust the learning rate
dynamically based on recent gradients, which can help in nding a good balance.
2. Number of Epochs: - Too Few: If the number of epochs is too low, the model might undert the data, as it
hasn’t seen the data enough times to learn the underlying paerns. - Too Many: If the number of epochs is too
high, the model might overt the data, especially if there isn’t an early stopping mechanism or regularization
in place. Overing means the model performs well on the training data but poorly on unseen data.
Does it keep on improving?
- Not necessarily. As training progresses: - e loss typically decreases and the model’s performance on
the training set improves. - However, aer a certain point, the model might start to overt, and its performance
on a validation set might start to degrade. - e optimal number of epochs is oen where the performance
on the validation set is maximized.
- e learning rate plays a crucial role in this: - A good learning rate can help the model converge to a
good solution faster. - Learning rate schedules or decay can be benecial. Starting with a larger learning rate
(to progress quickly) and reducing it over time (to ne-tune) can be eective.
In summary, there’s a balance to strike. It’s essential to monitor both training and validation metrics
during training to determine the best values for the learning rate and the number of epochs. Experimentation
and hyperparameter tuning, possibly with techniques like cross-validation or Bayesian optimization, can help
nd the optimal values for a given problem and dataset.
51
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
8.1.4 Exercise 4.1.5
1. We can explore the connection between exponential families and somax in some more depth
1. Compute the second derivative of the cross-entropy loss l(y, ˆy)for somax. 2. Compute the variance of
the distribution given by somax(o) and show that it matches the second derivative computed above.
1. Second Derivative of Cross-Entropy Loss for Somax, e cross-entropy loss for somax is given by:
l(y, ˆy) = X
i
yilog(ˆyi)
Where ˆyiis the somax output:
ˆyi=exp(oi)
Pjexp(oj)
Let’s compute the second derivative with respect to the logits o.
First, the derivative of ˆyiwith respect to okis:
ˆyi
ok
= ˆyi(δik ˆyk)
Where δik is the Kronecker delta (it’s when i=kand 0 otherwise). Using the above, the second derivative
is:
2l
oiok
=X
j
yj
2log(ˆyj)
oiok
=X
j
yj
oiˆyj(δjk ˆyk)
ˆyj
=X
j
yj(δjk ˆyk)ˆyj
oi
=X
j
yj(δjk ˆyk) ˆyj(δji ˆyi)
2. Variance of the Distribution Given by Somax
e variance Var[X]of a discrete random variable Xwith possible outcomes x1, x2, . . . and probabilities
p1, p2, . . . is:
Var[X] = X
i
pi(xiE[X])2
For the somax distribution, Xcan take on values corresponding to the logits owith probabilities ˆy. us,
the variance becomes:
Var[X] = X
i
ˆyi(oiE[X])2
Where E[X] = Piˆyioi.
Now, the second derivative of the cross-entropy loss with respect to the logits, as derived above, gives us a
measure of the curvature of the loss. When we compare this with the variance of the somax distribution, we
can see that they are related. e variance captures the spread of the logits, and the second derivative captures
the sensitivity of the loss to changes in the logits. In the context of exponential families, this relationship
between variance and the second derivative of the log-likelihood is a known property.
52
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
To show that they match, one would equate the expressions derived in the two sections above and simplify.
e detailed algebra can be quite involved, but the key insight is understanding the relationship between the
curvature of the loss (second derivative) and the spread of the logits (variance).
5. Somax gets its name from the following mapping: RealSof tMax(a, b) = log(exp(a) + exp(b))
1. Prove that RealSoMax(a, b)>max(a, b):
Given:
RealSoMax(a, b) = log(exp(a) + exp(b))
Now, since exp(a)and exp(b)are both positive, we have:
exp(a) + exp(b)>exp(a)
exp(a) + exp(b)>exp(b)
Taking the logarithm of both sides (logarithm is a monotonically increasing function):
log(exp(a) + exp(b)) >log(exp(a)) = a
log(exp(a) + exp(b)) >log(exp(b)) = b
us, RealSoMax(a, b)>max(a, b).
2. How small can you make the dierence between both functions?
Without loss of generality, let’s set b= 0 and ab. en:
RealSoMax(a, 0) = log(exp(a) + 1)
e dierence between the two functions is:
RealSoMax(a, 0) a= log(exp(a) + 1) a
As abecomes large, the dierence approaches 0, but it’s always positive because of the ”+1” inside the
logarithm.
3. Prove that this holds for λ1RealSoMax (λa, λb), provided that λ > 0:
Given:
λ1RealSoMax (λa, λb) = λ1log(exp(λa) + exp(λb))
Using the property of logarithms:
= log(exp(λa) + exp(λb))λ1
Since exp(λa)and exp(λb)are both positive and λ > 0, the proof from part applies here as well, and we
can conclude that λ1RealSoMax(λa, λb)>max(a, b).
4. Show that for λ we have λ1RealSoMax(λa, λb)max(a, b):
As λbecomes very large, the term exp(λa)will dominate exp(λb)if a>b. us:
λ1log(exp(λa) + exp(λb)) λ1log(exp(λa)) = a
Similarly, if b > a, the result will be b. erefore, as λ ,λ1RealSoMax(λa, λb)max(a, b).
5. Construct an analogous somin function:
e somin function can be constructed using the negative of the inputs to the somax function:
RealSoMin(a, b) = RealSoMax(a, b)
53
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
= log(exp(a) + exp(b))
6. Extend this to more than two numbers:
For nnumbers a1, a2, . . . , an:
RealSoMax(a1, a2, . . . , an) = log(exp(a1) + exp(a2) + . . . + exp(an))
Similarly, the somin for nnumbers is:
RealSoMin(a1, a2, . . . , an) = log(exp(a1) + exp(a2) + . . . + exp(an))
8.1.5 Exercise 4.5.5
format format min format max exp min exp max test min exp(max-1) exp(max+1)
torch.oat64 -1.80e+308 1.80e+308 -1.80e+308 7.10e+02 0.00e+00 6.61e+307 inf
torch.oat32 -3.40e+38 3.40e+38 -3.40e+38 8.87e+01 0.00e+00 1.25e+38 inf
torch.boat16 -3.39e+38 3.39e+38 -3.39e+38 8.85e+01 0.00e+00 1.00e+38 inf
torch.oat16 -6.55e+04 6.55e+04 -6.55e+04 1.11e+01 0.00e+00 2.41e+04 inf
torch.int8 -1.28e+02 1.27e+02 -1.28e+02 4.84e+00 0.00e+00 4.67e+01 8.90e+01
Deep learning uses many dierent number formats, including FP64 double precision (used ex-
tremely rarely), FP32 single precision, BFLOAT16 (good for compressed representations), FP16
(very unstable), TF32 (a new format from NVIDIA), and INT8. Compute the smallest and largest
argument of the exponential function for which the result does not lead to numerical underow
or overow 1. FP64 (Double Precision): - Smallest positive normalized number 21022 - Largest nite rep-
resentable number (2 252)×21023 - For the exponential function, the range of xfor which exdoes not
overow would be approximately [1022 log(2),1023 log(2)].
2. FP32 (Single Precision): - Smallest positive normalized number: 2126 - Largest nite representable
number: (2 223)×2127 - For the exponential function, the range of xfor which exdoes not overow
would be approximately [126 log(2),127 log(2)].
3. BFLOAT16: - is format has the same exponent bits as FP32 but fewer precision bits. - e range for
the exponential function would be similar to FP32.
4. FP16: - Smallest positive normalized number: 214 - Largest nite representable number: (2 210)×
215 - For the exponential function, the range of xfor which exdoes not overow would be approximately
[14 log(2),15 log(2)].
5. TF32: - is is a new format introduced by NVIDIA, and it has 10 bits for the exponent and 19 bits for
the fraction. - e range would be somewhere between FP32 and FP16.
6. INT8: - is is an integer format, so it doesn’t directly apply to the exponential function in the same way
as the oating-point formats. However, if used in quantized neural networks, the range would be [128,127].
For the exact ranges, especially for the newer formats like TF32, one would need to refer to the ocial
documentation or specications provided by the manufacturers.
To avoid overow or underow when computing the exponential function, it’s common to clip or bound
the input values to the function within the safe range for the given number format.
8.1.6 Exercise 7.1.6
1. Assume that the size of the convolution kernel is = 0. Show that in this case the convolution
kernel implements an MLP independently for each set of channels. is leads to the Network
in Network architectures When the size of the convolution kernel is = 0, it means that the kernel
54
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
eectively has a size of 1×1. A 1×1convolution kernel operates on each pixel individually, without
considering its neighbors.
Let’s break down the implications of this:
1. Independent Operation on Each Pixel: A 1×1convolution operates on each pixel independently. For
an input with Cin channels, each pixel is represented by a Cin-dimensional vector. e 1×1convolution
transforms this Cin-dimensional vector into a Cout-dimensional vector, where Cout is the number of output
channels.
2. MLP for Each Set of Channels: e transformation from Cin to Cout for each pixel can be viewed as a
fully connected layer (or MLP) applied independently to each spatial location. is is because the weights of
the 1×1convolution kernel can be seen as the weights of a fully connected layer that connects each input
channel to each output channel.
3. Network in Network (NiN): e idea of using 1×1convolutions to implement an MLP for each set of
channels led to the Network in Network (NiN) architecture. In NiN, multiple 1×1convolutional layers are
stacked, allowing for deep per-pixel transformations. is introduces additional non-linearity for each pixel
without changing the spatial dimensions.
In summary, a 1×1convolution kernel eectively applies an MLP to each pixel independently, transform-
ing the channels at each spatial location. is concept is central to the Network in Network (NiN) architecture,
which uses 1×1convolutions to introduce additional non-linearities in the network without altering the spa-
tial dimensions.
5. What happens with convolutions when an object is at the boundary of an image? When an object
is at the boundary of an image, and we apply convolutions, several issues and considerations arise:
1. Loss of Information: e primary issue is that for pixels on the boundary, there aren’t enough neighbor-
ing pixels to apply the convolution kernel fully. is means that these boundary pixels might not be processed
adequately, leading to potential loss of information about objects at the image boundary.
2. Padding: To address the boundary issue, one common approach is to add padding around the image.
Padding involves adding extra pixels around the boundary of the image. ere are dierent types of padding:
- Zero Padding: Adding pixels with a value of zero. - Reect Padding: e boundary pixels are reected.
For instance, if the last row of an image is [a, b, c], it becomes [a, b, c, c, b, a]aer reect padding. - Replicate
Padding: e boundary pixels are replicated. Using the previous example, it becomes [a, b, c, c, c, c].
Padding ensures that every pixel, including those on the boundary, has a full neighborhood of pixels to
apply the convolution kernel.
3. Valid vs. Same Convolutions: - Valid Convolution: No padding is used. e resulting feature map aer
convolution is smaller than the input image because the kernel can only be applied to positions where it ts
entirely within the image boundaries. - Same Convolution: Padding is added such that the output feature
map has the same spatial dimensions as the input image.
4. Edge Eects: Even with padding, objects at the boundary might be aected dierently than objects in
the center of the image. is is because the articial values introduced by padding might not represent the
actual image content. is can lead to edge artifacts in the output feature map.
5. Dilated Convolutions: In dilated convolutions, where the kernel elements are spaced out by introducing
gaps, the boundary eects can be even more pronounced, especially if the dilation rate is high.
6. Pooling Layers: e boundary eects can also impact pooling layers (like max-pooling or average-
pooling). If the pooling window doesn’t t entirely within the image or feature map boundaries, padding
might be needed, or the pooling window might be truncated.
6. Prove that the convolution is symmetric, fg=gfTo prove that convolution is symmetric, we’ll
start with the denition of the convolution operation for two functions f(t)and g(t):
55
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
(fg) (t) = Z
−∞
f(τ)g(tτ)
Now, let’s express the convolution of gand f:
(gf) (t) = Z
−∞
g(τ)f(tτ)
To show that these two expressions are equivalent, we’ll make a substitution:
Let λ=tτis implies =
When τ=−∞,λ=t+=When τ=,λ=t =−∞
Substituting these values into the convolution of gand f, we get:
(gf)(t) = Z−∞
g(tλ)f(λ)()
Now, reversing the limits of integration:
(gf)(t) = Z
−∞
g(tλ)f(λ)
Comparing this with the expression for (fg)(t), we see that the two expressions are equivalent:
(fg)(t) = (gf)(t)
us, we’ve proven that the convolution operation is symmetric:
fg=gf
Exercise 7.2.8
4. How do you represent a cross-correlation operation as a matrix multiplication by changing the
input and kernel tensors? Cross-correlation can be represented as a matrix multiplication by converting
the input and kernel tensors into appropriate matrix and vector forms, respectively. is process is oen
referred to as “toeplitzization” or “im2col” (image to column) transformation. Here’s how you can achieve
this:
1. Flaen the Kernel: - First, you reshape the kernel into a column vector. - If the kernel is of size k×k,
the resulting vector will be of size k2×1.
2. Toeplitz Matrix for Input: - For each position in the output, extract a k×kpatch from the input, and
aen it into a row vector. - Stack these row vectors on top of each other to form a matrix. is matrix has
a special structure called a Toeplitz matrix. - If the input is of size n×nand the kernel is k×k, then the
resulting matrix will be of size (nk+ 1)2×k2.
3. Matrix Multiplication: - Multiply the Toeplitz matrix (from step 2) with the aened kernel (from step
1). e result will be a column vector representing the aened output of the cross-correlation operation. -
Reshape this column vector back into a 2D matrix to get the nal output.
Mathematically, if Mis the Toeplitz matrix derived from the input and Kis the column vector derived
from the kernel, then the output Ois given by:
O=M×K
is representation is particularly useful for certain computational platforms and algorithms that can
eciently handle matrix multiplications. In deep learning frameworks, this transformation is oen used
under the hood to speed up convolution operations, especially when using GPUs.
56
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
Note: is explanation assumes 2D inputs and kernels for simplicity, but the concept can be extended to
higher dimensions.
def im2col(inputs, kernel_size):
# Extract patches from the input tensor
k,_=kernel_size
unfold =torch.nn.Unfold(kernel_size=k, stride=1, padding=0)
cols =unfold(inputs)
return cols.transpose(1,2)
57
Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks
58
Chapter 9
Generative Deep Learning
9.1 Intro to Deep Generative Models
What is Generative Modeling?
Generative modeling can be broadly dened as follows: Generative modeling is a branch of machine
learning that involves training a model to produce new data that is similar to a given dataset [Fos22].
Discriminative models vs Generative models
e Generative Modeling Framework
We have a dataset of Observations X.
We assume that the observations have been generated according to some unknown distribution Pdata
We want to build a generative model Pmodel that mimics Pdata to generate observations that appear to
have been drawn from Pdata
erefore, the desirable properties of Pdata are:
1. Accuracy: if Pmodel is high for a generated observation, it should look like it has been drawn from
Pdata. If Pmodel is low, it should look like it has not been drawn from Pdata
2. Generation: it should be possible to easily sample a new observation from Pmodel
3. Representation: it should be possible to understand how dierent high-level features in the data are
represented by Pmodel
Generative Model Taxonomy
Explicitly model the density function, but constrain the model in some way, so that the density function
s tractable (i.e. it can be calculated)
Explicitly model a tractable approximation of the density function.
Implicitly model the density function, through a stochastic process that directly generates data.
If you do choose to use batch normalization before activation, you can remember the order using the
acronym BAD (batch normalization, activation, then dropout)
9.2 Variational Autoencoders
e Reparameterization Trick Rather than sample directly from a normal distribution with parameters
z mean and z log var, we can sample epsilon from a standard normal and then manually adjust the sample
59
Chapter 9 Generative Deep Learning 9.2 Variational Autoencoders
Figure 9.1: Generative Models Taxonomy
to have the correct mean and variance. is is known as the reparameterization trick, and it’s important as
it means gradients can backpropagate freely through the layer. By keeping all of the randomness of the layer
contained within the variable epsilon, the partial derivative of the layer output with respect to its input can
be shown to be deterministic (i.e., independent of the random epsilon), which is essential for backpropagation
through the layer to be possible.
Kullback-Leibler (KL) Divergence
DKL [N(µ, σ)||N(0,1)] = 1
2
n
X
i=1 σ2
i+µ2
i1log σ2
i.
In summary, the KL divergence term penalizes the network for encoding observations to z mean and
z log var variables that dier signicantly from the parameters of a standard normal distribution, namely
z mean = 0 and z log var = 0.
Why does this addition to the loss function help?
Firstly, we now have a well-dened distribution that we can use for choosing points in the latent space—the
standard normal distribution. Secondly, since this term tries to force all encoded distributions toward the stan-
dard normal distribution, there is less chance that large gaps will form between point clusters. Instead, the
encoder will try to use the space around the origin symmetrically and eciently.
In the original Variational Autoencoder (VAE) paper, the loss function for a VAE was simply the addition
of the reconstruction loss and the KL divergence loss term. A variant on this (the β-VAE) includes a factor
that weights the KL divergence to ensure that it is well balanced with the reconstruction loss. If we weight
the reconstruction loss too heavily, the KL loss will not have the desired regulatory eect and we will see the
same problems that we experienced with the plain autoencoder. If the KL divergence term is weighted too
heavily, the KL divergence loss will dominate and the reconstructed images will be poor. is weighting term
is one of the parameters to tune when you’re training your VAE [Fos22].
60
Chapter 9 Generative Deep Learning 9.3 Generative Adversarial Networks
Figure 9.2: Adding and subtracting features to and from faces
Latent Space Arithmetic One benet of mapping images into a lower-dimensional latent space is that we
can perform arithmetic on vectors in this latent space that has a visual analogue when decoded back into the
original image domain (Figure 9.2).
For example, suppose we want to take an image of somebody who looks sad and give them a smile. To do
this we rst need to nd a vector in the latent space that points in the direction of increasing smile. Adding
this vector to the encoding of the original image in the latent space will give us a new point which, when
decoded, should give us a more smiley version of the original image [Fos22].
9.3 Generative Adversarial Networks
Upsampling vs Transposed Convolution e UpSampling2D layer simply repeats each row and column
of its input in order to double the size. e Conv2D layer with stride 1 then performs the convolution opera-
tion. It is a similar idea to convolutional transpose, but instead of lling the gaps between pixels with zeros,
upsampling just repeats the existing pixel values.
It has been shown that the Conv2DTranspose method can lead to artifacts, or small checkerboard paerns
in the output image (see Figure 9.3) that spoil the quality of the output. However, they are still used in many
of the most impressive GANs in the literature and have proven to be a powerful tool in the deep learning
practitioner’s toolbox.
Adding Noise to the Labels A useful trick when training GANs is to add a small amount of random noise
to the training labels. is helps to improve the stability of the training process and sharpen the generated
images. is label smoothing acts as way to tame the discriminator, so that it is presented with a more
challenging task and doesn’t overpower the generator.
Another requirement of a successful generative model is that it doesn’t only reproduce images from the
61
Chapter 9 Generative Deep Learning 9.3 Generative Adversarial Networks
Figure 9.3: Artifacts when using convolutional transpose layers
training set. To test this, we can nd the image from the training set that is closest to a particular generated
example. A good measure for distance is the L1 distance, dened as:
L1(a,b) = 1
n
n
X
i=1 |aibi|.
9.3.1 GAN Training Tips
While GANs are a major breakthrough for generative modeling, they are also notoriously dicult to train.
We will explore some of the most common problems and challenges encountered when training GANs in this
section, alongside potential solutions. In the next section, we will look at some more fundamental adjustments
to the GAN framework that we can make to remedy many of these problems [Fos22].
Discriminator overpowering the generator If the discriminator becomes too strong, the signal from the
loss function becomes too weak to drive any meaningful improvements in the generator. In the worst-case
scenario, the discriminator perfectly learns to separate real images from fake images and the gradients vanish
completely, leading to no training whatsoever,
If you nd your discriminator loss function collapsing, you need to nd ways to weaken the discriminator.
Try the following suggestions:
Increase the rate parameter of the Dropout layers in the discriminator to dampen the amount of infor-
mation that ows through the network.
Reduce the learning rate of the discriminator.
Reduce the number of convolutional lters in the discriminator.
Add noise to the labels when training the discriminator.
Flip the labels of some images at random when training the discriminator.
62
Chapter 9 Generative Deep Learning 9.3 Generative Adversarial Networks
Generator overpowering the discriminator If the discriminator is not powerful enough, the generator
will nd ways to easily trick the discriminator with a small sample of nearly identical images. is is known
as mode collapse.
For example, suppose we were to train the generator over several batches without updating the discrimi-
nator in between. e generator would be inclined to nd a single observation (also known as a mode) that
always fools the discriminator and would start to map every point in the latent input space to this image.
Moreover, the gradients of the loss function would collapse to near 0, so it wouldn’t be able to recover from
this state.
Even if we then tried to retrain the discriminator to stop it being fooled by this one point, the generator
would simply nd another mode that fools the discriminator, since it has already become numb to its input
and therefore has no incentive to diversify its output.
If you nd that your generator is suering from mode collapse, you can try strengthening the discrimi-
nator using the opposite suggestions to those listed in the previous section. Also, you can try reducing the
learning rate of both networks and increasing the batch size.
Uninformative Loss Since the deep learning model is compiled to minimize the loss function, it would
be natural to think that the smaller the loss function of the generator, the beer the quality of the images
produced. However, since the generator is only graded against the current discriminator and the discriminator
is constantly improving, we cannot compare the loss function evaluated at dierent points in the training
process. Indeed the loss function of the generator actually increases over time, even though the quality of the
images is clearly improving. is lack of correlation between the generator loss and image quality sometimes
makes GAN training dicult to monitor.
Hyperparameter As we have seen, even with simple GANs, there are a large number of hyperparameters
to tune. As well as the overall architecture of both the discriminator and the generator, there are the param-
eters that govern batch normalization, dropout, learning rate, activation layers, convolutional lters, kernel
size, striding, batch size, and latent space size to consider. GANs are highly sensitive to very slight changes in
all of these parameters, and nding a set of parameters that works is oen a case of educated trial and error,
rather than following an established set of guidelines.
WGAN-GP WGAN-GP (Figure 9.4) brings a meaningful loss metric that correlates with the generator’s
convergence and sample quality, and improved stability of the optimization process.
1
n
n
X
i=1
(yilog (pi) + (1 yi) log (1 pi)) (9.1)
One last consideration we should note before training a WGAN-GP is that batch normalization shouldn’t
be used in the critic. is is because batch normalization creates correlation between images in the same
batch, which makes the gradient penalty loss less eective. Experiments have shown that WGAN-GPs can
still produce excellent results even without batch normalization in the critic. We have now covered all of the
key dierences between a standard GAN and a WGAN-GP. To recap:
AWGAN-GP uses the Wasserstein loss.
e WGAN-GP is trained using labels of 1 for real and –1 for fake.
ere is no sigmoid activation in the nal layer of the critic.
Include a gradient penalty term in the loss function for the critic.
Train the critic multiple times for each update of the generator.
ere are no batch normalization layers in the critic.
63
Chapter 9 Generative Deep Learning 9.4 Autoregressive Models
Figure 9.4: e WGAN-GP critic training process
Contitonal Generative Adversarial Nets (CGAN)
Summary we explored three dierent GAN models: the deep convolutional GAN (DCGAN), the more
sophisticated WGAN-GP, and the CGAN.
All GANs are characterized by a generator versus discriminator (or critic) architecture, with the discrim-
inator trying to “spot the dierence” between real and fake images and the generator aiming to fool the
discriminator. By balancing how these two adversaries are trained, the GAN generator can gradually learn
how to produce similar observations to those in the training set.
We rst saw how to train a DCGAN to generate images of toy bricks. It was able to learn how to re-
alistically represent 3D objects as images, including accurate representations of shadow, shape, and texture.
We also explored the dierent ways in which GAN training can fail, including mode collapse and vanishing
gradients [Fos22].
9.4 Autoregressive Models
Working with Text Data ere are several key dierences between text and image data that mean that
many of the methods that work well for image data are not so readily applicable to text data. In particular:
Text data is composed of discrete chunks (either characters or words), whereas pixels in an image are
points in a continuous color spectrum. We can easily make a green pixel more blue, but it is not obvious
how we should go about making the word cat more like the word dog, for example. is means we
can easily apply backpropagation to image data, as we can calculate the gradient of our loss function
with respect to individual pixels to establish the direction in which pixel colors should be changed to
minimize the loss. With discrete text data, we can’t obviously apply backpropagation in the same way,
so we need to nd a way around this problem.
64
Chapter 9 Generative Deep Learning 9.4 Autoregressive Models
Figure 9.5: Inputs and outputs of the generator and critic in a CGAN
65
Chapter 9 Generative Deep Learning 9.4 Autoregressive Models
Text data has a time dimension but no spatial dimension, whereas image data has two spatial dimensions
but no time dimension. e order of words is highly important in text data and words wouldn’t make
sense in reverse, whereas images can usually be ipped without aecting the content. Furthermore,
there are oen long-term sequential dependencies between words that need to be captured by the
model: for example, the answer to a question or carrying forward the context of a pronoun. With
image data, all pixels can be processed simultaneously.
Text data is highly sensitive to small changes in the individual units (words or characters). Image
data is generally less sensitive to changes in individual pixel units—a picture of a house would still be
recognizable as a house even if some pixels were altered—but with text data, changing even a few words
can drastically alter the meaning of the passage, or make it nonsensical. is makes it very dicult to
train a model to generate coherent text, as every word is vital to the overall meaning of the passage.
Text data has a rules-based grammatical structure, whereas image data doesn’t follow set rules about
how the pixel values should be assigned. For example, it wouldn’t make grammatical sense in any
context to write “e cat sat on the having. ere are also semantic rules that are extremely dicult to
model; it wouldn’t make sense to say “I am in the beach, even though grammatically, there is nothing
wrong with this statement.
Tokenization If you use word tokens:
All text can be converted to lowercase, to ensure capitalized words at the start of sentences are tokenized
the same way as the same words appearing in the middle of a sentence. In some cases, however, this
may not be desirable; for example, some proper nouns, such as names or places, may benet from
remaining capitalized so that they are tokenized independently.
e text vocabulary (the set of distinct words in the training set) may be very large, with some words
appearing very sparsely or perhaps only once. It may be wise to replace sparse words with a token for
unknown word, rather than including them as separate tokens, to reduce the number of weights the
neural network needs to learn.
Words can be stemmed, meaning that they are reduced to their simplest form, so that dierent tenses
of a verb remained tokenized together. For example, browse, browsing, browses, and browsed would
all be stemmed to brows.
You will need to either tokenize the punctuation, or remove it altogether.
Using word tokenization means that the model will never be able to predict words outside of the training
vocabulary.
If you use character tokens:
e model may generate sequences of characters that form new words outside of the training vocabu-
lary—this may be desirable in some contexts, but not in others.
Capital leers can either be converted to their lowercase counterparts, or remain as separate tokens.
e vocabulary is usually much smaller when using character tokenization. is is benecial for model
training speed as there are fewer weights to learn in the nal output layer.
LSTM A recurrent layer has the special property of being able to process sequential input data x1, . . . , xn.
It consists of a cell that updates its hidden state, ht, as each element of the sequence xtis passed through it,
one timestep at a time.
e hidden state is a vector with length equal to the number of units in the cell. it can be thought of as
the cell’s current understanding of the sequence. At timestep t, the cell uses the previous value of the hidden
66
Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models
state,ht1, together with the data from the current timestep xtto produce an updated hidden state vector,
ht. is recurrent process continues until the end of the sequence. Once the sequence is nished, the layer
outputs the nal hidden state of the cell, hn, It’s important to remember that all of the cells in this diagram
share the same weights (as they are really the same cell).
We represent the recurrent process by drawing a copy of the cell at each timestep and show how the
hidden state is constantly being updated as it ows through the cells (Figure 9.6).
e hidden state is updated in six steps (Figure 9.7):
1. e hidden state of the previous timestep, ht1, and the current word embedding, xt, are concatenated
and passed through the forget gate. is gate is simply a dense layer with weights matrix Wf, bias bf,
and a sigmoid activation function. e resulting vector, ftl, has length equal to the number of units in
the cell and contains values between 0 and 1 that determine how much of the previous cell state, Ct1,
should be retained.
2. e concatenated vector is also passed through an input gate that, like the forget gate, is a dense layer
with weights matrix Wi, bias bi, and a sigmoid activation function. e output from this gate, it, has
length equal to the number of units in the cell and contains values between 0 and 1 that determine how
much new information will be added to the previous cell state, Ct1.
3. e concatenated vector is passed through a dense layer with weights matrix WC, bias bC, and a
tanh activation function to generate a vector ˜
Ctthat contains the new information that the cell wants
to consider keeping. It also has length equal to the number of units in the cell and contains values
between –1 and 1.
4. ftand Ct1are multiplied element-wise and added to the element-wise multiplication of itand ˜
Ct.
is represents forgeing parts of the previous cell state and then adding new relevant information to
produce the updated cell state, Ct.
5. e concatenated vector is passed through an output gate: a dense layer with weights matrix Wo, bias
bo, and a sigmoid activation. e resulting vector, ot, has length equal to the number of units in the
cell and stores values between 0 and 1 that determine how much of the updated cell state, Ct, to output
from the cell.
6. otis multiplied element-wise with the updated cell state, Ct, aer a tanh activation has been applied to
produce the new hidden state, ht.
GRU Another type of commonly used Recurrent Neural Network (RNN) layer is the GRU (Figure 9.8). e
key dierences from the LSTM unit are as follows:
1. e forget and input gates are replaced by reset and update gates.
2. ere is no cell state or output gate, only a hidden state that is output from the cell.
9.5 Normalizing Flow Models
Normalizing ows share similarities with both autoregressive models and VAEs. Like autoregressive models,
normalizing ows are able to explicitly and tractably model the data-generating distribution Px. Like VAEs,
normalizing ows aempt to map the data into a simpler distribution, such as a Gaussian distribution. e
key dierence is that normalizing ows place a constraint on the form of the mapping function, so that it is
invertible and can therefore be used to generate new data points.
67
Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models
Figure 9.6: How a single sequence ows through a recurrent layer
68
Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models
Figure 9.7: An LSTM Cell
69
Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models
Figure 9.8: A single GRU cell
70
Chapter 9 Generative Deep Learning 9.6 Energy-Based Models
Summary A normalizing ow model is an invertible function dened by a neural network that allows us to
directly model the data density via a change of variables. In the general case, the change of variables equation
requires us to calculate a highly complex Jacobian determinant, which is impractical for all but the simplest
of examples.
To sidestep this issue, the RealNVP model restricts the form of the neural network, such that it adheres
to the two essential criteria: it is invertible and has a Jacobian determinant that is easy to compute.
It does this through stacking coupling layers, which produce scale and translation factors at each step.
Importantly, the coupling layer masks the data as it ows through the network, in a way that ensures that the
Jacobian is lower triangular and therefore has a simple-to-compute determinant. Full visibility of the input
data is achieved through ipping the masks at each layer.
By design, the scale and translation operations can be easily inverted, so that once the model is trained it is
possible to run data through the network in reverse. is means that we can target the forward transformation
process toward a standard Gaussian, which we can easily sample from. We can then run the sampled points
backward through the network to generate new observations.
e RealNVP paper also shows how it is possible to apply this technique to images, by using convolutions
inside the coupling layers, rather than densely connected layers. e GLOW paper extended this idea to
remove the necessity for any hardcoded permutation of the masks. e FFJORD model introduced the concept
of continuous time normalizing ows, by modeling the transformation process as an ODE dened by a neural
network.
Overall, we have seen how normalizing ows are a powerful generative modeling family that can produce
high-quality samples, while maintaining the ability to tractably describe the data density function. [Fos22].
9.6 Energy-Based Models
Energy-based models aempt to model the true data-generating distribution using a Boltzmann distribu-
tion (9.2) where E(x)is know as the energy function of an observation x.
p(x) = eE(x)
RˆxXeE(ˆx)
(9.2)
9.7 Diusion Models
Denoising Diusion Models (DDM) e core idea behind a denoising diusion model is simple—we train
a deep learning model to denoise an image over a series of very small steps.
e Forward Diusion Process Suppose we have an image x0that we want to corrupt gradually over
a large number of steps (T= 1000), so that eventually it is indistinguishable from standard Gaussian noise
(i.e., xTshouldhavezeormeanandunitvariance).
We can dene a function qthat adds a small amount of Gaussian noise with variance βtto an image xt1
to generate a new image xt. If we keep applying this function, we will generate a sequence of progressively
noisier images (x0,...xT), as shown in Figure 9.9.
We can write this update process mathematically as follows (here, ϵt1is a standard Gaussian with zeor
mean and unit variance):
xt=p1βtxt1+pβtϵt1.
Note that we also scale the input image xt1, to ensure that the variance of the output image xtremains
constant over time. is way, if we normalize our original image x0to have zeor mean and unit variance, the
xtwill approximate a standard Gaussian distribution for large enough T, by induction, as follows.
71
Chapter 9 Generative Deep Learning 9.7 Diusion Models
Figure 9.9: e forward diusion process
If we assume that xt1has zeor mean and unit variance then 1βtxt1will have variance 1βtand
βtϵt1will have variance βt, using the rule that Var (aX) = a2Var (X). Adding these together, we obtain
a new distribution xtwith zero mean and variance 1βt+βt= 1, using the rule that Var (X+Y) =
Var (X) Var (Y)for independent Xand Y. erefore, if x0is normalized to a zero mean and unit variance,
then we guarantee that this is also true for all xt, including the nal image xT, which will approximate a
standard Gaussian distribution. is is exactly what we need, as we want to be alble to easily sample xTand
then apply a reverse diusion process through our trained neural network model.
In other words, our forward nosing process qcan also be wrien as follows:
q(xt|xt1) = N(xt;p1βtxt1, βtI).
e Reparmeaterization Trick It would also be useful to be able to jump straight from an image x0to
any noised version of the image xtwithout having to go through tapplications of q.
If we dene α= 1 βtand αt=Qt
i=1 αi, then we can write the following:
xt=αtxt1+1αtϵt1
=αtαt1xt2+p1αtαt1ϵ
=. . .
=αtx0+1αtϵ.
Note that the second line uses the fact that we can add two Gaussians to obtain a new Gaussian. We
therefore have have a way to jump from the original image x0to nay step of the forward diusion process
xt. Moverover, we can dene the diusion schedule using the αtvalues, instead of the original βtvalues, with
the interpretation that αtis the variance due to the signal (the original image, x0) and 1αtis the variance
due to the noise (ϵ).
e forward diusion process qcan therefore also be wrien as follows:
q(xt|x0) = Nxt;αtx0,(1 αt)I.
Diusion Schedule
72
Appendix A
Formulas
A.1 Gaussian distribution
Denition A.1 (Gaussian distribution).Gaussian distribution
Theorem A.1 (Central limit theorem).
73
Chapter A Formulas A.1 Gaussian distribution
74
Bibliography
[Bab16] L´
aszl´
o Babai. “Graph Isomorphism in asipolynomial Time”. Jan. 19, 2016. arXiv: 1512.03547
[cs, math] (cit. on p. 2).
[Chi09] Andrew M. Childs. Universal Computation by antum Walk. Physical Review Leers 102.18
(May 4, 2009), p. 180501. arXiv: 0806.1972 (cit. on p. 2).
[Fos22] David Foster. Generative Deep Learning. ”O’Reilly Media, Inc.”, June 28, 2022. 444 pp. (cit. on pp. 11,
5962,64,71).
[Kit+02] Alexei Yu Kitaev et al. Classical and quantum computation. 47. American Mathematical Soc., 2002
(cit. on p. 2).
[Zha+23] Aston Zhang et al. Dive into Deep Learning. Cambridge University Press, Dec. 7, 2023. 581 pp.
(cit. on pp. 13,14).
75
BIBLIOGRAPHY BIBLIOGRAPHY
76
Alphabetical Index
G
Gaussian distribution . . . . . . 73
I
index......................2
77