Learning Notes
Yangyang Li
yangyang.li@northwestern.edu
Update on December 21, 2023
2
Contents
Acronyms 7
Preface 1
0.1 Features of this template ....... 1
0.1.1 crossref ............ 1
0.1.2 ToC (Table of Content) .... 1
0.1.3 header and footer ....... 1
0.1.4 bib ............... 2
0.1.5 preface, index, quote (epi-
graph) and appendix ..... 2
I Machine Learning 3
1 Probability 5
1.1 Basic Concepts ............. 5
1.2 Maximum Likelihood Estimation . . . 5
1.3 Maximum A Posteriori Estimation . . 6
1.4 Gaussian Distribution ......... 6
1.5 Bayesian Network ........... 8
1.6 Probability Graph ........... 8
1.6.1 Variables Elimination ..... 8
1.6.2 Belief propagation ...... 8
1.6.3 Max-product Algorithm . . . 9
1.6.4 Factor Graph ......... 9
1.7 Expectation Maximum ........ 9
1.8 Gaussian Mixture Model ....... 9
1.9 Hidden Markov Model ........ 9
2 Notes 11
2.1 Complexity of matrix multiply . . . . 11
2.2 Representation of matrix ....... 11
2.3 Training Tricks ............ 11
II Algorithm and Data Structure 15
3 Algorithm 17
3.1 Graph .................. 17
III Programming 19
4 Leetcode 21
4.1 Problems Sets ............. 21
4.2 Tips ................... 21
4.3 Problems ................ 22
4.3.1 1 Two Sum .......... 22
4.3.2 20 valid parentheses ..... 22
4.3.3 21 Merge Two Sorted List . . 22
4.3.4 121 Best time to buy and sell
stock .............. 22
4.3.5 226 Invert Binary Tree . . . . 22
4.3.6 242 Valid Anagram ...... 22
4.3.7 704 Binary Search ....... 22
4.3.8 733 Flood Fill ......... 22
4.3.9 235 lowest-common-
ancestor-of-a-binary-search-
tree ............... 22
4.3.10 110 balanced-binary-tree . . . 22
4.3.11 141 linked-list-cycle ..... 23
4.3.12 232 implement-queue-using-
stacks ............. 23
4.3.13 278 rst-bad-version ..... 23
4.3.14 383 ransom-note ....... 23
4.3.15 70 climbing-stairs ....... 23
4.3.16 409 longest-palindrome . . . 23
4.3.17 206 reverse-linked-list . . . . 23
4.3.18 169 majority-element ..... 23
4.3.19 67 add-binary ......... 23
4.3.20 543 diameter-of-binary-tree . 23
4.3.21 876 middle-of-the-linked-list . 24
4.3.22 104 maximum-depth-of-
binary-tree .......... 24
4.3.23 217 contains-duplicate . . . . 24
4.3.24 53 maximum-subarray . . . . 24
4.3.25 57 insert-interval ....... 24
4.3.26 542 01-matrix ......... 24
4.3.27 973 k-closest-points-to-origin 24
4.3.28 3 longest-substring-without-
repeating-characters ..... 24
4.3.29 15 3sum ............ 24
4.3.30 102 binary-tree-level-order-
traversal ............ 25
4.3.31 133 clone-graph ........ 25
4.3.32 207 course-schedule ..... 25
4.3.33 322 coin-change ........ 25
4.3.34 238 product-of-array-except-
self ............... 25
4.3.35 155 min-stack ......... 25
3
CONTENTS CONTENTS
4.3.36 98 validate-binary-search-tree 25
4.3.37 200 number-of-islands . . . . 25
4.3.38 33 search-in-rotated-sorted-
array .............. 25
4.3.39 39 combination-sum ..... 25
4.3.40 86.partition-list ........ 26
4.3.41 23.merge-k-sorted-lists . . . . 26
4.3.42 142.linked-list-cycle-ii.cpp . . 26
4.3.43 160.intersection-of-two-
linked-lists.cpp ........ 26
4.3.44 26.remove-duplicates-from-
sorted-array .......... 26
4.3.45 27.remove-element.cpp . . . . 26
4.3.46 283.move-zeroes.cpp ..... 26
4.3.47 5.longest-palindromic-substring 26
4.3.48 669.trim-a-binary-search-
tree.cpp ............ 26
4.3.49 124.binary-tree-maximum-
path-sum.cpp ......... 26
4.3.50 46.permutations.cpp ..... 26
4.3.51 51.n-queens.cpp ........ 27
4.3.52 78.subsets.cpp ......... 27
4.3.53 77.combinations.cpp ..... 27
4.3.54 90.subsets-ii.cpp ........ 27
4.3.55 40.combination-sum-ii.cpp . . 27
4.3.56 47.permutations-ii.cpp . . . . 27
4.3.57 111.minimum-depth-of-
binary-tree.cpp ........ 27
4.3.58 752.open-the-lock.cpp . . . . 27
4.3.59 76.minimum-window-
substring.cpp ......... 27
4.3.60 438.nd-all-anagrams-in-a-
string.cpp ........... 27
4.3.61 337.house-robber-iii.cpp . . . 27
4.3.62 116. Populating Next Right
Pointers in Each Node . . . . 27
5 C++ 29
5.1 Code Snippets ............. 29
5.1.1 Random Number Generation 29
5.1.2 ick Sort ........... 29
5.2 Code Minning ............. 30
5.2.1 Deep Learning ........ 30
6 Rust 31
IV Research 33
7 Paper Writing 35
7.1 Resources ................ 35
7.2 Writing Tools ............. 36
7.3 Writing Tips .............. 36
7.4 PhD eses Tips ............ 39
7.5 Abstract ................ 40
7.6 Introduction .............. 40
7.7 Literature Reviews ........... 40
7.8 Method ................. 41
7.9 Result .................. 41
7.10 Conclusion ............... 41
V Book Reading 43
8 Dive into Deep Learning 45
8.1 Linear Neural Networks ........ 45
8.1.1 Exercise 3.1.6 ......... 45
8.1.2 Exercise 3.3.5 ......... 49
8.1.3 Exercise 3.5.6 ......... 51
8.1.4 Exercise 4.1.5 ......... 52
8.1.5 Exercise 4.5.5 ......... 54
8.1.6 Exercise 7.1.6 ......... 54
9 Generative Deep Learning 59
9.1 Intro to Deep Generative Models . . . 59
9.2 Variational Autoencoders ....... 59
9.3 Generative Adversarial Networks . . 61
9.3.1 Generative Adversarial Nets
(GAN) Training Tips ..... 62
9.4 Autoregressive Models ........ 64
9.5 Normalizing Flow Models ....... 67
9.6 Energy-Based Models ......... 71
9.7 Diusion Models ............ 71
Appendices 73
Appendix A Formulas 73
A.1 Gaussian distribution ......... 73
Bibliography 75
Alphabetical Index 77
4
List of Figures
1.1 A simple Bayesian network. ..................................... 8
1.2 Belief propagation. .......................................... 9
2.1 Batch Normalization ......................................... 12
2.2 Transposed Convolution [Zha+23, Chapter 14.10] ......................... 13
2.3 Transposed convolution with a kernel with stride of 2×2. e shaded portions are a por-
tion of an intermediate tensor as well as the input and kernel tensor elements used for the
computation. [Zha+23]........................................ 14
9.1 Generative Models Taxonomy .................................... 60
9.2 Adding and subtracting features to and from faces ......................... 61
9.3 Artifacts when using convolutional transpose layers ........................ 62
9.4 e Wasserstein GAN with Gradient Penalty (WGAN-GP) critic training process ........ 64
9.5 Inputs and outputs of the generator and critic in a CGAN ..................... 65
9.6 How a single sequence ows through a recurrent layer ...................... 68
9.7 An Long Short-Term Memory (LSTM) Cell ............................. 69
9.8 A single Gated Recurrent Unit (GRU) cell .............................. 70
9.9 e forward diusion process .................................... 72
List of eorems
A.1 Theorem (Central limit theorem) . . . 73
List of Denitions
A.1 Denition (Gaussian distribution) . . 73
5
LIST OF DEFINITIONS LIST OF DEFINITIONS
6
Acronyms
CGAN Contitonal Generative Adversarial Nets 64
DCGAN deep convolutional GAN 64
DDM Denoising Diusion Models 71
GAN Generative Adversarial Nets 4,6164
GRU Gated Recurrent Unit 5,67,70
KL Kullback-Leibler 60
LSTM Long Short-Term Memory 5,66,67,69
MAP Maximum A Posteriori Estimation 6
MLE Maximum Likelihood Estimation 5,6
PDF Probability Density Function 6
RNN Recurrent Neural Network 67
VAE Variational Autoencoder 60,67
WGAN-GP Wasserstein GAN with Gradient Penalty 5,63,64
7
Acronyms Acronyms
8
Preface
Contents
0.1 Features of this template ................................... 1
0.1 Features of this template
TeX, stylized within the system as L
A
T
E
X, is a typeseing system which was designed and
wrien by Donald Knuth and rst released in 1978. TeX is a popular means of typeseing
complex mathematical formulae; it has been noted as one of the most sophisticated digital
typographical systems.
-Wikipedia
0.1.1 crossref
dierent styles of clickable denitions and theorems
nameref: Gaussian distribution
autoref: Denition A.1,
cref: Denition A.1,
hyperref: Gaussian,
0.1.2 ToC (Table of Content)
mini toc of sections at the beginning of each chapter
list of theorems, denitions, gures
the chapter titles are bi-directional linked
0.1.3 header and footer
fancyhdr
right header: section name and link to the beginning of the section
le header: chapter title and link to the beginning of the chapter
footer: page number linked to ToC of the whole document
1
Acronyms 0.1 Features of this template
0.1.4 bib
titles of reference is linked to the publisher webpage e.g., [Kit+02]
backref (go to the page where the reference is cited) e.g., [Chi09]
customized video entry in reference like in [Bab16]
0.1.5 preface, index, quote (epigraph) and appendix
index page at the end of this document…
2
Part I
Machine Learning
3
Chapter 1
Probability
Contents
1.1 Basic Concepts ......................................... 5
1.2 Maximum Likelihood Estimation .............................. 5
1.3 Maximum A Posteriori Estimation ............................. 6
1.4 Gaussian Distribution ..................................... 6
1.5 Bayesian Network ....................................... 8
1.6 Probability Graph ....................................... 8
1.7 Expectation Maximum .................................... 9
1.8 Gaussian Mixture Model ................................... 9
1.9 Hidden Markov Model ..................................... 9
1.1 Basic Concepts
Permutation is an arrangement of objects in which the order is important.
P(n, r) = n!
(nr)!
Combination is an arrangement of objects in which the order is not important.
C(n, r) = P(n, r)
r!=n!
r! (nr)!
in which 0rn.
1.2 Maximum Likelihood Estimation
X= (x1,x2,...,xN)T,xi= (xi1, xi2, . . . , xip)T(1.1)
in which Nis the number of samples, pis the number of features. e data is sampled from a distribution
p(x|θ), where θis the parameter of the distribution.
For Ni.i.d. samples, the likelihood function is p(X|θ) = QN
i=1 p(xi|θ))
In order to get θ, we use Maximum Likelihood Estimation (MLE) to maximize the likelihood function.
θMLE = argmax
θ
log p(X|θ) = argmax
θ
N
X
i=1
log p(xi|θ)(1.2)
5
Chapter 1 Probability 1.3 Maximum A Posteriori Estimation
1.3 Maximum A Posteriori Estimation
In Bayes’ theorem, the θis not a constant value, but θp(θ). Hence,
p(θ|X) = p(X|θ)p(θ)
p(X)=p(X|θ)p(θ)
R
θ
p(X|θ)p(θ) (1.3)
In order to get θ, we use Maximum A Posteriori Estimation (MAP) to maximize the posterior function.
θMAP = argmax
θ
p(θ|X) = argmax
θ
p(X|θ)p(θ)
p(X)(1.4)
Aer θis estimated, then calculating p(X|θ)·p(θ)
R
θ
p(X|θ)p(θ) to get the posterior distribution. We can use the
posterior distribution to predict the probability of a new sample x.
p(xnew |X) = Z
θ
p(xnew |θ)·p(θ|X) (1.5)
1.4 Gaussian Distribution
Gaussian distribution is also called normal distribution.
θ= (µ, σ2), µ =1
N
N
X
i=1
xi, σ2=1
N
N
X
i=1
(xiµ)2(1.6)
For MLE,
θ= (µ, Σ)=(µ, σ2), θMLE = argmax
θ
log p(X|θ) = argmax
θ
N
X
i=1
log p(xi|θ)(1.7)
Generally, the Probability Density Function (PDF) of a Gaussian distribution is:
p(x|µ, Σ) = 1
p(2π)pdet(Σ)exp 1
2(xµ)TΣ1(xµ)(1.8)
in which µis the mean vector, Σis the covariance matrix, det is the determinant of matrix. det is the product
of all eigenvalues of a matrix.
Hence,
log p(X|θ) =
N
X
i=1
log p(xi|θ) =
N
X
i=1
log 1
p(2π)pdet(Σ)exp 1
2(xµ)TΣ1(xµ)(1.9)
Let’s only consider 1 dimension case for brevity, then
log p(X|θ) =
N
X
i=1
log p(xi|θ) =
N
X
i=1
log 1
2πσ2exp 1
2
(xµ)2
σ2(1.10)
Let’s get the optimal value for µ,
6
Chapter 1 Probability 1.4 Gaussian Distribution
µMLE = argmax
µ
log p(X|θ) = argmin
µ
N
X
i=1
1
2(xiµ)2(1.11)
So,
log p(X|θ)
µ =
N
X
i=1
(µxi)=0µMLE =1
N
N
X
i=1
xi(1.12)
Let’s get the optimal value for σ2,
σMLE = argmax
σ
log p(X|θ)
= argmax
σ
N
X
i=1
log 1
2πσ2exp 1
2
(xµ)2
σ2
= argmax
σ
N
X
i=1 log 2πσ2(xµ)2
2σ2
= argmin
σ
N
X
i=1 "log σ+(xµ)2
2σ2#
Hence,
σ
N
X
i=1 "log σ+(xiµ)2
2σ2#= 0 σ2
MLE =1
N
N
X
i=1
(xiµ)2(1.13)
ED[µMLE]is unbaised.
ED[µMLE] = ED"1
N
N
X
i=1
xi#=1
N
N
X
i=1
ED[xi] = 1
N
N
X
i=1
µ=µ(1.14)
However, EDσ2
MLEis biased.
7
Chapter 1 Probability 1.5 Bayesian Network
a
bc
Figure 1.1: A simple Bayesian network.
EDσ2
MLE=ED"1
N
N
X
i=1
(xiµMLE)2#(1.15)
=ED"1
N
N
X
i=1
(xiµMLE)2#(1.16)
=ED"1
N
N
X
i=1 x2
i2xiµMLE +µ2
MLE#=ED"N
X
i=1
x2
i21
N
N
X
i=1
xiµMLE +µ2
MLE#(1.17)
=ED"1
N
N
X
i=1 x2
iµ2+µ2µ2
MLE#(1.18)
=σ2EDµ2
MLE µ2(1.19)
=σ2EDµ2
MLEEDµ2
MLE (1.20)
=σ2Var [µMLE] = σ2Var "1
N
N
X
i=1
xi#(1.21)
=σ21
N2
N
X
i=1
Var [xi] = N1
Nσ2(1.22)
(1.23)
1.5 Bayesian Network
1.6 Probability Graph
this section is not nished yet. Need to be reviewed p54
1.6.1 Variables Elimination
1.6.2 Belief propagation
Belief propagation is mainly used for tree data structure, and equals Section 1.6.1 with caching.
8
Chapter 1 Probability 1.7 Expectation Maximum
a
b
cd
mab
Figure 1.2: Belief propagation.
1.6.3 Max-product Algorithm
1.6.4 Factor Graph
1.7 Expectation Maximum
Θ(t+1) = argmax
ΘZZ
log P(x, z |θ)·P(z|x, Θ(t))dz
continue on p60
1.8 Gaussian Mixture Model
QΘ, Θ(t)=ZZ
log P(x, z |θ)·P(z|x, Θ(t))dz
1.9 Hidden Markov Model
9
Chapter 1 Probability 1.9 Hidden Markov Model
10
Chapter 2
Notes
2.1 Complexity of matrix multiply
Assume Ais m×nand Bis n×p. e naive algorithm takes O(mnp)time.
2.2 Representation of matrix
Row major, col major, stride
How to calculate dimension of convolutional layer
nhkh+ph+sh
sh×nwkw+pw+sw
sw
In which nhmeans height of input, khmeans height of lter, phmeans padding of height, shmeans stride
of height. So as nw,kw,pw,sw.
2.3 Training Tricks
Batch normalization
BN (x) = γxˆ
µB
ˆ
σB
+β
Batch normalization is a technique that drastically reduces this problem. e solution is surprisingly
simple. During training, a batch normalization layer calculates the mean and standard deviation of each of
its input channels across the batch and normalizes by subtracting the mean and dividing by the standard
deviation. ere are then two learned parameters for each channel, the scale (gamma) and shi (beta) Fig. 2.1.
e output is simply the normalized input, scaled by gamma and shied by beta [Fos22].
When it comes to prediction, we may only want to predict a single observation, so there is no batch over
which to calculate the mean and standard deviation. To get around this problem, during training a batch
normalization layer also calculates the moving average of the mean and standard deviation of each channel
and stores this value as part of the layer to use at test time [Fos22].
How many parameters are contained within a batch normalization layer? For every channel in the pre-
ceding layer, two weights need to be learned: the scale (gamma) and shi (beta). ese are the trainable
parameters. e moving average and standard deviation also need to be calculated for each channel, but
since they are derived from the data passing through the layer rather than trained through backpropagation,
they are called nontrainable parameters. In total, this gives four parameters for each channel in the preceding
layer, where two are trainable and two are nontrainable [Fos22].
11