Learning Notes

Yangyang Li

yangyang.li@northwestern.edu

Update on December 21, 2023

Contents

Acronyms 7

Preface 1

0.1 Features of this template ....... 1

0.1.1 crossref ............ 1

0.1.2 ToC (Table of Content) .... 1

0.1.3 header and footer ....... 1

0.1.4 bib ............... 2

0.1.5 preface, index, quote (epi-

graph) and appendix ..... 2

I Machine Learning 3

1 Probability 5

1.1 Basic Concepts ............. 5

1.2 Maximum Likelihood Estimation . . . 5

1.3 Maximum A Posteriori Estimation . . 6

1.4 Gaussian Distribution ......... 6

1.5 Bayesian Network ........... 8

1.6 Probability Graph ........... 8

1.6.1 Variables Elimination ..... 8

1.6.2 Belief propagation ...... 8

1.6.3 Max-product Algorithm . . . 9

1.6.4 Factor Graph ......... 9

1.7 Expectation Maximum ........ 9

1.8 Gaussian Mixture Model ....... 9

1.9 Hidden Markov Model ........ 9

2 Notes 11

2.1 Complexity of matrix multiply . . . . 11

2.2 Representation of matrix ....... 11

2.3 Training Tricks ............ 11

II Algorithm and Data Structure 15

3 Algorithm 17

3.1 Graph .................. 17

III Programming 19

4 Leetcode 21

4.1 Problems Sets ............. 21

4.2 Tips ................... 21

4.3 Problems ................ 22

4.3.1 1 Two Sum .......... 22

4.3.2 20 valid parentheses ..... 22

4.3.3 21 Merge Two Sorted List . . 22

4.3.4 121 Best time to buy and sell

stock .............. 22

4.3.5 226 Invert Binary Tree . . . . 22

4.3.6 242 Valid Anagram ...... 22

4.3.7 704 Binary Search ....... 22

4.3.8 733 Flood Fill ......... 22

4.3.9 235 lowest-common-

ancestor-of-a-binary-search-

tree ............... 22

4.3.10 110 balanced-binary-tree . . . 22

4.3.11 141 linked-list-cycle ..... 23

4.3.12 232 implement-queue-using-

stacks ............. 23

4.3.13 278 rst-bad-version ..... 23

4.3.14 383 ransom-note ....... 23

4.3.15 70 climbing-stairs ....... 23

4.3.16 409 longest-palindrome . . . 23

4.3.17 206 reverse-linked-list . . . . 23

4.3.18 169 majority-element ..... 23

4.3.19 67 add-binary ......... 23

4.3.20 543 diameter-of-binary-tree . 23

4.3.21 876 middle-of-the-linked-list . 24

4.3.22 104 maximum-depth-of-

binary-tree .......... 24

4.3.23 217 contains-duplicate . . . . 24

4.3.24 53 maximum-subarray . . . . 24

4.3.25 57 insert-interval ....... 24

4.3.26 542 01-matrix ......... 24

4.3.27 973 k-closest-points-to-origin 24

4.3.28 3 longest-substring-without-

repeating-characters ..... 24

4.3.29 15 3sum ............ 24

4.3.30 102 binary-tree-level-order-

traversal ............ 25

4.3.31 133 clone-graph ........ 25

4.3.32 207 course-schedule ..... 25

4.3.33 322 coin-change ........ 25

4.3.34 238 product-of-array-except-

self ............... 25

4.3.35 155 min-stack ......... 25

CONTENTS CONTENTS

4.3.36 98 validate-binary-search-tree 25

4.3.37 200 number-of-islands . . . . 25

4.3.38 33 search-in-rotated-sorted-

array .............. 25

4.3.39 39 combination-sum ..... 25

4.3.40 86.partition-list ........ 26

4.3.41 23.merge-k-sorted-lists . . . . 26

4.3.42 142.linked-list-cycle-ii.cpp . . 26

4.3.43 160.intersection-of-two-

linked-lists.cpp ........ 26

4.3.44 26.remove-duplicates-from-

sorted-array .......... 26

4.3.45 27.remove-element.cpp . . . . 26

4.3.46 283.move-zeroes.cpp ..... 26

4.3.47 5.longest-palindromic-substring 26

4.3.48 669.trim-a-binary-search-

tree.cpp ............ 26

4.3.49 124.binary-tree-maximum-

path-sum.cpp ......... 26

4.3.50 46.permutations.cpp ..... 26

4.3.51 51.n-queens.cpp ........ 27

4.3.52 78.subsets.cpp ......... 27

4.3.53 77.combinations.cpp ..... 27

4.3.54 90.subsets-ii.cpp ........ 27

4.3.55 40.combination-sum-ii.cpp . . 27

4.3.56 47.permutations-ii.cpp . . . . 27

4.3.57 111.minimum-depth-of-

binary-tree.cpp ........ 27

4.3.58 752.open-the-lock.cpp . . . . 27

4.3.59 76.minimum-window-

substring.cpp ......... 27

4.3.60 438.nd-all-anagrams-in-a-

string.cpp ........... 27

4.3.61 337.house-robber-iii.cpp . . . 27

4.3.62 116. Populating Next Right

Pointers in Each Node . . . . 27

5 C++ 29

5.1 Code Snippets ............. 29

5.1.1 Random Number Generation 29

5.1.2 ick Sort ........... 29

5.2 Code Minning ............. 30

5.2.1 Deep Learning ........ 30

6 Rust 31

IV Research 33

7 Paper Writing 35

7.1 Resources ................ 35

7.2 Writing Tools ............. 36

7.3 Writing Tips .............. 36

7.4 PhD eses Tips ............ 39

7.5 Abstract ................ 40

7.6 Introduction .............. 40

7.7 Literature Reviews ........... 40

7.8 Method ................. 41

7.9 Result .................. 41

7.10 Conclusion ............... 41

V Book Reading 43

8 Dive into Deep Learning 45

8.1 Linear Neural Networks ........ 45

8.1.1 Exercise 3.1.6 ......... 45

8.1.2 Exercise 3.3.5 ......... 49

8.1.3 Exercise 3.5.6 ......... 51

8.1.4 Exercise 4.1.5 ......... 52

8.1.5 Exercise 4.5.5 ......... 54

8.1.6 Exercise 7.1.6 ......... 54

9 Generative Deep Learning 59

9.1 Intro to Deep Generative Models . . . 59

9.2 Variational Autoencoders ....... 59

9.3 Generative Adversarial Networks . . 61

9.3.1 Generative Adversarial Nets

(GAN) Training Tips ..... 62

9.4 Autoregressive Models ........ 64

9.5 Normalizing Flow Models ....... 67

9.6 Energy-Based Models ......... 71

9.7 Diusion Models ............ 71

Appendices 73

Appendix A Formulas 73

A.1 Gaussian distribution ......... 73

Bibliography 75

Alphabetical Index 77

List of Figures

1.1 A simple Bayesian network. ..................................... 8

1.2 Belief propagation. .......................................... 9

2.1 Batch Normalization ......................................... 12

2.2 Transposed Convolution [Zha+23, Chapter 14.10] ......................... 13

2.3 Transposed convolution with a kernel with stride of 2×2. e shaded portions are a por-

tion of an intermediate tensor as well as the input and kernel tensor elements used for the

computation. [Zha+23]........................................ 14

9.1 Generative Models Taxonomy .................................... 60

9.2 Adding and subtracting features to and from faces ......................... 61

9.3 Artifacts when using convolutional transpose layers ........................ 62

9.4 e Wasserstein GAN with Gradient Penalty (WGAN-GP) critic training process ........ 64

9.5 Inputs and outputs of the generator and critic in a CGAN ..................... 65

9.6 How a single sequence ows through a recurrent layer ...................... 68

9.7 An Long Short-Term Memory (LSTM) Cell ............................. 69

9.8 A single Gated Recurrent Unit (GRU) cell .............................. 70

9.9 e forward diusion process .................................... 72

List of eorems

A.1 Theorem (Central limit theorem) . . . 73

List of Denitions

A.1 Denition (Gaussian distribution) . . 73

LIST OF DEFINITIONS LIST OF DEFINITIONS

Acronyms

CGAN Contitonal Generative Adversarial Nets 64

DCGAN deep convolutional GAN 64

DDM Denoising Diusion Models 71

GAN Generative Adversarial Nets 4,61–64

GRU Gated Recurrent Unit 5,67,70

KL Kullback-Leibler 60

LSTM Long Short-Term Memory 5,66,67,69

MAP Maximum A Posteriori Estimation 6

MLE Maximum Likelihood Estimation 5,6

PDF Probability Density Function 6

RNN Recurrent Neural Network 67

VAE Variational Autoencoder 60,67

WGAN-GP Wasserstein GAN with Gradient Penalty 5,63,64

Acronyms Acronyms

Preface

Contents

0.1 Features of this template ................................... 1

0.1 Features of this template

TeX, stylized within the system as L

X, is a typeseing system which was designed and

wrien by Donald Knuth and rst released in 1978. TeX is a popular means of typeseing

complex mathematical formulae; it has been noted as one of the most sophisticated digital

typographical systems.

-Wikipedia

0.1.1 crossref

dierent styles of clickable denitions and theorems

•nameref: Gaussian distribution

•autoref: Denition A.1,

•cref: Denition A.1,

•hyperref: Gaussian,

0.1.2 ToC (Table of Content)

•mini toc of sections at the beginning of each chapter

•list of theorems, denitions, gures

•the chapter titles are bi-directional linked

0.1.3 header and footer

fancyhdr

•right header: section name and link to the beginning of the section

•le header: chapter title and link to the beginning of the chapter

•footer: page number linked to ToC of the whole document

Acronyms 0.1 Features of this template

0.1.4 bib

•titles of reference is linked to the publisher webpage e.g., [Kit+02]

•backref (go to the page where the reference is cited) e.g., [Chi09]

•customized video entry in reference like in [Bab16]

0.1.5 preface, index, quote (epigraph) and appendix

index page at the end of this document…

Part I

Machine Learning

Chapter 1

Probability

Contents

1.1 Basic Concepts ......................................... 5

1.2 Maximum Likelihood Estimation .............................. 5

1.3 Maximum A Posteriori Estimation ............................. 6

1.4 Gaussian Distribution ..................................... 6

1.5 Bayesian Network ....................................... 8

1.6 Probability Graph ....................................... 8

1.7 Expectation Maximum .................................... 9

1.8 Gaussian Mixture Model ................................... 9

1.9 Hidden Markov Model ..................................... 9

1.1 Basic Concepts

Permutation is an arrangement of objects in which the order is important.

P(n, r) = n!

(n−r)!

Combination is an arrangement of objects in which the order is not important.

C(n, r) = P(n, r)

r!=n!

r! (n−r)!

in which 0≤r≤n.

1.2 Maximum Likelihood Estimation

X= (x1,x2,...,xN)T,xi= (xi1, xi2, . . . , xip)T(1.1)

in which Nis the number of samples, pis the number of features. e data is sampled from a distribution

p(x|θ), where θis the parameter of the distribution.

For Ni.i.d. samples, the likelihood function is p(X|θ) = QN

i=1 p(xi|θ))

In order to get θ, we use Maximum Likelihood Estimation (MLE) to maximize the likelihood function.

θMLE = argmax

log p(X|θ) = argmax

i=1

log p(xi|θ)(1.2)

Chapter 1 Probability 1.3 Maximum A Posteriori Estimation

1.3 Maximum A Posteriori Estimation

In Bayes’ theorem, the θis not a constant value, but θ∼p(θ). Hence,

p(θ|X) = p(X|θ)p(θ)

p(X)=p(X|θ)p(θ)

p(X|θ)p(θ)dθ (1.3)

In order to get θ, we use Maximum A Posteriori Estimation (MAP) to maximize the posterior function.

θMAP = argmax

p(θ|X) = argmax

p(X|θ)p(θ)

p(X)(1.4)

Aer θis estimated, then calculating p(X|θ)·p(θ)

p(X|θ)p(θ)dθ to get the posterior distribution. We can use the

posterior distribution to predict the probability of a new sample x.

p(xnew |X) = Z

p(xnew |θ)·p(θ|X)dθ (1.5)

1.4 Gaussian Distribution

Gaussian distribution is also called normal distribution.

θ= (µ, σ2), µ =1

i=1

xi, σ2=1

i=1

(xi−µ)2(1.6)

For MLE,

θ= (µ, Σ)=(µ, σ2), θMLE = argmax

log p(X|θ) = argmax

i=1

log p(xi|θ)(1.7)

Generally, the Probability Density Function (PDF) of a Gaussian distribution is:

p(x|µ, Σ) = 1

p(2π)pdet(Σ)exp −1

2(x−µ)TΣ−1(x−µ)(1.8)

in which µis the mean vector, Σis the covariance matrix, det is the determinant of matrix. det is the product

of all eigenvalues of a matrix.

Hence,

log p(X|θ) =

i=1

log p(xi|θ) =

i=1

log 1

p(2π)pdet(Σ)exp −1

2(x−µ)TΣ−1(x−µ)(1.9)

Let’s only consider 1 dimension case for brevity, then

log p(X|θ) =

i=1

log p(xi|θ) =

i=1

log 1

√2πσ2exp −1

(x−µ)2

σ2(1.10)

Let’s get the optimal value for µ,

Chapter 1 Probability 1.4 Gaussian Distribution

µMLE = argmax

log p(X|θ) = argmin

i=1

2(xi−µ)2(1.11)

So,

∂log p(X|θ)

∂µ =

i=1

(µ−xi)=0→µMLE =1

i=1

xi(1.12)

Let’s get the optimal value for σ2,

σMLE = argmax

log p(X|θ)

= argmax

i=1

log 1

√2πσ2exp −1

(x−µ)2

σ2

= argmax

i=1 −log √2πσ2−(x−µ)2

2σ2

= argmin

i=1 "log σ+(x−µ)2

2σ2#

Hence,

∂

∂σ

i=1 "log σ+(xi−µ)2

2σ2#= 0 →σ2

MLE =1

i=1

(xi−µ)2(1.13)

ED[µMLE]is unbaised.

ED[µMLE] = ED"1

i=1

xi#=1

i=1

ED[xi] = 1

i=1

µ=µ(1.14)

However, EDσ2

MLEis biased.

Chapter 1 Probability 1.5 Bayesian Network

Figure 1.1: A simple Bayesian network.

EDσ2

MLE=ED"1

i=1

(xi−µMLE)2#(1.15)

=ED"1

i=1

(xi−µMLE)2#(1.16)

=ED"1

i=1 x2

i−2xiµMLE +µ2

MLE#=ED"N

i=1

i−21

i=1

xiµMLE +µ2

MLE#(1.17)

=ED"1

i=1 x2

i−µ2+µ2−µ2

MLE#(1.18)

=σ2−EDµ2

MLE −µ2(1.19)

=σ2−EDµ2

MLE−EDµ2

MLE (1.20)

=σ2−Var [µMLE] = σ2−Var "1

i=1

xi#(1.21)

=σ2−1

i=1

Var [xi] = N−1

Nσ2(1.22)

(1.23)

1.5 Bayesian Network

1.6 Probability Graph

this section is not nished yet. Need to be reviewed p54

1.6.1 Variables Elimination

1.6.2 Belief propagation

Belief propagation is mainly used for tree data structure, and equals Section 1.6.1 with caching.

Chapter 1 Probability 1.7 Expectation Maximum

mab

Figure 1.2: Belief propagation.

1.6.3 Max-product Algorithm

1.6.4 Factor Graph

1.7 Expectation Maximum

Θ(t+1) = argmax

ΘZZ

log P(x, z |θ)·P(z|x, Θ(t))dz

continue on p60

1.8 Gaussian Mixture Model

QΘ, Θ(t)=ZZ

log P(x, z |θ)·P(z|x, Θ(t))dz

1.9 Hidden Markov Model

Chapter 1 Probability 1.9 Hidden Markov Model

Chapter 2

Notes

2.1 Complexity of matrix multiply

Assume Ais m×nand Bis n×p. e naive algorithm takes O(mnp)time.

2.2 Representation of matrix

Row major, col major, stride

How to calculate dimension of convolutional layer

nh−kh+ph+sh

sh×nw−kw+pw+sw

sw

In which nhmeans height of input, khmeans height of lter, phmeans padding of height, shmeans stride

of height. So as nw,kw,pw,sw.

2.3 Training Tricks

Batch normalization

BN (x) = γ⊙x−ˆ

µB

σB

+β

Batch normalization is a technique that drastically reduces this problem. e solution is surprisingly

simple. During training, a batch normalization layer calculates the mean and standard deviation of each of

its input channels across the batch and normalizes by subtracting the mean and dividing by the standard

deviation. ere are then two learned parameters for each channel, the scale (gamma) and shi (beta) Fig. 2.1.

e output is simply the normalized input, scaled by gamma and shied by beta [Fos22].

When it comes to prediction, we may only want to predict a single observation, so there is no batch over

which to calculate the mean and standard deviation. To get around this problem, during training a batch

normalization layer also calculates the moving average of the mean and standard deviation of each channel

and stores this value as part of the layer to use at test time [Fos22].

How many parameters are contained within a batch normalization layer? For every channel in the pre-

ceding layer, two weights need to be learned: the scale (gamma) and shi (beta). ese are the trainable

parameters. e moving average and standard deviation also need to be calculated for each channel, but

since they are derived from the data passing through the layer rather than trained through backpropagation,

they are called nontrainable parameters. In total, this gives four parameters for each channel in the preceding

layer, where two are trainable and two are nontrainable [Fos22].

Chapter 2 Notes 2.3 Training Tricks

Figure 2.1: Batch Normalization

Chapter 2 Notes 2.3 Training Tricks

Figure 2.2: Transposed Convolution [Zha+23, Chapter 14.10]

Transposed Convolution Ignoring channels for now, let’s begin with the basic transposed convolution

operation with stride of 1 and no padding. Suppose that we are given a input tensor nh×nwand a kernel

kh×kw. Sliding the kernel window with stride of 1 for nwtimes in each row and nhtimes in each column

yields a total of nhnwintermediate results. Each intermediate result is a (nh+kh−1) ×(nw+kw−1)

tensor that are initialized as zeros. To compute each intermediate tensor, each element in the input tensor is

multiplied by the kernel so that the resulting tensor kh×kwreplaces a portion in each intermediate tensor.

Note that the position of the replaced portion in each intermediate tensor corresponds to the position of the

element in the input tensor used for the computation. In the end, all the intermediate results are summed

over to produce the output [Zha+23, Chapter 14.10].

Dierent from in the regular convolution where padding is applied to input, it is applied to output in the

transposed convolution. For example, when specifying the padding number on either side of the height and

width as 1, the rst and last rows and columns will be removed from the transposed convolution output.

Consider implementing the convolution by multiplying matrices. Given an input vector xand a weight

matrix W, the forward propagation function of the convolution can be implemented by multiplying its input

with the weight matrix and outpuing a vector y=Wx. Since backpropagation follows the chain rule and

∇xy=W⊤, the backpropagation function of the convolution can be implemented by multiplying its input

with the transposed weight matrix W⊤. erefore, the transposed convolutional layer can just exchange the

forward propagation function and the backpropagation function of the convolutional layer: its forward prop-

agation and backpropagation functions multiply their input vector with W⊤and W, respectively [Zha+23].

Chapter 2 Notes 2.3 Training Tricks

Figure 2.3: Transposed convolution with a kernel with stride of 2×2. e shaded portions are a portion of

an intermediate tensor as well as the input and kernel tensor elements used for the computation. [Zha+23]

Part II

Algorithm and Data Structure

Chapter 3

Algorithm

Contents

3.1 Graph ............................................... 17

3.1 Graph

Chapter 3 Algorithm 3.1 Graph

Part III

Programming

Chapter 4

Leetcode

Contents

4.1 Problems Sets .......................................... 21

4.2 Tips ................................................ 21

4.3 Problems ............................................. 22

4.1 Problems Sets

•https://leetcode.com/list/rab78cw1/https://www.techinterviewhandbook.org/grind75?

weeks=8

•dfs recursive = bfs iterative (use stack)

•two pointer

•check size (is empty, compare, larger, .etc.)

•hash map (xed array)

4.2 Tips

•postorder: LRN

•preorder: NLR

•inorder: LNR

•virtual head pointers

•two pointers (le and right, fast and slow)

•BST (value of le child is smaller, that of right child is larger)

•Backtracking (terminated option, available paths, current paths, if has

•duplicates(available paths or result))

•BFS (queue;while;for)

•sliding windows (tow pointers) for substring problems

•best-time-to-buy-and-sell-template.cpp

•(p + m - 1) / m equal to ceil(p / m)

Chapter 4 Leetcode 4.3 Problems

•be careful about int overow and underow

•binary search problem: 1. le, right, f(x) increase/decrease, target

4.3 Problems

4.3.1 1 Two Sum

•hash map

4.3.2 20 valid parentheses

•stack

4.3.3 21 Merge Two Sorted List

•recursive

4.3.4 121 Best time to buy and sell stock

•record value

4.3.5 226 Invert Binary Tree

•DFS recursive

•BFS iterative via stack

4.3.6 242 Valid Anagram

•count freq then verify freq

4.3.7 704 Binary Search

•two pointers

4.3.8 733 Flood Fill

•graph matrix

•DFS via recursive or BFS via iterative

4.3.9 235 lowest-common-ancestor-of-a-binary-search-tree

•Aributes of BST, the value of le subtree is less than root, that of le larger than root.

•Recursive

•iterative

4.3.10 110 balanced-binary-tree

•recursive DFS

•use -1 means unbalanced

Chapter 4 Leetcode 4.3 Problems

4.3.11 141 linked-list-cycle

•two points fast and slow

•check if they meet again

4.3.12 232 implement-queue-using-stacks

•Two stacks, input and output

•move when output is empty

•amortized cost for each operation is O(1).

4.3.13 278 rst-bad-version

•Binary Search

•use val directly instead of index

4.3.14 383 ransom-note

•hash map (xed array)

•check size rst

4.3.15 70 climbing-stairs

•Fibonacci like

•recursive (from top to down, memorization)

•iterative (from boom to top, space O(1))

4.3.16 409 longest-palindrome

•use even char freq

•cal odd char freq (s.size() −odd + (odd >0))

4.3.17 206 reverse-linked-list

•recursive

•iterative (three pointes (prev, curr,next))

4.3.18 169 majority-element

•moorse vote

4.3.19 67 add-binary

•carry

•ASCII to int, vicewise

4.3.20 543 diameter-of-binary-tree

•postorder traversal

Chapter 4 Leetcode 4.3 Problems

4.3.21 876 middle-of-the-linked-list

•fast and slow pointers

•get n and n/2 at same time

4.3.22 104 maximum-depth-of-binary-tree

•DFS recursive

•postorder

4.3.23 217 contains-duplicate

•Hash set

•unorder map to hash map

•map to BST

4.3.24 53 maximum-subarray

•DP

•use subproblem to x global problem

4.3.25 57 insert-interval

•three situations

•not—overlap—not

4.3.26 542 01-matrix

•DP rst le top, then boom, right

•BFS, rst 0, then unseen

4.3.27 973 k-closest-points-to-origin

•sort/n element/partial sort

•max-heap

•randomized quicksort

4.3.28 3 longest-substring-without-repeating-characters

•slide window (two points)

4.3.29 15 3sum

•sort

•two points

Chapter 4 Leetcode 4.3 Problems

4.3.30 102 binary-tree-level-order-traversal

•stack/queue BFS

•DFS recursive / keep level

4.3.31 133 clone-graph

•DFS

•Amazing

4.3.32 207 course-schedule

•https://www.geeksforgeeks.org/kahns-algorithm-vs-dfs-approach-a-comparative-analysis/

•BFS kahn’s algorithm topological sort

•DFS beer for larger V

•topological sort

4.3.33 322 coin-change

•DP

4.3.34 238 product-of-array-except-self

•prex

•sux

•space O(1)

4.3.35 155 min-stack

•keep value and min at the same time

4.3.36 98 validate-binary-search-tree

•Recursive

4.3.37 200 number-of-islands

•nd 1 and DFS clear 1

4.3.38 33 search-in-rotated-sorted-array

•Binary Search

•check if mid and target are in same side

•if not same side assume one side is inf/-inf value let le/right advanced

4.3.39 39 combination-sum

•Backtracking

Chapter 4 Leetcode 4.3 Problems

4.3.40 86.partition-list

•dummy pointers

•two pointers

4.3.41 23.merge-k-sorted-lists

•heap/priority queue

•merge two lists iteratively

4.3.42 142.linked-list-cycle-ii.cpp

•two pointers

4.3.43 160.intersection-of-two-linked-lists.cpp

•contact two lists

•make sure two lists have same distance from the end

4.3.44 26.remove-duplicates-from-sorted-array

•fast and slow pointers

4.3.45 27.remove-element.cpp

•fast and slow pointers

4.3.46 283.move-zeroes.cpp

•fast and slow pointers

•snowball

4.3.47 5.longest-palindromic-substring

•dp

•two pointers (expand from center/narrow from both sides)

4.3.48 669.trim-a-binary-search-tree.cpp

•BST (value of le child is smaller, that of right child is larger)

•postorder

4.3.49 124.binary-tree-maximum-path-sum.cpp

•postorder

•discard nodes with negative value

4.3.50 46.permutations.cpp

•Backtracking

Chapter 4 Leetcode 4.3 Problems

4.3.51 51.n-queens.cpp

•Backtracking(current path, options, terminated option)

4.3.52 78.subsets.cpp

•backtracking

4.3.53 77.combinations.cpp

•backtracking

4.3.54 90.subsets-ii.cpp

•backtracking

4.3.55 40.combination-sum-ii.cpp

•backtracking

4.3.56 47.permutations-ii.cpp

•backtracking

4.3.57 111.minimum-depth-of-binary-tree.cpp

•BFS

4.3.58 752.open-the-lock.cpp

•BFS

4.3.59 76.minimum-window-substring.cpp

•sliding windows

4.3.60 438.nd-all-anagrams-in-a-string.cpp

•sliding windows

4.3.61 337.house-robber-iii.cpp

•213.house-robber-ii.cpp

•198.house-robber.cpp

•DP

•recursive with memorization

•DP table -¿ optimize table

4.3.62 116. Populating Next Right Pointers in Each Node

•ree children tree traverse

•BFS O(N)space O(1)

Chapter 4 Leetcode 4.3 Problems

Chapter 5

C++

Contents

5.1 Code Snippets .......................................... 29

5.2 Code Minning .......................................... 30

5.1 Code Snippets

5.1.1 Random Number Generation

#include <algorithm>

#include <iostream>

#include <iterator>

#include <random>

int main() {

std::random_device rd;

std::mt19937 rng(rd());

std::uniform_int_distribution<int>dist6(1,6);

std::generate_n(std::ostream_iterator<int>(std::cout, " "), 10,

[&dist6, &rng]() { return dist6(rng); });

return 0;

}

5.1.2 ick Sort

#include <vector>

using std::vector;

class Solution {

public:

void sort(vector<int>& nums, int low, int high){

if (low >= high) return;

int p=partition(nums, low, high);

Chapter 5 C++ 5.2 Code Minning

sort(nums, low, p - 1); // Changed p to p-1

sort(nums, p + 1, high);

}

int partition(vector<int>& nums, int low, int high){

int pivot =low, l =pivot + 1,r=high;

while(l <= r) {

if (nums[l] <nums[pivot]) ++l;

else if (nums[r] >= nums[pivot]) --r;

else std::swap(nums[l], nums[r]);

}

std::swap(nums[pivot], nums[r]);

return r;

}

void shuffle(vector<int>& nums){

std::srand((unsigned) time(nullptr));

int n=nums.size();

for (int i= 0;i<n; ++i){

int r=i+rand() %(n -i);

std::swap(nums[i], nums[r]);

}

vector<int>sortArray(vector<int>& nums) {

shuffle(nums); // Shuffle the array before sorting

sort(nums, 0, nums.size() - 1); // Changed nums.size() to nums.size() - 1

return nums;

}

};

On average, the algorithm takes O(nlog n)time. In the worst case, it takes On2time. We shue the

array before sorting to avoid the worst case.

5.2 Code Minning

5.2.1 Deep Learning

https://ensmallen.org/docs.html

https://github.com/mlpack/mlpack/blob/master/doc/quickstart/cpp.md https://github.

com/flashlight/flashlight?tab=readme-ov-file#build-options

https://github.com/pytorch/pytorch

https://mp.weixin.qq.com/s/rx7SPYr-sEOz_GOYRfSDOw

Chapter 6

Rust

Chapter 6 Rust

Part IV

Research

Chapter 7

Paper Writing

Contents

7.1 Resources ............................................ 35

7.2 Writing Tools .......................................... 36

7.3 Writing Tips ........................................... 36

7.4 PhD eses Tips ........................................ 39

7.5 Abstract ............................................. 40

7.6 Introduction ........................................... 40

7.7 Literature Reviews ....................................... 40

7.8 Method .............................................. 41

7.9 Result .............................................. 41

7.10 Conclusion ........................................... 41

7.1 Resources

•https://github.com/MLNLP-World/Paper-Writing-Tips

•https://github.com/dspinellis/latex-advice

•https://github.com/guanyingc/latex_paper_writing_tips

•https://github.com/yzy1996/English-Writing

•https://github.com/guanyingc/cv_rebuttal_template

•https://www.overleaf.com/gallery/tagged/tikz/page

•https://color.adobe.com/zh/explore

•https://github.com/acmi-lab/cmu-10717-the-art-of-the-paper/tree/main/resources

•https://github.com/daveshap/ChatGPT_Custom_Instructions/blob/main/academic_copy_

reviser.md

•https://github.com/sylvainhalle/textidote?tab=readme-ov-file

•https://github.com/egeerardyn/awesome-LaTeX

•https://texfaq.org

•https://github.com/xinychen/awesome-latex-drawing

•https://pudding.cool/2022/02/plain

Chapter 7 Paper Writing 7.2 Writing Tools

Table 7.1: Writing Tools

Tool URL

Latex Table Generator tablesgenerator.com

Latex Equation Editor https://www.codecogs.com/latex/eqneditor.php

Latex Equation and Figure Editor https://www.mathcha.io/editor

Equation Image to Latex https://mathpix.com/

Figure Inspiration https://echarts.apache.org/examples/

Figure Creation https://altair-viz.github.io/

Online Figure Editor https://www.tldraw.com/

Identify Font https://www.fontsquirrel.com/matcherator

Identify Color https://imagecolorpicker.com/

Identify Math Symbol https://detexify.kirelabs.org/classify.html

Beer Labeling Tool https://labelstud.io/

Fix Wrong Citation https://github.com/yuchenlin/rebiber

Have a good name for model http://acronymify.com/

Deadline Reminder https://aideadlin.es/?sub=ML,CV,NLP

Great Paper/Model Info https://paperswithcode.com/

Same Meaning Words https://www.thesaurus.com/

Compare Same Meaning Words https://wikidiff.com/neglect/omit

Common Words for Academic Writing http://www.esoda.org/

Doi to Bib https://www.doi2bib.org/

Figure Gallery of Vega https://vega.github.io/vega-lite/examples/

Figure Gallery of Graphviz https://graphviz.org/gallery/

UML Diagram https://plantuml.com/

Color Picker https://colorbrewer2.org

Abstract/Title Generator https://x.writefull.com/

Online BibTeX Formaer https://flamingtempura.github.io/bibtex-tidy

•https://www.youtube.com/watch?v=VK51E3gHENc&ab_channel=MicrosoftResearch

•https://www.cs.jhu.edu/~jason/advice/write-the-paper-first.html

•https://github.com/yzy1996/English-Writing

7.2 Writing Tools

7.3 Writing Tips

Your dra can describe your motivation, formal problem, model, algorithms, and experiments before you

actually build anything. (Ideally, it will also explain why you did it this way rather than some other way, and

point out gaps that remain for future work.) By showing others the dra at this stage, you’ll get important

feedback before you invest time in the “wrong” work.

If you run into trouble while doing the work, then I may have diculty diagnosing or even understanding

the problem you are facing. We may waste a whole meeting just geing aligned. But if you can show me

a precise writeup of what you’re doing, then I can be much more helpful. In general, meetings are very

productive when we have a concrete document in front of us.

Of course, there are many kinds of concrete documents. You could instead write up the notes from our

(usually extensive) discussions as private documents for further discussion. Still, writing up in the form of

Chapter 7 Paper Writing 7.3 Writing Tips

a nal paper makes you (1) integrate everything in one place, (2) decide which ideas will be made central

for this paper, and (3) focus on the coherence and impact of the end product. (Additional discussion and

brainstorming can still go into the document—those subsections can be cut or moved to an appendix for the

conference paper, but retained for a longer version as a tech report, journal article, or dissertation.)1

What needs to be done? Assuming the idea is indeed a good one, then writing a dra makes you sharpen

the message of the paper. en you can gure out what work needs to be done to support exactly that message.

•Your introduction will make some claims. Oen you will realize that some interesting additional ex-

periment would really test those claims.

•Writing the literature review will help you design your experiments. It may inuence what datasets

you use, what comparisons you do, and what you are trying to prove you can do that other people

can’t. However, don’t write the lit review rst. Write out your own ideas before comparing them with

the literature. is increases the chance that you’ll nd new angles on the problem.

•Writing the experimental section is possible even before you’ve done the experiments. Explain the full

experimental design. Make empty result tables with row and column headers and explanatory captions.

Make empty graphs with axes and captions. e actual results will be missing, but it will be clear what

work is necessary. If you are having trouble switching your brain from experiments to writing, just

write up your current experiments rst. Although writing an intro will also help, so you can explain

why you are doing the experiments.

Honest writing may lead you to realize that proving your point requires more work than you’d thought

(which is why you’d beer write early).

How to give a wonderful talk?

•What is the question?

•Why is it interesting?

•How do you propose to answer it?

•e.g. means “for example”.

•i.e. means “that is”.

•et al. means “and others of the same kind”.

•etc. means “and others”, do not specify people.

Avoid extreme meaning words

1. Don’t spend too much time reading other people’s research, waiting for inspiration to strike you.

(a) Reading research should be a regular part of any economist’s routine.

(b) But it is not a substitute for doing your own.

2. Don’t set the bar too high. You don’t need to win the Nobel.

3. But don’t set the bar too low!

(a) Be wary of picking a topic that is of interest to a small number of people (e.g. you and your

adviser).

1https://www.cs.jhu.edu/~jason/advice/write-the-paper-first.html

Chapter 7 Paper Writing 7.3 Writing Tips

Table 7.2: Alternative Terms for Common Words

Original Term Alternative Terms

obvious straightforward

always usually, generally, oen

never seldom, rare

avoid, eliminate reduce, mitigate, alleviate, relieve

a lot of much, many

do (verb) perform, conduct, carry out

big large, signicant

like such as, for example

think believe, consider, argue, claim, suggest

talk discuss, describe, present, show

look at examine, investigate, explore, study

get obtain, acquire, receive, gain

keep maintain, retain, preserve

climb increase, rise, grow, improve, ascend

really very, extremely, highly

(b) In particular, think about topics that will interest those beyond this island

4. Your ideas and results won’t sell themselves.

5. How you communicate your work is of crucial importance.

6. ere is no point in having an interesting piece of research that nobody understands or sees the point

of.

7. Many economists think of themselves as primarily experts in technical methods: Econometrics, eco-

nomic theory, data expertise.

8. is “white coat” mentality—that we are mainly scientists who then do a write-up of our results—is

deeply wrong.

9. Writing is an essential part of the research process, not a last-minute thing to be rushed.

10. Do not wait to write

11. Identify your key idea explicitly

12. Tell a story paper structure

13. Nail your contributions to the mast introduction

14. Related work

15. Put your readers rst the main body

16. Listen to your readers

Latex Writing Tips

•use \usepackage{microtype} to improve the appearance of the text.

•use \tablename~\ref{tab} to reference table.

•use \figurename~\ref{fig} to reference gure.

Chapter 7 Paper Writing 7.4 PhD eses Tips

•Eq.~\eqref{eq1}

•comments from dierent people can be distinguished by dierent colors. \newcommand{\yl}[1]{{\color{blue}{[(YL): #1]}}}

•use \newcommand{\method}{ABC\xspace} to dene model name or method

•mathematic notation https://www.deeplearningbook.org/contents/notation.html

7.4 PhD eses Tips

e PhD esis Tips from https://www.karlwhelan.com/Teaching/PhD/phd-writing-talk.pdf

What makes a good thesis? Forget about objectivity—beauty is in the eye of the beholder. A good thesis

is one that readers think is good. So, you need to explain well what you are doing to somebody else. Your

ideas and results won’t sell themselves. How you communicate your work is of crucial importance. ere is

no point in having an interesting piece of research that nobody understands or sees the point of. e “white

coat” mentality—that we are mainly scientists who then do a write-up of our results—is misguided. Writing

is an essential part of the research process, not a last-minute thing to be rushed. is is particularly true of

PhD theses which are read very carefully by externs, who are hoping that you have explained what you have

done in a clear fashion.

Writing Skills: More Important an You ink What makes a good thesis? Forget about objectiv-

ity—beauty is in the eye of the beholder. A good thesis is one that readers think is good. So, you need to

explain well what you are doing to somebody else Your ideas and results won’t sell themselves. How you

communicate your work is of crucial importance. ere is no point in having an interesting piece of research

that nobody understands or sees the point of. e “white coat” mentality—that we are mainly scientists who

then do a write-up of our results—is misguided. Writing is an essential part of the research process, not a last-

minute thing to be rushed. is is particularly true of PhD theses which are read very carefully by externs,

who are hoping that you have explained what you have done in a clear fashion.

Start by Avoiding Very Bad Writing People can be quite sensitive about their writing. If you are a native

English speaker and you have a degree, you probably think you know how to write and communicate well.

Chances are, you might be wrong. Most MA graduates and even many professional PhD-qualied economists

write very poorly. Good writing is hard to dene. Bad writing is easy to spot. A badly-wrien thesis will

have:

1. Mis-spelled words.

2. Missing words.

3. Sentences that don’t make sense or aren’t proper sentences.

4. Poor use of punctuation – full stops, commas etc.

is annoys the reader because it makes things harder to read (and understand) but also because it’s so easy

to prevent. It suggests you didn’t take the time to be careful.

What To Do About It? Read, re-read, edit, and re-edit. And do this as you go along. Read and edit aer

you’ve wrien a page or so. is can correct most of the common errors of style, grammar, and spelling that

occur in the writing process. Most of you actually do know what a sentence is. In addition to catching typos,

a quick re-read and edit allows you to check that what you’ve wrien gets across what you’ve been trying to

say. Sometimes you can think you’ve made a point clearly but then you read what you’ve wrien and it’s not

so good. Read your stu aloud or slowly to yourself. Does it sound right? Are you writing proper sentences?

Are you over-using jargon or certain particular phrases? Use spell checks but don’t rely on them. Grammar.

Chapter 7 Paper Writing 7.5 Abstract

ere are rules about how to use commas, colons, semi-colons, full stops, about what denes a sentence. Try

to learn them.

7.5 Abstract

7.6 Introduction

1. Introductions are crucial because they set up the reader to understand what your topic is and what you

are going to do in your paper.

2. ickly explain two things:

(a) Why is the topic of your paper interesting?

(b) What did YOU do? What is YOUR contribution? A new question? An existing question but

new methodology? Existing question, existing methodology, new data (e.g. no previous Irish

application)?

3. Be willing to give an outline of what your results are but don’t get into too many details.

4. Because of its importance, spend a lot of time on the introduction.

5. But don’t make it too long. ree pages is a limit. Two is beer.

6. I oen start writing the introduction as soon I have some results and then keep adjusting it as the paper

evolves.

7. Conclusions, on the other hand, should be kept short and to the point. Don’t repeat lots of stu from

the intro.

7.7 Literature Reviews

1. You need to explain your contribution.

2. So it needs to be put in context.

3. is will require discussion of previous studies in this area.

4. Remember, though, the purpose is to set up your contribution, and distinguish it from previous work.

5. Don’t simply provide a long list of separate descriptions of weakly related studies. (X (2007) did this.

Y(2009) did that …)

6. Grouping studies together by type may be a beer way of explaining it than listing lots of separate

individual studies.

7. A well-focused literature review is usually a useful component of a PhD thesis. It shows externs and

your supervisor that you have read the literature.

8. However, it is not essential that articles submied for publication have a literature review. You may be

able to summarise the existing literature in your introduction or work it into your opening section that

sets up what you are doing.

Chapter 7 Paper Writing 7.8 Method

7.8 Method

7.9 Result

7.10 Conclusion

Chapter 7 Paper Writing 7.10 Conclusion

Part V

Book Reading

Chapter 8

Dive into Deep Learning

8.1 Linear Neural Networks

8.1.1 Exercise 3.1.6

Exercise 1: Let’s solve this analytically.

To minimize the sum of squared dierences, we take the derivative of the function with respect to (b), set

it equal to zero, and solve for (b). e function to minimize is:

f(b) =

i=1

(xi−b)2

Taking the derivative with respect to (b) gives:

f′(b) =

i=1 −2 (xi−b)

Seing this equal to zero and solving for (b) gives:

0 =

i=1 −2(xi−b)

Solving for (b) gives:

b=1

i=1

So the value of (b) that minimizes the sum of squared dierences is the mean of the xi.

is problem and its solution relate to the normal distribution because the sum of squared dierences

is the basis for the maximum likelihood estimation of the mean of a normal distribution. In other words,

the mean of a normal distribution is the value that maximizes the likelihood of the observed data, which is

equivalent to minimizing the sum of squared dierences from the mean.

If we change the loss function to the sum of absolute dierences, the problem becomes nding the median

of the xi. is is because the median is the value that minimizes the sum of absolute dierences. Unlike the

mean, the median is not sensitive to extreme values, so it is a more robust measure of central tendency when

there are outliers in the data.

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

Exercise 2: An ane function is a function composed of a linear transformation and a translation. In the

context of machine learning, an ane function oen takes the form f(x) = x⊤w+b), where xis an input

vector, wis a weight vector, and (b) is a bias term.

We can show that this ane function is equivalent to a linear function on the augmented vector (x,1)

by considering a new weight vector w′= (w, b)and an augmented input vector x′= (x,1). en the ane

function can be wrien as a linear function:

f(x) = x⊤w+b=x′⊤w′

Here’s the proof:

e original ane function is f(x) = x⊤w+b.

We dene x′= (x,1))and(w′= (w, b).

e dot product x′⊤w′)is(x⊤w+ 1 ·b, which is exactly the original ane function.

erefore, the ane function x⊤w+bis equivalent to the linear function x′⊤w′on the augmented vector

(x,1). is equivalence is oen used in machine learning to simplify the notation and the implementation of

algorithms.

Exercise 3: A quadratic function of the form f(x) = b+Piwixi+Pj≤iwij xixjcan be implemented

in a deep network using a fully connected layer followed by an element-wise multiplication operation and

another fully connected layer. Here’s how you can do it:

Input Layer: e input to the network is the vector x.

First Fully Connected Layer: is layer applies a linear transformation to the input vector. e weights

of this layer represent the witerms in the quadratic function. e output of this layer is b+Piwixi.

Element-wise Multiplication Layer: is layer computes the element-wise product of the input vector with

itself, resulting in the vector x⊙x, where ⊙denotes element-wise multiplication. is operation corresponds

to the xixjterms in the quadratic function.

Second Fully Connected Layer: is layer applies a linear transformation to the output of the element-

wise multiplication layer. e weights of this layer represent the wij terms in the quadratic function. e

output of this layer is Pj≤iwij xixj.

Summation Layer: is layer adds the outputs of the rst fully connected layer and the second fully

connected layer to produce the nal output of the network, which is the value of the quadratic function.

Note that this network does not include any activation functions, because the quadratic function is a

polynomial function, not a nonlinear function. If you want to model a more complex relationship between

the input and the output, you can add activation functions to the network.

Exercise 4: If the design matrix X⊤Xdoes not have full rank, it means that there are linearly dependent

columns in the matrix X. is can lead to several problems:

e matrix X⊤Xis not invertible, which means that the normal equations used to solve the linear re-

gression problem do not have a unique solution. is makes it impossible to nd a unique set of regression

coecients that minimize the sum of squared residuals.

e presence of linearly dependent columns in Xcan lead to overing, as the model can “learn” to use

these redundant features to perfectly t the training data, but it will not generalize well to new data.

One common way to x this issue is to add a small amount of noise to the entries of X. is can help

to make the columns of Xlinearly independent, which ensures that X⊤Xhas full rank and is invertible.

Another common approach is to use regularization techniques, such as ridge regression or lasso regression,

which add a penalty term to the loss function that discourages the model from assigning too much importance

to any one feature.

If you add a small amount of coordinate-wise independent Gaussian noise to all entries of X, the expected

value of the design matrix X⊤Xremains the same. is is because the expected value of a Gaussian random

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

variable is its mean, and adding a constant to a random variable shis its mean by that constant. So if the

noise has mean zero, the expected value of X⊤Xdoes not change.

Stochastic gradient descent (SGD) can still be used when X⊤Xdoes not have full rank, but it may not

converge to a unique solution. is is because SGD does not rely on the invertibility of X⊤X, but instead

iteratively updates the regression coecients based on a randomly selected subset of the data. However, the

presence of linearly dependent columns in Xcan cause the loss surface to have multiple minima, which means

that SGD may converge to dierent solutions depending on the initial values of the regression coecients.

Exercise 5: e negative log-likelihood of the data under the model is given by:

e likelihood function for the data is P(y|X) = Qn

i=1 1

2exp−|yi−(x⊤

iw+b)|, where yiis the (i)-th

target value, xiis the (i)-th input vector, wis the weight vector, and (b) is the bias term.

Taking the negative logarithm of this gives the negative log-likelihood:

−log P(y|X) = Pn

i=1 log 2 + |yi−(x⊤

iw+b)|.

Unfortunately, there is no closed-form solution for this problem. e absolute value in the log-likelihood

function makes it non-dierentiable at zero, which means that we cannot set its derivative equal to zero and

solve for wand b.

A minibatch stochastic gradient descent algorithm to solve this problem would involve the following steps:

Initialize wand bwith random values. For each minibatch of data: Compute the gradient of the negative log-

likelihood with respect to wand b. e gradient will be dierent depending on whether yi−(x⊤

iw+b)is

positive or negative. Update wand bby taking a step in the direction of the negative gradient. Repeat until

the algorithm converges. One issue that could arise with this algorithm is that the updates could become very

large if yi−(x⊤

iw+b)is close to zero, because the gradient of the absolute value function is not dened at

zero. is could cause the algorithm to diverge.

One possible solution to this problem is to use a variant of gradient descent that includes a regularization

term, such as gradient descent with momentum or RMSProp. ese algorithms modify the update rule to

prevent the updates from becoming too large. Another solution is to add a small constant to the absolute

value inside the logarithm to ensure that it is always dierentiable.

Exercise 6: e composition of two linear layers in a neural network essentially results in another linear

layer. is is due to the property of linearity: the composition of two linear functions is another linear

function.

Mathematically, if we have two linear layers L1and L2such that L1(x) = Ax +band L2(x) = Cx +d,

then the composition L2(L1(x)) = L2(Ax +b) = C(Ax +b) + d= (CA)x+ (Cb +d), which is another

linear function.

is means that a neural network with two linear layers has the same expressive power as a neural net-

work with a single linear layer. It cannot model complex, non-linear relationships between the input and the

output.

To overcome this limitation, we typically introduce non-linearities between the linear layers in the form

of activation functions, such as the ReLU (Rectied Linear Unit), sigmoid, or tanh functions. ese non-linear

activation functions allow the neural network to model complex, non-linear relationships and greatly increase

its expressive power.

Exercise 7: Regression for realistic price estimation of houses or stock prices can be challenging due to the

complex nature of these markets. Prices can be inuenced by a wide range of factors, many of which may not

be easily quantiable or available for use in a regression model.

e additive Gaussian noise assumption may not be appropriate for several reasons. First, prices cannot be

negative, but a Gaussian distribution is dened over the entire real line, meaning it allows for negative values.

Second, the Gaussian distribution is symmetric, but price changes in real markets are oen not symmetric. For

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

example, prices may be more likely to increase slowly but decrease rapidly, leading to a skewed distribution

of price changes. Finally, the Gaussian distribution assumes that large changes are extremely unlikely, but in

real markets, large price uctuations can and do occur.

Regression to the logarithm of the price can be a beer approach because it can help to address some of

the issues with the Gaussian noise assumption. Taking the logarithm of the prices can help to stabilize the

variance of the price changes and make the distribution of price changes more symmetric. It also ensures that

the predicted prices are always positive, since the exponential of any real number is positive.

When dealing with penny stocks, or stocks with very low prices, there are several additional factors to

consider. One issue is that the prices of these stocks may not be able to change smoothly due to the minimum

tick size, or the smallest increment by which the price can change. is can lead to a discontinuous distribution

of price changes, which is not well modeled by a Gaussian distribution. Another issue is that penny stocks are

oen less liquid than higher-priced stocks, meaning there may not be enough buyers or sellers to trade at all

possible prices. is can lead to large price jumps, which are also not well modeled by a Gaussian distribution.

e Black-Scholes model for option pricing is a celebrated model in nancial economics that makes spe-

cic assumptions about the distribution of stock prices. It assumes that the logarithm of the stock price follows

a geometric Brownian motion, which implies that the stock price itself follows a log-normal distribution. is

model has been very inuential, but it has also been criticized for its assumptions, which may not hold in real

markets. For example, it assumes that the volatility of the stock price is constant, which is oen not the case

in reality.

Exercise 8: e Gaussian additive noise model may not be appropriate for estimating the number of apples

sold in a grocery store for several reasons: e Gaussian model allows for negative values, but the number

of apples sold cannot be negative. e Gaussian model is a continuous distribution, but the number of apples

sold is a discrete quantity. e Gaussian model assumes that large deviations from the mean are extremely

unlikely, but in reality, there could be large uctuations in the number of apples sold due to various factors

(e.g., seasonal demand, promotions).

e Poisson distribution is a discrete probability distribution that expresses the probability of a given

number of events (in this case, the number of apples sold) occurring in a xed interval of time or space. e

parameter λis the rate at which events occur. e expected value of a Poisson-distributed random variable

is indeed λ. is can be shown as follows:

e expected value (E[k]) of a random variable (k) distributed according to a Poisson distribution is given

by:

E[k] =

∞

k=0

k·p(k|λ) =

∞

k=0

k·λke−λ

By the properties of the Poisson distribution, this sum equals λ.

e loss function associated with the Poisson distribution is typically the negative log-likelihood. Given

observed counts y1, y2, . . . , ynand predicted rates λ1, λ2, . . . , λn, the negative log-likelihood is:

L(λ, y) =

i=1

(λi−yilog λi)

If we want to estimate log λinstead of λ, we can modify the loss function accordingly. One possible choice

is to use the squared dierence between the logarithm of the predicted rate and the logarithm of the observed

count:

L(log λ, y) =

i=1

(log λi−log yi)2

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

is loss function penalizes relative errors in the prediction of the rate, rather than absolute errors. It can

be more appropriate when the counts vary over a wide range, as it gives equal weight to proportional errors

in the prediction of small and large counts.

8.1.2 Exercise 3.3.5

1. What will happen if the number of examples cannot be divided by the batch size. How would you

change this behavior by specifying a dierent argument by using the framework’s API? PyTorch:

In PyTorch’s DataLoader, there’s an argument called drop last. If set to True, it will drop the last incomplete

batch. By default, it’s set to False.

train_loader =torch.utils.data.DataLoader(dataset, batch_size=batch_size,

shuffle=True, drop_last=True)

2. Suppose that we want to generate a huge dataset, where both the size of the parameter vector w

and the number of examples num examples are large. 1. What happens if we cannot hold all data in

memory?

2. How would you shue the data if it is held on disk? Your task is to design an ecient algorithm that

does not require too many random reads or writes. Hint: pseudorandom permutation generators 75 allow

you to design a reshue without the need to store the permutation table explicitly (Naor and Reingold, 1999).

1. What happens if we cannot hold all data in memory?

If we cannot hold all the data in memory:

Performance Impact: Reading data directly from disk is much slower than reading from memory. If the

training algorithm frequently accesses the disk, it can signicantly slow down the training process.

Out-of-Core Processing: You’ll need to use ”out-of-core” or ”external memory” algorithms, which are

designed to process data that is too large to t into a computer’s main memory. ese algorithms break the

data into smaller chunks that can t into memory, process each chunk, and then combine the results.

Streaming: Data can be streamed from the disk in mini-batches. Only one mini-batch is loaded into

memory at a time, and aer processing it, the next mini-batch is loaded. is is common in deep learning

when dealing with large datasets.

Distributed Computing: Another approach is to distribute the dataset across multiple machines in a

cluster. Each machine processes a subset of the data. Frameworks like TensorFlow and PyTorch have support

for distributed training.

2. How would you shue the data if it is held on disk?

Shuing large datasets on disk without too many random reads or writes can be challenging. Here’s a

method using pseudorandom permutation generators:

1. Pseudorandom Permutation: Use a pseudorandom permutation generator to generate a sequence of

indices that represents the shued order of your data. e beauty of pseudorandom permutation generators

is that they can generate the nth element of the sequence without generating the rst n-1 elements, allowing

for ecient random access.

2. Sequential Reads and Writes: - Read the data from the disk sequentially in large chunks (to benet

from sequential read speeds). - For each item in the chunk, use the pseudorandom permutation generator to

determine its new location in the shued dataset. - Write the shued data back to the disk sequentially in

large chunks.

3. In-Place Shuing: If you have some memory available (but not enough to hold the entire dataset),

you can load chunks of data into memory, shue them using the pseudorandom permutation, and then write

them back to disk. is reduces the number of disk writes.

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

4. External Sorting: Another approach is to use techniques from external sorting. Divide the data into

chunks that t into memory, shue each chunk in memory, and then merge the chunks together in a shued

manner.

5. Temporary Storage: If possible, use a fast intermediate storage (like an SSD) to temporarily store data

chunks during the shuing process. is can speed up both reads and writes.

Remember, the key is to minimize random disk accesses, as they are much slower than sequential accesses.

Leveraging pseudorandom permutations allows for an ecient reshuing without the need to store large

permutation tables explicitly.

3. Implement a data generator that produces new data on the y, every time the iterator is called.

import torch

from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):

def __init__(self, num_samples, tensor_size):

"""

Args:

num_samples (int): Number of samples in the dataset.

tensor_size (tuple): Size of the tensor to be generated.

"""

self.num_samples =num_samples

self.tensor_size =tensor_size

def __len__(self):

return self.num_samples

def __getitem__(self, idx):

# Generate a random tensor on the fly

sample =torch.randn(self.tensor_size)

return sample

# Parameters

num_samples = 1000

tensor_size =(3,32,32)# Example: 3-channel, 32x32 image

# Create dataset and dataloader

random_dataset =RandomDataset(num_samples, tensor_size)

dataloader =DataLoader(random_dataset, batch_size=32, shuffle=True)

# Iterate over the dataloader

for batch in dataloader:

print(batch.shape) # Should print torch.Size([32, 3, 32, 32])

# for all batches except possibly the last one

4. How would you design a random data generator that generates the same data each time it is

called?

torch.manual_seed(idx)

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

8.1.3 Exercise 3.5.6

1. How would you need to change the learning rate if you replace the aggregate loss over the mini-

batch with an average over the loss on the minibatch? When you replace the aggregate loss over the

minibatch with an average over the loss on the minibatch, you’re essentially scaling the loss by a factor of

batch size . If you’re using gradient descent or a variant thereof, the update rule is:

θnew =θold −η∇L

Where: - θrepresents the parameters of the model. - ηis the learning rate. - ∇Lis the gradient of the

loss function Lwith respect to the parameters.

If you change the loss from an aggregate to an average, the gradient ∇Lwill also be scaled by 1

batch size .

To compensate for this scaling and to ensure that the magnitude of the parameter updates remains the same,

you would need to multiply the learning rate ηby the batch size.

In other words, if your original learning rate was ηoriginal when using aggregate loss, and you switch to

using average loss, your new learning rate should be:

ηnew =ηoriginal ×batch size

However, in practice, many deep learning frameworks, including TensorFlow and PyTorch, compute the

average loss over a minibatch by default. So, if you’re already using one of these frameworks, you likely

don’t need to adjust the learning rate when switching between aggregate and average loss, unless you were

explicitly scaling the loss by the batch size yourself.

4. What is the eect on the solution if you change the learning rate and the number of epochs?

Does it keep on improving? e learning rate and the number of epochs are two critical hyperparameters

in training neural networks and other machine learning models. eir values can signicantly impact the

convergence and performance of the model. Let’s discuss the eects of changing each:

1. Learning Rate (η): - Too High: If the learning rate is set too high, the model might overshoot the

minima in the loss landscape, leading to divergence. e loss might oscillate or even explode. - Too Low: If

the learning rate is too low, the model will converge very slowly. It might get stuck in local minima or plateaus

in the loss landscape. - Adaptive: Many modern optimizers (like Adam, RMSprop) adjust the learning rate

dynamically based on recent gradients, which can help in nding a good balance.

2. Number of Epochs: - Too Few: If the number of epochs is too low, the model might undert the data, as it

hasn’t seen the data enough times to learn the underlying paerns. - Too Many: If the number of epochs is too

high, the model might overt the data, especially if there isn’t an early stopping mechanism or regularization

in place. Overing means the model performs well on the training data but poorly on unseen data.

Does it keep on improving?

- Not necessarily. As training progresses: - e loss typically decreases and the model’s performance on

the training set improves. - However, aer a certain point, the model might start to overt, and its performance

on a validation set might start to degrade. - e optimal number of epochs is oen where the performance

on the validation set is maximized.

- e learning rate plays a crucial role in this: - A good learning rate can help the model converge to a

good solution faster. - Learning rate schedules or decay can be benecial. Starting with a larger learning rate

(to progress quickly) and reducing it over time (to ne-tune) can be eective.

In summary, there’s a balance to strike. It’s essential to monitor both training and validation metrics

during training to determine the best values for the learning rate and the number of epochs. Experimentation

and hyperparameter tuning, possibly with techniques like cross-validation or Bayesian optimization, can help

nd the optimal values for a given problem and dataset.

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

8.1.4 Exercise 4.1.5

1. We can explore the connection between exponential families and somax in some more depth

1. Compute the second derivative of the cross-entropy loss l(y, ˆy)for somax. 2. Compute the variance of

the distribution given by somax(o) and show that it matches the second derivative computed above.

1. Second Derivative of Cross-Entropy Loss for Somax, e cross-entropy loss for somax is given by:

l(y, ˆy) = −X

yilog(ˆyi)

Where ˆyiis the somax output:

ˆyi=exp(oi)

Pjexp(oj)

Let’s compute the second derivative with respect to the logits o.

First, the derivative of ˆyiwith respect to okis:

∂ˆyi

∂ok

= ˆyi(δik −ˆyk)

Where δik is the Kronecker delta (it’s when i=kand 0 otherwise). Using the above, the second derivative

is:

∂2l

∂oi∂ok

=−X

∂2log(ˆyj)

∂oi∂ok

=−X

∂

∂oiˆyj(δjk −ˆyk)

ˆyj

=−X

yj(δjk −ˆyk)∂ˆyj

∂oi

=−X

yj(δjk −ˆyk) ˆyj(δji −ˆyi)

2. Variance of the Distribution Given by Somax

e variance Var[X]of a discrete random variable Xwith possible outcomes x1, x2, . . . and probabilities

p1, p2, . . . is:

Var[X] = X

pi(xi−E[X])2

For the somax distribution, Xcan take on values corresponding to the logits owith probabilities ˆy. us,

the variance becomes:

Var[X] = X

ˆyi(oi−E[X])2

Where E[X] = Piˆyioi.

Now, the second derivative of the cross-entropy loss with respect to the logits, as derived above, gives us a

measure of the curvature of the loss. When we compare this with the variance of the somax distribution, we

can see that they are related. e variance captures the spread of the logits, and the second derivative captures

the sensitivity of the loss to changes in the logits. In the context of exponential families, this relationship

between variance and the second derivative of the log-likelihood is a known property.

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

To show that they match, one would equate the expressions derived in the two sections above and simplify.

e detailed algebra can be quite involved, but the key insight is understanding the relationship between the

curvature of the loss (second derivative) and the spread of the logits (variance).

5. Somax gets its name from the following mapping: RealSof tMax(a, b) = log(exp(a) + exp(b))

1. Prove that RealSoMax(a, b)>max(a, b):

Given:

RealSoMax(a, b) = log(exp(a) + exp(b))

Now, since exp(a)and exp(b)are both positive, we have:

exp(a) + exp(b)>exp(a)

exp(a) + exp(b)>exp(b)

Taking the logarithm of both sides (logarithm is a monotonically increasing function):

log(exp(a) + exp(b)) >log(exp(a)) = a

log(exp(a) + exp(b)) >log(exp(b)) = b

us, RealSoMax(a, b)>max(a, b).

2. How small can you make the dierence between both functions?

Without loss of generality, let’s set b= 0 and a≥b. en:

RealSoMax(a, 0) = log(exp(a) + 1)

e dierence between the two functions is:

RealSoMax(a, 0) −a= log(exp(a) + 1) −a

As abecomes large, the dierence approaches 0, but it’s always positive because of the ”+1” inside the

logarithm.

3. Prove that this holds for λ−1RealSoMax (λa, λb), provided that λ > 0:

Given:

λ−1RealSoMax (λa, λb) = λ−1log(exp(λa) + exp(λb))

Using the property of logarithms:

= log(exp(λa) + exp(λb))λ−1

Since exp(λa)and exp(λb)are both positive and λ > 0, the proof from part applies here as well, and we

can conclude that λ−1RealSoMax(λa, λb)>max(a, b).

4. Show that for λ→ ∞ we have λ−1RealSoMax(λa, λb)→max(a, b):

As λbecomes very large, the term exp(λa)will dominate exp(λb)if a>b. us:

λ−1log(exp(λa) + exp(λb)) ≈λ−1log(exp(λa)) = a

Similarly, if b > a, the result will be b. erefore, as λ→ ∞,λ−1RealSoMax(λa, λb)→max(a, b).

5. Construct an analogous somin function:

e somin function can be constructed using the negative of the inputs to the somax function:

RealSoMin(a, b) = −RealSoMax(−a, −b)

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

= log(exp(−a) + exp(−b))

6. Extend this to more than two numbers:

For nnumbers a1, a2, . . . , an:

RealSoMax(a1, a2, . . . , an) = log(exp(a1) + exp(a2) + . . . + exp(an))

Similarly, the somin for nnumbers is:

RealSoMin(a1, a2, . . . , an) = log(exp(−a1) + exp(−a2) + . . . + exp(−an))

8.1.5 Exercise 4.5.5

format format min format max exp min exp max test min exp(max-1) exp(max+1)

torch.oat64 -1.80e+308 1.80e+308 -1.80e+308 7.10e+02 0.00e+00 6.61e+307 inf

torch.oat32 -3.40e+38 3.40e+38 -3.40e+38 8.87e+01 0.00e+00 1.25e+38 inf

torch.boat16 -3.39e+38 3.39e+38 -3.39e+38 8.85e+01 0.00e+00 1.00e+38 inf

torch.oat16 -6.55e+04 6.55e+04 -6.55e+04 1.11e+01 0.00e+00 2.41e+04 inf

torch.int8 -1.28e+02 1.27e+02 -1.28e+02 4.84e+00 0.00e+00 4.67e+01 8.90e+01

Deep learning uses many dierent number formats, including FP64 double precision (used ex-

tremely rarely), FP32 single precision, BFLOAT16 (good for compressed representations), FP16

(very unstable), TF32 (a new format from NVIDIA), and INT8. Compute the smallest and largest

argument of the exponential function for which the result does not lead to numerical underow

or overow 1. FP64 (Double Precision): - Smallest positive normalized number 2−1022 - Largest nite rep-

resentable number (2 −2−52)×21023 - For the exponential function, the range of xfor which exdoes not

overow would be approximately [−1022 log(2),1023 log(2)].

2. FP32 (Single Precision): - Smallest positive normalized number: 2−126 - Largest nite representable

number: (2 −2−23)×2127 - For the exponential function, the range of xfor which exdoes not overow

would be approximately [−126 log(2),127 log(2)].

3. BFLOAT16: - is format has the same exponent bits as FP32 but fewer precision bits. - e range for

the exponential function would be similar to FP32.

4. FP16: - Smallest positive normalized number: 2−14 - Largest nite representable number: (2 −2−10)×

215 - For the exponential function, the range of xfor which exdoes not overow would be approximately

[−14 log(2),15 log(2)].

5. TF32: - is is a new format introduced by NVIDIA, and it has 10 bits for the exponent and 19 bits for

the fraction. - e range would be somewhere between FP32 and FP16.

6. INT8: - is is an integer format, so it doesn’t directly apply to the exponential function in the same way

as the oating-point formats. However, if used in quantized neural networks, the range would be [−128,127].

For the exact ranges, especially for the newer formats like TF32, one would need to refer to the ocial

documentation or specications provided by the manufacturers.

To avoid overow or underow when computing the exponential function, it’s common to clip or bound

the input values to the function within the safe range for the given number format.

8.1.6 Exercise 7.1.6

1. Assume that the size of the convolution kernel is ∆= 0. Show that in this case the convolution

kernel implements an MLP independently for each set of channels. is leads to the Network

in Network architectures When the size of the convolution kernel is ∆= 0, it means that the kernel

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

eectively has a size of 1×1. A 1×1convolution kernel operates on each pixel individually, without

considering its neighbors.

Let’s break down the implications of this:

1. Independent Operation on Each Pixel: A 1×1convolution operates on each pixel independently. For

an input with Cin channels, each pixel is represented by a Cin-dimensional vector. e 1×1convolution

transforms this Cin-dimensional vector into a Cout-dimensional vector, where Cout is the number of output

channels.

2. MLP for Each Set of Channels: e transformation from Cin to Cout for each pixel can be viewed as a

fully connected layer (or MLP) applied independently to each spatial location. is is because the weights of

the 1×1convolution kernel can be seen as the weights of a fully connected layer that connects each input

channel to each output channel.

3. Network in Network (NiN): e idea of using 1×1convolutions to implement an MLP for each set of

channels led to the Network in Network (NiN) architecture. In NiN, multiple 1×1convolutional layers are

stacked, allowing for deep per-pixel transformations. is introduces additional non-linearity for each pixel

without changing the spatial dimensions.

In summary, a 1×1convolution kernel eectively applies an MLP to each pixel independently, transform-

ing the channels at each spatial location. is concept is central to the Network in Network (NiN) architecture,

which uses 1×1convolutions to introduce additional non-linearities in the network without altering the spa-

tial dimensions.

5. What happens with convolutions when an object is at the boundary of an image? When an object

is at the boundary of an image, and we apply convolutions, several issues and considerations arise:

1. Loss of Information: e primary issue is that for pixels on the boundary, there aren’t enough neighbor-

ing pixels to apply the convolution kernel fully. is means that these boundary pixels might not be processed

adequately, leading to potential loss of information about objects at the image boundary.

2. Padding: To address the boundary issue, one common approach is to add padding around the image.

Padding involves adding extra pixels around the boundary of the image. ere are dierent types of padding:

- Zero Padding: Adding pixels with a value of zero. - Reect Padding: e boundary pixels are reected.

For instance, if the last row of an image is [a, b, c], it becomes [a, b, c, c, b, a]aer reect padding. - Replicate

Padding: e boundary pixels are replicated. Using the previous example, it becomes [a, b, c, c, c, c].

Padding ensures that every pixel, including those on the boundary, has a full neighborhood of pixels to

apply the convolution kernel.

3. Valid vs. Same Convolutions: - Valid Convolution: No padding is used. e resulting feature map aer

convolution is smaller than the input image because the kernel can only be applied to positions where it ts

entirely within the image boundaries. - Same Convolution: Padding is added such that the output feature

map has the same spatial dimensions as the input image.

4. Edge Eects: Even with padding, objects at the boundary might be aected dierently than objects in

the center of the image. is is because the articial values introduced by padding might not represent the

actual image content. is can lead to edge artifacts in the output feature map.

5. Dilated Convolutions: In dilated convolutions, where the kernel elements are spaced out by introducing

gaps, the boundary eects can be even more pronounced, especially if the dilation rate is high.

6. Pooling Layers: e boundary eects can also impact pooling layers (like max-pooling or average-

pooling). If the pooling window doesn’t t entirely within the image or feature map boundaries, padding

might be needed, or the pooling window might be truncated.

6. Prove that the convolution is symmetric, f∗g=g∗fTo prove that convolution is symmetric, we’ll

start with the denition of the convolution operation for two functions f(t)and g(t):

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

(f∗g) (t) = Z∞

−∞

f(τ)g(t−τ)dτ

Now, let’s express the convolution of gand f:

(g∗f) (t) = Z∞

−∞

g(τ)f(t−τ)dτ

To show that these two expressions are equivalent, we’ll make a substitution:

Let λ=t−τis implies dλ =−dτ

When τ=−∞,λ=t+∞=∞When τ=∞,λ=t− ∞ =−∞

Substituting these values into the convolution of gand f, we get:

(g∗f)(t) = Z−∞

∞

g(t−λ)f(λ)(−dλ)

Now, reversing the limits of integration:

(g∗f)(t) = Z∞

−∞

g(t−λ)f(λ)dλ

Comparing this with the expression for (f∗g)(t), we see that the two expressions are equivalent:

(f∗g)(t) = (g∗f)(t)

us, we’ve proven that the convolution operation is symmetric:

f∗g=g∗f

Exercise 7.2.8

4. How do you represent a cross-correlation operation as a matrix multiplication by changing the

input and kernel tensors? Cross-correlation can be represented as a matrix multiplication by converting

the input and kernel tensors into appropriate matrix and vector forms, respectively. is process is oen

referred to as “toeplitzization” or “im2col” (image to column) transformation. Here’s how you can achieve

this:

1. Flaen the Kernel: - First, you reshape the kernel into a column vector. - If the kernel is of size k×k,

the resulting vector will be of size k2×1.

2. Toeplitz Matrix for Input: - For each position in the output, extract a k×kpatch from the input, and

aen it into a row vector. - Stack these row vectors on top of each other to form a matrix. is matrix has

a special structure called a Toeplitz matrix. - If the input is of size n×nand the kernel is k×k, then the

resulting matrix will be of size (n−k+ 1)2×k2.

3. Matrix Multiplication: - Multiply the Toeplitz matrix (from step 2) with the aened kernel (from step

1). e result will be a column vector representing the aened output of the cross-correlation operation. -

Reshape this column vector back into a 2D matrix to get the nal output.

Mathematically, if Mis the Toeplitz matrix derived from the input and Kis the column vector derived

from the kernel, then the output Ois given by:

O=M×K

is representation is particularly useful for certain computational platforms and algorithms that can

eciently handle matrix multiplications. In deep learning frameworks, this transformation is oen used

under the hood to speed up convolution operations, especially when using GPUs.

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

Note: is explanation assumes 2D inputs and kernels for simplicity, but the concept can be extended to

higher dimensions.

def im2col(inputs, kernel_size):

# Extract patches from the input tensor

k,_=kernel_size

unfold =torch.nn.Unfold(kernel_size=k, stride=1, padding=0)

cols =unfold(inputs)

return cols.transpose(1,2)

Chapter 8 Dive into Deep Learning 8.1 Linear Neural Networks

Chapter 9

Generative Deep Learning

9.1 Intro to Deep Generative Models

What is Generative Modeling?

Generative modeling can be broadly dened as follows: Generative modeling is a branch of machine

learning that involves training a model to produce new data that is similar to a given dataset [Fos22].

Discriminative models vs Generative models

e Generative Modeling Framework

•We have a dataset of Observations X.

•We assume that the observations have been generated according to some unknown distribution Pdata

•We want to build a generative model Pmodel that mimics Pdata to generate observations that appear to

have been drawn from Pdata

•erefore, the desirable properties of Pdata are:

1. Accuracy: if Pmodel is high for a generated observation, it should look like it has been drawn from

Pdata. If Pmodel is low, it should look like it has not been drawn from Pdata

2. Generation: it should be possible to easily sample a new observation from Pmodel

3. Representation: it should be possible to understand how dierent high-level features in the data are

represented by Pmodel

Generative Model Taxonomy

•Explicitly model the density function, but constrain the model in some way, so that the density function

s tractable (i.e. it can be calculated)

•Explicitly model a tractable approximation of the density function.

•Implicitly model the density function, through a stochastic process that directly generates data.

If you do choose to use batch normalization before activation, you can remember the order using the

acronym BAD (batch normalization, activation, then dropout)

9.2 Variational Autoencoders

e Reparameterization Trick Rather than sample directly from a normal distribution with parameters

z mean and z log var, we can sample epsilon from a standard normal and then manually adjust the sample

Chapter 9 Generative Deep Learning 9.2 Variational Autoencoders

Figure 9.1: Generative Models Taxonomy

to have the correct mean and variance. is is known as the reparameterization trick, and it’s important as

it means gradients can backpropagate freely through the layer. By keeping all of the randomness of the layer

contained within the variable epsilon, the partial derivative of the layer output with respect to its input can

be shown to be deterministic (i.e., independent of the random epsilon), which is essential for backpropagation

through the layer to be possible.

Kullback-Leibler (KL) Divergence

DKL [N(µ, σ)||N(0,1)] = 1

i=1 σ2

i+µ2

i−1−log σ2

i.

In summary, the KL divergence term penalizes the network for encoding observations to z mean and

z log var variables that dier signicantly from the parameters of a standard normal distribution, namely

z mean = 0 and z log var = 0.

Why does this addition to the loss function help?

Firstly, we now have a well-dened distribution that we can use for choosing points in the latent space—the

standard normal distribution. Secondly, since this term tries to force all encoded distributions toward the stan-

dard normal distribution, there is less chance that large gaps will form between point clusters. Instead, the

encoder will try to use the space around the origin symmetrically and eciently.

In the original Variational Autoencoder (VAE) paper, the loss function for a VAE was simply the addition

of the reconstruction loss and the KL divergence loss term. A variant on this (the β-VAE) includes a factor

that weights the KL divergence to ensure that it is well balanced with the reconstruction loss. If we weight

the reconstruction loss too heavily, the KL loss will not have the desired regulatory eect and we will see the

same problems that we experienced with the plain autoencoder. If the KL divergence term is weighted too

heavily, the KL divergence loss will dominate and the reconstructed images will be poor. is weighting term

is one of the parameters to tune when you’re training your VAE [Fos22].

Chapter 9 Generative Deep Learning 9.3 Generative Adversarial Networks

Figure 9.2: Adding and subtracting features to and from faces

Latent Space Arithmetic One benet of mapping images into a lower-dimensional latent space is that we

can perform arithmetic on vectors in this latent space that has a visual analogue when decoded back into the

original image domain (Figure 9.2).

For example, suppose we want to take an image of somebody who looks sad and give them a smile. To do

this we rst need to nd a vector in the latent space that points in the direction of increasing smile. Adding

this vector to the encoding of the original image in the latent space will give us a new point which, when

decoded, should give us a more smiley version of the original image [Fos22].

9.3 Generative Adversarial Networks

Upsampling vs Transposed Convolution e UpSampling2D layer simply repeats each row and column

of its input in order to double the size. e Conv2D layer with stride 1 then performs the convolution opera-

tion. It is a similar idea to convolutional transpose, but instead of lling the gaps between pixels with zeros,

upsampling just repeats the existing pixel values.

It has been shown that the Conv2DTranspose method can lead to artifacts, or small checkerboard paerns

in the output image (see Figure 9.3) that spoil the quality of the output. However, they are still used in many

of the most impressive GANs in the literature and have proven to be a powerful tool in the deep learning

practitioner’s toolbox.

Adding Noise to the Labels A useful trick when training GANs is to add a small amount of random noise

to the training labels. is helps to improve the stability of the training process and sharpen the generated

images. is label smoothing acts as way to tame the discriminator, so that it is presented with a more

challenging task and doesn’t overpower the generator.

Another requirement of a successful generative model is that it doesn’t only reproduce images from the

Chapter 9 Generative Deep Learning 9.3 Generative Adversarial Networks

Figure 9.3: Artifacts when using convolutional transpose layers

training set. To test this, we can nd the image from the training set that is closest to a particular generated

example. A good measure for distance is the L1 distance, dened as:

L1(a,b) = 1

i=1 |ai−bi|.

9.3.1 GAN Training Tips

While GANs are a major breakthrough for generative modeling, they are also notoriously dicult to train.

We will explore some of the most common problems and challenges encountered when training GANs in this

section, alongside potential solutions. In the next section, we will look at some more fundamental adjustments

to the GAN framework that we can make to remedy many of these problems [Fos22].

Discriminator overpowering the generator If the discriminator becomes too strong, the signal from the

loss function becomes too weak to drive any meaningful improvements in the generator. In the worst-case

scenario, the discriminator perfectly learns to separate real images from fake images and the gradients vanish

completely, leading to no training whatsoever,

If you nd your discriminator loss function collapsing, you need to nd ways to weaken the discriminator.

Try the following suggestions:

•Increase the rate parameter of the Dropout layers in the discriminator to dampen the amount of infor-

mation that ows through the network.

•Reduce the learning rate of the discriminator.

•Reduce the number of convolutional lters in the discriminator.

•Add noise to the labels when training the discriminator.

•Flip the labels of some images at random when training the discriminator.

Chapter 9 Generative Deep Learning 9.3 Generative Adversarial Networks

Generator overpowering the discriminator If the discriminator is not powerful enough, the generator

will nd ways to easily trick the discriminator with a small sample of nearly identical images. is is known

as mode collapse.

For example, suppose we were to train the generator over several batches without updating the discrimi-

nator in between. e generator would be inclined to nd a single observation (also known as a mode) that

always fools the discriminator and would start to map every point in the latent input space to this image.

Moreover, the gradients of the loss function would collapse to near 0, so it wouldn’t be able to recover from

this state.

Even if we then tried to retrain the discriminator to stop it being fooled by this one point, the generator

would simply nd another mode that fools the discriminator, since it has already become numb to its input

and therefore has no incentive to diversify its output.

If you nd that your generator is suering from mode collapse, you can try strengthening the discrimi-

nator using the opposite suggestions to those listed in the previous section. Also, you can try reducing the

learning rate of both networks and increasing the batch size.

Uninformative Loss Since the deep learning model is compiled to minimize the loss function, it would

be natural to think that the smaller the loss function of the generator, the beer the quality of the images

produced. However, since the generator is only graded against the current discriminator and the discriminator

is constantly improving, we cannot compare the loss function evaluated at dierent points in the training

process. Indeed the loss function of the generator actually increases over time, even though the quality of the

images is clearly improving. is lack of correlation between the generator loss and image quality sometimes

makes GAN training dicult to monitor.

Hyperparameter As we have seen, even with simple GANs, there are a large number of hyperparameters

to tune. As well as the overall architecture of both the discriminator and the generator, there are the param-

eters that govern batch normalization, dropout, learning rate, activation layers, convolutional lters, kernel

size, striding, batch size, and latent space size to consider. GANs are highly sensitive to very slight changes in

all of these parameters, and nding a set of parameters that works is oen a case of educated trial and error,

rather than following an established set of guidelines.

WGAN-GP WGAN-GP (Figure 9.4) brings a meaningful loss metric that correlates with the generator’s

convergence and sample quality, and improved stability of the optimization process.

−1

i=1

(yilog (pi) + (1 −yi) log (1 −pi)) (9.1)

One last consideration we should note before training a WGAN-GP is that batch normalization shouldn’t

be used in the critic. is is because batch normalization creates correlation between images in the same

batch, which makes the gradient penalty loss less eective. Experiments have shown that WGAN-GPs can

still produce excellent results even without batch normalization in the critic. We have now covered all of the

key dierences between a standard GAN and a WGAN-GP. To recap:

•AWGAN-GP uses the Wasserstein loss.

•e WGAN-GP is trained using labels of 1 for real and –1 for fake.

•ere is no sigmoid activation in the nal layer of the critic.

•Include a gradient penalty term in the loss function for the critic.

•Train the critic multiple times for each update of the generator.

•ere are no batch normalization layers in the critic.

Chapter 9 Generative Deep Learning 9.4 Autoregressive Models

Figure 9.4: e WGAN-GP critic training process

Contitonal Generative Adversarial Nets (CGAN)

Summary we explored three dierent GAN models: the deep convolutional GAN (DCGAN), the more

sophisticated WGAN-GP, and the CGAN.

All GANs are characterized by a generator versus discriminator (or critic) architecture, with the discrim-

inator trying to “spot the dierence” between real and fake images and the generator aiming to fool the

discriminator. By balancing how these two adversaries are trained, the GAN generator can gradually learn

how to produce similar observations to those in the training set.

We rst saw how to train a DCGAN to generate images of toy bricks. It was able to learn how to re-

alistically represent 3D objects as images, including accurate representations of shadow, shape, and texture.

We also explored the dierent ways in which GAN training can fail, including mode collapse and vanishing

gradients [Fos22].

9.4 Autoregressive Models

Working with Text Data ere are several key dierences between text and image data that mean that

many of the methods that work well for image data are not so readily applicable to text data. In particular:

•Text data is composed of discrete chunks (either characters or words), whereas pixels in an image are

points in a continuous color spectrum. We can easily make a green pixel more blue, but it is not obvious

how we should go about making the word cat more like the word dog, for example. is means we

can easily apply backpropagation to image data, as we can calculate the gradient of our loss function

with respect to individual pixels to establish the direction in which pixel colors should be changed to

minimize the loss. With discrete text data, we can’t obviously apply backpropagation in the same way,

so we need to nd a way around this problem.

Chapter 9 Generative Deep Learning 9.4 Autoregressive Models

Figure 9.5: Inputs and outputs of the generator and critic in a CGAN

Chapter 9 Generative Deep Learning 9.4 Autoregressive Models

•Text data has a time dimension but no spatial dimension, whereas image data has two spatial dimensions

but no time dimension. e order of words is highly important in text data and words wouldn’t make

sense in reverse, whereas images can usually be ipped without aecting the content. Furthermore,

there are oen long-term sequential dependencies between words that need to be captured by the

model: for example, the answer to a question or carrying forward the context of a pronoun. With

image data, all pixels can be processed simultaneously.

•Text data is highly sensitive to small changes in the individual units (words or characters). Image

data is generally less sensitive to changes in individual pixel units—a picture of a house would still be

recognizable as a house even if some pixels were altered—but with text data, changing even a few words

can drastically alter the meaning of the passage, or make it nonsensical. is makes it very dicult to

train a model to generate coherent text, as every word is vital to the overall meaning of the passage.

•Text data has a rules-based grammatical structure, whereas image data doesn’t follow set rules about

how the pixel values should be assigned. For example, it wouldn’t make grammatical sense in any

context to write “e cat sat on the having.” ere are also semantic rules that are extremely dicult to

model; it wouldn’t make sense to say “I am in the beach,” even though grammatically, there is nothing

wrong with this statement.

Tokenization If you use word tokens:

•All text can be converted to lowercase, to ensure capitalized words at the start of sentences are tokenized

the same way as the same words appearing in the middle of a sentence. In some cases, however, this

may not be desirable; for example, some proper nouns, such as names or places, may benet from

remaining capitalized so that they are tokenized independently.

•e text vocabulary (the set of distinct words in the training set) may be very large, with some words

appearing very sparsely or perhaps only once. It may be wise to replace sparse words with a token for

unknown word, rather than including them as separate tokens, to reduce the number of weights the

neural network needs to learn.

•Words can be stemmed, meaning that they are reduced to their simplest form, so that dierent tenses

of a verb remained tokenized together. For example, browse, browsing, browses, and browsed would

all be stemmed to brows.

•You will need to either tokenize the punctuation, or remove it altogether.

•Using word tokenization means that the model will never be able to predict words outside of the training

vocabulary.

If you use character tokens:

•e model may generate sequences of characters that form new words outside of the training vocabu-

lary—this may be desirable in some contexts, but not in others.

•Capital leers can either be converted to their lowercase counterparts, or remain as separate tokens.

•e vocabulary is usually much smaller when using character tokenization. is is benecial for model

training speed as there are fewer weights to learn in the nal output layer.

LSTM A recurrent layer has the special property of being able to process sequential input data x1, . . . , xn.

It consists of a cell that updates its hidden state, ht, as each element of the sequence xtis passed through it,

one timestep at a time.

e hidden state is a vector with length equal to the number of units in the cell. it can be thought of as

the cell’s current understanding of the sequence. At timestep t, the cell uses the previous value of the hidden

Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models

state,ht−1, together with the data from the current timestep xtto produce an updated hidden state vector,

ht. is recurrent process continues until the end of the sequence. Once the sequence is nished, the layer

outputs the nal hidden state of the cell, hn, It’s important to remember that all of the cells in this diagram

share the same weights (as they are really the same cell).

We represent the recurrent process by drawing a copy of the cell at each timestep and show how the

hidden state is constantly being updated as it ows through the cells (Figure 9.6).

e hidden state is updated in six steps (Figure 9.7):

1. e hidden state of the previous timestep, ht−1, and the current word embedding, xt, are concatenated

and passed through the forget gate. is gate is simply a dense layer with weights matrix Wf, bias bf,

and a sigmoid activation function. e resulting vector, ftl, has length equal to the number of units in

the cell and contains values between 0 and 1 that determine how much of the previous cell state, Ct−1,

should be retained.

2. e concatenated vector is also passed through an input gate that, like the forget gate, is a dense layer

with weights matrix Wi, bias bi, and a sigmoid activation function. e output from this gate, it, has

length equal to the number of units in the cell and contains values between 0 and 1 that determine how

much new information will be added to the previous cell state, Ct−1.

3. e concatenated vector is passed through a dense layer with weights matrix WC, bias bC, and a

tanh activation function to generate a vector ˜

Ctthat contains the new information that the cell wants

to consider keeping. It also has length equal to the number of units in the cell and contains values

between –1 and 1.

4. ftand Ct−1are multiplied element-wise and added to the element-wise multiplication of itand ˜

Ct.

is represents forgeing parts of the previous cell state and then adding new relevant information to

produce the updated cell state, Ct.

5. e concatenated vector is passed through an output gate: a dense layer with weights matrix Wo, bias

bo, and a sigmoid activation. e resulting vector, ot, has length equal to the number of units in the

cell and stores values between 0 and 1 that determine how much of the updated cell state, Ct, to output

from the cell.

6. otis multiplied element-wise with the updated cell state, Ct, aer a tanh activation has been applied to

produce the new hidden state, ht.

GRU Another type of commonly used Recurrent Neural Network (RNN) layer is the GRU (Figure 9.8). e

key dierences from the LSTM unit are as follows:

1. e forget and input gates are replaced by reset and update gates.

2. ere is no cell state or output gate, only a hidden state that is output from the cell.

9.5 Normalizing Flow Models

Normalizing ows share similarities with both autoregressive models and VAEs. Like autoregressive models,

normalizing ows are able to explicitly and tractably model the data-generating distribution Px. Like VAEs,

normalizing ows aempt to map the data into a simpler distribution, such as a Gaussian distribution. e

key dierence is that normalizing ows place a constraint on the form of the mapping function, so that it is

invertible and can therefore be used to generate new data points.

Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models

Figure 9.6: How a single sequence ows through a recurrent layer

Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models

Figure 9.7: An LSTM Cell

Chapter 9 Generative Deep Learning 9.5 Normalizing Flow Models

Figure 9.8: A single GRU cell

Chapter 9 Generative Deep Learning 9.6 Energy-Based Models

Summary A normalizing ow model is an invertible function dened by a neural network that allows us to

directly model the data density via a change of variables. In the general case, the change of variables equation

requires us to calculate a highly complex Jacobian determinant, which is impractical for all but the simplest

of examples.

To sidestep this issue, the RealNVP model restricts the form of the neural network, such that it adheres

to the two essential criteria: it is invertible and has a Jacobian determinant that is easy to compute.

It does this through stacking coupling layers, which produce scale and translation factors at each step.

Importantly, the coupling layer masks the data as it ows through the network, in a way that ensures that the

Jacobian is lower triangular and therefore has a simple-to-compute determinant. Full visibility of the input

data is achieved through ipping the masks at each layer.

By design, the scale and translation operations can be easily inverted, so that once the model is trained it is

possible to run data through the network in reverse. is means that we can target the forward transformation

process toward a standard Gaussian, which we can easily sample from. We can then run the sampled points

backward through the network to generate new observations.

e RealNVP paper also shows how it is possible to apply this technique to images, by using convolutions

inside the coupling layers, rather than densely connected layers. e GLOW paper extended this idea to

remove the necessity for any hardcoded permutation of the masks. e FFJORD model introduced the concept

of continuous time normalizing ows, by modeling the transformation process as an ODE dened by a neural

network.

Overall, we have seen how normalizing ows are a powerful generative modeling family that can produce

high-quality samples, while maintaining the ability to tractably describe the data density function. [Fos22].

9.6 Energy-Based Models

Energy-based models aempt to model the true data-generating distribution using a Boltzmann distribu-

tion (9.2) where E(x)is know as the energy function of an observation x.

p(x) = e−E(x)

Rˆx∈Xe−E(ˆx)

(9.2)

9.7 Diusion Models

Denoising Diusion Models (DDM) e core idea behind a denoising diusion model is simple—we train

a deep learning model to denoise an image over a series of very small steps.

e Forward Diusion Process Suppose we have an image x0that we want to corrupt gradually over

a large number of steps (T= 1000), so that eventually it is indistinguishable from standard Gaussian noise

(i.e., xTshouldhavezeormeanandunitvariance).

We can dene a function qthat adds a small amount of Gaussian noise with variance βtto an image xt−1

to generate a new image xt. If we keep applying this function, we will generate a sequence of progressively

noisier images (x0,...xT), as shown in Figure 9.9.

We can write this update process mathematically as follows (here, ϵt−1is a standard Gaussian with zeor

mean and unit variance):

xt=p1−βtxt−1+pβtϵt−1.

Note that we also scale the input image xt−1, to ensure that the variance of the output image xtremains

constant over time. is way, if we normalize our original image x0to have zeor mean and unit variance, the

xtwill approximate a standard Gaussian distribution for large enough T, by induction, as follows.

Chapter 9 Generative Deep Learning 9.7 Diusion Models

Figure 9.9: e forward diusion process

If we assume that xt−1has zeor mean and unit variance then √1−βtxt−1will have variance 1−βtand

√βtϵt−1will have variance βt, using the rule that Var (aX) = a2Var (X). Adding these together, we obtain

a new distribution xtwith zero mean and variance 1−βt+βt= 1, using the rule that Var (X+Y) =

Var (X) Var (Y)for independent Xand Y. erefore, if x0is normalized to a zero mean and unit variance,

then we guarantee that this is also true for all xt, including the nal image xT, which will approximate a

standard Gaussian distribution. is is exactly what we need, as we want to be alble to easily sample xTand

then apply a reverse diusion process through our trained neural network model.

In other words, our forward nosing process qcan also be wrien as follows:

q(xt|xt−1) = N(xt;p1−βtxt−1, βtI).

e Reparmeaterization Trick It would also be useful to be able to jump straight from an image x0to

any noised version of the image xtwithout having to go through tapplications of q.

If we dene α= 1 −βtand αt=Qt

i=1 αi, then we can write the following:

xt=√αtxt−1+√1−αtϵt−1

=√αtαt−1xt−2+p1−αtαt−1ϵ

=. . .

=√αtx0+√1−αtϵ.

Note that the second line uses the fact that we can add two Gaussians to obtain a new Gaussian. We

therefore have have a way to jump from the original image x0to nay step of the forward diusion process

xt. Moverover, we can dene the diusion schedule using the αtvalues, instead of the original βtvalues, with

the interpretation that αtis the variance due to the signal (the original image, x0) and 1−αtis the variance

due to the noise (ϵ).

e forward diusion process qcan therefore also be wrien as follows:

q(xt|x0) = Nxt;√αtx0,(1 −αt)I.

Diusion Schedule

Appendix A

Formulas

A.1 Gaussian distribution

Denition A.1 (Gaussian distribution).Gaussian distribution

Theorem A.1 (Central limit theorem).

Chapter A Formulas A.1 Gaussian distribution

Bibliography

[Bab16] L´

aszl´

o Babai. “Graph Isomorphism in asipolynomial Time”. Jan. 19, 2016. arXiv: 1512.03547

[cs, math] (cit. on p. 2).

[Chi09] Andrew M. Childs. Universal Computation by antum Walk. Physical Review Leers 102.18

(May 4, 2009), p. 180501. arXiv: 0806.1972 (cit. on p. 2).

[Fos22] David Foster. Generative Deep Learning. ”O’Reilly Media, Inc.”, June 28, 2022. 444 pp. (cit. on pp. 11,

59–62,64,71).

[Kit+02] Alexei Yu Kitaev et al. Classical and quantum computation. 47. American Mathematical Soc., 2002

(cit. on p. 2).

[Zha+23] Aston Zhang et al. Dive into Deep Learning. Cambridge University Press, Dec. 7, 2023. 581 pp.

(cit. on pp. 13,14).

BIBLIOGRAPHY BIBLIOGRAPHY

Alphabetical Index

Gaussian distribution . . . . . . 73

index......................2