Deep Learning

Brief Intro to Deep Learning

Deep learning:

  • Data: large datasets, e.g., ImageNet, etc.;
  • Model: deep neural networks, e.g., ResNet-152, etc.;
  • Learning algorithm: backpropagation, i.e., stochastic gradient descent (SGD).

In recurrent neural networks (RNNs): same neural network gets reused many times.

ht=F(xt,ht1,W).h^t = F(x^t, h^{t-1}, W).

Back propagation:

  • Loss is always a scalar value;
  • For the backward propogation, the gradient is in the form: JTvJ^T v, where vv is a vector, and JJ is a Jacobian matrix, TT means transpose;
  • For the forward propogation, the gradient is in the form: JvJ v.
  • In BF process, the shape of a Jacobian matrix is m×nm \times n, where mm is the dimension of output, and nn is the dimension of input.

Consider Vector-by-Matrix Gradients:

Case 1: z=Wx\mathbf{z} = \mathbf{W}\mathbf{x}, where xRn×1\mathbf x \in \mathbb{R}^{n \times 1}, WRm×n\mathbf{W} \in \mathbb{R}^{m \times n}, zRm×1\mathbf z \in \mathbb{R}^{m \times 1}.

zx=W,\begin{aligned} \frac{\partial \mathbf{z}}{\partial \mathbf{x}} &= \mathbf{W}, \\ \end{aligned}

Case 2: z=xW\mathbf{z} = \mathbf{x} \mathbf{W}, where xR1×m\mathbf x \in \mathbb{R}^{1 \times m}, WRm×n\mathbf{W} \in \mathbb{R}^{m \times n}, zR1×n\mathbf z \in \mathbb{R}^{1 \times n}.

zx=WT,\begin{aligned} \frac{\partial \mathbf{z}}{\partial \mathbf{x}} &= \mathbf{W}^T, \\ \end{aligned}

Case 3: z=x\mathbf z = \mathbf x, then zx=I\frac{\partial \mathbf z}{\partial \mathbf x} = \mathbf{I}.

Case 4: XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}, then scalarXRm×n\frac{\partial scalar}{\partial \mathbf{X}} \in \mathbb{R}^{m \times n}, just as the same as the shape of X\mathbf{X}.

Case 5: Z=XW\mathbf{Z} = \mathbf{X}\mathbf{W}, where XRm×n\mathbf{X} \in \mathbb{R}^{m \times n}, WRn×w\mathbf{W} \in \mathbb{R}^{n \times w}. Assume LossZ=δRm×w\frac{\partial Loss}{\partial \mathbf{Z}} = \mathbf{\delta} \in \mathbb{R}^{m \times w}.

LossX=δWT,LossW=XTδ.\begin{aligned} \frac{\partial Loss}{\partial \mathbf{X}} &= \mathbf{\delta} \mathbf{W}^T, \\ \frac{\partial Loss}{\partial \mathbf{W}} &= \mathbf{X}^T \mathbf{\delta}. \end{aligned}

Here we consider the simple example:

h1=xW1,(1×4)=(1×3)×(3×4)y^=h1W2,(1×2)=(1×4)×(4×2)L=y^22.(1×1)=(1×2)×(2×1)\begin{aligned} \mathbf h_1 = \mathbf x \mathbf{W}_1, \qquad (1 \times 4) = (1 \times 3) \times (3 \times 4)\\ \hat{\mathbf y} = \mathbf h_1 \mathbf{W}_2, \qquad (1 \times 2) = (1 \times 4) \times (4 \times 2)\\ L = \|\hat{\mathbf y}\|_2^2. \qquad (1 \times 1) = (1 \times 2) \times (2 \times 1) \end{aligned}

Then, for the backward propagation process, we have:

Ly^=2y^,(1×2)=(1×2)LW2=(y^W2)TLy^=h1TLy^=2h1Ty^,(4×2)=(4×1)×(1×2)Lh1=Ly^y^h1=Ly^W2T=2y^W2T,(1×4)=(1×2)×(2×4)LW1=(h1W1)T[Ly^y^h1]=xTLh1=2xTy^W2T,(3×4)=(3×1)×(1×4)Lx=Ly^y^h1h1x=2y^W2TW1T,(1×3)=(1×2)×(2×4)×(4×3)\begin{aligned} \frac{\partial L}{\partial \hat{\mathbf y}} &= 2 \hat{\mathbf y}, \qquad (1 \times 2) = (1 \times 2) \\ \frac{\partial L}{\partial \mathbf{W}_2} &= (\frac{\partial \hat{\mathbf y}}{\partial \mathbf{W}_2})^T \frac{\partial L}{\partial \hat{\mathbf y}}= \mathbf h_1^T \frac{\partial L}{\partial \hat{\mathbf y}} = 2 \mathbf h_1^T \hat{\mathbf y}, \qquad (4 \times 2) = (4 \times 1) \times (1 \times 2)\\ \frac{\partial L}{\partial \mathbf h_1} &= \frac{\partial L}{\partial \hat{\mathbf y}}\frac{\partial \hat{\mathbf y}}{\partial \mathbf h_1}=\frac{\partial L}{\partial \hat{\mathbf y}} \mathbf{W}_2^T = 2 \hat{\mathbf y} \mathbf{W}_2^T , \qquad (1 \times 4) = (1 \times 2) \times (2 \times 4)\\ \frac{\partial L}{\partial \mathbf{W}_1} &= (\frac{\partial \mathbf h_1}{\partial \mathbf{W}_1})^T [\frac{\partial L}{\partial \hat{\mathbf y}}\frac{\partial \hat{\mathbf y}}{\partial \mathbf h_1}] = \mathbf x^T \frac{\partial L}{\partial \mathbf h_1} = 2 \mathbf x^T \hat{\mathbf y} \mathbf{W}_2^T, \qquad (3 \times 4) = (3 \times 1) \times (1 \times 4) \\ \frac{\partial L}{\partial \mathbf x} &= \frac{\partial L}{\partial \hat{\mathbf y}}\frac{\partial \hat{\mathbf y}}{\partial \mathbf h_1} \frac{\partial \mathbf h_1}{\partial \mathbf x} = 2\hat{\mathbf y} \mathbf{W}_2^T \mathbf{W}_1^T, \qquad (1 \times 3) = (1 \times 2) \times (2 \times 4) \times (4 \times 3) \end{aligned}

We can get similar results if xR3×1\mathbf x \in \mathbb{R}^{3 \times 1}.

Invariant and Equivariant

Invariant: A mathematical object (or a class of mathematical objects) remains unchanged after operations or transformations of a certain type are applied to the objects F(g(x))=F(x)F(g(x)) = F(x), e.g., max pooling;

  • Symmetry Group: all transformations under which the object is invariant

Equivariant: Applying a transformation and then computing the function produces the same result as computing the function and then applying the transformation F(g(x))=g(F(x))F(g(x)) = g(F(x)).

  • Convolution is translation equivariant, i.e., Conv(Shift(X)) = Shift(Conv(X))!

  • Global pooling gives you shift-invariance!

Permutation Invariance

Birkhoff Polytope:

Bn={PRn×ni,j,Pij0,i=1nPij=1,j=1nPij=1}B_n = \{P \in \mathbb{R}^{n \times n} | \forall i, j, P_{ij} \geq 0, \sum_{i=1}^n P_{ij} = 1, \sum_{j=1}^n P_{ij} = 1\}

This type of matrix is Doubly Stochastic Matrix.

Birkhoff–von Neumann Theorem:

  1. Birkhoff Polytope is the convex hull of permutation matrices
  2. Permutation matrices = Vertices of Birkhoff Polytope (S_n):

Assume XRn×3\mathbf{X} \in \mathbb{R}^{n \times 3}, permutation matrix PRn×n\mathbf{P} \in \mathbb{R}^{n \times n}. YR1×K\mathbf Y \in \mathbb{R}^{1 \times K} is the output probability of classes.

Permutation Invariance : Y=f(PX)\mathbf Y = f(\mathbf{PX}), PSn\forall \mathbf{P} \in S_n.

Assume HRn×d\mathbf{H} \in \mathbb{R}^{n \times d} is the representations, and H=f(X)\mathbf{H} = f(\mathbf{X}), then,

Permutation Equivariance: PH=Pf(X)=f(PX)\mathbf{PH} = \mathbf{P}f(\mathbf{X}) = f(\mathbf{PX}), where PSn\mathbf{P} \in S_n is a permutation matrix.

  • equivariant first, then move to invariant.

Valid set funtions: the function is invariant to the order of the input.

Theorem: A function ff operating on a set XX having elements from a countable universe. If ff is a valid set function, then there exists a function gg such that f(X)=ρ(xXϕ(x))f(X) = \rho(\sum_{x \in X} \phi(x)).

Proof: For the mapping: f(X)Rf(X) \to R, the domain of ff is all subsets in XX. For example: if X={a,b,c}X = \{a, b, c\}, then the domain is a power set {ϕ,a,b,c,{a,b},{a,c},{b,c},{a,b,c}}\{\phi, a, b, c, \{a, b\}, \{a, c\}, \{b, c\}, \{a, b, c\}\}.

Sufficiency: summation is permutation invariant!

Necessity: find an unique representation of any set and then map it:

why choose 4: base 3,4 … are ok, but base 2 is not. Since base 2 cannot guarantee the uniqueness of the representation.

Deep learning for Sequences

Applications:

  • Language Model: P(xt+1xt,,x1)P(x^{t+1}| x^{t}, \cdots, x^1), where xt+1x^{t+1} is the word we want to predict.

  • Machine Translation.

Key challenges:

  • Variable length input and output;
  • Order change may be crucial for cognition;
  • complex statistical dependencies (e.g. long-range ones).

Transformer


The output representation of the final encoder is the input of each decoder.


The decoder includes self-attention and encoder-decoder attention.
For the self-attention, it uses masked multi-head attention, why???

How to encode the input sequence?

Input embedding: Construct the one-hot vector for each word. N words, each word will be mapping to a D dimension vector. Then, we can get a NxD matrix. D is the hyper-parameter.

  • Will the input embedding manners affect the performance of the model?

    In general, we use the same vocabulary dictionary for the input embedding.

positional encoding:

PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)\begin{aligned} PE(pos, 2i) &= sin(pos/10000^{2i/d_{model}}) \\ PE(pos, 2i+1) &= cos(pos/10000^{2i/d_{model}}) \\ \end{aligned}

pos is the index of the word in the sentence. (0-30)
2i2i and 2i+12i+1 is the index of the column, d_model is the number of columns, it is a hyper-parameter(120). For each word(token), we encode it to a vector with dimension d_model according to its position.

Here we use denominator 100002i/dmodel10000^{2i/d_model} to make sure the positional encoding is different for different tokens. The sin and cos are periodic functions, if we don’t use the denominator, then the positional encoding could be same for different tokens.

  • If there are two different sentences with the same size, will the positional encodings be the same?? yes.

Self attention:
XRtokes×dimX \in R^{tokes \times dim}, WQRdim×dimW_Q \in R^{dim \times dim},
Q=XWQRtokens×dimQ = XW_Q \in R^{tokens \times dim}.

softmax(QKTdim)Rtokens×tokenssoftmax(\frac{QK^T}{\sqrt{dim}}) \in R^{tokens \times tokens} is the attention matrix, the dimension must be the same as the number of tokens.

we apply softmax in individual row, then the output of softmax is tokens×tokenstokens \times tokens.

The final output dimension is tokens×dimtokens \times dim.

  • Why does it need to divide by dim\sqrt{dim}?

    To keep the variance of each entry QKT[i,k]QK^T[i,k] to be 1. [approach to 1]. if we don’t preserve the variance, then the gradient will be larger and larger, and the model will be unstable.

Multi-head attention
it can capture different dependency of the input sequence. One choice is to input the same input embedddings for each attention head, and then aggregate the output of each attention head. Another choice is to split the input embeddings into different parts, and then input different parts to different attention heads. (The problem is not a convex problem, thus the weights of each attention head may be different.)

Layer normalization

it is applied to each row of the output of multi-head attention. It is similar to batch normalization, but it is applied to each row, not each column.
we want to learn γ\gamma and β\beta, because we want to learn the distribution of each row.

Masked multi-head attention in decoder
The decoder must be autoregressive. We need to input the previous words to predict the next word and prevent attending from future. For a attention matrix, we can mask the upper triangle, then the values masked are zeros. But, if we mask the upper triangle, the sum of each row is not equal to 1, thus we need to adjust the attention matrix. Normally, the attention matrix is:

A=softmax(QKTdim)\begin{aligned} A = softmax(\frac{QK^T}{\sqrt{dim}}) \end{aligned}

The dimension of the attention matrix is outputtokens×outputtokensoutputtokens \times outputtokens. If we mask AijA_{ij}, then the input of softmax function will be:

Aij=expkQikKjkdimexp()\begin{aligned} A_{ij} = \frac{\exp \frac{ \sum_k Q_{ik}K_{jk} - \infty}{\sqrt{dim}}}{\sum \exp()} \end{aligned}

Shifted right: we already generate one token, and we want to predict the next token, then we always focus on the right part of the output sequence.

Cross attention
it is used in the decoder. The input of the decoder is the output of the encoder. The encoder-decoder attention is similar to the self attention, but the query is the output of the decoder, and the key and value are the output of the encoder. Here the output of encoder is the embaddings, and the docoder can generate the key and value from the embeddings. Cross attention can capture the relationship between the input sentence and the output sentence.

Graph Neural Networks Message Passing Models

Graph: multi-edges, nodes have types, edges have types.

  • connectivity: adjacency list G=(V,E)G = (V, E) and adjacency matrix AA. V=n,E=m|V| = n, |E| = m. ARn×nA \in \mathbb{R}^{n \times n}.
  • features: node features XX, edge features, graph features.

if you want to permute a graph, then you need to left multiply the permutation matrix to the adjacency matrix (change rows) and also right multiply the transpose permutation matrix to the adjacency matrix(change columns):

A=PAPT\begin{aligned} A' = PAP^T \end{aligned}

For a graph A1A_1, if there exists a permutation matrix PP, such that A2=PA1PTA_2 = PA_1P^T, then A1A_1 and A2A_2 are graph isomorphic.

For a graph AA, if there exists a permutation matrix PP, such that A=PAPTA = PAP^T, then AA is graph automorphism.

Given graph data (A,X)(A, X) and f(A,X)Rn×df(A, X) \in \mathbb{R}^{n \times d}:

  • invariance: f(PAPT,PX)=f(A,X)f(PAP^T, PX)=f(A, X), PSn\forall P \in S_n.

  • equivariance: f(PAPT,PX)=Pf(A,X)f(PAP^T, PX) = Pf(A, X), PSn\forall P \in S_n.

Key challenges:

  • unordered neighbors;
  • variable size of neighbors;
  • varying graph partitions.

Message Passing in GNNs

Feedforward networks: layers do not share message passing module. [don’t share weights]. Usually, we call it ‘layers’.

Recurrent networks: layers share message passing module.[reuse weights]. Usually, we call it ‘steps’ instead of ‘layers’.

Even if we increase the number of nodes and edges, the model can still work.

ONLY USE ONE NETWORK FOR ALL NEIGHBORS, same across edges. if we use different networks for each neighbor, changing the number of nodes will affect the model. The model is not invariant to the number of nodes.

For undirected graph, the message passing is symmetric, i.e., m_ij = m_ji. It doesn’t related to the order of nodes. For directed graph, m_ij may not equal to m_ji.

We can also use transformers or gcn or LSTM to implement the message passing. Be careful that LSTM is not permutation invariant.

Tips: parallel message passing: compute messages for all nodes/edges and compute updates for all nodes in parallel. Use dense operators on GPUs.

privacy and robust to attack in gnn.

we can also use transformers in encoding graph structure as an attention mask.

Graph Convolutional Networks

laplace operator is the eigenfuction, why?

For undirected graph, Graph Laplacian: L=DAL = D - A, where DD is the degree matrix, AA is the adjacency matrix.
Laplacian matrix can compute the difference between the node and its neighbors.
It is symmetric, diagonally dominant, positive semi-definite(eigenvalues are nonnegative), and the number of zero eigenvalues is the number of connected components.

  • Translation group;
  • Roto-translation group:
    SO(n): QRn×n,QTQ=QQT=I,det(Q)=1.Q \in \mathbb{R}^{n \times n}, Q^TQ = Q Q^T = I, det(Q) = 1.

g=(X,Rθ),g=(X,Rθ)g = (X, R_\theta), g' = (X', R_{\theta'})

gg=gg(x0)=g(Rθx0+X)=RθRθx0+RθX+X=(RθX+X,Rθ+θ)g \cdot g' = g \cdot g' (x_0) = g(R_{\theta'}x_0 + X') = R_{\theta}R_{\theta'}x_0 + R_{\theta}X' + X = (R_{\theta'}X' + X, R_{\theta + \theta'})

  • Scale-translation group.
  • Affine group.

cross-correlations: fg(x)=f(xx)g(x)dxf \star g (x) = \int f(x'-x)g(x')dx', in mathmetics, the order of convolution is inverse to the order in DL.

Autoregressive Models

Autoregressive model:

P(x1,,xn)=i=1nP(xix1,,xi1)=i=1nP(xix<i)\begin{aligned} P(x_1, \cdots, x_n) = \prod_{i=1}^n P(x_i|x_1, \cdots, x_{i-1}) = \prod_{i=1}^n P(x_i|x_{<i}) \end{aligned}

For images, each xix_i is a pixel value, e.g., {0,…,255}. n = height * width. Each term P(xix<i)P(x_i|x_{<i}) can be modeled by a single CNN/RNN/…

Why do we consider the same model for each term? Otherwise, the number of model is O(n).

PixelCNNs: conditioned on the pixels above and to the left of the pixel being predicted. At each step, we will mask the pixels below and right of the pixel being predicted. Then we use convolution on the image, but it will yield high computation cost.

PixelRNNs: vectorize the image as a sequence of pixels, and then use RNN to model the sequence.

Masked Filter: we mask the filter to make sure the convolution is autoregressive. But it will yield blind spots.

blind spots: the model cannot see the pixels below and right of the pixel being predicted.

resovle blind spots: use a stack of masked convolutions.

The cons of softmax: if the number of dimension is very large, then the softmax will be very small. Thus we use the discretized mixture logistic distribution:

P(x)=k=1Kπkσ(xμksk)\begin{aligned} P(x) = \sum_{k=1}^K \pi_k \sigma(\frac{x - \mu_k}{s_k}) \end{aligned}

where σ\sigma is the sigmoid function, πk\pi_k is the weight of the kk-th component, μk\mu_k is the mean of the kk-th component, sks_k is the scale of the kk-th component.

Due to the sequential nature of autoregressive sampling, it is slow.

  • will autoregressive model focus more on the nearby locations? yes, because we use masked convolutions.

  • For directed graphs, we can generate the lower triangle of the adjacency matrix, and then generate the upper triangle.

Generative Adversarial Networks (GANs)

Generative models: generate data from noise.

Min-max loss:

minθmaxϕExpdata(x)[logDϕ(x)]+Ezpz(z)[log(1Dϕ(Gθ(z)))]\begin{aligned} \min_{\theta} \max_{\phi}\mathbb{E}_{x \sim p_{data}(x)}[\log D_{\phi}(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D_{\phi}(G_{\theta}(z)))] \end{aligned}

where DD is the discriminator, GG is the generator, xx is the real data, zz is the noise, pdatap_{data} is the distribution of the real data, pzp_z is the distribution of the noise.

The fake images are generated from noise (normal distribution), we can not get the distribution of fake images. Why? Because the mapping from noise to fake images is not invertible. Thus, we cannot get the likelihood of the fake images.

GAN is also called likihood-free model. We cannot get the likelihood of the fake images.

For GANs, we can get images through one forward pass, but for autoregressive model, we need to generate images pixel in a sequential manner.

The output of the discriminator is a scalar, the probability of the input image being real. The output of the generator is an image.

Fix generator, the optimal discriminator is:

Dϕ(x)=pdata(x)pdata(x)+pGθ(x)\begin{aligned} D_{\phi}^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{G_{\theta}}(x)} \end{aligned}

where pGθ(x)p_{G_{\theta}}(x) is the distribution of the fake images.
When pGθ(x)=pdata(x)p_{G_{\theta}}(x) = p_{data}(x), then Dϕ(x)=0.5D_{\phi}^*(x) = 0.5. However, we don’t know the distribution of the fake images and the distribution of the real images.

inner-loop is the discriminator, outer-loop is the generator. why?
focus more on the generator, because the discriminator is easy to train ??

[paper] Graph convolutional autoencoders with co-learning of graph structure and node attributes

In this paper, they design the special graph encoder and decoder for the tasks undertaken by the graph autoencoders. The task of the encoder is to embed the nodes into a new space, and then the latent representation of each node is close to its neighbors[the encoder is a low-pass filter]. The decoder restores the original space from the embedded space by making the latent representation of each node away from its neighbors[the decoder is a high-pass filter].

In this paper, they encode both the graph structure and the node attributes in the latent space with an improved GCN, which is a completely low-pass graph filter. Then, to reconstruct the node attributes X , they design a new high-pass graph decoder. At the same time, we use the inner product layer to reconstruct the graph structure information. Last, the graph encoder and two sub-decoders are jointly optimized in a unified framework in such a way that each can be beneficial to the other and finally lead to a better graph embedding.

Normalized adjacency matrix and Laplacian matrices

1, The normalized adjacency matrix is defined as:

A^=D1/2AD1/2,\hat{A} = D^{-1/2}A D^{-1/2},

where AA is the adjacency matrix of graph GG. D=diag(d)D = diag(d), d(i)d(i) is the degree of node ii.

2, The normalized Laplacian matrix is defined as:

Ls=IA^=ID1/2AD1/2.L_s = I - \hat{A} = I - D^{-1/2}A D^{-1/2}.

Note that Ls=IA^=D1/2(DA)D1/2=D1/2LD1/2L_s = I - \hat{A} = D^{-1/2}(D - A) D^{-1/2} = D^{-1/2}L D^{-1/2}, where L=DAL = D - A is the unnormalized Laplacian matrix of graph GG.

For the largest eigenvalue λs\lambda^s of AA and the maximum degree Δ\Delta of a node in a graph, we have $ d_{avg} \leq \lambda^s \leq \Delta$.
Normalizing the adjacency matrix can make its largest eigenvalue 1.

3, Let α1α2...αn\alpha_1 \geq \alpha_2 \geq ... \geq \alpha_n be the eigenvalues of A^\hat{A}, λ1sλ2s...λns\lambda^s_1 \leq \lambda^s_2 \leq ... \leq \lambda^s_n be the eigenvalues of LsL_s, then

1=α1...αn1,0=λ1s...λns2.1 = \alpha_1 \geq ... \geq \alpha_n \geq -1, \quad 0=\lambda^s_1 \leq ... \leq \lambda^s_n \leq 2.

Graph convolutional networks

GCN generalizes the convolutional neural networks on non-Euclidean domains. It uses the first-order approximation of Chebyshev polynomials:

gθxθ(IN+D1/2AD1/2)X.g_{\theta} \star x \approx \theta (I_N + D^{-1/2}AD^{-1/2})X.

The spectral radius of (IN+D1/2AD1/2)(I_N + D^{-1/2}AD^{-1/2}) is 2, and repeated application of this operator will cause numerical instabilities. To solve this problem, GCN uses a renormalization trick by adding a self-loop to each node, which is equivalent to adding the identity matrix INI_N to the adjacency matrix AA: A~=A+I\tilde{A} = A + I, the associated degree matrix D~=D+I\tilde{D} = D + I. The new symmetrically normalized matrix is A~GCN=D~1/2A~D~1/2\tilde{A}_{GCN} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}. The one-layer GCN is

Z(m+1)=σ(A~GCNZ(m)W(m)),Z^{(m+1)} = \sigma(\tilde{A}_{GCN}Z^{(m)}W^{(m)}),

where Z(m)Z^{(m)} is the latent representation matrix learned by the mm-th layer, Z(0)=XZ^{(0)} = X.

Graph signal processing

In graph signal processing , the eigenvalues and eigenvectors of the graph Laplacian correspond to the frequencies and Fourier basis.

The graph laplacian is defined as L=DAL = D-A. By eigen-decomposition, L=UΛU1L = U \Lambda U^{-1}, where Λ=diag(λ1,...,λn)\Lambda = diag(\lambda_1, ..., \lambda_n), U=(u1,u2,...,un)U = (u_1, u_2, ..., u_n). The eigenvalues λi,i[n]\lambda_i, i \in [n] can be considered to be frequencies, and the associated eigenvectors ui,i[n]u_i, i \in [n] can be considered to be a Fourier basis.

A graph signal ff can be decomposed into a linear combination of basis signals uiu_i:

f=Uc=i=1nciui,f = Uc = \sum_{i=1}^n c_i u_i,

where c=(c1,...,cn)Tc = (c_1, ..., c_n)^T, cic_i is the coefficient of uiu_i, the magnitude of cic_i represents the importance of uiu_i in ff.

The smoothness of the basis signal uiu_i is measured by the corresponding eigenvalue λi\lambda_i. The smaller the eigenvalue λi\lambda_i, the smoother the basis signal uiu_i.

ej,kEaj,k[ui(j)ui(k)]2=uiTLui=λiuiTui=λi.\sum_{e_{j,k} \in E} a_{j,k}[u_i(j) - u_i(k)]^2 = u_i^T L u_i = \lambda_i u_i^T u_i = \lambda_i.

The basic idea of graph filtering is to design a proper graph filter to produce the required signals for the downstream tasks. A graph filter is a function that takes a graph signal as input and outputs a new signal. A linear graph filter can be represented as a matrix GRN×NG \in \mathbb{R}^{N \times N}, which is defined as

G=Up(Λ)U1,G = U p(\Lambda) U^{-1},

where p(Λ)=diag(p(λ1),...,p(λn))p(\Lambda) = diag(p(\lambda_1), ..., p(\lambda_n)). p()p(\cdot) is the frequency response function.

The output signal can be written as

y=Gf=Up(Λ)U1Uc=Up(Λ)c=i=1np(λi)ciui.y = Gf = U p(\Lambda) U^{-1} Uc = U p(\Lambda) c = \sum_{i=1}^n p(\lambda_i) c_i u_i.

Definition 1 (completely low-pass graph filter). A completely low-pass graph filter is a graph filter whose frequency response function p():RR+p(\cdot): \mathbb{R} \to \mathbb{R}^{+} is a decreasing function with λ\lambda.

  • According to definition 1, the completely low-pass graph filter obtains a smooth graph output signal yy that consists of mostly low-frequency basis signals, and as a result, the latent representation of each node is close to its neighbors.

Definition 2 (completely high-pass graph filter). A completely high-pass graph filter is a graph filter whose frequency response function p():RR+p(\cdot): \mathbb{R} \to \mathbb{R}^{+} is an increasing function with λ\lambda.

According to definition 2, the completely high-pass graph filter obtains an unsmooth graph output signal yy that consists of mostly high-frequency basis signals, which makes the latent representation of each node far away from its neighbors.

For GCN, the graph filter of GCN is

A~GCN=D~1/2A~D~1/2=ILs=U(IΛs)U1.\tilde{A}_{GCN} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} = I - L_s = U (I - \Lambda^s) U^{-1}.

The frequency response function of GCN is p(λis)=1λisp(\lambda^s_i) = 1 - \lambda_i^s. Since the range of λis\lambda_i^s is [0,2][0, 2], the frequency response function of GCN is a decreasing function with λis\lambda_i^s. GCN is completely low-pass graph filter when λis[0,1]\lambda_i^s \in [0, 1], but not in [1,2][1, 2]. When λis[1,2]\lambda_i^s \in [1, 2], p(λis)p(\lambda^s_i) will take a negative value that will introduce noise and undermine the performance. Thus, GCN is not a completely low-pass graph filter.

The difference between Adam and AdamW

https://towardsdatascience.com/why-adamw-matters-736223f31b5d

Why regularization can reduce overfitting?

http://neuralnetworksanddeeplearning.com/chap3.html#regularization

Cosine decay schedule with warm up period

Loshchilov and Hutter, SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017. https://arxiv.org/abs/1608.03983

https://scorrea92.medium.com/cosine-learning-rate-decay-e8b50aa455b

Reference

EECE 571F (2023 Winter Term 1): Deep Learning with Structures

http://cs231n.stanford.edu/slides/2018/cs231n_2018_ds02.pdf

https://jalammar.github.io/illustrated-transformer/

UvA - An Introduction to Group Equivariant Deep Learning

Jie Wang, Jiye Liang, Kaixuan Yao, Jianqing Liang, Dianhui Wang,Graph convolutional autoencoders with co-learning of graph structure and node attributes,Pattern Recognition,Volume 121,2022,108215,ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2021.108215.

https://people.orie.cornell.edu/dpw/orie6334/Fall2016/lecture7.pdf

  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2021-2026 Xue Yu
  • Visitors: | Views: