Jekyll2021-02-23T23:49:20-05:00https://pyjhzwh.github.io/feed.xmlYunjie Pan’s Personal Websitepersonal websiteYunjie Panpanyj@umich.edu[2021-02-20] Analyzing and Mitigating Data Stalls in DNN Training2021-02-20T00:00:00-05:002021-02-20T00:00:00-05:00https://pyjhzwh.github.io/posts/reading-paper<p>Most DNN acclerator papers I read focus on DNN inference rather than training. From this paper, I learned that the bottleneck for DNN training is I/O for fetching data and CPU side for preprocessing.</p>
<h1 id="background">Background</h1>
<p><img src="../../images/dnn_training_data_pipeline.png" alt="Data Pipeline in DNN taining" /></p>
<p>The figure above shows the data pipeline in DNN training.
(1) A minibatch of data items is fetched from storage.
(2) The data items are pre-processed, for e.g.,, for image classifica- tion, data items are decompressed, and then randomly cropped, resized, and flipped.
(3) The minibatch is then processed at the GPU to obtain the model’s prediction
(4) A loss function is used to determine how much the prediction deviates from the right answer
(5) Model weights are updated using computed gradients</p>
<h1 id="analyzing-data-stalls">Analyzing data stalls</h1>
<h2 id="technique">Technique</h2>
<p>Existing profiling data stalls frameworks like Pytorch and Tensorflow are inaccurate and insufficient:
1) They cannot accurately provide the split up of time spent in data fetch (from disk or cache) and pre-processing operations
2) Frameworks like PyTorch and libraries like DALI use several concurrent processes (or threads) to fetch and pre-process data; But GPU processes wait to synchronize weight upates at batch boundaries, so a data stall may affect the GPU compute time for other GPUs</p>
<p>This paper develop a tool, DS-Analyzer to overcome these limitations by using a dofferential approach:
1) Measure ingestion rate with no fetch or prep stalls
2) Measure prep stalls with a subset of given dataset which is entirely cached in memory.
3) Measure fetch stalls by clearing all caches and compare the difference between 2)</p>
<h2 id="results">Results</h2>
<ul>
<li>Pay a one-time download cost for the dataset, and reap benefits of local-SSD accesses thereafter. Because the cost of downloading the entire dataset in the first epoch is amortized.</li>
</ul>
<p>When datasets cannot be fully cached:</p>
<ul>
<li>Fetch stalls are common if the dataset is not fully cached in memory, which is obvious.</li>
<li>OS Page Cache is inefficient for DNN training because it leads to trashing.</li>
<li>Lack of coordination among caches leads to redundant I/O in distributed training.</li>
</ul>
<p>When datasets could fit in memory:</p>
<ul>
<li>DNNs need 3–24 CPU cores per GPU for pre-processing.</li>
<li>DALI is able to reduce, but not eliminate prep stalls.</li>
<li>As compute gets faster (either due to large batch sizes, or the GPU getting faster), data stalls squander the benefits due to fast compute.</li>
<li>Redundant pre-processing for concurrent jobs in HP search results in high prep stalls</li>
</ul>
<h1 id="mitigate-data-stalls">Mitigate data stalls</h1>
<ul>
<li>MinIO cache (single-server training)
<ul>
<li>DNN data access pattern: repetitive across epochs and random within an epoch.</li>
<li>items, once cached, are never replaced in the DNN cache</li>
</ul>
</li>
<li>Patitioned MinIO cache (distributed-server training)
<ul>
<li>Data transfer over commodity TCP stack is much faster than fetch- ing a data item from its local storage, on a cache miss.</li>
<li>Whenever a local cache miss happens in the subsequent epoch at any server, the item is first looked up in the metadata; if present, it is fetched from the respective server over TCP, else from its local storage.</li>
</ul>
</li>
<li>Coordinated Prep (single-server training)
<ul>
<li>each job processes the entire dataset exactly once per epoch</li>
</ul>
</li>
</ul>Yunjie Panpanyj@umich.eduMost DNN acclerator papers I read focus on DNN inference rather than training. From this paper, I learned that the bottleneck for DNN training is I/O for fetching data and CPU side for preprocessing. Background[2021-01-26] GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms2021-01-26T00:00:00-05:002021-01-26T00:00:00-05:00https://pyjhzwh.github.io/posts/reading-paper<p>This paper is a CPU-FPGA heterogenrous platform for GCN training. CPU will do the communication intensive operations, and leave the computation intensive parts to CPU.</p>
<h1 id="background-and-motivation">Background and Motivation</h1>
<p>It is challenging to accelerate Graph Convolutional Networks because:
(1) substantial and irregular data communication to propagate information within the graph
(2) intensive computation to propagate information along the neural network layers
(3) Degree-imbalance ofgraph nodes can significantly degrade the performance of feature propagation.</p>
<p>GCN acceleration compared with existing graph analytic problems
(1) traditional graph analytics often propagate scalars along the graph edges, while GCNs propagate long feature vectors
(2) traditional graph analytics often propagate information within the full graph, while GCNs propagate within minibatches.</p>
<h1 id="optimizations">Optimizations</h1>
<p>Training Algorithm Selection:</p>
<ul>
<li>minibatch by sampling training graph. The algorithm will samples subgraph instead of GCN layers</li>
</ul>
<p>Redundancy Reduction:</p>
<ul>
<li>perform pre-processing to compute the partial sum</li>
<li>common pairs of neighbors, size of 2 (Could it be larger, or dynamic according the graph topolopy?)</li>
</ul>
<h1 id="architecture-design">Architecture Design</h1>
<p><img src="../../images/GraphACT_arch.png" alt="GraphACT FPGA overview" />
CPU: communication intensive part, including graph sampling</p>
<p>FPGA: computation intensive part, including forward and backward pass</p>
<p>How to improve training throughput:</p>
<ul>
<li>reduce the overhead in external memory access: set the minibatch size so that the subgraph could fit in BRAM capacity</li>
<li>increase the utilization of the on-chip resources
<ul>
<li>feature aggrefation module: 1D accumulator array - parallelizing on the feature dimension</li>
<li>weight transformation module: 2D systolic array to compute the dense matrix product</li>
</ul>
</li>
</ul>
<h1 id="evaluation">Evaluation</h1>
<p>Compared with CPU baseline, 12x to 15x speedup; Compared with GPU baseline, 1.1x to 1.5x faster convergence rate.
The authors said that their work has higher accuracy comapred with one previous work. But I wonder if all the work use the same subgraph sampling algorithm, would the accuracy be different.</p>
<h1 id="insights">Insights</h1>
<p>Although I feel like this work focus more on the engineering part rather than the novel architecture design. The challenge it proposes is insightful for me: memory access and load-balance.
I wonder if this design is scalable. It seems that the BRAM size is the bottleneck, could we do more optimization on the memory access?</p>Yunjie Panpanyj@umich.eduThis paper is a CPU-FPGA heterogenrous platform for GCN training. CPU will do the communication intensive operations, and leave the computation intensive parts to CPU. Background and Motivation It is challenging to accelerate Graph Convolutional Networks because: (1) substantial and irregular data communication to propagate information within the graph (2) intensive computation to propagate information along the neural network layers (3) Degree-imbalance ofgraph nodes can significantly degrade the performance of feature propagation.[2021-01-18] [CS224W] Graph Neural Network2021-01-18T00:00:00-05:002021-01-18T00:00:00-05:00https://pyjhzwh.github.io/posts/studying<h1 id="basics-of-graph-neural-network">Basics of Graph Neural Network</h1>
<p>Idea: Generate node embeddings based on local network neighborhoods
Neighborhood aggregation: Average information from neighbors and apply a neural network</p>
<p>\(h_v^0 = x_v \\
h_v^k = \sigma(W_k \sum_{u \in N(v)} \frac{h_u^{k-1}}{|N(v)|} + B_k h_v^{k-1}), \forall k \in {1, \cdots, K}\\
z_v = h_v^K\)
$W_k$ and $B_k$ are trainable parameters</p>
<h1 id="graph-convolutional-networks-and-graphsage">Graph Convolutional Networks and GraphSAGE</h1>
\[h_v^k = \sigma([W_k \cdot AGG({h_u^{k-1}, \forall u \in N(v)}) , B_k h_v^{k-1}]), k \in {1, \cdots, K}\\\]
<p>AGG variants:
mean, pool, LSTM</p>
<p>Efficient Implementation:</p>
<ul>
<li>sparse matrix operations</li>
</ul>
<h1 id="graph-attention-networks">Graph Attention Networks</h1>
<p>Specify arbitrary importances to different neighbors of each node in the graph
Let $\alpha_{vu}$ be computed as a byproduct of an attention mechanism $a$
\(e_{vu} = a(W_kh_u^{k-1}, W_kh_v^{k-1}) \\
\alpha_{vu} = \frac{exp(e_{vu})}{\sum_{k \in N(v)} exp(e_{vk})} \\
h_v^k = \sigma(\sum_{u \in N(v)} \alpha_{vu}W_kh_u^{k-1})\)
where $e_{vu}$ indicates the importance of node u’s message to node v</p>
<p>Attention mechanism $a$</p>
<ul>
<li>e.g. use a simple single-layer neural network</li>
<li>parameters of $a$ are trained jointly</li>
</ul>Yunjie Panpanyj@umich.eduBasics of Graph Neural Network[2021-01-17] [CS224W] Graph Repesentation Learning2021-01-17T00:00:00-05:002021-01-17T00:00:00-05:00https://pyjhzwh.github.io/posts/studying<h1 id="network-embedding">Network embedding</h1>
<p>Task: We map each node in a network into a low-dimensional space
Goal: encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network.</p>
<ol>
<li>Define an encoder (i.e., a mapping from nodes to embeddings)
\(ENV(v) = z_v\)</li>
<li>Define a node similarity function (i.e., a measure of similarity in the original network)</li>
<li>Optimize the parameters of the encoder so that:
\(similarity(u,v) = z_v^Tz_u\)</li>
</ol>
<h1 id="random-walk-embeddings">Random-walk Embeddings</h1>
<ol>
<li>
<table>
<tbody>
<tr>
<td>Estimate probability of visiting node v on a random walk starting from node u using some random walk strategy R: $P_R(u</td>
<td>v)$</td>
</tr>
</tbody>
</table>
</li>
<li>
<table>
<tbody>
<tr>
<td>Optimize embeddings to encode these random walk statistics: $similarity = cos(\theta) \propto P_R(u</td>
<td>v)$</td>
</tr>
</tbody>
</table>
</li>
</ol>
<p>Unsupervised Feature Learning
Idea: Learn node embedding such that nearby nodes are close together in the network
Given a node u, how do we define nearby nodes? $N_R(U)$ neighbourhood of u obtained by some strategy R</p>
<p>Log-lokelihood objective:
\(max_z \sum_{u \in V} log P(N_R(u) | z_u)\)
where $N_R(u)$ is the neighborhood of node $u$ by strategy $R$.</p>
<p>For random walk optimization:</p>
<ol>
<li>Run short fixed-length random walks starting from each node on the graph using some strategy R</li>
<li>For each node u collect $N_R(U)$, the multiset* of nodes visited on random walks starting from u</li>
<li>Optimize embeddings according to: Given node u, predict its neighbors $N_R(U)$
\(max_z \sum_{u \in V} log P(N_R(u) | z_u)\)</li>
</ol>
<p>\(L = \sum_{u \in V} \sum_{v \in N_R(u)} -log P(v | z_u)\)
Parameterize $P(v | z_u)$ using softmax:
\(P(v | z_u) = \frac{exp(z_u^Tz_v)}{\sum_{n \in V} exp(z_u^Tz_n)}\)
Why softmax? Intuition: $\sum_i exp(x_i) \approx \max_iexp(x_i)$</p>
<p>But it is computationally expensive.</p>
<p>Solution: Negative Sampling
\(log(\frac{exp(z_u^Tz_v)}{\sum_{n \in V}exp(z_u^Tz_n)}) \\
\approx log(\sigma(z_u^Tz_v)) - \sum_{i=1}^k log(\sigma(z_u^Tz_{n_i})), ni \sim P_v\)
where $\sigma()$ is the sigmoid function</p>
<h1 id="node2vec-biased-walks">node2vec: Biased Walks</h1>
<p>Idea: use flexible, biased random walks that can trade off between local and global views of the network
BFS: micro-view of neighbourhood
DFS: macro-view of neighbourhood
Two parameters:</p>
<ul>
<li>Return parameter p: Return back to the previous node</li>
<li>In-out parameter q: Moving outwards (DFS) vs. inwards (BFS)</li>
</ul>
<p>Algorithm:
1) Compute random walk probabilities
2) Simulate $r$ random walks of length $l$ starting from each node $u$
3) Optimize the node2vec objective using Stochastic Gradient Descent
Linear-time complexity
all 4 steps are individually parallelizable</p>
<h1 id="translating-embeddings-for-modeling-multi-relational-data">Translating Embeddings for Modeling Multi-relational Data</h1>
<p>knowledge graph completion - link prediction</p>Yunjie Panpanyj@umich.eduNetwork embedding Task: We map each node in a network into a low-dimensional space Goal: encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the original network.[2021-01-15] [CS224W] Spectral Clustering2021-01-15T00:00:00-05:002021-01-15T00:00:00-05:00https://pyjhzwh.github.io/posts/studying<h1 id="graph-partitioning">Graph Partitioning</h1>
<p>Graph cut: Set of edges with one endpoint in each group
\(cut(A,B) = \sum_{i \in A, j \in B} w_{ij}\)
where $w_{ij}$ is the weighted edges between i and j</p>
<p>Graph Cut Criterion:</p>
<ul>
<li>Minumin cut
<ul>
<li>problems: only consider external cluster connections</li>
</ul>
</li>
<li>Conductance
<ul>
<li>$\phi(A,B) = \frac{cut(A,B)}{min(vol(A), vol(B))}$, where $vol(A)$ is the total weighted degreee of nodes in A</li>
<li>Produces more balanced partitions</li>
<li>problem: Computing the best cut is NP-hard</li>
</ul>
</li>
</ul>
<p>Adjacency matrix(A)
Degree Matrix(D)
Laplacian matrix(L): $L = D - A$</p>
<p>We would like to find the 2nd smallest eigenvalues and eigenvectors of $L$
\(\lambda_2 = min_{x: x^Tw_1 = 0} \frac{x^TMx}{x^Tx} = min_{\sumx_i=0} \frac{\sum_{(i,j) \in E} (x_i - x_j)^2}{\sum_i x_i^2}\)</p>
<h1 id="spectral-clustering-algorithm">Spectral Clustering Algorithm</h1>
<p>1) Pre-processing
Construct a matrix representation of the graph
2) Decomposition
Compute eigenvalues and eigenvectors of the matrix (only care the 2nd smallest eigenvalues)
Map each point to a lower-dimensional representation based on one or more eigenvectors
3) Grouping
Assign points to two or more clusters, based on the new representation</p>
<h1 id="motif-based-spectral-clustering">Motif-based spectral clustering</h1>
<p>motifs cut
$vol_M(S)$ = #(motif end-points in S)
\(\phi(S) = \frac{#(motifs cut)}{vol_M(S)}\)</p>
<p>Three steps
1) Pre-processing
$W_{i,j}^(M)$= # times edge (i,j) participates in the motif $M$
2) Decomposition (standard sprctral clustering)
set $L^(M) = D^(M) - W^(M)$, get 2nd eigenvalues and eigenvectors
3) Grouping
Sort nodes by their values in $x$: x1, x2, …xn. Let Sr = {x1, …, xr} and compute the motif conductance of each $S_r$</p>Yunjie Panpanyj@umich.eduGraph Partitioning[2021-01-13] [CS224W] Community Structure in Networks2021-01-13T00:00:00-05:002021-01-13T00:00:00-05:00https://pyjhzwh.github.io/posts/studying<h1 id="communities">Communities</h1>
<p>Triadic closure = high clustering coefficient
Edge overleap:
\(O_{i,j} = \frac{N(i) \cap N(j)\\{i,j}}{N(i) \cup N(j)\\{i,j}}\)
where N(i) is the set of neighbors of node i</p>
<p>Network communities: sets of tightly connected nodes</p>
<p>Modularity Q: A measure of how well a network is partitioned into communities
\(Q \propto \sum_{s \in S} [(# edges within group s) - \underset{need a null model}{(expected # edges within group s)} ]\)</p>
<p>Null Model - Configuration Model
Given real $G$ on $n$ nodes and $m$ edges, construct rewired network $G\prime$
The expected number of edges between nodes $i$ and $j$ of degrees $k_i$ and $k_j$ is $k_ik_j/(2m)$</p>
<p>\(Q(G,S) = 1/2m \sum_{s \in S}\sum_{i \in s} \sum_{j \in s} (A_{ij} - k_ik_j/(2m))\)
$A_{ij} = 1$ if i,j has edges</p>
<h1 id="louvain-algorithm---greedy-algorithm-for-community-detection">Louvain Algorithm - Greedy algorithm for community detection</h1>
<p>O($n\log n$) run time
Each pass is made of 2 phases:
Phase 1: Modularity is optimized by allowing only local changes to node-communities memberships
Phase 2: The identified communities are aggregated into super-nodes to build a new network</p>
<h1 id="bigclam---detecting-overlapping-communities">BigCLAM - Detecting Overlapping Communities</h1>
<p>Step 1)
Define a generative model for graphs that is based on node community affiliations
Community Affiliation Graph Model (AGM)
Step 2)
Given graph $G$, make the assumption that $G$ was generated by AGM
Find the best AGM that could have generated $G$, maximize graph likelihood $P(G|F)$</p>Yunjie Panpanyj@umich.eduCommunities[2021-01-12] [CS224W] Motifs and Structural Roles in Networks2021-01-12T00:00:00-05:002021-01-12T00:00:00-05:00https://pyjhzwh.github.io/posts/studying<h1 id="subgraphs-motifs">Subgraphs, Motifs</h1>
<p>Network motifs: recurring, significant patterns of interconnections</p>
<ul>
<li>induced subgraphs - consider all edges connecting pairs of vertices in subset</li>
<li>recurrence - allow overlapping of motifs</li>
<li>signficance of a motif: Motifs are overrepresented in a network when compared to randomized networks</li>
</ul>
<p>How to get the randomized networks? Configuration model: : Generate a random graph with a given degree sequence k1, k2, … kN ( the same degree as the real network) Each $G^{rand}$ has the same #(nodes), #(edges) and #(degree distribution) as $G^{real}$</p>
<ul>
<li>Spokes: Nodes with spokes; randomly pair up mini-nodes</li>
<li>Swithing: Select a pair of edges A->B, C->D at random; Exchange the endpoints to give A->D, C->B</li>
</ul>
<h1 id="graphlets">Graphlets</h1>
<p>Definition: connected non-isomorphic subgraphs
Graphlet degree vector counts #(graphlets) that a node touches at a particular orbit (takes into account the symmetries of a subgraph)
Graphlet degree vector provides a measure of a node’s local network topology</p>
<h1 id="structural-roles-in-networks">Structural Roles in Networks</h1>
<p>Role: A collection of nodes which have similar positions in a network
Structural equivalence: Nodes u and v are structurally equivalent if they have the same relationships to all other nodes</p>
<h1 id="structure-role-discovery-method---roix">Structure Role Discovery Method - RoIX</h1>
<p>Recursive feature extraction turns network connectivity into structural features. Use the neighborhood features to generate new recursive features.</p>
<ul>
<li>Neighborhood features:
<ul>
<li>Local features: all measures of the node degreee (in-, out- degree, total degree, etc.)</li>
<li>Egonet features: Egonet includes the node, its neighbors, and any edges in the induced subgraph on these nodes. #(within egonet edges), #(edges entering/leaving egonet)</li>
</ul>
</li>
<li>recursive features:
<ul>
<li>Use the set of current node features to generate additional features; Two types of aggregate functions: mean and sum</li>
</ul>
</li>
</ul>Yunjie Panpanyj@umich.eduSubgraphs, Motifs Network motifs: recurring, significant patterns of interconnections induced subgraphs - consider all edges connecting pairs of vertices in subset recurrence - allow overlapping of motifs signficance of a motif: Motifs are overrepresented in a network when compared to randomized networks[2021-01-11] [CS224W] Properties of Networks and Random Graph Models2021-01-11T00:00:00-05:002021-01-11T00:00:00-05:00https://pyjhzwh.github.io/posts/studying<p>I would like to learn GNN to see if there is any opportunity to optimize/accelerate it. But the surveying paper for GNN has too many expressions, notations which are hard to understand. So I will watch the CS224W Machine Learning with Graphs lecture to learn it in a smooth way. I will write down some key points I learned.</p>
<h1 id="key-netowrk-properties">Key Netowrk Properties</h1>
<p>Degree distribution P(k): Probability that a randomly chosen node has degree k</p>
<p>Path length h</p>
<ul>
<li>Distance (shortest path, geodesic) between a pair of nodes is defined as the number of edges along the shortest path connecting the nodes</li>
<li>Network Diameter: The maximum (shortest path) distance between any pair of nodes in a graph</li>
<li>Average path length for a connected graph or a strongly connected directed graph $\bar{h} = 1 / 2 E_{max} \sum_{i,j \neq i} h_ij$, where $h_{i,j}$ is the distance from node $i$ to node $j$, and $E_{max}$ the max number of edges (total number of node pairs) = $n(n-1)/2$</li>
</ul>
<p>Clustering coefficient C: how connected are i’s neighbors to each other</p>
<ul>
<li>$C_i = 2e_i/(k_i(k_i-1)) \in [0,1]$, where $e_i$ is the number of edges between the neighbors of node $i$.</li>
<li>Average clusing coefficient $C = 1/N \sum_i^N C_i$</li>
</ul>
<p>Connected components s:</p>
<ul>
<li>Size of the largest connected component is Largest set where any two vertices can be joined the by a path</li>
<li>Largest component = Giant component</li>
<li>Found by BFS</li>
</ul>
<h1 id="erdös-renyi-random-graphs">Erdös-Renyi Random Graphs</h1>
<p>It is the simplest model of graphs.
$Gnp$: undirected graph on n nodes where each edge (u,v) appears i.i.d. with probability p</p>
<ul>
<li>Binomial degree distribution</li>
<li>very small clustering coefficient</li>
<li>path length and largest connected component is similar to the real network</li>
</ul>
<h1 id="small-world-model">Small World Model</h1>
<p>Goal: high clustering but also short length
Two components to the model:</p>
<ul>
<li>Start with a low-dimensional regular lattice - Has high clustering coefficient</li>
<li>Rewire: Introduce randomness (“shortcuts”) - reduce diameter
Drawback: Does not lead to the correct degree distribution</li>
</ul>
<h1 id="kronecker-graph-model">Kronecker Graph Model</h1>
<p>Idea : recursive graph generation to capture the self-similarity of network.
Kronecker graph is obtained by growing sequence of graphs by iterating the Kronecker product over the initiator matrix $K_1$:
\(K_1^{[m]} = K_m = \underset{m \; times}{K1 \otimes K1 \otimes \cdots K_1} = K_{m-1} \otimes K_1\)
The Kronecker graph and real network looks very close.</p>Yunjie Panpanyj@umich.eduI would like to learn GNN to see if there is any opportunity to optimize/accelerate it. But the surveying paper for GNN has too many expressions, notations which are hard to understand. So I will watch the CS224W Machine Learning with Graphs lecture to learn it in a smooth way. I will write down some key points I learned.[2021-01-07] Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training2021-01-07T00:00:00-05:002021-01-07T00:00:00-05:00https://pyjhzwh.github.io/posts/reading-paper<p>This is a paper about GPU memory footprint reduction using selective recomputation during LSTM training process. Although I do not know much about LSTM network, this paper is quite clear and easy to understand.</p>
<h1 id="backgound-and-observation">Backgound and Observation</h1>
<p>LSTM will have intermediate results 9x larger than the output size. And by the nature of backpropagation, some featuremaps have to be stashed in memory during the forward pass to compute the gradients.
Two observation during the LSTM training on GPU:
(i) the feature maps of the attention and RNN layers consume most of the GPU memory (feature maps are the data entries saved in the forward pass to compute the gradients during the backward pass),
(ii) the runtime is unevenly distributed across different layers (fully-connected
layers dominate the runtime while other layers are relatively lightweight).
And there is an observation of the training throughput comparing ResNet-50(CNN model) with NMT(LSTM model).The training throughput saturates as the batch size increases for ResNet-50, which means the GPU compute units have been almost fully utilized. But for NMT, the training throughput increases linearly with the batch size, but such increase stops when the model hits the GPU memory capacity wall. From the comparison, the authors draw the conclusion that, in LSTM RNN-based model, performance is limited by the GPU memory capacity</p>
<h1 id="ideas">Ideas</h1>
<p><img src="../../images/Echo_recomputation.png" alt="recomputation" />
The dependencies shown in (a) means that th featuremap has to be stashed, resulting in the bottleneck of GPU. So (b) tries to remove the dependecies by recomputation. Such recomputation idea has been proposed already, but did not have good performance because the previous work failed to estimate footprint reduction and runtime overhead accurately.
For footprint reduction, a practical recomputation strategy should involve the comparison between the feature maps that are released and those that are newly allocated. Besides, the global impact of those allocations should be considered. But evaluating all possible recomputation has large runtime complexity. So the author proposes to partition the whole computation graph into small subgraphs. Compute-heavy layers form the natural boundaries since they are never recomputed.
For runtime overhead estimation. The authors look into the layer properties and be less conservative of the recompution overhead. For example, fully-connected layer gradients has not depedency of the output, which means no recompution needed. ReLU activations and dropout layers can be stored in 1-bit binary format.</p>
<h1 id="implementation--evaluation">Implementation & Evaluation</h1>
<p>The paper integrate Echo in NNVM, which is the computation graph compiler for MXNet. And the work is able to use a compiler-based optimization pass to automatically insert the recomputation in the program.
Echo reduces the GPU memory footprint by 1.89× on average and 3.13× maximum with marginal runtime overhead.</p>Yunjie Panpanyj@umich.eduThis is a paper about GPU memory footprint reduction using selective recomputation during LSTM training process. Although I do not know much about LSTM network, this paper is quite clear and easy to understand.[2020-11-14] High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs2020-11-14T00:00:00-05:002020-11-14T00:00:00-05:00https://pyjhzwh.github.io/posts/reading-paper<p>This is a paper using processor-in-memory(PIM) for NN acceleration in ASPLOS 2020. I know little about PIM techinques. And I don’t know much about the SRAM structure. I learned some basic knowledge of SRAM, PIM from this paper. And I learned a lot of the implementation tools and benchmark measuring tools.</p>
<h2 id="sram-pim">SRAM, PIM</h2>
<ul>
<li>memory organization of SRAM: SRAM -> slices -> bank -> subbank-> subarray-> partition -> cells.</li>
<li>Any PIM strategy which disturbs this tightly coupled sub-array organization will either negatively impact the data access (memory property) or increase the PIM execution latency.</li>
<li>One of the most common PIM techniques is to assert multiple worlines where the input operands are located and establish a bitline discharge to obtain the computed output. Boolean operation takes 1 cycle. Limited to bitwise operations.</li>
<li>Multiple row activation (MRA) will cause a write bias condition and my cause data corruption.</li>
<li>Interconnect dominates the energy and latency of SRAM.</li>
<li>PIM performance can be determined by the number of operations per PIM cycle (PIM-OPC).</li>
</ul>
<h2 id="motivation">Motivation</h2>
<p>Most exsiting accelerators for DNN workloads has incorporates quantization, and using various bit precision for different layers. Such cases makes a better use case for LUT-based approaches.
Interconnect dominates the energy and latency of SRAM. So it is desirable to avoid data movement through the interconnect and compute within a subarray. And data parallelism can be enabled by concurrent processing at individual sub-arrays.
Integrating specialized hardware within each subarray will result in significant degradation of memory density and negatively impact normal memory operation. So BFree - a LUT based comptue that supports configurable for multiple operations.</p>
<h2 id="bfree-architecture">BFree Architecture</h2>
<p><img src="../../images/BFree-LUT.png" alt="BFree architecture" />
BFree architecture corporates compute capabilities at sub-array granularity in conventional cache memory organization.
Configuration block (CB) stores metadata.
BFree compute engine(BCE) incorporates three stage in-order pipeline: 1. Read instructions from CB and decode 2. Compute LUT addressed 3. Access LUT address and further accumulated/processed
Decoupled bitlines for the LUT rows alone to reduce access cost.
For reducing the number of LUT entries, they only store the entries for 4-bit operands in the sub-arrays. For higher precision operands, BCE will decompose the operands to 4-bit precision operands and accumulate the partial product. Further, the number of entries could be reduced to 49 by utilizing some fundamental multiplication properties - only storing the products if both the operands are odd numbers.
Other operations like division or activation functions can also be implemented with LUT-based approaches.</p>
<h2 id="evaluations">Evaluations</h2>
<p>Tools for BFree design: SPICE simulation for STAM, LUT and BCE design; Synposys DC and ICC compiler for auto place and route; Synopsys PrimeTime-PX for power evalutaion
Benchmarks: 1) In cache technique - Neural Cache 2) Eyeriss (Not sure how to compare Eyeriss energy/performance, does the author implement Eyeriss design?) 3) CPU, using Pytorch profiler, and use Intel Rapl tools tp measure power 4) GPU, using Tensorflow profiler, and using Nvidia-smi tools to measure GPU power
This PIM solution achieves 1.72x better performance while being 3.14x energy efficient compared to the state-of-the-art DNN in-memory accelerators when running inception v3. Our analysis show 101x, 3x speed up and 91x, 11x energy efficient than CPU and GPU</p>Yunjie Panpanyj@umich.eduThis is a paper using processor-in-memory(PIM) for NN acceleration in ASPLOS 2020. I know little about PIM techinques. And I don’t know much about the SRAM structure. I learned some basic knowledge of SRAM, PIM from this paper. And I learned a lot of the implementation tools and benchmark measuring tools.