- Implement parameter sharing similar to mask stuff.
- Print weight histogram

- Maybe add parts from the source code to the report
- Implement Huffman coding, maybe just as a part need thinkin

- Do iterative training, where you prune x percent whenever test error exceed training error by a significant amount.

This is my final project for the Computer Vision class thought by Rob Fergus in Fall 2016. I got the project of implementing compressing ideas of Song Han presented in paper `Learning both Weights and Connections for Efficient Neural Networks`

. I start with a brief summary of the report and my-work to ease the reading.

- I provide a summary of the main paper.
- I provide more summaries of related papers, that I read to clarify the ideas in the main paper.
- I chose Lenet-5 architecture trained on MNIST for my experiments, since it is big enough two compress and small enought to easily do experiments on.
- I start thinking and iterating at how to implement Pruning with Torch, in summary:
- Torch doesn't has sparse Tensor implementation and implementing one requires some time. Therefore I decided to focus on the idea and simulate pruning instead of spending too much time on implementing sparse tensors/operations.
- After couple of trial I've decided to modify original
`torch.nn.Linear`

and`torch.nn.SpatialConvolutional`

modules to mask connections and enable straight-forward training. I made a Pull Request to the project #1073 - Compared the sensity of different layers and effect of retraining/tuning with magnitude based iterative prunning: 3 epoch is enough for retraining and all-layers have a similar around 10x without-loss compression potential.
- Realized that it doesn't make sense not use GPU's. Implemented
`-cuda`

support and enjoyed it. - Attemted to implement five pruner function to investigate different strategies. Completed some and TODO compare them
- Investigated effect of iterative pruning(pruning a portion of the target rate at each prune/retrain iteration) vs. one-time pruning(pruning the network in one prune/retrain iteration). TODO result
- Implemented accuracy loss based iterative circular prunning. TODO results

Song Han starts the paper with a focus on energy consumption of neural networks and motivates the need of compressing neural networks by pointing out that a smaller network would fit in memory and therefore the energy consumption would be less. He first talks about the Related Work

- He first points out several quantization and low-rank approximation methods and says that those methods are still valid and can be applied after network pruning.[11-12-13-14]
- Then he mentions two methods utilizing global average pooling instead of the fully connected layer. I think this idea follows the fact that most parameters in popular models are due to the fully connected layers. [15-16]
- Then he mentions the early pruning ideas e.g. biased weight [17] decay and some others [18-19]
- Lastly the Hashed Nets are mentioned and an anticipation about pruning may help it presented[20]

To prune the network importance of each weight is learned with a training method that is not mentioned. After this learning step, connections whose importance weights below a certion threshold are removed. Then the network is retrained and this part is crucial.

**Regularization**: L1 and L2 regularizations are used during training and it is observed that even though L1(forces sparse weights) gives better after pruning accuracy, the accuracy after retraining observed to be better with L2**Dropout Factor**: Dropout should be adjusted during retraining due to the pruned connections with $D_{new}=D_{old}\sqrt{\frac{C_{new}}{C_{old}}}$**Local Pruning and Parameter Co-adaptation**: No reinitilization is made after pruning because it is obviously stupid to do. You can just train with smaller size if that was possible. ConvNet part and Fully Connected Layer's are pruned separetely. It is mentioned that computation is less then original since partial retraining is made.*I am not sure why this is not made layer by layer if vanishing gradients are the problem.***Iterative Prunning**: Prunning made through iterating and doing*greedy search*at each iteration. I am not sure which algorithm*greedy search*supposed to point here. It is a vague term. Doing one pass agressive prunning led bad accuracy. My guesses about about iterative*greedy search*are:- Choose threshold iteratively.
- ?

**Pruning Neurons**Neurons are removed if there are either no incoming or no outgoing connections before retraining.

Caffee is used and a mask is implemented over the weights such that it disregards the masked outparameters.

**Lenet5**: After pruning retrained with LR/10 and a nice visualization is provided to show the outside weight are pruned(attention) [ ] Check: weights and importance weights give simalar plot like this.**Alex-Net**: 73h to train on NVDIA Titan X GPU. 173h to retrain with LR/100.**VGG**: Most of the reduction is at fully connected layer.

There is still one point that is not clear to me how the pruning is made with L1 or L2. I need to think about this. But basically in this section it is shown that iterative prunning with L2-regularization gave best results. One need to prune different regions separetely. Because FC layers are more prunable.

- There is a free launch, which is prune %50 without retraining, same accuracy.
- Layers are pruned layer by layer. Sensitivity increases with deepness of the layer. It is not mentioned but the reason might be that the initial results effect more results and it may propogate and increase!
- Each layer's sensitivity is used as threshold to prune each layer.
- Pruned layers are stored as a sparse matrix (a overhead of 15%, probably binary mask).
- Weight distribution of before/after pruning is given. Weights around 0 is disapeared.

Yann Le Cun's pruning paper emphasizing the importance of pruning as a regularizer and performance-optimizer. The idea of deleting parameters with small `saliency`

is proposed. Magnitude of weights proposed as simple measure of saliency in the earlier literature and its similarity to the weight decay mentioned. This paper proposes a better more accurate measure of saliency.

- Saliency = change in the error function by deleting the parameter: HARD to COMPUTE
- Taylor Series Approximation: $$\delta E=\sum_{i}g_i\delta u_i + \frac{1}{2}\sum_{i}h_{ii}\delta u_i^2 + \frac{1}{2}\sum_{i\ne j}h_{ij}\delta u_i \delta u_j + O(||\delta U ||^3)$$ The first term is neglected, since it is assumed that the model is converged and gradients are near zero. The third term is neglected by assuming $\delta E$'s of each parameter is independent of others and therefore the overall $\delta E$ is equal to the sum of individual $\delta E$'s. The last term is also neglected since the cost function is nearly quadratic. Then we left with: $$\delta E=\frac{1}{2}\sum_{i}h_{ii}\delta u_i^2 $$
- Diagonal approximation of hessian matrix is calculated by (where $V_k$ is set of connections sharing the same weight): $$ h_{kk}=\sum_{(i,j) \in V_k}\frac{\delta^2 E}{\delta w_{ij}}= \sum_{(i,j) \in V_k}\frac{\delta^2 E}{\delta a_{i}^2}x_j^2= \sum_{(i,j) \in V_k} 2 x_j^2 (f'(a_i)^2-(d_i-x_i)f''(a_i))$$
- The recipe of pruning connections according to the second order approximation given in the paper, where at each iteration some portion of the lowest saliency connections are pruned.
- A clear improvement over magnitude based prunning reported.
- A really interesting result is reported, too. MSE is increased aroun 50% percent after iterative pruning, however the test accuracy decreased, which is a clear indication showing the importance of loss function, i.e. MSE is not the right metric.

This paper focuses on some transfer learning tasks where the models trained on Image-net transfered to solve smaller classification problems. One significant difference is instead of pruning weights whole neuron is pruned.

- The method is same:
- transfer&converge
- iterate with prune+retrain
- Stop when the goal is reached

- They used Spearman's rank correlation to compare different saliency metrics with the optimal 'oracle' method(prune and measure $\delta E$), even though they pick only one neuron at each time.
- First order Taylor approximation gives best results compare to 1)sum of weights(of a neuron) 2)mean 3)std 4)mutual info, which is trivial and expected. Why should mean of weigths of a neuron be a metric for pruning...
- The first order approximation is basically the first term in the functions above $$\delta E=\sum_{i}g_i\delta u_i$$

It always feels good to read old papers. I visited this paper to learn more about Weight Decay and its connection to bias function(regularizer). They reported sparser connections are achieved as a result of applying exponential bias.

This paper proposes around 2x speedup at convolutional layers by deriving low-rank approximations of the filters and 5-10x parameter reduction at fully connected layers. The motivation of the paper based on the findings of *Denil et al.* regarding the redundancies in network parameterts.

- First filters of first two convolutional layer is clustered into equally sized groups.
- For each group a low-rank approximation calculated by using SVD

The main paper published at ICLR 2016 combining pruning idea with other methods like quantizing and huffman coding.

- The pruning part explained rather short and they emphasize that they laern the connections, with a regularizer.
- Quantization and weight sharing done first doing k-means clustering for each layer.
- They compared three intilization method for clustering and linear initiliaziation works best, since it keeps the high value weights.
- Shared weights tuned with backprop.
- It is shown that quatization works better with pruning.
- After all tuning is done, huffman encoding is made to comress the data offline.
- Up to 40x compression rate reached as a result.

I started with first training my first network `LeNet-5`

on HPC and got a test error of 0.96% in 30 Epochs with deafult training parameters. It occupies around 5Mb and has 313k parameters. My goal is to get 10x compression in size following the three methods outlined in the paper. The parameter breakdown is below:

I wanted to implement every part in Torch. After diving in I realized this might be a hard task. The reason is basically there is no Sparse-Tensor implementation and no space gain is made through making the weigth matrices(connections) sparse. After struggling a bit, I decided to aim an encoding and decoding method. Because implementing Sparse Tensor's and all the required operations is another project by itself I believe. Layers like SpatialConvolution and Linear is implemented for optimization and source code is not that easy to understand and modify. Therefore I decided to use full weight matrices throughout my experiments and represent connectivity by having non-zero weights.

First I've started with `Pruner`

module. After couple of iterations I've decided to intialize Pruner module with `setVariables`

call, which includes a model a pruner function(mask generating), a trainer,a tester and relevant torchnet engine. With these parameters I gave full power to the Pruner module to re-train and test model. After initialization one is ready to prune the network. `Pruner:prune`

call gets a mask-generating function(there was two implementations `Pruner.maskThreshold`

and `Pruner.maskPercentage`

at the beginning), a table of layer-id's and a table of mask-generating function parameters (either threshold or percentages in this case). Basically prune uses the layer-id's to get the weight tensor of target layer. Since this is a development code there are no type-checks and the provided id should be a valid one(a layer with `.weight`

field like nn.SpatialConvolution, nn.Linear). Then a mask is generated by calling the provided function with provided parameters and selected weight Tensor. The result is a binary mask with the same size as the weight-Tensor. The mask is saved in each layer and resulting model is tested. `prune`

repeats this proccess for each layer-id and returns the percantage of retained connections for each layer-id and the test-accurcy.

After pruning one can call `Pruner:reTrain`

function with nEpochs to retrain the network. Test-accury after testing is returned.

I've played with this and got some initial results by just masking according to the absolute value of the weights and got similar, sometimes better results with around 50% pruning of each layer without retraining. The individual sensitivity of each layer is below.

conv1-fcc1-fcc3 | conv2-fcc2 |
---|---|

Then I've implemented retraining. To do that after going through several possibilities like definining new `nn`

modules or `torchnet.engine`

hooks, I decided to alter the `nn.Linear`

and `nn.SpatialConvolution`

code and implement the pruning via masking cleanly. This choice helped me to prune the layers just by adding the binary mask and retrain them properly.

I didn't need to implement CUDA for other homeworks, but this time I wanted to learn how to do it and see the difference. I've realized that it is pretty straight forward: a generic function `isCuda(inp)`

which calls `inp:cuda()`

if cuda flag is provided does the necessary work. I've got a 2x speedup on Lenet-5 model compare to its multithreaded version.

There are 2 main methods proposed in the literature as pruning metric.

- Taylor series based approximations:
- Using 1st order approximation
- Using 2nd order diagonal approximation
- Combining these two

- Regularization based learning methods
- L1 based
- L2 based

An important point here to made is, all of the methods above try to approximate or optimize the L