Game of Numbers: RigL at ICLR2020

11 minute read

6 months ago our paper got rejected at ICLR-2020 like many other papers did. At the time I thought of writing about our experience on twitter and then stopped myself from doing that since I didn’t want to be the person who talks about the bad when it happens to themselves as if they are the only one who matters. 2 months later (March 2020), it still felt important to share this experience and I wrote this post, however decided not making online as we were afraid it might affect the reviews for the ICML re-submission. I can see the tone of the writing reflects my disappointment I had at the time. Hope it doesn’t increase your stress levels.

We will present our work at ICML 2020 virtual conference tomorrow: my first first-author conference paper. I am happy… but I was not 3 months ago when ICLR results were out. I would like to share why: RigL @ ICLR.

Our paper is called Rigging the Lottery: Making all Tickets Winners or in short RigL. Erich came up with the title and sometimes when you have the title, it is clear what you need to do. We took on the problem of training sparse neural networks that was observed by many and highlighted by Jonathan in Lottery Ticket Hypothesis. Building on top of the recent work on dynamic sparse training (SET, SNFS, DSR), we showed that it is possible to train sparse neural networks from scratch without the dense parameterization and match or exceed the accuracy of dense-to-sparse methods like pruning. Great experimental results, relatively simple algorithm and fully open-sourced code. We got: 3, 6, 6: Borderline.

Yes, reviews are a bit random; but I would like to say a bit more than that. It is a great thing that all reviews are open and I encourage you to read them all if you have some spare time. In short, we addressed the concerns of reviewers 2 and 3 after the initial review (for which we were given the scores 6 (weak accept)); however they never responded/reacted to our responses and kept their scores as is (this was unfortunate since you need reviewers who want your paper in). We had a long discussion with reviewer 1 which resulted in brand new experiments with 2 new datasets. Later it became clear, that the main reason for the low score given was:

The idea of starting from a small, sparse network and expanding it is not novel. DEN [Yoon et al. 18] proposed the same bottom-up approach with sparsely initialized networks, while they allowed to increase the number of neurons at each layer and focused more on continual learning. The authors should compare the two methods both conceptually and experimentally.

I was puzzled when I read the DEN paper and I am still puzzled. DEN is a method for continual learning. It uses sparsity to pick a subset of the existing network when a new task is given and the algorithm adds new neurons (i.e. increased #parameters). Continual learning assumes that the training data will arrive at different times and we should learn from the new data/task when available, whilst not forgetting what we already learned. The paper also claims using continual learning they achieve better results at classifying images than the common batch supervised setting. In other words, the computer vision field was doing it all wrong training on all classes at once and not using their method. This claim is big and I couldn’t verify their results since the code provided by the authors only had MNIST training, where we couldn’t see any improvement using DEN over regular multi-class training.

RigL is a method for training sparse networks and the size of the network is constant throughout the training. We responded to the review highlighting that our method does not grow neurons and it is not a continual learning algorithm (which should be obvious), but we got the following response.

I believe that you could compare against DEN-Finetune with multilayer perceptron as the base network, using the codes provided at the git repository you linked. Timestamped inference should not be an issue with the final finetuned DEN (DEN-Finetune) at all.

Since I do not find the idea of starting with a sparse network and growing it up in a bottom-up manner as novel, as it is already done in DEN, without experimental comparison against it I do not believe that the paper has a sufficient novelty or advantage over it.

Thus I will stick to my original rating of weak reject.

Yes, again the reviewer implied that our method RigL grows networks and indicated that they will be keeping their score. Well, either we failed to convey our message or it was a lost cause from the beginning: I don’t know. What do you do when you get a message like this half-way through rebuttal? When the destiny of your paper depends on doing a comparison that seems (and probably is) irrelevant. With hindsight, I wished we had stopped at this point but as you may guess, we spent the day and the night and did the comparison on which the fate of our paper is tied(?).

Code available for DEN was only for the MNIST fully-connected network, which is by all means irrelevant to contemporary computer vision and to our goal of training modern networks (ResNet, MobileNet) on large datasets (ImageNet-2012). I ran the experiments and crafted a response with my collaborators. Our results showed that:

(A) DEN does not achieve meaningful sparsity -- the networks obtained are only 10% sparse -- far too low to be of any practical benefit.

(B) Using DEN as a pre-training step and then fine tuning the resulting network is not as efficient as training it from scratch. This confirms results of Liu et.al.(https://arxiv.org/abs/1810.05270).

(C) RigL requires ~100x fewer FLOPs than DEN and gets higher accuracy than DEN-Finetune.

So, the network DEN trains is only 10% sparse (only 10% missing), does not improve the performance over fine-tuning and doesn’t come anywhere near RigL in terms of efficiency, which is kind of expected since continual learning is a more difficult setting than regular multi-class classification. It’s possible to get different results using non-public hyper-parameter settings and I would be more than happy to go back and update my statement.

We didn’t hear from reviewer-1 after our response. Such a waste of time on our end. We hoped the Area Chair (AC) would note this absurd comparison request and (possibly) the unfair rating. The decision was reject (possibly the decision of AC got overturned at a higher level) and the message from program chairs was:

A somewhat new approach to growing sparse networks. Experimental validation is good, focusing on ImageNet and CIFAR-10, plus experiments on language modeling. Though efficient in computation and storage size, the approach does not have a theoretical foundation. That does not agree with the intended scope of ICLR. I strongly suggest the authors submit elsewhere.

which is I believe a very unfortunate thing to say, given that the majority of work published in ICLR or in deep learning doesn’t have any theoretical foundations and obviously this is not a requirement.

So why did we get this response and result? I think it is all a game of numbers when stakes are high and time is limited. 6-6-3 are not great scores and I am not happy about that. We can do better at motivating our method, and do a better job at rebuttal. We should be more uplifting and less irritated in our responses even when the review we get doesn’t make sense to us. However, this doesn’t justify the disappointing experience we had.

Why did reviewer-1 insist on a comparison with DEN? Is it possible that the reviewer is related to the paper? I don’t know and I don’t want to speculate. If yes, this is very bad ethics. If not, this is very bad quality. Honestly, I liked the idea of the DEN paper and its algorithm. I would have cited and included a comparison if I was writing a paper on continual learning. One shouldn’t condition a good rating on a comparison that is provably unrelated.

I am pretty sure RigL is not the only paper which had an experience like this. Problems with bad reviewing practices are pointed out by others, too(link1,link2). I understand that the reviewing process is inherently noisy: reviewers have limited time and have different criteria. Therefore you get many reviewers and the results should be less noisy, right?

No, this is not true if qualified reviewers are in the minority. I keep hearing from the researchers I interact with that the overall review quality has decreased over the years, possibly due to the increased number of paper submissions caused by the inflated value of publishing in top conferences. I think this is very important to understand. There is a problem and the situation is getting worse. What if the bad quality reviews are like a virus that spreads to other researchers when there is no incentive to stop its spread?

This might not seem important to many people, and even inevitable to some. I think it is a very good sign of the quality of research we are doing. And quality is important. We are all here for a short time and if the goal is to push the limits of human knowledge we should be careful how we spend our time doing research. I don’t think I spent my time well during the rebuttal and this is partly due to review quality: missing most of our team offsite doing rebuttal, not preparing for an important on-site interview which I failed and wasting my time and energy hoping (maybe naively) that our score would increase if I can complete the comparison on time. I believe all of us have had similar experiences in our research career and I don’t think it needs to be this way. Here’s my 2 Kuruş (means cent in turkish):

(1) We should incentivize good reviews; rewarding it as we reward publishing papers.

(2) We should have clear guidelines for promoting individual research doing reviews and consequences if not followed.

Finishing the post, I would like to invite you to share your story and I would be more than happy to reference them. Every story is important and sharing them is probably the first step towards a solution.

I would like to thank Erich, Pablo, Laura, Erin and Linda for their feedback.