I A review of provable neural training
In this paper we prove results about trainability of certain depth neural nets under more general settings than previously analyzed. Specialized to neural nets, learning theory seeks to solve the following function optimization/risk minimization problem,
(1) 
where is some lowerbounded nonnegative function, members of are continuous piecewise linear functions representable by some chosen neural net architecture and we only have sample access to the distribution . This reduces to the empirical risk minimization question when this
is an uniform distribution on a finite set of points. To the best of our knowledge about the stateoftheart in deeplearning either of these two optimization problems is typically solvable in either of the following two mutually exclusive scenarios :
Scenario : SemiRealizable Data i.e the data comes as tuples with being the noise corrupted output of a net (of known architecture) when given as an input. And Scenario : SemiAgnostic Data i.e data comes as tuples with no obvious functional relationship between and but there could be geometrical or statistical assumptions about the and .We note that its not very interesting to work in the fully agnostic setting as in that case training even a single ReLU gate can be SPNhard as shown in [goel2016reliably] On the other hand the simplifications that happen for infinitely large networks have been discussed since [neal1996priors] and this theme has had a recent resurgence in works like [chizat2018global, jacot2018neural]. Eventually this lead to an explosion of literature getting linear time training of various kinds of neural nets when their width is a high degree polynomial in training set size, inverse accuracy and inverse confidence parameters (a somewhat unrealistic regime), [lee2017deep, wu2019global, du2018gradient, su2019learning, kawaguchi2019gradient, huang2019dynamics, allen2019convergenceDNN, allen2019learning, allen2019convergenceRNN, du2018power, zou2018stochastic, zou2019improved, arora2019exact, arora2019harnessing, li2019enhanced, arora2019fine]. The essential essential proximity of this regime to kernel methods have been thought of separately in works like [allen2019can, wei2019regularization]
Even in the wake of this progress, it remains unclear as to how any of this can help establish rigorous guarantees about smaller neural networks or more pertinently for constant size neural nets which is a regime closer to what is implemented in the real world. Thus motivated we can summarize what is open about training depth nets into the following two questions,

Question Can any algorithm train a ReLU gate to accuracy in time using neither symmetry nor compact support assumptions on the distribution?

Question Can a single ReLU gate be trained using (S) GD with (a) random/arbitrary initialization and (b) weakly constrained data distribution  at least allowing it to be nonGaussian and preferably noncompactly supported?


Question Can a neural training algorithm work with the following naturally wanted properties being simultaneously true?

Nets of depth with a constant/small number of gates.

The training data instances (and maybe also the noise) would have nonGaussian noncompactly supported distributions.

Less structural assumptions on the weight matrices than being of the single filter convolutional type.

approximate answers be obtainable in at most time.

Ii A summary of our results
We make progress on some of the above fronts by drawing inspiration from two distinct streams of literature and often generalizing and blending techniques from them. First of them are the different avatars of the iterative stochastic nongradient Tron algorithms analyzed in the past like, [rosenblatt1958perceptron, Pal1992MultilayerPF, freund1999large, kakade2011efficient, klivans2017learning, goel2017learning, goel2018learning]. The second kind of historical precedence that we are motivated by are the different works which have shown how some of the desired theorems about gradient descent can be proven if designed noise is injected into the algorithm in judicious ways, [raginsky2017non, xu2018global, zhang2017hitting, durmus2019analysis, lee2019online, jin2018local, mou2018generalization, li2019generalization]
We will be working exclusively with the function class of neural nets whose activation function can be the
function which maps for being its weight. Hence for such nets the corresponding empirical or the population risk is neither convex nor smooth in how it depends on the weights. Thus to the best of our knowledge none of the convergence results among these provable noise assisted algorithms cited above can be directly applied to our neural nets because these proofs crucially leverage either convexity or very strong smoothness assumptions on the optimization objective.In our work we focus on the following class of depth nets,
Definition 1 (Single Filter Neural Nets Of Depth ).
Given a set of matrices and an activation function we call the following depth , width neural net to be a single filter neural net defined by the matrices
and where is the Leaky which maps as, for some
Note that the above class of nets includes any single gate for and it also includes any depth convolutional neural net with a single filter by setting the to be matrices such that each row has exactly one and each column has at most one .
Our results in this work can be said to be of two types:
Iia Guarantees without the assumption of parity symmetry on the data distribution
We show kinds of results in this category. In Section III first we have shown a very simple iterative stochastic algorithm to recover the underlying parameter of the gate when the realizable data allowed to be sampled online is of the form . The distributional condition is very mild which essentially just captures the intuition that enough of our samples are such that
To the best of our knowledge all previous attempts at this problem in the exactly realizable case have solved this only for specific data distributions like the Gaussian distributions
[soltanolkotabi2017learning, kalan2019fitting]. And results like [goel2018learning] have contained as special cases a solution to this problem but only under symmetry assumptions on the distribution. In contrast our assumptions on the distribution are significantly milder than all previous attempts.But more importantly by making some extra distributional assumptions, the theorem in Section III also encompasses the case when during training the true labels are being additively distorted by the oracle by a bounded perturbation. In this case we show that the accuracy of the algorithm in recovering is nearly optimal. To the best of our knowledge this is the first guarantee on training a gate in the adversarial setting.
In Section LABEL:sec:GDNoiseReLU we show the firstofitskind analysis of gradient descent on a gate albeit when assisted with the injection of certain kinds of noise. We assume that the labels in the data are realizable but we make no assumptions about the distribution on the domain. We make progress by showing that such a noise assisted GD in such a situation has a diffusive behaviour about the global minima i.e after T steps of the algorithm starting from anywhere, w.h.p all the steps of the algorithm would be within distance of the global minima of the function. The key idea here is that of coupling which shows that from the iterates of noise injected gradient descent on the squared loss of a gate one can create a discrete submartingale.
Remark.
We would like to emphasize to the reader that in such a distribution free regime as above, there are no algorithms known yet which can provably train. Also note that the result is parametric in the magnitude of the added noise and hence one can make the algorithm be arbitrarily close to being a pure gradient descent.
In the short Section LABEL:sec:GLMTron we do a quick reanalysis of a known algorithm called GLMTron under more general conditions than previously to show how well it can do (empirical) risk minimization on any Lipschitz gate with Lipschitz constant (in particular a gate) in the noisily realizable setting while no assumptions are being made on the distribution of the noise beyond their boundedness  hence the noise can be adversarial. We also point out how the result can be improved under some assumptions on the noise making it more benign. Note that in contrast to the training result in Section III which used a stochastic algorithm, here we are using fullbatch iterative updates to gain these extra abilities to deal with more general gates and having essentially no distributional assumptions on training data.
IiB Guarantees with the assumption of parity symmetry on the data distribution
In Section LABEL:sec:NeuroTronD2FBSingle we allow ourselves a symmetry assumption on the data to show a generalization of the GLMTron algorithm (to what we call the NeuroTron) that can provably do approximate parameter recovery for our single filter nets of depth with Leaky activation (in particular with activations). This multigate form of GLMTron turns out to be significantly more mathematically complicated to analyze and more so because we allow for adversarial label corruption. Working in the context of allowing for attack on the labels, brings to light advantages of the largeness of the width of the net and shows how that helps the training guarantees. We give a datadependent threshold value of the width s.t if we consider nets wider than this constant value, then we can get better accuracies of recovery while also allowing for nontrivial attacks on the labels.
We would like to emphasize the following two salient points here, Firstly it is to be noted that in the above while we leverage parity symmetry of the training data, we are also able to write the proof for any finite width depth nets of the type that we focus on. Along the way the theorem in Section LABEL:sec:NeuroTronD2FBSingle also shows that the ground truth parameters for the finitely large nets that we consider can be recovered using a finite training set with only a symmetry condition being imposed on it. We are not aware of this result being implied by any previous result.
Secondly we would like to emphasize to the reader that [gao2019convergence] is the only previous result we are aware of about provable training of a neural net under any kind of model of adversarial attack. But this result as it stands only applies when the net is asymptotically large/in the NTK regime. In contrast for the class of nets that we consider we give an accuracywidth tradeoff for all widths above a constant threshold value showing that once our adversary is fixed, arbitrarily good accuracy of recovery can be achieved by using appropriately wide nets. Also as opposed to the previous such result the adversarial setup in Sections III, LABEL:sec:GLMTron and LABEL:sec:NeuroTronD2FBSingle give guarantees while training with a data poisoning adversary [wang2018data], [zhang2019online], [koh2018stronger]  the attack is allowed to happen on the training data in an online fashion. To the best of our knowledge this work (and our companion paper [AnirbitRam2020Neuro1]) are the first such guarantees about training a net in the presence of a data poisoning attack.
Iii Learning a ReLU gate in the realizable setting and with adversarial attacks on labels
Given distribution s.t , suppose the corresponding true labels are generated as for some unknown . We assume that we have sampling access to and an adversarial oracle that returns labels with adversarial noise . To learn the true labeling function in the noisily realizable setting we try to solve the following optimization problem,
In contrast to previous work we show that the simple algorithm given below solves this learning problem by leveraging the intuition that if we see enough labels where
, then solving the linear regression problem on this subset of samples, gives a
which is a close to . In the situation with adversarial corruption () we show in subsection LABEL:opt:dadush that our recovery guarantee is a nearly optimal and and in the realizable case (), our setup learns to arbitrary accuracy the true filter using much milder distributional constraints than previous such results that we are aware of.Theorem III.1.
Case I : Realizable setting, .
Suppose (a) and the covariance matrix exist and (b) is s.t is positive definite and let and . If Algorithm 1 is run with and starting from starting from then for we would have,
Case II : With bounded adversarial corruption of the true labels, .
Suppose additionally the distribution and true filter are such that (a) The following expectations are finite and nonzero, and (b)
Then if we choose the step size in Algorithm 1 as,
Then there exists where , such that for all noise bound ,
for all noise bounds , accuracy , confidence , and distribution such that,
(2) 
Comments
There are no comments yet.