Basic Tricks on NN
L2 Regularization
A full loss function includes regularization over all parameters $\theta$ , e.g., L2 regularization:
$$ J(\theta)=f(x)+\lambda \sum_k \theta^2_k $$Regularization produces models that generalize well when we have a “big” model.
Dropout
- Training time: at each instance of evaluation (in online SGD-training), randomly set 50% of the inputs to each neuron to 0
- Test time: halve the model weights (now twice as many)
- This prevents feature co-adaptation
- Can be thought of as a form of model bagging (i.e., like an ensemble model)
- Nowadays usually thought of as strong, feature-dependent regularizer
Vectorization
Always try to use vectors and matrices rather than for loops.
Non-linearities, old and new
For building a deep network, the first thing you should try is ReLU — it trains quickly and performs well due to good gradient backflow.
Parameter Initialization
You normally must initialize weights to small random values. To avoid symmetries that prevent learning/specialization.
Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target)
Initialize all other weights ~ Uniform($–r$, $r$), with $r$ chosen so numbers get neither too big or too small
Xavier initialization has variance inversely proportional to fan-in $n_{in}$ (previous layer size) and fan-out $n_{out}$ (next layer size):
$$ Var(W_i)=\frac{2}{n_{in}+n_{out}} $$
Optimizers
- Usually, plain SGD will work just fine
- These models give differential per-parameter learning rates
- Adagrad
- RMSprop
- Adam: A fairly good, safe place to begin in many cases
- SparseAdam
Learning Rates
- You can just use a constant learning rate. It must be order of magnitude right – try powers of 10
- Too big: model may diverge or not converge
- Too small: your model may not have trained by the assignment deadline
- By a formula: $lr=lr_0e^{-kt}$ , for epoch $t$
- There are fancier methods like cyclic learning rates
Language Modeling
Language Modeling is the task of predicting what word comes next. A system that does this is called a Language Model.
n-gram Language Models
A n-gram
is a chunk of $n$ consecutive words.
- unigrams: “the”, “students”, “opened”, ”their”
- bigrams: “the students”, “students opened”, “opened their”
- trigrams: “the students opened”, “students opened their”
- 4-grams: “the students opened their”
We make a Markov assumption 马尔可夫假设: $x^{t+1}$ depends only on the preceding $n-1$ words
So we can get n-gram and (n-1)-gram probabilities by counting them in corpus of text.
Example
Sparsity Problems with n-gram
Note: Increasing n makes sparsity problems worse. Typically, we can’t have $n$ bigger than 5.
Storage Problems with n-gram
Recurrent Neural Networks (RNN)
RNN Advantages:
- Can process any length input
- Computation for step $t$ can (in theory) use information from many steps back
- Model size doesn’t increase for longer input context
- Same weights applied on every timestep, so there is symmetry in how inputs are processed
RNN Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back