# Julia Things

### Environment

First things first. Let us set up the environment with the requried packages for this notebook. We will also set the desired context (e.g. `KnetArray` for gpu), the number of epochs (`nepochs`), and the variable `fast`. This variable is used to skip checking the accuracy at every epoch. 

In [1]:
for p in ("Knet", "Plots", "Plotly.jl")
    Pkg.installed(p) == nothing && Pkg.add(p)
end

using Knet, Plots
gr()

Knet.gpu(0); # set the desired GPU to use
atype   = KnetArray{Float32}; # atype = KnetArray{Float32} for gpu usage, Array{Float32} for cpu. 
nepochs = 10
fast    = false
pdrop   = 0.5
println("OS: ", Sys.KERNEL)
println("Julia: ", VERSION)
println("Knet: ", Pkg.installed("Knet"))
println("GPU: ", readstring(`nvidia-smi --query-gpu=name --format=csv,noheader`))

OS: Linux
Julia: 0.6.0
Knet: 0.8.5+
GPU: NVS 310
TITAN X (Pascal)



### New Stuff

In this notebook we introduce the following Julia/Knet packages and functions:

* ...

# Dropout regularization 

If you're reading the tutorials in sequence, 
then you might remember from Part 2 
that machine learning models 
can be susceptible to overfitting. 
To recap: in machine learning,
our goal is to discover general patterns.
For example, we might want to learn an association between genetic markers
and the development of dementia in adulthood. 
Our hope would be to uncover a pattern that could be applied successfully to assess risk for the entire population.

However, when we train models, we don't have access to the entire population (or current or potential humans).
Instead, we can access only a small, finite sample.
Even in a large hospital system, we might get hundreds of thousands of medical records. 
Given such a finite sample size, it's possible to uncover spurious associations 
that don't hold up for unseen data.

Let's consider an extreme pathological case. 
Imagine that you want to learn to predict
which people will repay their loans. 
A lender hires you as a data scientist 
to investigate the case and gives you complete files on 100 applicants,
of which 5 defaulted on their loans within 3 years. 
The files might include hundreds of features 
including income, occupation, credit score, length of employment etcetera.
Imagine that they additionally give you video footage of their interview with a lending agent. 
That might seem like a lot of data! 

Now suppose that after generating an enormous set of features,
you discover that of the 5 applicants who defaults, 
all 5 were wearing blue shirts during their interviews,
while only 40% of general population wore blue shirts. 
There's a good chance that any model you train would pick up on this signal 
and use it as an important part of its learned pattern.

Even if defaulters are no more likely to wear blue shirts, 
there's a 1% chance that we'll observe all five defaulters wearing blue shirts.
And keeping the sample size low while we have hundreds or thousands of features,
we may observe a large number of spurious correlations. 
Given trillions of training examples, these false associations might disappear. 
But we seldom have that luxury.

The phenomena of fitting our training distribution more closely than the real distribution
is called *overfitting*, and the techniques used to combat overfitting are called regularization.
In the previous chapter, we introduced one classical approach to regularize statistical models. 
We penalized the size (the $\ell^2$ norm) of the weights, coercing them to take smaller values.
In probabilistic terms we might say this imposes a Gaussian prior on the value of the weights. 
But in more intuitive, functional terms, we can say this encourages the model to spread out its weights among many features and not to depend too much on a small number of potentially spurious associations. 
    

## With great flexibility comes overfitting liability

Given many more features than examples, linear models can overfit. 
But when there are many more examples than features, 
linear models can usually be counted on not to overfit.
Unfortunately this propensity to generalize well comes at a cost. 
For every feature, a linear model has to assign it either positive or negative weight.
Linear models can't take into account nuanced interactions between features.
In more formal texts, you'll see this phenomena discussed as the bias-variance tradeoff.
Linear models have high bias, (they can only represent a small class of functions),
but low variance (they give similar results across different random samples of the data).
[**point to more formal discussion of generalization when chapter exists**]

Deep neural networks, however, occupy the opposite end of the bias-variance spectrum.
Neural networks are so flexible because they aren't confined to looking at each feature individually.
Instead, they can learn complex interactions among groups of features. 
For example, they might infer that "Nigeria" and "Western Union" 
appearing together in an email indicates spam 
but that "Nigeria" without "Western Union" does not connote spam. 

Even for a small number of features, deep neural networks are capable of overfitting.
As one demonstration of the incredible flexibility of neural networks,
researchers showed that [neural networks perfectly classify randomly labeled data](https://arxiv.org/abs/1611.03530).
Let's think about what means. 
If the labels are assigned uniformly at random, and there are 10 classes, 
then no classifier can get better than 10% accuracy on holdout data. 
Yet even in these situations, when there is no true pattern to be learned, 
neural networks can perfectly fit the training labels. 

## Dropping out activations

In 2012, Professor Geoffrey Hinton and his students including Nitish Srivastava 
introduced a new idea for how to regularize neural network models. 
The intuition goes something like this. 
When a neural network overfits badly to training data,
each layer depends too heavily on the exact configuration
of features in the previous layer. 

To prevent the neural network from depending too much on any exact activation pathway,
Hinton and Srivastava proposed randomly *dropping out* (i.e. setting to $0$) 
the hidden nodes in every layer with probability $.5$.
Given a network with $n$ nodes we are sampling uniformly at random from the $2^n$ 
networks in which a subset of the nodes are turned off. 

![](../img/dropout.png)

One intuition here is that because the nodes to drop out are chosen randomly on every pass,
the representations in each layer can't depend on the exact values taken by nodes in the previous layer. 

## Making predictions with dropout models

However, when it comes time to make predictions, 
we want to use the full representational power of our model. 
In other words, we don't want to drop out activations at test time.
One principled way to justify the use of all nodes simultaneously,
despite not training in this fashion,
is that it's a form of model averaging. 
At each layer we average the representations of all of the $2^n$ dropout networks.
Because each node has a $.5$ probability of being on during training, 
its vote is scaled by $.5$ when we use all nodes at prediction time




## Data 

In [19]:
include(Knet.dir("data","cifar.jl"))
xtrn, ytrn, xtst, ytst = mnist()

dtrn = minibatch(xtrn, ytrn, 100, xtype=atype);
dtst = minibatch(xtst, ytst, 100, xtype=atype);



## Model

In [3]:
function initweights(d, scale=0.01; hidden=[2], atype=Array{Float32})
    model = Vector{Any}(2 * length(hidden))
    X = d
    for k = 1:length(hidden)
        H = hidden[k]
        model[2k - 1] = scale * randn(H, X) 
        model[2k]     = scale * randn(H, 1)
        X = H
    end
    return map(atype, model)
end

initweights (generic function with 2 methods)

We can define the function `initmodel` with the desired parameters. The variable `hidden` contains the output sizes for each of the layers, and `num_inputs` is the size of the input variable `x` (in this case $x\in\mathbb{R}^{784}$). 

In [4]:
initmodel(atype) = initweights(784, hidden=[256, 128, 10]; atype=atype);

In [5]:
function predict(w, x; p=0)
    x = mat(x)
    for i=1:2:length(w) - 2
        x = relu.(w[i] * x .+ w[i+1])
        x = dropout(x, p)
    end
    return w[end - 1]*x .+ w[end]
end

predict (generic function with 1 method)

Let's test the predict function to make sure everything works fine:

In [6]:
for (x, y) in dtrn
    display(predict(initmodel(atype), x; p=pdrop))
    break
end

10×100 Knet.KnetArray{Float32,2}:
  0.0102376     0.0110553     0.00971472  …   0.0105167     0.0115427 
  0.00719412    0.00809406    0.00717845      0.00660533    0.00637   
 -0.000235557   0.000569347   0.00010308      0.000691225  -0.00109118
 -0.0159943    -0.0154395    -0.0147362      -0.0158345    -0.0157569 
 -0.0134848    -0.0143681    -0.0139184      -0.0139141    -0.0141434 
  0.00447695    0.00421254    0.00494863  …   0.0045765     0.00520394
 -0.00737216   -0.00750307   -0.00789365     -0.00717817   -0.00586199
  0.00905698    0.00879994    0.00865319      0.0089237     0.00926897
 -0.00781964   -0.00838048   -0.00774621     -0.00739538   -0.00805289
  0.00610901    0.00778764    0.00727687      0.00711902    0.00629103

## Loss Function

In [7]:
loss(w, x, ygold, predict; o...) = nll(predict(w, x; o...), ygold);
lossgradient = grad(loss);

In [8]:
for (x, y) in dtrn
    display(loss(initmodel(atype), x, y, predict; p=pdrop))
    break
end

2.301888f0

## Train Function

In [9]:
function train(w, dtrn, optim, predict; epochs=10, o...)
    for epoch = 1:epochs
        for (x, y) in dtrn
            g = lossgradient(w, x, y, predict; o...)
            update!(w, g, optim)
        end
    end
end

train (generic function with 1 method)

## Optimizer

In [10]:
optim(w; lr=0.01) = optimizers(w, Sgd;  lr=lr);

## Helpers

In [11]:
function report(epoch, w, dtrn, dtst, predict)
    println((:epoch, epoch, :trn, accuracy(w, dtrn, predict), :tst, accuracy(w, dtst, predict)))
end

report (generic function with 1 method)

## Train the Model

In [12]:
w   = initmodel(atype);
opt = optim(w, lr=1e-1);

if fast
    train(w, dtrn, opt, predict; epochs=nepochs, p=pdrop)
else
    for epoch = 1:nepochs
        train(w, dtrn, opt, predict; epochs=1, p=pdrop)
        report(epoch, w, dtrn, dtst, predict)
    end
end

(:epoch, 1, :trn, 0.84845, :tst, 0.8542)
(:epoch, 2, :trn, 0.9113666666666667, :tst, 0.9118)
(:epoch, 3, :trn, 0.9418, :tst, 0.94)
(:epoch, 4, :trn, 0.9537333333333333, :tst, 0.9513)
(:epoch, 5, :trn, 0.9601666666666666, :tst, 0.9574)
(:epoch, 6, :trn, 0.9650666666666666, :tst, 0.9607)
(:epoch, 7, :trn, 0.96965, :tst, 0.9651)
(:epoch, 8, :trn, 0.9738666666666667, :tst, 0.9676)
(:epoch, 9, :trn, 0.9758, :tst, 0.9697)
(:epoch, 10, :trn, 0.9786166666666667, :tst, 0.9718)


## Conclusion

Nice. With just two hidden layers containing 256 and 128 hidden nodes, respectively, we can achieve over 95% accuracy on this task. 

## Next
[dropout](section3-dropout.ipynb)

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/moralesq/Knet-the-Julia-dope)