# Julia Things

### Environment

First things first. Let us set up the environment with the requried packages for this notebook. We will also set the desired context (e.g. `KnetArray` for gpu), the number of epochs (`nepochs`), and the variable `fast`. This variable is used to skip checking the accuracy at every epoch. 

In [86]:
for p in ("Knet", "Plots", "Plotly.jl")
    Pkg.installed(p) == nothing && Pkg.add(p)
end

using Knet, Plots
gr()

Knet.gpu(0); # set the desired GPU to use
atype   = KnetArray{Float32}; # atype = KnetArray{Float32} for gpu usage, Array{Float32} for cpu. 
nepochs = 10
fast    = false
println("OS: ", Sys.KERNEL)
println("Julia: ", VERSION)
println("Knet: ", Pkg.installed("Knet"))
println("GPU: ", readstring(`nvidia-smi --query-gpu=name --format=csv,noheader`))

OS: Linux
Julia: 0.6.0
Knet: 0.8.5+
GPU: NVS 310
TITAN X (Pascal)



### New Stuff

In this notebook we introduce the following Julia/Knet packages and functions:

* ...

# Multilayer perceptrons from scratch

In the previous chapters we showed how you could implement multiclass logistic regression 
(also called *softmax regression*)
for classifiying images of handwritten digits into the 10 possible categories. 
This is where things start to get fun.
We understand how to wrangle data, 
coerce our outputs into a valid probability distribution,
how to apply an appropriate loss function,
and how to optimize over our parameters.
Now that we've covered these preliminaries, 
we can extend our toolbox to include deep neural networks.

Recall that before, we mapped our inputs directly onto our outputs through a single linear transformation.
$$\hat{y} = \mbox{softmax}(W \boldsymbol{x} + b)$$

Graphically, we could depict the model like this:
![](https://github.com/zackchase/mxnet-the-straight-dope/blob/master/img/simple-softmax-net.png?raw=true)

If our labels really were relatd to our input data by an approximately linear function,
then this approah might be adequate.
*But linearity is a strong assumption*.
Linearity means that fixing one output of interest,
for each input,
increasing its value should either drive up the value of the output,
or drive it down,
irrespective of the value of the other inputs.

Imagine the case of classifying cats and dogs based on black and white images.
That's like saying that for each pixel, 
increasing its value either increases probability that it depicts a dog or decreases it.
That's not reasonable. After all, the world contains both black dogs and black cats, and both white dogs and white cats. 

Teasing out what is depicted in an image generally requires allowing more complex relationships between
our inputs and outputs, considering the possibility that our pattern might be characterized by interactions among the many features. 
In these cases, linear models will have low accuracy. 
We can model a more general class of functions by incorporating one or more *hidden layers*.
The easiest way to do this is to stack a bunch of layers of neurons on top of each other.
Each layer feeds in to the layer above it, until we generate an output.
This architecture is commonly called a "multilayer perceptron".
With an MLP, we're going to stack a bunch of layers on top of each other.

$$ h_1 = \phi(W_1\boldsymbol{x} + b_1) $$
$$ h_2 = \phi(W_2\boldsymbol{h_1} + b_2) $$
$$...$$
$$ h_n = \phi(W_n\boldsymbol{h_{n-1}} + b_n) $$

Note that each layer requires its own set of parameters.
For each hidden layer, we calculate its value by first applying a linear function 
to the acivatiosn of the layer below, and then applying an element-wise
nonlinear activation function. 
Here, we've denoted the activation function for the hidden layers as $\phi$.
Finally, given the topmost hidden layer, we'll generate an output.
Because we're still focusing on multiclass classification, we'll stick with the softmax activation in the output layer.

$$ \hat{y} = \mbox{softmax}(W_y \boldsymbol{h}_n + b_y)$$

Graphically, a multilayer perceptron could be depicted like this:

![](https://github.com/zackchase/mxnet-the-straight-dope/blob/master/img/multilayer-perceptron.png?raw=true)

Multilayer perceptrons can account for complex interactions in the inputs because 
the hidden neurons depend on the values of each of the inputs. 
It's easy to design a hidden node that that do arbitrary computation,
say logical operations.
And it's even widely known that multilayer perceptrons are universal approximators. 
That means that even for a single-hidden-layer neural network,
with enough nodes, and the right set of weights, it could model any function at all!
Actually learning that function is the hard part. 
And it turns out that we can approximate functions much more compactly if we use deeper (vs wider) neural networks.
We'll get more into the maths in subsequent chapter. But for now, let's actually build a MLP.
In this example, we'll implement a multilayer perceptron with two hidden layers and one output layer.

## Data 

In [87]:
include(Knet.dir("data","mnist.jl"))
xtrn, ytrn, xtst, ytst = mnist()
dtrn = minibatch(xtrn, ytrn, 100, xtype=atype);
dtst = minibatch(xtst, ytst, 100, xtype=atype);



## Model

In [88]:
function initweights(d, scale=0.01; hidden=[2], atype=Array{Float32})
    model = Vector{Any}(2 * length(hidden))
    X = d
    for k = 1:length(hidden)
        H = hidden[k]
        model[2k - 1] = scale * randn(H, X) 
        model[2k]     = scale * randn(H, 1)
        X = H
    end
    return map(atype, model)
end

initweights (generic function with 2 methods)

We can define the function `initmodel` with the desired parameters. The variable `hidden` contains the output sizes for each of the layers, and `num_inputs` is the size of the input variable `x` (in this case $x\in\mathbb{R}^{784}$). 

In [89]:
function initmodel(atype;num_inputs=784,num_hidden=256,num_outputs=10)
    return initweights(num_inputs,hidden=[num_hidden,num_hidden,num_outputs]; atype=atype);
end

initmodel (generic function with 4 methods)

In [90]:
function predict(w, x)
    x = mat(x)
    for i=1:2:length(w) - 2
        x = relu.(w[i] * x .+ w[i+1])
    end
    return w[end - 1]*x .+ w[end]
end

predict (generic function with 1 method)

Let's test the predict function to make sure everything works fine:

In [91]:
for (x, y) in dtrn
    display(predict(initmodel(atype), x))
    break
end

10×100 Knet.KnetArray{Float32,2}:
  0.0227497     0.022383      0.0233163    …   0.023055      0.023645   
 -0.0011035    -0.00171302   -0.000232517     -0.0005282    -0.000880577
 -0.0114875    -0.0115701    -0.0104334       -0.00967458   -0.010482   
 -0.000505474   0.000431799  -8.67485f-5      -0.000775862  -0.00099238 
 -0.000354351   0.000189104  -0.00214156      -0.00062014   -0.00154479 
  0.00512402    0.00518112    0.00480477   …   0.00500566    0.00480705 
 -0.0223992    -0.022344     -0.0207275       -0.0205362    -0.0206613  
 -0.00505681   -0.00434136   -0.00457164      -0.00460923   -0.00487216 
 -0.0160057    -0.0165691    -0.0150407       -0.016062     -0.0167797  
  0.0114391     0.0104472     0.0122909        0.011399      0.0120771  

## Loss Function

In [92]:
loss(w, x, ygold, predict) = nll(predict(w, x), ygold);
lossgradient = grad(loss);

## Train Function

In [93]:
function train(w, dtrn, optim, predict; epochs=10)
    for epoch = 1:epochs
        for (x, y) in dtrn
            g = lossgradient(w, x, y, predict)
            update!(w, g, optim)
        end
    end
end

train (generic function with 2 methods)

## Optimizer

In [94]:
optim(w; lr=0.01) = optimizers(w, Sgd;  lr=lr);

## Helpers

In [95]:
function report(epoch, w, dtrn, dtst, predict)
    println((:epoch, epoch, :trn, accuracy(w, dtrn, predict), :tst, accuracy(w, dtst, predict)))
end

report (generic function with 1 method)

## Train the Model

In [98]:
w   = initmodel(atype);
opt = optim(w, lr=1e-1);

if fast
    train(w, dtrn, opt, predict; epochs=nepochs)
else
    for epoch = 1:nepochs
        train(w, dtrn, opt, predict, epochs=1)
        report(epoch, w, dtrn, dtst, predict)
    end
end

(:epoch, 1, :trn, 0.8536166666666667, :tst, 0.856)
(:epoch, 2, :trn, 0.9123333333333333, :tst, 0.9125)
(:epoch, 3, :trn, 0.9393166666666667, :tst, 0.938)
(:epoch, 4, :trn, 0.9544, :tst, 0.9521)
(:epoch, 5, :trn, 0.9643833333333334, :tst, 0.9603)
(:epoch, 6, :trn, 0.9713, :tst, 0.9667)
(:epoch, 7, :trn, 0.97625, :tst, 0.9688)
(:epoch, 8, :trn, 0.9797833333333333, :tst, 0.9713)
(:epoch, 9, :trn, 0.9826, :tst, 0.9726)
(:epoch, 10, :trn, 0.9842833333333333, :tst, 0.9735)


## Conclusion

Nice! With just two hidden layers containing 256 hidden nodes, respectively, we can achieve over 97% test accuracy on this task. 

## Next
[dropout](section3-dropout.ipynb)

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/moralesq/Knet-the-Julia-dope)