# Julia Things

### Environment

First things first. Let us set up the environment with the requried packages for this notebook:

In [1]:
for p in ("Knet", "Plots", "Plotly.jl")
    Pkg.installed(p) == nothing && Pkg.add(p)
end

using Knet, Plots
gr()

Knet.gpu(0); # set the desired GPU to use
atype = KnetArray{Float32}; # atype = KnetArray{Float32} for gpu usage, Array{Float32} for cpu. 
srand(1)

println("OS: ", Sys.KERNEL)
println("Julia: ", VERSION)
println("Knet: ", Pkg.installed("Knet"))
println("GPU: ", readstring(`nvidia-smi --query-gpu=name --format=csv,noheader`))

OS: Linux
Julia: 0.6.0
Knet: 0.8.5+
GPU: NVS 310
TITAN X (Pascal)



### New Stuff

In this notebook we introduce the following Julia/Knet packages and functions:

* ...

# Generative Adversarial Networks (GANs)

Many of the applications are in the context of images. Since this takes too much time to solve in a Jupyter notebook on a laptop, we're going to provide a simpler example by fitting a much simpler distribution. We will illustrate what happens if we use GANs to build the world's most inefficient estimator of parameters for a Gaussian. Let's get started. Since this is going to be the world's lamest example, we simply generate data drawn from a Gaussian. And let's also set a context where we'll do most of the computation.

In [2]:
xtrn = randn(2, 1000);
ytrn = ones(UInt8, 1, 1000);
w    = [[1 2; -0.1 0.5]', [1, 2]];
xtrn = w[1] * xtrn .+ w[2];

In [3]:
batch_size = 4;
dtrn = minibatch(xtrn, ytrn, batch_size, xtype=atype, shuffle=true);

Let's see what we got. This should be a Gaussian shifted in some rather arbitrary way with mean $b$ and covariance matrix $A^\top A$.

In [4]:
print("The covariance matrix is:\n")
A = w[1]'
A * A'

The covariance matrix is:


2Ã—2 Array{Float64,2}:
 5.0  0.9 
 0.9  0.26

In [5]:
scatter(xtrn[1, :], xtrn[2, :], legend=false)
xticks!([-2, 1, 4]); yticks!([-4, 2, 8])

## Define the networks

Given $N$ samples of real data $x_i\in\mathbb{R}^d$, we need to define the model to synthesize fake data $\hat{x_i}\in\mathbb{R}^d$ from noise $z_i\in\mathbb{R}^d$. Our generator $G$ will be the simplest network possible - a single layer linear model

$$ \hat{x} = G(w_g, z) = w_g^1\cdot z + w_g^2,$$

where $w_g$ are the parameters of $G$, $w_g^1$ is a $K\times d$ matrix of weights, $w_g^2\in\mathbb{R}^K$ are the biases, and $K$ is the number of labels. Note that in this case there are only two labels $K=2$: real data ($k=1$) and fake data ($k=2$) (recall that Knet's function `nll` uses labels [1 2] instead of [0 1]). Thus, our real data, fake data, and noise are matrices of size $d\times n$, where $n$ is the batch size. This is since we'll be driving that linear network with a Gaussian data generator. Hence, it literally only needs to learn the parameters to fake things perfectly. 

For the discriminator $D$ we will be a bit more discriminating: we will we an MLP with $l=3$ layers to make things a bit more interesting: 

$$ \hat{x}_i = \sigma_i\big( w_{d}^{2i-1}\cdot x_{i-1} + b_d^{2i} \big)\hspace{1cm}\text{for}\,\,i=1,\dots,L$$

where $\sigma_i=\tanh$ for $i<L$ and $\sigma_L(x)=x$, $w_d^{2i-1}$ is a $H_{2i-1}\times H_{2i-3}$ matrix of weights, $w_g^{wi}\in\mathbb{R}^{H_{2i-1}}$, and $\hat{x}_i\in\mathbb{R}^{H_{2i-1}}$ is the output of each layer such that $H_0 = d$, $H_{L}=K$, and $p(y|\hat{x})=\hat{L}\in\mathbb{R}^{K}$ is the probability of each class (fake or real) given that the data is fake. The cool thing here is that we have *two* different networks, each of them with their own gradients, optimizers, losses, etc. that we can optimize as we please. 

The function `initweights(d, hiddden)` allows to initialize a model of arbitrary depth such that `d` is as defined above and $L$=|`hidden`| contains the output dimensions of each layer such that $hidden_i = H_{2i-1}$:

In [6]:
function initweights(d, hidden)
    model = Vector{Any}(2 * length(hidden))
    X = d
    for k = 1:length(hidden)
        H = hidden[k]
        model[2k - 1] = 0.02 * randn(H, X)
        model[2k]     = zeros(H, 1)
        X = H
    end
    return model
end

initweights (generic function with 1 method)

Notice that we're scaling the normal distribution by `0.02`. With this notation we can define function to initialize each network:

In [7]:
generator_init(d, atype)     = map(atype, initweights(d, [2]));
discriminator_init(d, atype) = map(atype, initweights(d, [5, 3, 2]));

We can also create a function to generate noise data to clean our code:

In [8]:
driver(x, y, atype) = (atype(randn(size(x))), Array{UInt8}(2ones(size(y))))

driver (generic function with 1 method)

Note here that we include the `atype` to convert the arrays to `KnetArray` if necessary. We can also create a single `predict` function for both models:

In [9]:
function predict(w, x)
    x = mat(x)
    for i=1:2:length(w) - 2
        x = tanh.(w[i] * x .+ w[i+1])
    end
    return w[end - 1]*x .+ w[end]
end

predict (generic function with 1 method)

For $G(z)$ notice that, since a single layer is used, we simply skip the loop entirely. So far everything has been defined as usual (with the exception of defining two separate model weights). Here comes the fun part: *the loss function*. 

When we train $D$ we will have input data $[x\,\,\hat{x}]$ with labels $[y\,\,\hat{y}]$ (for fake and real). Thus, it's not difficult to see that this is a standard classification problem where we can use the negative loss function `nll`. On the other hand, when we train $G$ we need to know how we perform with respect to $D$. That is, it's $D$ that provides the predicted labels `ypred` in `nll(ypred, ygold)`. For example, if $p(y|\hat{x})=D(G(z))$  shows high probability for the data to be fake (i.e. y=2), that means that our generator network did a poor job. Thus, our gold label in this case should be $y=1$ (the real label). With 

$$\hat{x}=\text{ pred}(w_g, z)  \equiv G(z)$$

$$p(y|\hat{x}) = \text{ pred}(w_d, \hat{x}) \equiv D(\hat{x}) = D\big(G(z)\big)$$

We can easily implement this idea with Knet

In [10]:
function loss(w, x, y; wd=0)
    if wd == 0
        L = nll(predict(w, x), y)
    else
        L = nll(predict(wd, predict(w, x)), y)
    end
    
    return L
end

lossgradient  = grad(loss)

(::gradfun) (generic function with 1 method)

we can create a `train` function for that we can use for both networks:

In [11]:
function train(w, x, y, optim; o...)
    g = lossgradient(w, x, y; o...)
    update!(w, g, optim)
    return w
end

train (generic function with 1 method)

OK! time to train us some networks!

In [27]:
wg = generator_init(2, atype);
wd = discriminator_init(2, atype);
optimg = optimizers(wg, Adam;  lr=0.005)
optimd = optimizers(wd, Adam;  lr=0.025)

for epoch = 1:10
    
    for (x, y) in dtrn
        z, yÌ‚ = driver(x, y, atype);
        xÌ‚    = predict(wg, z);
        
        wd = train(wd, hcat(x, xÌ‚), hcat(y, yÌ‚), optimd)  
        wg = train(wg, x, y, optimg; wd=wd)  
    end  
    
    if epoch % 1 == 0
        xfake = Array(predict(wg, atype(randn(2, 100))))
        scatter(xtrn[1, :], xtrn[2, :], label=:true_data)
        display(scatter!(xfake[1, :], xfake[2, :],  label=:synthetic_data, size=(400,300)))
    end
end

A word of caution here - to get this to converge properly, we needed to adjust the learning rates very carefully. And for Gaussians, the result is rather mediocre - a simple mean and covariance estimator would have worked much better. However, whenever we don't have a really good idea of what the distribution should be, this is a very good way of faking it to the best of our abilities. Note that a lot depends on the power of the discriminating network. If it is weak, the fake can be very different from the truth. E.g. in our case it had trouble picking up anything along the axis of reduced variance. In summary, this isn't exactly easy to set and forget. One nice resource for dirty practioner's knowledge is [Soumith Chintala's handy list of tricks](https://github.com/soumith/ganhacks) for how to babysit GANs.