In the past couple of months, I have taken some time to try out the new *probabilistic programming* library Edward. Probabilistic programming is all about building probabilistic models and performing inference on them. These models are ideal for describing phenomena that contain some amount of inherent randomness, say, the daily flow of customers in your local Apple Store.

So what makes the newcomer Edward different from other similar libraries such as Stan or PyMC3? From my point of view, the interesting feature is that it’s built on top of TensorFlow, which allows efficient computation using optimised dataflow graphs. Heavy duty TensorFlow models can be trained efficiently using distributed GPU clusters in the Google Cloud.

## Training a neural network

As my first exercise, I set to train a *Bayesian neural network* for a regression task. In such a task we aim to predict a numerical target by building a model and training it on some data. A standard “deterministic” network, whose weights are simply real numbers, is trained by minimising a loss function such as mean squared error.

The weights of a Bayesian network, on the other hand, are probability distributions over the reals, and their training (or, rather, inference) is a more complicated task. For instance, one can perform variational (approximate) inference by minimising a quantity known as Kullback—Leibler divergence, or alternatively, one can opt for fully Bayesian inference with algorithms such as Hamiltonian Monte Carlo.

## Probabilistic model brings improved accuracy

Technicalities aside, the results are what interest us! On my more or less randomly chosen regression dataset (on concrete compressive strength(!)), the Bayesian neural network certainly outperformed in prediction accuracy both the standard linear model as well as the deterministic neural network. What might be the cause of this? Does the prediction accuracy improve on other datasets as well? This certainly inspires us to research further.

Interestingly, when comparing the resulting network weights (or means of their distributions) from deterministic and Bayesian training, it turns out that they are adjusted to completely different values. Finally, upon reflecting on the way in which variational inference was performed, it becomes clear that a different method, which takes into account the dependence between weights, would be desirable. Such a refined approach will, however, have to wait until the next exercise!

I saved all the numbers and figures to my Jupyter notebook on GitHub, so go check it out to learn more!