## Introduction

Artificial neural networks commonly consist of an ordered set of layers, where consecutive layers are densely connected. For example, in MLP networks, every neuron in one layer is connected to all the neurons in the preceding layer, and in convolutional neural networks, a filter in a layer is typically applied to all the feature maps of the previous layer.

A neural architecture, which is the structure and connectivity of the network, is typically either hand-crafted or searched by optimizing some specific objective criterion (e.g., classification accuracy). Since the space of all neural architectures is huge, search methods are usually heuristic and do not guarantee finding the optimal architecture, with respect to the objective criterion. In addition, these search methods might require a large number of supervised training iterations and use a high amount of computational resources, rendering the solution infeasible for many applications. Moreover, optimizing for a specific criterion might result in a model that is suboptimal for other useful criteria such as model size, representation of uncertainty and robustness to adversarial attacks. Thus, the resulting architectures of most strategies used today, whether hand crafting or heuristic searches, are densely connected networks, which are not an optimal solution for the objective they were created to achieve, let alone other objectives.

Looking at this phenomena, we asked: Is there a sparse neural structure that can outperform dense connectivity under a given context? Can this structure be generalized to multiple tasks under the same context? How can a neural structure be learned from observed data?

* Hebb’s postulate: “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased. *

In biological systems, which inspired artificial neural networks, the formation of neural structures has been a topic of research for the last century. In 1949, Donnald Hebb, in his book “The Organization of Behavior”, postulated that neurons with causally related activations are wired together. This postulate became the basis of parameter update rules in artificial neural networks; however, learning the connectivity is commonly ignored by the research community.

Inspired by this quality of the biological brain, we present a principled, theoretically grounded approach for learning complex, sparse, neural architectures, based on a sound mathematical background. Neural structures are learned in an unsupervised manner by identifying hierarchical causal relations in the observed data. Such learning results in small networks that generalize well to multiple tasks on the same observed data, encode uncertainty, and are inherently robust to adversarial attacks.

* Illustration of stochastic forward passes in the proposed BRAINet architecture*

## Unsupervised Learning of Neural Structures

In our approach, the structure of a generative model, which “explains” the observed training data, is learned by the algorithm. Learning the structure of a generative model, rather than of a discriminative one, is an important differentiator between this method for learning neural architectures and other commonly used methods. For common discriminative tasks, such as classification and regression, this approach is useful only if the generative model structure can be converted into a discriminative one, so that the discriminative structure can mimic the generative one while preserving important virtues (e.g., a compact network having high expressive power).

In our structure learning algorithm (B2N), we cast the problem of learning a generative neural structure as a problem of learning the structure of a deep belief network (a directed acyclic graph) where the observed variables are represented by leaf nodes. Conditional independence (CI) is tested between pairs of observed variables, conditioned on a set of other observed variables (condition set). The number of variables in the condition set is called the order of the CI test. A deep belief network is constructed recursively, starting with the deepest layers using low-order CI tests, and concluding with input-gather layers when no more higher order CI tests can be tested.

Under the hood, causal relations among the observed variables are identified from the result of conditional independence tests. These relations are represented by a complete partially-directed acyclic graph (CPDAG), where its nodes correspond to the observed variables. Causal relations are identified using the RAI algorithm, and our algorithm uses intermediate results in the recursive steps of this algorithm to construct a deep generative structure.

Once a generative structure is learned, it can be converted into a discriminative one simply by reversing the direction of the edges, and connecting some class node, as a child, to the nodes of the deepest layers. This simple procedure yields a discriminative structure that preserves all the conditional dependencies thereby mimicking the generative structure.

*Learning an architecture with B2N from an input of 5 observed variables*

*Number of parameters in a B2N-learned architecture vs. the original architecture for various topologies (see the paper for full details of the experiments).*

## Modelling Uncertainty

In real-world scenarios, there is uncertainty when estimating causal relations among variables. This uncertainty may be the result of insufficient information in the observed data (epistemic uncertainty), e.g., due to small datasets, or the result of true underlying ambiguity (aleatoric uncertainty). In a simple case for example, one may estimate that variable A is the cause of variable B, with probability 0.8. However, in real-world cases, the uncertainty in the causal relation between one pair of variables depends on the uncertainty in the relation between another pair of variables. We developed a novel algorithm, which we call B-RAI, for learning a connected ensemble of causal diagrams (assuming no latent variables and no selection bias) from which uncertainty of causal relations can be estimated. We then convert these B-RAI causal diagrams into a neural network architecture, which we call B-RAI network (BRAINet).

*Learning a BRAINet architecture from an input of 5 observed variables*

*Number of unique structures embedded in a single BRAINet as a function of the training set size for MNIST hand recognition dataset*

The BRAINet architecture is a connected ensemble of multiple neural networks, which are coupled into a hierarchy. From this hierarchy, neural network structures can be sampled recursively with respect to their posterior probability. Sampled networks share parts of their structure with each other, where deeper layers have a higher probability of being shared. During training, networks are sampled independently, but the parameters of their shared parts are learned jointly.

This algorithm is computationally efficient and converges within a few hours running on a single desktop CPU. Interestingly, for a small number of observations (a small training set) that result in a high epistemic uncertainty, the structural parts that are not shared among sampled networks will be dissimilar. This translates to a broader prior over the BRAINet parameters. On the other hand, for a large number of observations resulting in lower epistemic uncertainty, those structural parts will tend to be similar. Any dissimilarity will be the result of true underlying ambiguity (aleatoric uncertainty).

## Stochastic Feed-Forward

A BRAINet model consists of multiple networks coupled together into a hierarchy. Thus, when learning the parameters, only a small subset is trained in each training step (using a batch of data in SGD). It is important to note that parameters in the deeper layers are shared across networks, while those in the first layer are not shared. The first layers are expected to produce different “views” of the input by processing it using distinct sets of parameters. In contrast, the shared parameters in the deeper layers gradually aggregate the different information arriving from the distinct neural routes.

BRAINet provides multiple inference results by sampling multiple networks, where a sampled network corresponds to a set of connections in BRAINet. Typically, less than 15% of the BRAINet structure is used at each inference step. Aggregating multiple inference results provides high accuracy, as well as an uncertainty estimate.

Predictive uncertainty estimation using BRAINet as measured by mutual information, visualized on the latent space of a VAE on MNIST. Brighter areas correspond to a higher uncertainty.

One more important property of BRAINet is anytime predictive uncertainty estimation. Each inference step that involves a stochastic feed-forward provides a liable inference result. Aggregating multiple results yields better inference quality and better predictive uncertainty estimate. This anytime uncertainty estimation property is unique to ensemble methods, such as MC-dropout and BRAINet. Nevertheless, BRAINet demonstrates significantly better uncertainty estimates, especially for small number of inference steps. It is important to note that an inference step in BRAINet involves less than 15% of the edges, whereas in other ensemble methods the entire network (deep-ensembles) or some large portion (over 50% in MC-dropout) of the network is used.

Area under ROC and precision-recall curves as a function of the number of stochastic forward passes. BRAINet achieves high AUC even for a small number of forward passes, compared to MC-dropout (Architecture: ResNet20, in-distribution: CIFAR-10, out-of-distribution: SVHN)

## Share this job on social media