Using Deep Learning for Avatar Aesthetics In Virtual Reality

As part of their course, TMCS students at Oxford spend a couple of days developing their programming skills in a hackathon. This year, we challenged the students to apply machine learning to various problems relevant to our research.

We have recently developed a multi-user virtual reality environment for molecular dynamics simulations, using the Nano Simbox. Within this environment, users can see each others headsets and controllers, and interact with the same simulation. A testament to the quality of the tracking provided by the HTC Vive is that users can confidently assume that the position of the head and controllers in virtual space matches that in VR – we often get new users to reach out and touch the head of another user in VR to get them used to the idea. While this is already great, can we render more than just the headset and controllers? For us, with our multi-user setup, it would be extremely beneficial to have some sort of full body representation of the users, as it would make it much easier for users to perform complex tasks together if they can reason about where each others limbs are as well as the head and hands. More broadly, for many novice users of VR the frequent lack of an avatar for oneself can be disconcerting, and many video games and VR applications could benefit from a full body avatar.

The Vive Trackers (or “pucks”) recently released by HTC are an obvious solution to this problem. They have already released code which produces full avatars for a user wearing several of these trackers on their body. However, for multi-user VR this isn’t very practical as we would need an inordinate number of trackers, and it would be cumbersome to put them on every time you stepped into VR.

We do already have a lot of information about the user: the positions and orientations of their head, left hand and right hand from the devices being tracked. Can we use that information to construct an avatar? We decided that this problem would be an interesting challenge to tackle in a 2-day hackathon: try to predict the full body positions through training an artificial neural network on example poses. Neural networks – or “deep learning” – have become something of a buzz word in the field of machine learning, due to their unparalleled success in several difficult tasks including image recognition and speech recognition.

The idea is to produce a training set with labelled positions of the head, left controller and right controller as the features, and positions of other points of the body as the targets to be predicted. Since at the time of writing the pucks had not yet been released, we commandeered our multi-user VR to track additional controllers carefully placed (with a lot of duct tape) on representative points on a users body. We chose the elbows, top of the back, the belly button and the knees for these representative points. The images show how the controllers were placed on the body, and what this looked like in VR through a simple render of the controllers.

We had 7 volunteers from the group and the wider Centre for Computational Chemistry at Bristol get duct taped up and perform various representative tasks in the Nano Simbox, such as tying knots in peptides and making chemical reactions happen, as well as performing more general movements. The video below shows what the avatar looked like with just the controllers being rendered on the various positions on the body. While crude, the representation does add a more physical nature to the user’s representation compared to simply a floating head and controllers.

The data collection resulted in 36000 example poses from a variety of people of different shapes and sizes. We gave this data to 4 TMCS students – Laszlo Berencei, Callum Bungey, Thomas Fay and Jonathan Milward –  who had spent a couple of days learning about python and scikit-learn, a set of python tools for performing machine learning, and tasked them with setting up a pipeline for training a neural network to predict the body positions given just the headset and controller positions.

There were several tasks necessary to complete:

  • Preprocessing and standardising the data.
  • Removing outliers.
  • Setting up the scikit-learn pipeline for training the machine learning algorithm.
  • Creating a renderer so we could compare the predicted avatar positions to the training set.

Since we only had two short days to do this, the aim was to get a skeleton pipeline from the raw data through to rendering the predicted positions up and running to see if the idea had any chance of working.

In preprocessing the data, we centred all the poses and rotated them about the Y axis to best match a reference frame. This kind of standardisation is very common in molecular simulation, where one seeks to superpose a structure over some reference (for example a protein crystal structure). This was necessary because the data was recorded in a 5m x 5m space, but for the purposes of predicting body positions only the relative distances  and orientations between the controllers and the head are important.

We had to remove a lot of outliers in the data, where occlusion resulted in a systematic drift or “freezing” of the controllers during recording. For this, we used the Isolation Forest method, which worked well in cases where the positions drifted to highly unrealistic positions.

For the machine learning, we opted for the Multi-Layer Perceptron Regressor, more commonly referred to as a neural network regressor. For a starting point, we used 2 hidden layers, and used a Grid Search to start tuning the hyperparameters of the network, varying the number of neurons in the two hidden layers and the regularization term alpha.

To render the data, the students used pyglet to create simple 2D projections of the avatar positions and the predicted positions, so we could visually evaluate the performance of the regressor.

By the end of the two days, we’d hacked together all these components and trained the regressor on a small subset of the data. This subset consisted of 1148 training frames and 688 test frames all from one continuous session with one person, so the results are extremely preliminary! Our R2 score – a measure of how well our model will predict future samples – was a reasonable 67%, leaving plenty of room for improvement. The plot below shows the distribution of error for each target in the test set. The plots show that the median error for the targets vary between 15 and 20cm, with the knees being the least well predicted values (unsurprisingly).


When we rendered the predicted positions in comparison to the true values, we found that the neural network predicted values are already qualitatively reasonable for aesthetic purposes, as shown in the video below (you will want to slow the playback down). The white square is the headset position, the red squares the position of the controllers (the hands), the blue squares are the true positions of the other body parts and the pink squares are the predicted positions.

To take this initial exploration forward, we will want to train the neural network on all of the data, and perform more sophisticated tuning of the hyperparameters. To do this, we’d move over to using TensorFlow, a GPU accelerated neural network library. The results so far are very exciting and we hope that a practical solution to producing virtual avatars will emerge from this work. The repository used for the hackathon, which contains the data as well as the scripts we’ve written so far for processing and analysing is publicly available here.


Exploration of molecular systems with Virtual Reality

Over the last few months I’ve been developing an interactive molecular dynamics platform that supports Virtual Reality (VR). Using the Nano Simbox  framework, I can run a research grade GPU-accelerated molecular dynamics simulation (OpenMM) and visualise it in VR.

Molecular simulations are incredibly complex systems as every atom can interact with every other atom in 3D. For example, many drug design problems are akin to a sort of “3D tetris”, where you try to find a drug with the right shape such that it fit snugly into an enzyme’s active site. Virtual reality is a natural environment for exploring these systems, as the inherently 3D nature of VR interaction means we can at last manipulate the system in an intuitive way.


Simulation of penicillin binding to beta-lactamase, an enzyme instrumental in antibacterial resistance.

We’ve experimented with a variety of VR solutions, and have found the HTC Vive to be the most robust and enjoyable to use. The fact that you can freely walk around the space and that the controllers are tracked extremely well enables powerful interaction with a simulation. Pulling the triggers on the controllers results in a “force probe” being applied to the selected atoms,  meaning you can influence the simulation in a physically meaningful way.

The visualisation and interaction tool we’ve created opens up some exciting prospects. Simply exploring the molecular structure in 3D and observing how the system responds to interaction can be a powerful way of gaining insight into its mechanisms, but I believe we can take this further.

One of the biggest problems in molecular simulations is the so called ‘rare event problem’: interesting events going from one molecular configuration to another (e.g. protein folding, chemical reactions) may occur on the order of milliseconds or, while we typically our simulations are restricted to the order nanoseconds due to the computational intensity of calculating the interactions between all the atoms. In order to compute metrics that can give insight into the system and be compared against experiment, the event has to be sampled many times in order to converge statistics. This has led to a proliferation of methods that attempt to accelerate the occurrence of rare events so that many short simulations can be used to capture the rare event. In previous work, I made some improvements the Boxed Molecular Dynamics (BXD) algorithm, which is an example of one of these methods.

The problem with many of these methods is that they usually require the researcher to set up in advance a set of variables, called collective variables or reaction coordinates, that govern the event of interest. For example, in a simulation of a drug binding to an enzyme, one of the obvious variables governing the binding is the distance of the drug from the active site: it clearly needs to be minimised. However, there may be other more subtle variables as well, such as the angle of the drug as it approaches the protein, or the position of a particle side-chain of a protein. Determining what these collective variables are requires a mixture of chemical intuition and a large amount of trial and error on the part of the researcher, and limits the ability to automate molecular simulations. For simulations of large biomolecules such as proteins, identifying these collective variables can be extremely difficult as the concerted motions between the atoms are incredibly complex. For example, 1% of proteins in the Protein Data Bank are knotted, but it is not clear why or how they end up in this state.


Recording the path a methane molecule takes being pulled through a carbon nanotube.

There are methods that attempt to automatically identify the important collective variables for a particular system, but they typically require an initial path between the states of the system. How do you find this initial path if you don’t know what the
important collective variables are? Finding these paths are exactly what interactive molecular dynamics could be useful for.

In the coming months, I’ll be seeing if we can use interactive molecular dynamics with virtual reality to enable researchers to find paths in molecular simulations, which can then be passed on to path refining and collective variable analysis methods. Combining human intuition with automated methods in this way lead to a workflow that provides enhanced insight to chemical problems more quickly.





Babun – A terminal emulator for Window

As a scientific programmer, I spent most of my time working on Unix systems, and have grown accustomed to the range of features in various shells. I also develop on Windows occasionally, as in my opinion the Visual Studio IDE is excellent for large projects, and some of my projects currently require it.
I mostly used Git Bash, bundled with the Windows installation of git for my command line needs as it provides just enough to get by. However, I’ve recently been working exclusively on my Windows machine and needed a set up that was slightly more reminiscent of my meticulously crafted oh-my-zsh setup in iTerm2 for OS X. The obvious choice is to configure Cygwin but the effort required is non-trivial.

I stumbled upon Babun, which has done all the hard work for me. It comes with Cygwin, oh-my-zsh, a Mintty console and a whole load of other stuff. Right out of the box, it comes close enough to my usual set up on a unix system to be practical, and doesn’t look like something from the punch-card era. Install tmux with the following command, and nobody will ever know you’re using Windows:

pict install tmux