Using Deep Learning for Avatar Aesthetics In Virtual Reality

As part of their course, TMCS students at Oxford spend a couple of days developing their programming skills in a hackathon. This year, we challenged the students to apply machine learning to various problems relevant to our research.

We have recently developed a multi-user virtual reality environment for molecular dynamics simulations, using the Nano Simbox. Within this environment, users can see each others headsets and controllers, and interact with the same simulation. A testament to the quality of the tracking provided by the HTC Vive is that users can confidently assume that the position of the head and controllers in virtual space matches that in VR – we often get new users to reach out and touch the head of another user in VR to get them used to the idea. While this is already great, can we render more than just the headset and controllers? For us, with our multi-user setup, it would be extremely beneficial to have some sort of full body representation of the users, as it would make it much easier for users to perform complex tasks together if they can reason about where each others limbs are as well as the head and hands. More broadly, for many novice users of VR the frequent lack of an avatar for oneself can be disconcerting, and many video games and VR applications could benefit from a full body avatar.

The Vive Trackers (or “pucks”) recently released by HTC are an obvious solution to this problem. They have already released code which produces full avatars for a user wearing several of these trackers on their body. However, for multi-user VR this isn’t very practical as we would need an inordinate number of trackers, and it would be cumbersome to put them on every time you stepped into VR.

We do already have a lot of information about the user: the positions and orientations of their head, left hand and right hand from the devices being tracked. Can we use that information to construct an avatar? We decided that this problem would be an interesting challenge to tackle in a 2-day hackathon: try to predict the full body positions through training an artificial neural network on example poses. Neural networks – or “deep learning” – have become something of a buzz word in the field of machine learning, due to their unparalleled success in several difficult tasks including image recognition and speech recognition.

The idea is to produce a training set with labelled positions of the head, left controller and right controller as the features, and positions of other points of the body as the targets to be predicted. Since at the time of writing the pucks had not yet been released, we commandeered our multi-user VR to track additional controllers carefully placed (with a lot of duct tape) on representative points on a users body. We chose the elbows, top of the back, the belly button and the knees for these representative points. The images show how the controllers were placed on the body, and what this looked like in VR through a simple render of the controllers.

We had 7 volunteers from the group and the wider Centre for Computational Chemistry at Bristol get duct taped up and perform various representative tasks in the Nano Simbox, such as tying knots in peptides and making chemical reactions happen, as well as performing more general movements. The video below shows what the avatar looked like with just the controllers being rendered on the various positions on the body. While crude, the representation does add a more physical nature to the user’s representation compared to simply a floating head and controllers.

The data collection resulted in 36000 example poses from a variety of people of different shapes and sizes. We gave this data to 4 TMCS students – Laszlo Berencei, Callum Bungey, Thomas Fay and Jonathan Milward –  who had spent a couple of days learning about python and scikit-learn, a set of python tools for performing machine learning, and tasked them with setting up a pipeline for training a neural network to predict the body positions given just the headset and controller positions.

There were several tasks necessary to complete:

  • Preprocessing and standardising the data.
  • Removing outliers.
  • Setting up the scikit-learn pipeline for training the machine learning algorithm.
  • Creating a renderer so we could compare the predicted avatar positions to the training set.

Since we only had two short days to do this, the aim was to get a skeleton pipeline from the raw data through to rendering the predicted positions up and running to see if the idea had any chance of working.

In preprocessing the data, we centred all the poses and rotated them about the Y axis to best match a reference frame. This kind of standardisation is very common in molecular simulation, where one seeks to superpose a structure over some reference (for example a protein crystal structure). This was necessary because the data was recorded in a 5m x 5m space, but for the purposes of predicting body positions only the relative distances  and orientations between the controllers and the head are important.

We had to remove a lot of outliers in the data, where occlusion resulted in a systematic drift or “freezing” of the controllers during recording. For this, we used the Isolation Forest method, which worked well in cases where the positions drifted to highly unrealistic positions.

For the machine learning, we opted for the Multi-Layer Perceptron Regressor, more commonly referred to as a neural network regressor. For a starting point, we used 2 hidden layers, and used a Grid Search to start tuning the hyperparameters of the network, varying the number of neurons in the two hidden layers and the regularization term alpha.

To render the data, the students used pyglet to create simple 2D projections of the avatar positions and the predicted positions, so we could visually evaluate the performance of the regressor.

By the end of the two days, we’d hacked together all these components and trained the regressor on a small subset of the data. This subset consisted of 1148 training frames and 688 test frames all from one continuous session with one person, so the results are extremely preliminary! Our R2 score – a measure of how well our model will predict future samples – was a reasonable 67%, leaving plenty of room for improvement. The plot below shows the distribution of error for each target in the test set. The plots show that the median error for the targets vary between 15 and 20cm, with the knees being the least well predicted values (unsurprisingly).


When we rendered the predicted positions in comparison to the true values, we found that the neural network predicted values are already qualitatively reasonable for aesthetic purposes, as shown in the video below (you will want to slow the playback down). The white square is the headset position, the red squares the position of the controllers (the hands), the blue squares are the true positions of the other body parts and the pink squares are the predicted positions.

To take this initial exploration forward, we will want to train the neural network on all of the data, and perform more sophisticated tuning of the hyperparameters. To do this, we’d move over to using TensorFlow, a GPU accelerated neural network library. The results so far are very exciting and we hope that a practical solution to producing virtual avatars will emerge from this work. The repository used for the hackathon, which contains the data as well as the scripts we’ve written so far for processing and analysing is publicly available here.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s