The challenges to achieving a quality VR video experience
We call it “virtual reality,” but the quality of the video we watch in VR headsets still doesn’t look real. Why is quality important? The feeling of really being there, called “spatial immersion,” is undermined by poor-quality video. Today, a great deal of VR video struggles to achieve immersion, because the poor quality distracts the viewer from the feeling of realism.
A true feeling of immersion comes from a combination of narrative and spatial immersion. Narrative immersion occurs when the viewer becomes emotionally invested in the story. Think about reading a novel or watching your favorite movie. You get so caught up in the story that you lose track of time. Spatial immersion occurs when the viewer becomes convinced by their senses that they are in a new space. When we are experiencing the sights and sounds of a new space and engaged with a compelling story, we can be powerfully moved. When we get these two ingredients right, we feel like we are truly in virtual reality.
There are no hard and fast rules for achieving narrative immersion. But we can define how to make a great-looking video that can fool your senses to create spatial immersion.
Many things that are called VR today fall far short of this goal. Services such as Google’s YouTube 360° and Facebook 360° have started to regularly offer 360-degree video content that simulates VR by displaying spherical video in a standard video player. 360-degree video is usually delivered through a web browser or on a simple device like Google Cardboard. But moving a mouse or holding up a cardboard box to your face requires hand-eye coordination that makes your brain take on tasks that interfere with immersion. Holding a smartphone up to your head is fine for a few minutes, but who wants to do that for a two-hour movie? A better experience comes from a head-mounted display that accurately and naturally tracks your movement.
A mobile phone-based headset like the popular Samsung Gear VR has good motion sensors and fast response times, and is capable of creating an immersive experience. There are many new headsets coming to the market in 2016. The Oculus Rift and HTC Vive are both tethered by cables to a PC, making use of the PC’s GPU capabilities to render 3D data quickly, and using additional sensors to track user motion in more directions than the Gear VR. Both the mobile and PC-tethered headsets have high-resolution displays with plenty of pixels, but when those displays are used to play video, the video quality tends to be poor.
Even with good display hardware in a headset, low video quality can break the sense of true immersion. The tricky part is delivering a great looking image to the device at the right moment. Since the Gear VR is the most widely available set today, let’s look at its technical specs and the challenges of using it to achieve a quality VR video experience.
The challenges of VR video content
While all the headsets have cutting-edge high-resolution displays, the shortfall comes from content delivered at low resolution.
File size. With Gear VR, many of the video-based experiences in applications from VRSE and others require an initial download. The downloads are very large—more than 1 GB in size— and slow to load onto the device. That is the first ding against the experience.
Video quality. The second and bigger issue is the quality of the videos. Even professionally produced VR media is alarmingly soft. Viewing it feels like jumping back in time to the 1990s when computers struggled to play anything more than standard-definition video. The VR videos that seem to suffer the most are in stereo format.
Certainly the video content must have been captured at high quality, but in the headset it appears very mushy. Why? The answer is resolution.
The Gear VR can support Ultra High Definition (UHD)-size playback. In pixel terms, that UHD video has a frame size that is 3840 pixels wide x 1920 pixels tall. The video often plays back at 30 frames per second (higher frame rates can be supported at lower resolution). Clearly the Galaxy S6 phone is a little powerhouse that is able to easily play 4K video. But this turned out to be too low a resolution to make a good-looking image.
To understand why, we need to answer two different questions about resolution. What is the display resolution that we are looking at on the Galaxy’s screen? Let’s call this screen resolution. The second question is, how much of that big UHD size image do I see inside the headset? Let’s call this the field-of-view resolution.
The screen resolution on the Galaxy S6 is 2560 x 1440 pixels. So each eye gets 1280 x 1440 pixels to view through one lens of the Gear VR.
[A short digression: Gear VR lenses distort the image quite a bit, so the practical resolution is really 1280 x 1280. Each eye gets to look at a 1280 x 1280 image. Not bad, but not exactly “Full HD.” But at over 500 pixels per inch, this should be plenty to make the image look sharp.]
Let’s return to the field-of-view resolution. Recall that each frame of UHD video is 3840 x 1920 pixels, but this video frame has to fill 360° of the horizontal view and 180° of the vertical view. On the Gear VR, the field of view is 96°—a bit over one-fourth of the full 360° view. When we view the image, we are only looking at a little square section of the whole frame. As we turn our head, the area of the video we are pointed at has to update, and the software shows just the small section we need to see from moment to moment as we turn our heads. That little square is defined as the field of view.
Simple math shows that each degree of view corresponds to 10.6667 pixels/degree (1920 pixels/180 degrees = 10.6667 pixels/degree). So multiply 10.6667 x 96 and you get 1024.
So the image that we are shown when looking through the headset is 1024 x 1024 pixels (at most), but the size of the display is 1280 x 1280 pixels. There is our first clue to why the video looks soft. The little piece of video that the software cuts out of the frame is SMALLER than the display resolution. So the software has to stretch the video to map the field-of-view crop to the display resolution. This is done with a relatively low-quality scaler, since the Galaxy phone has to also simultaneously decompress and warp the image to project it on the screen. The result is a little bit of softening. But a 20% scale shouldn’t be that big of a problem.
Stereo Makes it Worse
All of the display issues discussed so far are only related to “mono” video—where each eye sees the same image. To increase immersion, stereo display is preferred, where each eye sees an offset image to trick your brain into perceiving depth. On the Gear VR, stereo display can really boost the sense of realism. A stereo image just looks more lifelike because the offset provides depth cues that help to separate objects in a scene. Some 3D games on the Gear VR can show the virtual environment of the game in stereo, adding to the realism.
Each eye should ideally get a full-resolution 360° view. That would require that we double the resolution in one dimension, for example vertically, to have separate images for each eye.
If regular UHD video is 3840 x 1920, a frame that is twice as tall would be a giant 3840 x 3840, which is too large for the hardware to support. Instead, we must pack this super-tall video into the smaller 3840 x 1920 frame. The left and right eye share the frame, and each eye gets a 3840 x 960 pixel image that covers the full spherical view horizontally, but must be stretched vertically to cover the full view.
The result of packing two image views into a single frame is a much smaller resolution for each eye. Scaling 20% is not great. But scaling 150% (512 pixels vertically stretched to 1280 display pixels) is going to result in a big perceived loss of resolution. Whether the format is packed top/bottom or left/right in the frame, you lose half the resolution.
Smaller and smaller streams
When we look at streaming video, the situation gets much worse. Services like YouTube 360° and others stream video in a lower resolution than 4K because few viewers have the bandwidth to accommodate UHD video. Because the video is being played from a server on the internet and not locally on the device, the services reduce the resolution to maintain a smooth streaming experience. Typically the image is streamed in a 2048 x 1024 frame size. Using the same math above, the field-of-view resolution works out to 540 x 540 pixels for a YouTube size frame. Streaming HD-resolution 360° video really does result in a resolution comparable to old television (720 x 480 or 720 x 576) and it must be stretched more than 230% to fit the display. That is why streaming stereo VR video looks so soft.
Much bigger is better
For current displays like the Galaxy S6 (or even the forthcoming Oculus Rift), a field-of-view resolution that is slightly larger than the display resolution would help. Scaling down a slightly larger image will make the image look smoother and will reduce distracting artifacts such as aliasing and noise.
A resolution of 1536 x 1536 pixels would be a good solution. This is about 17% bigger than the display, and in our tests it shows a significant improvement in the experience. Using the math from earlier, if we want to cover a 96° field of view with 1536 pixels, we need a resolution of 16 pixels/degree. Multiply that by 360° x 180°, and you get a resolution of 5760 x 2880. So it turns out we need images that are nearly 6000 pixels across to get better coverage for mono video.
What about stereo? Remember that for stereo, we ideally want a full resolution image for each eye (or a clever way to pack left and right together). So a rough measure would mean sending video files that are at least 5760 pixels wide by 5760 pixels tall. That works out to 33 megapixels per frame. Even with next-generation gigabit or 5G wireless networks and better compression, streaming 33-megapixel videos to a headset is going to be a big challenge.
The Pixvana System
At Pixvana, we think there are other clever ways to solve these resolution and streaming problems. It’s possible to create streams that are closer to HD resolution to allow the viewer to experience VR video at an ideal resolution. Facebook is trying an approach for their own platform that tries to minimize the bandwidth and increase the quality by focusing the quality on the front view with Dynamic Streaming. You can read a bit more on their coding blog.
We are investigating methods for covering the field of view with greater resolution. One strategy is to cover the current field of view with higher-resolution imagery and add extra padding to account for head motion. Switching between multiple streams as the viewer turns could generate perfect coverage for headsets like the Gear VR (or the PC-tethered systems—HTC Vive, or the Oculus Rift), or even future headsets that will offer much higher display resolution. A better approach might be to have multiple ways to pack and encode the data based on content. Video shot in a closed studio environment with a fixed camera could have vastly different encoding from a VR video shot from a moving vehicle, for example.
Pixvana intends to build an open system that gives content creators all the tools they need for creating beautiful high-resolution VR video. Such a system would be responsive to changes in headset position, changes in bandwidth, and headset display characteristics, and still deliver the right image for the viewer at any moment. The Pixvana system will work on a variety of platforms, from mobile to PC. With our system, future video streamed to the Gear VR will be significantly better than the video we view now, and VR video will achieve the immersion we desire.