UVQ: Measuring YouTube’s Perceptual Video Quality

Online video sharing platforms, like YouTube, need to understand perceptual video quality (i.e., a user’s subjective perception of video quality) in order to better optimize and improve user experience. Video quality assessment (VQA) attempts to build a bridge between video signals and perceptual quality by using objective mathematical models to approximate the subjective opinions of users. Traditional video quality metrics, like peak signal-to-noise ratio (PSNR) and Video Multi-Method Assessment Fusion (VMAF), are reference-based and focus on the relative difference between the target and reference videos. Such metrics, which work best on professionally generated content (e.g., movies), assume the reference video is of pristine quality and that one can induce the target video’s absolute quality from the relative difference.

However, the majority of the videos that are uploaded on YouTube are user-generated content (UGC), which bring new challenges due to their remarkably high variability in video content and original quality. Most UGC uploads are non-pristine and the same amount of relative difference could imply very different perceptual quality impacts. For example, people tend to be less sensitive to the distortions of poor quality uploads than of high quality uploads. Thus, reference-based quality scores become inaccurate and inconsistent when used for UGC cases. Additionally, despite the high volume of UGC, there are currently limited UGC video quality assessment (UGC-VQA) datasets with quality labels. Existing UGC-VQA datasets are either small in size (e.g., LIVE-Qualcomm has 208 samples captured from 54 unique scenes), compared with datasets with millions of samples for classification and recognition (e.g., ImageNet and YouTube-8M), or don’t have enough content variability (sampling without considering content information, like LIVE-VQC and KoNViD-1k).

In “Rich Features for Perceptual Quality Assessment of UGC Videos“, published at CVPR 2021, we describe how we attempt to solve the UGC quality assessment problem by building a Universal Video Quality (UVQ) model that resembles a subjective quality assessment. The UVQ model uses subnetworks to analyze UGC quality from high-level semantic information to low-level pixel distortions, and provides a reliable quality score with rationale (leveraging comprehensive and interpretable quality labels). Moreover, to advance UGC-VQA and compression research, we enhance the open-sourced YouTube-UGC dataset, which contains 1.5K representative UGC samples from millions of UGC videos (distributed under the Creative Commons license) on YouTube. The updated dataset contains ground-truth labels for both original videos and corresponding transcoded versions, enabling us to better understand the relationship between video content and its perceptual quality.

Subjective Video Quality Assessment
To understand perceptual video quality, we leverage an internal crowd-sourcing platform to collect mean opinion scores (MOS) with a scale of 1–5, where 1 is the lowest quality and 5 is the highest quality, for no-reference use cases. We collect ground-truth labels from the YouTube-UGC dataset and categorize UGC factors that affect quality perception into three high-level categories: (1) content, (2) distortions, and (3) compression. For example, a video with no meaningful content won’t receive a high quality MOS. Also, distortions introduced during the video production phase and video compression artifacts introduced by third-party platforms, e.g., transcoding or transmission, will degrade the overall quality.

MOS= 2.052 MOS= 4.457
Left: A video with no meaningful content won’t receive a high quality MOS. Right: A video displaying intense sports shows a higher MOS.
MOS= 1.242 MOS= 4.522
Left: A blurry gaming video gets a very low quality MOS. Right: A video with professional rendering (high contrast and sharp edges, usually introduced in the video production phase) shows a high quality MOS.
MOS= 2.372 MOS= 4.646
Left: A heavily compressed video receives a low quality MOS. Right: a video without compression artifacts shows a high quality MOS.

We demonstrate that the left gaming video in the second row of the figure above has the lowest MOS (1.2), even lower than the video with no meaningful content. A possible explanation is that viewers may have higher video quality expectations for videos that have a clear narrative structure, like gaming videos, and the blur artifacts significantly reduce the perceptual quality of the video.

UVQ Model Framework
A common method for evaluating video quality is to design sophisticated features, and then map these features to a MOS. However, designing useful handcrafted features is difficult and time-consuming, even for domain experts. Also, the most useful existing handcrafted features were summarized from limited samples, which may not perform well on broader UGC cases. In contrast, machine learning is becoming more prominent in UGC-VQA because it can automatically learn features from large-scale samples.

A straightforward approach is to train a model from scratch on existing UGC quality datasets. However, this may not be feasible as there are limited quality UGC datasets. To overcome this limitation, we apply a self-supervised learning step to the UVQ model during training. This self-supervised step enables us to learn comprehensive quality-related features, without ground-truth MOS, from millions of raw videos.

Following the quality-related categories summarized from the subjective VQA, we develop the UVQ model with four novel subnetworks. The first three subnetworks, which we call ContentNet, DistortionNet and CompressionNet, are used to extract quality features (i.e., content, distortion and compression), and the fourth subnetwork, called AggregationNet, maps the extracted features to generate a single quality score. ContentNet is trained in a supervised learning fashion with UGC-specific content labels that are generated by the YouTube-8M model. DistortionNet is trained to detect common distortions, e.g., Gaussian blur and white noise of the original frame. CompressionNet focuses on video compression artifacts, whose training data are videos compressed with different bitrates. CompressionNet is trained using two compressed variants of the same content that are fed into the model to predict corresponding compression levels (with a higher score for more noticeable compression artifacts), with the implicit assumption that the higher bitrate version has a lower compression level.

The ContentNet, DistortionNet and CompressionNet subnetworks are trained on large-scale samples without ground-truth quality scores. Since video resolution is also an important quality factor, the resolution-sensitive subnetworks (CompressionNet and DistortionNet) are patch-based (i.e., each input frame is divided into multiple disjointed patches that are processed separately), which makes it possible to capture all detail on native resolution without downscaling. The three subnetworks extract quality features that are then concatenated by the fourth subnetwork, AggregationNet, to predict quality scores with domain ground-truth MOS from YouTube-UGC.

The UVQ training framework.

Analyzing Video Quality with UVQ
After building the UVQ model, we use it to analyze the video quality of samples pulled from YouTube-UGC and demonstrate that its subnetworks can provide a single quality score along with high-level quality indicators that can help us understand quality issues. For example, DistortionNet detects multiple visual artifacts, e.g., jitter and lens blur, for the middle video below, and CompressionNet detects that the bottom video has been heavily compressed.

ContentNet assigns content labels with corresponding probabilities in parentheses, i.e., car (0.58), vehicle (0.42), sports car (0.32), motorsports (0.18), racing (0.11).
DistortionNet detects and categorizes multiple visual distortions with corresponding probabilities in parentheses, i.e., jitter (0.112), color quantization (0.111), lens blur (0.108), denoise (0.107).
CompressionNet detects a high compression level of 0.892 for the video above.

Additionally, UVQ can provide patch-based feedback to locate quality issues. Below, UVQ reports that the quality of the first patch (patch at time t = 1) is good with a low compression level. However, the model identifies heavy compression artifacts in the next patch (patch at time t = 2).

Patch at time t = 1 Patch at time t = 2
Compression level = 0.000 Compression level = 0.904
UVQ detects a sudden quality degradation (high compression level) for a local patch.

In practice, UVQ can generate a video diagnostic report that includes a content description (e.g., strategy video game), distortion analysis (e.g., the video is blurry or pixelated) and compression level (e.g., low or high compression). Below, UVQ reports that the content quality, looking at individual features, is good, but the compression and distortion quality is low. When combining all three features, the overall quality is medium-low. We see that these findings are close to the rationale summarized by internal user experts, demonstrating that UVQ can reason through quality assessments, while providing a single quality score.

UVQ diagnostic report. ContentNet (CT): Video game, strategy video game, World of Warcraft, etc. DistortionNet (DT): multiplicative noise, Gaussian blur, color saturation, pixelate, etc. CompressionNet (CP): 0.559 (medium-high compression). Predicted quality score in [1, 5]: (CT, DT, CP) = (3.901, 3.216, 3.151), (CT+DT+CP) = 3.149 (medium-low quality).

Conclusion
We present the UVQ model, which generates a report with quality scores and insights that can be used to interpret UGC video perceptual quality. UVQ learns comprehensive quality related features from millions of UGC videos and provides a consistent view of quality interpretation for both no-reference and reference cases. To learn more, read our paper or visit our website to see YT-UGC videos and their subjective quality data. We also hope that the enhanced YouTube-UGC dataset enables more research in this space.

Acknowledgements
This work was possible through a collaboration spanning several Google teams. Key contributors include: Balu Adsumilli, Neil Birkbeck, Joong Gon Yim from YouTube and Junjie Ke, Hossein Talebi, Peyman Milanfar from Google Research. Thanks to Ross Wolf, Jayaprasanna Jayaraman, Carena Church, and Jessie Lin for their contributions.

Read More