We all experience video streaming issues, probably on a daily basis. But now that people aren’t just streaming for entertainment, but are relying on it in many cases to perform their jobs, it ups the ante to address some of these issues and provide an improved experience. Deep Learning, according to a recent paper presented at the IEEE International Conference on Communications and Image Processing (VCIP), may be the key to solving many of these performance issues.
Researchers Christian Timmerer, Ekrem Cetinkaya, Hadi Amirpour, and Mohammed Ghanbari from The University of Klagenfurt’s Athena Christian Doppler Pilot Laboratory and streaming technology company Bitmovin assert that Deep Learning may offer a solution to strained bandwidth and other performance issues that currently create problems for viewers and streaming companies.
A brief discussion of CNNs
Convolutional Neural Networks (CNNs) comprise a class of Deep Learning that is commonly used in image recognition and analysis. Like other neural network-based models, CNNs mimic the human brain’s ability to learn by creating multiple layers of connected ‘neurons’. The neurons in this case are actually combinations of features from the dataset that are predictive of certain outcomes or categories, and are layered on top of each other through a series of epochs, or iterations through the full data set. Unlike other machine learning models, Deep Learning relies on multiple hidden layers to perform its magic, which can often result in very precise predictions, and can handle non-parametric data.
Setting the speed limit according to the slowest driver on the road
Videos for streaming are stored on servers in different versions–otherwise known as ‘multiple representations’–with various sizes and qualities. When the client/player requests the data, it selects the one that is best suited for current network conditions. According to the paper, the current standard method for streaming video over the web–HTTP Adaptive Streaming(HAS)–is limited in its ability to handle such multiple representation encoding.
Parallel encoding relies on ‘reference encoding’, or frames of a compressed video used to define future frames. Timmerer et al. assert that most existing methods cannot accelerate the encoding process because they use the highest quality representation for reference encoding. By nature, the highest quality representation is also the one that presents the highest time-complexity. Thus the process is delayed until the representation with the highest time-complexity is completed, which creates many of the streaming bottlenecks we experience, and consequently a poor user experience and big headache for video streaming developers.
In plain terms, it’s kind of like setting the speed limit according to the slowest driver on the road (instead of just telling them to stay in the slow lane). So how do you ‘raise the speed limit’? The key, according to Timmerer et al., is setting the speed limit based on the fastest driver on the highway, not the slowest. You can do this, they say, with CNNs, through a technique termed ‘FaME-ML’.
How FaME-ML works
The authors of “FaME-ML: Fast Multirate Encoding for HTTP Adaptive Streaming Using Machine Learning” describe how they used CNNs to optimize the encoding of multiple representations of video. FaME-ML uses CNNs to predict the split decisions on CTUs–square blocks or subdivisions of frames–for multirate encoding. The representation with the lowest time-complexity is then selected as the reference encoding (vs. current techniques which use the representation with highest time-complexity–i.e. the ‘slowest driver’ on the road–as the reference encoding).
FaME-ML, as described in the paper, consists of the following steps:
- Use the HEVC reference software to encode the lowest quality representation, and store required data to speed up the remaining representations during the encoding
- Use the CNN to process Y, U, and V channels of the raw video and obtain the output of the intermediate softmax layer
- Append an additional feature vector to the output and pass it through fully connected layers to obtain the final split decision
- Use the decision of the CNN to speed up the encoding process of the two most complex representations
Implications of the study
The research could lay the groundwork for much more efficient encoding. The paper cites ROC-AUC scores of 0.79, 0.81, and 0.77 ROC-AUC for classifications for the network leveraging CNNs, and about a 41% reduction in overall time-complexity in parallel encoding.
With the number of people streaming and the amount of data required for video content increasing at an incredible rate, new approaches such as this may help to alleviate many of the serious challenges faced by the video streaming industry.