How to reproduce the Netflix per-title encoding results

In 2016 Netflix introduced the concept of per-title-encoding. In a very interesting blog post and a detailed paper they described their findings and how to interpret the results. In this blog post I want to describe my understanding of the per-title encoding and how to reproduce the results of Netflix using ffmpeg.

What is per-title encoding?

Per-Title encoding is based on the fact that different types of video content require different bitrates and encoding settings to achieve a certain quality. Compared to classic approaches, in which the same, predefined encoding ladder is applied for all types of content, per-title encoding has the potential to significantly decrease the storage and delivery costs of video streams. Easy-to-encode videos can be delivered with much lower bitrates while improving the perceptual quality. In addition, movies or sport streams which contain a lot of movement are streamed with lower resolutions to avoid a bad quality of experience for the viewer.
To illustrate these findings consider the figure below:

PSNR values of different 1080p movie trailers depending on the encoding bitrate.

In this case different movie trailers have been encoded with the same Constant Rate Factor (CRF) values. The chart shows the resulting PSNR and bitrate values at a resolution of 1080p. While some of the assets achieve a “good” PSNR value of 40-45 at bitrates of 1Mbit – 2Mbit other input files require around 3Mbit – 6Mbit. Consequently, some of the movie trailers are easier to encode and less complex than others.

How to implement per-title encoding

So how can we actually determine the complexity of an asset and come up with individual encoding settings? For that purpose, Netflix uses more or less a brute force approach. Lets go through that step by step:

1. Perform multiple test encodes

To start off we need to perform multiple test encodes. These test encodes will help us later to determine the optimal bitrate/resolution pairs. In my tests I used 1080p videos as an input. Seven target resolutions with 12 different CRF values lead to a total of 84 test encodes:

CodecResolutionsCRF values
H.2641920×1080, 1280×720, 720×480,640×480,
512×384, 382×288,320×240
18, 19, 20, 22, 25,
27, 30, 35, 40, 45, 50, 55

The ffmpeg command for these types of encode is pretty straight forward:

ffmpeg -i input.mp4 -y -vcodec libx264 -filter:v scale=w=1280:h=720 -crf 30 output.mp4

For each of the 84 different settings, we tell ffmpeg to encode in H.264 at our target CRF value and scale the output to our target resolution.

2. Determine the quality

Now that we have finished our 84 test encodes we want analyze the output and determine its quality compared to our input video. For that purpose we will use the Peak signal-to-noise-ratio (PSNR). At this point, some might argue that PSNR is not a good indicator for the perceived quality of a video. I agree on that, nevertheless the following principles also apply for other metrics like SSIM or VMAF. The main benefit of PSNR is that the calculation is comparatively easy and does not take much time.   

2.1 Upscaling to the source resolution

By definition, PSNR and other quality metrics only work on videos having the same resolution. Since most of our output videos have a smaller resolution than our input we have two options: We could either downscale our 1080p input video to the respective output resolutions, or upscale our outputs to 1080p.
Because basically all TV sets support at least 1080p resolutions and Netflix is mainly consumed on such devices, upscaling our outputs is the way to go. Still, how can we upscale our videos without re-encoding them? At this point I struggled and I asked Jan Ozer for advice. He told me to output raw files. That way the content is not re-encoded but still upscaled. This leaves us with the following ffmpeg command:

ffmpeg -i output.mp4 -y -pix_fmt yuv420p -vsync 0 -s 1920x1080 -sws_flags lanczos output.y4m

In one of their latest blog posts, Netflix recommends using bicubic upsampling, which is also the ffmpeg default. If you want to do it that way, you would replace “lanczos” with “bicubic”.

2.2 PSNR calculation

Now that our input and output assets have the same resolution we can do the actual PSNR calculation:

ffmpeg -i output.y4m -i input.mp4 -y -filter_complex psnr -f null-

I would recommend to do step 2.1 and 2.2 sequentially for each of the outputs. As the name indicates, the raw files tend to become very large. Unless you have terabytes of space on your harddisk you might end up in situations where you only have a few Kilobytes left (Trust me I have been there).

3. Analyzing the results

At this point we have the resulting PSNR and bitrate values for each of our 84 test encodes. So what do we do with that? A good idea is always to plot the values and see what happens:

The resulting bitrate/PSNR values for the different resolutions

This looks very promising and similar to the graphs we can see in the Netflix blog posts and paper. For now, ignore the convex hull description on the right. What we can observe is that lower resolutions outperform higher resolutions at low bitrates. For instance at around 500Kbit/s the 480p and the 720p resolutions deliver better PSNR values than the 1080p resolution. Even though the content is upscaled on the playback device, the quality is still better. 
Moreover, just like in the Netflix examples, the bitrate/PSNR curves start flattening out at some point. Hence, increasing the bitrate will not result in better quality.

4. Convex hull calculation

Netflix states that the optimal bitrate/resolution pairs are located at points closest to the convex hull. So lets take a look how our chart looks when we calculate the convex hull:

The convex hull 

Pretty much what we expected. The convex hull curve covers the best bitrate/PSNR pairs of the seven resolutions. Depending on our needs as a streaming provider we can derive the target encoding ladder from the test encodes and the resulting PSNR and convex hull values.

5. Determine the encoding ladder

Streaming providers are aware of the bandwidth available on the client side. They might have certain fixed bitrates which they want to provide, because a lot of their viewers internet connections are limited within this bitrate. Therefor, the most straight forward approach to use our data is by identifying the optimal resolution for a certain bitrate. Compared to classic one-fits-all encoding ladders this offers the following advantages:

  • For easy to encode content: A better resolution and a better quality at the same bitrates
  • For hard to encode content: The potential to stay below 1080p and avoid artifacts and blurriness in the video.

Conclusion

This concludes our small dive into the field of per-title encoding. We did not touch topics like scene-based encoding and optimizing for high-complexity parts of the source video. For anyone interested in that, the concepts are also explained in the Netflix paper referenced at the start of this post.