Video Compression for Mere Mortals

Published on , 4914 words, 18 minutes to read

An image of landscape, spring, onsen, highly detailed, colorful, cherry blossoms, watercolor, no humans
landscape, spring, onsen, highly detailed, colorful, cherry blossoms, watercolor, no humans - Pastel-mix

I'm not just an internet streamer, I'm a VTuber. I end up streaming every other weekend on Twitch where I work on a variety of things and attempt to explain my process as I am doing them. I started doing this as a way to help be more comfortable with public speaking and it has been absolutely catalytic to get over that fear.

This website is intended to be my professional portfolio, and I haven't really had a good way to encode things like the events I've attended or the streams I have done. As I progress deeper and deeper into the craft of developer relations, these things are becoming more obvious and I need something like this.

Cadey is coffee
<Cadey>

To be fair, I do have the ability to annotate individual posts with links to the relevant stream recordings, but that is a very different thing than a list of streams that have been done and an outline of what happened in them. Those links are also at the bottom of the page, and I suspect that nobody ends up reading those or clicking through to them.

However, just making the section with a YouTube embed would work, but I don't feel that's in the true spirit of the Xe Iaso brand. If I have a section for streams on my website, I want people to be able to watch those streams on my website. I don't want to be limited by how YouTube or Twitch compresses videos or limits the use of picture-in-picture. I want to self-host the videos.

Unfortunately, there are a few problems with this:

So, I'm gonna need to compress the video a bit better. Before I set out on this journey, my best efforts got me about 1 gigabyte per hour of video. After doing more back-of-the-napkin bandwidth and storage math, I was okay with that. I would have to scale things up a bit sooner than I had hoped, but this is the natural consequence of making things more elaborate. Thanks to help from a friend, I figured out how to compress my video a lot. It's all compressed down to 330 MB for a 3.5 hour video. With the fact that I do about 2-4 hours of streaming every two weeks, this means I would need to think about the storage demands of XeDN in about June. This is acceptable for me. I'm keeping an eye on the disk usage of the XeDN volumes and have an alert set up in case there's a huge spike in disk usage somehow, but it should be fine until then.

Aoi is wut
<Aoi>

How do you have so many napkins?

Cadey is enby
<Cadey>

I get the bulk packs from Costco. Still working through the second of four packs of 1.3 thousand napkins. I'm sure I'll get there soon!

Video compression

In case you've never been down this rabbit hole, let's take a moment to cover the basic problems that video compression solves. At a high level, video is a bunch of pictures that is played back fast enough that it tricks your brain into thinking that there is motion.

Mara is hacker
<Mara>

Fun fact: what we call video used to be called "moving pictures" when it was first introduced. Over time people sanded away at the words and this lead to the term "movies" to describe them. This is the origin of the term "movie" as we know it today.

A naïve way to handle this would be to just take each frame of video and store it on the disk with a little bit of compression. This is how video is stored in many professional formats like RED raw and Apple's ProRes video, but that unmatched level of quality comes with an unmatched level of disk space usage. One minute of 4K ProRes video can easily eat up six whole gigabytes of space. This level of quality is absolutely required for professional video editing because of the flexibility it gives you, but for most other uses (such as making stream recordings of me being very bad at using CSS) it's vastly overkill.

Mara is hacker
<Mara>

The main reason why professionals want the flexibility of raw video formats is that every time you encode video into a codec like h.264 (the main video codec used on the internet, usually in .mp4 files) you lose some of the quality permanently. For an example of this in action, see deepfriedmemes.com, which allows you to see the generational compression loss on images using JPEG compression as an artform for creating sarcastic versions of internet posts. I've included an example of this below:

Before generational lossAfter generational loss
Mara is happy
<Mara>

You can understand why this would be a bad thing to have to deal with in a professional environment! This is a bit of a bad example of this in particular because both the source images were stored as jpeg files before they were ingested by XeDN and if you look closely you can already see faint signs of generational loss around the borders of the image on the left. The image on the left was generated using CDI Diffusion.

There is a simple way to compress all of this though, and for that we're going to have to look at animation like Disney did it back when they were still animating everything by hand.

When Disney movies were drawn by hand, they had two tiers of artists: key frame artists, and in-between artists. The key frame artists drew entire base images (or, more realistically, individual layers of those things that were manually composited together at time of filming) for key parts of single scenes and then the rest of the work was done by the in-between frame artists. People again sanded away on the terms and the fully composited images became known as key frames and the rest were known as tween frames.

Mara is hacker
<Mara>

A good example of how expensive it can be to have each frame painstakingly made by experts is the anime movie REDLINE. It is an absolutely fantastic creation, but took over 7 years to produce and was a commercial failure. This almost took out the studio producing it, but it is quite literally one of the most beautiful anime movies ever made. It's up there with Miyazaki movies in how finely crafted it is. It's well worth a watch.

With modern video compression techniques, we can actually take advantage of the fact that each frame in a video file is almost identical to the frame before it. So, instead of storing the entire frame of video, we can only record the differences between that frame and the previous one. This is the key discovery that lead to online video streaming being viable before high-speed Internet was deployed to and used by the masses. It's why we have Netflix.

Mara is hacker
<Mara>

There's an artform called datamoshing that involves using the tween frames from another video without the correlating keyframes as an artistic effect. Consider this meme:

Xe :verified:
2023-02-05T19:02:33Z
Link

VTubing and compression

With this understanding of how video compression works (tl;dr: most frames are stored as the differences between the previous one, save the ones that have an entire image associated with them), let's take a look at how my streams in particular work. Here's a random still frame from one of my recent streams:

The core elements are very simple and many of them do not change:

The two things that update the most often are the terminal window and the facecam. At the time of writing this post, the overlay is static (however I've been considering making something dynamic with Godot that would allow me to embed the Twitch chat, but there are concerns with privacy and user-generated content that have left me still considering the best way to do this). This means that about 5-10 percent of the viewport does not change every frame, meaning that the bandwidth per second can be saved for the things that do change.

My VTuber software VSeeFace is a bit feature-incomplete, which means that there isn't much that changes per frame. This means that the facecam doesn't change much either (it mostly switches between keyframes for facsimilies of common speech sounds), which adds to even more bitrate savings.

Consider these two random sequential frames from one of my streams:

Frame 1Frame 2

Did you notice the difference? It's my VTuber avatar doing the animation for the /s/ sound. This means that about 0.05% of the viewport actually changed. This allows me to crank the bitrate way down without losing out on quality as much as it would in something like Beat Saber. Consider these two random sequential frames of me playing Synth Riders:

Frame 1Frame 2

There is way more of a visible difference between these two frames, and as such it requires a much higher bitrate in order to encode this video in a way that isn't painful to watch.

The encode-ening

As an example, let's take a look at my most recent stream where I mostly hacked up my Nix-based Emacs config so that I could use it instead of my Spacemacs config. In my video folder, the 1 hour, 54 minute video file takes up about 4.82 GB of space. When I pass it to ffprobe (the ffmpeg tool that tells you information about a media file), I get this metadata back:

Input #0, matroska,webm, from 'emacs-hacking.mkv':
  Metadata:
    ENCODER         : Lavf58.76.100
  Duration: 01:53:47.14, start: 0.000000, bitrate: 6066 kb/s
  Stream #0:0: Video: h264 (High), yuv420p(tv, bt709/reserved/reserved, progressive), 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 1k tbn (default)
    Metadata:
      DURATION        : 01:53:47.133000000
  Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp (default)
    Metadata:
      title           : simple_aac
      DURATION        : 01:53:47.136000000

Here are the key things we can take away from this:

Cadey is coffee
<Cadey>

For the sake of napkin-math, we're going to assume this is a two hour video, mostly because it lets you divide the size by two for the megabytes per-hour comparison.

So, let's get a-crunching! In my initial HLS post I used this command:

ffmpeg \
  -i ./source.mp4 \
  -level 3.0 \
  -start_number 0 \
  -hls_time 10 \
  -hls_list_size 0 \
  -f hls \
  index.m3u8

If I let this run I get a 470MB folder with HLS chunks in it. Each HLS chunk is about 10 seconds of video, and on average each chunk is about 1.45MB. This gives us a total cost per hour of 235 MB. This is way better than you'd expect from fairly uncomplicated settings, and this is what I've used on my blog. It's not the best, not the worst, but it gets the job done and that's good enough for most uses.

Running a random chunk through ffprobe gives me this metadata:

Input #0, mpegts, from 'baseline/index5.ts':
  Duration: 00:00:16.67, start: 51.421333, bitrate: 520 kb/s
  Program 1
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
  Stream #0:0[0x100]: Video: h264 (Constrained Baseline) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/reserved/reserved, progressive), 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 90k tbn
  Stream #0:1[0x101]: Audio: aac (LC) ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp, 144 kb/s

And it looks great:

Frame 1Frame 2

However, we can go deeper. The source video was about 6000 kb/sec and ffmpeg already crunched it down to 520 kb/sec. However, we're not even getting into the real fun power of ffmpeg: multi-pass encoding.

Multi-pass drifting encoding

The encoding I just did is known as a single-pass encode. ffmpeg started at the beginning of the video file and encoded everything as it got it. There was no real foreknowledge of what was going on, it was sight-reading the video and going on the fly. A multi-pass encode has ffmpeg scan over the entire video once and then uses that information to encode things more optimally. In order to encode this with multi-pass encoding, we need to have two ffmpeg commands:

ffmpeg \
  -i ./emacs-hacking.mkv \
  -b:v 550k \
  -level 3.0 \
  -f mp4 \
  -pass 1 \
  -y \
  /dev/null

ffmpeg \
  -i ./emacs-hacking.mkv \
  -b:v 550k \
  -level 3.0 \
  -start_number 0 \
  -hls_time 10 \
  -hls_list_size 0 \
  -f hls \
  -pass 2 \
  ./two-pass/index.m3u8
Mara is hacker
<Mara>

Multi-pass encoding isn't compatible with automatically figuring out the bitrate, so the arbitrarily chosen bitrate target of 550 kilobits per second was chosen.

This gets us a slightly larger result size (mostly because of the arbitrarily chosen bitrate), but it gave a 600 MB output for 300 MB of video per hour. This is a bit too big for what I want, but it's a good starting point for going even deeper.

ffprobe reports this about a random HLS chunk:

Input #0, mpegts, from './two-pass/index277.ts':
  Duration: 00:00:08.33, start: 2776.421333, bitrate: 561 kb/s
  Program 1
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
  Stream #0:0[0x100]: Video: h264 (Constrained Baseline) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/reserved/reserved, progressive), 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 90k tbn
  Stream #0:1[0x101]: Audio: aac (LC) ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp, 131 kb/s

And here are two sequential frames:

Frame 1Frame 2

This is passable, and I wouldn't be ashamed to have it visible on my portfolio for my mostly text-based streams.

We don't need no stinkin' bitrate!

In video compression, the bitrate is the amount of bits of data per second of video. This includes keyframes. This part of the process is where you start playing with the bitrate numbers and crunch things down even more. In my testing, I found that 150kb/sec was the best balance between readability of text and filesize. This means you can use a ffmpeg command that looks like this:

ffmpeg \
  -i ./emacs-hacking.mkv \
  -b:v 150k \
  -profile:v high \
  -preset slow \
  -tune animation \
  -pix_fmt yuv420p \
  -f mp4 \
  -pass 1 \
  -y \
  /dev/null

ffmpeg \
  -i ./emacs-hacking.mkv \
  -b:v 150k \
  -profile:v high \
  -preset slow \
  -tune animation \
  -pix_fmt yuv420p \
  -start_number 0 \
  -hls_playlist_type vod \
  -hls_allow_cache 1 \
  -hls_time 0 \
  -hls_list_size 0 \
  -f hls \
  -pass 2 \
  ./two-pass-150/index.m3u8

There's one other major difference with how this file in particular is encoded. I changed the hls_time from 10 seconds to 0 seconds. HLS is built on chunking video into a billionty small chunks, and the -hls_time flag tells ffmpeg to chunk the video at the next keyframe after about that many seconds. This means that each HLS chunk we made before had about 10 seconds plus the gap to the next keyframe of video, so about 10-15 seconds on average. When we move the bitrate down this low, we're going to need to also be more aggressive about moving each keyframe into its own HLS chunk or things will look slightly off in a way that people won't really be able to place their fingers on.

Mara is hacker
<Mara>

The -tune animation tells ffmpeg to put more deblocking data into the resulting files so that you notice compression artifacts less easily. Natively each group of pixels is compressed as a superblock of a bunch of pixels and most of the time you can't see the differences between the pixel groups. You can see that very easily with anime and video recordings of computer UIs, especially when the video changes a lot very quickly. If you've ever seen things get blocky when glitter is flying around in YouTube videos, this is why.

The other encoding settings are something I got off of an anime piracy forum that had suggestions for how to make anime look good at lower filesizes. I guess you can consider my vtubing stuff to be anime if you squint enough, so let's throw it into the pot and see what sticks.

Cadey is coffee
<Cadey>

One of the real annoying parts of this video conversion process is that you can't use hardware-accelerated encoding to speed this up because most of what we're doing here is beyond the scope of the encoding hardware in consumer GPUs.

After this command, I ended up with a total filesize of 267 megabytes. This is a lot closer to what I want!

Here's the ffprobe output on a random chunk:

Input #0, mpegts, from './two-pass-150/index341.ts':
  Duration: 00:00:08.39, start: 2768.080167, bitrate: 258 kb/s
  Program 1
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
  Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/reserved/reserved, progressive), 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 90k tbn
  Stream #0:1[0x101]: Audio: aac (LC) ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp, 138 kb/s
Aoi is wut
<Aoi>

Huh, the audio is bigger than the video?

Yep! The video stream is now so optimized that the audio is the bigger concern. Here's two more sequential frames:

Frame 1Frame 2

Now this is how you know you've found a good balance between bitrate and visual quality. That random HLS chunk I copied was 265 kiloBYTES and was about 8 seconds of video. This aligns with the bitrate estimate of 258 kb/sec that ffmpeg gave us. This is also small enough that it's very efficient for bandwidth and for the way that XeDN's storage architecture works. That's just about twice as much as the jpg form of this post's cover image.

If you try to go with a lower bitrate, you'll start to find halos around text characters and overall it just starts to look like YouTube dropping to 240p when your comcast connection suddenly has a pingspike and then it looks like if you had one dollar for every pixel in the image, you'd have one dollar.

Cadey is coffee
<Cadey>

Ask me how I know.

Even deeper

At this point, we have hit the practical wall for how well the video can be encoded, but we still have a lot of room in the audio department. As of this point the video takes up about 150 kilobits per second of data, but the audio takes up 128 kilobits per second. It would be nice if I could use something like opus for the audio and av1 for the video, but I can't use those yet. I could probably get away with making multiple encodings with one being an av1 encode and the other being a mp4 encode, but that's a bit too complicated for my needs. Not to mention that would not help with the space saving issue.

So, let's compress the audio. On advice of a friend, I tried to use the libfdk_aac codec to get the audio down to 32 kilobits per second but in such a way that it doesn't sound like trying to stuff a concert symphony through the phone lines. I looked at the documentation for how to use libfdk_aac and ran into a slight snag:

The license of libfdk_aac is not compatible with GPL, so the GPL does not permit distribution of binaries containing incompatible code when GPL-licensed code is also included. Therefore this encoder have been designated as "non-free", and you cannot download a pre-built ffmpeg that supports it. This can be resolved by compiling ffmpeg yourself.

I checked, and the version of ffmpeg in my package manager was not built with libfdk_aac.

Aoi is sus
<Aoi>

Isn't compiling a custom version of ffmpeg a huge pain needing you to track down a billion dependencies?

Mara is happy
<Mara>

Normally, yes. But, we're using Nix!

One of the fundamental concepts of Nix is that package builds are functions that take in an attribute set containing the build instructions and then all that is passed to a builder function that compiles all that metadata down to a bash script that's run in a sandbox. Nix will handle grabbing any dependencies and putting them in the $PATH of the build environment. It all Just Works™️.

There are two ways you can override how package builds work:

In this case, we need to change the configuration flags and build dependencies of ffmpeg, so we're going to opt for overriding the attributes of the build with the .overrideAttrs method that every package has available. Here's the override we need:

pkgs.ffmpeg_5-full.overrideAttrs (old: rec {
    configureFlags = old.configureFlags ++ [ "--enable-libfdk_aac" "--enable-nonfree" ];
    buildInputs = old.buildInputs ++ [ pkgs.fdk_aac ];
})

This overrides the buildInputs (native dependencies, for things like openssl and command-line tools) and the configureFlags (the arguments passed to ./configure, the entry point for most C/C++ build systems) to add support for libfdk_aac. We can then slap that in a flake.nix file and make a devShell with that prepopulated in the environment:

{
  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
    utils.url = "github:numtide/flake-utils";
  };

  outputs = { self, nixpkgs, utils }:
    utils.lib.eachDefaultSystem (system:
      let
        pkgs = import nixpkgs {
          inherit system;
          overlays = [ ];
        };

        ffmpeg = pkgs.ffmpeg_5-full.overrideAttrs (old: rec {
          configureFlags = old.configureFlags ++ [ "--enable-libfdk_aac" "--enable-nonfree" ];
          buildInputs = old.buildInputs ++ [ pkgs.fdk_aac ];
        });
      in rec { devShell = with pkgs; mkShell { buildInputs = [ ffmpeg ]; }; });
}
Mara is hacker
<Mara>

If you want an exercise to help you get more up to speed with Nix flake hacking, try changing this flake.nix file so that it exposes the custom ffmpeg build as a package.

Now the ffmpeg commands can look like this:

ffmpeg \
  -i ./emacs-hacking.mkv \
  -b:v 150k \
  -profile:v high \
  -preset slow \
  -tune animation \
  -pix_fmt yuv420p \
  -f mp4 \
  -pass 1 \
  -an \
  -y \
  /dev/null

ffmpeg \
  -i ./emacs-hacking.mkv \
  -b:v 150k \
  -profile:v high \
  -preset slow \
  -tune animation \
  -pix_fmt yuv420p \
  -start_number 0 \
  -hls_playlist_type vod \
  -hls_allow_cache 1 \
  -hls_time 0 \
  -hls_list_size 0 \
  -f hls \
  -c:a libfdk_aac \
  -profile:a aac_he_v2 \
  -b:a 32k \
  -pass 2 \
  ./two-pass-fdk/index.m3u8

This will let us end up with a 178 megabyte folder. ffprobe will let us confirm that we've won:

Input #0, mpegts, from './two-pass-fdk/index341.ts':
  Duration: 00:00:08.39, start: 2768.163000, bitrate: 146 kb/s
  Program 1
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
  Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/reserved/reserved, progressive), 1280x720 [SAR 1:1 DAR 16:9], 30 fps, 30 tbr, 90k tbn
  Stream #0:1[0x101]: Audio: aac (HE-AACv2) ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp, 33 kb/s

This is probably the floor for getting reasonably-sounding audio. The end result is 89 megabytes of data per hour of stream footage. It should also be maximally compatible with devices and allow me to save money on bandwidth and storage. The chunks are about the size of 1-1.5 JPEG images too, which means that they are very efficient for XeDN's storage/retrieval architecture.

Caveats

The only problem is that it's very optimized for my stream VODs in particular and if I wanted to encode Synth Riders gameplay I'd need to vastly bump up the video and audio quality or it would look terrible. As an example, here's the Synth Riders gameplay that I recorded for the earlier frame comparisons:

Want to watch this in your video player of choice? Take this:
https://cdn.xeiaso.net/file/christine-static/blog/video-compression/synth-riders/index.m3u8
Cadey is coffee
<Cadey>

The note approach speed in that Synth Riders clip may look a bit fast at first glance, but it's actually a bit slow for me. It's a lot easier to get into flow with a faster approach speed, if only because it makes reading upcoming notes easier.

This allows me to crunch a 248 MB file down to 68 MB. I had to compromise on the audio quality because the AAC codec profile I used was tuned for human speech, an understandable thing to want for my streams of mostly human speech. For that video in particular, I used these flags:

-c:a libfdk_aac -b:a 64k

This tells ffmpeg to use libfdk_aac as the audio codec and sets a bitrate of 64 kilobits per second. I don't specify a codec profile here because I want to get the low effciency codec, which is the default. This stack of codecs gets you roughly the same audio experience that you get from Spotify or Apple Music (unless you enable lossless or Dolby Atmos audio), so it should be good enough.

If I didn't bother to optimize the compression, I'd end up with a 9 MB video that looks and sounds like this:

This is a bit interesting because before I did the compression, I didn't think the audio would turn out as well as it did. You can really hear the high ends of the song get absolutely destroyed (the codec profile I use for VTubing videos is optimized for human speech, and human speech doesn't have many high end sounds), especially when compared to the previous recording.


Overall I'm quite happy with the results of this. I have an offroad from YouTube and I can host all the video on my own infrastructure. For most users there's no difference between this solution and something hosted on YouTube (save the lack of a view count, comments, and other anti-features like like/dislike buttons). I'm pretty sure that this will meet my needs and allow me to self-host the things I create.


Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.

Tags: ffmpeg, scalability, xedn, nix