How I VTuberRead time in minutes: 15
If you've watched tech talks I've done and any of my Twitch streams recently, you probably have noticed that I don't use a webcam for any of them. Well, technically I do, but that webcam view shows an anime looking character. This is because I am a VTuber. I use software that combines 3d animation and motion capture technology instead of a webcam. This allows me to have a unique presentation experience and helps me stand out from all the other people that create technical content.
This also makes it so much easier to edit videos because of the fact that the face on the avatar I use isn't too expressive. This allows me to do multiple takes of a single paragraph in the same recording because I can reset the face to neutral and you will not be able to see the edit happen unless you look really closely at my head position.
Version 1.x: Dabbling in Experiments
Some of the best things in life start as the worst mistakes imaginable and the people responsible could never really see them coming. This all traces back to my boss buying everyone an Oculus Quest 2 last year.
This got me to play around with things and see what I could do. I found out that I could use it with my PC using Virtual Desktop. This opened a whole new world of software to me. The Quest 2 is no slouch, but it's not entirely a supercomputer either. However, my gaming PC is better than the Quest 2 at GPU muscle.
One of the main things I started playing with was VRChat. VRChat is the IMVU for VR. You pick an avatar, you go into a world with some friends and you hang out. This was a godsend as the world was locked down all throughout 2020. I hadn't really gotten to talk with my friends very much, and VRChat allowed us to have an experience doing it more than a giant Zoom call or group chat in Discord.
One of the big features of VRChat is the in-game camera. The in-game camera functions like an actual physical camera and lets you also enable a mode where that camera controls the view that the VRChat desktop window renders. This mode became the focus of my research and experimentation for the next few weeks.
With this and OBS' Webcam Emulation Support, I could make the world in VRChat render out to a webcam which could then be picked up by Google Meet.
The only major problem with this was the avatar I was using. I didn't really have a good avatar then. I was drifting between freely available models. Then I found the one that I used as a base to get my way to the one I am using now.
Version 1.x was only ever really used experimentally and never used anywhere publicly.
Version 2.x: VRChat and Wireless VR
I mentioned above that I did VR wirelessly but didn't go into much detail about how much of an excruciating, mind-numbing pain it was. It was an excruciating, mind-numbingly painful thing to set up. At the time my only real options for this were ALVR and Virtual Desktop. A friend was working on ALVR so that's what I decided to use first.
ALVR isn't on the Oculus store, so I had to use SideQuest to sideload the ALVR application on my headset. I did this by creating a developer account on the Oculus store to unlock developer mode on my headset (if you do this in the future, you will need to have bought something from the store in order to activate developer mode) and then flashed the apk onto the headset.
I set up the PC software and fired up VRChat. The most shocking thing to me at the time was that it all worked. I was able to play VRChat without having to be wired up to the PC.
Then I realized how bad the latency was. A lot of this can be traced down to how Wi-Fi as a protocol works. Wi-Fi (and by extension all other wireless protocols) are built on shouting. Wi-Fi devices shout out everywhere and hope that the access point can hear it. The access point shouts back and hopes that the Wi-Fi devices can hear it. The advantage of this is that you can have your phone anywhere within shouting range and you'll be able to get a victory royale in Fortnite, or whatever it is people do with phones these days.
The downside of Wi-Fi being based on shouting is that only one device can shout at a time, and latency is critical for VR to avoid motion sickness. Even though these packets are pretty small, the overhead for them is not zero, so lots of significant Wi-Fi traffic on the same network (or even interference from your neighbors that have like a billion Wi-Fi hotspots named almost identical things even though it's an apartment and doing that makes no sense but here we are) can totally tank your latency.
However it does work...mostly.
It was good enough to get me started. I was able to use it in work calls and one of my first experiences with it was my first 1:1 with a poor intern that had a difficult to describe kind of flabbergasted expression on his face once the call connected.
By now I had found an avatar model and was getting it customized to look a bit more business casual. I chose a model based on a jRPG character and have been customizing it to meet my needs (and as I learn how to desperately glue together things in Unity).
During this process I was able to get a Valve Index second-hand off a friend in IRC. The headset was like new (I just now remembered that I bought it used as I was writing this article) and it allowed me to experience low-latency PC VR in its true form. I had used my husband's Vive a bit, but this was the first time that it really stuck for me.
It also ruined me horribly and now going back to wireless VR via Wi-Fi is difficult because I can't help but notice the latency. I am ruined.
Version 3.x: VRM and VSeeFace
Doing all this with a VR headset works, but it really does get uncomfortable and warm after a while. Strapping a display to your head makes your head get surprisingly warm after a while. It can also be slightly claustrophobic at times. Not to mention the fact that VR eats up all the system resources trying to render things into your face at 120 frames per second as consistently as possible.
Other VTubers on Twitch and YouTube don't always use VR headsets for their streams though. They use software that attempts to pick out their face from a webcam and then attempts to map changes in that face to a 2d/3d model. After looking over several options, I arbitrarily chose VSeeFace. When I have it all set up with my VRM model that I converted from VRChat, the VSeeFace UI looks something like this:
The green point cloud you see on the left of this is the data that VSeeFace is inferring from the webcam data. It uses that to pick out a small set of animations for my avatar to do. This only really tracks a few sound animations (the sounds of vowels "A", "I", "U", "E", "O") and some emotions ("fun", "angry", "joy", "sorrow", "surprised").
This is enough to create a reasonable facsimile of speech. It's not perfect. It really could be a lot better, but it is very cheap to calculate and leaves a lot of CPU headroom for games and other things.
It works though. It's enough for Twitch, my coworkers and more to appreciate it. I'm gonna make it better in the future, but I'm very, very happy with the progress I've made so far with this.
Especially seeing as I have no idea what I am doing with Unity, Blender and other such programs.
Right now my VTubing setup doesn't have a way for me to track my hands. I tend to emote with my hands when I am explaining things. When I am doing that on stream with the VTubing setup, I feel like an idiot. VMagicMirror would let me do hand tracking with my webcam, but I may end up getting a Leap Motion to do hand tracking with VSeeFace. Most of the other VTubing scene seems to have Leap Motions for hand tracking, so I may follow along there.
I want to use this for a conference talk directly related to my employer. I have gotten executive signoff for doing this, so it shouldn't be that hard assuming I can find a decent subject to talk about.
I officially double dare you— apenwarr (@apenwarr) December 30, 2021
I also want to make the model a bit more expressive than it currently is. I am limited by the software I use, so I may have to make my own, but for something that is largely a hackjob I'm really happy with this experience.
Right now my avatar is very, very unoptimized. I want to figure out how to make it a lot more optimized so that I can further reduce GPU load on my machine rendering it. Less GPU for the avatar means more GPU for games.
I also want to create a conference talk stage thing that I can use to give talks on and record the results more easily in higher resolution and detail. I'm very much in early research stages for it, but I'm calling it "Bigstage". If you see me talking about that online, that's what I'm referring to.
starting to draw out the design for Bigstage (a VR based conference stage for me to prerecord talk videos on pic.twitter.com/n8osEv9BQI— Xe Iaso (@theprincessxena) December 14, 2021
I hope this was an amusing trip through all of the things I use to make my VTubing work. Or at least pretend to work. I'm doing my best to make sure that I document things I learn in forms that are not badly organized YouTube tutorials. I have a few things in the pipeline and will stream writing them on Twitch when they are ready to be fully written out.
This post was written live on Twitch. You can catch the VOD on Twitch here. If the Twitch link 404's, you can catch the VOD on YouTube here. The YouTube link will not be live immediately when this post is, but when it is up on Saturday January 15th, you should be able to watch it there to your heart's content.
My favorite chat message from the stream was this:
kouhaidev: I guess all of the cool people are using nix