Anything can be a message queue if you use it wrongly enough
Published on , 4993 words, 19 minutes to read
1girl, green hair, green eyes, landscape, hoodie, backpack, space needle - Ligne ClaireHi, readers! This post is satire. Don't treat it as something that is viable for production workloads. By reading this post you agree to never implement or use this accursed abomination. This article is released to the public for educational reasons. Please do not attempt to recreate any of the absurd acts referenced here.
You may think that the world is in a state of relative peace. Things look like they are somewhat stable, but reality couldn't be farther from the truth. There is an enemy out there that transcends time, space, logic, reason, and lemon-scented moist towelettes. That enemy is a scourge of cloud costs that is likely the single reason why startups die from their cloud bills when they are so young.
The enemy is Managed NAT Gateway. It is a service that lets you egress traffic from a VPC to the public internet at $0.07 per gigabyte. This is something that is probably literally free for them to run but ends up getting a huge chunk of their customer's cloud spend. Customers don't even look too deep into this because they just shrug it off as the cost of doing business.
This one service has allowed companies like the duckbill group to make millions by showing companies how to not spend as much on the cloud.
However, I think I can do one better. What if there was a better way for your own services? What if there was a way you could reduce that cost for your own services by up to 700%? What if you could bypass those pesky network egress costs yet still contact your machines over normal IP packets?
Really, if you are trying to avoid Managed NAT Gateway in production for egress-heavy workloads (such as webhooks that need to come from a common IP address), you should be using a Tailscale exit node with a public IPv4/IPv6 address attached to it. If you also attach this node to the same VPC as your webhook egress nodes, you can basically recreate Managed NAT Gateway at home. You also get the added benefit of encrypting your traffic further on the wire. This is the only thing in this article that you can safely copy into your production workloads.
Base facts
Before I go into more detail about how this genius creation works, here's some things to consider:
When AWS launched originally, it had three services:
- S3 - Object storage for cloud-native applications
- SQS - A message queue
- EC2 - A way to run Linux virtual machines somewhere
Of those foundational services, I'm going to focus the most on S3: the
Simple Storage Service. In essence, S3 is malloc()
for the cloud.
The C programming language
When using the C programming language, you normally are working with
memory in the stack. This memory is almost always semi-ephemeral and
all of the contents of the stack are no longer reachable (and
presumably overwritten) when you exit the current function. You can do
many things with this, but it turns out that this isn't very useful in
practice. To work around this (and reliably pass mutable values
between functions), you need to use the
malloc()
function. malloc()
takes in the number of bytes you want to allocate
and returns a pointer to the region of memory that was allocated.
Huh? That seems a bit easy for C. Can't allocating memory fail when there's no more free memory to allocate? How do you handle that?
Isn't "implementation-defined" code for "it'll probably crash"?
In many cases: yes most of the time it will crash. Hard. Some applications are smart enough to handle this more gracefully (IE: try to free memory or run a garbage collection run), but in many cases it doesn't really make more sense to do anything but crash the program.
Oh. Good. Just what I wanted to hear.
When you get a pointer back from malloc()
, you can store anything in
there as long as it's the same length as you passed or less.
Fun fact: if you overwrite the bounds you passed to malloc()
and anything
involved in the memory you are writing is user input, congradtulations: you
just created a way for a user to either corrupt internal application state or
gain arbitrary code execution. A similar technique is used in The Legend of
Zelda: Ocarina of Time speedruns in order to get arbitrary code execution via
Stale Reference
Manipulation.
Oh, also anything stored in that pointer to memory you got back from
malloc()
is stored in an area of ram called "the heap", which is
moderately slower to access than it is to access the stack.
S3 in a nutshell
Much in the same way, S3 lets you allocate space for and submit
arbitrary bytes to the cloud, then fetch them back with an address.
It's a lot like the malloc()
function for the cloud. You can put
bytes there and then refer to them between cloud functions.
The bytes are stored in the cloud, which is slightly slower to read from than it would be to read data out of the heap.
And these arbitrary bytes can be anything. S3 is usually used for hosting static assets (like all of the conversation snippet avatars that a certain website with an orange background hates), but nothing is stopping you from using it to host literally anything you want. Logging things into S3 is so common it's literally a core product offering from Amazon. Your billing history goes into S3. If you download your tax returns from WealthSimple, it's probably downloading the PDF files from S3. VRChat avatar uploads and downloads are done via S3.
It's like an FTP server but you don't have to care about running out of disk space on the FTP server!
IPv6
You know what else is bytes? IPv6 packets. When you send an IPv6 packet to a destination on the internet, the kernel will prepare and pack a bunch of bytes together to let the destination and intermediate hops (such as network routers) know where the packet comes from and where it is destined to go.
Normally, IPv6 packets are handled by the kernel and submitted to a queue for a hardware device to send out over some link to the Internet. This works for the majority of networks because they deal with hardware dedicated for slinging bytes around, or in some cases shouting them through the air (such as when you use Wi-Fi or a mobile phone's networking card).
Wait, did you just say that Wi-Fi is powered by your devices shouting at eachother?
Yep! Wi-Fi signal strength is measured in decibels even!
Wrong. Wi-Fi is more accurately light, not sound. It is much more accurate to say that the devices are shining at eachother. Wi-Fi is the product of radio waves, which are the same thing as light (but it's so low frequency that you can't see it). Boom. Roasted.
The core Unix philosophy: everything is a file
There is a way to bypass this and have software control how network links work, and for that we need to think about Unix conceptually for a second. In the hardcore Unix philosophical view: everything is a file. Hard drives and storage devices are files. Process information is viewable as files. Serial devices are files. This core philosophy is rooted at the heart of just about everything in Unix and Linux systems, which makes it a lot easier for applications to be programmed. The same API can be used for writing to files, tape drives, serial ports, and network sockets. This makes everything a lot conceptually simpler and reusing software for new purposes trivial.
As an example of this, consider the
tar
command. The name
tar
stands for "Tape ARchive". It was a format that was created for writing
backups to actual magnetic tape
drives. Most commonly, it's used to
download source code from GitHub or as an interchange format for downloading
software packages (or other things that need to put multiple files in one
distributable unit).
In Linux, you can create a TUN/TAP device to let applications control how network or datagram links work. In essence, it lets you create a file descriptor that you can read packets from and write packets to. As long as you get the packets to their intended destination somehow and get any other packets that come back to the same file descriptor, the implementation isn't relevant. This is how OpenVPN, ZeroTier, FreeLAN, Tinc, Hamachi, WireGuard and Tailscale work: they read packets from the kernel, encrypt them, send them to the destination, decrypt incoming packets, and then write them back into the kernel.
In essence
So, putting this all together:
- S3 is
malloc()
for the cloud, allowing you to share arbitrary sequences of bytes between consumers. - IPv6 packets are just bytes like anything else.
- TUN devices let you have arbitrary application code control how packets get to network destinations.
In theory, all you'd need to do to save money on your network bills would be to read packets from the kernel, write them to S3, and then have another loop read packets from S3 and write those packets back into the kernel. All you'd need to do is wire things up in the right way.
So I did just that.
Here's some of my friends' reactions to that list of facts:
- I feel like you've just told me how to build a bomb. I can't belive this actually works but also I don't see how it wouldn't. This is evil.
- It's like using a warehouse like a container ship. You've put a warehouse on wheels.
- I don't know what you even mean by that. That's a storage method. Are you using an extremely generous definition of "tunnel"?
- sto psto pstop stopstops
- We play with hypervisors and net traffic often enough that we know that this is something someone wouldn't have thought of.
- Wait are you planning to actually implement and use ipv6 over s3?
- We're paying good money for these shitposts :)
- Is routinely coming up with cursed ideas a requirement for working at tailscale?
- That is horrifying. Please stop torturing the packets. This is a violation of the Geneva Convention.
- Please seek professional help.
Before any of you ask, yes, this was the result of a drunken conversation with Corey Quinn.
Hoshino
Hoshino is a system for putting outgoing IPv6 packets into S3 and then reading incoming IPv6 packets out of S3 in order to avoid the absolute dreaded scourge of Managed NAT Gateway. It is a travesty of a tool that does work, if only barely.
The name is a reference to the main character of the anime Oshi no Ko, Hoshino Ai. Hoshino is an absolute genius that works as a pop idol for the group B-Komachi.
Hoshino is a shockingly simple program. It creates a TUN device, configures the OS networking stack so that programs can use it, and then starts up two threads to handle reading packets from the kernel and writing packets into the kernel.
When it starts up, it creates a new TUN device named either hoshino0
or an administrator-defined name with a command line flag. This
interface is only intended to forward IPv6 traffic.
Each node derives its IPv6 address from the
machine-id
of the system it's running on. This means that you can somewhat
reliably guarantee that every node on the network has a unique address
that you can easily guess (the provided ULA /64 and then the first
half of the machine-id
in hex). Future improvements may include
publishing these addresses into DNS via Route 53.
When it configures the OS networking stack with that address, it uses a netlink socket to do this. Netlink is a Linux-specific socket family type that allows userspace applications to configure the network stack, communicate to the kernel, and communicate between processes. Netlink sockets cannot leave the current host they are connected to, but unlike Unix sockets which are addressed by filesystem paths, Netlink sockets are addressed by process ID numbers.
In order to configure the hoshino0
device with Netlink, Hoshino does
the following things:
- Adds the node's IPv6 address to the
hoshino0
interface - Enables the
hoshino0
interface to be used by the kernel - Adds a route to the IPv6 subnet via the
hoshino0
interface
Then it configures the AWS API client and kicks off both of the main loops that handle reading packets from and writing packets to the kernel.
When uploading packets to S3, the key for each packet is derived from the destination IPv6 address (parsed from outgoing packets using the handy library gopacket) and the packet's unique ID (a ULID to ensure that packets are lexicographically sortable, which will be important to ensure in-order delivery in the other loop).
When packets are processed, they are added to a
bundle for
later processing by the kernel. This is relatively boring code and
understanding it is mostly an exercise for the reader. bundler
is
based on the Google package
bundler
,
but modified to use generic types because the original
implementation of bundler
predates them.
cardio
However, the last major part of understanding the genius at play here is by the use of cardio. Cardio is a utility in Go that lets you have a "heartbeat" for events that should happen every so often, but also be able to influence the rate based on need. This lets you speed up the rate if there is more work to be done (such as when packets are found in S3), and reduce the rate if there is no more work to be done (such as when no packets are found in S3).
Okay, this is also probably something that you can use outside of this post, but I promise there won't be any more of these!
When using cardio, you create the heartbeat channel and signals like this:
heartbeat, slower, faster := cardio.Heartbeat(ctx, time.Minute, time.Millisecond)
The first argument to cardio.Heartbeat
is a
context
that lets you cancel the
heartbeat loop. Additionally, if your application uses
ln
's
opname
facility, an
expvar
gauge will be created and named
after that operation name.
The next two arguments are the minimum and maximum heart rate. In this example, the heartbeat would range between once per minute and once per millisecond.
When you signal the heart rate to speed up, it will double the rate. When you trigger the heart rate to slow down, it will halve the rate. This will enable applications to spike up and gradually slow down as demand changes, much like how the human heart will speed up with exercise and gradually slow down as you stop exercising.
When the heart rate is too high for the amount of work needed to be done (such as when the heartbeat is too fast, much like tachycardia in the human heart), it will automatically back off and signal the heart rate to slow down (much like I wish would happen to me sometimes).
This is a package that I always wanted to have exist, but never found the need to write for myself until now.
Terraform
Like any good recovering SRE, I used Terraform to automate creating IAM users and security policies for each of the nodes on the Hoshino network. This also was used to create the S3 bucket. Most of the configuration is fairly boring, but I did run into an issue while creating the policy documents that I feel is worth pointing out here.
I made the "create a user account and policies for that account" logic into a Terraform module because that's how you get functions in Terraform. It looked like this:
data "aws_iam_policy_document" "policy" {
statement {
actions = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
]
effect = "Allow"
resources = [
var.bucket_arn,
]
}
statement {
actions = ["s3:ListAllMyBuckets"]
effect = "Allow"
resources = ["*"]
}
}
When I tried to use it, things didn't work. I had given it the permission to write to and read from the bucket, but I was being told that I don't have permission to do either operation. The reason this happened is because my statement allowed me to put objects to the bucket, but not to any path INSIDE the bucket. In order to fix this, I needed to make my policy statement look like this:
statement {
actions = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
]
effect = "Allow"
resources = [
var.bucket_arn,
"${var.bucket_arn}/*", # allow every file in the bucket
]
}
This does let you do a few cool things though, you can use this to create per-node credentials in IAM that can only write logs to their part of the bucket in particular. I can easily see how this can be used to allow you to have infinite flexibility in what you want to do, but good lord was it inconvenient to find this out the hard way.
Terraform also configured the lifecycle policy for objects in the bucket to delete them after a day.
resource "aws_s3_bucket_lifecycle_configuration" "hoshino" {
bucket = aws_s3_bucket.hoshino.id
rule {
id = "auto-expire"
filter {}
expiration {
days = 1
}
status = "Enabled"
}
}
If I could, I would set this to a few hours at most, but the minimum granularity for S3 lifecycle enforcement is in days. In a loving world, this should be a sign that I am horribly misusing the product and should stop. I did not stop.
The horrifying realization that it works
Once everything was implemented and I fixed the last bugs related to the efforts to make Tailscale faster than kernel wireguard, I tried to ping something. I set up two virtual machines with waifud and installed Hoshino. I configured their AWS credentials and then started it up. Both machines got IPv6 addresses and they started their loops. Nervously, I ran a ping command:
xe@river-woods:~$ ping fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f
PING fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f(fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f) 56 data bytes
64 bytes from fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f: icmp_seq=1 ttl=64 time=2640 ms
64 bytes from fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f: icmp_seq=2 ttl=64 time=3630 ms
64 bytes from fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f: icmp_seq=3 ttl=64 time=2606 ms
It worked. I successfully managed to send ping packets over Amazon S3. At the time, I was in an airport dealing with the aftermath of Air Canada's IT system falling the heck over and the sheer feeling of relief I felt was better than drugs.
Sometimes I wonder if I'm an adrenaline junkie for the unique feeling that you get when your code finally works.
Then I tested TCP. Logically holding, if ping packets work, then TCP should too. It would be slow, but nothing in theory would stop it. I decided to test my luck and tried to open the other node's metrics page:
$ curl http://[fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f]:8081
# skipping expvar "cmdline" (Go type expvar.Func returning []string) with undeclared Prometheus type
go_version{version="go1.20.4"} 1
# TYPE goroutines gauge
goroutines 208
# TYPE heartbeat_hoshino.s3QueueLoop gauge
heartbeat_hoshino.s3QueueLoop 500000000
# TYPE hoshino_bytes_egressed gauge
hoshino_bytes_egressed 3648
# TYPE hoshino_bytes_ingressed gauge
hoshino_bytes_ingressed 3894
# TYPE hoshino_dropped_packets gauge
hoshino_dropped_packets 0
# TYPE hoshino_ignored_packets gauge
hoshino_ignored_packets 98
# TYPE hoshino_packets_egressed gauge
hoshino_packets_egressed 36
# TYPE hoshino_packets_ingressed gauge
hoshino_packets_ingressed 38
# TYPE hoshino_s3_read_operations gauge
hoshino_s3_read_operations 46
# TYPE hoshino_s3_write_operations gauge
hoshino_s3_write_operations 36
# HELP memstats_heap_alloc current bytes of allocated heap objects (up/down smoothly)
# TYPE memstats_heap_alloc gauge
memstats_heap_alloc 14916320
# HELP memstats_total_alloc cumulative bytes allocated for heap objects
# TYPE memstats_total_alloc counter
memstats_total_alloc 216747096
# HELP memstats_sys total bytes of memory obtained from the OS
# TYPE memstats_sys gauge
memstats_sys 57625662
# HELP memstats_mallocs cumulative count of heap objects allocated
# TYPE memstats_mallocs counter
memstats_mallocs 207903
# HELP memstats_frees cumulative count of heap objects freed
# TYPE memstats_frees counter
memstats_frees 176183
# HELP memstats_num_gc number of completed GC cycles
# TYPE memstats_num_gc counter
memstats_num_gc 12
process_start_unix_time 1685807899
# TYPE uptime_sec counter
uptime_sec 27
version{version="1.42.0-dev20230603-t367c29559-dirty"} 1
I was floored. It works. The packets were sitting there in S3, and I
was able to pluck out the TCP
response
and I opened it with xxd
and was able to confirm the source and
destination address:
00000000: 6007 0404 0711 0640
00000008: fd5e 59b8 f71d 9a3e
00000010: c05f 7f48 de53 428f
00000018: fd5e 59b8 f71d 9a3e
00000020: 59e5 5085 744d 4a66
It was fd5e:59b8:f71d:9a3e:59e5:5085:744d:4a66
trying to reach
fd5e:59b8:f71d:9a3e:c05f:7f48:de53:428f
.
Wait, if this is just putting stuff into S3, can't you do deep packet inspection with Lambda by using the workflow for automatically generating thumbnails?
Yep! This would let you do it fairly trivially even. I'm not sure how you
would prevent things from getting through, but you could have your lambda
handler funge a TCP packet to either side of the connection with the RST
flag
set
(RFC 793: Transmission Control Protocol, the RFC that defines TCP, page 36,
section "Reset Generation"). That could let you kill connections that meet
unwanted criteria, at the cost of having to invoke a lambda handler. I'm
pretty sure this is RFC-compliant, but I'm a shitposter, not a the network
police.
Oh. I see.
Wait, how did you have 1.8 kilobytes of data in that packet? Aren't packets usually smaller than that?
When dealing with networking hardware, you can sometimes get frames (the networking hardware equivalent of a packet) to be up to 9000 bytes with jumbo frames, but if your hardware does support jumbo frames then you can usually get away with 9216 bytes at max.
It's over nine-
Yes dear, it's over 9000. Do keep in mind that we aren't dealing with physical network equipment here, so realistically our packets can be up to to the limit of the IPv6 packet header format: the oddly specific number of 65535 bytes. This is configured by the Maximum Transmission Unit at the OS level (though usually this defines the limit for network frames and not IP packets). Regardless, Hoshino defaults to an MTU of 53049, which should allow you to transfer a bunch of data in a single S3 object.
Cost analysis
When you count only network traffic costs, the architecture has many obvious advantages. Access to S3 is zero-rated in many cases with S3, however the real advantage comes when you are using this cross-region. This lets you have a worker in us-east-1 communicate with another worker in us-west-1 without having to incur the high bandwidth cost per gigabyte when using Managed NAT Gateway.
However, when you count all of the S3 operations (up to one every millisecond), Hoshino is hilariously more expensive because of simple math you can do on your own napkin at home.
For the sake of argument, consider the case where an idle node is
sitting there and polling S3 for packets. This will happen at the
minimum poll rate of once every 500 milliseconds. There are 24 hours
in a day. There are 60 minutes in an hour. There are 60 seconds in a
minute. There are 1000 milliseconds in a second. This means that each
node will be making 172,800 calls to S3 per day, at a cost of $0.86
per node per day. And that's what happens with no traffic. When
traffic happens that's at least one additional PUT
-GET
call pair
per-packet.
Depending on how big your packets are, this can cause you to easily triple that number, making you end up with 518,400 calls to S3 per day ($2.59 per node per day). Not to mention TCP overhead from the three-way handshake and acknowledgement packets.
This is hilariously unviable and makes the effective cost of transmitting a gigabyte of data over HTTP through such a contraption vastly more than $0.07 per gigabyte.
Other notes
This architecture does have a strange advantage to it though: assuming a perfectly spherical cow, adequate network latency, and sheer luck this does make UDP a bit more reliable than it should be otherwise.
With appropriate timeouts and retries at the application level, it may end up being more reliable than IP transit over the public internet.
Good lord is this an accursed abomination.
I guess you could optimize this by replacing the S3 read loop with some kind of AWS lambda handler that remotely wakes the target machine, but at that point it may actually be better to have that lambda POST the contents of the packet to the remote machine. This would let you bypass the S3 polling costs, but you'd still have to pay for the egress traffic from lambda and the posting to S3 bit.
Before you comment about how I could make it better by doing x, y, or z; please consider that I need to leave room for a part 2. I've already thought about nearly anything you could have already thought about, including using SQS, bundling multiple packets into a single S3 object, and other things that I haven't mentioned here for brevity's sake.
Shitposting so hard you create an IP conflict
Something amusing about this is that it is something that technically steps into the realm of things that my employer does. This creates a unique kind of conflict where I can't easily retain the intellectial property (IP) for this without getting it approved from my employer. It is a bit of the worst of both worlds where I'm doing it on my own time with my own equipment to create something that will be ultimately owned by my employer. This was a bit of a sour grape at first and I almost didn't implement this until the whole Air Canada debacle happened and I was very bored.
However, I am choosing to think about it this way: I have successfully shitposted so hard that it's a legal consideration and that I am going to be absolved of the networking sins I have committed by instead outsourcing those sins to my employer.
I was told that under these circumstances I could release the source code and binaries for this atrocity (provided that I release them with the correct license, which I have rigged to be included in both the source code and the binary of Hoshino), but I am going to elect to not let this code see the light of day outside of my homelab. Maybe I'll change my mind in the future, but honestly this entire situation is so cursed that I think it's better for me to not for the safety of humankind's minds and wallets.
I may try to use the basic technique of Hoshino as a replacement for DERP, but that sounds like a lot of effort after I have proven that this is so hilariously unviable. It would work though!
Stay tuned. I have plans.
Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.
Tags: aws, cursed, tuntap, satire