Last September, at the šØš Swiss Cloud Native Day, Florian Forster, co-founder & CEO of ZITADEL, talked about why they switched to serverless containers. ZITADEL has a really interesting workload that is both CPU intensive and latency sensitive. On top of this, their users are global, and traffic is bursty. Florian talks about how they evaluated AWS, GCP & Azure before they settled on the platform that met their requirements.
Featuring
Sponsors
Fastly ā Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com
Fly.io ā The home of Changelog.com ā Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.
The Changelog ā Conversations with the hackers, leaders, and innovators of the software world
Notes & Links
- š¬ Why We Switched to Serverless Containers - Swiss Cloud Native Day, September 2022
- š¬ What is the fuzz around serverless (containers)? - Open Source @ Siemens, May 2022
- š¬ Self-host ZITADEL on Knative
- š github.com/zitadel/zitadel
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | Intro | 00:52 |
2 | 01:14 | Florian Forster | 03:48 |
3 | 05:02 | The Kubernetes dance | 02:48 |
4 | 07:49 | "Person for everything no one else takes care of" | 01:22 |
5 | 09:11 | Zitadel | 03:32 |
6 | 12:43 | Double-dip traffic | 02:44 |
7 | 15:27 | Zitadel's tech stack | 02:37 |
8 | 18:04 | Terraform | 00:58 |
9 | 19:18 | Sponsor: The Changelog | 01:45 |
10 | 21:03 | For the cloud service? | 07:36 |
11 | 28:39 | Why Cloud Run? | 04:35 |
12 | 33:14 | Pops and regions | 05:35 |
13 | 38:49 | What's the CDN like? | 01:43 |
14 | 40:31 | Zitadel's price model | 03:08 |
15 | 43:40 | Zitadel Cloud | 04:49 |
16 | 48:28 | The "Oh f**k" moments | 03:34 |
17 | 52:02 | How to know there's a problem | 02:25 |
18 | 54:27 | Florian's 2023 | 04:04 |
19 | 58:31 | Upcoming for the cloud | 03:37 |
20 | 1:02:08 | Florian's one key take away | 04:58 |
21 | 1:07:06 | Wrap up | 00:21 |
22 | 1:07:29 | Outro | 00:30 |
Transcript
Play the audio to listen along while you enjoy the transcript. š§
One of the talks that I didnāt have time to watch in-personā¦ I think it was while I was giving my talk - this was Cloud Native Day in 2022 - was Florianās talk, āWhy we switched to serverless containers.ā I have re-watched it on YouTube, twice. Not because I was preparing for this, but because I really liked it, you know? So Florian, welcome to Ship It.
Well, thank you for having me.
What is the story behind the talk? What made you want to give it?
It basically boils down to we have been huge fans of Kubernetes so far. And I still really like the ongoing effort behind Kubernetes. But while running a cloud service that should be kind of scalable, we hit some limits around the quickness of scalability in Kubernetes, and we rethought our stack to better adhere to that, and to kind of get better scaling efforts and better cost profiles as well. Because thatās also a side. So the economical side was also kind of a driver to that.
Yeah, yeah. So when you say serverless containers, what do you mean by that?
Well, we labeled it that way because we thought, āOkay, Citadel is being built like a Go binary, crash it into a container, you can run it basically everywhere.ā And thatās what our customers do. They use normally Docker Compose, or Kubernetes. And we wanted to keep that availability around, so that we actually eat our own dogfood, and not just create something new for our internal purpose, but instead rely on the same things we tell our customers. And so serverless containers is kind of the definition - in our head, itās basically a platform where you can run some kind of OCI image. And you could call it native, you can call it AWS Fargate, you can call it Google Cloud Run, whatever fits your poison. I mean, it could even be like fly.io, or something like that. Basically, plug in a container and it should be scalably being run. Thatās kind of the definition we made right there.
Okay. Okay. So the container part - that is very clear. But why serverless?
Because we donāt want to handle and tackle the effort of running the underlying infrastructure in general. Because, yeah, if coming from Kubernetes, you are well aware you need to handle nodes to some degree, even with things like GKE Autopilot, there is still an abstract concept of compute nodes behind it. And the whole, letās say undertaking with Cloud Run and AWS Fargate and all of those - you basically plug in your container and tell it āPlease run it from 0 to 1000ā, and you donāt need to care whether itās one or 100,000 servers beneath. Thatās why we call it serverless to some degree, even though the word is absolutely ā itās wrong. [laughs] I donāt like the word serverless, to be specific.
Okay. So if you could use a different word, what would you use, instead of serverless? What do you think describes it better?
You could call it container as a service, to some degree, I thinkā¦ Although thatās kind of plagued by the HashiCorps, Nomad, and Kubernetes, and Docker Swarmsā¦ Because people tend to think that a managed Kubernetes offering its container as a service. And I donāt like that nuance, but thatās a personal opinionā¦ Because I still, to some degree, see the underlying infrastructure. And yeah, serverless kind of reflects well on - yeah, there is no server to be managedā¦ But still, you can use your funny old OCI image and just throw it in and it will keep it working.
Yeah, yeah. So just by going on this very brief conversation, my impression is that youāve been doing this Kubernetes dance for quite some time. Right?
Yesā¦ [laughs]
Youāre not in this for a few months and decided you donāt like itā¦ Or youāve tried it and you said, āYou know what? This is not for us.ā Youāve been using it for a while, because you have a very deep knowledge about the ins and outs of how this works. So tell us a little bit about that.
Well, I mean, the first contact points with Kubernetes were around where the kube-up script was still there, by kubeup.sh.
Wowā¦ Okay. I donāt remember that thingā¦ So thatās a really long time. Okay.
[05:43] Yeah. It was way pre Renshaw, pre K3sā¦ Yeah, even OpenShift was like the classic OpenShift at that point in time. Yeah, at the company I worked at that time we had a need for an orchestration system to run some things in a scalable fashion, and we started poking into the Kubernetes ecosystem, because everybody was kind of hypedā¦ āYeah, yeah, Google is behind it. Borg is being replaced by Kubernetes.ā Everybody was talking in that way, and I thought itās worth looking into it. And I liked some thoughts, but still, until today, it kind of gets more enterprisy, and thatās a great thing if you are an enterprise. But if youāre a startup, itās too much abstraction, too much things to care for. It feels not like a hands-off operation. I mean, even just running observability in Kubernetes is like, āYeah, pick your poison.ā There is like 20 different ways of doing things, and why should I even have to need to care for that too much if I just want to run like one container? Please run that container for me. And so yeah, thatās where my personal change from Kubernetes to something else started. But still, I totally like what they do on their end, even though a lot of complexity is involved.
So I think we finally did it. Weāve found the person with 20 years of Kubernetes experience, right? Everyone thinks they donāt exist. Weāve just found him. [laughter]
I definitely aged like 20 years around Kubernetesā¦
Because of it? I see. Okay, okay.
Yeah, definitely. I mean, at one point, funnily enough, we decided to switch to Tectonic, like from CoreOS; it was at that point still aroundā¦ With their self-hosted Kubernetes control plane, like running the kubectl plane in containersā¦ It was a great thing to do. And as well Etcd. But yeah, it had some nifty tweaks and problemsā¦ And you aged quite well in that case.
Okay, okay. So what is it that you do now? What is your role?
Iām not sure what the translation into English is, but letās call it person for everything that nobody else takes care ofā¦
I see. Janitor? No. Handyman? Noā¦ [laughter]
I label myself as CEO, though still I feel more like a CTO-ish, head of DevOps-ish kind of guy, butā¦
Whatever. Yeah. Whatever needs to happen, you know, youāre there.
Exactly. I do a wide range on the business side of things, the overall vision on how we want to shape Citadel, and also the things - how can we ease some stress in our operations part of the gameā¦ So normally, well-opinionated about many things, even though Iām not all the times able to talk in the whole depth.
Because time, right? And time constraints. You canāt be everywhere, doing everything all the time, at 120%, right? You have to pick your battles.
Exactly. I mean, if you asked me whether I like Go generics or something like that, I need to resort to the answer āI have not experienced them yet, because I havenāt had time to look into it.ā
Okayā¦
My engineers will tell you a different story, but Iām not opinionated on that end.
So what does Citadel do? And maybe Iām mispronouncing it, but Iām sure that you will give us the official pronunciation.
I mean, we call it Citadel as well, so itās totally fine. And the logo kind of tries to reflect the word origin, because it comes from the French way of building fortresses, with the star designā¦ The place where you commonly see this is in Copenhagen, for example; thereās still like the fortress with the star design. That would be called Citadel, but in French, but nobody cares. So Citadel in English.
[09:45] What we basically try to do is we want to bring the greatness of zero, like a classic closed source proprietary cloud service, with the greatness of key cloaks, run yourself capabilities, and we want to combine them into one nice, tidy package, so that basically everybody who has at least a heart in engineering can solve some of the problems around the identity game in general. So that includes things like āI want to have a login site, I want to have authorization, I want to have single sign on, with different IDPs, I want to have tokens that I can send through the worldā¦ā And everything like that. So basically, you could call it a turnkey solution to solve user management and authentication in general.
Okay. Okay. So thereās one word that comes to mind when you see that. Actually two. CPU and hashing. [laughter] So from all that description, all Iām thinking is āI hope your CPU is fast, and your hashes are even faster.ā But donāt skimp on the cycles. Right? Like, make sure you do enough iterations. Okay? No cheating, please. Okay.
Yeah, I mean, it comes down to hashing. We rely on Bcrypt normally, and yes, we need many CPU cyclesā¦ But also, thereās like a second thought, that if you do signing tokens - so thatās also quite exhaustive for CPUs nowadays. And so yes, we use a lot of CPU to make that happen.
But actually, the stress is somewhat alleviated down to the future, because the pass key concept and [unintelligible 00:11:30.03] in general, since that relies on public-private key cryptography, it reduces quiet well on our end the amount of CPU we need to use, because RSA signature verification normally takes around one millisecond. So itās not that much of a stress. But hashing, Bcrypt, password with 10 to 12 of iterations might be more unrivaled of like 800 milliseconds to 1,000 milliseconds, around that [unintelligible 00:11:57.27] if you run like two to four CPUs.
Yeah, yeah. Okay. Do you use GPUs for any of this?
No. At the moment, no. I mean, it comes down to a certain degree that our cloud provider does not really allow for thatā¦ And on-premise environments also are oftentimes really restricted in the kind of GPU they have aroundā¦ And so we try and avoid for Citadel too much of integration and depth there. So we still rely on CPUs. But there might be a time where we want to retrain some analytics model with machine learning, and so we are looking into that space. But itās not yet there, to be honest.
So most of the workloads that you run, I imagine theyāre CPU-bound, because especially of the hashing. But I also think network is involved. So you need to have a good network. Iām gonna say good network - not high throughput, but low-latency network. Can you predict these workloads? I mean, people sign in whenever they need to sign in; they donāt sign in all the time. Right? So can you tell us a little bit about the workloads specifically that you run?
Yeah, we have a common pattern called double-dip traffic. So we often see like traffic starting like seven oāclock, until twelve, and then from one to around seven in the evening. So thatās the major traffic phase. But if you spread that across the globe, you see certain regions ingest traffic during certain times, but globally, overarching, itās basically a big white noise line. Itās like more flat if you watch all the regions, but if you really do a region, you really see like double-dips happening all the time.
[13:45] And yeah, having a network that can be easily and fast scalable, and really provides low latency to our customers, because they use our API, our login page, is kind of important to the service quality. And so we internally set our goal to āOkay, letās try and keep latency below 250 milliseconds for 80% of the calls.ā Thatās kind of the SLO weāre aiming at, but thatās not all the times possible. I mean, down under - yeah, you get bad latency. You get strange latency if you go for South Americaā¦ So yeah. Normally, it does not ā it comes down not to the problem being Citadel, who you can easily run at the edge, but rather how can you move data along the journey, and thatās kind of the thing that holds you back most.
Yeah, thatās right. Especially if youāre in like SAML, or something like that, where you yourselves need to talk to another provider, which itself may have variable latency at specific periods of timeā¦ And then you can only guarantee what you get, basically. You canāt make (I donāt know) GitHub login, or Twitter login, or whatever youāre using, faster than it is.
Yes. Itās bound to external services. And in our case, for the storage layer we mainly rely on CockroachDB, since they can move around and shift data closer to data centers that they are actually being used. But still, the data is the most heavy thing right there if you want to reduce latency.
Okay. What does your tech stack look like? Like, what do you have? What is Citadel running on?
Basically, we use internally Cloud Run for our offering, so Google Cloud Run, because we can easily scale during traffic peaks, as well as CPU peaks, because Google will scale Cloud Run containers as well based on CPU usage. That fits our narrative quite well. And also, the ability to scale to zero in not frequently used regions is really a thing we really like.
Below that, we use CockroachDB dedicated, as like their managed service offering of Cockroach. In the past, we actually did run Cockroach on our own, but we figured the amount of money and time we need to invest basically is being taken by Docker if they run it; itās even economically more value, because they include their license quite nifty into their cloud offerings, soā¦ Basically, you have multiple reasons to do that.
And upfront, we use Google CDN and Google Global Load Balancer with Cloud Armor to mitigate some of the risks around DDoS, and rate limits, and stuff, because we see malicious behaviors all the timeā¦ And yeah, the stack is really lean in that regardā¦ Because the only thing we can append to that from an operational perspective is like we use Datadog to some degree for observability purpose, and also Google Suiteā¦ And weāre kind of torn in between, because - yeah, why should I send logfiles to Datadog if they already have ā I already have them in Googleās cloud offering. Why should I pay twice for it? But there are some catches in Cockroachās offering that you only can use Datadog to monitor the database, and so weāve kind of bound to Datadog. So itās not so funny in between state thereā¦
Interesting.
ā¦because we try to reduce our amount of third-party processes whenever possible on that end.
Okay.
Thatās kind of the stack we use.
I like that. Itās simple, but it sounds modern, and itās complicated in the areas where itās sometimes by design; as you mentioned, this coupling between CockroachDB the dedicated one, which is managed for you, so youāre consuming it as a service, and thereās a coupling to Datadog. And Iām sure that choice was made for whatever reasonsā¦
Yesā¦ [laughs]
Okay. So thereās Google Cloud, thereās CockroachDB, thereās Datadog. Anything else?
[18:10] We use TerraForm to provision infrastructure, and we use GitHub Actions mainly to do that. We have still some stuff in TerraForm Cloud, but weāre constantly migrating into GitHub Actions and private repositories, because it fits better with our flow. 80% of the company knows how to deal with Git, and so we ā we did a lot of classic GitOps in the past, with like Argo and Flux and all of those toolsā¦ We feel comfortable with Git, and so we try to shift as many things that we can into into that space. So from a source control perspective - yeah, itās GitHub with us. But thatās about it on that end. Thereās more chat tools, and things around, or auxiliary services, but the core operations really revolves around that stuff.
How much of this do you use for the actual Citadel software as a service, where itās all managed for you? I think youāre calling it the cloud now?
Yup.
[unintelligible 00:21:11.10] versus building the binaries that you make available to users? Is it like most of this is for the cloud service, or what does the combination look like?
I think 80% is for the cloud service, because the whole CI side of the story comes down to GitHub Actions, some Docker files, and some custom-built shell scripts, because protoc and gRPC tends to be a little bit nifty how to build clients with gRPC. So that basically comes down to that from a CI perspective. And we can basically run all our components we have - like, we have a management GUI in Angular, we have like a cloud GUI in Next.js, we have like the backend of Citadel, which is written in Goā¦ And we can cram everything into GitHub and GitHub Actions, into a dockerfile, basically. And thatās a process that is constantly evolving to reduce drag for additional committersā¦ Because it kind of is ā if you donāt make that understandable, itās a high entry bar to commit things.
Yeah, of course.
[22:23] Yeah, and you need to reduce the drag. I think an honorable mention on that end is like Cypress for each retest stuffā¦ Because we test our management GUI with Cypress, which basically passes on to our API, so we can do that end to end from our perspectiveā¦ Yeah, and then thereās like automated things, like Dependabot, and CodeQL, and static analysis tools to make the security rightā¦ But thatās about it, because we think the tool drag - I call it often this tool dragā¦ I donāt like having too many tools around to do the same job, because you lack focus if you do that.
Of course. Of course, that just makes sense. Yeah, that cognitive overload of having ten ways of doing the same thing, based on the part of the company that youāre working inā¦ Okay, okay. So how many engineers are there for us to better understand the scale of people that touch this code, and work on this code, and help maintain it? So thereās the company, and then the community, because thatās also an important element.
Yeah. Currently, itās like eight dedicated engineers, like software engineers from our end, working on Citadelās code. Some of them also working on things like the Helm chart, and stuff, because itās mainly our engineering staff who does thatā¦ And then we have like 20, 25-ish external contributors across the field who do a variety of tasks. Because we have also some separated packages from Citadel. Like the OpenID connect library, for example, is not in Citadel; we use it on our end, but we needed it to create, for example, a Go library for OpenID connect, and we wanted to have it certified, so we needed to do like an extra leg on that and. And we have separate maintainers on that end from Citadel, because many people are using [unintelligible 00:24:17.17] without our knowledge, so to say.
So itās multiple projects, multiple contributors, but the stack beneath is kind of the same. And with us itās basically the eight engineers who work on that, even though the company now has roughly 15 employees since we started to work more and more on technical writing things, API documentationā¦ Because otherwise, itās not nice to work withā
Oh yes, the docs. Oh, yeah. Tell me about it. Yeah. Okay, okay. [laughter] Oh, yeah.
Docs is like the conflicting major pain point all the times. And I donāt say that lightly, but I think that you can use that on any project - as soon as itās kind of open source, itās hard to get good quality, consistent reading docsā¦ But thatās a topic - for example, we want to invest quite heavily over the coming few months, because we really feel like if you want to engage with developers and engineers, you need to have proper documentation. Otherwise, it feels like the impediment is too high.
Yeah, for sure. And I canāt think of a better way to show the users that you care, you care about your product. I mean, very few would dig into the code and say, āWow, this is amazing code. Iām going to use this product, because itās amazing.ā How often does it happen?
[laughs]
But put like a nice docs site out there, easy to search, easy to understand, with good flows, and people go āWow, they really spent some timeā, even though the code may not be that good, but the docs are, and then the perception is āThis is amazing.ā
[26:02] Yeah. I mean, I have recently heard the term āContent is oxygen for the communityā, and I think thatās also applicable to the documentation side of things, because itās not only outreach content, but also, and rather important, the documentation side of things. Because even if you write like the best blog about something, at one point you will link to your documentation, and if that link is not nicely, tidy being done - yeah, it breaks the experience. And so if I need to point out one specific thing we need to improve over the coming months, itās really docs. They need to have a clear flow. āWhere should I start? From where can I go to what?ā So they need to reflect the user journey, basically, and thatās kind of the biggest rework we will do, is restructure docs to better appeal for that.
Yeah. Okay, that makes a lot of sense. Yeah, for sure. For sure. Iām just taken aback by how simply youāve put that. Itās a very high problem, right? And it all boils down to this. So if you donāt get this, then forget everything else, about like making them easy to understand. I mean, thatās important, but what are the flows? Where do you enter? What are the entry points? Where do you drop off? What happens next? What is the follow-up? What is the story that you get when you go to the docs? And if you just get referenced ā by the way, thereās so many types of actuallyā¦ Four, as far as I can remember; thereās guidelines, thereās references, thereāsā¦
Examples, oftentimesā¦
Examples, exactly.
API documentationā¦
Oh, yeah.
In our case, we split out the whole self-hosting part, because in our cloud itās not applicable. But if you want to run it on your own, you need like production checklists, examples to deploy it to x, y, z, how to configure all the nifty details, how to configure a CDN, how to configure TLSā¦ I mean, thereās a whole array of topics just for the self-hosting stuff. And so you kind of need to figure out that flow, and it will take you a lot of time to do that. But if you figure it out, it will get beautiful. The only thing you can break at that point is basically to style how you write content, in what kind of language, and thatās especially difficult if you have non-native speakers and engineers; they tend to write different documentations, as dev rels, or as content marketing guys or gals, because itās just a different way on thinking of itā¦ Thatās the second thing to get right there.
For sure. For sure. Okay, so Iād like us to come back now to your talk, because one thing which I really liked in that video, and in your presentation, is you talked about why you havenāt chosen Microsoft, and why you havenāt chosen AWS. Itās not like, you know, āWe havenāt even looked there.ā You did try them, you did consider them, but there were certain things which wouldnāt work for you. So tell us a little bit about that, how you ended up with Cloud Run, which I donāt think it was your first choice, but basically, you ended up there because of your requirements.
Yeah, the first and most prominent thing that struck us was kind of having end to end HTTP support. Because we provide gRPC APIs to our customers, and they need HTTP/2. And while verifying that with all the different offerings, it was kind of hard to either get proper documentation, whether they support it or not, or they oftentimes only supported it from their CDN to the customerās site, but not in the upstreamā¦ And so that was kind of really like bogging us down on that end.
[29:51] I mean, we could have chosen the route and say āOkay, we do not offer gRPC, but only gRPC web and REST, because we supply that as well. But we really wanted to have the HTTP capabilities, because we think at one point there is a unique opportunity to be taken to use streams and stuff for identity-related things. So if something changes, we can notify you immediately, which can be an interesting way of thinking of it. And it reduces latency quite a lot. I mean, our own measurement states thatās reduced seven milliseconds each call purely down to JSON serializationā¦ Which is not like a bad thing, but itās seven milliseconds.
Yeah. It adds up, right? Seven milliseconds here, three milliseconds thereā¦ Before you know it, itās a minute.
No, really, if you have microservice architectures ā I mean, if you have like five services cascaded, and every time they call like our introspect endpoint to validate some tokens, it adds up. Itās 50 milliseconds only serialization at that point. But that was just the decision they made there, and really, the major breaking point was the HTTP stuff. It was just too confused, too not clearly stated. We started poking around, we saw that eventually you can make it work, to some degree. For example, Azure - whatās it called now? Azure App Container Instances, I thinkā¦ They now use Envoy as proxy, so you now get HTTP/2ā¦ But as soon as you want to use the web application firewall - yeah, well, youāre in a bad spot, because that thing does only support HTTP/2 to the front, and not to the back. And the CDN as well. So it always did come down to friction on that end. And so yeah, we choose Google Cloud Run, exactly. Thatās one of the major reasons we chose Cloud Run, even though we donāt like some of the limitations with Cloud Run. There are some which we donāt like.
Tell us about them.
I mean, I still to the day not fully understand why Google Cloud Run needs to use their internal artifact registryā¦ Like, you need to push your images into Googleās registry, and from there on out, you can fetch it in Cloud Run. I donāt know why that decision was being made. There might be a technical reason to that. I mean, you could argue on availability; that would be a ground. But I donāt like that fact. And the other thing that really is kind of a bad thing is if you want to use VPCs, you need to use the VPC connector, which now can be edited. I guess they released that like one month back, or something like that, but you need to have basically VMs in the background that handle connectivity from your Cloud Run to your VPCs. And since we use Cockroach, that traffic passes through a VPC, and we use like [unintelligible 00:32:47.18] gateways, so we pass anyway traffic through that, because we want to have control over what traffic leaves our site, and stuff. And that VPC connector thing is always and so like, yeah. Itās there, I donāt like it, because it scales not down. Itās just scaling up, and then I have like 10 VMs running in the background, doing nothing. But yeah, itās kind of a thing I donāt like. But other than that, itās a great tool.
Thatās right. One thing that maybe we havenāt done as a good job to convey this is that your service is global. So when youāre saying VPC connect, you donāt mean just in one region or in one zone; you mean across the whole world. So how many pops do you have worldwide, where Citadel runs?
We have like a core pop region. Itās like three regions we run constantly, that comes down because we run our storage there as well. And sometimes if we see different traffic profiles, we start regions without storage to them, just to get some of the business logic closer to customers. So that can range normally from three to nine regions during normal operations. But since Googleās internal network is quite efficient from our perspective, and their connectivity is really great, we donāt need to spread it to more regions than that normally.
[34:13] We did some internal experiments, we built like a small TerraForm function where you basically can throw in a list of regions you want to deploy, and it will basically deploy to 26 regions in like one to two minutesā¦
Wow.
That works really well. But you get strange problems if you do that, because sometimes you want to have like hot regions, because your application is anyway running, it can serve traffic quite easilyā¦ If you have to cold-start a region, it always takes a few milliseconds to do that. And itās not like a big thing, but it can influence customersā view on your service. Because if you hit the login page and it takes like two seconds to get everything spinning, and database connectivity set up, and everything, it gives you some drag. And so weāre trying to keep hot regions, as we call them.
Yeah, thatās crazy. Like, you say that two seconds is slow for a whole region to come up. Itās like, āWhat?!ā [laughter] Like, try booting something; it will take more than two seconds. Anything, really. Wow, okayā¦
I feel itās like engineering ethos that you might at some time over-engineer certain thingsā¦ But still, it feels right to do that, because itās more easy to just scale up an existing region and throw some more traffic into that. You can easily steer that around. The thing you most of the times will miss out is basically 50 milliseconds. There is edge cases with different regionsā¦ So if you live like in Australia - yes, we donāt have like an immediate region in your vicinityā¦ But yeah, thatās a matter of - if you have enough traffic, you will actually open a region at one point, and then you try to keep it hot as long as you can. And I always think the classic engineering decision that comes down to that is the same thing that Cloudflare and Fastly are constantly arguing aroundā¦ I mean, the last time I checked, Cloudflare was still āLetās build small regions across many placesā, and Fastly was like, āNo, letās build like huge data centers, with huge compute to them.ā Itās a matter of what your service needs to decide that, and we decided that āYeah, two seconds feels bad.ā
Yeah. But itās interestingā¦ Sometimes routing to a region which is further away, sending a request to that region can be faster based on your workload. And even though your workloads are super-optimized, for something to be up and ready in two seconds, thatās just crazy. Try doing that with Javaā¦ [laughter] Right? Or something which is like slow to start. Not picking on Java, but itās known for slow starts, which is why you wouldnāt stop it. And that would be like a no-starter; like, you canāt even consider that for what youāre doing. Maybe GraalVM does things better. But JVM, itās slow by design, because it runs optimizationsā¦ I mean, a lot of things need to happen. Again, nothing wrong with that, but not suitable for this workload.
I mean, itās even small things involved into getting like fast startup latenciesā¦ One big drive ofā we hypothesize - we donāt really have evidence, but we hypothesize - is that the image sizes of your containers influence that quite a lot. Even though I think Google does quite a lot of magic in caching things in Cloud Run to scale it quickly. But nonetheless, we see that bigger images take more time. So for example, we have a documentation page built with Docusaurus, and we normally use Netlify to deploy those things, and we are now currently testing to move that to Cloud Run as well. And that container is approximately 500 megs in size, because of the images and stuff, and it takes more time to start. So it takes like three to four seconds just to get like the node server started, even though everything is pre-built, so itās not like we are compiling stuff on the fly. Itās really like start the Node server with static assets.
[38:14] Okay, okay. Yeah, youāre right, youāre right; that can make a big difference as well. So a few seconds is not bad, right? Especially if we have like a blue/green style of deployment, where ā and I know that Cloud Run supports that. So youāre not taking the live version down, and thatās okayā¦ Something that needs dynamic requests - for that itās a bit more difficult, right? Because it needs to service it, it needs to keep the connection openā¦ Thereās like a couple of things happening, rather than just static ones, where you can just cache them, use a CDN. So you said that you are using a CDN, right?
Yup.
Okay. Okay. How is that like? How are you finding the Google CDN? Because I havenāt used it in anger; I mean, only small projectsā¦ How does that work in practice?
I actually quite like it. One of the things we like the most is that you can cache assets across multiple domains. So for example, each of our customer has their own domain name, and our management GUI is built with Angular, so we have a lot of static assets to thatā¦ And if one customer accesses that data in one region, we can cache it for basically every customer. And thatās nice, because you basically can ignore the name as the host name, and instead just cash the files. And thatās a feature I have not easily found in CloudFlare or Fastlyās offering. I mean, you can always make it work to some degree, but that was basically ā yeah, just input a validation rule into the Google CDN and it will take care after all that stuff.
And the overall strategy with Googleās pricing is more beneficial to our end, because we basically only pay usage, and we are not feature-locked. And with the Cloudflares, Fastlys, and everybody, you basically are always feature-locked until you get to their enterprise offering. And at that point - yeah, the cost is quite steep to be paid at that endā¦ Even though they have great offerings, but it feels wrong to spend so much money on a CDN, when itās just ā itās basically caching some static assets. I mean, itās not doing the heavy-lifting, itās more quality of life improvement, I would call it.
I was reading something about this as well, enterprise features, feature locking, things like that. It was your blog, where he said you charged for requests, right?
[laughs] Yes.
See, I have done a bit of research; not too much, but I did notice that, where you mentioned thatā¦ And to me, that is very reasonable. You donāt have to upgrade to higher price tiers just to unlock certain features. I mean, why? Does it cost you more? I mean in development time sure, but you donāt get the full experience of the service. Per seats - again, that pricing can work in some cases. So what influenced your decision to do that? I thought that was very interesting, for all the right reasons.
We thought long and hard about our pricing, so many times in the pastā¦ We even had like a feature locked model, closely to what Cloudflare doesā¦ And what does not reflect well in the security area is if you want to provide your customers with a security service, you should give them the means to have a secure application, and not to tell them, āNo, if you want to have 2-factor, you need to pay extraā, because that kind of defeats the purpose of having an external specialized system of handling the security in the first placeā¦
And the second thing there is like if you price by identities, customers will stop creating identities at one point, if they can choose. And we wanted to remove that sensation by telling them, āHey, store as many things as you like, do as many things as you like. The only thing we want to have from you in return is we need to be able to finance our infrastructure to some degree, and so it comes down to pay us for the usage.ā
[42:20] Thatās really what it boils down toā¦ Because it feels like a nice trade-off, even though - and I can be honest on that, and itās during sales meetings - it can sometimes be a problem or impediment, because people still think in users. āI want to have like a million users. I want to have a price for a million users.ā And I mean, we have a lot of data, we can do cost estimations for that; itās not like a big problem. We even do over H deals where we say āOkay, weāll give you like 10 million requests a month for a price XYZ, and we will use it for 12 months, and we will check after five months how it was workingā, because we want to reduce that friction out of the equation. But itās just a matter of different strategy, and we are committed to that end, because it feels like the right thing to do, even though it has some challenges.
Yeah, for sure. For sure. Yeah, I mean, to me, that sounds a more sensible approach, a more honest approach, a more open approach. Everything is out there for you to use. Thereās like one requirement that we are able to support all your requests, and we are able to give you the quality of service that we know what you wantā¦ And these are the SLOs, and thatās what it will cost. Okay. How long have you been running the cloud service for, your cloud service? Six months, 12 months?
Now, looking at the date, itās like seven.
Seven months. Okay.
Yeah. The thing we call Citadel Cloud now is now seven months in ageā¦ But we had a service we called Citadel v1, with kind of a different sensation to it, with the old pricing I just mentionedā¦ And that was started in mid-2021. But we learned so many things across that journey that we needed to reconsider some of the things, like pricing, deployment strategy, locations where we deploy, because customers actually care sometimes about thatā¦ The API needed to be reshaped to a degreeā¦ And so yeah, it comes down to an evolution of Citadel, like from version one to version two, and our cloud service changed as well. And so the new service is basically seven months in age right now.
Okay. What are the things that worked well for the cloud service in the last seven months? Good decisions, that proved to be good in practice?
I think itās not only directly the cloud service, but the overall change in our messaging, what we actually want to sell, and why we recommend that you use our cloud service - that message is being picked up better since like the seven last months. So thatās a thing we constantly improved. So many people now use the free offering we have, because it provides already a lot of value, and we are even considering increasing the amount of things we give you, the amount of requests as well, to get developers an easygoing free tier that they can actually start building software. Because nobody really likes to run things. And I think thatās the most ā letās call it the biggest change I experienced so far in behavioral thingsā¦ Because everybodyās always shouting, āI want to run system XYZ on my own hardwareā, but in the end everybody turns to some kind of free hosted offering, because everybody just knows āOh, no, I donāt want to take care of backups. Oh no, it just runs. Oh no, I need to start it again.ā So ā
āUpgrades? Again?! Oh, my goodnessā¦ā When did you last upgrade your phone? Serious question.
My phoneā¦ Iām quite pedantic, so I will catch up on releases in one to two days. [laughter]
[46:14] Okay, thatās a great answerā¦ But for me, the updates just happen, right? I mean, unless itās like a major update, your applications on your phone - they just update. Itās not a problem that people think about, or should think about. So if you run it yourself, guess what? You have to think about that. And then you say, āOh, dammit, I want this free auto update feature. Why donāt you give it to me?ā Okay, well, itās not that easy. I mean, the easiest auto-update - like, delete it, and then deploy it again. And then youāre good. But people donāt want that. So the point being is, you want the service, because this stuff should be seamless, and someone needs to put in the effort for it to be seamless every single time.
Itās hands-off. We really call it hands-off. I mean, we take care of the TLS stuff, we will take care of updates, of backups, of rate limits, of malicious trafficā¦ Everything is just handled for you, and thatās a value I think is going great with the communityā¦ Even though Citadelās open source version is really like ā there is like 99% of the things that we have in our cloud service is in the open source code, and then you can run it on your own. You can even get like an enterprise license with us to get support, and stuff. So we really encourage you to do that. But the main reason we still encourage doing that is we see many customers having special requirements on data residency, or data accessā¦ And we always tell them āWe donāt want to do like a managed service offering for you guys, because it feels wrong. Because we still have access.ā And if your reason is, āNobody else should have accessā, well then you need to run the system on your own.
Yeah. Yeah, that makes sense. I forgot, youāre a security, right? [laughter] And security has this very important requirement. No, sorry, I have to run this. I mean, I understandā¦ I can pay you to run it for me, but it has to be in the specific locations, with these restrictions, andā¦ Yeah, that makes sense.
Yeah. So thatās the thing I think went great with the cloud service. So free tier is definitely a thing we will reiterate on, even though you could run it on your own. I mean, itās not like there is like a feature gap, or something like that.
Okay. What about the things that you wish you knew, before we [48:29]? [laughter] The āOh, f**k!ā moments. And that will be bleeped, butā¦
[laughs] Yeah, there are many. Honestly, there are so many. Iām going to choose to focus on the operation side of things for the first moment. Itās really like, donāt try and build many funny things, even though there are great open source tools around, and everything is ready. Just try and relax a little and use ready-made services in the beginning. Because we thought āOkay, letās run our own Kubernetes, or GCPā, and stuff. āLetās get more control of it.ā Yeah, it was going great, but the added cost and the added slowness you have while maintaining our own Kubernetes and stuff - itās not worth it. So thatās really ā just use turnkey services to begin with. And at one point, be ready to make decisions to change that stuff to more enterprisy side of thingsā¦ Because [unintelligible 00:49:36.04] application to Netlify, and Vercel, and calls it a dayā¦ But I think thatās only worth for the start. At one point you want to get more control, more flexibilityā¦ You want to create rate limits, you want private IPs, you want to have like the enterprisy thingsā¦ And you get that way easier if you start focusing on using like Google Cloud, AWS, Azure, whatever is your poison, basicallyā¦ Just use an infrastructure provider for that.
[50:10] While reflecting back, I feel like that would have helped us decrease some of the drag along operation efforts. And as well, donāt run Cockroach on your own; just use the cloud service. Why not? You must really have valid and specific requirements that do not allow you to do that before you make that decision. So thatās really a big thing on that end.
Okay.
Other things? Yeah, our startup lesson is like āDonāt assume things. Always validate things. Talk to your customers, talk to potential customers whether a feature is really needed.ā And we built, for example, too many features in Citadel. We assumed too many things in the past, and so we now strip some of the things that nobody actually needs. And you need to have like data to make that judgment, but other than that, it really comes down to āCheck first whether somebody needs somethingā, and not only one guy, but also multiple guys, and then build it, and not the other way aroundā¦ Because you get a lot of dead code that you need to maintain, and it can have bugsā¦ And yeah, so thatās really a lesson weāre mentioning.
Yeah. Itās a good thing that you are pruning some of this stuff. Because what usually happens - you never touch it. You add it, but then you never touch it again. And that too often ends up in some very big messes, that no one wants to touch, everā¦ And then things just die like that, you know?
I mean, as soon as you see somebody raising a concern, or a bug over something, and you think, āIs that feature even being used?ā And you canāt really validate to yourself, āYes, thatās actively being used.ā You should invoke a discussion if you want to rather remove it, because just one person is using it. Itās just maintenance effort.
Iām wondering, on the operational side of your cloud service, how do you know when something doesnāt work as expected? How do you know that there is a problem, basically, is what Iām asking?
Letās say I am a huge fan of using existing data to get a sensation whether somethingās healthy or notā¦ By which I mean we try and avoid active health checks; like, we donāt ping Citadel from the outside world and figure out whether a service is available or not. Iād rather use the logfiles and throw them into some kind of analytics engine to figure out āOkay, how many status codes do we have, in what variety? How many calls? Whatās the latency?ā Because that gives you a broad understanding of the health of your service across regions. Because if an error rate is growing, and thatās being tied into a region, youāll normally know itās more an infrastructure-related problem. If the error rate is growing across the globe, it might be more on the storage side of things, because thatās a global thingā¦ And so you get quite fast a broad sensation of how things are.
And the other hugely important thing to us is like having traces, like open tracing. We use Google Cloud Tracing for that stuff, to get a sensation on whether releases change things, or smallish changes, or A/B tests alsoā¦ Because sometimes we create a new branch in our open source repository, and refactor some of the code, for example, get user by IDs being revoked with a new SQL statement, and then we deploy it to production to get some traffic to it, and see how latency shifts around.
So it really comes down to observability based on the data you have around, but not checking it actively, because that gives you a wrong sensation. Because youāre always checking happy paths there. I mean ā
[54:03] Yeah, true. Yeah, thatās a good point.
Yeah, you can check a login page. I mean, filling in fields, pressing a button. It will work. But what happens if itās not working because the user has strange extensions in the browser, because they have strange proxies in their environment, because mobile connections are resetā¦ So it does not reflect the natural world.
Yeah, thatās a good point. Okay. So weāre still at the beginning of 2023ā¦ What are the things that youāre looking forward to? ā¦for Citadel, for your cloud service, for what you do, for the space, for the wider security space. You can answer this whichever way you want. Thatās why Iām giving you a wide berth. Pick whatever resonates with you the most.
We have identified some needs for our customers who want to have more control over their general user experienceā¦ Because currently, people use the hosted login page from Citadel, which can be white-labeled, and you can basically change everything, and customize itā¦ But still, itās like the [unintelligible 00:55:11.09] for you. Even though itās a good approach, because we can include all the security policies into it, and verifications, and stuffā¦ But there is a huge demand in the developer space for having like only authentication APIs. So they basically can send us a username, password, and we will respond by true or false. So people want to create their own login experiences, as well as their own register experiences. And thatās a thing we will tackle in the next few months, in the next coming months. We want to extend our APIs so that people can build their own login pages.
And during that phase, we will also change our login, because now itās built with Go, and Go HTML templates, and thatās kind of not so beautiful as it could be. There might be a point where we change that to Next.js, to get like a more SDK-y approach, so that we kind of build our login with Next.js, and we will provide you the template that you can clone and create your own login without our intervention.
So Citadel might become more of a headless identity system in one place, where we just provide you the means to have different components that you can deploy, or if you donāt want them, you can get rid of them. So thatās kind of a natural evolution path we see there.
And the other big thing we will change in 2023 is we will extend our actions concept more. Basically, actions are ā you can invoke customized JavaScript codes in Citadel at certain points. Like, if a user is created, you can call a CRM system with some JavaScript code, you can fetch some information from there, and the whole action engine will be reworked that we can allow for more flexibility. So you basically can think of it like a GitHub Action; you can subscribe to events, and then execute something. And thatās a thing we encourage quite heavily, even though it has a steep cost to be paid in regard of runtime security. I mean, running foreign code in a system is always not so funny to doā¦
Oh, yeahā¦
We do a lot of pen testing, and testing in generalā¦ And thatās the reason why you canāt really run it everywhere currently in Citadel, because we want to reduce the threat surface, to get our bearings whether the engine works, and everythingā¦ So that will be a subject to be changed in 2023. So those are the two biggest, prominent points I think you will see from us on that end.
There is more underlying stuff, especially around machine learning and things, because we ā I mean, Citadel is built in an event sourcing concept, so we have a lot of data availableā¦ And we want to give our customers the option to train data, threat prevention models with their own data, to compensate for signal-based risksā¦ I mean, thatās kind of a little bit on the academic side of things, and we are working with research partners on that end. But bringing value from the data is a huge thing that we wanted to provide our customers.
[58:28] Yeah, okay. Okay. What about the cloud? Anything for the cloud that you have planned?
Some thingsā¦ [laughs]
Some things. Great. Thatās a great answer. We can move on. All good.
Iām not sure much I should already disclose. [laughs]
No, no, all good. Letās move on. Not a problem. All good. [laughs]
Letās seeā¦ A thing I can definitely disclose right there is the ā we are strongly considering opening additional regions, because we see now where traffic is originating from, and we are considering to expand our footprint on that endā¦ And also a thing that will land in our cloud eventually is a feature I did not disclose yet. Itās basically an event API. Itās a simple API; you can basically fetch everything that changed in Citadel, with the whole body to it, like first name changed to whatever, because that gives the developer a great way of backpressure processing of things that changed, so they get a proper change track of everything.
And that thing - I mean, the event API will land at one point in our cloud service, but that needs some rigorous testing to be sure that all the inner fencings of our cloud service work out well, not that customers see the wrong data, and stuff. So thatās a thing. And we are experimenting with different CPU profiles in our cloud service to reduce some of the latency we see, especially during hashing operationsā¦ itās one small [unintelligible 00:59:55.17] But there is like a limitation in Cloud Runās offering; you canāt have really high CPU containers without high memory, and we donāt need so much memory. Normally, Citadel uses 500 megs of memory; we donāt use more. The rest is being handled by the storage layer and caching.
So yeah, thatās a thing that needs untanglingā¦ So either we can use resources better, or we can somehow influence to have more CPUs in there, to get better latency. Yeah, thatās an ongoing experiment. We always try to wiggle out some things.
Yeah. Okay. Well, that sounds like a very good plan for 2023. Letās see how much of that actually happensā¦ As we know, the plans are the best.
[laughs] Yeah, yeah,
Everythingās gonna be amazing, and then reality hits, and then you realize that half the stuff you thought would work will be impossible. So thatās my favorite - like, get it out there, see what sticks, see what doesnāt, keep making those changes, those improvements, drop what doesnāt make senseā¦ Whatever.
Itās really testing things. And there is also some discussions around a reiteration of the pricing. For example, a thing we are currently testing, as well as deciding, is whether we want to give away a domain for free, for example. Because currently, if you want to have like a production domain with Citadel Cloud, you need to pay 25 bucks a month. But we are strongly considering whether we want to provide that domain name for free to developers, because it reduces drag along their journey. I mean, they want to start poking around, they want to use itā¦ And if our cost attached to that is not so high, itās no real problem. And since Google changed some of their offering in the TLS space, you can get like customer certificates quite easily, without a huge cost profile. So a certificate would cost you only like 10 cents per month, per customerā¦ So it feels like the right way to do it, but itās not yet a done deal.
[01:02:06.03] Okay, that sounds exciting. Not as exciting as your takeaway, because Iām sure that you had the time to think as we were discussing this, about the one key takeaway that you want listeners that got as far as this in the conversation with us. So honestly, this was very eye-opening for me, to see how maturely youāre thinking about some very hard problems, like distributing code globally, latency, and the scale that youāre thinkingā¦ Youāre saying a few seconds is too slow. And to me, itās like āWhoa, what?ā Sometimes requests take longer than a few seconds, because it happens, because maybe youāre on a phone, or youāre in a watch, or whatever the case may be. And then your CPU isnāt as fast, or the cellular doesnāt work as well as you may think.
So to me, from what Iām hearing is operationally, youāre very advanced. And youāve tried a couple of things, and youāve seen a bunch of things that didnāt work out very well in practice, even though the promise is there, and the marketing is working well for certain thingsā¦ So you have a lot of like ā I think street-wise, itās called; youāre street-wise. Youāve been out there, youāve tried a few things, and you know what works and what doesnāt for you. So, again, for the listeners that made it thus far with us - and thank you all for sticking - what would you like them to take away from our conversation?
I think the most important thought - in my space, to be specific - is donāt think of the authentication system as two input fields and a button. Because thatās plainly under-estimating the amount of effort in depth that goes into such topics. And so I would encourage every listener to always think thoroughly across the reasoning, whether you want to use just a framework in your application, or just create your own login thingā¦ Because there is a huge attached cost to that in regard of operational security. Because you need to maintain it, you need to pen test it, you need to prove the security of the processes, and all that - letās call it dirty plumbing work, to make it happen.
And so I really encourage everybody, please use some kind of turnkey solution thatās battle-testedā¦ Even though if you use a framework from your language-specific thing, donāt build it your own; you will hurt yourself at one point. And you donāt need to take my word for it, but a general agreement in the industry is that itās better to have somebody deeply committed to the topic of authentication and authorization, to have them working on that, and not you. Just use something, and build great features and great products. So thatās really my one thing I really want to get across to everybody.
It just shows that your head is where your heart is, right? ā¦which is authentication, authorizationā¦ I mean, if you build a company around it, you really have to believe it that thingā¦ So youāre committed through and through. And that continues to be top of your mind, which is important.
Yeah, I mean, itās such a huge ā authentication and identity space, in general, it has a huge depthness to it. And it always feels like you can easily do that. But as soon as you start poking into the space, you will see there is like a huge amount of time flowing into it. I mean, the OAuth threat framework is like 60 pages. I recently did something for the Swiss government, which will come out in a few monthsā¦ Itās like 120 pages on just things you should consider when building something like thatā¦ And it starts with things like XSS, and CSPs, andā¦ You need to care for that. Itās just ā the depth is the problem, basically.
Yeah. Okay. So it may seem simple when you consume it as a service, but thereās a lot that goes into it. And if you think you can do it - sure, but there are dragons thereā¦ So at least just be aware of them. Just donāt get eaten without knowing what youāre getting yourself into.
Yes. I mean, you can always pokeā¦ If you want to make funny games, just go to a login page where you have a user, throw in your username, throw in your password, and if the response is coming back faster than 500 milliseconds, I can tell you something is broken already, because no solid password hashing algorithm will return a resulting that fast. Otherwise, they run huge CPUs. I mean, I call it the Xeons and stuff. Otherwise, you canāt get that hashing through so fast. But yeah, thatās just a sensation. Thatās a thing I always, I poke around login pages. [laughs]
Well, Florian, thank you very much for joining us today. It was an absolute pleasure hosting you. Thank you. It was a great conversation, and I look forward to the next one.
Thank you. Likewise. I really liked your questions. And you see, Iām still laughingā¦
Yeah, exactly. So Iāve done something right. Thank you very much for that. Until next time. See you.
Our transcripts are open source on GitHub. Improvements are welcome. š