If you’re a software engineer working anywhere near backend systems, the term “service mesh” has probably infiltrated your consciousness some time over the past few years. Thanks to a strange confluence of events, this phrase has been rolling around the industry like a giant Katamari ball, glomming on successively bigger pieces of marketing and hype and showing no signs of stopping any time soon.
The service mesh was born in the murky, trend-infested waters of the cloud native ecosystem, which unfortunately means that a huge amount of service mesh content ranges from “low-calorie fluff” to—to use a technical term—“basically bullshit”. But there’s some real, concrete, and important value to the service mesh, if you can cut through all the noise.
In this guide I’m going to attempt just that: to provide an honest, deep, engineer-focused guide to the service mesh. I’m going to cover not just the what but also the why and the why now. Finally, I’m going to attempt to describe why I think this particular technology has attracted such a crazy level of hype, which is an interesting story in and of itself.
Who am I?
Hi there. I’m William Morgan. I am one of the creators of Linkerd, the very first service mesh project and the project that gave birth to the term service mesh itself. (Sorry!) I’m also the CEO of Buoyant, a startup that builds cool service mesh stuff like Linkerd and Dive.
As you might imagine, I am very biased and have some strong opinions on this topic. That said, so I’m going to do my best to leave the editorializing to a minimum (except one section, “Why do people talk so much about this?“, where I’ll unveil some opinions) and I’ll do my best to write this guide in a way that is as objective as possible. When I need concrete examples I’ll primarily rely on Linkerd, but when there are differences I know about with other mesh implementations I’ll call them out.
Ok. On to the good stuff!
What is a service mesh?
For all the hype, the service mesh is architecturally pretty straightforward. It’s nothing more than a bunch of userspace proxies, stuck “next” to your services (we’ll talk about what “next” means in a bit), plus a set of management processes. The proxies are referred to as the service mesh’s data plane, and the management processes as its control plane. The data plane intercepts calls between services and “does stuff” with these calls; the control plane coordinates the behavior of the proxies, and provides an API for you, the operator, to manipulate and measure the mesh as a whole.
What are these proxies? They’re Layer 7-aware TCP proxies, just like haproxy and NGINX. The choice of proxy varies; Linkerd uses a Rust proxy simply called linkerd-proxy that we built specifically for the service mesh. Other meshes use different proxies; Envoy is a common choice. But the choice of proxy is an implementation detail.
What do these proxies do? They proxy calls to and from the services, of course. (Strictly speaking, they act as both “proxies” and “reverse proxies”, handling both incoming and outgoing calls.) And they implement a featureset that focuses on the calls between services. This focus on traffic between services is what differentiates service mesh proxies from, say, API gateways or ingress proxies, which focus on calls from the outside world into the cluster as a whole.
So that’s the data plane. The control plane is simpler: it’s a set of components that provide whatever machinery the data plane needs to act in a coordinated fashion, including service discovery, TLS certificate issuing, metrics aggregation, and so on. The data plane calls the control plane to inform its behavior; the control plane in turn provides an API to allow the user to modify and inspect the behavior of the data plane as a whole.
Here’s a diagram of Linkerd’s control plane and data plane. You can see that
the control plane has several different components, including a small
Prometheus instance that aggregates metrics data from the proxies, as well as
components such as
destination (service discovery),
public-api (web and CLI endpoints). The data plane, by
contrast, is just a single linkerd-proxy next to an application instance. This
is just the logical diagram; when deployed, you may end up with three replicas
of each control plane component but hundreds or thousands of data plane
(The blue boxes in this diagram represent Kubernetes pod boundaries. You can see that the linkerd-proxy containers actually run in the same pod as the application containers. This pattern is known as a sidecar container.)
The architecture of the service mesh has a couple big implications. For one, since the proxy featureset is designed for service-to-service calls, the service mesh really only makes sense if your application is built as services. You could use it with a monolith, but it would be a whole lot of machinery to run a single proxy, and the featureset wouldn’t be a great fit.
Another consequence is that the service mesh is going to require lots and lots of proxies. In fact, Linkerd adds one linkerd-proxy per instance of every service. (Some other mesh implementations add one proxy per node / host / VM. It’s a lot either way.) This heavy use of proxies itself has a couple implications:
- Whatever these data plane proxies are, they’d better be fast. You’re adding two proxy hops to every call, one on the client side and one on the server side.
- Also, the proxies need to be small and light. Each one will consume memory and CPU, and this consumption will scale linearly with your application.
- You’re going to need a system for deploying and updating lots of proxies. You don’t want to have to do this by hand.
But, at least at the 10,000ft level, that’s really all there is to the service mesh: you deploy a ton of userspace proxies to “do stuff” to internal, service-to-service traffic, and you use the control plane to change their behavior and to query the data they generate.
Now let’s move on to the why.
Why does the service mesh make sense?
If you’re encountering the idea of service mesh for the first time, you can be forgiven if your first reaction is mild horror. The design of the service mesh means that not only does it add latency to your application, it also consumes resources and also introduces a whole bunch of machinery. One minute you’re installing a service mesh, the next you’re suddenly on the hook for operating hundreds or thousands of proxies. Why would anyone want to do this?
There are two parts to the answer. The first is that the operational cost of deploying these proxies can be greatly reduced, thanks to some other changes that are happening in the ecosystem. Lots more on that later.
The more important answer is because this design is actually a great way to introduce additional logic into the system. That’s not only because there are a ton of features you can add right there, but also because you can add them without changing the ecosystem. In fact, the entire service mesh model is predicated on this very insight: that, in a multi-service system, regardless of what individual services actually do, the traffic between them is an ideal insertion point for functionality.
For example, Linkerd, like most meshes, has a Layer 7 feature set focused primarily on HTTP calls, including HTTP/2 and gRPC.1 The feature set is broad, but can be divided into three classes:
- Reliability features. Request retries, timeouts, canaries (traffic splitting/shifting), etc.
- Observability features. Aggregation of success rates, latencies, and request volumes for each service, or individual routes; drawing of service topology maps; etc.
- Security features. Mutual TLS, access control, etc.
Many of these features operate at the request level (hence the “L7 proxy”). For example, if service Foo makes an HTTP call to service Bar, the linkerd-proxy on Foo’s side can load balance that call intelligently across all the instances of Bar based on the observed latency of each one; it can retry the request if it fails and if it’s idempotent; it can record the response code and latency; and so on. Similarly, the linkerd-proxy on Bar’s side can reject the call if it’s not allowed, or is over the rate limit; it can record latency from its perspective; and so on.
The proxies can “do stuff” at the connection level too. For example, Foo’s linkerd-proxy can initiate a TLS connection and Bar’s linkerd-proxy can terminate it, and both sides can validate the others’ TLS certificate.2 This provides not just encryption between services, but a cryptographically secure form of service identity—Foo and Bar can “prove” they are who they say they are.
Whether they’re at the request or at the connection level, one important thing to note is that the features of the service mesh are all operational in nature. There isn’t anything in Linkerd about transforming the semantics of the request payload, e.g. adding fields to a JSON blob or transforming a protobuf. This is an important distinction that touch on again when we talk about ESBs and middleware.
So that’s the set of features that the service mesh can provide. But why not just implement them directly in the application? Why bother with the proxies at all?
Why is the service mesh a good idea?
While the featureset is interesting, the core value of the service mesh is not actually in the features. After all, we could implement these features directly in the application themselves. (In fact, we’ll see later that this was the genesis of the service mesh.) If I had to put it into a single sentence, the value of the service mesh comes down to this: The service mesh gives you features that are critical for running modern server-side software in a way that’s uniform across your stack and decoupled from application code.
Let’s take that one bit at a time.
Features that are critical for running modern server-side software. If you are building a transactional, server side application that is connected to the public Internet and takes requests from the outside world and responds to them within some short timeframe—think web apps, API servers, and the bulk of modern server-side software—and if you are building this system as a collection of services which talk to each other in a synchronous fashion, and if you are continually modifying this software to add more functionality, and if you are tasked with keeping this system running even while you’re modifying it—then congratulations, you are building modern server-side software. And all those glorious features listed above actually turn out to be critical for you. The application must be reliable; it must be secure; and you must be able to observe what it’s doing. And that’s exactly what the service mesh helps with.
(Ok, I snuck an opinion in there: that this one approach is the modern way to build server-side software. There are people in the world today who are building monoliths or “reactive microservices” and other things that don’t fit into the definition above, who might have a different opinion. In turn, my opinion is that their opinion is “wrong”—but either way the service mesh is not very useful for them.)
Uniform across your stack. The features provided by the service mesh aren’t just critical, they apply to every service in your application, regardless of what language the service is written in, what framework is uses, who wrote it, how it was deployed, or any other detail of development or deployment.
Decoupled from application code. Finally, the service mesh doesn’t just provide features uniformly across your stack, it does so in a way that requires no application changes. The fundamental ownership of the service mesh functionality—including the operational ownership of configuration, updates, operation, maintenance, etc—lies purely at the platform level, independent of the application. The application can change without the service mesh being involved, and the service mesh can change without the application being involved.
In short: not only does the service mesh provide vital features, it does so in a way that’s global, uniform, and independent of the application. And so while yes, the features of the service mesh could be implemented in the service code (even as a library that was linked in to to every service), this approach would not provide the decoupling and uniformity that’s at the heart of the service mesh value prop.
And all you have to do is add a lot of proxies! I promise that we were going to talk about the operational cost of adding all these proxies very soon. But first, we need a pit stop to examine this idea of decoupling from the perspective of people.
Who does the service mesh help?
As inconvenient as it may be, it turns out that in order for technology to actually have an impact, it must be adopted by human beings. So who adopts the service mesh? Who benefits from it?
If you’re building what I’ve described above as modern server software above, you can roughly think of your team as divided into service owners, who are in the business of building the business logic, and platform owners, who are building the internal platform on which these services run. In small organizations, these may be the same people, but as the organization gets larger these roles typically get more defined and even further subdivided. (There’s a lot more to be said here about the changing nature of devops, the organizational impact of microservices, etc. But for now let’s take these descriptions as a given.)
Seen through this lens, the immediate beneficiary of the service mesh is the platform owners. The goal of the platform team, after all, is to build the internal platform on which the service owners can run their business logic, and to do so in a way that keeps the service owners as independent as possible from the gory details of operationalization. The service mesh not only provides features that are critical for accomplishing this, it does so in a way that doesn’t, in turn, incur a dependency on service owners.
The service owners also benefit, albeit in a more indirect way. The goal of the service owner is to be as productive in possible in building the logic of the business, and the fewer operational mechanics they have to worry about, the easier that is. Rather than being on the hook for implementing e.g. retry policies or TLS, they can focus purely on business logic concerns and trust that the platform will take care of the rest. That’s a big plus for them as well.
The organizational value of the decoupling between platform and service owners can’t be overstated. In fact, I think it might be the key reason why the service mesh is valuable.
We learned this lesson when one of our earliest Linkerd adopters told us just why they were adopting a service mesh: because it allowed them to “not have to talk to people”. This was a platform team at a large company was migrating to Kubernetes. Because their app handled sensitive information, they wanted to encrypt all communication on the clusters. There were hundreds of services and hundreds of developers teams, and they were not looking forward to convincing each dev team to add TLS to their roadmap. By installing Linkerd, they shifted ownership of the feature out of the hands of developers, for whom it was an imposition, and into the hands of the platform team, for whom it was a top-level priority. Linkerd didn’t solve a technical problem for them so much as it solved an organizational problem.
In short, the service mesh is less a solution to a technical problem than it is a solution to a socio-technical problem.3
Does the service mesh solve all my problems?
Yes. Er, no!
If you look at the three classes of features outlined above—reliability, security, and observability—it should be clear that the service mesh is not a complete solution for any of these domains. While Linkerd can retry requests when it knows that they are idempotent, it can’t make decisions about what to return to the user if a service is entirely down—the application must make these decisions. While Linkerd can report success rates, etc, it can’t look inside a service and report internal metrics—the application must have instrumentation. And while Linkerd can do things like mutual TLS “for free”, there’s a lot more to security solution than just that.
The subset of features in those domains that the service mesh provides are the ones that are platform features. By this I mean features that are:
- Independent of business logic. The way that traffic latency histograms are computed for calls between Foo and Bar is totally independent of why Foo is calling the Bar in the first place.
- Difficult to implement correctly. Linkerd’s retries are parameterized with sophisticated things like retry budgets because the naive approach to retries is a sure path to “retry storms” and other distributed system failure modes.
- Most effective when implemented uniformly. The mechanics of mutual TLS only really make sense when everyone is doing them.
Because these features are implemented at the proxy layer, rather than at the application layer, the service mesh provides them at the platform, not application, level. It doesn’t matter what language the services are written in, or what framework they use, or who wrote them, or how they got there. The proxies function independent of all that, and the ownership of this functionality—including the operational ownership of configuration, updates, operation, maintenance, etc—lies purely at the platform level.
Example features of the service mesh
|Service Mesh||Service success rates||Request retries||Mutual TLS between all services|
|Platform (non-service mesh)||Log aggregation||Multiple replicas of dataset||Encryption of data at rest|
|Application||Instrumentation of internal feature usage||Handling of failure when an entire component is down||Ensuring users only have access to their own data|
To summarize: the service mesh is not a complete solution to reliability, or to observability, or to security. The broader ownership of those domains necessarily involves service owners, ops and SRE teams, and other parts of the organization. The service mesh can only provide a platform-layer “slice” of each domain.
Why does the service mesh make sense now?
At this point you may be saying to yourself: ok, if this service mesh thing is so great, why weren’t we rolling millions of proxies in our stack ten years ago?
There’s a shallow answer to this, which is that ten years ago everyone was building monoliths, and so no one needed a service mesh. Which is true, but I think misses the point. Even ten years ago, the concept of “microservices” as a feasible way of building high-scale systems was widely discussed, and was publicly being put into practice at companies like Twitter, Facebook, Google, and Netflix. The general sentiment, at least in the parts of the industry I was exposed to, was that microservices were the “right way” to build high-scale systems, even if gosh they were really painful to do.
Of course, while there were companies operating microservices ten years ago, they were by and large not installing proxies everywhere to form a service mesh. If you looked closely, though, they were doing something related: many of these organizations mandated the use of a specific internal library for network communication (sometimes called a “fat client” library). Netflix had Hysterix, Google had the Stubby libraries, and Twitter had Finagle. Finagle, for example, was mandatory for every new service at Twitter, handled both client and server sides of the connection, and implemented retries, and request routing, and load balancing, and instrumentation. It provided a consistent layer of reliability and observability across the entire Twitter stack, independent of what the service itself actually did. Sure, it only worked for JVM languages, and it had a programming model that you had to build your whole app around, but the operational features it provided were almost exactly those of the service mesh.4
So ten years ago, not only did we have microservices, we had proto-service-mesh libraries that solved many of the same problems that the service mesh solves today. But we didn’t have the service mesh. Something else needed to change first.
And that’s where the deeper answer lies, buried in another difference that’s happened over the past ten years: there’s been a dramatic reduction of the cost of deploying microservices. The companies I’ve listed above who were publicly using microservices a decade ago—Twitter, Netflix, Facebook, Google—were companies of immense scale and immense resources. They had not just the need but the talent to build, deploy, and operation significant microservice applications. The sheer amount of engineering time and energy that went into Twitter’s migration from monolith to microservices boggles the imagination,5 and this sort of infrastructural maneuver was essentially impossible for smaller companies.
Contrast that to today, where you might encounter startups with a 5:1 or even 10:1 ratio of microservices to developers—and what’s more, they are equipped to handle it. If running 50 microservices is a plausible approach for a 5-person startup, then clearly something has reduced the cost of adopting microservices.
The dramatic reduction in the cost of operating microservices is a result of one thing: the rise in the adoption of containers and container orchestrators. And this is where the deeper answer to the question of what change has enabled the service mesh lies. What’s made the service mesh operationally viable is the same thing that’s making microservices operationally viable: Kubernetes and Docker.
Why? Well, Docker solves one big thing: the packaging problem. By allowing you to package your app and its (non-network) runtime dependencies into a container, your app is now a fungible unit that can be thrown around and run anywhere. By the same token, Docker makes it exponentially easier to run a polyglot stack: because the container is an atomic unit of execution, for deploy and operational purposes it doesn’t really matter what’s inside the container, and whether it’s a JVM app or a Node app or Go or Python or Ruby. You just run it.
Kubernetes solves the next step: now that I have a bunch of “executable things”, and I also have a bunch of “things that can execute these executable things” (aka machines), I need a mapping between them. In a broad sense, you give Kubernetes a bunch of containers and a bunch of machines, and it figures out this mapping. (Which of course is a dynamic and ever-shifting thing, as new containers roll through the system, machines come in and out of operation, and so on. But Kubernetes figures it out.)
Once you have Kubernetes going, the deploy-time cost of running one service is not that much different from running ten services, and in fact not that different from 100 services. Combine that with the container as packaging mechanism that encourages polyglot implementations, and the result is a ton of new applications that are implemented as microservices written in a variety of languages—exactly the environment the service mesh is most suited for.
And so finally we come to why the service mesh is feasible now: the very same uniformity that Kubernetes provides for services is directly applicable to the operational challenges of the service mesh. You package the proxies into containers, you tell Kubernetes to stick ‘em everywhere, and voila! You got yourself a service mesh, with all the deploy-time mechanics handled for you by Kubernetes.6
To summarize: the reason why the service mesh makes sense now, as opposed to 10 years ago, is that the rise of Kubernetes and Docker have not only dramatically increased the need to run a service mesh, by making it easy to build your application as a polyglot microservices architecture, they’ve dramatically reduced the cost of running a service mesh, by providing mechanisms for deploying and maintaining fleets of sidecar proxies.
Why do people talk SO MUCH about the service mesh?
Content warning: In this section, I resort to speculation, conjecture, inside baseball, and opinion.
One need only search for “service mesh” to encounter a Kafka-esque fever dream of a landscape, full of confusing projects, low-calorie recycled content, and general echo chamber distortion. All shiny new tech has a certain level of this, but the service mesh seems to have a particularly bad case. Why is that?
Well, partly it’s my fault. I’ve done my best to talk up Linkerd and the service mesh at every opportunity, over countless blog posts and podcasts and articles like this one. But I’m not that powerful. To really answer this question, I have to talk about the service mesh landscape. And it’s impossible to talk about the landscape without talking about one project in particular: Istio, an open source service mesh that’s billed as a collaboration between Google, IBM, and Lyft.7
What’s remarkable about Istio is two things. First, the sheer amount of marketing effort that Google, in particular, is placing behind it. In my estimation, the majority of people who know about the service mesh today were introduced to it through Istio. The second remarkable thing is just how poorly Istio has been received. Obviously I have a horse in this race, but trying to be as objective as I can, it seems to me that Istio has developed a pretty public backlash in a way that’s uncommon (though not unheard of8) for an open source project.9
Leaving aside my personal theories as to why that’s happening, I believe it’s Google’s involvement here that is really the reason that the service mesh space is so hype-y. Specifically, the combination of a) Istio being promoted so heavily by Google; b) its corresponding lackluster reception; and c) the recent meteoric rise of Kubernetes still fresh on everyone’s minds have all combined to form a kind of heady, oxygen-free environment where capacity for rational thought is extinguished and only a weird kind of cloud-native tulip mania remains.
From the Linkerd perspective, of course, this is… I guess I would describe it as a mixed blessing. I mean, it’s great that the service mesh is a “thing” now—this was not the case in 2016 when Linkerd first got off the ground, and it was really hard to get anyone to pay attention. We don’t have that problem any more! But it sucks that the service mesh landscape is so confusing and it’s so hard to understand even which projects are service meshes, never mind which one fits your use case the best. That does everyone a disservice. (And there are certainly situations where Istio or another project would be the right choice over Linkerd—it’s far from a one-size-fits-all solution.)
On the Linkerd side, our strategy has been to ignore the noise, continue focusing on solving real problems for our community, and basically wait for the whole thing to blow over. The level of hype will eventually subside and we can all get on with our lives.
In the meantime, though, we’re all going to have to suffer through this together.
So… should I, a humble software engineer, care about the service mesh?
If you’re a software engineer, here’s my basic rubric for whether you should care about the service mesh.
If you are in a pure business-logic-implementin’ developer role: No, you don’t really need to care about the service mesh. I mean, you’re certainly welcome to care, but ideally the service mesh won’t directly affect anything in your life. Keep building that sweet, sweet business logic that gets everyone around you paid.
If you are in a platform role in an org that is using Kubernetes: Yes, you 100% should care. Unless you are adopting K8s purely to run a monolith or to do batch processing (in which case, I would seriously ask the question of why K8s), you’re going to end up in a situation where you have lots of microservices, all written by other people, all talking to each other, all tied together into one unholy bundle of runtime dependencies, and you’re going to need a way to deal with that. Since you’re on Kubernetes, you will have several service mesh options, and you should have an informed opinion about which ones or even whether you want any of them at all. (Start with Linkerd.)
If you are in a platform role in an org that is NOT using Kubernetes, but IS “doing microservices”: Yes, you should care, but it’s going to be complicated. Sure, you could get the value of the service mesh by deploying lots of proxies everywhere, but the nice part of Kubernetes is the deployment model, and your ROI equation is going to look very different if you have to manage these proxies yourself.
If you are in a platform role in an org that is “doing monoliths”: No, you probably don’t need to care. If you are operating a monolith, or even a “collection of monoliths” that have well-defined and infrequently-changing communication patterns, then the service mesh will not add very much and you can probably just ignore it and hope it goes away.
The service mesh probably doesn’t actually hold the title of “the World’s Most Over-Hyped Technology”–that dubious distinction probably goes to Bitcoin or AI. Maybe it’s merely in the top 5. But if you can cut through the layers of noise, there’s some real value to be had for anyone who’s building applications on Kubernetes.
Finally, I’d love for you to try Linkerd—it should take about 60 seconds to install on a Kubernetes cluster, even just a Minikube on your laptop—and you can see for yourself exactly what I’m talking about.
If I just ignore this whole service mesh thing will it just go away?
Sadly, the service mesh is here to stay.
But I don’t WANT to use a service mesh.
Then don’t. But see my guide above as to whether you need to understand it.
Isn’t this just ESB / middleware all over again?
The service mesh focuses on operational logic, not business logic. That was the downfall of the enterprise service bus. Keeping that separation is critical for the service mesh avoiding the same fate.
How is this different from API gateways?
There are a million articles about this. Just google it.
Is Envoy a service mesh?
No, it’s not a service mesh. Envoy is a proxy. It can be used to make a service mesh (and many other things; it’s a general-purpose proxy). But it’s not a service mesh by itself.
Is Network Service Mesh a service mesh?
No. Despite the name, it’s not a service mesh. (Marketing is fun, right?)
Will the service mesh help my reactive, asynchronous message queue-based system?
No, the service mesh won’t help you.
Which service mesh should I use?
I think this article sucks / I think you suck.
Please share this link with all your friends so that they can see just how much it sucks / I suck.
Thanks and credits
As you might’ve guessed from the title, this article was inspired by Jay Krep’s fantastic treatise on logs, The Log: What every software engineer should know about real-time data’s unifying abstraction. I met Jay when I interviewed at LinkedIn almost a decade ago and he’s been an inspiration ever since.
While I like to call myself a Linkerd maintainer, the reality is that I am mostly “maintainer of Linkerd’s README.md”. Linkerd today is the work of many many many many people, and would not be possible without the amazing community of contributors and adopters.
Finally, a special shoutout to the creator of Linkerd, Oliver Gould (primus inter pares), who took the plunge with me on this whole service mesh thing many years ago.
- From Linkerd’s perspective, gRPC is basically the same as HTTP/2, you just happen to be using protobuf in the payload. From the developer’s perspective, of course, it’s quite different. [return]
- “Mutual” means that the client’s certificate is also validated. This is as opposed to “regular” TLS, e.g. between a web browser and a web server, which typically only validates the server’s certificate. [return]
- Thanks to Cindy Sridharan for introducing me to this term. [return]
- In fact, the first version of Linkerd was simply Finagle wrapped up in proxy form. [return]
- As does, frankly, the fact that it succeeded. [return]
- At least, at the 10,000-ft level. There’s a lot more to it than this, of course. [return]
- These three companies play very different roles: Lyft’s involvement seems to be in name only; they were the originator of Envoy but don’t appear to use Istio or even contribute to it. IBM contributes to Istio and also uses it. Google contributes heavily but as far as I can tell doesn’t actually use Istio. [return]
- Systemd comes to mind. The comparison has been made, several times. [return]
- In practice, Istio appears to have issues not just with complexity and UX but with performance. During our third-party Linkerd benchmark evaluation, for example, evaluators were able to find situations where Istio’s tail latency was 100x that of Linkerd, as well as low-resource environments where Linkerd happily chugged along but Istio completely stopped functioning. [return]