Observability as a Product — Building Platforms Engineers Actually Use with Iris Dyrmishi

In this episode, José Quaresma speaks with Iris Dyrmishi, Senior Observability Engineer at Miro, about building an observability platform that hundreds of engineers actually trust and use. Iris explains how her team treats observability as an internal product, walks through Miro's tracing migration from Jaeger and Zipkin to OpenTelemetry with zero disruption, and shares how teams now use traces proactively to find bottlenecks before they become outages. The conversation also covers the honest downsides — alert noise, dashboard sprawl, and the cost of observability — including a recent example using eBPF and Grafana Beyla to uncover hidden networking expenses that transformed Miro's cloud bill.

Iris Dyrmishi is a Senior Observability Engineer at Miro, where she builds and maintains the company's observability platform. She started as a backend engineer before moving into SRE roles at Worten Portugal and Farfetch, where she developed her specialty in tracing and drove OpenTelemetry migrations across large engineering organisations without disrupting existing workflows. A CNCF Ambassador, co-organiser of Kubernetes Community Days Porto, and active voice in the observability community, she writes extensively about practical adoption challenges and has spoken at KubeCon EU and on the o11ycast podcast. Her guiding philosophy: observability is a team sport.

Episode transcript (auto-generated):

José

Hello and welcome to the Smooth Scaling Podcast, where we speak with industry experts to uncover how to design, build, and run scalable and resilient systems. I'm your host, José Quaresma, and today we had the pleasure of talking to Iris Dyrmishi, Senior Observability Engineer at Miro. We went all in on observability and tracing, focusing on how to use it both in a reactive way and a proactive way. We also discussed her experience treating the observability platform as a product and aligning it with the needs of the engineering teams. If you like this episode, please subscribe and leave a review. It really helps the podcast. Enjoy.

Hi Iris, welcome to the podcast.

Iris

Thank you so much. I'm happy to be here.

José

I would like to start straight in and ask you about your role and your team at Miro. You're working in the observability team. Can you tell us what the team does?

Iris

Absolutely. So technically I'm an SRE focusing on the observability part, an observability engineer, and we're a team of around eight people now. What we do day to day is basically we've built the observability platform. We're constantly maintaining it, improving it, modernizing it, making sure that it is up to date with everything that is happening in the community as well. And on top of that, we're also doing advocacy for observability, making sure we're creating processes and guidelines for the teams to follow. So it's not only technical, but it's also a little bit of advocating and a little bit of product selling to our customers, which are our engineers. It goes beyond the technical.

José

That actually aligns quite well with some of my previous experience — just because you build something that is awesome and really works doesn't necessarily mean that people will start using it. So even internally within Miro, I guess you would have to do some internal selling to get people to use your platform and the services that you're providing, right?

Iris

Absolutely. Because observability is such an important topic for the day-to-day work of all engineers, we can't just expect someone to know what observability is and everything that we offer. So we have to make sure that it is properly documented, properly explained to everyone, also showing what benefits it could bring to the team so it gets adopted. Some things get collected by default, so the teams don't need to do much. But correlation, setting up the proper alerting — these things need to be communicated. We have so many engineers in Miro currently, and in my previous experiences as well, you cannot expect everyone to know everything. They have their own roadmaps, their own work to take care of. So it's our job to teach them and to show them how to use our tools.

José

And you mentioned that you see the engineering teams as your customers. How does that impact how you build the platform?

Iris

They have a very important role in what we're bringing to the table. Of course, the whole team has been working in observability for a long time. We're experts in our field. We know what needs to be done and what needs to be brought to the table. But most of us have never been back-end engineers, front-end engineers. We don't know their specific needs. We don't know what the teams actually need or what will help make everything clear for them. So we have this cycle of constantly requesting features from them. We have a very open conversation and an open culture so they can come to us every time with requests. And everything that we release something new, we go back on a feedback loop so they can provide us with feedback — how it was received, how it is being used, whether it was worth continuing or not. So instead of just having a pure technical roadmap of things that we think are going to be best, we usually go and ask them what they need, how can we help them, and integrate that with our own roadmap.

José

I understand that you were also quite involved in a migration to OpenTelemetry. Can you walk us through that?

Iris

We actually still have not migrated fully from Prometheus to OpenTelemetry. The migration I was leading was tracing — from Jaeger and Zipkin to OpenTelemetry. Something good with OpenTelemetry, and I've been very vocal about this, I like to share my experience as an end user, is that OpenTelemetry is backwards compatible with almost everything. In the case of tracing, we had several types of spans that we were receiving every day — some Zipkin, some Jaeger, several formats. The point was that we could not continue with the previous architecture because we were constantly dropping spans. It was not able to accept all of it. And then there was a matter of correlation, because some applications were sending things differently. It was a mess in general. So what we wanted to do was go to a tool that could unify them all — like the One Ring in Lord of the Rings — which was OpenTelemetry. The good thing was that it was backwards compatible with everything. We were running two stacks in parallel — the previous one, so the teams would not notice that something had changed, and OpenTelemetry — sending data to both of them for a period of adjustment. And in the end, the teams were getting so much more data, and the data was better, because there are so many transformations that you can do in OpenTelemetry, that we could safely remove Jaeger. The teams did not notice a difference in terms of their experience getting worse — it was better, because they were getting more information faster. And of course, afterwards there was a different part, which is instrumentation. Because you can keep everything a mess and migrate it, but eventually you will have to standardise. So after we migrated the infrastructure and the tooling and everything was in better shape, we started advocating with teams to use OpenTelemetry instrumentation. Today we've reached a very good place. Tracing is actually being used to fix issues and find performance bottlenecks. Teams moved from just using logging and metrics to adding tracing as well. So it's a great achievement.

José

Were there any other learnings from that migration, or things that were unexpected?

Iris

The learning, at least for me — I started this migration very early when I joined Miro. And for me, I was always the technical person. Okay, this can be done like that, architecture, we have the designs, let's do it. So the biggest learning was how important it is to know the people that you're building these features for. And that's where this mindset of product actually came for the whole team. We were making this big migration, but we had to justify it. You can't just say, okay, this is the coolest thing in the industry right now, OpenTelemetry, let's do it. You also have to justify why you need another tool, why you need this many engineering hours, and how it is going to benefit the end user. That's something I learned from this whole migration — that you don't need to think only about how to build those architecture diagrams, which are pretty cool, but you also have to think of the stakeholders. You need to show a good case to upper management, especially for a big change like this, and you need to think of your end users — what you're offering them, how it can improve their experience, and how you follow up with them to improve it even more.

José

That's a very good learning. It reminds me of how sometimes with engineering, we go after a new technology because it's a great technology, and then it's like, okay, I want this technology, now I just need to find a use case for it. That's maybe not always the best approach. It's more useful to think about what are the things that we need to solve.

Iris

We see this a lot. If you speak to other observability engineers, you see a lot of use cases like this — they want to have this new tool and implement it, but they can't get the proper support for it. And they're always asking other engineers who have managed to get it through, how did you do it? And usually, you find the use case, you show the value that it provides. Once you show value, people's opinions change, even if they were completely against it. Like, okay, there is something here, maybe we should pursue it. That's how you start.

José

Being in the observability world, what do you see holding companies back from increasing their maturity in this area?

Iris

I think companies are not investing enough in observability. They don't see observability as a first-class citizen, but just as something that happens — okay, we have everything ready, now let's implement observability. But that's not the right mindset, because observability needs to be part of all the processes before you're actually building your product. And it's not too late after everything is done, of course, but you're not going to get the best results. I see it happen a lot. Instead of investing in a few individuals that could be SREs with observability experience, which can be gained over time, companies choose to buy some vendor tool, add an agent, and get all the data out there. At the beginning it works great because it's small and you don't have to put any work in. But as it grows, it gets messier and it's very difficult to reach a maturity level of good observability once everything is all over the place.

José

So if there's anyone out there listening who is trying to push for more observability in their company — what I heard before is to focus on the business case. Do you have any thoughts on where they should start looking for that business case?

Iris

There could be two ways. The first is to find some observability champions in the engineering teams. You look at who likes observability, approach them, and get some use cases of how good observability is going to make their life easier, help them build their product faster, and make everything more reliable. That's an excellent place to start, to set the roots of investing more in observability. The second would be focused on incidents. After all, companies are out there to make money. And if an incident is happening, most companies are losing money, especially for example if it's a retailer gaining millions of euros every ten minutes from people buying. So focus on those incidents, get the data, see how the lack of observability made the whole thing last longer, and how good observability would have made it shorter and the losses even less. Those are two things I would advise, but of course it depends on the industry you're in.

José

That sounds like a very good starting point. Thank you. You mentioned the incident part, the reactive incident response and how observability can help with that. But I understand that at Miro you've gone from that to more of using observability for proactive performance improvement. Can you tell us about that?

Iris

I'm proud to say that at Miro we have a pretty good maturity when it comes to observability. It's not perfect, of course. But we have the reactive part — alerting, dashboards — so if something happens, our engineers know where to go and what data to get. But observability has become part of the day-to-day work. Every day we're seeing more teams wanting to invest more. And we see PRs, for example — that makes me very happy as a tracing person — where a team that has been struggling with a performance issue for a very long time opens a PR saying that thanks to spans and tracing in general, they managed to find the bottleneck. It's happening a lot. Because we have the three pillars in a good shape, the teams are getting more interested and they feel more involved in the observability world. They're actually using it not just to react and fix an issue, but to proactively find bottlenecks and issues to fix before something breaks. It's a great thing to see, and I highly recommend everyone to try to go in that direction.

José

You mentioned the three pillars. Can you tell us about them?

Iris

It's the three classical pillars: metrics, logs, and tracing. Before, we were relying a lot on metrics and logs — the classics. Logs are the ones where you see everything that is happening. And now that we have tracing, it's an extra layer. Because of the whole OpenTelemetry work and making things more standard, our engineers are able to correlate between logs and traces, which makes their whole life easier. And of course, there are more pillars, for example profiling, which we're not there yet. But at least in those three, we're pretty mature.

José

I don't want to assume that everyone listening has deep knowledge in this area — and also for myself — can you give an example of how it would look if you only have logs, versus if you actually have tracing correctly set up? How does that look as an engineer trying to understand how something is working?

Iris

Let's imagine you have only logging available. You go to the platform, and there are 10,000 log events — error, error, error — and they all say the same thing. Yes, this application is throwing an error, but why? It's very difficult to pinpoint where it is. Meanwhile, if you're using tracing — for somebody that doesn't know what tracing is — what happens is that it records every step of the way from the beginning of the call until the end. So if you go to tracing, you will see your whole call from the beginning until the end. You will see that in some parts there's higher latency — maybe the issue is going to be here. Even better, if there was actually an error along the way, you will see that an error was thrown at this particular part of the whole journey. That points you directly to where the issue is, instead of you having to assume, okay, error, but where could it be? You have it all there and it's telling you exactly: this is your error, this is what you need to fix.

José

I think in my experience, if you only have logs and you're investigating an issue, it kind of feels like you still need to get to the tracing, whether you have to do it yourself while looking at the logs. So you have an implicit tracing where you're like, okay, I think I know the application, so first was this call, and then I can see there was this one. But what we're discussing is that you can have that all instrumented and set up in a way that you get that tracing. I don't want to say automatically — I guess it is — but if you do things right, you'll be able to see it straight away.

Iris

Right. Instead of taking you two hours to try to trace everything back — because you will find it eventually, especially if you're experienced and you've been working with a product — you'll find it eventually. But it could take you two hours or two seconds.

José

Getting back to the business case, if it's an incident that is bringing some of your production system down, then two hours or two seconds, or even two minutes versus two seconds, that is a big difference. And are you seeing that also applied on the performance testing part, that it helps incredibly when you're evaluating the performance of your application — that you can find the bottlenecks much easier when you're making changes and developing?

Iris

Absolutely. We've seen it in so many use cases. I mentioned the PRs because it's the one that came to mind first and it made me very proud. But it's being used for all kinds of use cases. It just makes your life generally easier.

José

Can you share with us how, with observability in general and tracing specifically, do you think that gives the teams freedom or enables them to push further? I guess it would at least give them some confidence to do so.

Iris

Absolutely. That's why I like it when I see the teams trying to pursue better observability. It gives them a lot of freedom. At Miro, and I've noticed this is the standard in the community, the teams are actually the owners of their telemetry data, their dashboards, their alerting. So they have the freedom to do everything they want. We are not the ones building a dashboard for them or building an alert and then investigating when it fires. No, that is all the team. So it gives them the freedom to learn about their applications better, to protect their applications better — because who wants their product, their baby, to have an issue that could cause an outage. But it also pushes them, because they're getting to know their data so well, it pushes them to actually improve as well. Not just protect it, but improve it and constantly make it part of their decision-making. And actually, I wanted to mention here, since it's part of the conversation — now with the LLMs and the MCP servers, it's getting easier and easier to make observability part of your decision-making when creating a PR or making a change. Observability implications, because you already have some data and what it could mean — is it a good idea or not? So it generally gives them a lot of freedom and helps them improve their product without needing to consult someone every time an issue happens or every time they want to improve.

José

Do you see any other impact from AI in this field? It could be positive or negative — there's probably a bit of both, as with many things in life. But what have you seen in practice changing?

Iris

I choose to see the positive sides. I'm a big fan of AI, and in our company we're using AI a lot. We're actually pushed to use it as a tool to help us. So I've seen very positive things. For us as engineers, it helps in our day-to-day work. It's not as practical as it is for a back-end or front-end engineer, because mostly for us in infrastructure, it requires a lot of knowledge that the LLMs are still not there with. But when it comes to the perspective of an engineer using observability, it helps them a lot. I was actually in their shoes yesterday, because I wanted to troubleshoot some changes that I made and I had no idea where to start on the engineering side. I didn't want to reach out to someone for an experiment. So using an LLM, I actually managed to find all the data I needed without knowing where the application is. I feel like it's so positive that even someone who just joins the company, or hasn't had interest in knowing how observability works for their team, can start getting an idea how everything works, what their necessary data is, and how to improve without spending hours and hours. It's very positive in general. Negative — I guess the costs. It is expensive. If you're relying only on agents and LLMs to get your observability knowledge, it could get very expensive for the company. But other than that, I find very positive things about AI.

José

Very good. Thank you. Before joining Miro, you worked at Farfetch as well. We'd love to hear your thoughts from an observability perspective. Is it pretty much the same, or are there big differences being in two different areas?

Iris

That's an interesting thing for me. When I moved from one company to the other, the transition was so easy because it was the same. I would say it's a bit easier on the Farfetch side, maybe — not easier, it's just that the way the product works, it's a website where you're buying things, so it is easier to collect all the data you need. At Miro, because it's a collaboration platform with so many users at the same time, it can get a bit more complicated. But in terms of tooling and observability culture and what you need to do to bring it to the engineers and improve their lives, it was the same. So it was a very easy transition.

José

From an observability perspective, I guess you touched on it a little bit — being able to more easily find the bottlenecks will help being more scalable. Is there any other way that you've seen observability and scalability playing together?

Iris

The first thing is, to have a system that is scalable, you need to have the metrics for it. That's the first part where observability is very important. But to actually make the architecture decisions about scalability — not just in terms of HPA, VPA, KEDA, aside from that, which is the first part — I've seen people actually building their architectures with observability in mind, knowing that the data they have can make their applications more scalable in general. They're using all this knowledge and injecting it into their architectures and how they're building their applications. I've seen that a lot. And of course, it makes me very happy. As you can tell, I'm a big fan of observability and making it the centre of the world. It's very positive signs when I see teams actually taking all this data into consideration before they're building.

José

Are there any downsides to observability? Is there anything that comes to mind where you'd think, we shouldn't be using observability here?

Iris

I wouldn't say that's the case. There are negative sides, but not in the sense that we shouldn't be using it. It should be used everywhere because you need to have visibility of what you're doing — not just build blindly. Especially if you're working on a production product, you can't just let it work and not know what's happening with it. The negative sides of observability — and it's funny because I'm going to talk about this at Observability Day happening soon in Amsterdam together with my co-speaker — it can be a burden. For example, you have so much data, so much freedom, you create 100 alerts. And while you're working, you have alerts firing left and right, false positives, and it becomes noise. Instead of helping you, it actually stops you from doing your job. There are 10,000 dashboards in your team's name, some of them created five years ago by somebody that isn't even in the company anymore. So that's the negative side of observability — you also have to follow some processes and keep everything clean, which happens with help from the observability team. But sometimes it can feel like, okay, I don't need this, this is not important — until an incident happens and you understand that actually it was important, but because there was so much noise before, it's like the boy who cried wolf. That's the only negative, but it's something that can be fixed and improved.

José

I would also think that cost comes into perspective. Are you often in discussions of, we should add this data point, but should we, because it will come with a high cost?

Iris

Observability cost is a big problem. The issue is that first of all, observability cost is not really the cost of the observability team — it's a cost that is shared between many teams. And that's a common problem. When they're seeing those sheets with all the costs, like, oh my god, observability is killing us. But it's not us killing you with the cost — it's shared. The conversations I have about cost are that generally reliability comes first and cost is second. If we see that something is important and needs to be observed, which is everything in my opinion, it will always get it, even if it increases the cost. The best that we can do is optimise the cost. And there are tools and processes to do that. But reliability and good visibility come first, before cost, always.

José

And I guess it ties back to the business case as well. Maybe adding this component would increase the cost, but it would also help us have a more performant application, help during incidents with getting faster to the root cause. So that's back to our business case.

Iris

Absolutely. As I said, it's like the centre of the universe, the tech universe. People just need to listen to me more and understand that.

José

And I guess there's also the case that you could use it to guide your architectural decisions and potentially reduce infrastructure costs in other areas because you had the observability and were able to get the overall infrastructure more tailored to your needs.

Iris

I could give you an example right now, and also something that we're going to be talking about at KubeCon in another talk. We were using eBPF observability recently, which is not used a lot by many companies, but it's hot right now in the market. We decided to use Grafana Beyla, which is donated to OpenTelemetry — it's fully open source. And we were able to see the networking costs of observability. It's very high, because usually you do not consider networking costs from cloud services. We noticed it was very high, and it helped us make some reductions that completely changed our bill in observability networking. And it's also helping other teams, because they now know that this is expensive and could be optimised. They're having their own initiatives to improve it. So it can really be used for good use cases like this.

José

Very good. Thank you for sharing. I think we're getting closer to the end. We always like to wrap up with a few rapid-fire questions. What advice would you give to someone who is either starting in this career or trying to push their company to be more mature within observability?

Iris

Get involved in the observability community, because the amount of learning possible there is amazing. You get to see so many professionals sharing their use cases, you get to learn about the tooling, and you can bring amazing knowledge back to your company that could make good business cases for the changes that you want to implement. The community is amazing in observability.

José

I think that links into my second question. Is there any resource, thought leader, blog, book, podcast, or a specific conference that you would recommend for someone within this space?

Iris

I would say everything that is cloud native. The conferences are mostly places to go to meet different professionals that are doing the same thing that you're doing and sharing experiences with them. There are some observability newsletters that come out every week — you get the best of everything there, what is new, what has changed. And if I would recommend one thought leader I love following: Adriana Villela. She is one of the most known figures in observability right now. Every time there is some inspiration from the community, I go to her blog post or her LinkedIn. But generally, the cloud native space is where you should go. KubeCon is an amazing resource. Sometimes some people can't go because it's a bit pricey, but there are also scholarships, and companies are paying for tickets for people to go because the learning and the connections are just so worth it.

José

Thank you. Really great examples to share. Final question: to you, scalability is?

Iris

I have a lot of things in my mind right now. It's one of the core principles in software engineering, I would say.

José

Thank you for sharing. We very much agree. Thank you so much for taking the time and for joining us here.

Iris

Thank you for the invitation. It was great being here.

José

And that's it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed, consider subscribing and perhaps share it with a friend or colleague. If you want to share any thoughts or comments with us, send them to smoothscaling@queue-it.com. This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo, and brought to you by Queue-it, your virtual waiting room partner. I'm your host, José Quaresma. Until next time, keep it smooth, keep it scalable.

[This transcript was auto-generated and may contain errors.]

Handle peak traffic with confidence, no matter the demand

Discover Queue-it