View on Zencastr

Links:

https://twitter.com/NYCDubliner
https://docs.dev.getstanza.dev/

As a heads-up, the transcription below is AI-generated and somewhat rough. It may not align with what was actually said in the episode!

Transcription

00:00.00
James
Thanks for listening to the James from montana podcast a podcast in which I interview experts in the tech industry with the goal of slowly uploading the collective consciousness of tech into the cloud for more information on today's guest topic or how to be a guest yourself visit jamesfrommontana.com/podcast I have with me tern and de burca engineered stanza a company doing neat things in the sre space turn him is an itinerant dancer with the technology habit he's worked on 4 continents for hyper scalers and nonprofits from Google and Squarespace. Special olympics world summer games and ebola response and West Africa tear and is brought enthusiasm for solving problems and a passion for solving those problems with computers are you ti none.

00:51.39
Tiarnan
I'm doing pretty good I've never heard my bio Red Edge. So that bio is from the website and so I struggled to achieve American levels of positivity when writing it. And so now having an American reading to me like I got the voice right? I got that correct like energy. Yeah I feel good. Yeah know I think you did think did really well I'm just glad I got my voice right? You know that kind of whenever we would do,

01:06.23
James
Um, ah did I nail the American accent or.

01:17.75
Tiarnan
Performance reviews and like big american companies and you're running the irish department you have to give a special speech to people who are non-americans which is like don't write in your performance review I didn't too did too badly I suppose in balance say I am a legend. And work everything in terms of I am a legend and I need hack is actually get 2 people to write the performance review for the person beside each other like get get people to write your friend's performance review and then sign it because you're much more positive about your friend than you are about yourself and you need to do that kind of.

01:49.83
James
Um, that's amazing. Yeah.

01:51.88
Tiarnan
Positive reframing to get whatever the American base outline level of confidence is when you're not an American Anyway, we're off the rails already This is this is fine.

01:57.48
James
This is great I we're off the rails already. Yeah I mean it brings about the fact the podcast has only been with us guests so far and I feel like you have this unique opportunity to be of irish residency and you're in tech in Ireland right? so.

02:13.27
Tiarnan
Yeah, yeah.

02:15.23
James
I want to know a little bit more about what tech is like in Ireland I mean obviously there's a slight distinction even in just speaking.

02:22.43
Tiarnan
Sure I mean so Ireland has a ah big tech tradition so early 80 s you have deck you have Apple and they both started building computers here really early on and that when I started which would have been mid ninety s we'd moved up the value chain as far as kind of tech support. So if you bought a dell in the mid 90 s and you called somebody to find out why you couldn't get on the internet the personnouncing the call was probably irish and then that moved up the value chain again and you got into a lot of infrastructure and operation stuff. We now obviously have a pretty vibrant startup scene and we have lots of folks doing lots of different work. But the the rump or a lot of the work that's being done still in technology at ireland is in infrastructure space. So you've got very very large teams from google and amazon and microsoft and twitter and everybody else who are the kind of. Ah, global hypercanters basically all have teams here. And so you have very very kind of ah very strong concentration of expertise for folks who've built data centers who built infrastructure that are based in our in dublin and so I mean the company I work for is downs a systems. Our ceo is.

03:34.77
Tiarnan
Noil Murphy who was the co-author of the sre book from Google and was I think charge of sre for for azure for a while and so there's a lot of like very heavy hits in that space. That's very common in dopa like there's a reason that the emmeea s recon conference alternates between Doblin and somewhere else. Is basically it's it's it's migrate migratory pattern.

03:54.15
James
So I did want to go into your you have you have a background in Google around 2005 they they do have this somewhat famous infamous document for post-mortem culture from the sre space I mean.

04:01.38
Tiarnan
Is it.

04:11.30
James
I've had it organically brought up to me many times. I feel like you were in that period of time in which it was written and.

04:19.48
Tiarnan
So I mean I kind of I think what you? but you mean there is that there is this idea of a postmortem document right? or an after action report a bet that breaks down what's going on and and and introduces the blamelessness. So it wasn't. Google weren't the first people to approach your kind of blameless culture around operations. But it was my first experience with it and so a key part again and again was that it's completely safe to be part of a response to an incident right? like 1 of the great ideas was that. This might be the most expansive piece of training that the company has ever done for you. Why would we fire you now? There's a ers was the global head at the time of ambersttroture and for a long time. He owned the largest outage to Google had ever had he deleted the Google.com domain and so it was a very convenient thing when he was being interviewed to say well do you think I should have been fired for causing the biggest adage. Well if not then why would I fire somebody who's who's made a mistake and so that kind of permeated a lot of how Google went about, responding to incidents. Both from an expectation of reliability right? with they they really want to focus on highly reliable, highly available services. It's pretty easy if you look at their business model to understand why they cared so deeply about this. It turns out that if you open a web page and there's an ad on it. You might click on the ad.

05:45.15
Tiarnan
If there's no ad you will never click on that ad you might come back and look at that web page again and see another ad but like they're a run right? business any ad they don't serve you with lost revenue and so it becomes very easy for it to permeate the entire culture of the company around them. So the best way to get the highest best reliability was to.

05:53.59
James
Um, yeah.

06:02.94
Tiarnan
Build great engineers and support them where they needed to. They invented their own stuff and that led to a whole lot of other innovations but the core to the original idea that you were talking about is, templated incident response and talk about what you did afterwards understand why you're making changes you're making. And make sure that you are more ready for that class of of incident again. It used to be say that you used to say that this incident could never happen again. The kind of the current thinking is that the same incident doesn't ever happen twice and that you're more. You're better off thinking in terms of.. What can we do to be more resilient to this class failure like what took us out isn't as important as if this thing fails again in this like we lose a data center or we lose a service. What is our? What's graceful failure look like what does resilience look like heck with you and we recover does that make some sense.

06:55.39
James
Yeah, absolutely, we're already deep into Sre topics. But I think for the listeners we should back out just a little bit. Can you tell me a little bit more about what an sre is and what about. Site reliability engineering. Can you tell me about.

07:11.68
Tiarnan
Sure so sre as you said stands for site reliability engineering and and that is the approach to building reliable systems that Google initiated and then the ideas from that propagated kind of across the network of companies that. Google alumni wentro. And that has mutated further but the the kind of core idea originally was what if you took software engineering practices and applied them to infrastructure reliability. This is I think an incomplete telling of history I think there was a lot of folks. On the system side side who were already quite keen on programmatic approaches and the devops community was coming up at the same time but it's the meme that has released stuck and it's it's allowed us, carve out a niche to be able to think about what's the best way to. Support folks who have to support your systems right? So it's important that we build a culture where you're not burning somebody edge where the person who happens to catch the page where the site conscious fire is not the person who's catching hell afterwards that am. Being able to prioritize and respond to adages by getting changes in the way software works onto the to-do list of the product cited and software engineers or do the phorss themselves to to get the to change the behavior of the underlying software. It in a.

08:43.58
Tiarnan
And real sense. It gave you the opportunity kind of even the playing field between the infrastructure and operations teams and the product teams right to be able to say this is a collaboration and a partnership in learning software. At Google the kind of, one of the.

08:59.56
Tiarnan
Foundational myth was that in theory you could hand the pager back right? if somebody can run their own software if you're if they're if they're not able to collaborate with an sre team in practice that would happen very very frequently like folks folks like having a team that are really really good at this doing that work. And that leads to a point where you can rely on the incremental improvement in how software is run and to be able to say, a great a great way of thinking about this is for example and you should have a max of 1 incident. Shift if you're on duty for 8 hours you can only respond with urgency a certain number of times consequently, you should try and ensure that whatever the level of service that you're planning to deliver with a piece of software and a network and all of theld and infrastructure. Can support that level of of reliability because it turns out if the page is going off it be 3 minutes all you're doing is pressing the button that says don't page me anymore. You have no time for deep thought that is oh I wonder what caused this per particular interaction. What's this intermittent value right.

10:00.95
James
And.

10:09.39
Tiarnan
Maybe I'll go do a flame graph of its memory or this is the interaction between this particular long running query and that right? So SThreere is about this kind of discipline of so that the tools of software engineering discipline of scaling. I Think about it as the industrialization of systems administration right? like it's no longer this idea that the sysadine is the guy who sits in the corner and make sure the email server and the one server you have is up. It's the team of folks who move from a boiler in the basement to citywide infrastructure that heats the city. Right? You have this kind of upscaling of capacity and capability. And that's that's what Sre is about to to me anyway.

10:48.26
James
Oh that rule of some one one incident per shift which I feel like is in deep contrast to at least the companies that I work for and the amount of like alarm bells that go off and an average shift.

11:05.50
Tiarnan
So there's ah, there's a term which is feeding blood into the machine right? and feeding blood into the machine is this idea that sure the site is up but the site is up because the person who's keeping the site up won't work here in six months and you can't really. Reasonably attempt to do the things that Google was trying to do and that hopefully many kind of companies that with a sustainable long-term approach can do by doing that right? You can't expect someone to be able to. Live on that adrenaline edge for a very long time and in particular that person can't contribute to improving the system if all they're doing is rebooting things that are broken. They're not making the system any better and it turns out that those folks who are all at the coalface who really understand all the interactions who really understand what code you're running are folks that in a great position. To understand how to make the system better and so, obviously we we could probably tilt into I moved into a management role before I been in my current role and that part of the value of management in Sre and the infrastructure is to explain why. Care about these people beyond a human sense right? There's the there's the it's a really not's a human being and I should be nice to them but in like a base utilitarian sense. These are the folks who really understand how your system works if your users are going to have a great time. These folks need to be able to contribute to making your site or your service or your your.

12:32.20
Tiarnan
Whatever you're constructing better. And if you set yourself goals like 1 page per shift if you set yourself goal. It was like and don't be on call more than once a month you can put people in a mantle mind mindset and to be able to do that. 1 of the another great rule of film that comes out of the Google system is this idea of toil and toil is the work. You do that doesn't make the system any better right? How if we're gonna throw back to the mid 90 s changing tapes changing tapes needs to be done or you don't have a backup. Right? I think I've I've just dated myself as older than anybody listening to this podcast but there's work work that needs the work that needs to be doing that isn't going to make your service any better. It's just it's just lettinging you kind of tread war and then there's work which improves your service maybe moving to.

13:05.50
James
Yeah, ah yeah.

13:20.97
Tiarnan
A better load balancing algorithm ensuring that everything you've built is in infrastructure as code being able to war gaming problems being able to bring up sites closer to your customers being able to improve your instrumentation. These are all things that make the day-to-day life of both your users and the software engineers who are shipping that software. Better off Google's rule was no more than 50% of your day-to-day work should be toil so 50% of your work can be keeping the thing on the road but 50% of your work should be making it better again. There's like the benevolent. Oh my god Google was so forward thinking and then there's the. More utilitarian Google was doubling in size every 20 minutes and so if it wasn't getting better. It was getting worse if you're in a growth environment and you're not making every single thing that you can coat your finger on a little bit smoother and a little bit faster. You're getting worse right? If the traffic gets 50% more and you haven't gotten any better. You're 50% more in a hull like there's no there's no kindness in that math and so, you can't run to a standsfield in a growth environment. And and so when Google was in the like quintessential peak of its growth because there was just so much going on and being able to think forwardly around. Okay, we don't want folks to burn out because the people who were in the world who understand how we do these things and we want them to be able to build the systems that will mean that the next system.

14:44.75
Tiarnan
Happens I think is hugely important to give an example of probably my favorite example of extreme growth at Google was I was on a databases team for or I was on a team where one of the teams was the ads database team and they had and this was early on. And they were migrating to a new sharded database. It's all very clever, kind of classic sharding stuff and the manager brought them into an office one morning and said okay this half of the team you're staying on the on the sharded database this half of the team. Were building the sharded database we need after this sharded database because we looked at the growth rate this morning and the thing we thought bought us 2 years buys those six months and so the system that you've you've killed yourself building for the last six months you have to ship in the next ninety days with half of the staff team b. Gives you ninety days plus the run rate of the system. You've just built to come up with the next system and we recommend not be arbitrarily scalable because we can't do this again? And that that was the type of type of kind of comedy growth that Google was experiencing at the time.

15:43.10
James
That's fantastic.

15:47.70
James
Yeah, it's like it's it's true hypergrowth right? What else about early Google do you have to share I mean there's there's got to be a pletho of of stores that you have you know you started back in 2005 s are it.

15:49.74
Tiarnan
Yeah, oh.

16:01.77
Tiarnan
Yeah I mean I could of I want to be careful about not kind of overstating my importance right? I was a cug in a machine that was rapidly growing. Um I was part of the sre team in Dublin I was in but belaronch in Brazil for six months and I was in New York for a while. So I got to experience kind of hilarious periods of growth answering a phone in Dublin and set them asking because one of the projects I was part of was I think on office in a box and the premise of office in a box was that we didn't have time to think a bit. How the infrastructure in milan would be different from the infrastructure in cell power and so the only thing we wanted to know about and half us was how many people worked in sales and how many worked in engineering and we had a little graph and if you said you had 50 people in sales and 200 people in engineering. You got a medium if you had whatever five hundred days and three hundred days you get a large and there was a team of people some of which were based in in the Dublin Data Center who were building these of these offices in advance and so you would build what was a blank office in a box. And then the phone call would come in saying we've opened the Tel Aviv of us how big is it going to be It's gonna be this big great ship those three flight cases to Tel Aviv plugin power pluging data press a button in in donmin mountainton view and 4 hours later you had an office and it was an amazing amazing piece of automation. That.

17:23.24
James
Wow.

17:27.87
Tiarnan
Obviously there were speed bumps and there's many kind of blood on the floor type type problems that happened around that but I remember we had bought a company in Munich called Demark I think it was and I flew out on the Friday with a bunch of folks and I was part of the team where. We rolled I think it was 3 racks in plugged them in and on Monday morning. Everybody's on the Google domain and they have Google creds and everything else and you've done all the automation and you know you give back laptop a and you take laptop b and that's how it works and there's some amazing kind of that that type of thing. These days because the growth has it hasn't slowed as such but like they were already as big as they need to be in a lot of cases and they've migrated anything that was in the office obviously is now part of Google apps your demand this this is when everybody still had local mailboxes and things. And so, it's less important. But. Still have that type of ability to show up and onboard teams of people but they now use it for acquisitions. So if you're required by Google like the morning you're required by Google your account works and you can print. And you've got a laptop that is signed into the domain and everything else like they they do amazing work in that space. I suppose kind of projects that I worked on when I was there saw a socks kind of I think ah, lots of folks who are in the startup space kind of begin to have their compliance story right around the time they have a.

18:47.94
Tiarnan
Have a growth story. And they feel kind of and they feel kind of bitter repaired. Why are you doing this to me and as an infrastructure person I've always really liked compliance because Compliance is like me saying do your job only. It's the law right? It's like it's a beautiful thing.

18:48.36
James
So yeah, 100%

19:06.23
Tiarnan
And so ah, when we were at we were at Google they passed Sprence Loxley and the real challenge was that like nobody knew what it meant and like the text was changing onto your router. Ah but it was all this idea a separation of like who could see data and when you could see it all all the things that far as off to go. And trying to retrofit that to a system that was growing at the rate that it was growing was kind of ah again, a kind of a hungria set of challenges. Um and you and ended up hiving off parts of Sre so that you didn't leave all of sre with access to everything because you had to be able to kind of meet these things because there were just too many people. And the other thing I was I was part of which was the the response to the to the chinese security attack. A long time ago now where Google was attacked by a state actor and have to. Functionally speaking rebuild itself I don't want to go into too much detail because I can't remember what bits of those details've already committed. But there's been plenty of delays talks a better. But it was kind of pretty hilarious to watch as various folks inside the organization got a top on the shoulder to disappear into a room.

20:00.94
James
Are.

20:16.54
Tiarnan
And the room's virtual. But you're like a person's not doing any work on their project and then 72 hours later you get a top on the shelter and you're like oh I don't doing any work on my project either like is everything was just sucked into this ma to to make sure that Google had been resecured. It was it was a. Ah, Auror right is the name of that is the name of the the event that happened and if anybody does ah a search for Aurora China google security incidents they will get all of the articles that exist in the world but that was a project I ended up being part of that was exciting and interesting. But I can't remove a copy of their public.

20:47.89
James
You got like the men in black flash when you ah when you left the project.

20:55.39
Tiarnan
I mean there's lots of it out there and people have done a great job of of articulating the narrative and I don't want I don't want to ruin it for people basically like because um, some really great infrastructure work was done there and I know I don't a half as ten years later you know well fifteen years later on those years

21:04.99
James
So let's ah, let's back up even further. Ah, we chatted the other day and you mentioned that you had got your start going for a physics major this isn't the first time that I've heard this on the podcast.

21:18.73
Tiarnan
Yeah I mean let's face it I will I wasn't a physics person for very long so I failed first year physics and at the time you were ad compensate a single like you realize'd compensate.

21:24.96
James
So okay.

21:36.63
Tiarnan
If you if you failed an exam but you got re high in another exam they're like we'll we'll let you in into second year it'll be okay, and I had gotten I think a very close to a perfect store in the computing exam and I had done incredibly poorly in for physics exams and they were like okay we'll let you to second year um

21:40.28
James
Yeah, yeah.

21:54.66
Tiarnan
Don't ever do that again. I in second year I completely fail to do anything at all. Yeah I use I use the university purely as a way of having access to computing infrastructure and I went off and worked in Intel in there. Initially their desk side support and then server support and fo support and production line support and stuff like that that part of how I grew but I think one of the things that we we didn't realize at the time where I didn't realize that so I started school in 95 in physics in Dublin city university and. Thought all of the grown-ups were idiots because how could you not know any more than we did about the internet and what we didn't realize was that the commercial internet had hit like that year like 95 is the first year of the commercial internet you were to let pay for an internet connection for that and so. It was this really early only three or four years before had anybody had an opportunity to do anything on this and all these people were were grownups with real jobs and we weren't we were students who were in the process of failing out of a physics degree. And so a bunch of us learned how to do all types of cool stuff. 1 of the interesting things that came out of Dublin universities are irish universities at time was this idea of a networking society I was in redbrick which is still there which is a great society that's been there for 1 upon 25 years and it um.

23:13.26
Tiarnan
It gave us email addresses and chat and everything else very similar to the original concept for Facebook without the surveillance state aspects. Ah and, what it was very interesting for was there was a bunch of folks who had gotten interested in and were were given kind of.

23:31.26
Tiarnan
Shell access on unix machines through this and the university at the time was giving you an email address which was your student number assigned at random to the host name of the university so my my my student number when I started ah my my email address when I started was. 9 5 4 o 5 4 7 oashtlkadot dcu I e and so the bit where we would give you any string at redbrickck.dcuda made meant that you were the much better mouth infrastructure and so you ended up with these 19 twenty year olds who were running. Um, email and chat for thousands of people. They were the largest university society. And this was an incredible training training ground for infrastructure. Ah and reliability folks as they then went into industry and you have folks. Who have gone on to be in Google and Amazon and Reddit in, Twitch in Twitter in Facebook and Amazon like like on my first day when I started at Google there were 6 alum from dcu who who I'd known through the network and society. Already on the systems team kind of walk in on the first day and have 6 people say hello to you and then turn to the recruiter and ask the recruiter is there or anybody I should be introducing you to is a very strange way to start your first day in the office like it was that that was part of it so afraid physics did physics provided me with almost no background except.

24:56.69
Tiarnan
Very useful accounts on a set of computers. And the idea was at the time because when I was starting it was preinternet was that I wanted to work out how you use computers. But if you did an I t degree or a computing degree. It was a reasonable chance. You were going to go do computing for a company and the company was going to have a few hundred staff and that's the biggest computers you were ever going to need and so the universities and physics departments had reasons to have really big iron. And that was one of the reasons I went into physics but eighteen months later two years later, you're in a pinpoint where Hotmail exists and Gmail exists and search exists in Google and and alpha vista and all of a sudden if you're providing services to the entire world that's bigger than. University or all of the universities in a country and so you have this real disconnect that now you can be into computers for computer steak and still have a great reason to really understand the infrastructure at scale that would not have been true like. Maybe in the us you've got a notion and enough national laboratories and and that that you'd be able to get into it from ah from a purity operations perspective. But for the rest of the world. Not so much. Um, particularly for a rock off a rock off the edge of Europe.

26:07.40
James
So there's no direct line connection between physics and software engineering. It was more just the fact that you got into into using larger scale computing hardware.

26:20.90
Tiarnan
I mean yeah I mean for for me, it was for me. It was originally a case that physics was going to be a pathway because you had to do modeling and stuff and so that way I would have to learn how to use computers to do a thing rather than just learning how computers worked and as it turned out 17 year old me was completely wrong and that the thing I was interested in was how computers worked. And so and it was useful most with did because the account you get given on the first day in the university gives you access to more computing power than you ever have in your life because I didn't grow up with a pc in my house or any of those things right.

26:53.53
James
Yeah, so fast forwarding in time past your time at Google I'm guessing. There's a note about as mentioned in your bio the ebola response in West Africa did you actually go to West Africa for a while.

27:08.91
Tiarnan
I was in Sierra Leone for for a month during the crisis. Which was a lot was my summary. And.

27:18.88
Tiarnan
So so just to give you some background what we did was a thing that op until two years ago I had to explain what it was but I don't have to do that anymore. We do the thing called contact tracing. It turns out that everybody knows what contact tracing is there? So, Sierra Leone Ah which was I think the second country to be hit I think it it was originally Nigeria was the beginning had was 1 of 4 or 5 countries that were hit in West Africa by ebola in I think 2015? I moved to Berlin to be part of a great team grown by a group called e-health Africa. And we proceeded to deploy using aws and um offline first web apps a bunch of contact tracing software across these these places and the idea was that.

28:08.29
Tiarnan
If there was a chance you had been in contact with somebody who had ebola go home. We'll send you food stay in your house on over sure you don't have ebola and the diagnostics for but for ebola are really straightforward if you're still alive a week later. You didn't have Ebola but.

28:24.99
Tiarnan
Not bad complicated. It's horrible, but it's not complicated and and so, there was a lot of, it was a very it was an easy way to get open the morning but it was very stressful. Okay I've never been like like the alarm goes off and you're like. Got to go to work. People are actually dying. It's really easy to go to the office. It never been an easier place but you're in a position where and you have challenges that you've just never thought of like it was my first job after Google and I got there and we were running one of the country's infrastructure off. Ah, single awwsectwo instance back to up just use catch Dv backed up to another host very robust very reliable just worked very good and I was horrified right for Google you were used to being able to lose like you can lose half of any of your data center and.

29:06.78
James
I.

29:15.50
Tiarnan
Another data center and still continue without anybody noticing was the bar you had been safe right? So you were used to deploy like sixteen x the hardware and it was bulletproof and that was the mind that you that they had said. And I was having a moment on like the first day or second day I think it was let's turn to me on the side guy I was working with and said look that e c 2 instance is the most reliable thing in our stack in order for someone to do a report they have to have. Gone out to visit somebody who might have a butler with a mobile phone that was charged get there or usually on a bike or some local some local transport. Get back have the internet locally be working have the power locally be working have enough internet power to have charged their phone. And then successfully backhaul it out of the country at which point it goes on the e c two instance if it's on the EC2 instance like it's lock in it's it's the most reliable thing we've ever seen right? like and and that was such a mindset shift. It was incredibly positive and interesting and good to do but it was an insane mindset shift. So like 1 of the first things I did this was 2015 I installed a nagios on a set a set of hosts right? after and I can actually see James googling the phrase nagios because it's not something anybody in the last fifteen years has needed to think about and Nagius was.

30:31.15
James
So you're right I did.

30:34.92
Tiarnan
And on-premises a learning system right? pre your datad dogs and pre-honeycomb and pre- all these things you want to monitor hosts and tell how the hosts tell you how much memory they have and also and because we were we couldn't reliably get internet connectivity and the internet connectivity. We could get. Was not very reliable from remotest Sierra Leone to freetown and even less reliable from freetown out of the country. The best thing we could do initially was go back ten or fifteen years and deploy old ideas so that we could understand. Parts of our network were up on any given time and I want parts of it. What odd houses were up on any given time because it turns out power in these in our in our locations was more reliable in the places we were deployed than they were in the rest of the country because we had generators and we had good money and we were able to do those things. But they still weren't perfect. Anarada was much less reliable and my favorite story from this time is that we had 1 and 1 office that would lose connectivity every Sunday and would get it back every monday. And I could not work out what was going on and I had like centralized log and kind of remote it into this site and the local router is up and the hosts are row and so it's not a case if like the last person out on Saturday turns it off and the first person now Monday day turns it on.

32:01.55
Tiarnan
ve interviewed everybody in the building nobody has nobody's twitching it count work it and the uptime says that it's working I just lose connectivity to it for like a 18 to 22 hour period and eventually by interviewing folks and talking to people on the ground I work out that this office is in the most remote part. Of sierra leone that we help and it's not that our office is losing power. It's that our network connection goes via our least line network connection which was actually a micro link was going via the cellular links. Ah 1 of the cellular terrors that we relied on was so remote.

32:33.60
Tiarnan
But it wasn't worth paying someone to go out with gas on a Sunday to pay to put money and put the gas in the Jenny because there's so few people with any money in that bit of the country that connecting phone calls in that bit of the country isn't worth doing on a Sunday and so that that every Sunday whatever time like whatever number of hours after they put a let. Put gas in the jenny on on the generator on Saturday it would Killle it would die and on Monday morning when Bob the guy who puts gas in the Jenny who showed up and put gas in the Jenny you'd get internet again. I've been that winter video hit 1 of your interhonix steps just wouldn't have any power and. It's a funny story. But like if you try debugging that from a chair in Berlin it's ah it's ah it's like their year you're like agaha christie has notton on me like it's it's hilarious. There was another insane ahead.

33:14.32
James
Um, yeah, no joke. Yeah I had oh I I was gonna say. I had the opportunity to go to Kenya and if people think that rolling power outages are bad in like California or wherever we have them in the us and the rest of the world. It's fairly bad in and Africa like we're talking the power may go out for like 10 hours and you may get it back or it might be another 24 hours and there's no indication when it will come back.

33:48.83
Tiarnan
Sure I mean to speak out here. I mentioned this to you before I went back and did ah a master's degree in international relations. There is a big difference between the colonized country losing power for 10 hours because everything of value has been taken from that country and the colonizer country losing power for. X number of hours right? Life's but California should have power somebody write that down. But the one of the the other story I was going to say is that we had this just like being able to test on the ground and how it's different. We were using my pieces software called pouchdb and pagedb is. Very similar to to couch couchtb catchbase only it's a local implementation and so it allows you do an offline first web app at the talks to a local instance of a Json database and then periodically syncs those documents up into the the cloud and and we were using that for a vaccine trial of vaccine trial that we were part of in two thousand and fifteen eight and we tested it. We tested it in berlin and it worked fabulously well and we yeah we were clever. We used the the the chrome extension kind of connection tools to drop the perceived. Data speed so it would be like 20 k or 10 if we got a 5 k second or something rather than the gig or whatever we have in the office and it was a little slower but it worked were okay, that's fine and then we sent out some people with that app to freetown and we almost cause an actual riot.

35:17.56
Tiarnan
So it turns out there was a book in that particular part of pagedb where instead of syncing only the changes. It had not already synced. It synced everything it had seen so far and this has this fossil fabulous pathological case where the first 1 works. Second 1 works. The third 1 works and it only stops working when you exceed the bandwidth available and so in a vaccination center in Sierra Leone and in a pretty remote place. They got like. 2 phones Twenty five thirty people in each and all of a sudden vaccine registration stops working and to get across at the premise here we were paying people to vaccinate them because it turns out if you were trying to vaccinate folks in the developing world against diseases that are prevalent in the developing world. You need to pay them for their time. Right? This is not we can show up at a factory or I'll get permission for my boss to be part of the vaccine trial and it won't affect me anyway. People have been traveling for half a day a day to get there. It's a huge undertaking on their behalf you have to pay them for it and our payment infrastructure is tied to this book.

36:21.52
James
Oh no.

36:22.48
Tiarnan
And no matter what we do, We can't replicate it into open so eventually lots of logs and and various things to do we work out What the problem is and of course the thing that is the problem The be it Bandwidth isn't isn't enough to get the app is identical to the problem. We can't chip them a fix because we can't Update. We can't update the website and then get them to get a new piece of so we literally have to send somebody from headquarters with 2 phones with the fresh fork of the code out across freetown to get there. And while we're on the phone with ah in in Berlin with increasingly panicked people going. They're they're not taking. We're not giving you the money because we can't register this phone because all the money was being done electronically. It was all being done as a whatever their local E Cash was and that that is that is probably the.

37:09.74
James
Yeah.

37:14.74
Tiarnan
1 of the one of the scarier software bugs I've ever been part of because you've got folks who are reliant on your software on some angry locals who are not happy about what you've just done to them and from your perspective outside. You're like and you should be angry. That's that's a totally legitimate thing to be angry about I have no idea how to help you everything was fine. We got them a replacement code. It was an amazing case.

37:29.50
James
Are.

37:33.41
Tiarnan
It was one of the yeah was one of the first times I worked with a great product manager who really understood his product and so the yeah, the kind of that being able to understand exactly what was going wrong was was easier from an infrastructure perspective because I was working with him. Fantastic moment of growth afterwards in the moment it scared the hell out.

37:51.97
James
Man pay bugs are no joke I I come from a background to payroll software and 1 thing I've learned over the years is the worst software bugs. The worst problem you can introduce. The worst thing you can do. Is the mess with people's pay.

38:08.40
Tiarnan
And so as someone who has an accident in his first name a space in his second name and an Aposture V in his address I am personally a stress test of any payroll system I'm added to.

38:21.14
James
Ah, you are's so good. Yeah, so ah back a few years ago I actually worked on a project called the hospital run and it had a very similar offline first approach for basically. On-call physicians in third world countries and so they would go out with like tablets they would collect a bunch information. They would do an exam etc and then when they got back to the office. They would upload back to like their central hospital run database.

38:41.73
Tiarnan
So interesting.

38:57.96
James
It's really interesting to hear that technology like years later I'm sure maybe it just happened organically but it sounds extremely similar to to what you were deploying earlier.

39:05.61
Tiarnan
I mean I mean rural rural rural middle america at the time would have had very similar rural sierraoneone challenges right? Like just if you have no backhaul data then then that's the and I want to give a shout out like hoodie who were the crowd neighborhood is group. Who are the the developers that where we were working with in ehealth Lena Reinhardt I think was the Ceo with the time and and ya yan al on on Twitter was the was kind of I think cto and they they did some really great work bootsropping that team and it was an amazing cap to work with.

39:38.17
James
So tell me a little bit about what you're doing at stanza.

39:44.30
Tiarnan
Oh thanks me for asking and so staza is, there's a chance that I know junior person on the staza team. So we've got a lot of folks who've been working on large scale infrastructure for hyperscaters and for a very large companies for a very long time and. What we want to do is to make it easier to build reliable systems, reliable services and and the reliability should be seen at the point of consumption. So it's not as much that we are chasing nines for any given service but that the user experience is reactive in the way that. We have current react services that react well to moving to tabbut or moving to mobile phone this is reactive to individual services not being reliable. So what we do is we facilitate you adding circuit breaking and by pressure at the application tier rather than the network here. So at this point I'm used to being able to fall back on a demo that's very visual so on a podcast that will this will be a bit of an exciting challenge for me. But if you think about if you think about going to say Netflix and there's a recommendation engine that says.

40:50.12
Tiarnan
At number 1 in your hometown or number 1 for Tina and then you have a search and you have a playback and then you have the ability to spend money that you want to register to to spend to to pay for something in in Netflix's internal accounting. Being able to pay pay them being by able to play a video and be able to search a video are all more important to them in a moment than making sure that the recommendations for you are top shelf right now and so if they have some type of computing budget constraint or an api constraint or a services constraint and they don't want to run as much as stuff. They'll be able to prioritize a over b or oversea and so you can do things like well actually let's say use the cached version for the country that person's in or the city that person's in rather than there. Ah, example that we have is let's imagine you've got exactly the same setup where you've got.

41:39.39
Tiarnan
Recommendations carousel a search on a credit card checkout and they're all backed off the stripe api right? So you get a the stripe is and getting has control of your at what you've got in stock and all the skews you've gotten in stock and also all of your credit card transactions in all of your search again. The individual recommendations are not as important as being able to spend money which is not as important which is more important than being able to search and what you can do is you can choose to have those individual apps either gray out or perform a different different configuration during stress. So rather than hit the back end every time use a cost result rather than, hit the back end in this moment say render a spinner and try again in a minute what you want to be able to do is provide a graceful degradation. So the user is always able to do the thing that you as a. Designer of the software thinks is the most important thing for that software. So we've got an Sdk. We've got. We're we're currently so setting for partners if you got a stanza dot systems if you're interested in what we're talking about. We got full documentation and a bunch of breakdowns there including a video demo that. Probably makes more sense than me explaining this on the fly on, kind of how podcast appreciate it.

42:52.48
James
I Completely got it. So I feel like you're sort of on the bleeding edge of Sar E and that you're we're going to more like graceful degradation instead of you know. Seeing the things are broken and and making patches. So so looking ahead? What do you envision The future of Sre is like are there any trends that you're saying like any predictions you can make about.

43:28.58
James
How that landscape is changing in the coming years.

43:31.84
Tiarnan
So what I find is really interesting is that serverless kind of showed up and is really exciting and then folks get big enough on it and then they seem to desperately want my great off right? because they. It gets it. It absolutely isn't enough to build you an amazing proof of concept and then it begins to hit some limits so I'm interested to see where and that's amazing, right? So but I don't want I don't want to come across as something I'm saying negatively. But. And so the ability to get to that scale before you need to hire your first sre. Your first person who really cares about infrastructure is huge. I think that that leads people in a sudden desperate need to understand how their service works much later than it did it happened previously right? and and that's always been true, right? like. Folks these days. Don't need to understand linux internals and that's a good thing right instead of it going from you don't need to understand how to switch works or he linux internals works or how the file system's operating. We're now at a point where oh my deployments take longer than 30 seconds and so i. My my in-flight htb requests show ah show 500 spikes so I need to understand how to do graceful roll rolling restarts or any of those capabilities. What I find really interesting is that that stuff is only going to get more mature and so the point at which you need to do really serious infrastructure work I think my greats out further.

44:53.89
Tiarnan
I think that and there's a really interesting. One of the lessons that was taken from the accelerate book and the dora metrics that came at a couple of years ago was that you have this great intersection between shipping code quickie.

45:10.40
Tiarnan
Makes developers happy and makes your site more reliable and so as metrics go they have become the real watchword for folks who are thinking about reliability in in a serious way but kind of all the seconconferences and everything else and so. Being able to think about you want to optimize your teams to be able to make the smallest meaningful change as fast as possible and then to be able to undo that change at least as quickly. Because that means that your developers can operate with confidence. They can offer operate at speed. They can be able to, turn an idea to a vision to an execution really quickly. But in the event that they make a mistake. The rollback is is much less painful for your users than you would be able obviously. For folks who've graduated beyond that bottled stay ah stage things like stanza I think are hugely important being able to do graceful degradation of your services being able to do things like prioritize 1 request over another being able to delegate the authority. This is something that sounds it does that I haven't really seen in too many other places delegate the authority to the person writing the feature what to decide what happens when the service that that feature depends on isn't present right? So either. You're dependent on information from a saas or you're dependent on an internal database.

46:34.33
Tiarnan
If we can give you a signal which is you don't get to use that data source today or you don't get to use the you get to make a decision like do I not render the pain pain at all. Do I render cache data do I render data that's geographically localized but not localized to the user right? That's all that's always going to be a decision that the person. Wrote the feature is able to make a better decision about than me as the person running the internal service is going to be able to make decision but from my perspective I serve a 4 to 9 which is like go away for now and that's not what that's I mean it saves my life and my service is still open. That's good. But if I want to be able to do the next level up from that which is in the event of this happening. Please do these other things but like defense in depth and defense with intelligence I think is something that we're really going to get more maturity in overtime.

47:20.84
James
And I personally think it's really cool I mean I've worked on many systems I've had to do many like try catch retries and fallback logic anything to make that process easier more bulletproof. Is like definitely Welcome. The bigger end impressivees that be.

47:37.58
Tiarnan
I enthusiastically say I enthusiastically so say and anybody with a similar experience to James and in fact, James himself to our documentation because being able to construct. Your features and prioritize your features and make the see policy decisions based on which feature is more reliable than the other is I think a really transformative thing particularly before I need to have a large scale distributed system.

48:03.16
James
Turn and so you've been on sre world for a long time. How do people get into sre if if they're on the job market right now if they're thinking about a career change. They like the aspects of Sre and.

48:17.73
Tiarnan
Sure, so the an sre is not a wizard There's somebody who cares about how this stuff works right? and so in any given team of people. There's the person who's willing to say. The build is too slow. The deployment doesn't take is it isn't quickly. You know if we can't roll back our we don't have a good story around being able to do data migrations being that person on a on an existing team is an amazing prep for I'm like going to want to do this longer term. And it will make everybody on your team more productive now. There is a thing to be careful there which is you don't want to become someone who is the only person who has that knowledge which leads to the other thing which is very good to do as an sre which is document and ensure that anything that is of any real importance doesn't just live between your ears. So for me I think it's all about understanding the tools being able to not just be a customer of a thing but to understand what what the ramifications of using a tool are and if someone's at home and they're a software engineer. There's plenty of amazing. Speakers and writers scon has always been open access. They've used nix as and ah as a fabulous organization has been going for years and so all of the srecon talks are up there and out there sre as she was spoke by James somebody for us as the opening came out from last year's

49:40.99
Tiarnan
Ah, sir recon and you're in amtroam was a great kind of overview of the kind of history of how, thinking about it in this programmatic way has grown up over time and I think would be a great kind of framing for someone to think about these things. So it's all about like being willing to get your hounds dirty being willing to think about. From a systemic perspective how to make how we operate computing better rather than it's not about making a single loop run more effectively. It's about us being able to schedule that and understand why it will or would not work or being able to resilient to a schedule not work and being resilient to somebody having to take. Time and not being able answer a page right? We have to understand the interaction between um computers and the people to people who rule them and I think that that's that's what you do? Yeah, there are great communities out there. There are lots and lots of local meetups in most most major cities spoiler. Almost anybody speaking at ah at ah at Anica at a conference talk is probably recruiting if you if you wander around and say I'm interested in I'm interested in being part of the part of your reliable infrastructure that amazing things can happen. But that they're that they're types of things I would say obviously am. The more you know about this stuff the easier it will be to get hired but these days a lot of the interviews between a traditional suite and an infrastructure suite are very very similar right? You're you're being asked very similar questions. You'll have ah a greater focus on.

51:05.95
Tiarnan
Networking and a greater focus on things like Kubernetes in the infrastructure thing. But if you've done the reading around whatever the word. The those words were in the job spec and you're honest about what your expectations are and your what your willingness to grow is that I think you can you can do a good interview in a lot of these cases.

51:24.32
James
That's awesome I Think that's going to be extremely helpful for anybody interested in that world. So as we wrap up turn in oh good.

51:29.79
Tiarnan
Anybody wants to talk funny if anybody wants to talk to me I'm I'm at I'm at nyc dublinur on Twitter and if I my Dms are open so if I if I can be more specifically useful if anybody's looking to to join sre or get involved in those type of things give me a shit. Ah, SyS3 con us has currently has an open call for papers. You know if you want to submit. It's open give me a shout I hope you put put things the right check.

51:55.84
James
That's awesome. All right, give turn in a shout if you're if you're interested and in the space. All I'll have is link to socials in and the podcast description so turn it as we wrap up if you could give yourself. Ah, your younger self one piece of advice if you were starting into your journey of the world of engineering What would it be.

52:19.50
Tiarnan
Um, so the computers are just computers is the is an incredibly pers I think today but we what we're saying what we're doing is very similar to what we were doing thirty years ago and so be less worried about um.

52:36.17
Tiarnan
Holding on to knowledge. So if you've if you've become an expert in a thing and you have an opportunity to go work in another thing do the work in the other thing. Don't worry about not being an expert anymore if you've an opportunity to go be a manager and you're excited by that prospect go be a manager the same enthusiasm and drive that got you to become an expert in the technology that you became the first time. Will serve you well again, two years later right I mean even when it's gone from being samba to at rest to g orpc or whatever that whatever the evolution is in your particular part of the stock. You'll show up 10 years later and it'll be oh, that's exactly the same but we've changed all the terms and it's 30% faster.

53:12.79
James
And.

53:14.78
Tiarnan
Great and then you'll go by to work right? And so I think that there's ah, a big fear about not having your finger on the pulse that is not as well. That's not that That's not worth having in a lot of cases and the other thing I would say to a younger me is that I was really really afraid of.

53:31.63
Tiarnan
Programming and that programming was ah a magic a magic trick I spent most of my time being completely comfortable with the idea that I could learn all about operating systems and all about networking and all about all these other archad things that people were afraid of and for some reason I thought writing code was clearly at the the remit of special wizards. And I could have done what learning that lasted about 15 years on here. So thank you very much.

53:52.26
James
Sound advice. All right 1 last question for you ter and I hope you're ready. It's an irish themed question and I hope I don't offend you with it. Let me know if I do if you own the castle and subsequently and unfortunately died. What gift would you give to people for kissing a stone on your castle and where would you hide this magical stone.

54:19.78
Tiarnan
I see I see I so titles Tiarnan is trying to go open a polite response. So my, my heritage and so Deparka. The deberca castle was cashel. The the rock of cashll is beautiful. That bit of the family is not my bit of the family my bit of the family is, terraces and, kind of very kind of, very urban and people working working in working in factories in the city. Is is who who my people are so I think I would bequeath infinite but infinite broadbound and a big button depressed at the beginning because I I get very confused whenever I get outside kind of the the ringro of the city I'm afraid I'm not I'm not I'm not as ah, stereotypically rural as would be useful for the to answer this question.

55:05.83
James
Yeah, anything we haven't touched on anything you want to mention shout out.

55:08.55
Tiarnan
I think I probably know and more people than you have so we we should get out this car pretty soon. Thanks me for your time. Jeff.

55:20.84
Tiarnan
Um, no I think it's traditional to only remember them after I hung up.

55:25.64
James
Ah, right? Thank you Tiarnan for joining me and thanks so much for listening to the James from montana podcast if you want to support this production see more content like this visit jamesfrommontana.com and consider signing up. Thanks again!

Podcast Episode 5: Site Reliability Engineering ~ Tiarnán de Burca @ Stanza

In this fantastic Irish-themed episode of the podcast I sit down with Tiarnán de Burca, an ex-Googler, ex-Squarespace engineer with tons of hands-on experience in the SRE world. In this episode, I also offend all of Ireland!