Episode 414 – When the Cloud Falls: Understanding the AWS and Azure Outages of October 2025

November 06, 2025

Episode Transcript

Welcome to Episode 414 of the Microsoft Cloud IT Pro Podcast.This episode covers the major cloud service disruptions that impacted both AWS and Azure in October 2025. Even the biggest cloud providers face operational challenges. Learn what happened, how it was resolved, and what IT pros should keep in mind for future resilience planning.

Your support makes this show possible! Please consider becoming a premium member for access to live shows and more. Check out our membership options.

Show Notes

About the sponsors

Would you like to become the irreplaceable Microsoft 365 resource for your organization? Let us know!

See full episode transcriptTranscript is autogenerated by AI

1 00:00:03,520 --> 00:00:05,919 Welcome to episode 414 2 00:00:05,919 --> 00:00:08,960 of the Microsoft Cloud IT Pro podcast recorded 3 00:00:08,960 --> 00:00:11,859 live on 10/31/2025. 4 00:00:12,160 --> 00:00:14,894 This is a show about Microsoft March in 5 00:00:14,894 --> 00:00:17,214 Azure from the perspective of IT pros and 6 00:00:17,214 --> 00:00:19,535 end users, where we discuss the topic of 7 00:00:19,535 --> 00:00:21,875 recent news and how it relates to you. 8 00:00:22,015 --> 00:00:22,515 Fortunately, 9 00:00:22,815 --> 00:00:24,815 when we went to record this, the Internet 10 00:00:24,815 --> 00:00:27,074 is back online after an AWS 11 00:00:27,454 --> 00:00:30,570 and Azure outage, both related to DNS and 12 00:00:30,570 --> 00:00:32,429 both within the last couple of weeks. 13 00:00:32,810 --> 00:00:35,289 So what better to discuss today than what 14 00:00:35,289 --> 00:00:37,710 happened, how it was resolved, and what IT 15 00:00:37,770 --> 00:00:40,685 pros should keep in mind for future resilience 16 00:00:40,744 --> 00:00:43,325 planning when it comes to your cloud infrastructure. 17 00:00:45,945 --> 00:00:48,265 So, Scott, I saw this funny meme the 18 00:00:48,265 --> 00:00:50,024 other day. I'm gonna read it to you. 19 00:00:50,024 --> 00:00:52,284 I intentionally did not read this to you 20 00:00:52,504 --> 00:00:53,004 earlier. 21 00:00:53,384 --> 00:00:55,679 So I saw this. Somebody sent it to 22 00:00:55,679 --> 00:00:57,120 me. I have to go oh, I know 23 00:00:57,120 --> 00:00:58,240 where it is. I have to go find 24 00:00:58,240 --> 00:00:59,780 it. I should have pulled it up earlier. 25 00:01:00,320 --> 00:01:02,240 And this can tie into another topic as 26 00:01:02,240 --> 00:01:04,340 well. Where is that 27 00:01:04,719 --> 00:01:05,219 message? 28 00:01:06,319 --> 00:01:08,834 Wow. Okay. Here you go, Scott. After getting 29 00:01:08,834 --> 00:01:10,375 fired from ungrateful 30 00:01:10,754 --> 00:01:11,254 AWS, 31 00:01:12,114 --> 00:01:14,194 after an outage where my job was to 32 00:01:14,194 --> 00:01:16,694 Vibe code all the DNS entries to IPv 33 00:01:16,754 --> 00:01:19,234 six, happy to announce that it's my first 34 00:01:19,234 --> 00:01:21,795 today at Azure. Azure recognizes the value of 35 00:01:21,795 --> 00:01:24,390 Vibe coding IPv six DNS, and I just 36 00:01:24,390 --> 00:01:27,109 force pushed my first 1,000,000 entries. Now off 37 00:01:27,109 --> 00:01:28,170 to grab some coffee. 38 00:01:29,510 --> 00:01:32,150 Yes. I've seen this one. The Internet. Somebody 39 00:01:32,150 --> 00:01:33,049 has been following 40 00:01:33,590 --> 00:01:35,290 at wrecked on x. 41 00:01:35,990 --> 00:01:37,750 Yes. Actually, someone sent it. I do not 42 00:01:37,750 --> 00:01:39,965 follow them, but somebody sent that because 43 00:01:40,265 --> 00:01:43,384 DNS is apparently hard as evidenced by this 44 00:01:43,384 --> 00:01:45,965 last week of both AWS and Azure. 45 00:01:46,265 --> 00:01:47,864 I guess it wasn't quite within a week. 46 00:01:47,864 --> 00:01:50,825 AWS was October 20. Azure was October 29. 47 00:01:50,825 --> 00:01:52,650 Nine days, There was a little bit of 48 00:01:52,650 --> 00:01:53,709 a spread in between, 49 00:01:54,730 --> 00:01:55,870 but it does happen. 50 00:01:56,170 --> 00:01:58,109 It's always a good reminder when 51 00:01:59,209 --> 00:02:00,750 the cloud goes down 52 00:02:01,130 --> 00:02:02,670 that it really 53 00:02:03,049 --> 00:02:05,954 is somebody else's data center someplace else. It's 54 00:02:05,954 --> 00:02:06,694 just not 55 00:02:06,995 --> 00:02:09,555 it's not your data center. These things tend 56 00:02:09,555 --> 00:02:12,375 to be far reaching. I'm always 57 00:02:12,915 --> 00:02:13,415 amazed 58 00:02:13,955 --> 00:02:14,455 when 59 00:02:15,074 --> 00:02:18,055 Herndon goes down, so like US East Virginia 60 00:02:18,115 --> 00:02:19,014 for AWS, 61 00:02:19,715 --> 00:02:20,215 and 62 00:02:20,790 --> 00:02:23,349 50% of the Internet just goes offline. Because 63 00:02:23,349 --> 00:02:24,969 there are so many 64 00:02:25,430 --> 00:02:26,250 of the 65 00:02:26,870 --> 00:02:27,849 modern day 66 00:02:28,229 --> 00:02:29,129 SaaS services, 67 00:02:29,430 --> 00:02:31,270 like the things that you would depend on, 68 00:02:31,270 --> 00:02:33,689 like, hey, I listen to music on Spotify, 69 00:02:33,830 --> 00:02:36,294 I stream my podcast from here, I do 70 00:02:36,294 --> 00:02:38,875 my banking with like, all these different things 71 00:02:39,175 --> 00:02:41,194 are all homed out of that region. 72 00:02:41,655 --> 00:02:44,474 So when bad things happen to Herndon, 73 00:02:44,935 --> 00:02:46,715 particularly in AWS land, 74 00:02:47,495 --> 00:02:49,995 bad things tend to happen on the Internet 75 00:02:50,069 --> 00:02:51,990 for the rest of us or at least 76 00:02:51,990 --> 00:02:54,230 I think the parts of the Internet that 77 00:02:54,230 --> 00:02:57,110 folks who listen to this podcast would go 78 00:02:57,110 --> 00:02:58,790 for. So for me, like I said, that's 79 00:02:58,790 --> 00:03:00,650 things like Spotify going down, 80 00:03:01,110 --> 00:03:02,250 that is 81 00:03:03,094 --> 00:03:06,134 Reddit suddenly disappearing and going no. Yep. There 82 00:03:06,134 --> 00:03:07,354 went the body of knowledge 83 00:03:07,735 --> 00:03:09,655 that was pulling all these things out. And 84 00:03:09,655 --> 00:03:11,194 then in this new world 85 00:03:11,574 --> 00:03:14,294 of LLMs and everything else that are doing 86 00:03:14,294 --> 00:03:17,400 both ingested plus real time searches of these 87 00:03:17,400 --> 00:03:17,900 systems, 88 00:03:18,199 --> 00:03:21,000 like, all that stuff starts to show its 89 00:03:21,000 --> 00:03:21,500 cracks 90 00:03:21,879 --> 00:03:22,939 along the way 91 00:03:24,120 --> 00:03:27,740 as well. So the AWS one, interestingly, 92 00:03:28,040 --> 00:03:29,960 like, manifests, I think, is a little bit 93 00:03:29,960 --> 00:03:32,294 of, like, oh, this all sounds like a 94 00:03:32,294 --> 00:03:34,775 lot of DNS. My understanding was it was 95 00:03:34,775 --> 00:03:36,955 actually a problem with DynamoDB 96 00:03:37,335 --> 00:03:40,395 and kinda light load balancing with Dynamo and 97 00:03:40,534 --> 00:03:41,995 the way that they push 98 00:03:42,455 --> 00:03:42,955 configuration 99 00:03:43,495 --> 00:03:45,520 and things like that into it. But I 100 00:03:45,520 --> 00:03:46,639 could be a little bit off there. I 101 00:03:46,639 --> 00:03:48,400 didn't have a ton of time to dive 102 00:03:48,400 --> 00:03:49,060 into theirs, 103 00:03:49,360 --> 00:03:51,759 especially, like you said, with the Azure outage 104 00:03:51,759 --> 00:03:54,639 coming on October 29, just nine days later, 105 00:03:54,639 --> 00:03:55,939 and that one being 106 00:03:56,240 --> 00:03:59,574 certainly more DNS related or at least like 107 00:03:59,574 --> 00:04:01,735 a I think to the spirit of it 108 00:04:01,735 --> 00:04:03,814 being that it was Azure Front Door and 109 00:04:03,814 --> 00:04:04,314 kinda 110 00:04:04,615 --> 00:04:06,694 some of the global load balancing capabilities of 111 00:04:06,694 --> 00:04:08,855 Front Door that got out of whack due 112 00:04:08,855 --> 00:04:09,515 to a 113 00:04:10,375 --> 00:04:13,094 configuration update. And in both cases, in both 114 00:04:13,094 --> 00:04:15,490 systems, these were configuration updates 115 00:04:15,950 --> 00:04:18,449 that kind of went a little bit sideways, 116 00:04:18,750 --> 00:04:20,830 and things got a little bit squirrely. It's 117 00:04:20,830 --> 00:04:24,689 hard. Stuff at that scale is very complicated, 118 00:04:24,830 --> 00:04:25,649 but it always 119 00:04:26,029 --> 00:04:27,889 amazes me how 120 00:04:28,865 --> 00:04:31,444 one of those configuration changes 121 00:04:31,904 --> 00:04:33,285 can take down everything 122 00:04:33,824 --> 00:04:36,384 so quickly that because we've seen it multiple 123 00:04:36,384 --> 00:04:39,425 times from multiple different cloud vendors where you 124 00:04:39,425 --> 00:04:40,865 would think they would have figured out by 125 00:04:40,865 --> 00:04:42,779 this time how they could do, like, small 126 00:04:42,779 --> 00:04:45,680 configuration changes that don't have the snowball effect, 127 00:04:45,899 --> 00:04:47,599 but yet we continue to 128 00:04:47,899 --> 00:04:50,459 see these. And, yeah, both were DNS. I 129 00:04:50,459 --> 00:04:52,620 was reading some on the AWS one too, 130 00:04:52,620 --> 00:04:54,459 and it sounds like it was it was 131 00:04:54,459 --> 00:04:54,959 Dynamo 132 00:04:55,535 --> 00:04:56,035 DB, 133 00:04:56,495 --> 00:04:57,394 but updating 134 00:04:57,774 --> 00:04:59,774 that's used to update DNS. And it was 135 00:04:59,774 --> 00:05:02,014 like two different services were trying to update 136 00:05:02,014 --> 00:05:04,415 the same DNS records tied to Dynamo DB 137 00:05:04,415 --> 00:05:06,495 and, oh, and two things are trying to 138 00:05:06,495 --> 00:05:08,654 update the same DNS record. It's like trying 139 00:05:08,654 --> 00:05:10,959 to update the same line in a file 140 00:05:11,019 --> 00:05:13,259 multiple times and SharePoint complaining that you have 141 00:05:13,259 --> 00:05:16,319 version mismatches? It's definitely possible to 142 00:05:17,100 --> 00:05:18,639 encounter these race conditions. 143 00:05:19,339 --> 00:05:22,399 Even small changes do have big impacts, so 144 00:05:22,459 --> 00:05:24,459 I think it's a little it's a little 145 00:05:24,459 --> 00:05:26,754 off or maybe, like, not the right color 146 00:05:26,754 --> 00:05:28,754 to say, like, oh, it's surprising when a 147 00:05:28,754 --> 00:05:30,134 little configuration change 148 00:05:30,514 --> 00:05:32,595 or, like, that a bigger configuration change goes 149 00:05:32,595 --> 00:05:34,915 out. Like, all these things go out, whether 150 00:05:34,915 --> 00:05:38,055 it's Amazon, whether it's Microsoft, whether it's Google. 151 00:05:38,459 --> 00:05:40,879 Everybody has their own deployment practices 152 00:05:41,259 --> 00:05:42,639 for safe deployments, 153 00:05:43,100 --> 00:05:45,339 for making sure that things get flighted through, 154 00:05:45,339 --> 00:05:48,139 like, multiple rings and they follow a general 155 00:05:48,139 --> 00:05:49,980 progression. You see the same thing, like, when 156 00:05:49,980 --> 00:05:51,740 a feature rolls out in SharePoint, for example. 157 00:05:51,740 --> 00:05:53,339 Right? We all know about the different rings 158 00:05:53,339 --> 00:05:55,634 that go in there with deployment rings and 159 00:05:55,634 --> 00:05:58,115 things like that. So it's the best of 160 00:05:58,115 --> 00:05:58,615 intentions. 161 00:05:59,394 --> 00:06:02,034 The interesting thing for me in the a 162 00:06:02,194 --> 00:06:03,254 AWS RCA 163 00:06:03,634 --> 00:06:05,394 was they got into some of the nitty 164 00:06:05,394 --> 00:06:07,094 gritty around how 165 00:06:07,474 --> 00:06:10,370 complex these things are with all these microservices 166 00:06:10,830 --> 00:06:13,710 that are running, talking to each other. So 167 00:06:13,710 --> 00:06:16,110 you'd like things are starting to manifest where 168 00:06:16,110 --> 00:06:17,970 we've built these really awesome 169 00:06:18,350 --> 00:06:20,590 machines, right, to go and manage this all 170 00:06:20,590 --> 00:06:22,430 for us and have all this underlying logic 171 00:06:22,430 --> 00:06:24,694 and all these other things into them. But 172 00:06:24,995 --> 00:06:27,875 when these, like, little subtle race conditions are 173 00:06:27,875 --> 00:06:30,055 coming through or other things are coming out 174 00:06:30,115 --> 00:06:32,834 and stuff gets out of whack, in in 175 00:06:32,834 --> 00:06:35,669 the case of the Dynamo thing, these workers 176 00:06:35,729 --> 00:06:37,029 between these various microservices 177 00:06:37,410 --> 00:06:38,229 becoming desynchronized, 178 00:06:39,410 --> 00:06:40,310 bad things 179 00:06:40,769 --> 00:06:41,269 happen. 180 00:06:41,970 --> 00:06:42,470 Right? 181 00:06:42,849 --> 00:06:45,329 So I think in the AWS one just 182 00:06:45,329 --> 00:06:46,930 pulling up their RCA real quick. So they've 183 00:06:46,930 --> 00:06:49,024 got a couple components. They've got this planner 184 00:06:49,024 --> 00:06:50,805 and these enactor workers 185 00:06:51,185 --> 00:06:52,644 within dyno DynamoDB 186 00:06:53,425 --> 00:06:57,204 that help with some some distribution of traffic 187 00:06:57,345 --> 00:06:59,425 and other things via DNS, but it's a 188 00:06:59,425 --> 00:07:02,160 bunch of basically, like, internal components. I'd encourage 189 00:07:02,160 --> 00:07:03,600 somebody to go read about this. Like, if 190 00:07:03,600 --> 00:07:06,879 you're interested in, like, distributed computing, hyperscalers, all 191 00:07:06,879 --> 00:07:09,680 these things, like, it's always interesting to see 192 00:07:09,680 --> 00:07:12,639 how these things are designed. But, you know, 193 00:07:12,639 --> 00:07:15,199 apparently, you had this one service, which is 194 00:07:15,199 --> 00:07:16,660 the DNS Enactor, 195 00:07:17,394 --> 00:07:20,055 which when it fires up, it verifies 196 00:07:20,435 --> 00:07:23,074 plan freshness, what it's supposed to do, what 197 00:07:23,074 --> 00:07:24,134 it's supposed to process, 198 00:07:24,595 --> 00:07:27,634 what updates it's or endpoints it's supposed to 199 00:07:27,634 --> 00:07:29,175 update, all those things. 200 00:07:29,620 --> 00:07:32,839 Turns out, the DNS and actor did within 201 00:07:33,139 --> 00:07:36,579 Dynamo does a very, like, sane thing in 202 00:07:36,579 --> 00:07:38,740 that it verifies the freshness of what it 203 00:07:38,740 --> 00:07:39,560 needs to do 204 00:07:39,939 --> 00:07:42,824 anytime that process starts or at the start 205 00:07:42,824 --> 00:07:45,544 of processing. But it's not doing, like, state 206 00:07:45,544 --> 00:07:47,785 management as it goes. It's always assuming that, 207 00:07:47,785 --> 00:07:49,944 hey. I spun up. This is current state. 208 00:07:49,944 --> 00:07:51,865 Let me go make some changes and then 209 00:07:51,865 --> 00:07:54,584 check again kind of thing. So you had 210 00:07:54,584 --> 00:07:57,144 these multiple actors that are talking to each 211 00:07:57,144 --> 00:07:57,644 other, 212 00:07:58,079 --> 00:08:00,240 and like you said, it's a contention issue. 213 00:08:00,240 --> 00:08:02,319 So by the time one spins up and 214 00:08:02,319 --> 00:08:03,919 it says, okay. Here's the plan. Here's what 215 00:08:03,919 --> 00:08:04,980 I'm gonna go do, 216 00:08:05,360 --> 00:08:07,759 and it goes and does it, well, it 217 00:08:07,759 --> 00:08:10,000 turns out that another one was spinning up 218 00:08:10,000 --> 00:08:12,000 with a potentially different plan because that is 219 00:08:12,079 --> 00:08:14,475 they haven't been in flight. And all of 220 00:08:14,475 --> 00:08:16,314 a sudden that check that had been performed 221 00:08:16,314 --> 00:08:18,714 that was fresh was now stale, and it's 222 00:08:18,714 --> 00:08:20,495 applying a stale configuration 223 00:08:21,115 --> 00:08:23,214 and overriding what was already there, 224 00:08:23,514 --> 00:08:25,935 and that leads to a series 225 00:08:26,235 --> 00:08:27,375 of cascading 226 00:08:27,834 --> 00:08:28,334 failures. 227 00:08:29,000 --> 00:08:31,180 And for services like Dynamo, 228 00:08:31,560 --> 00:08:35,740 they're so integral to the fabric of AWS. 229 00:08:36,200 --> 00:08:38,440 So there's a bunch of other services that 230 00:08:38,440 --> 00:08:41,019 are depending on DynamoDB. So if you're 231 00:08:41,325 --> 00:08:43,804 doing compute and you're using virtual machines with 232 00:08:43,804 --> 00:08:44,304 EC2, 233 00:08:44,605 --> 00:08:46,785 you're doing functions with Lambda, 234 00:08:47,245 --> 00:08:51,404 even things like RBAC and I'm ultimately tie 235 00:08:51,404 --> 00:08:54,785 back to these database systems like Dynamo, and 236 00:08:54,925 --> 00:08:57,929 they have these, like, just really bad, no 237 00:08:57,929 --> 00:08:59,070 good, horrible days. 238 00:08:59,929 --> 00:09:01,610 The closer to home for me on my 239 00:09:01,610 --> 00:09:04,190 side, I've seen when we've had outages 240 00:09:04,809 --> 00:09:05,710 in storage 241 00:09:06,409 --> 00:09:09,924 and very similar thing, like, you'd be amazed 242 00:09:09,924 --> 00:09:12,325 at the number of services that depend on 243 00:09:12,325 --> 00:09:15,284 storage for something. Right? They publish some kind 244 00:09:15,284 --> 00:09:16,264 of state in there. 245 00:09:16,644 --> 00:09:19,044 Maybe they're not even using, like, unstructured storage. 246 00:09:19,044 --> 00:09:20,725 It's not like they're storing logs or something, 247 00:09:20,725 --> 00:09:22,529 but maybe they're using, like, NoSQL 248 00:09:22,830 --> 00:09:24,690 tables or they're using queues 249 00:09:25,070 --> 00:09:27,389 or things like that along the way. So 250 00:09:27,389 --> 00:09:29,389 there there's just a bunch of moving pieces. 251 00:09:29,389 --> 00:09:30,850 There's a bunch of dependencies, 252 00:09:31,950 --> 00:09:33,409 and those dependencies 253 00:09:34,110 --> 00:09:35,169 just tend to 254 00:09:35,710 --> 00:09:37,964 bleed their way out. And I think what 255 00:09:37,964 --> 00:09:39,565 we were seeing a lot more is with 256 00:09:39,565 --> 00:09:42,044 these outages, at least these last couple, these 257 00:09:42,044 --> 00:09:43,644 two most recent ones, and I think if 258 00:09:43,644 --> 00:09:45,504 we look back a couple months as well, 259 00:09:45,565 --> 00:09:48,284 the impacts are just so far reaching because 260 00:09:48,284 --> 00:09:49,904 so many customers today 261 00:09:50,389 --> 00:09:52,950 are dependent on the cloud. Like, I saw 262 00:09:52,950 --> 00:09:54,389 a lot of chatter after this one, like, 263 00:09:54,389 --> 00:09:56,549 oh, AWS went down, and then, oh, Azure 264 00:09:56,549 --> 00:09:57,830 went down, and, oh, we should all be 265 00:09:57,830 --> 00:09:59,590 mount multi cloud, and we should all be 266 00:09:59,590 --> 00:10:02,809 and all these things. Right? Like, sure. Absolutely. 267 00:10:03,269 --> 00:10:05,934 We should. If we had infinite money, infinite 268 00:10:05,934 --> 00:10:08,815 time, infinite skilling, all those kinds of things 269 00:10:08,815 --> 00:10:11,375 that are out there, but that's ultimately not 270 00:10:11,375 --> 00:10:13,134 the reality for a lot of us. So 271 00:10:13,134 --> 00:10:15,554 I fall back to, are these things bad? 272 00:10:15,695 --> 00:10:18,240 Yes. Do we learn from them? Also, yes. 273 00:10:18,240 --> 00:10:20,240 Like like this particular race condition in the 274 00:10:20,240 --> 00:10:21,139 case of AWS, 275 00:10:21,759 --> 00:10:24,100 the thing that happened in Azure, they happened. 276 00:10:24,480 --> 00:10:26,960 They should not happen again because we learn 277 00:10:26,960 --> 00:10:28,720 from them, we implement those changes, and we 278 00:10:28,720 --> 00:10:30,945 go forward. And as bad as it is 279 00:10:30,945 --> 00:10:33,825 to have half the Internet go down, well, 280 00:10:33,825 --> 00:10:35,825 half the Internet was down. It wasn't just 281 00:10:35,825 --> 00:10:38,004 you. It was everybody else. And 282 00:10:38,705 --> 00:10:42,225 the fix also wasn't on you. The fix 283 00:10:42,225 --> 00:10:44,420 was on somebody else. Right? So while all 284 00:10:44,420 --> 00:10:46,980 those servers were catching fire, while everything's spinning 285 00:10:46,980 --> 00:10:49,460 back up and there's just this big retry 286 00:10:49,460 --> 00:10:51,700 storm going on and network links are getting 287 00:10:51,700 --> 00:10:53,860 overloaded and CPU and memory and all these 288 00:10:53,860 --> 00:10:55,735 things are going down, like, as bad as 289 00:10:55,735 --> 00:10:57,415 it sounds to say it, it was somebody 290 00:10:57,415 --> 00:10:58,634 else's problem to fix. 291 00:10:59,495 --> 00:11:02,215 It wasn't our problem to fix. So I'm 292 00:11:02,215 --> 00:11:04,774 still reminded of that part, like and very 293 00:11:04,774 --> 00:11:07,195 mindful that, like, when these things do happen, 294 00:11:08,000 --> 00:11:08,980 yes, they're bad. 295 00:11:09,360 --> 00:11:11,840 Clearly, they can be very severe and go 296 00:11:11,840 --> 00:11:14,159 out there and have some some crazy kind 297 00:11:14,159 --> 00:11:16,100 of impact. But at the same time, 298 00:11:16,639 --> 00:11:18,960 while you're maybe up all night trying to 299 00:11:18,960 --> 00:11:20,899 inform your customers or 300 00:11:21,324 --> 00:11:22,684 you're kind of running around trying to figure 301 00:11:22,684 --> 00:11:25,485 out what's going on, ultimately, that responsibility sits 302 00:11:25,485 --> 00:11:26,464 with somebody else 303 00:11:27,164 --> 00:11:29,245 to make sure that it is ultimately where 304 00:11:29,245 --> 00:11:30,924 it needs to be and that it's back 305 00:11:30,924 --> 00:11:33,565 up and it's running. And I think, like 306 00:11:33,565 --> 00:11:35,504 I said, like, these things happen. 307 00:11:36,044 --> 00:11:37,105 We're talking like 308 00:11:37,589 --> 00:11:39,450 these massive distributed systems. 309 00:11:39,909 --> 00:11:42,709 They're built by the best engineers that are 310 00:11:42,709 --> 00:11:44,169 out there, and 311 00:11:44,549 --> 00:11:46,629 they still have these issues even with testing, 312 00:11:46,629 --> 00:11:48,730 things like that, but they will get hardened. 313 00:11:49,029 --> 00:11:51,190 These are just battles in the war. They 314 00:11:51,190 --> 00:11:52,089 make these systems 315 00:11:52,504 --> 00:11:54,605 more resilient at the end of the day. 316 00:11:54,825 --> 00:11:57,725 Everybody learns from these. Like the AWS outage, 317 00:11:57,865 --> 00:11:59,784 I can guarantee you folks in Azure learn 318 00:11:59,784 --> 00:12:02,105 from. The Azure outage, I can guarantee you 319 00:12:02,105 --> 00:12:04,904 folks at AWS and Google and competitors are 320 00:12:04,904 --> 00:12:07,049 also learning from as well as we're all 321 00:12:07,049 --> 00:12:10,090 publishing these RCAs and getting things out there 322 00:12:10,090 --> 00:12:12,809 and kinda talking about what broke, what we're 323 00:12:12,809 --> 00:12:14,410 doing to make it better, how we're fixing 324 00:12:14,410 --> 00:12:15,929 it. Yeah. And even the whole multi cloud 325 00:12:15,929 --> 00:12:17,769 thing doesn't always work. Like, I was looking 326 00:12:17,769 --> 00:12:19,529 at the AWS and the Azure one, and 327 00:12:19,529 --> 00:12:21,514 under both of them, Starbucks went down. So 328 00:12:21,514 --> 00:12:23,674 it's like Yes. In that case, multi cloud 329 00:12:23,674 --> 00:12:26,235 didn't even help. Like, Starbucks crashed with AWS. 330 00:12:26,235 --> 00:12:28,394 They crashed with Azure. It is what it 331 00:12:28,394 --> 00:12:30,634 is. And the Azure one too, like, you 332 00:12:30,634 --> 00:12:32,735 mentioned the network storm, and I think that's 333 00:12:32,794 --> 00:12:34,154 some of it. We talked about how a 334 00:12:34,154 --> 00:12:36,829 small change can trigger a wide spread effect. 335 00:12:37,370 --> 00:12:39,230 Looking at the Azure outage, 336 00:12:39,690 --> 00:12:41,850 that one was a little bit more that 337 00:12:41,850 --> 00:12:43,529 way where there was a change that was 338 00:12:43,529 --> 00:12:47,309 applied to Front Door configuration change, and 339 00:12:47,690 --> 00:12:49,735 it caused a few of the Front Door 340 00:12:49,735 --> 00:12:52,295 nodes to fail. And then everything starts failing 341 00:12:52,295 --> 00:12:54,774 over to working ones, but the working ones 342 00:12:54,774 --> 00:12:57,894 don't handle all the failovers, and then they 343 00:12:57,894 --> 00:12:59,115 start failing, and 344 00:12:59,495 --> 00:13:01,860 it just snowballs from there where it wasn't 345 00:13:01,860 --> 00:13:03,940 like to your point, people didn't just go 346 00:13:03,940 --> 00:13:05,779 apply everything to all the front doors at 347 00:13:05,779 --> 00:13:08,440 once, but one cascaded to another. 348 00:13:12,339 --> 00:13:14,475 Do you feel overwhelmed by trying to manage 349 00:13:14,475 --> 00:13:16,794 your Office three sixty five environment? Are you 350 00:13:16,794 --> 00:13:20,095 facing unexpected issues that disrupt your company's productivity? 351 00:13:20,315 --> 00:13:22,315 Intelligink is here to help. Much like you 352 00:13:22,315 --> 00:13:24,154 take your car to the mechanic that has 353 00:13:24,154 --> 00:13:26,315 specialized knowledge on how to best keep your 354 00:13:26,315 --> 00:13:29,350 car running, Intelligent helps you with your Microsoft 355 00:13:29,350 --> 00:13:31,690 cloud environment because that's their expertise. 356 00:13:31,990 --> 00:13:34,310 Intelligent keeps up with the latest updates in 357 00:13:34,310 --> 00:13:36,470 the Microsoft cloud to help keep your business 358 00:13:36,470 --> 00:13:38,710 running smoothly and ahead of the curve. Whether 359 00:13:38,710 --> 00:13:40,790 you are a small organization with just a 360 00:13:40,790 --> 00:13:43,184 few users up to an organization of several 361 00:13:43,184 --> 00:13:45,985 thousand employees, they want to partner with you 362 00:13:45,985 --> 00:13:49,365 to implement and administer your Microsoft cloud technology. 363 00:13:49,985 --> 00:13:53,605 Visit them at inteliginc.com/podcast. 364 00:13:53,904 --> 00:14:00,529 That's intelligink.com/podcast 365 00:14:00,910 --> 00:14:03,070 for more information or to schedule a thirty 366 00:14:03,070 --> 00:14:05,169 minute call to get started with them today. 367 00:14:05,389 --> 00:14:08,750 Remember, Intelligink focuses on the Microsoft cloud so 368 00:14:08,750 --> 00:14:10,485 you can focus on your business. 369 00:14:12,644 --> 00:14:14,964 It wasn't our problem fixed, but I also 370 00:14:14,964 --> 00:14:17,284 caused a cascading failure this week, Scott. Unless 371 00:14:17,284 --> 00:14:18,964 you wanna talk more about AWS and Azure 372 00:14:18,964 --> 00:14:21,204 failures. We should talk about the front door 373 00:14:21,204 --> 00:14:24,004 one really quick. Alright. And I think I 374 00:14:24,004 --> 00:14:27,540 just wanna take this opportunity maybe as someone 375 00:14:27,540 --> 00:14:29,700 who's a little bit closer to the lingo 376 00:14:29,700 --> 00:14:32,420 that's used internally around these things to clarify 377 00:14:32,420 --> 00:14:34,420 some things. Yep. I saw a thread on 378 00:14:34,420 --> 00:14:38,580 Reddit that was diving into the front door 379 00:14:38,580 --> 00:14:41,384 outage. And if you go read the RCA 380 00:14:42,084 --> 00:14:44,084 that comes out, like, I'll go with this 381 00:14:44,084 --> 00:14:45,924 first sentence in the what went wrong and 382 00:14:45,924 --> 00:14:51,044 why. An inadvertent tenant configuration change within Azure 383 00:14:51,044 --> 00:14:52,745 Front Door triggered a widespread 384 00:14:53,110 --> 00:14:56,009 service disruption affecting both Microsoft services 385 00:14:56,470 --> 00:14:57,769 and customer applications 386 00:14:58,230 --> 00:15:01,269 dependent on Azure Front Door for global content 387 00:15:01,269 --> 00:15:02,710 delivery. And I'm gonna go back to the 388 00:15:02,710 --> 00:15:04,570 very first part of that. An inadvertent 389 00:15:05,110 --> 00:15:07,049 tenant configuration change 390 00:15:07,455 --> 00:15:10,254 within Azure Front Door triggered a widespread service 391 00:15:10,254 --> 00:15:10,754 disruption. 392 00:15:11,055 --> 00:15:13,375 There were folks on Reddit who were reading 393 00:15:13,375 --> 00:15:15,634 that, and they were taking that terminology 394 00:15:16,175 --> 00:15:18,355 of a tenant configuration change 395 00:15:18,894 --> 00:15:23,039 to mean that a customer tenant, like you, 396 00:15:23,039 --> 00:15:24,799 maybe you have a front door profile and 397 00:15:24,799 --> 00:15:26,100 I have a front door profile, 398 00:15:26,559 --> 00:15:28,879 that you would have the ability to push 399 00:15:28,879 --> 00:15:29,539 a configuration 400 00:15:29,840 --> 00:15:32,639 change to your front door profile that would 401 00:15:32,639 --> 00:15:34,799 take down the whole system. That cascaded to 402 00:15:34,799 --> 00:15:37,504 everything? Yeah. I coulda told you that from 403 00:15:37,504 --> 00:15:39,825 externally. Right? But I can see where that 404 00:15:39,825 --> 00:15:43,684 language tenant is used so broadly. So broadly. 405 00:15:44,785 --> 00:15:46,785 Yeah. Familiar with it could take that. So 406 00:15:46,785 --> 00:15:48,384 I just wanted to maybe provide a little 407 00:15:48,384 --> 00:15:50,544 bit of clarification there. So when we say 408 00:15:50,544 --> 00:15:51,830 tenant in 409 00:15:52,610 --> 00:15:53,670 this respect, 410 00:15:54,290 --> 00:15:57,190 really what we're saying is service tenant 411 00:15:57,490 --> 00:16:00,610 or maybe tenant that the service itself is 412 00:16:00,610 --> 00:16:02,850 hosted on. So may maybe another word for 413 00:16:02,850 --> 00:16:05,144 tenant here would be scale unit. Like, what 414 00:16:05,144 --> 00:16:07,485 are the scale units that host Front Door 415 00:16:07,544 --> 00:16:08,044 versus 416 00:16:08,504 --> 00:16:09,644 what are the actual 417 00:16:10,184 --> 00:16:11,945 customer tenants and things that are out there? 418 00:16:11,945 --> 00:16:14,365 And I think the confusion for this one 419 00:16:14,584 --> 00:16:17,464 was maybe a little bit further born out 420 00:16:17,464 --> 00:16:20,360 of the fact that the front door team 421 00:16:20,819 --> 00:16:22,120 has currently blocked 422 00:16:22,500 --> 00:16:25,940 all front door configuration changes. Oh, interesting. If 423 00:16:25,940 --> 00:16:27,459 you have a front door profile and I 424 00:16:27,459 --> 00:16:28,679 have a front door profile, 425 00:16:29,059 --> 00:16:31,779 we are blocked from making changes to those 426 00:16:31,779 --> 00:16:34,105 profiles right now. And I think this kinda 427 00:16:34,105 --> 00:16:35,644 perpetuates that thinking 428 00:16:36,105 --> 00:16:38,345 that, oh, you and I are blocked from 429 00:16:38,345 --> 00:16:40,825 making changes, and that's because I could make 430 00:16:40,825 --> 00:16:43,485 a change that's gonna impact you. And 431 00:16:43,865 --> 00:16:45,945 I don't think that's the case with this 432 00:16:45,945 --> 00:16:47,644 one. I think this is more like 433 00:16:47,970 --> 00:16:50,389 scale units, internal service things, 434 00:16:50,929 --> 00:16:53,490 all of that again. So there was a 435 00:16:53,490 --> 00:16:55,350 configuration change internally. 436 00:16:56,129 --> 00:16:58,629 That configuration change introduced 437 00:16:59,009 --> 00:17:02,414 an invalid state, very similar to those race 438 00:17:02,414 --> 00:17:04,755 conditions that we were talking about with Dynamo 439 00:17:04,815 --> 00:17:06,115 on on the other side. 440 00:17:06,494 --> 00:17:07,474 That inconsistent 441 00:17:07,855 --> 00:17:08,355 state 442 00:17:08,894 --> 00:17:12,815 caused a whole bunch of AFD tenants or 443 00:17:12,815 --> 00:17:16,174 AFD nodes, AFD scale units, whatever we wanna 444 00:17:16,174 --> 00:17:17,660 call them, to crash, 445 00:17:17,960 --> 00:17:20,360 and on that crash, to subsequently not be 446 00:17:20,360 --> 00:17:22,059 able to load properly. 447 00:17:22,519 --> 00:17:24,440 So Azure Front Door is kind of a 448 00:17:24,440 --> 00:17:25,660 global load balancer 449 00:17:25,960 --> 00:17:28,440 and a DNS load balancer. All of a 450 00:17:28,440 --> 00:17:30,565 sudden, you started seeing all this weird stuff, 451 00:17:30,565 --> 00:17:31,465 increased latencies, 452 00:17:31,845 --> 00:17:33,545 timeouts, connection errors 453 00:17:33,845 --> 00:17:34,345 for 454 00:17:34,725 --> 00:17:37,605 every sort of downstream service that exists out 455 00:17:37,605 --> 00:17:40,725 there. So, like, in storage land, you ever 456 00:17:40,725 --> 00:17:43,619 provisioned a ZRS storage account? A ZRS storage 457 00:17:43,619 --> 00:17:46,200 account, your DNS endpoint, your public endpoint 458 00:17:46,820 --> 00:17:49,539 is a DNS CNAME that is part of 459 00:17:49,539 --> 00:17:50,680 a front door profile 460 00:17:51,140 --> 00:17:53,460 and points to a front door profile. So, 461 00:17:53,779 --> 00:17:55,140 not good. Right? Like, all of a sudden 462 00:17:55,140 --> 00:17:56,279 your z ZRS 463 00:17:56,660 --> 00:17:58,855 zone zone of resilient thing, like, could be 464 00:17:58,855 --> 00:18:00,774 having some trouble due to lack of DNS 465 00:18:00,774 --> 00:18:01,274 resolution. 466 00:18:01,815 --> 00:18:04,534 The other one that happens in Azure land 467 00:18:04,534 --> 00:18:05,034 is 468 00:18:05,494 --> 00:18:08,474 so much of the tooling talks to API 469 00:18:08,534 --> 00:18:11,654 endpoints that are available via Front Door or 470 00:18:11,654 --> 00:18:14,119 fronted via Front Door. So you think about, 471 00:18:14,119 --> 00:18:16,440 like, management.azure.com, 472 00:18:16,440 --> 00:18:19,799 which is the restful API surface for all 473 00:18:19,799 --> 00:18:22,519 of Azure Resource Manager. That's behind Front Door. 474 00:18:22,519 --> 00:18:24,440 Lots of folks notice it when the portal 475 00:18:24,440 --> 00:18:26,835 goes down because you just go to, say, 476 00:18:26,835 --> 00:18:29,554 you're in a public Azure customer, it doesn't 477 00:18:29,554 --> 00:18:32,034 matter if you're in The United Kingdom or 478 00:18:32,034 --> 00:18:34,034 The United States. We all just go to 479 00:18:34,034 --> 00:18:35,815 portal.azure.com, 480 00:18:35,875 --> 00:18:37,734 and we get directed redirected 481 00:18:38,115 --> 00:18:40,054 to the closest portal instance 482 00:18:40,509 --> 00:18:43,390 via DNS load balancing via traffic manager. So 483 00:18:43,390 --> 00:18:45,630 there's actually, like, regional endpoints for the portal, 484 00:18:45,630 --> 00:18:47,390 but they're all masked out because they're part 485 00:18:47,390 --> 00:18:50,190 of this resolution chain on the DNS side 486 00:18:50,190 --> 00:18:52,049 that can go a little sideways 487 00:18:52,750 --> 00:18:54,130 in in the case of 488 00:18:54,644 --> 00:18:57,045 Front Door clearing out and getting to where 489 00:18:57,045 --> 00:18:58,644 it's knee where it needs to be. So, 490 00:18:58,644 --> 00:19:01,065 yeah, definitely not a good look for either 491 00:19:01,845 --> 00:19:04,404 Azure or AWS on this one. I'm very 492 00:19:04,404 --> 00:19:07,285 mindful of, like, the customer pain that's felt 493 00:19:07,285 --> 00:19:09,205 on these and the friction that comes with 494 00:19:09,205 --> 00:19:10,750 it. I think the consolation 495 00:19:11,450 --> 00:19:11,950 is, 496 00:19:12,409 --> 00:19:13,549 one, as 497 00:19:14,329 --> 00:19:15,230 folks who 498 00:19:15,769 --> 00:19:18,009 curate and look after these environments that are 499 00:19:18,009 --> 00:19:20,190 hosted in Azure AWS, 500 00:19:20,809 --> 00:19:22,329 as much as we own the message to 501 00:19:22,329 --> 00:19:24,169 our users that, yeah, it's broken and it's 502 00:19:24,169 --> 00:19:26,105 down, at least we don't have to own 503 00:19:26,105 --> 00:19:28,264 the fix for it, which double edged sword. 504 00:19:28,264 --> 00:19:29,464 I I don't think many of us could 505 00:19:29,464 --> 00:19:31,625 fix it faster than the folks who built 506 00:19:31,625 --> 00:19:32,284 these things 507 00:19:32,585 --> 00:19:35,065 could anyway all along the way. But it 508 00:19:35,065 --> 00:19:36,984 it does give us some stuff to go 509 00:19:36,984 --> 00:19:38,524 out and think about 510 00:19:38,960 --> 00:19:40,320 and see if we can do a little 511 00:19:40,320 --> 00:19:42,480 bit differently next time. Yeah. And while they 512 00:19:42,480 --> 00:19:43,700 do go down, 513 00:19:44,160 --> 00:19:45,539 I would say lately, 514 00:19:46,160 --> 00:19:48,339 and this was kinda the case of AWS 515 00:19:48,400 --> 00:19:50,799 and Azure, I would say, is I feel 516 00:19:50,799 --> 00:19:51,299 like 517 00:19:51,759 --> 00:19:53,460 response times and fix times 518 00:19:54,015 --> 00:19:57,154 for Azure and AWS have gone gotten quicker. 519 00:19:57,295 --> 00:19:58,035 Like, the 520 00:19:58,414 --> 00:20:00,414 time from when they first go down to 521 00:20:00,414 --> 00:20:01,634 when they come back online 522 00:20:02,095 --> 00:20:04,515 used to and I guess I think to 523 00:20:04,575 --> 00:20:06,815 several years ago where you'd see outages that 524 00:20:06,815 --> 00:20:08,894 would be, like, day long outages, whether it 525 00:20:08,894 --> 00:20:10,099 was eight, ten, 526 00:20:10,400 --> 00:20:12,799 twelve, twenty four hours. There have been outages 527 00:20:12,799 --> 00:20:13,859 in Azure, AWS, 528 00:20:14,559 --> 00:20:16,960 Microsoft three sixty five, all of those. I 529 00:20:16,960 --> 00:20:19,599 feel like the recovery time when everything like, 530 00:20:19,599 --> 00:20:22,694 to catch the issue starting to happen to 531 00:20:22,694 --> 00:20:24,694 where it's starting to resolve. Maybe it's not 532 00:20:24,694 --> 00:20:25,674 completely resolved, 533 00:20:26,134 --> 00:20:28,774 but you're not hard down for, like, eight, 534 00:20:28,774 --> 00:20:31,255 ten hours. Companies have gotten better at that, 535 00:20:31,255 --> 00:20:34,214 catching it, mitigating it, and getting things back 536 00:20:34,214 --> 00:20:36,559 up quickly or at least starting to get 537 00:20:36,559 --> 00:20:38,320 them back up quickly. That seems to have 538 00:20:38,320 --> 00:20:40,160 been gotten a lot better, I would say, 539 00:20:40,160 --> 00:20:41,680 in the last few years. It goes both 540 00:20:41,680 --> 00:20:43,840 ways. When the entire Internet is down, it 541 00:20:43,840 --> 00:20:44,820 feels like forever. 542 00:20:45,279 --> 00:20:47,200 And it's not just when the entire Internet's 543 00:20:47,200 --> 00:20:48,500 down. I I think there's 544 00:20:48,994 --> 00:20:51,875 economic loss that's associated with these things. So 545 00:20:51,875 --> 00:20:53,955 I saw some estimates talking about, like, the 546 00:20:53,955 --> 00:20:56,914 AWS outage even for the, quote, unquote, brief 547 00:20:56,914 --> 00:20:58,835 period of time that it was being as 548 00:20:58,835 --> 00:21:01,315 high as, like, 500 to $600,000,000 549 00:21:01,315 --> 00:21:03,029 in lost revenue. Yeah. I saw some of 550 00:21:03,029 --> 00:21:05,269 those numbers too. For the companies that that 551 00:21:05,269 --> 00:21:07,610 are hosted on top of it. I think 552 00:21:07,830 --> 00:21:08,970 like any 553 00:21:09,350 --> 00:21:11,430 dark cloud, like, you gotta look for the 554 00:21:11,430 --> 00:21:14,070 silver linings. It can't always be glass half 555 00:21:14,070 --> 00:21:16,970 empty kind of thing. So I will say 556 00:21:17,134 --> 00:21:20,174 a couple of maybe, like, positive things that 557 00:21:20,174 --> 00:21:22,335 happen in both of these outages, both the 558 00:21:22,335 --> 00:21:25,134 AWS one and the Azure one. I'm seeing 559 00:21:25,134 --> 00:21:28,414 that communication's getting better. So while folks are 560 00:21:28,414 --> 00:21:30,835 still complaining that, like, oh, the status pages 561 00:21:30,894 --> 00:21:33,500 aren't updating, things like that, I do think 562 00:21:33,500 --> 00:21:35,440 the kind of proactive communication, 563 00:21:35,740 --> 00:21:37,679 like, we're finding a better balance between 564 00:21:38,059 --> 00:21:40,139 how many engineers do we put on fixing 565 00:21:40,139 --> 00:21:42,619 the problem, which I generally, I would say 566 00:21:42,619 --> 00:21:44,700 let's index towards putting everybody on it. But 567 00:21:44,700 --> 00:21:46,139 if we put everybody on it, that's at 568 00:21:46,139 --> 00:21:47,899 the expense of being able to communicate to 569 00:21:47,899 --> 00:21:49,525 customers, because we might even be taking the 570 00:21:49,525 --> 00:21:51,525 person who can take that message and and 571 00:21:51,525 --> 00:21:52,404 figure out how to get it to where 572 00:21:52,404 --> 00:21:53,924 you need to be. So I think the 573 00:21:53,924 --> 00:21:56,884 transparent communication getting way better. I've been really 574 00:21:56,884 --> 00:21:58,585 impressed by the 575 00:21:59,205 --> 00:22:00,265 post incident 576 00:22:01,009 --> 00:22:03,269 reviews that have come out from both Amazon 577 00:22:03,569 --> 00:22:05,190 and Azure recently. 578 00:22:05,809 --> 00:22:07,730 They're kinda going above and beyond in the 579 00:22:07,730 --> 00:22:10,309 things that they talk about and expose. Like, 580 00:22:10,369 --> 00:22:12,369 you as a regular customer, me as regular 581 00:22:12,369 --> 00:22:14,625 customer, we should never need to know the 582 00:22:14,625 --> 00:22:15,845 names of 583 00:22:16,304 --> 00:22:16,964 the internal 584 00:22:17,424 --> 00:22:17,924 microservices 585 00:22:18,544 --> 00:22:19,924 that are part of DynamoDB. 586 00:22:20,625 --> 00:22:22,625 And, like, we should never need to know 587 00:22:22,625 --> 00:22:23,125 about 588 00:22:23,585 --> 00:22:27,424 these things like AWS's internal planner and enactor 589 00:22:27,424 --> 00:22:27,924 workers. 590 00:22:28,230 --> 00:22:30,950 Like, alright. Great. Like, let's not worry about 591 00:22:30,950 --> 00:22:32,390 that kind of thing. So I think you 592 00:22:32,390 --> 00:22:34,250 are seeing, like, a level of transparency 593 00:22:34,630 --> 00:22:35,929 from the hypervisors 594 00:22:36,630 --> 00:22:37,690 that run these things 595 00:22:38,150 --> 00:22:40,329 and good transparent communication 596 00:22:40,630 --> 00:22:42,170 that's happening during the outages. 597 00:22:42,575 --> 00:22:44,095 The other thing I'll call out, like, these 598 00:22:44,095 --> 00:22:46,515 took some time in both cases to fix, 599 00:22:46,734 --> 00:22:48,835 but all those rollback procedures 600 00:22:49,454 --> 00:22:51,634 and stopping the bleeding and all that stuff, 601 00:22:51,694 --> 00:22:53,855 it worked. We're sitting here a week later 602 00:22:53,855 --> 00:22:55,855 and people are still banging their heads against 603 00:22:55,855 --> 00:22:57,295 the wall going, we don't know what the 604 00:22:57,295 --> 00:22:58,890 problem is. We don't know how to fix 605 00:22:58,890 --> 00:23:01,369 it. We don't know what changed. We don't 606 00:23:01,369 --> 00:23:03,549 know what happened. That's not the case here. 607 00:23:03,610 --> 00:23:05,690 Like, these things happened. There were point in 608 00:23:05,690 --> 00:23:08,410 time, tons of friction, tons of pain, horrible, 609 00:23:08,410 --> 00:23:11,049 yes. But they got fixed. They got fixed 610 00:23:11,049 --> 00:23:12,830 by somebody else, and they were fixed successfully. 611 00:23:13,315 --> 00:23:14,994 And then for whatever these failure modes are, 612 00:23:14,994 --> 00:23:17,154 like I said, you can be pretty confident 613 00:23:17,154 --> 00:23:18,595 that they're not gonna happen in the future. 614 00:23:18,595 --> 00:23:20,595 Are other things gonna happen? Yes. They haven't 615 00:23:20,595 --> 00:23:22,835 been discovered yet. But as they are, it 616 00:23:22,835 --> 00:23:25,154 all bleeds to more resiliency and it lends 617 00:23:25,154 --> 00:23:26,375 itself to more resiliency 618 00:23:27,289 --> 00:23:28,509 for these services. 619 00:23:29,130 --> 00:23:32,509 In some cases, I think there were mitigations 620 00:23:32,730 --> 00:23:34,490 put in place in a timely manner. So 621 00:23:34,490 --> 00:23:36,029 in the case of the front door outage, 622 00:23:36,329 --> 00:23:37,390 I saw that 623 00:23:37,769 --> 00:23:39,230 they actually pulled 624 00:23:39,529 --> 00:23:40,190 the portal 625 00:23:40,575 --> 00:23:42,815 out from behind front door. Like, they went 626 00:23:42,815 --> 00:23:45,454 and manipulated some DNS records to be able 627 00:23:45,454 --> 00:23:47,454 to give customers relief so that they could 628 00:23:47,454 --> 00:23:49,694 reach the portal without having to go through 629 00:23:49,694 --> 00:23:50,194 AFD 630 00:23:50,494 --> 00:23:51,794 and the load balancing 631 00:23:52,174 --> 00:23:52,674 mechanics 632 00:23:53,410 --> 00:23:55,170 that it brings along the way. I think 633 00:23:55,170 --> 00:23:57,490 the tooling's getting better. You're getting the ability 634 00:23:57,490 --> 00:24:00,150 in the tooling to target specific API surfaces, 635 00:24:00,369 --> 00:24:03,109 have other workarounds there, so that's all good. 636 00:24:03,569 --> 00:24:06,345 And, yeah, in general, like, sucks that it 637 00:24:06,345 --> 00:24:08,345 happened, but I'm actually, like, really happy with 638 00:24:08,345 --> 00:24:10,265 the responses here and the way they came 639 00:24:10,265 --> 00:24:12,664 out. Stuff could always go quicker. But that 640 00:24:12,664 --> 00:24:14,904 said, I think for what happened and the 641 00:24:14,904 --> 00:24:16,605 scale of both of these outages, 642 00:24:16,904 --> 00:24:20,684 stuff actually happened in a very timely way. 643 00:24:20,700 --> 00:24:21,440 And, ultimately, 644 00:24:22,380 --> 00:24:23,819 not much that I would have wanted to 645 00:24:23,819 --> 00:24:25,659 do as a customer anyway. Like, if I 646 00:24:25,659 --> 00:24:27,819 was already a multi cloud customer and I'm 647 00:24:27,819 --> 00:24:28,319 hosting 648 00:24:29,419 --> 00:24:30,639 in AWS and Azure, 649 00:24:30,940 --> 00:24:32,299 it's not like I'm gonna go out and 650 00:24:32,299 --> 00:24:34,220 bang on the door and say, well, let's 651 00:24:34,220 --> 00:24:36,154 go put ourselves into Oracle or Google and 652 00:24:36,154 --> 00:24:38,315 get yet another cloud here. Like, that's not 653 00:24:38,315 --> 00:24:39,615 necessarily the answer 654 00:24:40,075 --> 00:24:42,474 or the thing that's going to save you. 655 00:24:42,474 --> 00:24:45,295 You're only as resilient as your least resilient 656 00:24:45,355 --> 00:24:47,355 service kind of thing still at the end 657 00:24:47,355 --> 00:24:48,015 of the day. 658 00:24:48,369 --> 00:24:49,409 I think there is a little bit of 659 00:24:49,409 --> 00:24:51,890 an opportunity for customers to go through. Maybe 660 00:24:51,890 --> 00:24:53,809 you do wanna audit your dependencies a little 661 00:24:53,809 --> 00:24:55,250 bit, like, hey. Do I have to take 662 00:24:55,250 --> 00:24:56,609 a dependency on this thing? Or if I 663 00:24:56,609 --> 00:24:59,429 do, is there an alternative or a fallback 664 00:25:00,130 --> 00:25:02,849 service for me? Along the way, review your 665 00:25:02,849 --> 00:25:05,575 Doctor plans. So while you're not responsible, like 666 00:25:05,575 --> 00:25:07,734 I said, for fixing the servers and and 667 00:25:07,734 --> 00:25:10,295 the underlying microservices that power these things, I 668 00:25:10,295 --> 00:25:11,815 think you still wanna have good ways to 669 00:25:11,815 --> 00:25:14,315 communicate to your users about what's going on. 670 00:25:14,535 --> 00:25:17,095 So if you're a company that works with 671 00:25:17,095 --> 00:25:17,595 Azure 672 00:25:18,809 --> 00:25:21,130 and you have admins who are maybe more 673 00:25:21,130 --> 00:25:22,970 click ops and they're dependent on the Azure 674 00:25:22,970 --> 00:25:24,730 portal, you wanna make sure that you have, 675 00:25:24,730 --> 00:25:25,789 like, good documentation 676 00:25:26,410 --> 00:25:28,490 for your employees about what happens when the 677 00:25:28,490 --> 00:25:29,710 Azure portal is unavailable, 678 00:25:30,184 --> 00:25:31,944 What help happens when the m three sixty 679 00:25:31,944 --> 00:25:33,865 five portal is unavailable? What happens when this 680 00:25:33,865 --> 00:25:35,785 service is unavailable? Just so they know what 681 00:25:35,785 --> 00:25:38,505 to do, and they've got that kinda measured 682 00:25:38,505 --> 00:25:41,244 comfort food. You also need to think about 683 00:25:41,545 --> 00:25:42,845 kinda documenting 684 00:25:43,464 --> 00:25:44,924 recovery plans and expectations 685 00:25:45,640 --> 00:25:46,619 in terms of timing. 686 00:25:47,000 --> 00:25:49,000 So what happens if my cloud provider is 687 00:25:49,000 --> 00:25:51,319 down for ten seconds? What happens if my 688 00:25:51,319 --> 00:25:53,960 cloud provider is down for ten hours? Those 689 00:25:53,960 --> 00:25:55,179 are very different scenarios. 690 00:25:55,480 --> 00:25:57,000 And the way we react, the way we 691 00:25:57,000 --> 00:25:58,619 communicate with our user bases, 692 00:25:59,005 --> 00:26:00,625 all those things are 693 00:26:01,244 --> 00:26:02,304 going to be impacted. 694 00:26:02,765 --> 00:26:04,365 I think you also do have to think, 695 00:26:04,365 --> 00:26:06,304 like, I mentioned status pages. 696 00:26:06,605 --> 00:26:10,065 Both AWS and Azure, like, the status pages 697 00:26:10,125 --> 00:26:12,224 are not the greatest things at getting updated. 698 00:26:12,284 --> 00:26:14,365 So, like, are there alternative systems that you 699 00:26:14,365 --> 00:26:16,819 wanna look at? I see still lots of 700 00:26:16,819 --> 00:26:19,380 customers using things like down detector and things 701 00:26:19,380 --> 00:26:22,179 like that to see when these things are 702 00:26:22,179 --> 00:26:24,899 occurring or if they have broader impact within 703 00:26:24,899 --> 00:26:26,359 geo, outside of geo, 704 00:26:26,914 --> 00:26:28,914 things like that. I think those are all 705 00:26:28,914 --> 00:26:31,394 good to stand up. And then the last 706 00:26:31,394 --> 00:26:33,954 thing I would think about is as you're 707 00:26:33,954 --> 00:26:35,554 going through and you're figuring out maybe some 708 00:26:35,554 --> 00:26:36,694 of these things around 709 00:26:37,154 --> 00:26:39,839 recovery plans, things like that, is making sure 710 00:26:39,839 --> 00:26:41,920 that you're not only setting the expectations with 711 00:26:41,920 --> 00:26:44,400 users, but also setting the expectations with your 712 00:26:44,400 --> 00:26:44,900 leadership. 713 00:26:45,359 --> 00:26:47,039 So, like, if you work for a company 714 00:26:47,039 --> 00:26:49,940 that's single cloud, multi cloud, does your leadership 715 00:26:50,000 --> 00:26:51,140 have the right expectations 716 00:26:51,680 --> 00:26:52,180 around 717 00:26:52,720 --> 00:26:55,575 your company's dependency on the cloud? Has that 718 00:26:55,575 --> 00:26:57,815 been communicated in the right way? Does your 719 00:26:57,815 --> 00:26:58,875 leadership understand 720 00:26:59,494 --> 00:27:00,875 what they've bought into? 721 00:27:01,414 --> 00:27:03,335 Because there's the dream of the cloud, Oh, 722 00:27:03,335 --> 00:27:05,494 it's somebody else's cloud, it's somebody else's problem, 723 00:27:05,494 --> 00:27:08,154 it's 100% available. And then there's the reality 724 00:27:08,214 --> 00:27:10,809 of the cloud, which we know so far, 725 00:27:10,809 --> 00:27:13,630 no system out there is truly a 100%. 726 00:27:13,690 --> 00:27:15,690 So making sure that those things are ready 727 00:27:15,690 --> 00:27:17,929 to go so that your LT can weigh 728 00:27:17,929 --> 00:27:19,929 out all those options they need to, like 729 00:27:19,929 --> 00:27:20,829 multi cloud 730 00:27:21,289 --> 00:27:22,429 strategy options, 731 00:27:22,934 --> 00:27:26,075 ultimately understanding that whole, like, risk reward scenario 732 00:27:26,934 --> 00:27:29,434 or maybe risk versus cost 733 00:27:29,734 --> 00:27:31,514 for things like additional resiliency 734 00:27:31,974 --> 00:27:32,634 and redundancy 735 00:27:33,494 --> 00:27:35,494 and where that all falls out for you. 736 00:27:35,494 --> 00:27:37,619 Sounds good. What with that, Scott? I actually 737 00:27:37,619 --> 00:27:39,640 have family waiting for me to go do 738 00:27:39,779 --> 00:27:42,100 Halloween y stuff. So Halloween y stuff. You 739 00:27:42,100 --> 00:27:43,700 can It is the day for it. At 740 00:27:43,700 --> 00:27:45,720 least the weather is nice here in Jacksonville. 741 00:27:46,100 --> 00:27:48,279 Nice and cool out there. It's a balmy 742 00:27:48,434 --> 00:27:49,955 68. Yep. I think this is the first 743 00:27:49,955 --> 00:27:52,515 year it's under, like, 80 degrees Fahrenheit for 744 00:27:52,515 --> 00:27:54,355 Halloween in a while. It's been a while 745 00:27:54,355 --> 00:27:57,075 since it's been this cool. So yes. Well, 746 00:27:57,075 --> 00:27:59,174 thanks for that. Hopefully, no more DNS 747 00:27:59,634 --> 00:28:01,259 cloud outages here for a while. Hopefully 748 00:28:02,779 --> 00:28:04,940 Yes. It's something that nobody wants to happen. 749 00:28:04,940 --> 00:28:07,099 Nope. So go enjoy your weekend. Enjoy the 750 00:28:07,099 --> 00:28:09,740 rest of your Friday, and we'll be back 751 00:28:09,740 --> 00:28:12,619 again in a couple of weeks. Alright. Sounds 752 00:28:12,619 --> 00:28:14,295 good. Thanks, Ben. Alright. Thanks, 753 00:28:16,295 --> 00:28:18,775 Scott. If you enjoyed the podcast, go leave 754 00:28:18,775 --> 00:28:20,934 us a five star rating in iTunes. It 755 00:28:20,934 --> 00:28:22,615 helps to get the word out so more 756 00:28:22,615 --> 00:28:24,934 IT pros can learn about Office three sixty 757 00:28:24,934 --> 00:28:25,755 five and Azure. 758 00:28:26,295 --> 00:28:27,894 If you have any questions you want us 759 00:28:27,894 --> 00:28:30,160 to address on the show or feedback about 760 00:28:30,160 --> 00:28:32,480 the show, feel free to reach out via 761 00:28:32,480 --> 00:28:34,660 our website, Twitter, or Facebook. 762 00:28:34,960 --> 00:28:36,880 Thanks again for listening, and have a great 763 00:28:36,880 --> 00:28:37,380 day.

Digital Dispatch Podcast Podcast Artwork Image

Microsoft Cloud IT Pro Podcast

Ben Stegink, Scott Hoag

On the MS Cloud IT Pro Podcast, Scott and Ben discuss the Microsoft Cloud with a focus on IT Pros. They'll discuss the latest in Microsoft 365 and Office 365 News, Azure news and talk about their experiences with managing the Microsoft Cloud as well as interview industry experts on various cloud technology. They'll cover things such as SharePoint, Exchange, Microsoft Teams, PowerShell, Azure, Azure AD, Security, Networking, Storage, and the many other technologies and products that have made their way into the Microsoft 365 suite and Azure. To stay up-to-date on the latest in Microsoft Cloud news and gain some valuable knowledge as you deploy it within your own organization, make sure to tune in every week! Find out more at msclouditpropodcast.com.

Contact Show