1
00:00:03,520 --> 00:00:05,919
Welcome to episode 414
2
00:00:05,919 --> 00:00:08,960
of the Microsoft Cloud IT Pro podcast recorded
3
00:00:08,960 --> 00:00:11,859
live on 10/31/2025.
4
00:00:12,160 --> 00:00:14,894
This is a show about Microsoft March in
5
00:00:14,894 --> 00:00:17,214
Azure from the perspective of IT pros and
6
00:00:17,214 --> 00:00:19,535
end users, where we discuss the topic of
7
00:00:19,535 --> 00:00:21,875
recent news and how it relates to you.
8
00:00:22,015 --> 00:00:22,515
Fortunately,
9
00:00:22,815 --> 00:00:24,815
when we went to record this, the Internet
10
00:00:24,815 --> 00:00:27,074
is back online after an AWS
11
00:00:27,454 --> 00:00:30,570
and Azure outage, both related to DNS and
12
00:00:30,570 --> 00:00:32,429
both within the last couple of weeks.
13
00:00:32,810 --> 00:00:35,289
So what better to discuss today than what
14
00:00:35,289 --> 00:00:37,710
happened, how it was resolved, and what IT
15
00:00:37,770 --> 00:00:40,685
pros should keep in mind for future resilience
16
00:00:40,744 --> 00:00:43,325
planning when it comes to your cloud infrastructure.
17
00:00:45,945 --> 00:00:48,265
So, Scott, I saw this funny meme the
18
00:00:48,265 --> 00:00:50,024
other day. I'm gonna read it to you.
19
00:00:50,024 --> 00:00:52,284
I intentionally did not read this to you
20
00:00:52,504 --> 00:00:53,004
earlier.
21
00:00:53,384 --> 00:00:55,679
So I saw this. Somebody sent it to
22
00:00:55,679 --> 00:00:57,120
me. I have to go oh, I know
23
00:00:57,120 --> 00:00:58,240
where it is. I have to go find
24
00:00:58,240 --> 00:00:59,780
it. I should have pulled it up earlier.
25
00:01:00,320 --> 00:01:02,240
And this can tie into another topic as
26
00:01:02,240 --> 00:01:04,340
well. Where is that
27
00:01:04,719 --> 00:01:05,219
message?
28
00:01:06,319 --> 00:01:08,834
Wow. Okay. Here you go, Scott. After getting
29
00:01:08,834 --> 00:01:10,375
fired from ungrateful
30
00:01:10,754 --> 00:01:11,254
AWS,
31
00:01:12,114 --> 00:01:14,194
after an outage where my job was to
32
00:01:14,194 --> 00:01:16,694
Vibe code all the DNS entries to IPv
33
00:01:16,754 --> 00:01:19,234
six, happy to announce that it's my first
34
00:01:19,234 --> 00:01:21,795
today at Azure. Azure recognizes the value of
35
00:01:21,795 --> 00:01:24,390
Vibe coding IPv six DNS, and I just
36
00:01:24,390 --> 00:01:27,109
force pushed my first 1,000,000 entries. Now off
37
00:01:27,109 --> 00:01:28,170
to grab some coffee.
38
00:01:29,510 --> 00:01:32,150
Yes. I've seen this one. The Internet. Somebody
39
00:01:32,150 --> 00:01:33,049
has been following
40
00:01:33,590 --> 00:01:35,290
at wrecked on x.
41
00:01:35,990 --> 00:01:37,750
Yes. Actually, someone sent it. I do not
42
00:01:37,750 --> 00:01:39,965
follow them, but somebody sent that because
43
00:01:40,265 --> 00:01:43,384
DNS is apparently hard as evidenced by this
44
00:01:43,384 --> 00:01:45,965
last week of both AWS and Azure.
45
00:01:46,265 --> 00:01:47,864
I guess it wasn't quite within a week.
46
00:01:47,864 --> 00:01:50,825
AWS was October 20. Azure was October 29.
47
00:01:50,825 --> 00:01:52,650
Nine days, There was a little bit of
48
00:01:52,650 --> 00:01:53,709
a spread in between,
49
00:01:54,730 --> 00:01:55,870
but it does happen.
50
00:01:56,170 --> 00:01:58,109
It's always a good reminder when
51
00:01:59,209 --> 00:02:00,750
the cloud goes down
52
00:02:01,130 --> 00:02:02,670
that it really
53
00:02:03,049 --> 00:02:05,954
is somebody else's data center someplace else. It's
54
00:02:05,954 --> 00:02:06,694
just not
55
00:02:06,995 --> 00:02:09,555
it's not your data center. These things tend
56
00:02:09,555 --> 00:02:12,375
to be far reaching. I'm always
57
00:02:12,915 --> 00:02:13,415
amazed
58
00:02:13,955 --> 00:02:14,455
when
59
00:02:15,074 --> 00:02:18,055
Herndon goes down, so like US East Virginia
60
00:02:18,115 --> 00:02:19,014
for AWS,
61
00:02:19,715 --> 00:02:20,215
and
62
00:02:20,790 --> 00:02:23,349
50% of the Internet just goes offline. Because
63
00:02:23,349 --> 00:02:24,969
there are so many
64
00:02:25,430 --> 00:02:26,250
of the
65
00:02:26,870 --> 00:02:27,849
modern day
66
00:02:28,229 --> 00:02:29,129
SaaS services,
67
00:02:29,430 --> 00:02:31,270
like the things that you would depend on,
68
00:02:31,270 --> 00:02:33,689
like, hey, I listen to music on Spotify,
69
00:02:33,830 --> 00:02:36,294
I stream my podcast from here, I do
70
00:02:36,294 --> 00:02:38,875
my banking with like, all these different things
71
00:02:39,175 --> 00:02:41,194
are all homed out of that region.
72
00:02:41,655 --> 00:02:44,474
So when bad things happen to Herndon,
73
00:02:44,935 --> 00:02:46,715
particularly in AWS land,
74
00:02:47,495 --> 00:02:49,995
bad things tend to happen on the Internet
75
00:02:50,069 --> 00:02:51,990
for the rest of us or at least
76
00:02:51,990 --> 00:02:54,230
I think the parts of the Internet that
77
00:02:54,230 --> 00:02:57,110
folks who listen to this podcast would go
78
00:02:57,110 --> 00:02:58,790
for. So for me, like I said, that's
79
00:02:58,790 --> 00:03:00,650
things like Spotify going down,
80
00:03:01,110 --> 00:03:02,250
that is
81
00:03:03,094 --> 00:03:06,134
Reddit suddenly disappearing and going no. Yep. There
82
00:03:06,134 --> 00:03:07,354
went the body of knowledge
83
00:03:07,735 --> 00:03:09,655
that was pulling all these things out. And
84
00:03:09,655 --> 00:03:11,194
then in this new world
85
00:03:11,574 --> 00:03:14,294
of LLMs and everything else that are doing
86
00:03:14,294 --> 00:03:17,400
both ingested plus real time searches of these
87
00:03:17,400 --> 00:03:17,900
systems,
88
00:03:18,199 --> 00:03:21,000
like, all that stuff starts to show its
89
00:03:21,000 --> 00:03:21,500
cracks
90
00:03:21,879 --> 00:03:22,939
along the way
91
00:03:24,120 --> 00:03:27,740
as well. So the AWS one, interestingly,
92
00:03:28,040 --> 00:03:29,960
like, manifests, I think, is a little bit
93
00:03:29,960 --> 00:03:32,294
of, like, oh, this all sounds like a
94
00:03:32,294 --> 00:03:34,775
lot of DNS. My understanding was it was
95
00:03:34,775 --> 00:03:36,955
actually a problem with DynamoDB
96
00:03:37,335 --> 00:03:40,395
and kinda light load balancing with Dynamo and
97
00:03:40,534 --> 00:03:41,995
the way that they push
98
00:03:42,455 --> 00:03:42,955
configuration
99
00:03:43,495 --> 00:03:45,520
and things like that into it. But I
100
00:03:45,520 --> 00:03:46,639
could be a little bit off there. I
101
00:03:46,639 --> 00:03:48,400
didn't have a ton of time to dive
102
00:03:48,400 --> 00:03:49,060
into theirs,
103
00:03:49,360 --> 00:03:51,759
especially, like you said, with the Azure outage
104
00:03:51,759 --> 00:03:54,639
coming on October 29, just nine days later,
105
00:03:54,639 --> 00:03:55,939
and that one being
106
00:03:56,240 --> 00:03:59,574
certainly more DNS related or at least like
107
00:03:59,574 --> 00:04:01,735
a I think to the spirit of it
108
00:04:01,735 --> 00:04:03,814
being that it was Azure Front Door and
109
00:04:03,814 --> 00:04:04,314
kinda
110
00:04:04,615 --> 00:04:06,694
some of the global load balancing capabilities of
111
00:04:06,694 --> 00:04:08,855
Front Door that got out of whack due
112
00:04:08,855 --> 00:04:09,515
to a
113
00:04:10,375 --> 00:04:13,094
configuration update. And in both cases, in both
114
00:04:13,094 --> 00:04:15,490
systems, these were configuration updates
115
00:04:15,950 --> 00:04:18,449
that kind of went a little bit sideways,
116
00:04:18,750 --> 00:04:20,830
and things got a little bit squirrely. It's
117
00:04:20,830 --> 00:04:24,689
hard. Stuff at that scale is very complicated,
118
00:04:24,830 --> 00:04:25,649
but it always
119
00:04:26,029 --> 00:04:27,889
amazes me how
120
00:04:28,865 --> 00:04:31,444
one of those configuration changes
121
00:04:31,904 --> 00:04:33,285
can take down everything
122
00:04:33,824 --> 00:04:36,384
so quickly that because we've seen it multiple
123
00:04:36,384 --> 00:04:39,425
times from multiple different cloud vendors where you
124
00:04:39,425 --> 00:04:40,865
would think they would have figured out by
125
00:04:40,865 --> 00:04:42,779
this time how they could do, like, small
126
00:04:42,779 --> 00:04:45,680
configuration changes that don't have the snowball effect,
127
00:04:45,899 --> 00:04:47,599
but yet we continue to
128
00:04:47,899 --> 00:04:50,459
see these. And, yeah, both were DNS. I
129
00:04:50,459 --> 00:04:52,620
was reading some on the AWS one too,
130
00:04:52,620 --> 00:04:54,459
and it sounds like it was it was
131
00:04:54,459 --> 00:04:54,959
Dynamo
132
00:04:55,535 --> 00:04:56,035
DB,
133
00:04:56,495 --> 00:04:57,394
but updating
134
00:04:57,774 --> 00:04:59,774
that's used to update DNS. And it was
135
00:04:59,774 --> 00:05:02,014
like two different services were trying to update
136
00:05:02,014 --> 00:05:04,415
the same DNS records tied to Dynamo DB
137
00:05:04,415 --> 00:05:06,495
and, oh, and two things are trying to
138
00:05:06,495 --> 00:05:08,654
update the same DNS record. It's like trying
139
00:05:08,654 --> 00:05:10,959
to update the same line in a file
140
00:05:11,019 --> 00:05:13,259
multiple times and SharePoint complaining that you have
141
00:05:13,259 --> 00:05:16,319
version mismatches? It's definitely possible to
142
00:05:17,100 --> 00:05:18,639
encounter these race conditions.
143
00:05:19,339 --> 00:05:22,399
Even small changes do have big impacts, so
144
00:05:22,459 --> 00:05:24,459
I think it's a little it's a little
145
00:05:24,459 --> 00:05:26,754
off or maybe, like, not the right color
146
00:05:26,754 --> 00:05:28,754
to say, like, oh, it's surprising when a
147
00:05:28,754 --> 00:05:30,134
little configuration change
148
00:05:30,514 --> 00:05:32,595
or, like, that a bigger configuration change goes
149
00:05:32,595 --> 00:05:34,915
out. Like, all these things go out, whether
150
00:05:34,915 --> 00:05:38,055
it's Amazon, whether it's Microsoft, whether it's Google.
151
00:05:38,459 --> 00:05:40,879
Everybody has their own deployment practices
152
00:05:41,259 --> 00:05:42,639
for safe deployments,
153
00:05:43,100 --> 00:05:45,339
for making sure that things get flighted through,
154
00:05:45,339 --> 00:05:48,139
like, multiple rings and they follow a general
155
00:05:48,139 --> 00:05:49,980
progression. You see the same thing, like, when
156
00:05:49,980 --> 00:05:51,740
a feature rolls out in SharePoint, for example.
157
00:05:51,740 --> 00:05:53,339
Right? We all know about the different rings
158
00:05:53,339 --> 00:05:55,634
that go in there with deployment rings and
159
00:05:55,634 --> 00:05:58,115
things like that. So it's the best of
160
00:05:58,115 --> 00:05:58,615
intentions.
161
00:05:59,394 --> 00:06:02,034
The interesting thing for me in the a
162
00:06:02,194 --> 00:06:03,254
AWS RCA
163
00:06:03,634 --> 00:06:05,394
was they got into some of the nitty
164
00:06:05,394 --> 00:06:07,094
gritty around how
165
00:06:07,474 --> 00:06:10,370
complex these things are with all these microservices
166
00:06:10,830 --> 00:06:13,710
that are running, talking to each other. So
167
00:06:13,710 --> 00:06:16,110
you'd like things are starting to manifest where
168
00:06:16,110 --> 00:06:17,970
we've built these really awesome
169
00:06:18,350 --> 00:06:20,590
machines, right, to go and manage this all
170
00:06:20,590 --> 00:06:22,430
for us and have all this underlying logic
171
00:06:22,430 --> 00:06:24,694
and all these other things into them. But
172
00:06:24,995 --> 00:06:27,875
when these, like, little subtle race conditions are
173
00:06:27,875 --> 00:06:30,055
coming through or other things are coming out
174
00:06:30,115 --> 00:06:32,834
and stuff gets out of whack, in in
175
00:06:32,834 --> 00:06:35,669
the case of the Dynamo thing, these workers
176
00:06:35,729 --> 00:06:37,029
between these various microservices
177
00:06:37,410 --> 00:06:38,229
becoming desynchronized,
178
00:06:39,410 --> 00:06:40,310
bad things
179
00:06:40,769 --> 00:06:41,269
happen.
180
00:06:41,970 --> 00:06:42,470
Right?
181
00:06:42,849 --> 00:06:45,329
So I think in the AWS one just
182
00:06:45,329 --> 00:06:46,930
pulling up their RCA real quick. So they've
183
00:06:46,930 --> 00:06:49,024
got a couple components. They've got this planner
184
00:06:49,024 --> 00:06:50,805
and these enactor workers
185
00:06:51,185 --> 00:06:52,644
within dyno DynamoDB
186
00:06:53,425 --> 00:06:57,204
that help with some some distribution of traffic
187
00:06:57,345 --> 00:06:59,425
and other things via DNS, but it's a
188
00:06:59,425 --> 00:07:02,160
bunch of basically, like, internal components. I'd encourage
189
00:07:02,160 --> 00:07:03,600
somebody to go read about this. Like, if
190
00:07:03,600 --> 00:07:06,879
you're interested in, like, distributed computing, hyperscalers, all
191
00:07:06,879 --> 00:07:09,680
these things, like, it's always interesting to see
192
00:07:09,680 --> 00:07:12,639
how these things are designed. But, you know,
193
00:07:12,639 --> 00:07:15,199
apparently, you had this one service, which is
194
00:07:15,199 --> 00:07:16,660
the DNS Enactor,
195
00:07:17,394 --> 00:07:20,055
which when it fires up, it verifies
196
00:07:20,435 --> 00:07:23,074
plan freshness, what it's supposed to do, what
197
00:07:23,074 --> 00:07:24,134
it's supposed to process,
198
00:07:24,595 --> 00:07:27,634
what updates it's or endpoints it's supposed to
199
00:07:27,634 --> 00:07:29,175
update, all those things.
200
00:07:29,620 --> 00:07:32,839
Turns out, the DNS and actor did within
201
00:07:33,139 --> 00:07:36,579
Dynamo does a very, like, sane thing in
202
00:07:36,579 --> 00:07:38,740
that it verifies the freshness of what it
203
00:07:38,740 --> 00:07:39,560
needs to do
204
00:07:39,939 --> 00:07:42,824
anytime that process starts or at the start
205
00:07:42,824 --> 00:07:45,544
of processing. But it's not doing, like, state
206
00:07:45,544 --> 00:07:47,785
management as it goes. It's always assuming that,
207
00:07:47,785 --> 00:07:49,944
hey. I spun up. This is current state.
208
00:07:49,944 --> 00:07:51,865
Let me go make some changes and then
209
00:07:51,865 --> 00:07:54,584
check again kind of thing. So you had
210
00:07:54,584 --> 00:07:57,144
these multiple actors that are talking to each
211
00:07:57,144 --> 00:07:57,644
other,
212
00:07:58,079 --> 00:08:00,240
and like you said, it's a contention issue.
213
00:08:00,240 --> 00:08:02,319
So by the time one spins up and
214
00:08:02,319 --> 00:08:03,919
it says, okay. Here's the plan. Here's what
215
00:08:03,919 --> 00:08:04,980
I'm gonna go do,
216
00:08:05,360 --> 00:08:07,759
and it goes and does it, well, it
217
00:08:07,759 --> 00:08:10,000
turns out that another one was spinning up
218
00:08:10,000 --> 00:08:12,000
with a potentially different plan because that is
219
00:08:12,079 --> 00:08:14,475
they haven't been in flight. And all of
220
00:08:14,475 --> 00:08:16,314
a sudden that check that had been performed
221
00:08:16,314 --> 00:08:18,714
that was fresh was now stale, and it's
222
00:08:18,714 --> 00:08:20,495
applying a stale configuration
223
00:08:21,115 --> 00:08:23,214
and overriding what was already there,
224
00:08:23,514 --> 00:08:25,935
and that leads to a series
225
00:08:26,235 --> 00:08:27,375
of cascading
226
00:08:27,834 --> 00:08:28,334
failures.
227
00:08:29,000 --> 00:08:31,180
And for services like Dynamo,
228
00:08:31,560 --> 00:08:35,740
they're so integral to the fabric of AWS.
229
00:08:36,200 --> 00:08:38,440
So there's a bunch of other services that
230
00:08:38,440 --> 00:08:41,019
are depending on DynamoDB. So if you're
231
00:08:41,325 --> 00:08:43,804
doing compute and you're using virtual machines with
232
00:08:43,804 --> 00:08:44,304
EC2,
233
00:08:44,605 --> 00:08:46,785
you're doing functions with Lambda,
234
00:08:47,245 --> 00:08:51,404
even things like RBAC and I'm ultimately tie
235
00:08:51,404 --> 00:08:54,785
back to these database systems like Dynamo, and
236
00:08:54,925 --> 00:08:57,929
they have these, like, just really bad, no
237
00:08:57,929 --> 00:08:59,070
good, horrible days.
238
00:08:59,929 --> 00:09:01,610
The closer to home for me on my
239
00:09:01,610 --> 00:09:04,190
side, I've seen when we've had outages
240
00:09:04,809 --> 00:09:05,710
in storage
241
00:09:06,409 --> 00:09:09,924
and very similar thing, like, you'd be amazed
242
00:09:09,924 --> 00:09:12,325
at the number of services that depend on
243
00:09:12,325 --> 00:09:15,284
storage for something. Right? They publish some kind
244
00:09:15,284 --> 00:09:16,264
of state in there.
245
00:09:16,644 --> 00:09:19,044
Maybe they're not even using, like, unstructured storage.
246
00:09:19,044 --> 00:09:20,725
It's not like they're storing logs or something,
247
00:09:20,725 --> 00:09:22,529
but maybe they're using, like, NoSQL
248
00:09:22,830 --> 00:09:24,690
tables or they're using queues
249
00:09:25,070 --> 00:09:27,389
or things like that along the way. So
250
00:09:27,389 --> 00:09:29,389
there there's just a bunch of moving pieces.
251
00:09:29,389 --> 00:09:30,850
There's a bunch of dependencies,
252
00:09:31,950 --> 00:09:33,409
and those dependencies
253
00:09:34,110 --> 00:09:35,169
just tend to
254
00:09:35,710 --> 00:09:37,964
bleed their way out. And I think what
255
00:09:37,964 --> 00:09:39,565
we were seeing a lot more is with
256
00:09:39,565 --> 00:09:42,044
these outages, at least these last couple, these
257
00:09:42,044 --> 00:09:43,644
two most recent ones, and I think if
258
00:09:43,644 --> 00:09:45,504
we look back a couple months as well,
259
00:09:45,565 --> 00:09:48,284
the impacts are just so far reaching because
260
00:09:48,284 --> 00:09:49,904
so many customers today
261
00:09:50,389 --> 00:09:52,950
are dependent on the cloud. Like, I saw
262
00:09:52,950 --> 00:09:54,389
a lot of chatter after this one, like,
263
00:09:54,389 --> 00:09:56,549
oh, AWS went down, and then, oh, Azure
264
00:09:56,549 --> 00:09:57,830
went down, and, oh, we should all be
265
00:09:57,830 --> 00:09:59,590
mount multi cloud, and we should all be
266
00:09:59,590 --> 00:10:02,809
and all these things. Right? Like, sure. Absolutely.
267
00:10:03,269 --> 00:10:05,934
We should. If we had infinite money, infinite
268
00:10:05,934 --> 00:10:08,815
time, infinite skilling, all those kinds of things
269
00:10:08,815 --> 00:10:11,375
that are out there, but that's ultimately not
270
00:10:11,375 --> 00:10:13,134
the reality for a lot of us. So
271
00:10:13,134 --> 00:10:15,554
I fall back to, are these things bad?
272
00:10:15,695 --> 00:10:18,240
Yes. Do we learn from them? Also, yes.
273
00:10:18,240 --> 00:10:20,240
Like like this particular race condition in the
274
00:10:20,240 --> 00:10:21,139
case of AWS,
275
00:10:21,759 --> 00:10:24,100
the thing that happened in Azure, they happened.
276
00:10:24,480 --> 00:10:26,960
They should not happen again because we learn
277
00:10:26,960 --> 00:10:28,720
from them, we implement those changes, and we
278
00:10:28,720 --> 00:10:30,945
go forward. And as bad as it is
279
00:10:30,945 --> 00:10:33,825
to have half the Internet go down, well,
280
00:10:33,825 --> 00:10:35,825
half the Internet was down. It wasn't just
281
00:10:35,825 --> 00:10:38,004
you. It was everybody else. And
282
00:10:38,705 --> 00:10:42,225
the fix also wasn't on you. The fix
283
00:10:42,225 --> 00:10:44,420
was on somebody else. Right? So while all
284
00:10:44,420 --> 00:10:46,980
those servers were catching fire, while everything's spinning
285
00:10:46,980 --> 00:10:49,460
back up and there's just this big retry
286
00:10:49,460 --> 00:10:51,700
storm going on and network links are getting
287
00:10:51,700 --> 00:10:53,860
overloaded and CPU and memory and all these
288
00:10:53,860 --> 00:10:55,735
things are going down, like, as bad as
289
00:10:55,735 --> 00:10:57,415
it sounds to say it, it was somebody
290
00:10:57,415 --> 00:10:58,634
else's problem to fix.
291
00:10:59,495 --> 00:11:02,215
It wasn't our problem to fix. So I'm
292
00:11:02,215 --> 00:11:04,774
still reminded of that part, like and very
293
00:11:04,774 --> 00:11:07,195
mindful that, like, when these things do happen,
294
00:11:08,000 --> 00:11:08,980
yes, they're bad.
295
00:11:09,360 --> 00:11:11,840
Clearly, they can be very severe and go
296
00:11:11,840 --> 00:11:14,159
out there and have some some crazy kind
297
00:11:14,159 --> 00:11:16,100
of impact. But at the same time,
298
00:11:16,639 --> 00:11:18,960
while you're maybe up all night trying to
299
00:11:18,960 --> 00:11:20,899
inform your customers or
300
00:11:21,324 --> 00:11:22,684
you're kind of running around trying to figure
301
00:11:22,684 --> 00:11:25,485
out what's going on, ultimately, that responsibility sits
302
00:11:25,485 --> 00:11:26,464
with somebody else
303
00:11:27,164 --> 00:11:29,245
to make sure that it is ultimately where
304
00:11:29,245 --> 00:11:30,924
it needs to be and that it's back
305
00:11:30,924 --> 00:11:33,565
up and it's running. And I think, like
306
00:11:33,565 --> 00:11:35,504
I said, like, these things happen.
307
00:11:36,044 --> 00:11:37,105
We're talking like
308
00:11:37,589 --> 00:11:39,450
these massive distributed systems.
309
00:11:39,909 --> 00:11:42,709
They're built by the best engineers that are
310
00:11:42,709 --> 00:11:44,169
out there, and
311
00:11:44,549 --> 00:11:46,629
they still have these issues even with testing,
312
00:11:46,629 --> 00:11:48,730
things like that, but they will get hardened.
313
00:11:49,029 --> 00:11:51,190
These are just battles in the war. They
314
00:11:51,190 --> 00:11:52,089
make these systems
315
00:11:52,504 --> 00:11:54,605
more resilient at the end of the day.
316
00:11:54,825 --> 00:11:57,725
Everybody learns from these. Like the AWS outage,
317
00:11:57,865 --> 00:11:59,784
I can guarantee you folks in Azure learn
318
00:11:59,784 --> 00:12:02,105
from. The Azure outage, I can guarantee you
319
00:12:02,105 --> 00:12:04,904
folks at AWS and Google and competitors are
320
00:12:04,904 --> 00:12:07,049
also learning from as well as we're all
321
00:12:07,049 --> 00:12:10,090
publishing these RCAs and getting things out there
322
00:12:10,090 --> 00:12:12,809
and kinda talking about what broke, what we're
323
00:12:12,809 --> 00:12:14,410
doing to make it better, how we're fixing
324
00:12:14,410 --> 00:12:15,929
it. Yeah. And even the whole multi cloud
325
00:12:15,929 --> 00:12:17,769
thing doesn't always work. Like, I was looking
326
00:12:17,769 --> 00:12:19,529
at the AWS and the Azure one, and
327
00:12:19,529 --> 00:12:21,514
under both of them, Starbucks went down. So
328
00:12:21,514 --> 00:12:23,674
it's like Yes. In that case, multi cloud
329
00:12:23,674 --> 00:12:26,235
didn't even help. Like, Starbucks crashed with AWS.
330
00:12:26,235 --> 00:12:28,394
They crashed with Azure. It is what it
331
00:12:28,394 --> 00:12:30,634
is. And the Azure one too, like, you
332
00:12:30,634 --> 00:12:32,735
mentioned the network storm, and I think that's
333
00:12:32,794 --> 00:12:34,154
some of it. We talked about how a
334
00:12:34,154 --> 00:12:36,829
small change can trigger a wide spread effect.
335
00:12:37,370 --> 00:12:39,230
Looking at the Azure outage,
336
00:12:39,690 --> 00:12:41,850
that one was a little bit more that
337
00:12:41,850 --> 00:12:43,529
way where there was a change that was
338
00:12:43,529 --> 00:12:47,309
applied to Front Door configuration change, and
339
00:12:47,690 --> 00:12:49,735
it caused a few of the Front Door
340
00:12:49,735 --> 00:12:52,295
nodes to fail. And then everything starts failing
341
00:12:52,295 --> 00:12:54,774
over to working ones, but the working ones
342
00:12:54,774 --> 00:12:57,894
don't handle all the failovers, and then they
343
00:12:57,894 --> 00:12:59,115
start failing, and
344
00:12:59,495 --> 00:13:01,860
it just snowballs from there where it wasn't
345
00:13:01,860 --> 00:13:03,940
like to your point, people didn't just go
346
00:13:03,940 --> 00:13:05,779
apply everything to all the front doors at
347
00:13:05,779 --> 00:13:08,440
once, but one cascaded to another.
348
00:13:12,339 --> 00:13:14,475
Do you feel overwhelmed by trying to manage
349
00:13:14,475 --> 00:13:16,794
your Office three sixty five environment? Are you
350
00:13:16,794 --> 00:13:20,095
facing unexpected issues that disrupt your company's productivity?
351
00:13:20,315 --> 00:13:22,315
Intelligink is here to help. Much like you
352
00:13:22,315 --> 00:13:24,154
take your car to the mechanic that has
353
00:13:24,154 --> 00:13:26,315
specialized knowledge on how to best keep your
354
00:13:26,315 --> 00:13:29,350
car running, Intelligent helps you with your Microsoft
355
00:13:29,350 --> 00:13:31,690
cloud environment because that's their expertise.
356
00:13:31,990 --> 00:13:34,310
Intelligent keeps up with the latest updates in
357
00:13:34,310 --> 00:13:36,470
the Microsoft cloud to help keep your business
358
00:13:36,470 --> 00:13:38,710
running smoothly and ahead of the curve. Whether
359
00:13:38,710 --> 00:13:40,790
you are a small organization with just a
360
00:13:40,790 --> 00:13:43,184
few users up to an organization of several
361
00:13:43,184 --> 00:13:45,985
thousand employees, they want to partner with you
362
00:13:45,985 --> 00:13:49,365
to implement and administer your Microsoft cloud technology.
363
00:13:49,985 --> 00:13:53,605
Visit them at inteliginc.com/podcast.
364
00:13:53,904 --> 00:14:00,529
That's intelligink.com/podcast
365
00:14:00,910 --> 00:14:03,070
for more information or to schedule a thirty
366
00:14:03,070 --> 00:14:05,169
minute call to get started with them today.
367
00:14:05,389 --> 00:14:08,750
Remember, Intelligink focuses on the Microsoft cloud so
368
00:14:08,750 --> 00:14:10,485
you can focus on your business.
369
00:14:12,644 --> 00:14:14,964
It wasn't our problem fixed, but I also
370
00:14:14,964 --> 00:14:17,284
caused a cascading failure this week, Scott. Unless
371
00:14:17,284 --> 00:14:18,964
you wanna talk more about AWS and Azure
372
00:14:18,964 --> 00:14:21,204
failures. We should talk about the front door
373
00:14:21,204 --> 00:14:24,004
one really quick. Alright. And I think I
374
00:14:24,004 --> 00:14:27,540
just wanna take this opportunity maybe as someone
375
00:14:27,540 --> 00:14:29,700
who's a little bit closer to the lingo
376
00:14:29,700 --> 00:14:32,420
that's used internally around these things to clarify
377
00:14:32,420 --> 00:14:34,420
some things. Yep. I saw a thread on
378
00:14:34,420 --> 00:14:38,580
Reddit that was diving into the front door
379
00:14:38,580 --> 00:14:41,384
outage. And if you go read the RCA
380
00:14:42,084 --> 00:14:44,084
that comes out, like, I'll go with this
381
00:14:44,084 --> 00:14:45,924
first sentence in the what went wrong and
382
00:14:45,924 --> 00:14:51,044
why. An inadvertent tenant configuration change within Azure
383
00:14:51,044 --> 00:14:52,745
Front Door triggered a widespread
384
00:14:53,110 --> 00:14:56,009
service disruption affecting both Microsoft services
385
00:14:56,470 --> 00:14:57,769
and customer applications
386
00:14:58,230 --> 00:15:01,269
dependent on Azure Front Door for global content
387
00:15:01,269 --> 00:15:02,710
delivery. And I'm gonna go back to the
388
00:15:02,710 --> 00:15:04,570
very first part of that. An inadvertent
389
00:15:05,110 --> 00:15:07,049
tenant configuration change
390
00:15:07,455 --> 00:15:10,254
within Azure Front Door triggered a widespread service
391
00:15:10,254 --> 00:15:10,754
disruption.
392
00:15:11,055 --> 00:15:13,375
There were folks on Reddit who were reading
393
00:15:13,375 --> 00:15:15,634
that, and they were taking that terminology
394
00:15:16,175 --> 00:15:18,355
of a tenant configuration change
395
00:15:18,894 --> 00:15:23,039
to mean that a customer tenant, like you,
396
00:15:23,039 --> 00:15:24,799
maybe you have a front door profile and
397
00:15:24,799 --> 00:15:26,100
I have a front door profile,
398
00:15:26,559 --> 00:15:28,879
that you would have the ability to push
399
00:15:28,879 --> 00:15:29,539
a configuration
400
00:15:29,840 --> 00:15:32,639
change to your front door profile that would
401
00:15:32,639 --> 00:15:34,799
take down the whole system. That cascaded to
402
00:15:34,799 --> 00:15:37,504
everything? Yeah. I coulda told you that from
403
00:15:37,504 --> 00:15:39,825
externally. Right? But I can see where that
404
00:15:39,825 --> 00:15:43,684
language tenant is used so broadly. So broadly.
405
00:15:44,785 --> 00:15:46,785
Yeah. Familiar with it could take that. So
406
00:15:46,785 --> 00:15:48,384
I just wanted to maybe provide a little
407
00:15:48,384 --> 00:15:50,544
bit of clarification there. So when we say
408
00:15:50,544 --> 00:15:51,830
tenant in
409
00:15:52,610 --> 00:15:53,670
this respect,
410
00:15:54,290 --> 00:15:57,190
really what we're saying is service tenant
411
00:15:57,490 --> 00:16:00,610
or maybe tenant that the service itself is
412
00:16:00,610 --> 00:16:02,850
hosted on. So may maybe another word for
413
00:16:02,850 --> 00:16:05,144
tenant here would be scale unit. Like, what
414
00:16:05,144 --> 00:16:07,485
are the scale units that host Front Door
415
00:16:07,544 --> 00:16:08,044
versus
416
00:16:08,504 --> 00:16:09,644
what are the actual
417
00:16:10,184 --> 00:16:11,945
customer tenants and things that are out there?
418
00:16:11,945 --> 00:16:14,365
And I think the confusion for this one
419
00:16:14,584 --> 00:16:17,464
was maybe a little bit further born out
420
00:16:17,464 --> 00:16:20,360
of the fact that the front door team
421
00:16:20,819 --> 00:16:22,120
has currently blocked
422
00:16:22,500 --> 00:16:25,940
all front door configuration changes. Oh, interesting. If
423
00:16:25,940 --> 00:16:27,459
you have a front door profile and I
424
00:16:27,459 --> 00:16:28,679
have a front door profile,
425
00:16:29,059 --> 00:16:31,779
we are blocked from making changes to those
426
00:16:31,779 --> 00:16:34,105
profiles right now. And I think this kinda
427
00:16:34,105 --> 00:16:35,644
perpetuates that thinking
428
00:16:36,105 --> 00:16:38,345
that, oh, you and I are blocked from
429
00:16:38,345 --> 00:16:40,825
making changes, and that's because I could make
430
00:16:40,825 --> 00:16:43,485
a change that's gonna impact you. And
431
00:16:43,865 --> 00:16:45,945
I don't think that's the case with this
432
00:16:45,945 --> 00:16:47,644
one. I think this is more like
433
00:16:47,970 --> 00:16:50,389
scale units, internal service things,
434
00:16:50,929 --> 00:16:53,490
all of that again. So there was a
435
00:16:53,490 --> 00:16:55,350
configuration change internally.
436
00:16:56,129 --> 00:16:58,629
That configuration change introduced
437
00:16:59,009 --> 00:17:02,414
an invalid state, very similar to those race
438
00:17:02,414 --> 00:17:04,755
conditions that we were talking about with Dynamo
439
00:17:04,815 --> 00:17:06,115
on on the other side.
440
00:17:06,494 --> 00:17:07,474
That inconsistent
441
00:17:07,855 --> 00:17:08,355
state
442
00:17:08,894 --> 00:17:12,815
caused a whole bunch of AFD tenants or
443
00:17:12,815 --> 00:17:16,174
AFD nodes, AFD scale units, whatever we wanna
444
00:17:16,174 --> 00:17:17,660
call them, to crash,
445
00:17:17,960 --> 00:17:20,360
and on that crash, to subsequently not be
446
00:17:20,360 --> 00:17:22,059
able to load properly.
447
00:17:22,519 --> 00:17:24,440
So Azure Front Door is kind of a
448
00:17:24,440 --> 00:17:25,660
global load balancer
449
00:17:25,960 --> 00:17:28,440
and a DNS load balancer. All of a
450
00:17:28,440 --> 00:17:30,565
sudden, you started seeing all this weird stuff,
451
00:17:30,565 --> 00:17:31,465
increased latencies,
452
00:17:31,845 --> 00:17:33,545
timeouts, connection errors
453
00:17:33,845 --> 00:17:34,345
for
454
00:17:34,725 --> 00:17:37,605
every sort of downstream service that exists out
455
00:17:37,605 --> 00:17:40,725
there. So, like, in storage land, you ever
456
00:17:40,725 --> 00:17:43,619
provisioned a ZRS storage account? A ZRS storage
457
00:17:43,619 --> 00:17:46,200
account, your DNS endpoint, your public endpoint
458
00:17:46,820 --> 00:17:49,539
is a DNS CNAME that is part of
459
00:17:49,539 --> 00:17:50,680
a front door profile
460
00:17:51,140 --> 00:17:53,460
and points to a front door profile. So,
461
00:17:53,779 --> 00:17:55,140
not good. Right? Like, all of a sudden
462
00:17:55,140 --> 00:17:56,279
your z ZRS
463
00:17:56,660 --> 00:17:58,855
zone zone of resilient thing, like, could be
464
00:17:58,855 --> 00:18:00,774
having some trouble due to lack of DNS
465
00:18:00,774 --> 00:18:01,274
resolution.
466
00:18:01,815 --> 00:18:04,534
The other one that happens in Azure land
467
00:18:04,534 --> 00:18:05,034
is
468
00:18:05,494 --> 00:18:08,474
so much of the tooling talks to API
469
00:18:08,534 --> 00:18:11,654
endpoints that are available via Front Door or
470
00:18:11,654 --> 00:18:14,119
fronted via Front Door. So you think about,
471
00:18:14,119 --> 00:18:16,440
like, management.azure.com,
472
00:18:16,440 --> 00:18:19,799
which is the restful API surface for all
473
00:18:19,799 --> 00:18:22,519
of Azure Resource Manager. That's behind Front Door.
474
00:18:22,519 --> 00:18:24,440
Lots of folks notice it when the portal
475
00:18:24,440 --> 00:18:26,835
goes down because you just go to, say,
476
00:18:26,835 --> 00:18:29,554
you're in a public Azure customer, it doesn't
477
00:18:29,554 --> 00:18:32,034
matter if you're in The United Kingdom or
478
00:18:32,034 --> 00:18:34,034
The United States. We all just go to
479
00:18:34,034 --> 00:18:35,815
portal.azure.com,
480
00:18:35,875 --> 00:18:37,734
and we get directed redirected
481
00:18:38,115 --> 00:18:40,054
to the closest portal instance
482
00:18:40,509 --> 00:18:43,390
via DNS load balancing via traffic manager. So
483
00:18:43,390 --> 00:18:45,630
there's actually, like, regional endpoints for the portal,
484
00:18:45,630 --> 00:18:47,390
but they're all masked out because they're part
485
00:18:47,390 --> 00:18:50,190
of this resolution chain on the DNS side
486
00:18:50,190 --> 00:18:52,049
that can go a little sideways
487
00:18:52,750 --> 00:18:54,130
in in the case of
488
00:18:54,644 --> 00:18:57,045
Front Door clearing out and getting to where
489
00:18:57,045 --> 00:18:58,644
it's knee where it needs to be. So,
490
00:18:58,644 --> 00:19:01,065
yeah, definitely not a good look for either
491
00:19:01,845 --> 00:19:04,404
Azure or AWS on this one. I'm very
492
00:19:04,404 --> 00:19:07,285
mindful of, like, the customer pain that's felt
493
00:19:07,285 --> 00:19:09,205
on these and the friction that comes with
494
00:19:09,205 --> 00:19:10,750
it. I think the consolation
495
00:19:11,450 --> 00:19:11,950
is,
496
00:19:12,409 --> 00:19:13,549
one, as
497
00:19:14,329 --> 00:19:15,230
folks who
498
00:19:15,769 --> 00:19:18,009
curate and look after these environments that are
499
00:19:18,009 --> 00:19:20,190
hosted in Azure AWS,
500
00:19:20,809 --> 00:19:22,329
as much as we own the message to
501
00:19:22,329 --> 00:19:24,169
our users that, yeah, it's broken and it's
502
00:19:24,169 --> 00:19:26,105
down, at least we don't have to own
503
00:19:26,105 --> 00:19:28,264
the fix for it, which double edged sword.
504
00:19:28,264 --> 00:19:29,464
I I don't think many of us could
505
00:19:29,464 --> 00:19:31,625
fix it faster than the folks who built
506
00:19:31,625 --> 00:19:32,284
these things
507
00:19:32,585 --> 00:19:35,065
could anyway all along the way. But it
508
00:19:35,065 --> 00:19:36,984
it does give us some stuff to go
509
00:19:36,984 --> 00:19:38,524
out and think about
510
00:19:38,960 --> 00:19:40,320
and see if we can do a little
511
00:19:40,320 --> 00:19:42,480
bit differently next time. Yeah. And while they
512
00:19:42,480 --> 00:19:43,700
do go down,
513
00:19:44,160 --> 00:19:45,539
I would say lately,
514
00:19:46,160 --> 00:19:48,339
and this was kinda the case of AWS
515
00:19:48,400 --> 00:19:50,799
and Azure, I would say, is I feel
516
00:19:50,799 --> 00:19:51,299
like
517
00:19:51,759 --> 00:19:53,460
response times and fix times
518
00:19:54,015 --> 00:19:57,154
for Azure and AWS have gone gotten quicker.
519
00:19:57,295 --> 00:19:58,035
Like, the
520
00:19:58,414 --> 00:20:00,414
time from when they first go down to
521
00:20:00,414 --> 00:20:01,634
when they come back online
522
00:20:02,095 --> 00:20:04,515
used to and I guess I think to
523
00:20:04,575 --> 00:20:06,815
several years ago where you'd see outages that
524
00:20:06,815 --> 00:20:08,894
would be, like, day long outages, whether it
525
00:20:08,894 --> 00:20:10,099
was eight, ten,
526
00:20:10,400 --> 00:20:12,799
twelve, twenty four hours. There have been outages
527
00:20:12,799 --> 00:20:13,859
in Azure, AWS,
528
00:20:14,559 --> 00:20:16,960
Microsoft three sixty five, all of those. I
529
00:20:16,960 --> 00:20:19,599
feel like the recovery time when everything like,
530
00:20:19,599 --> 00:20:22,694
to catch the issue starting to happen to
531
00:20:22,694 --> 00:20:24,694
where it's starting to resolve. Maybe it's not
532
00:20:24,694 --> 00:20:25,674
completely resolved,
533
00:20:26,134 --> 00:20:28,774
but you're not hard down for, like, eight,
534
00:20:28,774 --> 00:20:31,255
ten hours. Companies have gotten better at that,
535
00:20:31,255 --> 00:20:34,214
catching it, mitigating it, and getting things back
536
00:20:34,214 --> 00:20:36,559
up quickly or at least starting to get
537
00:20:36,559 --> 00:20:38,320
them back up quickly. That seems to have
538
00:20:38,320 --> 00:20:40,160
been gotten a lot better, I would say,
539
00:20:40,160 --> 00:20:41,680
in the last few years. It goes both
540
00:20:41,680 --> 00:20:43,840
ways. When the entire Internet is down, it
541
00:20:43,840 --> 00:20:44,820
feels like forever.
542
00:20:45,279 --> 00:20:47,200
And it's not just when the entire Internet's
543
00:20:47,200 --> 00:20:48,500
down. I I think there's
544
00:20:48,994 --> 00:20:51,875
economic loss that's associated with these things. So
545
00:20:51,875 --> 00:20:53,955
I saw some estimates talking about, like, the
546
00:20:53,955 --> 00:20:56,914
AWS outage even for the, quote, unquote, brief
547
00:20:56,914 --> 00:20:58,835
period of time that it was being as
548
00:20:58,835 --> 00:21:01,315
high as, like, 500 to $600,000,000
549
00:21:01,315 --> 00:21:03,029
in lost revenue. Yeah. I saw some of
550
00:21:03,029 --> 00:21:05,269
those numbers too. For the companies that that
551
00:21:05,269 --> 00:21:07,610
are hosted on top of it. I think
552
00:21:07,830 --> 00:21:08,970
like any
553
00:21:09,350 --> 00:21:11,430
dark cloud, like, you gotta look for the
554
00:21:11,430 --> 00:21:14,070
silver linings. It can't always be glass half
555
00:21:14,070 --> 00:21:16,970
empty kind of thing. So I will say
556
00:21:17,134 --> 00:21:20,174
a couple of maybe, like, positive things that
557
00:21:20,174 --> 00:21:22,335
happen in both of these outages, both the
558
00:21:22,335 --> 00:21:25,134
AWS one and the Azure one. I'm seeing
559
00:21:25,134 --> 00:21:28,414
that communication's getting better. So while folks are
560
00:21:28,414 --> 00:21:30,835
still complaining that, like, oh, the status pages
561
00:21:30,894 --> 00:21:33,500
aren't updating, things like that, I do think
562
00:21:33,500 --> 00:21:35,440
the kind of proactive communication,
563
00:21:35,740 --> 00:21:37,679
like, we're finding a better balance between
564
00:21:38,059 --> 00:21:40,139
how many engineers do we put on fixing
565
00:21:40,139 --> 00:21:42,619
the problem, which I generally, I would say
566
00:21:42,619 --> 00:21:44,700
let's index towards putting everybody on it. But
567
00:21:44,700 --> 00:21:46,139
if we put everybody on it, that's at
568
00:21:46,139 --> 00:21:47,899
the expense of being able to communicate to
569
00:21:47,899 --> 00:21:49,525
customers, because we might even be taking the
570
00:21:49,525 --> 00:21:51,525
person who can take that message and and
571
00:21:51,525 --> 00:21:52,404
figure out how to get it to where
572
00:21:52,404 --> 00:21:53,924
you need to be. So I think the
573
00:21:53,924 --> 00:21:56,884
transparent communication getting way better. I've been really
574
00:21:56,884 --> 00:21:58,585
impressed by the
575
00:21:59,205 --> 00:22:00,265
post incident
576
00:22:01,009 --> 00:22:03,269
reviews that have come out from both Amazon
577
00:22:03,569 --> 00:22:05,190
and Azure recently.
578
00:22:05,809 --> 00:22:07,730
They're kinda going above and beyond in the
579
00:22:07,730 --> 00:22:10,309
things that they talk about and expose. Like,
580
00:22:10,369 --> 00:22:12,369
you as a regular customer, me as regular
581
00:22:12,369 --> 00:22:14,625
customer, we should never need to know the
582
00:22:14,625 --> 00:22:15,845
names of
583
00:22:16,304 --> 00:22:16,964
the internal
584
00:22:17,424 --> 00:22:17,924
microservices
585
00:22:18,544 --> 00:22:19,924
that are part of DynamoDB.
586
00:22:20,625 --> 00:22:22,625
And, like, we should never need to know
587
00:22:22,625 --> 00:22:23,125
about
588
00:22:23,585 --> 00:22:27,424
these things like AWS's internal planner and enactor
589
00:22:27,424 --> 00:22:27,924
workers.
590
00:22:28,230 --> 00:22:30,950
Like, alright. Great. Like, let's not worry about
591
00:22:30,950 --> 00:22:32,390
that kind of thing. So I think you
592
00:22:32,390 --> 00:22:34,250
are seeing, like, a level of transparency
593
00:22:34,630 --> 00:22:35,929
from the hypervisors
594
00:22:36,630 --> 00:22:37,690
that run these things
595
00:22:38,150 --> 00:22:40,329
and good transparent communication
596
00:22:40,630 --> 00:22:42,170
that's happening during the outages.
597
00:22:42,575 --> 00:22:44,095
The other thing I'll call out, like, these
598
00:22:44,095 --> 00:22:46,515
took some time in both cases to fix,
599
00:22:46,734 --> 00:22:48,835
but all those rollback procedures
600
00:22:49,454 --> 00:22:51,634
and stopping the bleeding and all that stuff,
601
00:22:51,694 --> 00:22:53,855
it worked. We're sitting here a week later
602
00:22:53,855 --> 00:22:55,855
and people are still banging their heads against
603
00:22:55,855 --> 00:22:57,295
the wall going, we don't know what the
604
00:22:57,295 --> 00:22:58,890
problem is. We don't know how to fix
605
00:22:58,890 --> 00:23:01,369
it. We don't know what changed. We don't
606
00:23:01,369 --> 00:23:03,549
know what happened. That's not the case here.
607
00:23:03,610 --> 00:23:05,690
Like, these things happened. There were point in
608
00:23:05,690 --> 00:23:08,410
time, tons of friction, tons of pain, horrible,
609
00:23:08,410 --> 00:23:11,049
yes. But they got fixed. They got fixed
610
00:23:11,049 --> 00:23:12,830
by somebody else, and they were fixed successfully.
611
00:23:13,315 --> 00:23:14,994
And then for whatever these failure modes are,
612
00:23:14,994 --> 00:23:17,154
like I said, you can be pretty confident
613
00:23:17,154 --> 00:23:18,595
that they're not gonna happen in the future.
614
00:23:18,595 --> 00:23:20,595
Are other things gonna happen? Yes. They haven't
615
00:23:20,595 --> 00:23:22,835
been discovered yet. But as they are, it
616
00:23:22,835 --> 00:23:25,154
all bleeds to more resiliency and it lends
617
00:23:25,154 --> 00:23:26,375
itself to more resiliency
618
00:23:27,289 --> 00:23:28,509
for these services.
619
00:23:29,130 --> 00:23:32,509
In some cases, I think there were mitigations
620
00:23:32,730 --> 00:23:34,490
put in place in a timely manner. So
621
00:23:34,490 --> 00:23:36,029
in the case of the front door outage,
622
00:23:36,329 --> 00:23:37,390
I saw that
623
00:23:37,769 --> 00:23:39,230
they actually pulled
624
00:23:39,529 --> 00:23:40,190
the portal
625
00:23:40,575 --> 00:23:42,815
out from behind front door. Like, they went
626
00:23:42,815 --> 00:23:45,454
and manipulated some DNS records to be able
627
00:23:45,454 --> 00:23:47,454
to give customers relief so that they could
628
00:23:47,454 --> 00:23:49,694
reach the portal without having to go through
629
00:23:49,694 --> 00:23:50,194
AFD
630
00:23:50,494 --> 00:23:51,794
and the load balancing
631
00:23:52,174 --> 00:23:52,674
mechanics
632
00:23:53,410 --> 00:23:55,170
that it brings along the way. I think
633
00:23:55,170 --> 00:23:57,490
the tooling's getting better. You're getting the ability
634
00:23:57,490 --> 00:24:00,150
in the tooling to target specific API surfaces,
635
00:24:00,369 --> 00:24:03,109
have other workarounds there, so that's all good.
636
00:24:03,569 --> 00:24:06,345
And, yeah, in general, like, sucks that it
637
00:24:06,345 --> 00:24:08,345
happened, but I'm actually, like, really happy with
638
00:24:08,345 --> 00:24:10,265
the responses here and the way they came
639
00:24:10,265 --> 00:24:12,664
out. Stuff could always go quicker. But that
640
00:24:12,664 --> 00:24:14,904
said, I think for what happened and the
641
00:24:14,904 --> 00:24:16,605
scale of both of these outages,
642
00:24:16,904 --> 00:24:20,684
stuff actually happened in a very timely way.
643
00:24:20,700 --> 00:24:21,440
And, ultimately,
644
00:24:22,380 --> 00:24:23,819
not much that I would have wanted to
645
00:24:23,819 --> 00:24:25,659
do as a customer anyway. Like, if I
646
00:24:25,659 --> 00:24:27,819
was already a multi cloud customer and I'm
647
00:24:27,819 --> 00:24:28,319
hosting
648
00:24:29,419 --> 00:24:30,639
in AWS and Azure,
649
00:24:30,940 --> 00:24:32,299
it's not like I'm gonna go out and
650
00:24:32,299 --> 00:24:34,220
bang on the door and say, well, let's
651
00:24:34,220 --> 00:24:36,154
go put ourselves into Oracle or Google and
652
00:24:36,154 --> 00:24:38,315
get yet another cloud here. Like, that's not
653
00:24:38,315 --> 00:24:39,615
necessarily the answer
654
00:24:40,075 --> 00:24:42,474
or the thing that's going to save you.
655
00:24:42,474 --> 00:24:45,295
You're only as resilient as your least resilient
656
00:24:45,355 --> 00:24:47,355
service kind of thing still at the end
657
00:24:47,355 --> 00:24:48,015
of the day.
658
00:24:48,369 --> 00:24:49,409
I think there is a little bit of
659
00:24:49,409 --> 00:24:51,890
an opportunity for customers to go through. Maybe
660
00:24:51,890 --> 00:24:53,809
you do wanna audit your dependencies a little
661
00:24:53,809 --> 00:24:55,250
bit, like, hey. Do I have to take
662
00:24:55,250 --> 00:24:56,609
a dependency on this thing? Or if I
663
00:24:56,609 --> 00:24:59,429
do, is there an alternative or a fallback
664
00:25:00,130 --> 00:25:02,849
service for me? Along the way, review your
665
00:25:02,849 --> 00:25:05,575
Doctor plans. So while you're not responsible, like
666
00:25:05,575 --> 00:25:07,734
I said, for fixing the servers and and
667
00:25:07,734 --> 00:25:10,295
the underlying microservices that power these things, I
668
00:25:10,295 --> 00:25:11,815
think you still wanna have good ways to
669
00:25:11,815 --> 00:25:14,315
communicate to your users about what's going on.
670
00:25:14,535 --> 00:25:17,095
So if you're a company that works with
671
00:25:17,095 --> 00:25:17,595
Azure
672
00:25:18,809 --> 00:25:21,130
and you have admins who are maybe more
673
00:25:21,130 --> 00:25:22,970
click ops and they're dependent on the Azure
674
00:25:22,970 --> 00:25:24,730
portal, you wanna make sure that you have,
675
00:25:24,730 --> 00:25:25,789
like, good documentation
676
00:25:26,410 --> 00:25:28,490
for your employees about what happens when the
677
00:25:28,490 --> 00:25:29,710
Azure portal is unavailable,
678
00:25:30,184 --> 00:25:31,944
What help happens when the m three sixty
679
00:25:31,944 --> 00:25:33,865
five portal is unavailable? What happens when this
680
00:25:33,865 --> 00:25:35,785
service is unavailable? Just so they know what
681
00:25:35,785 --> 00:25:38,505
to do, and they've got that kinda measured
682
00:25:38,505 --> 00:25:41,244
comfort food. You also need to think about
683
00:25:41,545 --> 00:25:42,845
kinda documenting
684
00:25:43,464 --> 00:25:44,924
recovery plans and expectations
685
00:25:45,640 --> 00:25:46,619
in terms of timing.
686
00:25:47,000 --> 00:25:49,000
So what happens if my cloud provider is
687
00:25:49,000 --> 00:25:51,319
down for ten seconds? What happens if my
688
00:25:51,319 --> 00:25:53,960
cloud provider is down for ten hours? Those
689
00:25:53,960 --> 00:25:55,179
are very different scenarios.
690
00:25:55,480 --> 00:25:57,000
And the way we react, the way we
691
00:25:57,000 --> 00:25:58,619
communicate with our user bases,
692
00:25:59,005 --> 00:26:00,625
all those things are
693
00:26:01,244 --> 00:26:02,304
going to be impacted.
694
00:26:02,765 --> 00:26:04,365
I think you also do have to think,
695
00:26:04,365 --> 00:26:06,304
like, I mentioned status pages.
696
00:26:06,605 --> 00:26:10,065
Both AWS and Azure, like, the status pages
697
00:26:10,125 --> 00:26:12,224
are not the greatest things at getting updated.
698
00:26:12,284 --> 00:26:14,365
So, like, are there alternative systems that you
699
00:26:14,365 --> 00:26:16,819
wanna look at? I see still lots of
700
00:26:16,819 --> 00:26:19,380
customers using things like down detector and things
701
00:26:19,380 --> 00:26:22,179
like that to see when these things are
702
00:26:22,179 --> 00:26:24,899
occurring or if they have broader impact within
703
00:26:24,899 --> 00:26:26,359
geo, outside of geo,
704
00:26:26,914 --> 00:26:28,914
things like that. I think those are all
705
00:26:28,914 --> 00:26:31,394
good to stand up. And then the last
706
00:26:31,394 --> 00:26:33,954
thing I would think about is as you're
707
00:26:33,954 --> 00:26:35,554
going through and you're figuring out maybe some
708
00:26:35,554 --> 00:26:36,694
of these things around
709
00:26:37,154 --> 00:26:39,839
recovery plans, things like that, is making sure
710
00:26:39,839 --> 00:26:41,920
that you're not only setting the expectations with
711
00:26:41,920 --> 00:26:44,400
users, but also setting the expectations with your
712
00:26:44,400 --> 00:26:44,900
leadership.
713
00:26:45,359 --> 00:26:47,039
So, like, if you work for a company
714
00:26:47,039 --> 00:26:49,940
that's single cloud, multi cloud, does your leadership
715
00:26:50,000 --> 00:26:51,140
have the right expectations
716
00:26:51,680 --> 00:26:52,180
around
717
00:26:52,720 --> 00:26:55,575
your company's dependency on the cloud? Has that
718
00:26:55,575 --> 00:26:57,815
been communicated in the right way? Does your
719
00:26:57,815 --> 00:26:58,875
leadership understand
720
00:26:59,494 --> 00:27:00,875
what they've bought into?
721
00:27:01,414 --> 00:27:03,335
Because there's the dream of the cloud, Oh,
722
00:27:03,335 --> 00:27:05,494
it's somebody else's cloud, it's somebody else's problem,
723
00:27:05,494 --> 00:27:08,154
it's 100% available. And then there's the reality
724
00:27:08,214 --> 00:27:10,809
of the cloud, which we know so far,
725
00:27:10,809 --> 00:27:13,630
no system out there is truly a 100%.
726
00:27:13,690 --> 00:27:15,690
So making sure that those things are ready
727
00:27:15,690 --> 00:27:17,929
to go so that your LT can weigh
728
00:27:17,929 --> 00:27:19,929
out all those options they need to, like
729
00:27:19,929 --> 00:27:20,829
multi cloud
730
00:27:21,289 --> 00:27:22,429
strategy options,
731
00:27:22,934 --> 00:27:26,075
ultimately understanding that whole, like, risk reward scenario
732
00:27:26,934 --> 00:27:29,434
or maybe risk versus cost
733
00:27:29,734 --> 00:27:31,514
for things like additional resiliency
734
00:27:31,974 --> 00:27:32,634
and redundancy
735
00:27:33,494 --> 00:27:35,494
and where that all falls out for you.
736
00:27:35,494 --> 00:27:37,619
Sounds good. What with that, Scott? I actually
737
00:27:37,619 --> 00:27:39,640
have family waiting for me to go do
738
00:27:39,779 --> 00:27:42,100
Halloween y stuff. So Halloween y stuff. You
739
00:27:42,100 --> 00:27:43,700
can It is the day for it. At
740
00:27:43,700 --> 00:27:45,720
least the weather is nice here in Jacksonville.
741
00:27:46,100 --> 00:27:48,279
Nice and cool out there. It's a balmy
742
00:27:48,434 --> 00:27:49,955
68. Yep. I think this is the first
743
00:27:49,955 --> 00:27:52,515
year it's under, like, 80 degrees Fahrenheit for
744
00:27:52,515 --> 00:27:54,355
Halloween in a while. It's been a while
745
00:27:54,355 --> 00:27:57,075
since it's been this cool. So yes. Well,
746
00:27:57,075 --> 00:27:59,174
thanks for that. Hopefully, no more DNS
747
00:27:59,634 --> 00:28:01,259
cloud outages here for a while. Hopefully
748
00:28:02,779 --> 00:28:04,940
Yes. It's something that nobody wants to happen.
749
00:28:04,940 --> 00:28:07,099
Nope. So go enjoy your weekend. Enjoy the
750
00:28:07,099 --> 00:28:09,740
rest of your Friday, and we'll be back
751
00:28:09,740 --> 00:28:12,619
again in a couple of weeks. Alright. Sounds
752
00:28:12,619 --> 00:28:14,295
good. Thanks, Ben. Alright. Thanks,
753
00:28:16,295 --> 00:28:18,775
Scott. If you enjoyed the podcast, go leave
754
00:28:18,775 --> 00:28:20,934
us a five star rating in iTunes. It
755
00:28:20,934 --> 00:28:22,615
helps to get the word out so more
756
00:28:22,615 --> 00:28:24,934
IT pros can learn about Office three sixty
757
00:28:24,934 --> 00:28:25,755
five and Azure.
758
00:28:26,295 --> 00:28:27,894
If you have any questions you want us
759
00:28:27,894 --> 00:28:30,160
to address on the show or feedback about
760
00:28:30,160 --> 00:28:32,480
the show, feel free to reach out via
761
00:28:32,480 --> 00:28:34,660
our website, Twitter, or Facebook.
762
00:28:34,960 --> 00:28:36,880
Thanks again for listening, and have a great
763
00:28:36,880 --> 00:28:37,380
day.