>> Okay. Let's get started.
So my name is Jason Hand.
I'm one of the Senior Cloud Advocates at Microsoft.
I've been with the team now for about six months or so.
Previous to that, I was working for a company called VictorOps.
They were On Call Incident Management.
I've been in the SRE space for a little while,
monitoring, operations, those types of things.
Really excited to show you some of the new things that
we have obviously through Microsoft and Azure,
but we're going to kind of keep a little
high level here at the beginning.
How many of you were in the SRE 20 talk that
I just gave like, wow, okay.
So, welcome back,
and you-all can zone out for like the next five, 10 minutes,
because I need to give everyone an overview of the SLIs and
the SLOs that we talked about in the previous talk, okay?
So, forgive me for repeating a little bit of content but we
have to assume that most people haven't seen all of the talks.
That being said, I believe
that the reason why most of you are here is,
you want to learn more about Azure,
maybe you've moved to Azure,
maybe you're thinking about it,
maybe you're starting to piece some things together,
and one by one move things over.
Anyway, it's a whole new world, right?
It's a new frontier,
we're discovering what the "Cloud" is like.
A big thing that scares people away immediately,
is when things go wrong,
and they will go wrong, we already know this part,
we'd been doing IT for way too long to have
some naive idea that failure isn't going to happen.
It's going to happen. But we're used
to dealing with failures in our own systems,
maybe On-premise or in a Data Center,
we can physically go to them and kind of like shake them and say,
what's wrong with you and try to fix it or replace it or whatever.
The Cloud brings a whole new element where it's kind of scary.
There's a lot of abstraction,
there's things that are removed from us that
at the end of the day makes our lives easier
but there's still is discomfort.
We can't actually go in and dig
into different aspects of our system to see what's going on, okay.
So a lot of the things and there will be
quite a lot things that I'll show you today,
and we'll just kind of keep it very high level,
but there's a lot of ways you can go in and
start understanding what your systems are doing,
hopefully we'll ease some of those concerns,
It'll be a little less scary.
But you know we're all here to understand how
to understand our systems when something goes wrong, okay.
So this may seem like a pretty common architecture,
the way your systems are sort of laid out.
We get the Clients going through maybe a firewall,
into your web tier,
different queues, messaging aspects,
middle tier, parts of your system.
There's going to be some caching,
of course there's the database,
there's a lot of things that go into our systems, okay.
Undeniably, our systems have become really complex.
They've always been complex,
but now we're adding more abstraction,
and we're making systems talk to other systems,
and their APIs are calling APIs,
and there's Microservices, and there's
Containers, and there's Functions,
and all these new things that are very new to us and
kind of foreign but they all work
kind of together to create a service,
to create value for something our users are trying to accomplish.
So everything's getting much more complex,
and it might seem if you're using Azure that yeah, okay,
I'm familiar with Azure Web Application Firewall,
I've started to play around with Azure Web Service or App Service,
I'm starting to understand more about Storage Queues,
Logic Apps, Functions, our caching for Redis,
we got maybe you've started to explore Cosmos.
But needless to say,
even if you're not using Azure,
there's a lot of components that go into our systems,
and things are going to go wrong.
If you can stick around until SRE 50,
Emily Frame is going to be talking about
the entire life cycle of an Incident which is
much more than just
the first three things that most of us think about.
We've detected an issue,
we've got the team involved,
we've got some some people looking into it,
and then hopefully they'll solve it.
Those are the first three steps of an incident,
but then there's two more that are even probably just
as important that usually aren't followed through with.
The first one being the analysis,
where a lot of people would call a postmortem.
I don't like that term,
I've moved away from it almost entirely because it
insinuates that we're looking for
just the cause of death or the failure,
when actually we want to know all the good things too.
When something goes wrong,
there's people that get involved
and a lot of those people who have
done really good things to try to recover from their service.
So we don't just talk about the failures during an incident.
We don't want to look at just the technical components.
There's no root cause in a complex system,
so why even bother trying to
find it when it's not there in the first place.
Don't get me started on language like root causes,
it just doesn't make sense to me.
There's always going to be a lot of things going on in
a system and we want to understand it
from the various aspects and various opinions and
perspectives of not only the system itself, but the people.
So the analysis is a super important phase,
if you can stick around the SRE 50,
I highly recommend that.
Of course, analysis sets us up to, what are we going to do now?
How are we going to make our system more
resilient and more reliable?
What things have we learned from this incident or
the series of incidence that we can do to improve
our system and make us more ready?
The system includes people.
We're building sociotechnical systems.
These aren't technical systems wholly, there are people involved.
Were the ones who make this stuff, right?
We don't have robots making software yet, it's just us.
So we have to be humane about how we
do this stuff and think about how we can be
prepared and ready for when problems happen instead of being
reactionary and trying to running around like
our heads are cut off not knowing what to do.
We're smart enough that we should be able to
prepare and be ready and make things better for our systems.
So that's the entire life cycle of an Incident,
more of that in SRE 50,
if you can hang out till then.
But our focus today here is really more about what's going on
as we go from the Detection phase Into the Response phase.
So the Detection phase is what I talked about in SRE 20,
where we're going to come up with new methods to let
us know what it's really like for the users.
Of course we still pay attention to
the things like CPU, and memory,
and all those things that have historically been part
of the things that tell us if our systems are healthy.
We're going to monitor that,
we're going to keep an eye on that,
but we're going to put some automation in place so that it
can go and just take care of those things
in the middle of the night for us.
We don't have to scramble to figure
out what do we need to do to fix
these problems when it's really just a resource thing.
We can go and grab them and we can even let go of them if we want.
So maybe we have some kind of big surge,
we need to spin up some new infrastructure,
some new containers that have all of our Apps,
but it's only for maybe 24 hours
and then that traffic has died down.
We can do that now,
we can dynamically, automatically go grab resources.
There's no reason for us to
page out people on those types of things.
There's lots of other things that we need people involved.
So we're going to be talking about,
how do we detect when something's wrong?
But the important thing or the thing we really have to discuss is,
how do you know if it's wrong, okay?
How do you know something is wrong and it's worth responding to.
Its worth getting out of bed.
It's worth flipping open your laptop and logging into the VPN?
How do you know it's really worth doing that?
What's wrong, okay?
How do we define by? How do we define the "Word" and term, wrong?
Well, for those of you that were in
the SRE 20 talk just a little bit ago,
you know that we talked about objectives,
and so, service level objectives.
What are our users,
our customers attempting to do?
That's their objective, that's
the reason they sign up for their service,
or for building things internally for
other other parts of our business. Those are still customers.
They are internal customers and they're not
paying for it necessarily with their own money,
but they have objectives.
They have their jobs that they're trying to
complete and tasks that they're trying to do,
and we have to provide the value to that.
So we have to understand those objectives and
make sure we're aiming in the right direction.
We also look at it from the perspective of them.
It's not just about what we perceive their objectives to be,
it's what they experience.
So are we looking at the system in the right way that's going to
tell us what it's like for them?
If we could sit right next to them, that would be awesome.
If we could put ourselves in their body and
experience through their eyes what they're seeing and doing,
that would be the ultimate thing,
but we can't, okay.
But we still have to find ways to move ourselves as
closer to that possibility as we can get.
So we're always looking at it from the customer's perspective.
Then when we start talking about reliability,
we put them into kind of two different terms here.
We've got our indicators, our SLIs,
which is just over a period of time,
where does our service sort of fall in this moment?
We can measure that over 30 days,
30 minutes, five minutes,
30 seconds, whatever you want, it's up to you.
Hopefully I got that point across in SRE 20,
is that, I can't stand up here and
tell you exactly how to do this.
You-all are building systems that are unique to you,
in the business that you're in and the industry that you're in.
So there isn't a silver bullet.
There is no answer, just depends.
I hate that that's the answer but that really is the answer.
But SLIs we talked about that in the previous talk,
how to determine what those are,
as well as objectives,
but for those of you who weren't in that talk,
which I know that there's still quite a few of them in here,
let me real quick just kind of recap
what we covered in that previous talk.
SLIs are nothing more than just a basic ratio, just a proportion.
All we're going to take as one value,
divided over a total value.
For example, we're going to look at
the total number of successful HTTP calls,
over the total number of calls at all.
That's going to give us a ratio of something over something,
and then we'll multiply that by a 100 and we can get our 99.99999,
whatever percent of availability or reliability.
That's how you get to those numbers.
They're just simple ratios.
But, you also have to make sure that you're
super clear when you start
talking about SLIs and
SLOs of where you're measuring these things.
Because I can say to you, hey I want to get all,
I want to measure all of the HTTP requests.
But if I'm going to measure those
from my server, that's where I'm going to collect it,
I might not have all of the requests because
maybe some of them didn't even get to
the server in the first place.
Maybe they got directed off in some other problem.
Maybe they hit the load balancer but
didn't make it past the load balancer.
So maybe I need to move my monitoring to
the load balancer and collect the HTTP requests from there.
That might not be the right answer for you, but it might be.
So that's something you have to determine but either way,
you always have to decide where are we going to get
these numbers that we're going to start
doing these calculations on,
and everybody has to be on the same page
because when someone starts to question,
"Hey, I don't know if I believe this SLO
or why are we measuring it here?"
There should always be
an understanding of why we're measuring it from here.
Here's why we determined that we're going to collect
the number of records processed from the application itself.
Here's our logic. Here's our thinking behind that,
and everyone on the engineering teams
should be on board with that,
should be a conversation that we all agree on.
So that's a quick SLO,
excuse me, SLI, let's talk about an SLO.
Our SLO is essentially our threshold.
We see our SLI dancing around
all the time if we're looking at some visual aid,
but our SLO is sort of our threshold.
What I like to say, it's the line in the sand.
You've crossed the line and it's time for a person to get
involved because something isn't right.
Our indicators have determined that it's dropped below
a threshold that we established as this is our comfort zone,
and now we need to page out to a human.
A person has to get involved,
we've got to respond, we've got
trios, you've got see what's going on.
So how do we create that SLO?
How do we determine what number we should be aiming for?
Well, first of all, we have to take the thing that
we're sort of measuring
in the previous example that I've been using,
it's just HTTP requests.
So that's the first part of the SLO.
Then we have the SLI proportion.
Okay. The number, the 99 point whatever,
or maybe it's 90 in this example.
Maybe it's 85 in the previous example, in the previous talk.
But you decide what your SLI proportion is.
Then there's always going to be a time statement.
So you choose these numbers,
you choose your SLI what thing you're going to measure,
but then you also have to decide over the course of blog.
Maybe it's 30 days, maybe it's 30 minutes, maybe it's 30 seconds.
It depends on the service, it depends on
what you need those things to do,
and how quickly you need to measure.
But we always have to have that time element,
that time statement there.
In this example, we're looking at 90 percent of
HTTP requests as reported by the load balancer,
and make sure we're all on the same page about that.
They've succeeded in the last 30-day window.
So that is our SLO.
Everybody understands that if this does not,
if our SLI gets below 90 percent,
that's when it's time to get some people involved.
So that's a real quick like crash course about SLIs and SLOs.
When it comes to detection of problems,
there's a couple of ways that this often happens.
The first one is that sometimes
people will start complaining on social media
especially for big companies.
Microsoft when we see issues,
you can guarantee people are on Twitter letting us know,
giving us an earful about the problems that they're experiencing.
Now, that's not a great way for us to know that there's an issue.
Hopefully, we've got some monitoring in place,
we've got some SLOs in place that determines,
yeah, we do have an issue,
we've got engineers involved.
It's not some person on social media
who's telling us for the first time
that we know that there's a problem.
The same way it goes for tickets.
You don't really want your customers to be
the first people to notice you have a problem.
So if you've got your main routing
of detection of issues coming through customer support,
you got some real gaps in your monitoring system.
You need to do some effort to instrument your applications,
your instrumenting your infrastructure and settings some SLIs.
So that isn't the customer who finds these issues first.
You don't want phone calls, and e-mails,
and all these different things coming from anyone other than
your engineering team that something isn't right.
So don't do it that way, please.
In fact, most of SRE 20,
the previous talk I was giving was showing you how,
what's the better way?
What's the newer, more modern site reliability method?
What we showed off is just going in and using logic, I'm sorry,
Log Analytics to just establish SLIs,
then establish our SLOs and make sure
we're setting up good actionable alerts.
So please don't, please don't accept number one
as the best possible way for detection.
There are many many options.
Of course, Log Analytics as part
of Application Insights and Azure Monitor in general,
and there's a lot of great stuff that
Azure Monitor provides right out of the box,
but it might be that you need to do some other or use
some other tools to really create a more observable system.
That was something I mentioned in the previous talk.
We're trying to create a system where I can ask
it any question, any question,
and I have almost absolute faith that the response I get back,
the answer I get back is the truth.
It is the reality of our system.
If I say, how many people in our system right now?
I should know, and I had absolute confidence in that.
I should be able to ask anything.
What's our CPA performance?
Even if I'm going to like monitor that
for automatic resource grabs,
I should be able to ask and know.
If I'm not able to answer a question,
I don't really have an observable system,
I need to add some different things.
The next part two,
after you've determined that there's a problem,
we've detected this issues,
we got to get people involved.
It's time to respond.
When I say respond, soak that in, respond, not react.
I think for many of us,
we go into this reactionary phase because we're just not prepared.
We weren't really expecting something to break.
Well, in the new world,
and it's not just really Cloud world,
I mean that's the way we're designing our systems these days,
there's just a lot of complexity.
So we have to like I mentioned a little bit ago,
we have to accept that failure is part of a system.
It is normal for systems to fail in small ways,
large ways, spectacular ways,
ways that customers notice, ways that they don't.
There's always going to be failure.
So we have to understand that this is part of the job now.
If it's part of the job, we should be prepared.
We shouldn't be reactionary,
we should respond, and there's a huge difference between the two.
Another thing that I talked about in
the previous talk is I've seen a lot
of bad behavior that we've
pulled ahead from the way we used to do things.
We used to send things to an e-mail endpoint,
or an e-mail inbox, or maybe a mail distribution,
where we were sending things that today
makes more sense to really go into a log.
If it's just a notification to
inform us about something, that's not an alert.
That's not something you send out that
pages the on-call responder.
So make sure that when you're starting to redesign the way
that you respond to incidents and
your monitoring and all that kind of thing,
that you you're looking for things that tell
you that the systems are no longer normal.
A heartbeat is not the right approach.
A heartbeat tells you in some regular time series,
it tells you that, yeah, I'm alive,
yeah, I'm alive, yeah I'm alive.
I don't want that, that's noise.
I want you to tell me when you're not alive,
or when there's some other thing that has been breached.
So these are all kind of bad behaviors that I've been guilty of
myself as I've moved forward but that's my advice to you is.
Look for things that you previously were probably alerting and
paging people on and really ask, is this actionable?
Do they need to physically stop what they're
doing and do something about this alert?
Because if not, it's not an alert.
It's something that should go to some other place
for them to consume at their leisure,
but they should not have to context
switch, drop what they're doing,
wake up in the middle night,
stop playing with their kids on their birthday,
like you don't want that kind of thing.
We're looking at ways to be a much more humane on-call approach.
Because at the end of the day,
these are socio-technical systems.
They are people plus technology,
and if you're responding or even
reacting or however it feels to you,
if you're doing it all the time,
that's going to burn you out,
and you're going to give up, and you're going to hate life.
You're going to hate your career, you're going to
move to another bookstore.
Something simple that you're just like yes, this is what.
But, we don't want that,
unless you really want to run a bookstore.
So be mindful of the ways that you do that in terms of alerting.
We also want to make sure that
any alert is really something that
a person has to get involved with.
If it's not something a person has to get involved with,
because we could automate a fix,
then let's automate a fix.
Obviously, is going to take maybe a couple of
times before you realize, hey you know what?
The way we fix that is,
we can put that into a script,
we should just do that.
It might take you few times to realize that,
but let's move to that point rather than always expecting
one person to just run the same restart script every single time.
Do you really have to wake up in the middle of night to do that?
No. So we're looking for ways that humans are
involved if there's not a way to automate it.
When you get these alerts, we want to make sure
that they have context, right?
They tell that first responder where is this alert coming from?
As we build out our systems,
they become much more complex.
There's lots of different areas of things going on,
so where's this alert even coming from?
If we're talking about the Tailwind Traders,
if you've noticed some of the way that those application design,
there's some microservices in their.
Some of them Node.js,
one of them's a .NET Core,
Where are these alerts coming from?
What's actually happening?
What expectations or what's to say SLOs have been violated?
Last, why is this an issue?
Why should I get out of bed?
Why should I flip open my laptop and do anything?
What's the customer experiencing?
If you don't give me those kinds of information,
it's hard for me to dig in and understand,
and really have empathy for what our users are experiencing.
So like I mentioned, we're just creating contexts.
We want to have, when I get paged in the middle of the night,
I want to know as much as I can,
as much as possible what's actually going on?
Another thing that help
becomes their last thing on this list that becomes
very helpful is actually give them maybe the first couple steps.
I know for me, I get woken up in the middle of the night,
I'm sleepy-eyed, I'm getting eye boogers out of my face,
and I don't know what's happening.
If you can tell me what's happening
and also the first thing to do,
"Hey, go check this log," or "Run this query."
Or maybe it'll be in the troubleshooting guide.
We're going to talk about the troubleshooting
guide towards the end of the talk.
But give them context.
Tell them as much as you can in that moment,
so they can move on and solve things.
What we're trying to do is build operational awareness.
What that means is this, I want to have
as much awareness about the systems as possible.
You're never going to know it all, okay?
But if you can build an observable system that knows as much
as it can in that moment until you continue to add
more and fill in the blind spots, that's what we're doing.
We're trying to create operational awareness.
My last piece of advice before we move into some examples
here is don't wait until you've had your big outage.
You're going to have a big outage.
I promise. I've got multiple. It's going to happen.
Don't wait until it
happens before you start doing some of these things.
I know we've got full plates,
we got tech debt that goes out the back door,
but we have to start preparing, okay?
Get out of that reactionary phase.
Here's my favorite PowerPoint effect
just for you guys. There's another one.
Okay. So what we're going to do moving
forward here is I'm going to show you
a little bit of all of these.
There's so many things I wanted to try to squeeze in into
the session that I couldn't, obviously.
But we're going to try to tackle
at least these five things to show you how real basic,
how you can use these to make
that scenario where you have been the person who's paged,
you gotta come in, hopefully you got a little bit of
context and maybe you don't, where do you go?
What are the tools do you have available to
start diagnosing and troubleshooting what's happening?
So we'll go through Azure monitor
in generals where all this is found,
but we're specifically going to take
a look at service health alerts.
I'll go into App Insights and show you
the application map, which is kind of cool.
Diagnostic log, so we'll fire up the live stream logs.
We'll actually going and tail a log,
so we can see if we can see what's going on there.
Then log analytics is our query,
and where it's the thing that I mostly
spent majority of the time in the previous talk.
Excuse me. So we'll go into log analytics as well.
Now, the first thing I want to show you that we should talk
about is when something goes wrong,
our first instinct in many cases is, "Is it just me?"
Especially when you're in the Cloud, right?
So if you're using Azure,
"Is it me or is it Azure?"
I'm here to tell you, Azure is going to have problems.
You probably remember late last year,
we had a big incident in
our San Antonio Data Center where there
was lightning storms going on,
they had fires, and then rain,
and then the Data Center was heating up,
and we had to turn things off.
It was not a good day for anybody but those things happen.
So as much promises we can make,
and different ways that we can add redundancy,
and all these kinds of things, make no mistake.
All Cloud providers are going to
have little problems every once in a while.
It might impact your service.
So there's ways that you can design around that,
so it doesn't impact you as much.
But at the end of the day, it's good to ask,
"Hey, is it just me?
Or is it my provider or providers
that are actually indicating some sort of a problem?"
So the first thing I want to show you
is Azure Resource Health and how to set up some health alerts.
So let's head over to the good old portal.
So I've got a nice little dashboard
here just to give me some shortcuts.
There's multiple ways to get to it,
but we're going to go into service health.
I'll just click on this, bring this in here.
From here, I'm going to go down
to "Resource health" and I'll skip to server,
so you can see it a little bit better.
So what this does is,
it's a way for me to just real
quickly look at any of the resources and
see are they healthier and what can you tell me about them?
So our first option here is to choose
a subscription of where all of your different resources exists.
I'm of course on the "Ignite the tour" one.
Now you decide what resource types
are you interested in an understanding?
I'm just going to choose "App Service plan" for this one.
You can see it's loading here. We'll give it a few seconds.
But let's going to go out and let's going to look at
all of the app service plans that are
currently under the subscription "Ignite the Tour".
You can see this one here,
which is, that's the one I just gave.
There's some sort of problem with apps services for
the SRE20 production site that I had run in for the previous talk.
So there's definitely something wrong there.
It looks like some of these other ones are
still maybe collecting some information on.
But we can also see we've got green arrows on
lots of other different apps services.
If I scroll down far enough,
I'm going to make this out, so I don't make
everybody's sick with my scrolling.
Let's go down into any of these.
We can see, we've even got historical information of the health.
So I can come in and see on the 21st,
it looked okay, 20th was okay.
But if I go back to the 17th,
there's definitely something a little bit off here.
So it's got history that you can go back and say,
"Hey, we experienced a little bit
of a problem. I wonder if that was even us.
It might have been something
going on with app services or anything."
So that's an easy way for you to go in and see if you can
figure out what's going on with
just the resources in their health.
Now if I back up, let's go back in "Service Health" again,
and we we're looking at resource health.
Now, let's take a look at "Health alerts" because yeah,
it's nice to go in and proactively or
manually look at the health of different systems,
but we're all a little bit too busy to sit around
staring at that screen just looking for problems,
so I need to be alerted, okay?
Again, this goes back to our actionable alerting.
Do you only want to be alerted when something needs to be done?
Yes. Usually, that's the case.
We don't want a lot of context switching,
but you have to decide for
yourself if there's a service health issue,
"That might be something I want to be alerted on."
but maybe not, and the reason why maybe
not is it could just be something like scheduled maintenance.
Maybe you already knew about this,
and finally, they've started doing their scheduled maintenance.
Maybe you don't page out to
the first responders on that
because you've already been made aware of that.
Regardless, we are going to want to
create a new service health alert.
So I'm going to come here and click on
the button to create new service health alert.
What it gives me is
the opportunity to again, choose my subscription.
We're in the correct one here,
and then I've got over
a 148 different services running under that subscription.
Now, I don't want to create
an alert that encompasses all of those.
I'd want to create alerts for
certain things that I actually care about.
So I'm going to uncheck "Select all" and in this case,
I'm still mostly using
app services for the things that I'm building,
so I'm just going to go ahead and turn on all of
the app services, and then likewise,
I might not necessarily need all of
the services and all different resources to be alerted on.
I don't necessarily need a problem in all my regions, too.
So maybe I've only got applications running in
West Europe and also Central US or something like that.
So only tell me about
the app services that fall into those two regions.
So you can start to filter
down on what are the things that are important to you,
and then here's where we were talking
about under the "Event type".
If we look at our options there,
we've got service issues,
planned maintenance, and health advisories.
So you probably don't want to be alerted necessarily
to the first on call responder to things like planned maintenance,
unless you weren't aware of it or you're
not somehow being told about these problems.
Maybe you do. But what I would suggest here
is to set up different alerts for each one of these,
rather than just being alerted for
each time that it could be one of these three.
That's a little bit too abstract.
You don't really know specifically what's going on.
It might create a lot of noise.
So I would just pick one. We're going to go,
in this case, "Service issue".
Once I've picked out
my target in exactly what I want to be alerted on,
I have to decide, "Well,
how am I going to be alerted?
What's the method of delivering this alert?"
This is the exact same thing we've covered in
SRE20 in terms of setting up actionable alerts.
But I'm going to say "SRE30",
we'll just say "Alert", and give it a short name,
pick your subscription, let me "Ignite the Tour",
and then I could come down here and find,
just leave it on that one at the moment,
because it'll take too long to populate our drop down.
But I'm only choose the resource group,
and then what do you want to happen?
What's the very first thing that you
want to happen when one of these health advisories,
or service maintenance, or whatever it
is you choose, what do you want to happen?
Well in this case, it's a service issue.
So I want somebody to be paged,
I want to alert the first responder.
We talked about it in the previous talk too,
that I don't know about you, but I don't get
woken up in the middle of night to e-mails.
I hardly notice e-mail at all anymore.
It's just too much noise.
There's a lot of, e-mail is for corresponding.
It's for sending long-tail conversations back and forth.
It is not for urgency,
and we want urgency.
So e-mail is not the best place for these types of things.
I actually prefer to be made aware of
these issues from a phone call because I almost
always will notice my phone ringing,
especially if it's nearby or in my pocket.
But I also like to create a space where we can
all come together and just talk about this one issue,
whatever the problem might be this one incident.
So in SRE50, another thing in which we're going to
show off is how you can create on the fly,
a new channel within Teams
that can give you some of
the context about what's actually going on.
So that when I get that phone call,
I'll be like, "Okay, yeah.
I know there's a problem now.
Hit the number to acknowledge it.",
put my phone down, hop into Teams,
and there's already a channel that's been
established just for conversation about this issue.
So you can do all that automatically but for
the purposes of this demo here,
I'm just going to go ahead and just put in just a fake number.
Please don't call that. It's not even a real number.
So what's going to happen now,
is I've got a service issue on one of
my App Service Plans in either West Europe or Central US,
because those are the only things I really care
about for this alert,
and what's going to happen is it's going to send
me a phone call if something is to go wrong with that.
So that's how you set up your basic service health alerts.
So how are we going to create the perfect Service Health Alert?
People ask that all the time, and I've already told you this,
It's my favorite/least favorite answer, but it depends.
It just depends on the services that you're building
and how important are they?
You have to determine these things for yourself.
But some of the guidance that I've
been given and I can pass on to you,
and the first one is, I don't know,
it doesn't even feel like guidance,
it feels like common sense is,
don't have too many but also don't have too few.
Now, that seems like
the worst possible advice, but if you think about it,
you don't want just a bunch of noise,
you don't want a bunch of alerts.
You only want the things that are actionable,
but you want to cover as many of your basis as you
can or the many of your main concerns.
So if you're trying to find that Goldilocks or that sweet spot of
how many or how you want the service alerts to be set up.
Another thing you want to make sure too because
this can get really confusing is if you've
got multiple alerts that cover the same problem,
it can get really confusing as well.
So make sure that your alerts and how you have
that setup that they're not actually covering
the same thing because somebody might get
two alerts either the same time or not too far apart,
and in our mind that's still going to feel like
two different problems until someone tells us otherwise,
so be mindful about that.
Another thing too is,
we've got all these different environments, right?
We've got our own development environment,
Engineers have their own environment on their machines,
we've got staging, we've got multiple environments.
Not that we don't necessarily
care about problems in the other environment because we do,
we want those environments to be as
close as possible to what it's like in production.
You'll never get it perfect to production,
but you want to get them as close as you can,
but we don't need to be alerted
necessarily on things that aren't production issues.
Then the next thing is that we want to make sure
again that there are going to be people responding to this,
so let's be nice to ourselves.
Future Jason would love to know this when he's paged,
so how can I give him those things?
So keep that in mind, that there are going to
be people that are getting involved.
Then we've already kind of mentioned this too,
but we want separate alerts for the different things.
Let's not just alert on one,
it could be one of three things because
that's not specific, it's not actionable.
If I get an alert that says, "Hey,
there's a problem and we're going to play a guessing game first.
I wonder which problem it could be."
Then we'll go, "That's just not the right approach."
So make sure that you've got separate alerts
for all your issues and planned maintenance and advisories.
Okay, so now the next thing I wanted to show
you is the Application Map.
Hop into my Application Map here.
An Application Map is really pretty cool.
Zoom in a little bit. Well, let
me explain a little bit before I zoom in,
I know that's difficult to see.
But what it does is it illustrates
your system or the system that it knows about anyway,
and how they all interact with each other.
It can tell you different things
about how the system is performing,
but it also gives you things like latency
between calls into a database,
how many calls there were,
and I can drill into all kinds of
any aspect of my system and find out more information.
So I don't know about you but red usually means bad for me,
so I'm curious about
what might be happening on this instance here.
It's our inventory system that's talking to a back-end SQL.
If I were to click on this,
a new blade pops open on the right,
that gives me a little bit of detail,
as soon as the Wi-Fi catches up with us,
and then also gives me a few options to dig even deeper.
So right away it's telling me
that there's been some failing requests, in fact,
there's been 14 of them in the last seven days,
if you notice that our little filter up here.
They're all just a robot,
just spider trying to go through and collect things.
So if we wanted to,
that really felt like a problem,
we can go into investigate failures,
but if you notice also, down here we see some other things.
We can see which of the requests
that are being made are actually the ones that are slow,
and it's not all of them.
Okay, I mean, they're not all great here,
but certainly that top one,
at 384.2 milliseconds,
that's a little bit too much latency for me.
I can dig even further,
I can go into,
zoom out just a little bit,
I can go into
deeper screens to understand what's actually going on.
I can see down here it's
this Increment Async post
that's actually got the longest duration for latency,
and then I can dig in even further from there.
So you can see there's like
a cascading tiered approach to understanding things,
we start from the high level Application Map,
and then dig further and further in to see
if we can diagnose what the heck's going on.
Move back over the Application Map.
One other thing that's kind of cool I like to show off here,
is there's multiple ways of viewing this data too.
If we change this here,
I kind of like this view a little bit.
But this obviously, this is a very simple system,
your systems probably have maps that look like
the Milky Way galaxy like there's stuff going
in all directions and talking to different things,
and that's the nature of our systems, they are very complex.
But this is a great way for you to
go in and see what's talking to what,
where is their slowness? Where are their problems?
They're starting to surface,
and how can we start to understand
at least a beginning place to dig in.
So that's Application Map.
The next thing I want to show you is
the Diagnostic Logs, so Application Insights,
Azure Monitor in general has the ability to
really collect a lot of information for you,
a lot of Diagnostic Logs,
and there's a couple of ways you can leverage those.
We can obviously go into Log Analytics
and write your own queries and dig into the data there.
But if you wanted to, you can also export that out,
just those logs and crunch on the numbers
and the data in a different tool or however you want.
So we'll go into that Diagnostic Logs,
and then we're going to show you a little bit
of the live debugging
which is to me the more fun part.
That's the way I remember doing a lot of things is,
let's just tail the logs,
try to run the application a little bit,
let's see what the logs tell us.
Then Log Analytics will be the last part here.
But if we talk about Diagnostic Logs, how do we turn that on?
I'm not even going to waste time to go back over to the portal
because all it is, if you can see it on the far right,
is just a matter of going into the portal and just turning it on,
and then suddenly you're getting all these Diagnostic Logs.
Then, like I said, you can either
look at them through Log Analytics
or what a lot of people like to do if they
haven't quite learned KQL,
the Kusto Query Language or they are
not quite familiar with Log Analytics
to really be able to
write the right queries and ask the right questions,
you can just download these things.
If you're using the Azure CLI,
the AZ extension that you can get,
or a package that you can get for your command line,
you can just run this command here,
"AZ webapp log download",
give it your resource group name,
your webapp name and immediately download that
from Cloud Shell or from wherever you're working on these things,
write down to your machine.
Then from there you can do whatever you want with the data.
There's more information on how that works,
how you do all that stuff by going to that link right there.
So I know I'm skimming over that,
I've got so many things to cover for you all.
But I think that's a pretty neat way to get
your logs especially if you'd
like to use a separate tool outside of Microsoft.
Okay, but like I mentioned, diving right into the log stream,
tailing a log as things are happening,
sometimes that's actually the best way
to see if you can figure out what's going on.
So, I think we're going to pop back over now and
show you how to do that within the portal.
So, one thing that I'm going to show you first,
I know probably a lot of you have already seen this,
but this is our application, right?
This is the Tailwind Traders inventory application.
It's how we go in and we see,
we've got Intelligent Soft Pants,
which I'm not sure exactly what those are.
But this is how we go in and say, no we don't have 51 of those,
we've actually got 50.
Okay? Or maybe it's 49,
but we can go in and adjust our inventory.
Well, the application is working just fine right now,
but let's say there is a problem.
First of all, I had to create
the problem so we can properly show how this works.
Now, how many have ever used
this Cloud shell that is part of Azure?
Awesome. This is actually really cool.
The ability to, from within a browser,
have a session that you can go in and do
all these types of things, for me anyway,
is a huge time-saver and it's
the way I prefer working now rather than from my own local CLI.
But let me get into, I've got a couple of
scripts here that I'm going to run to help us,
sort of, illustrate how to look at the logs.
Let me find those for you real quick.
All right, yeah, so here's all my fancy scripts, right?
You guys have a break database script, right?
Everybody's got one of those, hopefully.
So anyway, what I'm going to do here,
my app was working perfectly fine,
so real quick I'm just going to run this script, magic script.
It's going to go through and cause some problem.
This is chaos engineering live on stage for you.
Okay, script's done.
Now hopefully, sometimes this doesn't always work,
but hopefully the demo gods are shining bright today.
If I come in here and I try to make an adjustment to my inventory,
it's not working, okay?
So, break the database script works just fine.
But, I would like
to know what in the world was actually happening.
Let's say I didn't know that that was the case,
if I go into Application Insights and all the way,
or I'm sorry not Application Insights, its the wrong one.
If I go into my app
and all the way at the very bottom of this menu bar here,
this blade, is this option to do "Log Stream."
This is really cool, because what it does,
is it goes out and creates a container,
basically creates your sort of
environment for your application and
then as things are happening you'll see them pop up in here.
So it's now connecting,
and if I take this curl command I had,
all this is going to do is just send some traffic over to my app.
If everything goes according to plan,
we should see some things happening.
Yeah, I know that's probably very difficult to see,
so let me see if I can make this a little bit easier.
At the at the top here, all it's doing,
you can see it's creating its Docker container,
we've got some really neat ASQR here just
to remind you that we're still living in the '90s.
If you come down just a little bit further here,
I know for a lot of us,
we see this stuff all the time and it makes perfect sense,
but a lot of us it's not.
But if you notice here on this line here,
this is the one that jumps out to me immediately,
but right here there's a login failed for user admin SRE 30.
Now, the reason I use this in this demo to illustrate is,
when I was creating this
environment and setting all this stuff up,
I was running into problems where the app just wasn't working.
This is the exact tool I used to try to figure out
what the problem was and and solve it
so that I could actually give this demo.
What it was is that our application was
having trouble logging in. Something had happened.
We had dropped the credentials in the setup
somewhere and it just couldn't log in and that was all it was.
So this is the exact way that we discovered what was going on.
So we now know that,
it's a very basic example obviously,
but we know just by simply looking at the logs,
in real time that we're experiencing
some sort of login problem into our database.
Well, obviously that's going to be a problem.
So I could come back over here and run my fixed database,
which you-all have one of those, right, or two or so?
We'll run that, and
pay no mind to the password that's up on the screen there.
You won't be able to do anything with that in the next 20 minutes.
But what it's doing is just going out and re-establishing
the correct credentials and
then it's going to restart the application.
Oops! Give a second to do its thing.
Okay, so it's completed itself.
Let me just grab the URL here and
open it up in a new fresh browser.
Let that go out and get its numbers.
As long as our fixed script worked here, yeah, there we go,
I should be able to come into
our Intelligent Soft Pants and then add this back up.
So, real quick, real easy,
obviously that was fairly staged,
but you can see that it's not that hard for you to just
grab a quick screen to
see what's going on with your logs, instantly.
It's already there. It's already plugged in.
It's already ready to go. So that's
live troubleshooting using the log streams.
Now, when that's not quite telling you enough,
you've looked at all those different things in the logs
and you've passed through
that and you still don't understand what's going on,
now it's probably time to go a little bit deeper,
and I think currently
Log Analytics is the best place to place to do that.
Now, of course, you need to become much more
familiar with Kusto, the query language,
but it really is your door,
your window, whatever you want to say, to your system.
You just have to know what to ask, because Application Insights,
if you're using that, is grabbing
all kinds of valuable information for you.
Whether you need it or not it's there,
you just have to know how to ask for it.
And Log Analytics is currently the best way to do that.
So Log Analytics, just a quick
overview of what some of the things that it's grabbing for us,
you know, its got "Event Logs", its got our "Sys Log".
Those are both going into their own tables.
It's got agents that are sending things
like "Heartbeats" into its own place.
You can create "Custom Logs."
We've got "Metrics" that are going in there.
Then of course all the things that Application Insights
gets for us out of the box,
is getting "Requests" and "Traces," "pageViews."
Tells us where users are coming in and exiting.
Kind of gives us their flow through our system.
So like I mentioned Log Analytics
and Application Insights in general,
are just collecting a lot of useful stuff.
We just have to know how to ask the questions and we do
that using Kusto, Kusto query language.
And Kusto, if you aren't aware,
Kusto is just a nod to Jacques Cousteau,
who was always exploring the Great Blue data,
I'm sorry is that Great Blue date?
No, Great Blue deep sea. Yeah, I think that's what it was.
Which is to us that's Azure, right?
Like that's our deep sea of data.
So Kusto query languages which you use.
So let's just start with the basics.
Within Kusto or within Log Analytics,
you're going to have these tables and we kind of pointed out what
those tables are and what's being sucked in.
Then you can use a variety of different commands.
If you're familiar with SQL,
if you're familiar with writing SQL queries,
you're probably going to do okay jumping into Kusto.
There's gotta be some similarities, but it's not quite the same,
but you still have the ability to filter and sorts,
specify by time range.
You can create custom fields,
this is what we did in the previous talk.
You can aggregate things together.
I'm not going to go into whole lot
because we only have about 10 more minutes,
but there's a lot of things you can do with with
Log Analytics that I'm not even aware of,
because it's just such a big area to learn.
But I think this is one of the better links,
if you go to this one here you can get
a little bit more detailed sort of high level of Kusto.
But I don't want to leave you with nothing,
so let me show you just a few of
the basic things that I think will help
you at least be prepared to move on to the next thing.
This is just an example query,
and what it does is it's just going to go
out and just give me five,
just five records out of the requests table.
It's not, or sorry 10,
not five, but it's going to go out and just get 10,
just any 10, okay?
So that's just requests taking 10.
You can do something a little bit sort
of adjacent to that, I guess you'd say,
where I want 10, but I want to search within those
10 where requests have the value of "GET."
We could also do some sorting,
where the first one we're just
sorting based on the time generated descending.
Whereas the second one it's very similar,
but we're getting the top 10 sorted by the time generated.
We could do some filtering, where maybe we're
looking at the time generated out of
the requests that are more than 30 minutes ago or maybe where,
or in addition to that,
it's also looking for the result code that's a 404.
If you notice that "To-int" in there,
what that's doing is it's changing
that value from a text to an integer.
So you could actually do something
with it that way, if you needed to.
You can do some aggregation.
Here we're looking in the performance table,
where the time generated is more than an hour
ago and then we're going to summarize that by the object name.
I'm going to give it a, how many there were,
just roll that up into a total counts,
so give me a summary of all those. All right.
So, I know a lot of you that
were in the other talk you've already seen
some of this, and there's definitely going to be,
that was way more advanced than what I'm going to show you now,
but let me just pop into
the Log Analytics Query Editor so
that those of you who haven't seen it yet,
you know what it looks like.
So I'll go to Application Insights right up here
at the top "Analytics".
Let's log Analytics.
Here's our query editor. Here's where you can
make stuff from scratch.
We can come over here to the left
and as soon as this populates our tables,
we'll see all of the data
that's available to us and we can just do
some pointing and clicking and build
out our query if we wanted to.
So we come into "Traces" and
there's what we'd want traces at the top there too.
But you can come and start just clicking on things to create
your queries or you can type it in or do like me,
my favorite is copy and paste.
It's every developer's way of developing I think.
So here, we're just going to do a request,
give me that give me 10 of them.
If I hit run, we'll just get
10 random values out of
the "Requests" field, and that's all that is.
Okay, so that's very basic.
What other ones can we show
you? I know what I was going to show you.
So you can write your own queries,
that's the way most of us are going to get to at some point.
But another really cool thing,
if you click on "Sample Queries".
We've done a little bit of work
around up-time and reliability and things like that,
building highly distributed complex systems.
We've come up with some queries that we think are helpful for us,
and so we thought we'd just make them available to you.
So if you come down into these
"Get Started With Sample Queries", first of all,
the history section is going to
show you-all the queries that you've
run in the past or the kind of past, I guess, you'd say.
Then we've got "Requests' Performance" where these are
canned queries that we already have baked for you.
So I could just click on any of these and it's going to populate
the query editor above and then go ahead and run it for me,
and we can see what that looks like.
So you can keep coming back into all these different ones.
We may not have data on all of these,
but for the most part, many of them.
You just poke around and see what these things are telling you,
but then also come back up into the query and
see if you can make sense of how it got to those numbers.
How did it calculate that?
How did it display that on the screen?
It's all done through just writing simple queries.
So I think for me that's helped a lot to really go in
and spend some time just looking
through some of the things that are already in here.
I don't think any of the browser data is going
to give us anything because we're not collecting that.
But for your systems,
you come in here and you can learn
all kinds of things that you didn't even think to ask,
simply by just clicking on some of
the sample queries that we already have in here.
Now, one thing I would say also that's super helpful is,
once you've established some queries that really give
you a good answer to whatever it is you're trying to find out,
keep those queries handy.
You can save them up in the right-hand corner next where we
created alerts and clicked on the "New Sample Queries".
There's the option to save all those queries.
So it's a good idea keep some of those queries handy.
So you're not always trying to remember, what was the syntax?
How do I get that question answered?
I don't always remember especially
under duress or in the moment of an incident.
I just lose memory on
how to do most of the things I'm supposed to do.
So having those queries handy to
just pick out of a list and run is super helpful.
So good advice there is to keep things handy.
But also not just to have them in the screen there,
but also because of the next thing that
we'll spend the remaining time we have just
talking about is there's a notion
of workbooks and troubleshooting guides.
What these are wonderful for is,
I would have previously called these Runbooks,
the companies I worked with.
All they are is just, here's what you should
do when this type of problem comes up.
It's the stuff we were talking about in terms of the context.
Give me the first couple of things to go do,
help me jog my memory or maybe I'm a junior developer,
I'm new to the team.
One of the best things you can do for
junior developers is put them on-call.
Almost immediately, put them on-call.
If you can't do that then work towards that.
You don't want only senior engineers
to be the people who are on-call and you never want it to
be a knock or some tiered off shore support people.
These are your engineers maintaining your system.
They're the ones who know it the most intimately.
When there's an urgent problem,
there's no time to fill out tickets for somebody who doesn't
even really work for you to dig into the problem.
We need the people who built it to dig into the problem.
But you can put juniors on-call and as long as
you got workbooks and troubleshooting guides
and all the write queries,
man, it's such a great way to on-board your engineers.
But they come in super handy for these troubleshooting guides.
Another thing about the troubleshooting guides is that,
it makes it so that the queries that I
like to use are now available to everybody.
They're also editable by other people too.
So we can collaborate on what is
the most useful thing that the engineer,
maybe a junior engineer would need to know in that moment?
We can all work on creating
the perfect set of workbooks or troubleshooting guides.
Then the last thing I want to mention is they're
really grateful in that moment.
How many of you are currently on-call for your systems?
Maybe not at the moment,
but that is part of your role is to be on-call?
Okay, so a fair amount of you.
So you know what it's like to get paged,
to be alerted and you're like,
I don't know what to do.
I have no place. I don't even know where to go to start.
These workbooks, these troubleshooting guides
are immensely helpful,
especially in that late night or that
early morning moment when you're losing your memory,
you're frustrated because you just don't even want to go.
These things are good. So enough talking about it,
let's actually go and show you what those things look like.
Get back to my dashboard. I scrolled past it.
Right here, so our troubleshooting guide.
Now, we've got a couple of templates in here to get you started.
We can also do a custom one to just really
drop in all the information that you think is useful.
But for the sake of time and just to show
you what's out of the box ready,
let's take a look at the one that's already in here for
us for the troubleshooting request failures.
Now, if I scroll through here,
I can see I've got something that looks very similar to
what I was looking at over in the log Analytics.
It's just a displayed little chart
here of what's going on for the failure trends.
We could also go a little bit deeper and start to see what were
the specific files that were causing these failed,
especially with this one here that's showing the 404.
Then maybe down here it's like a checklist,
do this, then do this, then panic.
But give people the steps of what they should do in the moment.
What's really cool is I can come in here and if I decided that
this query isn't quite right,
all I have to do is hit "Edit".
Then over here on the right-hand side, when I hit "Edit",
I now have access right into
the query editor and I can fix this query or I can
put it in a different query that I've decided is actually going to
be a lot more helpful in that moment.
So these workbooks are completely
customizable and then they're delivered.
They can't be, anyway, delivered with the alert that says,
"Hey Jason, we have a huge problem."
Here's a workbook. Let us know if you need help.
As somebody who's been on-call,
probably a majority of their career,
I didn't have stuff like this.
I mean, I eventually developed a few things that helped me,
but it was still just like in a repo files.
If I see this and maybe I remember
this document being helpful but maybe not,
it was chaotic, it was ad hoc for sure.
But this is a way to add some real rigor
and some real framework around
your on-call scenario and making sure that you
are able to make a difference in
recovering services as quickly as possible.
The more information you can give your responders,
the more contexts you can give them the better.
So I really love workbooks, I think they're great.
You come in here, you can add as much information as you want.
You can add text.
Like I said, you can do these little checklists down here.
If you're familiar and comfortable with Markdown,
this is going to be a very easy exercise and
a fun exercise to come in and start creating some of these.
So I'd highly recommend you go check those out too.
They're one of the best-kept secrets in my opinion.
Okay. We've got one minute exactly left.
So what's going on in the future?
We've wrapped our heads around this,
but what's long-term thinking?
Well, you've heard me say observability a lot.
Observability to me is you've got a system that
is able to tell you the reality of the world as far as it knows.
What is the reality of my system,
both internals with the technical components but also our users?
How are they using it? What's happening?
Where are they coming from? What are they experiencing?
That's an observable system.
That helps us make data-driven decisions.
Helps the business make data-driven decisions
towards better business value.
It helps us determine where we need to
spend money or where we need to hire resources.
It just makes us be able to
answer the right questions at the right time,
and that's what we're after.
So the more we can do to get us
to a more observable system, that's the goal.
There's a lot of things that you can be doing out there that
are out of bounds of what Azure, Monitor is doing.
There's a really great tool.
There are several great tools like
OpenCensus that'll help us go out
and really understand the requests as they go through a system.
You can do tracing and really
understand when somebody comes
in and then they go do this and they go do this,
simply by just clicking on one button,
you can explode out exactly what
happened and that gives even more observability.
So that's one area.
Anomaly detection is another one that we
talked about previously in the other talk as I
was always looking for ways to detect
what we didn't know was even a possibility.
The system has known unknowns,
it has unknown unknowns.
There are things that we didn't even think about.
We couldn't have possibly known that this was how things worked.
Anomaly detection is great for that as well.
So with that, I'll say,
there's a couple more sessions in the SRE track.
We still have 40 in 50,
which is the next continuation of the story.
What do you do next? Ending with
responding to failure and doing post incident reviews.
How do you actually go in and discuss what went wrong,
but also what went right?
What were our engineers thinking and
doing and saying to each other
in the moment as they're trying to respond
and recover from some service disruption?
So I would encourage you to go check out those two talks.
Then lastly, I would love your feedback.
Tell me what I can do to make
this talk even better the next time I give it.
How can I provide as much information as I
can in 60 minutes but also maybe better useful information.
So whatever feedback you can give to me that
makes me make this talk better,
I would really appreciate it.
With that said, I've appreciated your time. Thank you so much.
Amsterdam is one of my favorite cities in
the world to come to you and I love being here.
So thank you and let's hang out
maybe down at the hub if you have questions.