1
00:00:00,730 --> 00:00:02,800
So continuing on from the last lesson,

2
00:00:02,800 --> 00:00:05,210
in this lesson we are going to pick up,

3
00:00:05,210 --> 00:00:08,036
talk a little bit more about partitions,

4
00:00:08,036 --> 00:00:09,490
and, specifically, talk about what happens

5
00:00:09,490 --> 00:00:11,113
when things go wrong.

6
00:00:12,210 --> 00:00:14,803
So what we're going to be looking at, specifically,

7
00:00:15,671 --> 00:00:17,000
is job recovery from failure.

8
00:00:17,000 --> 00:00:19,920
So when Azure Stream Analytics has a failure,

9
00:00:19,920 --> 00:00:22,160
what happens and how do we recover?

10
00:00:22,160 --> 00:00:23,530
And then we're going to talk about

11
00:00:23,530 --> 00:00:25,910
something called the replay catch-up time,

12
00:00:25,910 --> 00:00:28,610
and we're going to look at how we determine what that is.

13
00:00:29,780 --> 00:00:31,847
And to quote the famous Robert Burns,

14
00:00:31,847 --> 00:00:34,530
"the best-laid schemes o' mice an' men--"

15
00:00:34,530 --> 00:00:35,950
they sometimes co-arise.

16
00:00:35,950 --> 00:00:38,950
So it's a good thing to make some plans.

17
00:00:38,950 --> 00:00:43,722
So what does this look like in Azure Stream Analytics?

18
00:00:43,722 --> 00:00:46,220
Well, we talked about dividing our data into subsets

19
00:00:46,220 --> 00:00:50,603
based upon partitions and partition key (last lesson).

20
00:00:51,450 --> 00:00:54,510
And so what we have is we have a job that is started

21
00:00:54,510 --> 00:00:57,210
in Azure Stream Analytics.

22
00:00:57,210 --> 00:00:59,670
Now, part of that parallel processing,

23
00:00:59,670 --> 00:01:02,710
Azure Stream Analytics is going to break up your work

24
00:01:02,710 --> 00:01:04,800
into nodes automatically.

25
00:01:04,800 --> 00:01:06,480
And it's going to start working on

26
00:01:06,480 --> 00:01:09,400
all of your query in parallel.

27
00:01:09,400 --> 00:01:12,313
Now, sometimes a node can fail.

28
00:01:13,490 --> 00:01:15,330
Basically something bad happens,

29
00:01:15,330 --> 00:01:17,160
a failure occurs in a node,

30
00:01:17,160 --> 00:01:18,310
so what you need to understand

31
00:01:18,310 --> 00:01:21,170
is what Azure Stream Analytics does next.

32
00:01:21,170 --> 00:01:24,900
Well, it is go going to have an automatic recovery for you,

33
00:01:24,900 --> 00:01:26,710
which is a great thing.

34
00:01:26,710 --> 00:01:30,510
It's going to restore from your last available checkpoint.

35
00:01:30,510 --> 00:01:32,900
And it does this through something known as

36
00:01:32,900 --> 00:01:34,850
stateful query logic.

37
00:01:34,850 --> 00:01:37,630
Basically what happens is, when you do

38
00:01:37,630 --> 00:01:39,900
a windowed aggregate or a temporal join

39
00:01:39,900 --> 00:01:43,270
or a function, Azure Stream Analytics is going to

40
00:01:43,270 --> 00:01:46,980
keep information when that job is running,

41
00:01:46,980 --> 00:01:50,660
and it's going to store up to 7 days.

42
00:01:50,660 --> 00:01:52,780
Now that's a pretty long time window,

43
00:01:52,780 --> 00:01:54,020
but if you have longer than that,

44
00:01:54,020 --> 00:01:55,360
you need to keep that in mind.

45
00:01:55,360 --> 00:01:58,260
It stores 7 days worth of data.

46
00:01:58,260 --> 00:01:59,400
And so what it's going to look at

47
00:01:59,400 --> 00:02:01,623
is your last available checkpoint.

48
00:02:02,510 --> 00:02:05,910
And then it's going to create a new, healthy node

49
00:02:05,910 --> 00:02:07,850
and it's going to resume the work

50
00:02:07,850 --> 00:02:09,953
from that last available checkpoint.

51
00:02:11,970 --> 00:02:15,330
So, how do we make that replay process faster?

52
00:02:15,330 --> 00:02:18,580
Because you have some say in how fast

53
00:02:18,580 --> 00:02:22,230
the replay process takes, which is that restart

54
00:02:22,230 --> 00:02:25,340
and catch-up from our last available checkpoint.

55
00:02:25,340 --> 00:02:28,070
Well, in order to do that, we are going to estimate

56
00:02:28,070 --> 00:02:31,230
the replay catch-up time with this formula.

57
00:02:31,230 --> 00:02:35,120
So we're going to take a look at our input event rate,

58
00:02:35,120 --> 00:02:38,430
then we're going to take a look at our gap rate,

59
00:02:38,430 --> 00:02:39,520
and we're going to divide that

60
00:02:39,520 --> 00:02:42,700
by the number of processing partitions.

61
00:02:42,700 --> 00:02:45,550
Now, typically this gap length--and the gap length

62
00:02:45,550 --> 00:02:47,700
is your replay length--the time it takes it

63
00:02:47,700 --> 00:02:50,400
to catch up to where it was,

64
00:02:50,400 --> 00:02:54,110
that is going to typically be a couple of minutes.

65
00:02:54,110 --> 00:02:55,590
So now that you've seen the formula,

66
00:02:55,590 --> 00:02:57,410
let's take a look at how we calculate

67
00:02:57,410 --> 00:03:00,070
that replay catch-up time in practice.

68
00:03:00,070 --> 00:03:03,130
So we're going to look at our window size,

69
00:03:03,130 --> 00:03:05,070
so figure out how big that is.

70
00:03:05,070 --> 00:03:08,940
Then we're going to run sample data through Event Hub.

71
00:03:08,940 --> 00:03:11,510
Once you click Start, you're going to measure how long

72
00:03:11,510 --> 00:03:15,020
it takes until that output first appears.

73
00:03:15,020 --> 00:03:17,130
This is going to be our catch-up time,

74
00:03:17,130 --> 00:03:19,594
it's essentially what you're doing.

75
00:03:19,594 --> 00:03:23,900
So you're doing this formula, through this process.

76
00:03:25,600 --> 00:03:28,100
Now, to fix it--to make it run faster

77
00:03:28,100 --> 00:03:29,850
if it's not running fast enough--

78
00:03:29,850 --> 00:03:32,040
well, it comes down to streaming units.

79
00:03:32,040 --> 00:03:33,950
So if you spend more money

80
00:03:33,950 --> 00:03:36,240
and you have more streaming units available, it's

81
00:03:36,240 --> 00:03:38,730
going to process faster, which means you're going to have

82
00:03:38,730 --> 00:03:40,223
a smaller gap length.

83
00:03:41,520 --> 00:03:44,200
Also, you want to take a look at partitions,

84
00:03:44,200 --> 00:03:46,190
because if you remember from that formula,

85
00:03:46,190 --> 00:03:50,310
the bottom line was the number of processing partitions.

86
00:03:50,310 --> 00:03:54,190
So you can also affect the restoration time

87
00:03:54,190 --> 00:03:57,330
by looking at the number of partitions that you have.

88
00:03:57,330 --> 00:03:59,900
Those are the 2 things that you have control over

89
00:03:59,900 --> 00:04:02,790
to make your replay process faster

90
00:04:02,790 --> 00:04:04,263
in the event of a failure.

91
00:04:05,410 --> 00:04:06,770
So to summarize all this,

92
00:04:06,770 --> 00:04:10,760
basically, if something happens in Azure Stream Analytics

93
00:04:10,760 --> 00:04:13,840
and a node fails, failure is going to be noticed

94
00:04:13,840 --> 00:04:17,270
automatically by Azure, a new node is going to be created,

95
00:04:17,270 --> 00:04:18,620
and then it's going to replay

96
00:04:18,620 --> 00:04:20,540
from your last available checkpoint,

97
00:04:20,540 --> 00:04:23,130
which is stored up to 7 days

98
00:04:23,130 --> 00:04:24,463
worth of streaming data.

99
00:04:25,450 --> 00:04:27,620
To decrease your downtime, you're going to take a look

100
00:04:27,620 --> 00:04:29,970
at properly allocating your streaming units,

101
00:04:29,970 --> 00:04:31,760
which might be increasing that,

102
00:04:31,760 --> 00:04:36,200
and taking a look at your partitions and how many you have.

103
00:04:36,200 --> 00:04:39,770
If you've done those 2 things, you should be good to go.

104
00:04:39,770 --> 00:04:41,610
And with that, we've finished our lesson,

105
00:04:41,610 --> 00:04:43,260
and I'll see you in the next one.