1 00:00:00,730 --> 00:00:02,800 So continuing on from the last lesson, 2 00:00:02,800 --> 00:00:05,210 in this lesson we are going to pick up, 3 00:00:05,210 --> 00:00:08,036 talk a little bit more about partitions, 4 00:00:08,036 --> 00:00:09,490 and, specifically, talk about what happens 5 00:00:09,490 --> 00:00:11,113 when things go wrong. 6 00:00:12,210 --> 00:00:14,803 So what we're going to be looking at, specifically, 7 00:00:15,671 --> 00:00:17,000 is job recovery from failure. 8 00:00:17,000 --> 00:00:19,920 So when Azure Stream Analytics has a failure, 9 00:00:19,920 --> 00:00:22,160 what happens and how do we recover? 10 00:00:22,160 --> 00:00:23,530 And then we're going to talk about 11 00:00:23,530 --> 00:00:25,910 something called the replay catch-up time, 12 00:00:25,910 --> 00:00:28,610 and we're going to look at how we determine what that is. 13 00:00:29,780 --> 00:00:31,847 And to quote the famous Robert Burns, 14 00:00:31,847 --> 00:00:34,530 "the best-laid schemes o' mice an' men--" 15 00:00:34,530 --> 00:00:35,950 they sometimes co-arise. 16 00:00:35,950 --> 00:00:38,950 So it's a good thing to make some plans. 17 00:00:38,950 --> 00:00:43,722 So what does this look like in Azure Stream Analytics? 18 00:00:43,722 --> 00:00:46,220 Well, we talked about dividing our data into subsets 19 00:00:46,220 --> 00:00:50,603 based upon partitions and partition key (last lesson). 20 00:00:51,450 --> 00:00:54,510 And so what we have is we have a job that is started 21 00:00:54,510 --> 00:00:57,210 in Azure Stream Analytics. 22 00:00:57,210 --> 00:00:59,670 Now, part of that parallel processing, 23 00:00:59,670 --> 00:01:02,710 Azure Stream Analytics is going to break up your work 24 00:01:02,710 --> 00:01:04,800 into nodes automatically. 25 00:01:04,800 --> 00:01:06,480 And it's going to start working on 26 00:01:06,480 --> 00:01:09,400 all of your query in parallel. 27 00:01:09,400 --> 00:01:12,313 Now, sometimes a node can fail. 28 00:01:13,490 --> 00:01:15,330 Basically something bad happens, 29 00:01:15,330 --> 00:01:17,160 a failure occurs in a node, 30 00:01:17,160 --> 00:01:18,310 so what you need to understand 31 00:01:18,310 --> 00:01:21,170 is what Azure Stream Analytics does next. 32 00:01:21,170 --> 00:01:24,900 Well, it is go going to have an automatic recovery for you, 33 00:01:24,900 --> 00:01:26,710 which is a great thing. 34 00:01:26,710 --> 00:01:30,510 It's going to restore from your last available checkpoint. 35 00:01:30,510 --> 00:01:32,900 And it does this through something known as 36 00:01:32,900 --> 00:01:34,850 stateful query logic. 37 00:01:34,850 --> 00:01:37,630 Basically what happens is, when you do 38 00:01:37,630 --> 00:01:39,900 a windowed aggregate or a temporal join 39 00:01:39,900 --> 00:01:43,270 or a function, Azure Stream Analytics is going to 40 00:01:43,270 --> 00:01:46,980 keep information when that job is running, 41 00:01:46,980 --> 00:01:50,660 and it's going to store up to 7 days. 42 00:01:50,660 --> 00:01:52,780 Now that's a pretty long time window, 43 00:01:52,780 --> 00:01:54,020 but if you have longer than that, 44 00:01:54,020 --> 00:01:55,360 you need to keep that in mind. 45 00:01:55,360 --> 00:01:58,260 It stores 7 days worth of data. 46 00:01:58,260 --> 00:01:59,400 And so what it's going to look at 47 00:01:59,400 --> 00:02:01,623 is your last available checkpoint. 48 00:02:02,510 --> 00:02:05,910 And then it's going to create a new, healthy node 49 00:02:05,910 --> 00:02:07,850 and it's going to resume the work 50 00:02:07,850 --> 00:02:09,953 from that last available checkpoint. 51 00:02:11,970 --> 00:02:15,330 So, how do we make that replay process faster? 52 00:02:15,330 --> 00:02:18,580 Because you have some say in how fast 53 00:02:18,580 --> 00:02:22,230 the replay process takes, which is that restart 54 00:02:22,230 --> 00:02:25,340 and catch-up from our last available checkpoint. 55 00:02:25,340 --> 00:02:28,070 Well, in order to do that, we are going to estimate 56 00:02:28,070 --> 00:02:31,230 the replay catch-up time with this formula. 57 00:02:31,230 --> 00:02:35,120 So we're going to take a look at our input event rate, 58 00:02:35,120 --> 00:02:38,430 then we're going to take a look at our gap rate, 59 00:02:38,430 --> 00:02:39,520 and we're going to divide that 60 00:02:39,520 --> 00:02:42,700 by the number of processing partitions. 61 00:02:42,700 --> 00:02:45,550 Now, typically this gap length--and the gap length 62 00:02:45,550 --> 00:02:47,700 is your replay length--the time it takes it 63 00:02:47,700 --> 00:02:50,400 to catch up to where it was, 64 00:02:50,400 --> 00:02:54,110 that is going to typically be a couple of minutes. 65 00:02:54,110 --> 00:02:55,590 So now that you've seen the formula, 66 00:02:55,590 --> 00:02:57,410 let's take a look at how we calculate 67 00:02:57,410 --> 00:03:00,070 that replay catch-up time in practice. 68 00:03:00,070 --> 00:03:03,130 So we're going to look at our window size, 69 00:03:03,130 --> 00:03:05,070 so figure out how big that is. 70 00:03:05,070 --> 00:03:08,940 Then we're going to run sample data through Event Hub. 71 00:03:08,940 --> 00:03:11,510 Once you click Start, you're going to measure how long 72 00:03:11,510 --> 00:03:15,020 it takes until that output first appears. 73 00:03:15,020 --> 00:03:17,130 This is going to be our catch-up time, 74 00:03:17,130 --> 00:03:19,594 it's essentially what you're doing. 75 00:03:19,594 --> 00:03:23,900 So you're doing this formula, through this process. 76 00:03:25,600 --> 00:03:28,100 Now, to fix it--to make it run faster 77 00:03:28,100 --> 00:03:29,850 if it's not running fast enough-- 78 00:03:29,850 --> 00:03:32,040 well, it comes down to streaming units. 79 00:03:32,040 --> 00:03:33,950 So if you spend more money 80 00:03:33,950 --> 00:03:36,240 and you have more streaming units available, it's 81 00:03:36,240 --> 00:03:38,730 going to process faster, which means you're going to have 82 00:03:38,730 --> 00:03:40,223 a smaller gap length. 83 00:03:41,520 --> 00:03:44,200 Also, you want to take a look at partitions, 84 00:03:44,200 --> 00:03:46,190 because if you remember from that formula, 85 00:03:46,190 --> 00:03:50,310 the bottom line was the number of processing partitions. 86 00:03:50,310 --> 00:03:54,190 So you can also affect the restoration time 87 00:03:54,190 --> 00:03:57,330 by looking at the number of partitions that you have. 88 00:03:57,330 --> 00:03:59,900 Those are the 2 things that you have control over 89 00:03:59,900 --> 00:04:02,790 to make your replay process faster 90 00:04:02,790 --> 00:04:04,263 in the event of a failure. 91 00:04:05,410 --> 00:04:06,770 So to summarize all this, 92 00:04:06,770 --> 00:04:10,760 basically, if something happens in Azure Stream Analytics 93 00:04:10,760 --> 00:04:13,840 and a node fails, failure is going to be noticed 94 00:04:13,840 --> 00:04:17,270 automatically by Azure, a new node is going to be created, 95 00:04:17,270 --> 00:04:18,620 and then it's going to replay 96 00:04:18,620 --> 00:04:20,540 from your last available checkpoint, 97 00:04:20,540 --> 00:04:23,130 which is stored up to 7 days 98 00:04:23,130 --> 00:04:24,463 worth of streaming data. 99 00:04:25,450 --> 00:04:27,620 To decrease your downtime, you're going to take a look 100 00:04:27,620 --> 00:04:29,970 at properly allocating your streaming units, 101 00:04:29,970 --> 00:04:31,760 which might be increasing that, 102 00:04:31,760 --> 00:04:36,200 and taking a look at your partitions and how many you have. 103 00:04:36,200 --> 00:04:39,770 If you've done those 2 things, you should be good to go. 104 00:04:39,770 --> 00:04:41,610 And with that, we've finished our lesson, 105 00:04:41,610 --> 00:04:43,260 and I'll see you in the next one.