1 00:00:00,120 --> 00:00:02,080 So in this lesson, we are going to talk 2 00:00:02,080 --> 00:00:05,610 about a very important concept in data engineering. 3 00:00:05,610 --> 00:00:06,850 We are going to talk about 4 00:00:06,850 --> 00:00:09,460 how we process one or more partitions, 5 00:00:09,460 --> 00:00:11,410 which is known as repartitioning. 6 00:00:11,410 --> 00:00:12,850 Specifically in this lesson, 7 00:00:12,850 --> 00:00:14,260 we are going to talk about 8 00:00:14,260 --> 00:00:17,580 how we understand partitions and repartitions, 9 00:00:17,580 --> 00:00:19,380 and then we're also going to talk about 10 00:00:19,380 --> 00:00:23,120 how we use repartitions in Azure Stream Analytics. 11 00:00:23,120 --> 00:00:24,550 And the first little tidbit is, 12 00:00:24,550 --> 00:00:26,980 when you think of partitions, think of buckets 13 00:00:26,980 --> 00:00:30,130 and--there you go--cute dog in a bucket. 14 00:00:30,130 --> 00:00:32,080 (Brian chuckles) So, as you think of partitions, 15 00:00:32,080 --> 00:00:34,120 think about just building buckets 16 00:00:34,120 --> 00:00:36,763 and putting all your stuff into various buckets. 17 00:00:38,300 --> 00:00:40,830 So the basics of partitioning is just simply, 18 00:00:40,830 --> 00:00:45,140 we divide our data into subsets based on a partition key. 19 00:00:45,140 --> 00:00:49,100 So we're going to divide our data into those different 20 00:00:49,100 --> 00:00:52,363 buckets, based on some sort of metric that we define, 21 00:00:53,600 --> 00:00:56,310 and let's talk about why that matters. 22 00:00:56,310 --> 00:00:58,900 It matters because those subsets, 23 00:00:58,900 --> 00:01:02,270 that data in a bucket, it makes searching faster. 24 00:01:02,270 --> 00:01:05,580 I'll give you a little side-story here. 25 00:01:05,580 --> 00:01:08,070 So, my son loves LEGOs, 26 00:01:08,070 --> 00:01:10,440 and he took all of his LEGOs one day 27 00:01:10,440 --> 00:01:14,460 and broke them into pieces, and it looked kind of like this. 28 00:01:14,460 --> 00:01:16,770 Everything was mixed up and together, 29 00:01:16,770 --> 00:01:20,070 and so I helped him put all of them back together 30 00:01:20,070 --> 00:01:22,240 and build all of his sets again. 31 00:01:22,240 --> 00:01:25,150 In order to do that, I partitioned the LEGOs 32 00:01:25,150 --> 00:01:28,000 into various colors, red, yellow, blue, 33 00:01:28,000 --> 00:01:30,450 and that really helped because it made searching 34 00:01:30,450 --> 00:01:33,850 for the different pieces a lot faster. 35 00:01:33,850 --> 00:01:35,903 Same thing is true in data. 36 00:01:37,050 --> 00:01:39,010 And we use partition keys 37 00:01:39,010 --> 00:01:42,840 to help determine what buckets the data goes into. 38 00:01:42,840 --> 00:01:45,300 So think about this for partition keys. 39 00:01:45,300 --> 00:01:48,580 Let's say that we have a database that has customer name, 40 00:01:48,580 --> 00:01:53,360 order number, order date, order country, and total sale. 41 00:01:53,360 --> 00:01:56,330 What would be a good key for this data? 42 00:01:56,330 --> 00:02:00,440 Well, the first thing that we want to look at is static. 43 00:02:00,440 --> 00:02:02,980 We don't want something that changes. 44 00:02:02,980 --> 00:02:03,880 It's not going to help us 45 00:02:03,880 --> 00:02:06,310 if we have something that changes frequently. 46 00:02:06,310 --> 00:02:08,860 So another way to think about static 47 00:02:08,860 --> 00:02:10,750 (using data engineering terminology) 48 00:02:10,750 --> 00:02:12,810 would be reasonably consistent 49 00:02:12,810 --> 00:02:14,810 and bounded number of records. 50 00:02:14,810 --> 00:02:17,800 So thinking about something that is bounded, 51 00:02:17,800 --> 00:02:19,410 it has a beginning and an ending 52 00:02:19,410 --> 00:02:21,350 and it's going to be reasonably consistent 53 00:02:21,350 --> 00:02:23,260 in that it's not changing. 54 00:02:23,260 --> 00:02:26,800 The other thing we want to look at is high cardinality, 55 00:02:26,800 --> 00:02:28,900 and when we talk about high cardinality, 56 00:02:28,900 --> 00:02:31,050 this just means a big range. 57 00:02:31,050 --> 00:02:31,883 We want something 58 00:02:31,883 --> 00:02:34,260 that's going to have a lot of different values. 59 00:02:34,260 --> 00:02:38,230 For instance, if I chose order country as my partition, 60 00:02:38,230 --> 00:02:41,230 but everyone purchases within the United States, 61 00:02:41,230 --> 00:02:42,270 that's a terrible key, 62 00:02:42,270 --> 00:02:43,970 because we're going to have one giant bucket 63 00:02:43,970 --> 00:02:47,890 with all the data in it and our partitioning was useless. 64 00:02:47,890 --> 00:02:50,190 So when we look at our partition keys 65 00:02:50,190 --> 00:02:51,630 and we look at this data here, 66 00:02:51,630 --> 00:02:53,110 there's a couple of different choices, 67 00:02:53,110 --> 00:02:55,280 and the partition key that you would choose, 68 00:02:55,280 --> 00:02:58,650 would be based upon how you want to divide up the data. 69 00:02:58,650 --> 00:03:01,120 For instance, you might look at order date. 70 00:03:01,120 --> 00:03:05,070 If you have a reasonably smooth flow or even flow 71 00:03:05,070 --> 00:03:08,410 of orders per day, you could choose order date 72 00:03:08,410 --> 00:03:11,660 and use that as a partition key, or you might, 73 00:03:11,660 --> 00:03:12,900 and I know I mentioned before 74 00:03:12,900 --> 00:03:15,190 that you might not want to choose order country, 75 00:03:15,190 --> 00:03:18,550 but if you are in a global sales 76 00:03:18,550 --> 00:03:21,900 and all of your orders come from across the world, 77 00:03:21,900 --> 00:03:24,130 order country might be a good solution for you, 78 00:03:24,130 --> 00:03:26,810 if there's a reasonable distribution where it's pretty even 79 00:03:26,810 --> 00:03:28,510 and there's a lot of different countries, 80 00:03:28,510 --> 00:03:30,410 again, that high cardinality. 81 00:03:30,410 --> 00:03:31,910 So when we look at partition keys, 82 00:03:31,910 --> 00:03:33,850 it's important that we kind of think about 83 00:03:33,850 --> 00:03:36,210 how we're going to make those buckets work 84 00:03:36,210 --> 00:03:40,290 and that we think about what we want to use 85 00:03:40,290 --> 00:03:42,470 to help find the data that we're looking for 86 00:03:42,470 --> 00:03:46,320 within those buckets, i.e., our partition key. 87 00:03:46,320 --> 00:03:49,640 And keep in mind, this is the basics of partitioning. 88 00:03:49,640 --> 00:03:52,697 We could talk for several sections just on partitioning 89 00:03:52,697 --> 00:03:54,620 and the different ways that you can do it, 90 00:03:54,620 --> 00:03:56,550 but I want to give you an overview of some things 91 00:03:56,550 --> 00:03:59,663 that you can do and how it works at a high level. 92 00:04:01,610 --> 00:04:04,320 So when we talk about Stream Analytics, 93 00:04:04,320 --> 00:04:07,760 we're going to talk about embarrassingly parallel jobs. 94 00:04:07,760 --> 00:04:09,793 Yes, that is a real thing. 95 00:04:10,770 --> 00:04:14,030 Microsoft defines embarrassingly parallel jobs 96 00:04:14,030 --> 00:04:18,110 as the most scalable scenarios in Azure Stream Analytics, 97 00:04:18,110 --> 00:04:21,390 i.e., the easy button over here on the right. 98 00:04:21,390 --> 00:04:24,880 So, embarrassingly parallel jobs connect one partition 99 00:04:24,880 --> 00:04:28,610 of the input, to one instance of the output. 100 00:04:28,610 --> 00:04:31,130 That is the most scalable scenario 101 00:04:31,130 --> 00:04:32,570 for Azure Stream Analytics. 102 00:04:32,570 --> 00:04:33,860 You do need to understand that. 103 00:04:33,860 --> 00:04:38,630 Embarrassingly parallel jobs, 1 input to 1 output. 104 00:04:38,630 --> 00:04:40,300 So here's our key. 105 00:04:40,300 --> 00:04:42,850 We have our Azure Stream Analytics query 106 00:04:42,850 --> 00:04:43,990 and we have an output. 107 00:04:43,990 --> 00:04:46,340 So if we had multiple partitions, 108 00:04:46,340 --> 00:04:49,910 we would just have multiple sets of this. 109 00:04:49,910 --> 00:04:51,670 So when we look at the script for this, 110 00:04:51,670 --> 00:04:53,770 you can see that we have our select count as count, 111 00:04:53,770 --> 00:04:57,030 just selecting all of our data from our input, 112 00:04:57,030 --> 00:04:59,410 and then this is the part that's important for partitioning. 113 00:04:59,410 --> 00:05:02,050 We're going to partition by partition ID, 114 00:05:02,050 --> 00:05:04,990 which is whatever our partition key is, 115 00:05:04,990 --> 00:05:06,930 and then below that you can see 116 00:05:06,930 --> 00:05:11,350 that we simply tack on the standard hopping window 117 00:05:11,350 --> 00:05:13,210 or whatever our window is going to be. 118 00:05:13,210 --> 00:05:15,020 So that's our group by statement. 119 00:05:15,020 --> 00:05:16,090 So this is pretty typical. 120 00:05:16,090 --> 00:05:17,430 We have our select, 121 00:05:17,430 --> 00:05:18,820 then we have our from, 122 00:05:18,820 --> 00:05:20,870 then we have our partition by partition key, 123 00:05:20,870 --> 00:05:22,870 and then we have our group by statement. 124 00:05:23,720 --> 00:05:25,530 And so that is the code that we would use 125 00:05:25,530 --> 00:05:29,643 to run an Azure Stream Analytics query utilizing partitions. 126 00:05:31,040 --> 00:05:34,130 So what happens when it's not embarrassingly parallel 127 00:05:34,130 --> 00:05:35,930 and what would that even be? 128 00:05:35,930 --> 00:05:38,690 So if it was not embarrassingly parallel, 129 00:05:38,690 --> 00:05:43,690 we would have a mismatched partition count, possibly. 130 00:05:43,950 --> 00:05:45,790 In other words, our input partition counts 131 00:05:45,790 --> 00:05:48,360 don't match our output partition counts. 132 00:05:48,360 --> 00:05:51,520 Another scenario would be, 133 00:05:51,520 --> 00:05:55,260 if we are moving into something that doesn't have partitions 134 00:05:55,260 --> 00:05:56,830 so our input has partitions 135 00:05:56,830 --> 00:05:59,160 but output does not have partitions. 136 00:05:59,160 --> 00:06:01,150 So that would not be embarrassingly parallel 137 00:06:01,150 --> 00:06:04,513 because it's not 1 input to 1 output with partitions. 138 00:06:05,400 --> 00:06:06,310 So if that's the case, 139 00:06:06,310 --> 00:06:10,300 what you would do is you would run additional query steps 140 00:06:10,300 --> 00:06:14,233 and you would utilize something known as shuffling, 141 00:06:16,410 --> 00:06:19,030 and for that, let's go ahead and jump over 142 00:06:19,030 --> 00:06:21,190 and take a look at shuffling. 143 00:06:21,190 --> 00:06:23,140 So this is repartitioning. 144 00:06:23,140 --> 00:06:25,150 We've now transitioned from partitioning 145 00:06:25,150 --> 00:06:28,800 into repartitioning, or reshuffling. 146 00:06:28,800 --> 00:06:30,060 So this is for those scenarios 147 00:06:30,060 --> 00:06:32,033 that aren't embarrassingly parallel. 148 00:06:34,430 --> 00:06:35,263 So what we can do is, 149 00:06:35,263 --> 00:06:38,440 we can process all of the partitions independently. 150 00:06:38,440 --> 00:06:39,970 So in order to do repartitioning, 151 00:06:39,970 --> 00:06:42,080 we have basically 2 choices. 152 00:06:42,080 --> 00:06:43,160 The first choice is, 153 00:06:43,160 --> 00:06:45,330 we create a separate Stream Analytics job, 154 00:06:45,330 --> 00:06:47,670 that does all the repartitioning, 155 00:06:47,670 --> 00:06:51,210 or we create multiple steps in one job, 156 00:06:51,210 --> 00:06:53,970 and we do all of the repartitioning first, or the shuffling, 157 00:06:53,970 --> 00:06:55,310 and then we do everything else 158 00:06:55,310 --> 00:06:57,140 that we actually wanted to do. 159 00:06:57,140 --> 00:06:59,570 So those are the 2 different choices that you have. 160 00:06:59,570 --> 00:07:01,580 So now let's take a look at a script, 161 00:07:01,580 --> 00:07:04,140 so we can see what this looks like. 162 00:07:04,140 --> 00:07:07,410 So, we have here 2 different steps 163 00:07:07,410 --> 00:07:10,140 and what we're doing is we're going to repartition our data 164 00:07:10,140 --> 00:07:12,420 and then we're going to do something with it. 165 00:07:12,420 --> 00:07:14,830 So basically what we do is, 166 00:07:14,830 --> 00:07:18,680 we're going to create a repartitioned input 167 00:07:18,680 --> 00:07:21,530 and we're going to select everything, 168 00:07:21,530 --> 00:07:25,550 and then we are going to create a partition system, 169 00:07:25,550 --> 00:07:29,490 that's going to partition by device ID in this case, 170 00:07:29,490 --> 00:07:31,780 and that partition by could be whatever you want-- 171 00:07:31,780 --> 00:07:35,120 could be that customer number, could be a device ID, 172 00:07:35,120 --> 00:07:36,173 whatever you want. 173 00:07:38,090 --> 00:07:40,150 So once that first step is done, 174 00:07:40,150 --> 00:07:42,160 we have repartitioned our data, 175 00:07:42,160 --> 00:07:43,390 and then what we're going to do is, 176 00:07:43,390 --> 00:07:46,570 we're going to select our partition 177 00:07:46,570 --> 00:07:49,040 and then we're going to perform some simple averages. 178 00:07:49,040 --> 00:07:50,840 So we're going to do an average 179 00:07:50,840 --> 00:07:55,840 and then we're going to move that into our output, 180 00:07:55,980 --> 00:07:58,850 and we're going to use that repartitioned input 181 00:07:58,850 --> 00:07:59,720 that we just did, 182 00:07:59,720 --> 00:08:02,370 that reshuffled and created those partitions, 183 00:08:02,370 --> 00:08:04,200 as our source data. 184 00:08:04,200 --> 00:08:05,980 And then at the very bottom, 185 00:08:05,980 --> 00:08:08,550 you can see that we've added in a hopping window, 186 00:08:08,550 --> 00:08:11,280 so that Azure Stream Analytics has a good idea 187 00:08:11,280 --> 00:08:13,360 of how it should look at the data 188 00:08:13,360 --> 00:08:15,770 as it's coming through the stream. 189 00:08:15,770 --> 00:08:17,230 So this is a pretty simple script, 190 00:08:17,230 --> 00:08:20,320 but it gives you a good idea of what that 2-step process 191 00:08:20,320 --> 00:08:23,563 of reshuffling and then doing something, looks like. 192 00:08:25,970 --> 00:08:29,090 You can also use this to join multiple streams 193 00:08:29,090 --> 00:08:30,380 if you need to. 194 00:08:30,380 --> 00:08:32,370 So in this case what we have, 195 00:08:32,370 --> 00:08:37,080 is we have our step1 that we're creating 196 00:08:37,080 --> 00:08:39,790 and then we're also creating a step2, 197 00:08:39,790 --> 00:08:43,190 and those are going to be from 2 different input sources. 198 00:08:43,190 --> 00:08:44,220 And then what we can do is, 199 00:08:44,220 --> 00:08:46,977 we can select our step1 and step2, 200 00:08:48,210 --> 00:08:50,740 those 2 different partitions that we've created, 201 00:08:50,740 --> 00:08:54,200 and we can join those together into 1 output. 202 00:08:54,200 --> 00:08:56,460 So that is what you see here. 203 00:08:56,460 --> 00:08:59,210 So just a couple of simple scripts that we can use 204 00:08:59,210 --> 00:09:02,350 to give you an idea of what repartitioning looks like 205 00:09:02,350 --> 00:09:04,053 in Stream Analytics. 206 00:09:05,670 --> 00:09:07,840 All right, so in review. 207 00:09:07,840 --> 00:09:09,390 First, you need to know 208 00:09:09,390 --> 00:09:12,120 what an embarrassingly parallel job is. 209 00:09:12,120 --> 00:09:15,730 Basically, the input and output partitions? They match. 210 00:09:15,730 --> 00:09:17,133 Those numbers match. 211 00:09:18,800 --> 00:09:21,120 Repartitioning is just reshuffling. 212 00:09:21,120 --> 00:09:22,300 So that's what we do with jobs 213 00:09:22,300 --> 00:09:24,660 that aren't embarrassingly parallel, 214 00:09:24,660 --> 00:09:28,140 and you can see how we do that with those sample scripts. 215 00:09:28,140 --> 00:09:31,280 And you need to know that this affects the query section 216 00:09:31,280 --> 00:09:33,280 of Azure Stream Analytics. 217 00:09:33,280 --> 00:09:36,490 That is where all of this would come into play. 218 00:09:36,490 --> 00:09:38,620 All right, so that is it for this lesson. 219 00:09:38,620 --> 00:09:40,070 I'll see you in the next one.