1
00:00:00,120 --> 00:00:02,080
So in this lesson, we are going to talk

2
00:00:02,080 --> 00:00:05,610
about a very important concept in data engineering.

3
00:00:05,610 --> 00:00:06,850
We are going to talk about

4
00:00:06,850 --> 00:00:09,460
how we process one or more partitions,

5
00:00:09,460 --> 00:00:11,410
which is known as repartitioning.

6
00:00:11,410 --> 00:00:12,850
Specifically in this lesson,

7
00:00:12,850 --> 00:00:14,260
we are going to talk about

8
00:00:14,260 --> 00:00:17,580
how we understand partitions and repartitions,

9
00:00:17,580 --> 00:00:19,380
and then we're also going to talk about

10
00:00:19,380 --> 00:00:23,120
how we use repartitions in Azure Stream Analytics.

11
00:00:23,120 --> 00:00:24,550
And the first little tidbit is,

12
00:00:24,550 --> 00:00:26,980
when you think of partitions, think of buckets

13
00:00:26,980 --> 00:00:30,130
and--there you go--cute dog in a bucket.

14
00:00:30,130 --> 00:00:32,080
(Brian chuckles) So, as you think of partitions,

15
00:00:32,080 --> 00:00:34,120
think about just building buckets

16
00:00:34,120 --> 00:00:36,763
and putting all your stuff into various buckets.

17
00:00:38,300 --> 00:00:40,830
So the basics of partitioning is just simply,

18
00:00:40,830 --> 00:00:45,140
we divide our data into subsets based on a partition key.

19
00:00:45,140 --> 00:00:49,100
So we're going to divide our data into those different

20
00:00:49,100 --> 00:00:52,363
buckets, based on some sort of metric that we define,

21
00:00:53,600 --> 00:00:56,310
and let's talk about why that matters.

22
00:00:56,310 --> 00:00:58,900
It matters because those subsets,

23
00:00:58,900 --> 00:01:02,270
that data in a bucket, it makes searching faster.

24
00:01:02,270 --> 00:01:05,580
I'll give you a little side-story here.

25
00:01:05,580 --> 00:01:08,070
So, my son loves LEGOs,

26
00:01:08,070 --> 00:01:10,440
and he took all of his LEGOs one day

27
00:01:10,440 --> 00:01:14,460
and broke them into pieces, and it looked kind of like this.

28
00:01:14,460 --> 00:01:16,770
Everything was mixed up and together,

29
00:01:16,770 --> 00:01:20,070
and so I helped him put all of them back together

30
00:01:20,070 --> 00:01:22,240
and build all of his sets again.

31
00:01:22,240 --> 00:01:25,150
In order to do that, I partitioned the LEGOs

32
00:01:25,150 --> 00:01:28,000
into various colors, red, yellow, blue,

33
00:01:28,000 --> 00:01:30,450
and that really helped because it made searching

34
00:01:30,450 --> 00:01:33,850
for the different pieces a lot faster.

35
00:01:33,850 --> 00:01:35,903
Same thing is true in data.

36
00:01:37,050 --> 00:01:39,010
And we use partition keys

37
00:01:39,010 --> 00:01:42,840
to help determine what buckets the data goes into.

38
00:01:42,840 --> 00:01:45,300
So think about this for partition keys.

39
00:01:45,300 --> 00:01:48,580
Let's say that we have a database that has customer name,

40
00:01:48,580 --> 00:01:53,360
order number, order date, order country, and total sale.

41
00:01:53,360 --> 00:01:56,330
What would be a good key for this data?

42
00:01:56,330 --> 00:02:00,440
Well, the first thing that we want to look at is static.

43
00:02:00,440 --> 00:02:02,980
We don't want something that changes.

44
00:02:02,980 --> 00:02:03,880
It's not going to help us

45
00:02:03,880 --> 00:02:06,310
if we have something that changes frequently.

46
00:02:06,310 --> 00:02:08,860
So another way to think about static

47
00:02:08,860 --> 00:02:10,750
(using data engineering terminology)

48
00:02:10,750 --> 00:02:12,810
would be reasonably consistent

49
00:02:12,810 --> 00:02:14,810
and bounded number of records.

50
00:02:14,810 --> 00:02:17,800
So thinking about something that is bounded,

51
00:02:17,800 --> 00:02:19,410
it has a beginning and an ending

52
00:02:19,410 --> 00:02:21,350
and it's going to be reasonably consistent

53
00:02:21,350 --> 00:02:23,260
in that it's not changing.

54
00:02:23,260 --> 00:02:26,800
The other thing we want to look at is high cardinality,

55
00:02:26,800 --> 00:02:28,900
and when we talk about high cardinality,

56
00:02:28,900 --> 00:02:31,050
this just means a big range.

57
00:02:31,050 --> 00:02:31,883
We want something

58
00:02:31,883 --> 00:02:34,260
that's going to have a lot of different values.

59
00:02:34,260 --> 00:02:38,230
For instance, if I chose order country as my partition,

60
00:02:38,230 --> 00:02:41,230
but everyone purchases within the United States,

61
00:02:41,230 --> 00:02:42,270
that's a terrible key,

62
00:02:42,270 --> 00:02:43,970
because we're going to have one giant bucket

63
00:02:43,970 --> 00:02:47,890
with all the data in it and our partitioning was useless.

64
00:02:47,890 --> 00:02:50,190
So when we look at our partition keys

65
00:02:50,190 --> 00:02:51,630
and we look at this data here,

66
00:02:51,630 --> 00:02:53,110
there's a couple of different choices,

67
00:02:53,110 --> 00:02:55,280
and the partition key that you would choose,

68
00:02:55,280 --> 00:02:58,650
would be based upon how you want to divide up the data.

69
00:02:58,650 --> 00:03:01,120
For instance, you might look at order date.

70
00:03:01,120 --> 00:03:05,070
If you have a reasonably smooth flow or even flow

71
00:03:05,070 --> 00:03:08,410
of orders per day, you could choose order date

72
00:03:08,410 --> 00:03:11,660
and use that as a partition key, or you might,

73
00:03:11,660 --> 00:03:12,900
and I know I mentioned before

74
00:03:12,900 --> 00:03:15,190
that you might not want to choose order country,

75
00:03:15,190 --> 00:03:18,550
but if you are in a global sales

76
00:03:18,550 --> 00:03:21,900
and all of your orders come from across the world,

77
00:03:21,900 --> 00:03:24,130
order country might be a good solution for you,

78
00:03:24,130 --> 00:03:26,810
if there's a reasonable distribution where it's pretty even

79
00:03:26,810 --> 00:03:28,510
and there's a lot of different countries,

80
00:03:28,510 --> 00:03:30,410
again, that high cardinality.

81
00:03:30,410 --> 00:03:31,910
So when we look at partition keys,

82
00:03:31,910 --> 00:03:33,850
it's important that we kind of think about

83
00:03:33,850 --> 00:03:36,210
how we're going to make those buckets work

84
00:03:36,210 --> 00:03:40,290
and that we think about what we want to use

85
00:03:40,290 --> 00:03:42,470
to help find the data that we're looking for

86
00:03:42,470 --> 00:03:46,320
within those buckets, i.e., our partition key.

87
00:03:46,320 --> 00:03:49,640
And keep in mind, this is the basics of partitioning.

88
00:03:49,640 --> 00:03:52,697
We could talk for several sections just on partitioning

89
00:03:52,697 --> 00:03:54,620
and the different ways that you can do it,

90
00:03:54,620 --> 00:03:56,550
but I want to give you an overview of some things

91
00:03:56,550 --> 00:03:59,663
that you can do and how it works at a high level.

92
00:04:01,610 --> 00:04:04,320
So when we talk about Stream Analytics,

93
00:04:04,320 --> 00:04:07,760
we're going to talk about embarrassingly parallel jobs.

94
00:04:07,760 --> 00:04:09,793
Yes, that is a real thing.

95
00:04:10,770 --> 00:04:14,030
Microsoft defines embarrassingly parallel jobs

96
00:04:14,030 --> 00:04:18,110
as the most scalable scenarios in Azure Stream Analytics,

97
00:04:18,110 --> 00:04:21,390
i.e., the easy button over here on the right.

98
00:04:21,390 --> 00:04:24,880
So, embarrassingly parallel jobs connect one partition

99
00:04:24,880 --> 00:04:28,610
of the input, to one instance of the output.

100
00:04:28,610 --> 00:04:31,130
That is the most scalable scenario

101
00:04:31,130 --> 00:04:32,570
for Azure Stream Analytics.

102
00:04:32,570 --> 00:04:33,860
You do need to understand that.

103
00:04:33,860 --> 00:04:38,630
Embarrassingly parallel jobs, 1 input to 1 output.

104
00:04:38,630 --> 00:04:40,300
So here's our key.

105
00:04:40,300 --> 00:04:42,850
We have our Azure Stream Analytics query

106
00:04:42,850 --> 00:04:43,990
and we have an output.

107
00:04:43,990 --> 00:04:46,340
So if we had multiple partitions,

108
00:04:46,340 --> 00:04:49,910
we would just have multiple sets of this.

109
00:04:49,910 --> 00:04:51,670
So when we look at the script for this,

110
00:04:51,670 --> 00:04:53,770
you can see that we have our select count as count,

111
00:04:53,770 --> 00:04:57,030
just selecting all of our data from our input,

112
00:04:57,030 --> 00:04:59,410
and then this is the part that's important for partitioning.

113
00:04:59,410 --> 00:05:02,050
We're going to partition by partition ID,

114
00:05:02,050 --> 00:05:04,990
which is whatever our partition key is,

115
00:05:04,990 --> 00:05:06,930
and then below that you can see

116
00:05:06,930 --> 00:05:11,350
that we simply tack on the standard hopping window

117
00:05:11,350 --> 00:05:13,210
or whatever our window is going to be.

118
00:05:13,210 --> 00:05:15,020
So that's our group by statement.

119
00:05:15,020 --> 00:05:16,090
So this is pretty typical.

120
00:05:16,090 --> 00:05:17,430
We have our select,

121
00:05:17,430 --> 00:05:18,820
then we have our from,

122
00:05:18,820 --> 00:05:20,870
then we have our partition by partition key,

123
00:05:20,870 --> 00:05:22,870
and then we have our group by statement.

124
00:05:23,720 --> 00:05:25,530
And so that is the code that we would use

125
00:05:25,530 --> 00:05:29,643
to run an Azure Stream Analytics query utilizing partitions.

126
00:05:31,040 --> 00:05:34,130
So what happens when it's not embarrassingly parallel

127
00:05:34,130 --> 00:05:35,930
and what would that even be?

128
00:05:35,930 --> 00:05:38,690
So if it was not embarrassingly parallel,

129
00:05:38,690 --> 00:05:43,690
we would have a mismatched partition count, possibly.

130
00:05:43,950 --> 00:05:45,790
In other words, our input partition counts

131
00:05:45,790 --> 00:05:48,360
don't match our output partition counts.

132
00:05:48,360 --> 00:05:51,520
Another scenario would be,

133
00:05:51,520 --> 00:05:55,260
if we are moving into something that doesn't have partitions

134
00:05:55,260 --> 00:05:56,830
so our input has partitions

135
00:05:56,830 --> 00:05:59,160
but output does not have partitions.

136
00:05:59,160 --> 00:06:01,150
So that would not be embarrassingly parallel

137
00:06:01,150 --> 00:06:04,513
because it's not 1 input to 1 output with partitions.

138
00:06:05,400 --> 00:06:06,310
So if that's the case,

139
00:06:06,310 --> 00:06:10,300
what you would do is you would run additional query steps

140
00:06:10,300 --> 00:06:14,233
and you would utilize something known as shuffling,

141
00:06:16,410 --> 00:06:19,030
and for that, let's go ahead and jump over

142
00:06:19,030 --> 00:06:21,190
and take a look at shuffling.

143
00:06:21,190 --> 00:06:23,140
So this is repartitioning.

144
00:06:23,140 --> 00:06:25,150
We've now transitioned from partitioning

145
00:06:25,150 --> 00:06:28,800
into repartitioning, or reshuffling.

146
00:06:28,800 --> 00:06:30,060
So this is for those scenarios

147
00:06:30,060 --> 00:06:32,033
that aren't embarrassingly parallel.

148
00:06:34,430 --> 00:06:35,263
So what we can do is,

149
00:06:35,263 --> 00:06:38,440
we can process all of the partitions independently.

150
00:06:38,440 --> 00:06:39,970
So in order to do repartitioning,

151
00:06:39,970 --> 00:06:42,080
we have basically 2 choices.

152
00:06:42,080 --> 00:06:43,160
The first choice is,

153
00:06:43,160 --> 00:06:45,330
we create a separate Stream Analytics job,

154
00:06:45,330 --> 00:06:47,670
that does all the repartitioning,

155
00:06:47,670 --> 00:06:51,210
or we create multiple steps in one job,

156
00:06:51,210 --> 00:06:53,970
and we do all of the repartitioning first, or the shuffling,

157
00:06:53,970 --> 00:06:55,310
and then we do everything else

158
00:06:55,310 --> 00:06:57,140
that we actually wanted to do.

159
00:06:57,140 --> 00:06:59,570
So those are the 2 different choices that you have.

160
00:06:59,570 --> 00:07:01,580
So now let's take a look at a script,

161
00:07:01,580 --> 00:07:04,140
so we can see what this looks like.

162
00:07:04,140 --> 00:07:07,410
So, we have here 2 different steps

163
00:07:07,410 --> 00:07:10,140
and what we're doing is we're going to repartition our data

164
00:07:10,140 --> 00:07:12,420
and then we're going to do something with it.

165
00:07:12,420 --> 00:07:14,830
So basically what we do is,

166
00:07:14,830 --> 00:07:18,680
we're going to create a repartitioned input

167
00:07:18,680 --> 00:07:21,530
and we're going to select everything,

168
00:07:21,530 --> 00:07:25,550
and then we are going to create a partition system,

169
00:07:25,550 --> 00:07:29,490
that's going to partition by device ID in this case,

170
00:07:29,490 --> 00:07:31,780
and that partition by could be whatever you want--

171
00:07:31,780 --> 00:07:35,120
could be that customer number, could be a device ID,

172
00:07:35,120 --> 00:07:36,173
whatever you want.

173
00:07:38,090 --> 00:07:40,150
So once that first step is done,

174
00:07:40,150 --> 00:07:42,160
we have repartitioned our data,

175
00:07:42,160 --> 00:07:43,390
and then what we're going to do is,

176
00:07:43,390 --> 00:07:46,570
we're going to select our partition

177
00:07:46,570 --> 00:07:49,040
and then we're going to perform some simple averages.

178
00:07:49,040 --> 00:07:50,840
So we're going to do an average

179
00:07:50,840 --> 00:07:55,840
and then we're going to move that into our output,

180
00:07:55,980 --> 00:07:58,850
and we're going to use that repartitioned input

181
00:07:58,850 --> 00:07:59,720
that we just did,

182
00:07:59,720 --> 00:08:02,370
that reshuffled and created those partitions,

183
00:08:02,370 --> 00:08:04,200
as our source data.

184
00:08:04,200 --> 00:08:05,980
And then at the very bottom,

185
00:08:05,980 --> 00:08:08,550
you can see that we've added in a hopping window,

186
00:08:08,550 --> 00:08:11,280
so that Azure Stream Analytics has a good idea

187
00:08:11,280 --> 00:08:13,360
of how it should look at the data

188
00:08:13,360 --> 00:08:15,770
as it's coming through the stream.

189
00:08:15,770 --> 00:08:17,230
So this is a pretty simple script,

190
00:08:17,230 --> 00:08:20,320
but it gives you a good idea of what that 2-step process

191
00:08:20,320 --> 00:08:23,563
of reshuffling and then doing something, looks like.

192
00:08:25,970 --> 00:08:29,090
You can also use this to join multiple streams

193
00:08:29,090 --> 00:08:30,380
if you need to.

194
00:08:30,380 --> 00:08:32,370
So in this case what we have,

195
00:08:32,370 --> 00:08:37,080
is we have our step1 that we're creating

196
00:08:37,080 --> 00:08:39,790
and then we're also creating a step2,

197
00:08:39,790 --> 00:08:43,190
and those are going to be from 2 different input sources.

198
00:08:43,190 --> 00:08:44,220
And then what we can do is,

199
00:08:44,220 --> 00:08:46,977
we can select our step1 and step2,

200
00:08:48,210 --> 00:08:50,740
those 2 different partitions that we've created,

201
00:08:50,740 --> 00:08:54,200
and we can join those together into 1 output.

202
00:08:54,200 --> 00:08:56,460
So that is what you see here.

203
00:08:56,460 --> 00:08:59,210
So just a couple of simple scripts that we can use

204
00:08:59,210 --> 00:09:02,350
to give you an idea of what repartitioning looks like

205
00:09:02,350 --> 00:09:04,053
in Stream Analytics.

206
00:09:05,670 --> 00:09:07,840
All right, so in review.

207
00:09:07,840 --> 00:09:09,390
First, you need to know

208
00:09:09,390 --> 00:09:12,120
what an embarrassingly parallel job is.

209
00:09:12,120 --> 00:09:15,730
Basically, the input and output partitions? They match.

210
00:09:15,730 --> 00:09:17,133
Those numbers match.

211
00:09:18,800 --> 00:09:21,120
Repartitioning is just reshuffling.

212
00:09:21,120 --> 00:09:22,300
So that's what we do with jobs

213
00:09:22,300 --> 00:09:24,660
that aren't embarrassingly parallel,

214
00:09:24,660 --> 00:09:28,140
and you can see how we do that with those sample scripts.

215
00:09:28,140 --> 00:09:31,280
And you need to know that this affects the query section

216
00:09:31,280 --> 00:09:33,280
of Azure Stream Analytics.

217
00:09:33,280 --> 00:09:36,490
That is where all of this would come into play.

218
00:09:36,490 --> 00:09:38,620
All right, so that is it for this lesson.

219
00:09:38,620 --> 00:09:40,070
I'll see you in the next one.