1
00:00:00,930 --> 00:00:01,930
Hey, what's up, Gurus?

2
00:00:01,930 --> 00:00:04,670
In this lesson, we are going to talk about processing data

3
00:00:04,670 --> 00:00:07,420
by using Spark Structured Streaming.

4
00:00:07,420 --> 00:00:09,010
Specifically, we're going to take a look

5
00:00:09,010 --> 00:00:10,580
at a high-level overview

6
00:00:10,580 --> 00:00:13,180
of Apache Spark Structured Streaming.

7
00:00:13,180 --> 00:00:16,300
And when I say high level, I mean high level.

8
00:00:16,300 --> 00:00:19,500
We could have an entire course on Spark,

9
00:00:19,500 --> 00:00:21,170
but we don't have that much time

10
00:00:21,170 --> 00:00:22,970
nor do we need to go to that much depth.

11
00:00:22,970 --> 00:00:24,610
So we're going to cover just the basics

12
00:00:24,610 --> 00:00:27,070
so that you understand how it functions.

13
00:00:27,070 --> 00:00:29,370
That includes input streams and tables,

14
00:00:29,370 --> 00:00:31,650
and a little bit about how that works.

15
00:00:31,650 --> 00:00:32,930
And then, of course, we're going to review

16
00:00:32,930 --> 00:00:35,780
some simple code scripts to process data

17
00:00:35,780 --> 00:00:38,940
so that you understand how the Apache Spark module

18
00:00:38,940 --> 00:00:40,990
in Data Factory works.

19
00:00:40,990 --> 00:00:41,900
All right, with that,

20
00:00:41,900 --> 00:00:44,990
let's get started with the high-level overview.

21
00:00:44,990 --> 00:00:47,370
So here we have an overview

22
00:00:47,370 --> 00:00:50,590
of the Spark Structured Streaming process.

23
00:00:50,590 --> 00:00:51,760
And we're going to break this down

24
00:00:51,760 --> 00:00:53,350
kind of section by section

25
00:00:53,350 --> 00:00:55,413
so that you understand what's happening.

26
00:00:56,400 --> 00:00:59,700
So our first section here is our input.

27
00:00:59,700 --> 00:01:02,520
We have our TCP socket at the top there

28
00:01:02,520 --> 00:01:04,760
that you could use as a debugging source,

29
00:01:04,760 --> 00:01:06,500
or you have your main options,

30
00:01:06,500 --> 00:01:08,230
which are your message source,

31
00:01:08,230 --> 00:01:11,430
which would be like streaming through Kafka or Event Hubs.

32
00:01:11,430 --> 00:01:14,390
Or you could have a watcher set out

33
00:01:14,390 --> 00:01:16,230
looking at your data lake

34
00:01:16,230 --> 00:01:18,280
or a storage blob or something like that,

35
00:01:18,280 --> 00:01:20,230
and it would be watching for new files.

36
00:01:20,230 --> 00:01:23,170
As those files appear, it would input them

37
00:01:23,170 --> 00:01:26,940
into your Spark Structured Streaming.

38
00:01:26,940 --> 00:01:30,980
Now the streaming process is where the windowing occurs

39
00:01:30,980 --> 00:01:33,180
and normal transformations.

40
00:01:33,180 --> 00:01:36,210
So when we talk about Stream Analytics,

41
00:01:36,210 --> 00:01:37,750
it's very similar to that.

42
00:01:37,750 --> 00:01:39,740
You're going to have your processing

43
00:01:39,740 --> 00:01:42,320
where you set up a window of time that you're looking at,

44
00:01:42,320 --> 00:01:44,330
and as that data flows through the window,

45
00:01:44,330 --> 00:01:48,010
you're doing transformations on that data.

46
00:01:48,010 --> 00:01:51,180
Once you finish that, you're just going to have a sink

47
00:01:51,180 --> 00:01:53,920
and a sink is just an output location for the data.

48
00:01:53,920 --> 00:01:57,610
So you're going to send your data once it's been processed

49
00:01:57,610 --> 00:02:00,220
to a Power BI dashboard or a database

50
00:02:00,220 --> 00:02:03,950
or the data lake store, or something like that.

51
00:02:03,950 --> 00:02:06,590
And then we also, just like we had a debug source,

52
00:02:06,590 --> 00:02:09,040
or an input, we also have a debug sink

53
00:02:09,040 --> 00:02:12,780
that you can use as well if you need to do debugging.

54
00:02:12,780 --> 00:02:16,060
So that's the basics of how structured streaming works.

55
00:02:16,060 --> 00:02:19,690
Essentially, It's just input, query, output.

56
00:02:19,690 --> 00:02:21,680
So keep that in mind.

57
00:02:21,680 --> 00:02:23,640
Now let's talk a little bit about tables

58
00:02:23,640 --> 00:02:27,160
and how the data actually gets processed.

59
00:02:27,160 --> 00:02:30,270
So we actually have 2 different tables.

60
00:02:30,270 --> 00:02:33,330
We have an input table, we run queries,

61
00:02:33,330 --> 00:02:37,100
and then we have our output or our results table.

62
00:02:37,100 --> 00:02:39,330
So our data stream here on the left

63
00:02:39,330 --> 00:02:41,520
would represent our input table.

64
00:02:41,520 --> 00:02:43,480
And we're going to take that data

65
00:02:43,480 --> 00:02:46,410
and we are going to do processing and send that through

66
00:02:46,410 --> 00:02:49,163
to an unbounded or results table.

67
00:02:50,570 --> 00:02:52,500
So now just a couple of things to keep in mind

68
00:02:52,500 --> 00:02:53,810
as we move forward with this.

69
00:02:53,810 --> 00:02:57,950
The first is taking a look at this unbounded table concept.

70
00:02:57,950 --> 00:03:01,620
So an unbounded table is one that doesn't have a boundary,

71
00:03:01,620 --> 00:03:04,660
which means it's going to continuously grow.

72
00:03:04,660 --> 00:03:07,080
And so this is an appending table,

73
00:03:07,080 --> 00:03:11,200
or it's being continuously appended or added to.

74
00:03:11,200 --> 00:03:14,000
So each one of these purple lines coming down

75
00:03:14,000 --> 00:03:18,100
is adding to this unbounded table.

76
00:03:18,100 --> 00:03:19,480
The next thing to keep in mind

77
00:03:19,480 --> 00:03:23,550
is this concept of streaming in batch.

78
00:03:23,550 --> 00:03:25,870
So Spark Structured Streaming

79
00:03:25,870 --> 00:03:28,680
is definitely a streaming service.

80
00:03:28,680 --> 00:03:32,120
However, you can see the lines coming off of this arrow.

81
00:03:32,120 --> 00:03:34,430
We're going to take these individual packets,

82
00:03:34,430 --> 00:03:36,420
and we're going to process them

83
00:03:36,420 --> 00:03:39,060
almost like we would a batch request.

84
00:03:39,060 --> 00:03:40,320
So it's kind of a mixture

85
00:03:40,320 --> 00:03:43,313
of streaming in batch, if you will.

86
00:03:44,820 --> 00:03:47,030
And then finally, incremental processing.

87
00:03:47,030 --> 00:03:50,290
We definitely need to understand incremental processing.

88
00:03:50,290 --> 00:03:52,990
And so this kind of goes to that concept

89
00:03:52,990 --> 00:03:55,210
of this batch processing.

90
00:03:55,210 --> 00:03:57,890
So we're going to take these incremental pieces

91
00:03:57,890 --> 00:04:01,010
and we're going to process those in parallel.

92
00:04:01,010 --> 00:04:02,880
And I'll show you this in the next slide,

93
00:04:02,880 --> 00:04:05,420
but keep in mind those 3 concepts

94
00:04:05,420 --> 00:04:07,590
as we talk about this unbounded table

95
00:04:07,590 --> 00:04:09,900
that continuously grows.

96
00:04:09,900 --> 00:04:11,420
All right, so as we look at this,

97
00:04:11,420 --> 00:04:14,100
you need to understand that there's 3 different modes

98
00:04:14,100 --> 00:04:16,180
for outputting our table,

99
00:04:16,180 --> 00:04:17,300
and keep in mind the concept,

100
00:04:17,300 --> 00:04:19,700
we have the input, we're running processing,

101
00:04:19,700 --> 00:04:21,080
and then we're going to output it

102
00:04:21,080 --> 00:04:24,900
so you can see your input, result, and output, same thing.

103
00:04:24,900 --> 00:04:27,090
There's 3 different choices, though.

104
00:04:27,090 --> 00:04:30,330
We have a complete mode or a complete rewrite.

105
00:04:30,330 --> 00:04:31,260
And so in this mode,

106
00:04:31,260 --> 00:04:35,530
we are going to completely rewrite the result table.

107
00:04:35,530 --> 00:04:37,610
Second option is append mode.

108
00:04:37,610 --> 00:04:39,930
So like we've talked about before,

109
00:04:39,930 --> 00:04:43,500
append mode is only going to update the new

110
00:04:43,500 --> 00:04:45,220
or the appended result table.

111
00:04:45,220 --> 00:04:46,760
It's not going to rewrite everything.

112
00:04:46,760 --> 00:04:49,640
It's just going to write since the last trigger,

113
00:04:49,640 --> 00:04:51,433
so it's just going to continue to add.

114
00:04:52,300 --> 00:04:55,160
And then we also have an update mode.

115
00:04:55,160 --> 00:04:57,360
So this is pretty much the exact same thing

116
00:04:57,360 --> 00:05:02,130
as the append mode, except for the concept of aggregations.

117
00:05:02,130 --> 00:05:04,570
So it's only going to update the rows

118
00:05:04,570 --> 00:05:07,500
that have changed since the last trigger,

119
00:05:07,500 --> 00:05:10,230
whereas the append mode is just going to add new rows.

120
00:05:10,230 --> 00:05:13,180
It's not looking at changing or not changing.

121
00:05:13,180 --> 00:05:17,170
So if you had an update mode that didn't have any changes

122
00:05:17,170 --> 00:05:19,933
or aggregations, it would be the exact same as append.

123
00:05:21,250 --> 00:05:23,680
All right, those are the 3 different types

124
00:05:23,680 --> 00:05:25,970
of table outputs.

125
00:05:25,970 --> 00:05:29,260
And so now let's move over here and look at this chart.

126
00:05:29,260 --> 00:05:32,540
So you can see we have time moving from left to right,

127
00:05:32,540 --> 00:05:35,240
and there's a trigger every second.

128
00:05:35,240 --> 00:05:36,380
So we have our window moving,

129
00:05:36,380 --> 00:05:38,800
and every second we're going to have a trigger.

130
00:05:38,800 --> 00:05:40,620
So it's going to look at that data,

131
00:05:40,620 --> 00:05:43,440
batch that up as a one-second trigger,

132
00:05:43,440 --> 00:05:45,220
and it's going to take that,

133
00:05:45,220 --> 00:05:47,270
and it's going to pull all the input.

134
00:05:47,270 --> 00:05:51,600
So everything for data up to one second, it's going to pull.

135
00:05:51,600 --> 00:05:54,070
It's then going to run our query.

136
00:05:54,070 --> 00:05:57,200
So whatever query language we have, it's going to run that.

137
00:05:57,200 --> 00:06:01,740
Once it's done that, then it's going to give us our result.

138
00:06:01,740 --> 00:06:04,290
So for that, we have our output.

139
00:06:04,290 --> 00:06:07,410
So we get our result, then we push it through in the output.

140
00:06:07,410 --> 00:06:11,920
And in this chart, we're looking at a complete mode table.

141
00:06:11,920 --> 00:06:13,290
So what's going to happen then

142
00:06:13,290 --> 00:06:16,030
is we're going to look at the 2nd second.

143
00:06:16,030 --> 00:06:20,060
And what going to do is we're going to query data

144
00:06:20,060 --> 00:06:21,860
up to 1 and 2 seconds.

145
00:06:21,860 --> 00:06:24,640
So the input doubled in size

146
00:06:24,640 --> 00:06:28,040
because we're completely rewriting that table.

147
00:06:28,040 --> 00:06:29,830
Then we're going to run our query.

148
00:06:29,830 --> 00:06:31,520
We're going to get our result for those 2,

149
00:06:31,520 --> 00:06:34,120
and then we're going to spit our output.

150
00:06:34,120 --> 00:06:36,210
Trigger 3 seconds, same thing.

151
00:06:36,210 --> 00:06:38,360
We're going to pull all the data up to 3 seconds.

152
00:06:38,360 --> 00:06:40,450
Then we're going to run our query, get our result,

153
00:06:40,450 --> 00:06:43,080
and then spit out our output.

154
00:06:43,080 --> 00:06:45,300
So when you look at Spark Structured Streaming,

155
00:06:45,300 --> 00:06:46,410
that's what's happening.

156
00:06:46,410 --> 00:06:49,790
But keep in mind that this table right here

157
00:06:49,790 --> 00:06:51,040
is what that looks like.

158
00:06:51,040 --> 00:06:53,210
So each one of those little arrows

159
00:06:53,210 --> 00:06:56,220
is actually one of those blocks that we're looking at

160
00:06:56,220 --> 00:06:58,013
as we move forward in time.

161
00:06:59,450 --> 00:07:00,880
So to wrap this section up,

162
00:07:00,880 --> 00:07:03,660
let's talk about what's happening.

163
00:07:03,660 --> 00:07:06,870
So we have our input data stream. Data comes in,

164
00:07:06,870 --> 00:07:09,270
and it moves into an input table.

165
00:07:09,270 --> 00:07:10,910
Once it hits that input table,

166
00:07:10,910 --> 00:07:13,060
it's then going to be pulled out,

167
00:07:13,060 --> 00:07:16,530
and a query is going to be run on that data.

168
00:07:16,530 --> 00:07:20,090
It's then going to be filtered into our results table.

169
00:07:20,090 --> 00:07:21,840
And there's really a couple of different ways.

170
00:07:21,840 --> 00:07:23,280
And there's 3, but we're only going to talk

171
00:07:23,280 --> 00:07:25,040
about the complete and append,

172
00:07:25,040 --> 00:07:27,670
because the update mode is very similar.

173
00:07:27,670 --> 00:07:32,200
So in append mode, let's say that we have an IoT sensor,

174
00:07:32,200 --> 00:07:35,090
which is a very common application for Spark.

175
00:07:35,090 --> 00:07:40,090
And with this IoT sensor, we are looking at temperature.

176
00:07:40,270 --> 00:07:41,890
Let's say we have a manufacturing line,

177
00:07:41,890 --> 00:07:44,130
and we have a IoT sensor

178
00:07:44,130 --> 00:07:46,130
that is going to detect the temperature of the machine,

179
00:07:46,130 --> 00:07:49,050
and so it's going to help to spot any abnormalities.

180
00:07:49,050 --> 00:07:50,510
With that temperature sensor,

181
00:07:50,510 --> 00:07:55,190
we can just simply append that data into a results table.

182
00:07:55,190 --> 00:07:58,230
So that's a fantastic use of append.

183
00:07:58,230 --> 00:08:00,980
Now complete, one question you may be thinking

184
00:08:00,980 --> 00:08:03,060
in your mind is, well, what happens

185
00:08:03,060 --> 00:08:06,710
if my data goes from 1 to 10,000?

186
00:08:06,710 --> 00:08:09,130
Am I eventually going to run out of memory?

187
00:08:09,130 --> 00:08:13,410
And the answer is, yeah, you definitely will.

188
00:08:13,410 --> 00:08:18,410
If you are simply continuing to do a complete mode,

189
00:08:18,670 --> 00:08:22,150
over time as that grows bigger and bigger and bigger,

190
00:08:22,150 --> 00:08:23,840
you're eventually going to run out of memory.

191
00:08:23,840 --> 00:08:26,960
So a better use of complete mode is to look at windows.

192
00:08:26,960 --> 00:08:28,670
So let's say we have a window,

193
00:08:28,670 --> 00:08:31,610
and we are going to sum up all of our data.

194
00:08:31,610 --> 00:08:32,900
And you'll see this a little in the script

195
00:08:32,900 --> 00:08:34,080
that we do in a second,

196
00:08:34,080 --> 00:08:35,980
but we're going to sum up all of our data

197
00:08:35,980 --> 00:08:37,750
for a 5-second window,

198
00:08:37,750 --> 00:08:40,770
and then we're going to write all of that.

199
00:08:40,770 --> 00:08:43,610
Well, so that would be a good use of a complete mode.

200
00:08:43,610 --> 00:08:46,340
It's really designed to be used with aggregations,

201
00:08:46,340 --> 00:08:51,330
not just to continue to add up over and over and over

202
00:08:51,330 --> 00:08:52,330
more and more data.

203
00:08:52,330 --> 00:08:55,610
So keep that in mind for your complete mode,

204
00:08:55,610 --> 00:08:57,180
generally with aggregations,

205
00:08:57,180 --> 00:08:59,730
and keep in mind which you use the append mode for.

206
00:09:00,570 --> 00:09:02,540
To kind of finish out our discussion,

207
00:09:02,540 --> 00:09:04,490
I want to take a look at a script

208
00:09:04,490 --> 00:09:06,270
that we could use in Data Factory

209
00:09:06,270 --> 00:09:08,520
and kind of break down what's happening here.

210
00:09:09,360 --> 00:09:12,360
So what we have is we have our input,

211
00:09:12,360 --> 00:09:14,500
which is these first 2 blocks.

212
00:09:14,500 --> 00:09:16,010
We're going to import the pieces

213
00:09:16,010 --> 00:09:18,610
that we need to make this script function.

214
00:09:18,610 --> 00:09:23,100
And then we are going to provide the input data.

215
00:09:23,100 --> 00:09:24,670
So that's the input path.

216
00:09:24,670 --> 00:09:26,450
That's the schema that we're going to use,

217
00:09:26,450 --> 00:09:30,690
and then that is our streaming information as well.

218
00:09:30,690 --> 00:09:34,050
So we have all of that defined as our input.

219
00:09:34,050 --> 00:09:36,230
And so then we do work, and you can see that,

220
00:09:36,230 --> 00:09:39,710
in this case, we are aggregating our data together

221
00:09:39,710 --> 00:09:42,040
based upon time and temperature,

222
00:09:42,040 --> 00:09:44,000
so this is probably like an IoT sensor

223
00:09:44,000 --> 00:09:45,790
or something like that.

224
00:09:45,790 --> 00:09:48,690
And then finally, we have our output here.

225
00:09:48,690 --> 00:09:52,490
And so this is defining our output sink.

226
00:09:52,490 --> 00:09:54,510
And then finally we are going to get going,

227
00:09:54,510 --> 00:09:56,560
which means we're going to actually start the query.

228
00:09:56,560 --> 00:09:58,560
So this is the command that you would use

229
00:09:58,560 --> 00:10:00,033
to actually start the query.

230
00:10:01,070 --> 00:10:03,410
So this is the Spark Activity

231
00:10:03,410 --> 00:10:05,870
in Data Factory at a high level.

232
00:10:05,870 --> 00:10:09,620
Now for the DP-203, what do you actually need to know?

233
00:10:09,620 --> 00:10:12,800
Well, not a whole lot, as far as the code is concerned.

234
00:10:12,800 --> 00:10:15,280
Basically, if you understand the blocks,

235
00:10:15,280 --> 00:10:18,060
the input, the query, the output block,

236
00:10:18,060 --> 00:10:20,480
and you could recognize kind of those blocks

237
00:10:20,480 --> 00:10:22,430
and where they sit, then you're just fine.

238
00:10:22,430 --> 00:10:23,830
I think it's wildly unlikely

239
00:10:23,830 --> 00:10:25,870
that you're going to need to know individual functions

240
00:10:25,870 --> 00:10:27,040
or things like that.

241
00:10:27,040 --> 00:10:28,640
So really just being able to glance

242
00:10:28,640 --> 00:10:31,150
and understand kind of what's happening at a high level

243
00:10:31,150 --> 00:10:34,033
should be just fine for this specific slide.

244
00:10:34,880 --> 00:10:36,290
So we've talked about quite a bit.

245
00:10:36,290 --> 00:10:40,070
Let's sort of sum up what you need to know for the DP-203.

246
00:10:40,070 --> 00:10:43,960
The first is Spark is for big data

247
00:10:43,960 --> 00:10:46,290
or specialized applications.

248
00:10:46,290 --> 00:10:49,803
That's really where you're going to see Spark.

249
00:10:50,900 --> 00:10:53,660
Next, you're querying off of a table.

250
00:10:53,660 --> 00:10:56,750
Don't forget about that unbounded table

251
00:10:56,750 --> 00:11:00,250
where we're are pulling specific pieces of time

252
00:11:00,250 --> 00:11:02,340
in almost like a batch process,

253
00:11:02,340 --> 00:11:05,390
and we're going to pull out those data stream pieces,

254
00:11:05,390 --> 00:11:09,190
run processing, and append them to that unbounded table.

255
00:11:09,190 --> 00:11:11,210
So don't forget about that.

256
00:11:11,210 --> 00:11:13,130
And then finally, it really comes down

257
00:11:13,130 --> 00:11:16,010
to--still--input, query, output.

258
00:11:16,010 --> 00:11:17,600
We define our input sources.

259
00:11:17,600 --> 00:11:20,520
We define the work that we want to do in Spark,

260
00:11:20,520 --> 00:11:22,760
and then we define where it goes.

261
00:11:22,760 --> 00:11:24,760
So if you keep these 3 concepts in mind,

262
00:11:24,760 --> 00:11:27,580
you should be good to go for the DP-203.

263
00:11:27,580 --> 00:11:30,070
Again, don't spend a lot of time focusing on this

264
00:11:30,070 --> 00:11:34,220
because it is not a major component of the DP-203,

265
00:11:34,220 --> 00:11:35,330
but you do need to understand

266
00:11:35,330 --> 00:11:38,010
the highlights of what Spark is.

267
00:11:38,010 --> 00:11:39,510
Hope this has been helpful.

268
00:11:39,510 --> 00:11:41,333
Let's jump on to the next lesson.