1 00:00:00,930 --> 00:00:01,930 Hey, what's up, Gurus? 2 00:00:01,930 --> 00:00:04,670 In this lesson, we are going to talk about processing data 3 00:00:04,670 --> 00:00:07,420 by using Spark Structured Streaming. 4 00:00:07,420 --> 00:00:09,010 Specifically, we're going to take a look 5 00:00:09,010 --> 00:00:10,580 at a high-level overview 6 00:00:10,580 --> 00:00:13,180 of Apache Spark Structured Streaming. 7 00:00:13,180 --> 00:00:16,300 And when I say high level, I mean high level. 8 00:00:16,300 --> 00:00:19,500 We could have an entire course on Spark, 9 00:00:19,500 --> 00:00:21,170 but we don't have that much time 10 00:00:21,170 --> 00:00:22,970 nor do we need to go to that much depth. 11 00:00:22,970 --> 00:00:24,610 So we're going to cover just the basics 12 00:00:24,610 --> 00:00:27,070 so that you understand how it functions. 13 00:00:27,070 --> 00:00:29,370 That includes input streams and tables, 14 00:00:29,370 --> 00:00:31,650 and a little bit about how that works. 15 00:00:31,650 --> 00:00:32,930 And then, of course, we're going to review 16 00:00:32,930 --> 00:00:35,780 some simple code scripts to process data 17 00:00:35,780 --> 00:00:38,940 so that you understand how the Apache Spark module 18 00:00:38,940 --> 00:00:40,990 in Data Factory works. 19 00:00:40,990 --> 00:00:41,900 All right, with that, 20 00:00:41,900 --> 00:00:44,990 let's get started with the high-level overview. 21 00:00:44,990 --> 00:00:47,370 So here we have an overview 22 00:00:47,370 --> 00:00:50,590 of the Spark Structured Streaming process. 23 00:00:50,590 --> 00:00:51,760 And we're going to break this down 24 00:00:51,760 --> 00:00:53,350 kind of section by section 25 00:00:53,350 --> 00:00:55,413 so that you understand what's happening. 26 00:00:56,400 --> 00:00:59,700 So our first section here is our input. 27 00:00:59,700 --> 00:01:02,520 We have our TCP socket at the top there 28 00:01:02,520 --> 00:01:04,760 that you could use as a debugging source, 29 00:01:04,760 --> 00:01:06,500 or you have your main options, 30 00:01:06,500 --> 00:01:08,230 which are your message source, 31 00:01:08,230 --> 00:01:11,430 which would be like streaming through Kafka or Event Hubs. 32 00:01:11,430 --> 00:01:14,390 Or you could have a watcher set out 33 00:01:14,390 --> 00:01:16,230 looking at your data lake 34 00:01:16,230 --> 00:01:18,280 or a storage blob or something like that, 35 00:01:18,280 --> 00:01:20,230 and it would be watching for new files. 36 00:01:20,230 --> 00:01:23,170 As those files appear, it would input them 37 00:01:23,170 --> 00:01:26,940 into your Spark Structured Streaming. 38 00:01:26,940 --> 00:01:30,980 Now the streaming process is where the windowing occurs 39 00:01:30,980 --> 00:01:33,180 and normal transformations. 40 00:01:33,180 --> 00:01:36,210 So when we talk about Stream Analytics, 41 00:01:36,210 --> 00:01:37,750 it's very similar to that. 42 00:01:37,750 --> 00:01:39,740 You're going to have your processing 43 00:01:39,740 --> 00:01:42,320 where you set up a window of time that you're looking at, 44 00:01:42,320 --> 00:01:44,330 and as that data flows through the window, 45 00:01:44,330 --> 00:01:48,010 you're doing transformations on that data. 46 00:01:48,010 --> 00:01:51,180 Once you finish that, you're just going to have a sink 47 00:01:51,180 --> 00:01:53,920 and a sink is just an output location for the data. 48 00:01:53,920 --> 00:01:57,610 So you're going to send your data once it's been processed 49 00:01:57,610 --> 00:02:00,220 to a Power BI dashboard or a database 50 00:02:00,220 --> 00:02:03,950 or the data lake store, or something like that. 51 00:02:03,950 --> 00:02:06,590 And then we also, just like we had a debug source, 52 00:02:06,590 --> 00:02:09,040 or an input, we also have a debug sink 53 00:02:09,040 --> 00:02:12,780 that you can use as well if you need to do debugging. 54 00:02:12,780 --> 00:02:16,060 So that's the basics of how structured streaming works. 55 00:02:16,060 --> 00:02:19,690 Essentially, It's just input, query, output. 56 00:02:19,690 --> 00:02:21,680 So keep that in mind. 57 00:02:21,680 --> 00:02:23,640 Now let's talk a little bit about tables 58 00:02:23,640 --> 00:02:27,160 and how the data actually gets processed. 59 00:02:27,160 --> 00:02:30,270 So we actually have 2 different tables. 60 00:02:30,270 --> 00:02:33,330 We have an input table, we run queries, 61 00:02:33,330 --> 00:02:37,100 and then we have our output or our results table. 62 00:02:37,100 --> 00:02:39,330 So our data stream here on the left 63 00:02:39,330 --> 00:02:41,520 would represent our input table. 64 00:02:41,520 --> 00:02:43,480 And we're going to take that data 65 00:02:43,480 --> 00:02:46,410 and we are going to do processing and send that through 66 00:02:46,410 --> 00:02:49,163 to an unbounded or results table. 67 00:02:50,570 --> 00:02:52,500 So now just a couple of things to keep in mind 68 00:02:52,500 --> 00:02:53,810 as we move forward with this. 69 00:02:53,810 --> 00:02:57,950 The first is taking a look at this unbounded table concept. 70 00:02:57,950 --> 00:03:01,620 So an unbounded table is one that doesn't have a boundary, 71 00:03:01,620 --> 00:03:04,660 which means it's going to continuously grow. 72 00:03:04,660 --> 00:03:07,080 And so this is an appending table, 73 00:03:07,080 --> 00:03:11,200 or it's being continuously appended or added to. 74 00:03:11,200 --> 00:03:14,000 So each one of these purple lines coming down 75 00:03:14,000 --> 00:03:18,100 is adding to this unbounded table. 76 00:03:18,100 --> 00:03:19,480 The next thing to keep in mind 77 00:03:19,480 --> 00:03:23,550 is this concept of streaming in batch. 78 00:03:23,550 --> 00:03:25,870 So Spark Structured Streaming 79 00:03:25,870 --> 00:03:28,680 is definitely a streaming service. 80 00:03:28,680 --> 00:03:32,120 However, you can see the lines coming off of this arrow. 81 00:03:32,120 --> 00:03:34,430 We're going to take these individual packets, 82 00:03:34,430 --> 00:03:36,420 and we're going to process them 83 00:03:36,420 --> 00:03:39,060 almost like we would a batch request. 84 00:03:39,060 --> 00:03:40,320 So it's kind of a mixture 85 00:03:40,320 --> 00:03:43,313 of streaming in batch, if you will. 86 00:03:44,820 --> 00:03:47,030 And then finally, incremental processing. 87 00:03:47,030 --> 00:03:50,290 We definitely need to understand incremental processing. 88 00:03:50,290 --> 00:03:52,990 And so this kind of goes to that concept 89 00:03:52,990 --> 00:03:55,210 of this batch processing. 90 00:03:55,210 --> 00:03:57,890 So we're going to take these incremental pieces 91 00:03:57,890 --> 00:04:01,010 and we're going to process those in parallel. 92 00:04:01,010 --> 00:04:02,880 And I'll show you this in the next slide, 93 00:04:02,880 --> 00:04:05,420 but keep in mind those 3 concepts 94 00:04:05,420 --> 00:04:07,590 as we talk about this unbounded table 95 00:04:07,590 --> 00:04:09,900 that continuously grows. 96 00:04:09,900 --> 00:04:11,420 All right, so as we look at this, 97 00:04:11,420 --> 00:04:14,100 you need to understand that there's 3 different modes 98 00:04:14,100 --> 00:04:16,180 for outputting our table, 99 00:04:16,180 --> 00:04:17,300 and keep in mind the concept, 100 00:04:17,300 --> 00:04:19,700 we have the input, we're running processing, 101 00:04:19,700 --> 00:04:21,080 and then we're going to output it 102 00:04:21,080 --> 00:04:24,900 so you can see your input, result, and output, same thing. 103 00:04:24,900 --> 00:04:27,090 There's 3 different choices, though. 104 00:04:27,090 --> 00:04:30,330 We have a complete mode or a complete rewrite. 105 00:04:30,330 --> 00:04:31,260 And so in this mode, 106 00:04:31,260 --> 00:04:35,530 we are going to completely rewrite the result table. 107 00:04:35,530 --> 00:04:37,610 Second option is append mode. 108 00:04:37,610 --> 00:04:39,930 So like we've talked about before, 109 00:04:39,930 --> 00:04:43,500 append mode is only going to update the new 110 00:04:43,500 --> 00:04:45,220 or the appended result table. 111 00:04:45,220 --> 00:04:46,760 It's not going to rewrite everything. 112 00:04:46,760 --> 00:04:49,640 It's just going to write since the last trigger, 113 00:04:49,640 --> 00:04:51,433 so it's just going to continue to add. 114 00:04:52,300 --> 00:04:55,160 And then we also have an update mode. 115 00:04:55,160 --> 00:04:57,360 So this is pretty much the exact same thing 116 00:04:57,360 --> 00:05:02,130 as the append mode, except for the concept of aggregations. 117 00:05:02,130 --> 00:05:04,570 So it's only going to update the rows 118 00:05:04,570 --> 00:05:07,500 that have changed since the last trigger, 119 00:05:07,500 --> 00:05:10,230 whereas the append mode is just going to add new rows. 120 00:05:10,230 --> 00:05:13,180 It's not looking at changing or not changing. 121 00:05:13,180 --> 00:05:17,170 So if you had an update mode that didn't have any changes 122 00:05:17,170 --> 00:05:19,933 or aggregations, it would be the exact same as append. 123 00:05:21,250 --> 00:05:23,680 All right, those are the 3 different types 124 00:05:23,680 --> 00:05:25,970 of table outputs. 125 00:05:25,970 --> 00:05:29,260 And so now let's move over here and look at this chart. 126 00:05:29,260 --> 00:05:32,540 So you can see we have time moving from left to right, 127 00:05:32,540 --> 00:05:35,240 and there's a trigger every second. 128 00:05:35,240 --> 00:05:36,380 So we have our window moving, 129 00:05:36,380 --> 00:05:38,800 and every second we're going to have a trigger. 130 00:05:38,800 --> 00:05:40,620 So it's going to look at that data, 131 00:05:40,620 --> 00:05:43,440 batch that up as a one-second trigger, 132 00:05:43,440 --> 00:05:45,220 and it's going to take that, 133 00:05:45,220 --> 00:05:47,270 and it's going to pull all the input. 134 00:05:47,270 --> 00:05:51,600 So everything for data up to one second, it's going to pull. 135 00:05:51,600 --> 00:05:54,070 It's then going to run our query. 136 00:05:54,070 --> 00:05:57,200 So whatever query language we have, it's going to run that. 137 00:05:57,200 --> 00:06:01,740 Once it's done that, then it's going to give us our result. 138 00:06:01,740 --> 00:06:04,290 So for that, we have our output. 139 00:06:04,290 --> 00:06:07,410 So we get our result, then we push it through in the output. 140 00:06:07,410 --> 00:06:11,920 And in this chart, we're looking at a complete mode table. 141 00:06:11,920 --> 00:06:13,290 So what's going to happen then 142 00:06:13,290 --> 00:06:16,030 is we're going to look at the 2nd second. 143 00:06:16,030 --> 00:06:20,060 And what going to do is we're going to query data 144 00:06:20,060 --> 00:06:21,860 up to 1 and 2 seconds. 145 00:06:21,860 --> 00:06:24,640 So the input doubled in size 146 00:06:24,640 --> 00:06:28,040 because we're completely rewriting that table. 147 00:06:28,040 --> 00:06:29,830 Then we're going to run our query. 148 00:06:29,830 --> 00:06:31,520 We're going to get our result for those 2, 149 00:06:31,520 --> 00:06:34,120 and then we're going to spit our output. 150 00:06:34,120 --> 00:06:36,210 Trigger 3 seconds, same thing. 151 00:06:36,210 --> 00:06:38,360 We're going to pull all the data up to 3 seconds. 152 00:06:38,360 --> 00:06:40,450 Then we're going to run our query, get our result, 153 00:06:40,450 --> 00:06:43,080 and then spit out our output. 154 00:06:43,080 --> 00:06:45,300 So when you look at Spark Structured Streaming, 155 00:06:45,300 --> 00:06:46,410 that's what's happening. 156 00:06:46,410 --> 00:06:49,790 But keep in mind that this table right here 157 00:06:49,790 --> 00:06:51,040 is what that looks like. 158 00:06:51,040 --> 00:06:53,210 So each one of those little arrows 159 00:06:53,210 --> 00:06:56,220 is actually one of those blocks that we're looking at 160 00:06:56,220 --> 00:06:58,013 as we move forward in time. 161 00:06:59,450 --> 00:07:00,880 So to wrap this section up, 162 00:07:00,880 --> 00:07:03,660 let's talk about what's happening. 163 00:07:03,660 --> 00:07:06,870 So we have our input data stream. Data comes in, 164 00:07:06,870 --> 00:07:09,270 and it moves into an input table. 165 00:07:09,270 --> 00:07:10,910 Once it hits that input table, 166 00:07:10,910 --> 00:07:13,060 it's then going to be pulled out, 167 00:07:13,060 --> 00:07:16,530 and a query is going to be run on that data. 168 00:07:16,530 --> 00:07:20,090 It's then going to be filtered into our results table. 169 00:07:20,090 --> 00:07:21,840 And there's really a couple of different ways. 170 00:07:21,840 --> 00:07:23,280 And there's 3, but we're only going to talk 171 00:07:23,280 --> 00:07:25,040 about the complete and append, 172 00:07:25,040 --> 00:07:27,670 because the update mode is very similar. 173 00:07:27,670 --> 00:07:32,200 So in append mode, let's say that we have an IoT sensor, 174 00:07:32,200 --> 00:07:35,090 which is a very common application for Spark. 175 00:07:35,090 --> 00:07:40,090 And with this IoT sensor, we are looking at temperature. 176 00:07:40,270 --> 00:07:41,890 Let's say we have a manufacturing line, 177 00:07:41,890 --> 00:07:44,130 and we have a IoT sensor 178 00:07:44,130 --> 00:07:46,130 that is going to detect the temperature of the machine, 179 00:07:46,130 --> 00:07:49,050 and so it's going to help to spot any abnormalities. 180 00:07:49,050 --> 00:07:50,510 With that temperature sensor, 181 00:07:50,510 --> 00:07:55,190 we can just simply append that data into a results table. 182 00:07:55,190 --> 00:07:58,230 So that's a fantastic use of append. 183 00:07:58,230 --> 00:08:00,980 Now complete, one question you may be thinking 184 00:08:00,980 --> 00:08:03,060 in your mind is, well, what happens 185 00:08:03,060 --> 00:08:06,710 if my data goes from 1 to 10,000? 186 00:08:06,710 --> 00:08:09,130 Am I eventually going to run out of memory? 187 00:08:09,130 --> 00:08:13,410 And the answer is, yeah, you definitely will. 188 00:08:13,410 --> 00:08:18,410 If you are simply continuing to do a complete mode, 189 00:08:18,670 --> 00:08:22,150 over time as that grows bigger and bigger and bigger, 190 00:08:22,150 --> 00:08:23,840 you're eventually going to run out of memory. 191 00:08:23,840 --> 00:08:26,960 So a better use of complete mode is to look at windows. 192 00:08:26,960 --> 00:08:28,670 So let's say we have a window, 193 00:08:28,670 --> 00:08:31,610 and we are going to sum up all of our data. 194 00:08:31,610 --> 00:08:32,900 And you'll see this a little in the script 195 00:08:32,900 --> 00:08:34,080 that we do in a second, 196 00:08:34,080 --> 00:08:35,980 but we're going to sum up all of our data 197 00:08:35,980 --> 00:08:37,750 for a 5-second window, 198 00:08:37,750 --> 00:08:40,770 and then we're going to write all of that. 199 00:08:40,770 --> 00:08:43,610 Well, so that would be a good use of a complete mode. 200 00:08:43,610 --> 00:08:46,340 It's really designed to be used with aggregations, 201 00:08:46,340 --> 00:08:51,330 not just to continue to add up over and over and over 202 00:08:51,330 --> 00:08:52,330 more and more data. 203 00:08:52,330 --> 00:08:55,610 So keep that in mind for your complete mode, 204 00:08:55,610 --> 00:08:57,180 generally with aggregations, 205 00:08:57,180 --> 00:08:59,730 and keep in mind which you use the append mode for. 206 00:09:00,570 --> 00:09:02,540 To kind of finish out our discussion, 207 00:09:02,540 --> 00:09:04,490 I want to take a look at a script 208 00:09:04,490 --> 00:09:06,270 that we could use in Data Factory 209 00:09:06,270 --> 00:09:08,520 and kind of break down what's happening here. 210 00:09:09,360 --> 00:09:12,360 So what we have is we have our input, 211 00:09:12,360 --> 00:09:14,500 which is these first 2 blocks. 212 00:09:14,500 --> 00:09:16,010 We're going to import the pieces 213 00:09:16,010 --> 00:09:18,610 that we need to make this script function. 214 00:09:18,610 --> 00:09:23,100 And then we are going to provide the input data. 215 00:09:23,100 --> 00:09:24,670 So that's the input path. 216 00:09:24,670 --> 00:09:26,450 That's the schema that we're going to use, 217 00:09:26,450 --> 00:09:30,690 and then that is our streaming information as well. 218 00:09:30,690 --> 00:09:34,050 So we have all of that defined as our input. 219 00:09:34,050 --> 00:09:36,230 And so then we do work, and you can see that, 220 00:09:36,230 --> 00:09:39,710 in this case, we are aggregating our data together 221 00:09:39,710 --> 00:09:42,040 based upon time and temperature, 222 00:09:42,040 --> 00:09:44,000 so this is probably like an IoT sensor 223 00:09:44,000 --> 00:09:45,790 or something like that. 224 00:09:45,790 --> 00:09:48,690 And then finally, we have our output here. 225 00:09:48,690 --> 00:09:52,490 And so this is defining our output sink. 226 00:09:52,490 --> 00:09:54,510 And then finally we are going to get going, 227 00:09:54,510 --> 00:09:56,560 which means we're going to actually start the query. 228 00:09:56,560 --> 00:09:58,560 So this is the command that you would use 229 00:09:58,560 --> 00:10:00,033 to actually start the query. 230 00:10:01,070 --> 00:10:03,410 So this is the Spark Activity 231 00:10:03,410 --> 00:10:05,870 in Data Factory at a high level. 232 00:10:05,870 --> 00:10:09,620 Now for the DP-203, what do you actually need to know? 233 00:10:09,620 --> 00:10:12,800 Well, not a whole lot, as far as the code is concerned. 234 00:10:12,800 --> 00:10:15,280 Basically, if you understand the blocks, 235 00:10:15,280 --> 00:10:18,060 the input, the query, the output block, 236 00:10:18,060 --> 00:10:20,480 and you could recognize kind of those blocks 237 00:10:20,480 --> 00:10:22,430 and where they sit, then you're just fine. 238 00:10:22,430 --> 00:10:23,830 I think it's wildly unlikely 239 00:10:23,830 --> 00:10:25,870 that you're going to need to know individual functions 240 00:10:25,870 --> 00:10:27,040 or things like that. 241 00:10:27,040 --> 00:10:28,640 So really just being able to glance 242 00:10:28,640 --> 00:10:31,150 and understand kind of what's happening at a high level 243 00:10:31,150 --> 00:10:34,033 should be just fine for this specific slide. 244 00:10:34,880 --> 00:10:36,290 So we've talked about quite a bit. 245 00:10:36,290 --> 00:10:40,070 Let's sort of sum up what you need to know for the DP-203. 246 00:10:40,070 --> 00:10:43,960 The first is Spark is for big data 247 00:10:43,960 --> 00:10:46,290 or specialized applications. 248 00:10:46,290 --> 00:10:49,803 That's really where you're going to see Spark. 249 00:10:50,900 --> 00:10:53,660 Next, you're querying off of a table. 250 00:10:53,660 --> 00:10:56,750 Don't forget about that unbounded table 251 00:10:56,750 --> 00:11:00,250 where we're are pulling specific pieces of time 252 00:11:00,250 --> 00:11:02,340 in almost like a batch process, 253 00:11:02,340 --> 00:11:05,390 and we're going to pull out those data stream pieces, 254 00:11:05,390 --> 00:11:09,190 run processing, and append them to that unbounded table. 255 00:11:09,190 --> 00:11:11,210 So don't forget about that. 256 00:11:11,210 --> 00:11:13,130 And then finally, it really comes down 257 00:11:13,130 --> 00:11:16,010 to--still--input, query, output. 258 00:11:16,010 --> 00:11:17,600 We define our input sources. 259 00:11:17,600 --> 00:11:20,520 We define the work that we want to do in Spark, 260 00:11:20,520 --> 00:11:22,760 and then we define where it goes. 261 00:11:22,760 --> 00:11:24,760 So if you keep these 3 concepts in mind, 262 00:11:24,760 --> 00:11:27,580 you should be good to go for the DP-203. 263 00:11:27,580 --> 00:11:30,070 Again, don't spend a lot of time focusing on this 264 00:11:30,070 --> 00:11:34,220 because it is not a major component of the DP-203, 265 00:11:34,220 --> 00:11:35,330 but you do need to understand 266 00:11:35,330 --> 00:11:38,010 the highlights of what Spark is. 267 00:11:38,010 --> 00:11:39,510 Hope this has been helpful. 268 00:11:39,510 --> 00:11:41,333 Let's jump on to the next lesson.