1 00:00:00,350 --> 00:00:01,960 So this lesson is going to be 2 00:00:01,960 --> 00:00:03,580 a little bit different. 3 00:00:03,580 --> 00:00:05,550 In this lesson, I'm going to talk to you 4 00:00:05,550 --> 00:00:09,634 about how we manage Spark jobs in a pipeline. 5 00:00:09,634 --> 00:00:11,870 This lesson is going to be a little bit different 6 00:00:11,870 --> 00:00:16,780 in that there's not a whole lot of Spark on the DP-203. 7 00:00:16,780 --> 00:00:18,610 So since it's on here, 8 00:00:18,610 --> 00:00:21,280 I'm going to show you how we implement Spark activities 9 00:00:21,280 --> 00:00:23,010 in Data Factory, 10 00:00:23,010 --> 00:00:24,960 and then I'm going to show you in the Azure portal 11 00:00:24,960 --> 00:00:26,630 how you do that as well, 12 00:00:26,630 --> 00:00:28,590 but we're not going to spend a lot of time here 13 00:00:28,590 --> 00:00:30,550 because it's not a main component 14 00:00:30,550 --> 00:00:32,940 for the DP-203. 15 00:00:32,940 --> 00:00:33,773 Having said that, 16 00:00:33,773 --> 00:00:36,160 I'm going to show you 2 different ways to do this. 17 00:00:36,160 --> 00:00:38,730 The first is using JSON. 18 00:00:38,730 --> 00:00:41,590 And so when we execute Spark activities 19 00:00:41,590 --> 00:00:44,010 using an HDInsight cluster, 20 00:00:44,010 --> 00:00:45,440 we're going to use JSON 21 00:00:45,440 --> 00:00:48,450 that looks like this over here on the right, 22 00:00:48,450 --> 00:00:50,160 and I'm going to take just a couple of minutes 23 00:00:50,160 --> 00:00:53,570 to walk you through what that all means. 24 00:00:53,570 --> 00:00:56,070 So let's start off with the top section here 25 00:00:56,070 --> 00:00:57,680 and talk about the basics. 26 00:00:57,680 --> 00:01:00,440 So we're going to name our Spark activity. 27 00:01:00,440 --> 00:01:03,780 We're going to give it a description, whatever we want. 28 00:01:03,780 --> 00:01:07,260 The type, of course, is going to be HDInsightSpark. 29 00:01:07,260 --> 00:01:10,270 And then our link service name is just simply 30 00:01:10,270 --> 00:01:12,640 where our Spark program is going to run, 31 00:01:12,640 --> 00:01:14,820 so our compute, right. 32 00:01:14,820 --> 00:01:19,559 Below that, we are going to then define the properties. 33 00:01:19,559 --> 00:01:23,970 So we are going to define our sparkJobLinkedService, 34 00:01:23,970 --> 00:01:27,880 and this is going to be where our Spark file is located. 35 00:01:27,880 --> 00:01:32,530 So in Azure Storage, where's our Spark job file located? 36 00:01:32,530 --> 00:01:34,120 And there's only 2 choices-- 37 00:01:34,120 --> 00:01:35,440 this is actually important-- 38 00:01:35,440 --> 00:01:38,330 Blob storage and ADLS Gen2. 39 00:01:38,330 --> 00:01:41,800 So Data Lake and Blob storage are the only 2 options 40 00:01:41,800 --> 00:01:46,800 currently that you can store your Spark jobs in. 41 00:01:46,890 --> 00:01:49,050 So next up, we're going to do just the basics, 42 00:01:49,050 --> 00:01:52,810 and we're just going to describe where that file lives. 43 00:01:52,810 --> 00:01:55,870 And we can include some config file information 44 00:01:55,870 --> 00:02:00,270 if we have some specific values for properties on Spark 45 00:02:00,270 --> 00:02:02,510 that we want to take a look at. 46 00:02:02,510 --> 00:02:05,676 And then finally, we have our debug info 47 00:02:05,676 --> 00:02:07,051 that we can include. 48 00:02:07,051 --> 00:02:10,040 The default is failure, but we have 3 options: 49 00:02:10,040 --> 00:02:12,010 None, Always, or Failure. 50 00:02:12,010 --> 00:02:15,640 And we're just simply specifying where those Spark log files 51 00:02:15,640 --> 00:02:19,020 are going to be copied and when they're going to be copied. 52 00:02:19,020 --> 00:02:21,330 So that's it for JSON. 53 00:02:21,330 --> 00:02:24,220 Now, let me jump over into the portal 54 00:02:24,220 --> 00:02:27,620 and let's take a look at this in Data Factory. 55 00:02:27,620 --> 00:02:29,960 So here we are in our Data Factory, 56 00:02:29,960 --> 00:02:31,810 and I've opened up a pipeline, 57 00:02:31,810 --> 00:02:33,610 and you can see that we also have a choice 58 00:02:33,610 --> 00:02:36,460 for an activity under HDInsights. 59 00:02:36,460 --> 00:02:39,110 So you would just click on the arrow, scroll down, 60 00:02:39,110 --> 00:02:43,540 grab your Spark activity, drag that onto the canvas, 61 00:02:43,540 --> 00:02:44,480 and then you can see here, 62 00:02:44,480 --> 00:02:46,250 we have very similar information. 63 00:02:46,250 --> 00:02:49,560 We can name and describe our Spark activity. 64 00:02:49,560 --> 00:02:52,610 We do have an option of Retry if we want to. 65 00:02:52,610 --> 00:02:56,730 We can define our HDInsight cluster here. 66 00:02:56,730 --> 00:02:58,810 And when we define that cluster, 67 00:02:58,810 --> 00:03:01,330 you can also see, there's the cluster, 68 00:03:01,330 --> 00:03:02,910 and then you can also see below that 69 00:03:02,910 --> 00:03:05,640 we have our storage link for either Blob 70 00:03:05,640 --> 00:03:07,790 or ADLS Gen2. 71 00:03:07,790 --> 00:03:10,653 So we're going to define that connection there. 72 00:03:11,690 --> 00:03:15,670 Then we're going to define our script here 73 00:03:15,670 --> 00:03:19,320 so we can include where our file is located. 74 00:03:19,320 --> 00:03:20,590 And then after that, 75 00:03:20,590 --> 00:03:24,310 we can include some user properties if we so choose. 76 00:03:24,310 --> 00:03:25,170 And that's it. 77 00:03:25,170 --> 00:03:27,990 So you just simply build this into your pipeline. 78 00:03:27,990 --> 00:03:30,610 And as far as in this lesson, 79 00:03:30,610 --> 00:03:33,480 that's really all you need to know for the DP-203. 80 00:03:33,480 --> 00:03:34,313 Pretty short. 81 00:03:34,313 --> 00:03:37,040 We just need to know how we implement an activity 82 00:03:37,040 --> 00:03:38,430 and what it looks like. 83 00:03:38,430 --> 00:03:39,620 If you've got that, 84 00:03:39,620 --> 00:03:42,470 hey, you're good to go on to the next lesson. 85 00:03:42,470 --> 00:03:43,420 I'll see you there.