1
00:00:00,350 --> 00:00:01,960
So this lesson is going to be

2
00:00:01,960 --> 00:00:03,580
a little bit different.

3
00:00:03,580 --> 00:00:05,550
In this lesson, I'm going to talk to you

4
00:00:05,550 --> 00:00:09,634
about how we manage Spark jobs in a pipeline.

5
00:00:09,634 --> 00:00:11,870
This lesson is going to be a little bit different

6
00:00:11,870 --> 00:00:16,780
in that there's not a whole lot of Spark on the DP-203.

7
00:00:16,780 --> 00:00:18,610
So since it's on here,

8
00:00:18,610 --> 00:00:21,280
I'm going to show you how we implement Spark activities

9
00:00:21,280 --> 00:00:23,010
in Data Factory,

10
00:00:23,010 --> 00:00:24,960
and then I'm going to show you in the Azure portal

11
00:00:24,960 --> 00:00:26,630
how you do that as well,

12
00:00:26,630 --> 00:00:28,590
but we're not going to spend a lot of time here

13
00:00:28,590 --> 00:00:30,550
because it's not a main component

14
00:00:30,550 --> 00:00:32,940
for the DP-203.

15
00:00:32,940 --> 00:00:33,773
Having said that,

16
00:00:33,773 --> 00:00:36,160
I'm going to show you 2 different ways to do this.

17
00:00:36,160 --> 00:00:38,730
The first is using JSON.

18
00:00:38,730 --> 00:00:41,590
And so when we execute Spark activities

19
00:00:41,590 --> 00:00:44,010
using an HDInsight cluster,

20
00:00:44,010 --> 00:00:45,440
we're going to use JSON

21
00:00:45,440 --> 00:00:48,450
that looks like this over here on the right,

22
00:00:48,450 --> 00:00:50,160
and I'm going to take just a couple of minutes

23
00:00:50,160 --> 00:00:53,570
to walk you through what that all means.

24
00:00:53,570 --> 00:00:56,070
So let's start off with the top section here

25
00:00:56,070 --> 00:00:57,680
and talk about the basics.

26
00:00:57,680 --> 00:01:00,440
So we're going to name our Spark activity.

27
00:01:00,440 --> 00:01:03,780
We're going to give it a description, whatever we want.

28
00:01:03,780 --> 00:01:07,260
The type, of course, is going to be HDInsightSpark.

29
00:01:07,260 --> 00:01:10,270
And then our link service name is just simply

30
00:01:10,270 --> 00:01:12,640
where our Spark program is going to run,

31
00:01:12,640 --> 00:01:14,820
so our compute, right.

32
00:01:14,820 --> 00:01:19,559
Below that, we are going to then define the properties.

33
00:01:19,559 --> 00:01:23,970
So we are going to define our sparkJobLinkedService,

34
00:01:23,970 --> 00:01:27,880
and this is going to be where our Spark file is located.

35
00:01:27,880 --> 00:01:32,530
So in Azure Storage, where's our Spark job file located?

36
00:01:32,530 --> 00:01:34,120
And there's only 2 choices--

37
00:01:34,120 --> 00:01:35,440
this is actually important--

38
00:01:35,440 --> 00:01:38,330
Blob storage and ADLS Gen2.

39
00:01:38,330 --> 00:01:41,800
So Data Lake and Blob storage are the only 2 options

40
00:01:41,800 --> 00:01:46,800
currently that you can store your Spark jobs in.

41
00:01:46,890 --> 00:01:49,050
So next up, we're going to do just the basics,

42
00:01:49,050 --> 00:01:52,810
and we're just going to describe where that file lives.

43
00:01:52,810 --> 00:01:55,870
And we can include some config file information

44
00:01:55,870 --> 00:02:00,270
if we have some specific values for properties on Spark

45
00:02:00,270 --> 00:02:02,510
that we want to take a look at.

46
00:02:02,510 --> 00:02:05,676
And then finally, we have our debug info

47
00:02:05,676 --> 00:02:07,051
that we can include.

48
00:02:07,051 --> 00:02:10,040
The default is failure, but we have 3 options:

49
00:02:10,040 --> 00:02:12,010
None, Always, or Failure.

50
00:02:12,010 --> 00:02:15,640
And we're just simply specifying where those Spark log files

51
00:02:15,640 --> 00:02:19,020
are going to be copied and when they're going to be copied.

52
00:02:19,020 --> 00:02:21,330
So that's it for JSON.

53
00:02:21,330 --> 00:02:24,220
Now, let me jump over into the portal

54
00:02:24,220 --> 00:02:27,620
and let's take a look at this in Data Factory.

55
00:02:27,620 --> 00:02:29,960
So here we are in our Data Factory,

56
00:02:29,960 --> 00:02:31,810
and I've opened up a pipeline,

57
00:02:31,810 --> 00:02:33,610
and you can see that we also have a choice

58
00:02:33,610 --> 00:02:36,460
for an activity under HDInsights.

59
00:02:36,460 --> 00:02:39,110
So you would just click on the arrow, scroll down,

60
00:02:39,110 --> 00:02:43,540
grab your Spark activity, drag that onto the canvas,

61
00:02:43,540 --> 00:02:44,480
and then you can see here,

62
00:02:44,480 --> 00:02:46,250
we have very similar information.

63
00:02:46,250 --> 00:02:49,560
We can name and describe our Spark activity.

64
00:02:49,560 --> 00:02:52,610
We do have an option of Retry if we want to.

65
00:02:52,610 --> 00:02:56,730
We can define our HDInsight cluster here.

66
00:02:56,730 --> 00:02:58,810
And when we define that cluster,

67
00:02:58,810 --> 00:03:01,330
you can also see, there's the cluster,

68
00:03:01,330 --> 00:03:02,910
and then you can also see below that

69
00:03:02,910 --> 00:03:05,640
we have our storage link for either Blob

70
00:03:05,640 --> 00:03:07,790
or ADLS Gen2.

71
00:03:07,790 --> 00:03:10,653
So we're going to define that connection there.

72
00:03:11,690 --> 00:03:15,670
Then we're going to define our script here

73
00:03:15,670 --> 00:03:19,320
so we can include where our file is located.

74
00:03:19,320 --> 00:03:20,590
And then after that,

75
00:03:20,590 --> 00:03:24,310
we can include some user properties if we so choose.

76
00:03:24,310 --> 00:03:25,170
And that's it.

77
00:03:25,170 --> 00:03:27,990
So you just simply build this into your pipeline.

78
00:03:27,990 --> 00:03:30,610
And as far as in this lesson,

79
00:03:30,610 --> 00:03:33,480
that's really all you need to know for the DP-203.

80
00:03:33,480 --> 00:03:34,313
Pretty short.

81
00:03:34,313 --> 00:03:37,040
We just need to know how we implement an activity

82
00:03:37,040 --> 00:03:38,430
and what it looks like.

83
00:03:38,430 --> 00:03:39,620
If you've got that,

84
00:03:39,620 --> 00:03:42,470
hey, you're good to go on to the next lesson.

85
00:03:42,470 --> 00:03:43,420
I'll see you there.