1
00:00:00,900 --> 00:00:01,740
Hey, Cloud Gurus,

2
00:00:01,740 --> 00:00:05,570
welcome to our last video on Meeting the Tools of the Trade,

3
00:00:05,570 --> 00:00:07,693
finishing up with Apache Spark.

4
00:00:09,060 --> 00:00:09,950
In this lesson,

5
00:00:09,950 --> 00:00:12,890
we'll have a bit of an overview and then we'll go straight

6
00:00:12,890 --> 00:00:16,710
to demo and wrap everything up with a review.

7
00:00:16,710 --> 00:00:20,420
This is a super, high-level look at using Apache Spark

8
00:00:20,420 --> 00:00:21,930
in Azure Data Factory,

9
00:00:21,930 --> 00:00:24,640
mostly focused with getting you familiar on how to access

10
00:00:24,640 --> 00:00:27,110
the Apache Spark Transformations,

11
00:00:27,110 --> 00:00:29,513
and be ready for exam questions around them.

12
00:00:32,400 --> 00:00:33,233
To begin with,

13
00:00:33,233 --> 00:00:36,033
let's talk a bit about Apache Spark in Azure.

14
00:00:36,930 --> 00:00:40,600
This activity executes a Spark program on either your own

15
00:00:40,600 --> 00:00:43,430
or an on-demand HDInsight cluster.

16
00:00:43,430 --> 00:00:45,200
And if you remember a few videos back,

17
00:00:45,200 --> 00:00:47,340
when we were looking at Azure Data Factory,

18
00:00:47,340 --> 00:00:51,800
you could see the HDInsight Spark activity listed there.

19
00:00:51,800 --> 00:00:53,210
One of its advantages is that

20
00:00:53,210 --> 00:00:55,330
Spark jobs are more extensible,

21
00:00:55,330 --> 00:00:57,430
allowing you to provide multiple files,

22
00:00:57,430 --> 00:01:00,393
such as Python scripts, and jar packages.

23
00:01:01,250 --> 00:01:04,050
It expects a certain folder structure in the Blob storage

24
00:01:04,050 --> 00:01:07,200
referenced by the HDInsight linked service,

25
00:01:07,200 --> 00:01:09,810
and so once you have all of that set up at runtime,

26
00:01:09,810 --> 00:01:13,410
Data Factory will be able to use them if properly placed.

27
00:01:13,410 --> 00:01:16,250
And so this really allows you to have very customized

28
00:01:16,250 --> 00:01:18,680
transformations using one or more scripts,

29
00:01:18,680 --> 00:01:21,253
and one or more jar files to support them.

30
00:01:23,580 --> 00:01:26,320
With that brief intro, let's jump straight into the action

31
00:01:26,320 --> 00:01:27,663
and see it in the portal.

32
00:01:29,630 --> 00:01:32,040
Here we are back in Azure Data Factory Studio,

33
00:01:32,040 --> 00:01:34,160
in the same Data Factory we've been working with

34
00:01:34,160 --> 00:01:36,160
in other videos.

35
00:01:36,160 --> 00:01:40,330
I have just a basic, blank pipeline opened up here.

36
00:01:40,330 --> 00:01:41,163
And as you can see,

37
00:01:41,163 --> 00:01:44,060
we have our HDInsight category.

38
00:01:44,060 --> 00:01:48,120
Opening that up, we have our Spark activity.

39
00:01:48,120 --> 00:01:50,660
Within here, we can specify how we want to handle

40
00:01:50,660 --> 00:01:52,600
the HDI cluster.

41
00:01:52,600 --> 00:01:55,040
If we have a link service specified,

42
00:01:55,040 --> 00:01:58,770
we can certainly choose that, but we can create new,

43
00:01:58,770 --> 00:02:00,430
and we have the option of bringing

44
00:02:00,430 --> 00:02:02,870
our own HDInsight cluster,

45
00:02:02,870 --> 00:02:06,773
if you've already got one set up, or using an on-demand one.

46
00:02:07,760 --> 00:02:11,730
As you can see, we specify our Azure storage linked service,

47
00:02:11,730 --> 00:02:14,290
and this is where the needed directory structure will be,

48
00:02:14,290 --> 00:02:16,530
starting with the top-level Spark folder,

49
00:02:16,530 --> 00:02:17,720
and then having your scripts

50
00:02:17,720 --> 00:02:20,400
and jar package folders under that.

51
00:02:20,400 --> 00:02:22,370
I won't take you through all of the options,

52
00:02:22,370 --> 00:02:24,800
it will be very specific to each situation,

53
00:02:24,800 --> 00:02:27,300
but just be aware that this is where you specify

54
00:02:27,300 --> 00:02:29,160
which cluster type you want,

55
00:02:29,160 --> 00:02:31,573
and the associated storage account.

56
00:02:34,350 --> 00:02:36,923
I'll cancel out and discard those changes.

57
00:02:38,140 --> 00:02:40,390
The other thing I wanted to show you is here

58
00:02:40,390 --> 00:02:43,390
under the script/jar heading.

59
00:02:43,390 --> 00:02:45,350
Here we give the specifics of our script's

60
00:02:45,350 --> 00:02:46,810
name and location,

61
00:02:46,810 --> 00:02:49,600
the link service will default to the one for the cluster,

62
00:02:49,600 --> 00:02:51,830
unless you want to specify a different one,

63
00:02:51,830 --> 00:02:54,020
and of course you provide the file path,

64
00:02:54,020 --> 00:02:56,680
and any additional scripts.

65
00:02:56,680 --> 00:02:59,700
Once you have everything set up, validated, and published,

66
00:02:59,700 --> 00:03:03,370
you can trigger this just like any other pipeline activity

67
00:03:03,370 --> 00:03:04,590
and the output will be stored

68
00:03:04,590 --> 00:03:07,100
in that same Azure Blob storage account,

69
00:03:07,100 --> 00:03:10,163
within that same Spark directory that you created earlier.

70
00:03:12,650 --> 00:03:16,950
By way of review, the Spark activity in Azure Data Factory

71
00:03:16,950 --> 00:03:19,460
allows you to execute Apache Spark programs

72
00:03:19,460 --> 00:03:23,433
on either your own or on demand HDInsight cluster.

73
00:03:24,660 --> 00:03:26,680
Spark job flexibility means you can provide

74
00:03:26,680 --> 00:03:29,440
multiple dependencies, such as jar packages,

75
00:03:29,440 --> 00:03:31,823
Python scripts, and other files.

76
00:03:32,930 --> 00:03:34,870
I hope that you have enjoyed this brief look

77
00:03:34,870 --> 00:03:37,500
at using Apache Spark Transformations,

78
00:03:37,500 --> 00:03:40,960
this concludes our introduction of the tools of the trade.

79
00:03:40,960 --> 00:03:43,260
You now have several options that you can use

80
00:03:43,260 --> 00:03:47,050
in your arsenal for data ingestion and transformation.

81
00:03:47,050 --> 00:03:48,830
From here, we'll move on in the next video

82
00:03:48,830 --> 00:03:50,930
to creating data pipelines.

83
00:03:50,930 --> 00:03:52,830
When you're ready, I'll see you there.