1 00:00:00,900 --> 00:00:01,740 Hey, Cloud Gurus, 2 00:00:01,740 --> 00:00:05,570 welcome to our last video on Meeting the Tools of the Trade, 3 00:00:05,570 --> 00:00:07,693 finishing up with Apache Spark. 4 00:00:09,060 --> 00:00:09,950 In this lesson, 5 00:00:09,950 --> 00:00:12,890 we'll have a bit of an overview and then we'll go straight 6 00:00:12,890 --> 00:00:16,710 to demo and wrap everything up with a review. 7 00:00:16,710 --> 00:00:20,420 This is a super, high-level look at using Apache Spark 8 00:00:20,420 --> 00:00:21,930 in Azure Data Factory, 9 00:00:21,930 --> 00:00:24,640 mostly focused with getting you familiar on how to access 10 00:00:24,640 --> 00:00:27,110 the Apache Spark Transformations, 11 00:00:27,110 --> 00:00:29,513 and be ready for exam questions around them. 12 00:00:32,400 --> 00:00:33,233 To begin with, 13 00:00:33,233 --> 00:00:36,033 let's talk a bit about Apache Spark in Azure. 14 00:00:36,930 --> 00:00:40,600 This activity executes a Spark program on either your own 15 00:00:40,600 --> 00:00:43,430 or an on-demand HDInsight cluster. 16 00:00:43,430 --> 00:00:45,200 And if you remember a few videos back, 17 00:00:45,200 --> 00:00:47,340 when we were looking at Azure Data Factory, 18 00:00:47,340 --> 00:00:51,800 you could see the HDInsight Spark activity listed there. 19 00:00:51,800 --> 00:00:53,210 One of its advantages is that 20 00:00:53,210 --> 00:00:55,330 Spark jobs are more extensible, 21 00:00:55,330 --> 00:00:57,430 allowing you to provide multiple files, 22 00:00:57,430 --> 00:01:00,393 such as Python scripts, and jar packages. 23 00:01:01,250 --> 00:01:04,050 It expects a certain folder structure in the Blob storage 24 00:01:04,050 --> 00:01:07,200 referenced by the HDInsight linked service, 25 00:01:07,200 --> 00:01:09,810 and so once you have all of that set up at runtime, 26 00:01:09,810 --> 00:01:13,410 Data Factory will be able to use them if properly placed. 27 00:01:13,410 --> 00:01:16,250 And so this really allows you to have very customized 28 00:01:16,250 --> 00:01:18,680 transformations using one or more scripts, 29 00:01:18,680 --> 00:01:21,253 and one or more jar files to support them. 30 00:01:23,580 --> 00:01:26,320 With that brief intro, let's jump straight into the action 31 00:01:26,320 --> 00:01:27,663 and see it in the portal. 32 00:01:29,630 --> 00:01:32,040 Here we are back in Azure Data Factory Studio, 33 00:01:32,040 --> 00:01:34,160 in the same Data Factory we've been working with 34 00:01:34,160 --> 00:01:36,160 in other videos. 35 00:01:36,160 --> 00:01:40,330 I have just a basic, blank pipeline opened up here. 36 00:01:40,330 --> 00:01:41,163 And as you can see, 37 00:01:41,163 --> 00:01:44,060 we have our HDInsight category. 38 00:01:44,060 --> 00:01:48,120 Opening that up, we have our Spark activity. 39 00:01:48,120 --> 00:01:50,660 Within here, we can specify how we want to handle 40 00:01:50,660 --> 00:01:52,600 the HDI cluster. 41 00:01:52,600 --> 00:01:55,040 If we have a link service specified, 42 00:01:55,040 --> 00:01:58,770 we can certainly choose that, but we can create new, 43 00:01:58,770 --> 00:02:00,430 and we have the option of bringing 44 00:02:00,430 --> 00:02:02,870 our own HDInsight cluster, 45 00:02:02,870 --> 00:02:06,773 if you've already got one set up, or using an on-demand one. 46 00:02:07,760 --> 00:02:11,730 As you can see, we specify our Azure storage linked service, 47 00:02:11,730 --> 00:02:14,290 and this is where the needed directory structure will be, 48 00:02:14,290 --> 00:02:16,530 starting with the top-level Spark folder, 49 00:02:16,530 --> 00:02:17,720 and then having your scripts 50 00:02:17,720 --> 00:02:20,400 and jar package folders under that. 51 00:02:20,400 --> 00:02:22,370 I won't take you through all of the options, 52 00:02:22,370 --> 00:02:24,800 it will be very specific to each situation, 53 00:02:24,800 --> 00:02:27,300 but just be aware that this is where you specify 54 00:02:27,300 --> 00:02:29,160 which cluster type you want, 55 00:02:29,160 --> 00:02:31,573 and the associated storage account. 56 00:02:34,350 --> 00:02:36,923 I'll cancel out and discard those changes. 57 00:02:38,140 --> 00:02:40,390 The other thing I wanted to show you is here 58 00:02:40,390 --> 00:02:43,390 under the script/jar heading. 59 00:02:43,390 --> 00:02:45,350 Here we give the specifics of our script's 60 00:02:45,350 --> 00:02:46,810 name and location, 61 00:02:46,810 --> 00:02:49,600 the link service will default to the one for the cluster, 62 00:02:49,600 --> 00:02:51,830 unless you want to specify a different one, 63 00:02:51,830 --> 00:02:54,020 and of course you provide the file path, 64 00:02:54,020 --> 00:02:56,680 and any additional scripts. 65 00:02:56,680 --> 00:02:59,700 Once you have everything set up, validated, and published, 66 00:02:59,700 --> 00:03:03,370 you can trigger this just like any other pipeline activity 67 00:03:03,370 --> 00:03:04,590 and the output will be stored 68 00:03:04,590 --> 00:03:07,100 in that same Azure Blob storage account, 69 00:03:07,100 --> 00:03:10,163 within that same Spark directory that you created earlier. 70 00:03:12,650 --> 00:03:16,950 By way of review, the Spark activity in Azure Data Factory 71 00:03:16,950 --> 00:03:19,460 allows you to execute Apache Spark programs 72 00:03:19,460 --> 00:03:23,433 on either your own or on demand HDInsight cluster. 73 00:03:24,660 --> 00:03:26,680 Spark job flexibility means you can provide 74 00:03:26,680 --> 00:03:29,440 multiple dependencies, such as jar packages, 75 00:03:29,440 --> 00:03:31,823 Python scripts, and other files. 76 00:03:32,930 --> 00:03:34,870 I hope that you have enjoyed this brief look 77 00:03:34,870 --> 00:03:37,500 at using Apache Spark Transformations, 78 00:03:37,500 --> 00:03:40,960 this concludes our introduction of the tools of the trade. 79 00:03:40,960 --> 00:03:43,260 You now have several options that you can use 80 00:03:43,260 --> 00:03:47,050 in your arsenal for data ingestion and transformation. 81 00:03:47,050 --> 00:03:48,830 From here, we'll move on in the next video 82 00:03:48,830 --> 00:03:50,930 to creating data pipelines. 83 00:03:50,930 --> 00:03:52,830 When you're ready, I'll see you there.