1 00:00:00,000 --> 00:00:01,092 Congratulations! 2 00:00:01,092 --> 00:00:03,716 You have almost finished this section. 3 00:00:03,716 --> 00:00:04,880 In this lesson, 4 00:00:04,880 --> 00:00:06,810 I'm going to talk to you about interpreting 5 00:00:06,810 --> 00:00:11,787 a Spark directed acyclic graph, hereafter known as DAG. 6 00:00:12,650 --> 00:00:14,060 So when we get started, 7 00:00:14,060 --> 00:00:18,920 we're going to start with what a DAG actually is. 8 00:00:18,920 --> 00:00:20,180 Then I'm going to jump in, 9 00:00:20,180 --> 00:00:23,820 and we're going to talk about how we use DAG in Spark, 10 00:00:23,820 --> 00:00:27,043 and then follow that up by how we execute a job in Spark. 11 00:00:28,810 --> 00:00:33,630 So, the basics of DAG. Let's start with a vertex. 12 00:00:33,630 --> 00:00:38,053 So this is just a point. It's also called a node. 13 00:00:39,190 --> 00:00:42,770 Then we have an edge. There's the edge. 14 00:00:42,770 --> 00:00:45,690 So we have 2 vertices. 15 00:00:45,690 --> 00:00:49,100 Those are the 2 nodes, the 2 blue circles, 16 00:00:49,100 --> 00:00:52,100 and then we have the edge that's connecting them. 17 00:00:52,100 --> 00:00:53,440 Now, if we take a bunch of these, 18 00:00:53,440 --> 00:00:55,870 and we put them in order like this, 19 00:00:55,870 --> 00:00:58,960 you can see that we have a graph, 20 00:00:58,960 --> 00:01:01,250 and we have a direction to that graph. 21 00:01:01,250 --> 00:01:03,120 It's moving down. 22 00:01:03,120 --> 00:01:06,220 Lastly, as we look at DAG, 23 00:01:06,220 --> 00:01:08,450 keep in mind that these are acyclic, 24 00:01:08,450 --> 00:01:10,440 meaning that we go from start to finish, 25 00:01:10,440 --> 00:01:13,340 and we don't cycle back around to the beginning. 26 00:01:13,340 --> 00:01:18,340 So this is the basic of what a directed acyclic graph is. 27 00:01:18,710 --> 00:01:20,050 All right, so moving on, 28 00:01:20,050 --> 00:01:23,900 let's talk about what this looks like in Spark. 29 00:01:23,900 --> 00:01:26,360 To start with, we need to talk just a tiny bit 30 00:01:26,360 --> 00:01:30,840 about resilient distributed datasets, or RDD. 31 00:01:30,840 --> 00:01:34,580 So these are the main logical data units in Spark. 32 00:01:34,580 --> 00:01:36,280 A good way to think about that 33 00:01:36,280 --> 00:01:41,010 is like a partition that is read only, 34 00:01:41,010 --> 00:01:42,890 and inside of that partition, 35 00:01:42,890 --> 00:01:46,710 we have our data that we can perform work on in Spark. 36 00:01:46,710 --> 00:01:50,830 And this is what you use or what you can use in Spark. 37 00:01:50,830 --> 00:01:54,560 And so when we talk about DAG, here's an example 38 00:01:54,560 --> 00:01:57,090 of kind of what this looks like in practice. 39 00:01:57,090 --> 00:02:00,020 So we have our collection of records, 40 00:02:00,020 --> 00:02:01,350 and we're going to perform some work 41 00:02:01,350 --> 00:02:04,270 on these resilient distributed datasets. 42 00:02:04,270 --> 00:02:06,790 So we're going to start off here in stage zero, 43 00:02:06,790 --> 00:02:08,730 and we're going to pull in a text file. 44 00:02:08,730 --> 00:02:10,810 Then what we're going to do 45 00:02:10,810 --> 00:02:15,810 is we're going to split the line of the text into words. 46 00:02:16,610 --> 00:02:18,720 So once we've done that, we're going to take that, 47 00:02:18,720 --> 00:02:21,400 and we're going to map that into word pairs. 48 00:02:21,400 --> 00:02:24,620 And then finally, we're going to sum counts for each word. 49 00:02:24,620 --> 00:02:26,170 And so what we're doing here 50 00:02:26,170 --> 00:02:29,670 is we're developing a DAG visualization. 51 00:02:29,670 --> 00:02:33,210 So basically, we are taking all of the different actions 52 00:02:33,210 --> 00:02:36,800 that we're going to do on those datasets, 53 00:02:36,800 --> 00:02:38,970 and we are putting them in order 54 00:02:40,060 --> 00:02:44,153 in an acyclic-directed fashion. 55 00:02:45,158 --> 00:02:48,410 So that is what a simple DAG looks like 56 00:02:48,410 --> 00:02:50,800 in execution in Spark. 57 00:02:50,800 --> 00:02:53,320 So then finally, what we're going to do is we're going to 58 00:02:53,320 --> 00:02:55,630 jump in, and we're going to take a look a little bit further 59 00:02:55,630 --> 00:02:57,010 into that execution, 60 00:02:57,010 --> 00:02:59,873 and I'm going to show you what's happening behind the 61 00:03:00,740 --> 00:03:02,350 scenes. So we start off with our driver, 62 00:03:02,350 --> 00:03:04,260 and the driver is going to take that request 63 00:03:04,260 --> 00:03:05,530 that's been given to it, 64 00:03:05,530 --> 00:03:08,360 and it is going to interact with the cluster manager, 65 00:03:08,360 --> 00:03:12,270 and it's going to ask for worker nodes to be spun up. 66 00:03:12,270 --> 00:03:15,420 So the cluster manager is going to allocate resources 67 00:03:15,420 --> 00:03:17,530 and spin up some workers. 68 00:03:17,530 --> 00:03:20,110 Now, these workers are just simply nodes 69 00:03:20,110 --> 00:03:24,910 that are going to perform parts of that request in parallel. 70 00:03:24,910 --> 00:03:26,560 And so with inside that worker, 71 00:03:26,560 --> 00:03:29,340 they have an executor and then the individual tasks 72 00:03:29,340 --> 00:03:31,530 that are assigned to that worker. 73 00:03:31,530 --> 00:03:33,120 And so then the other piece to this 74 00:03:33,120 --> 00:03:37,490 is Spark is going to create an operator graph 75 00:03:37,490 --> 00:03:39,280 through this DAG scheduler. 76 00:03:39,280 --> 00:03:41,290 And what's going to happen is it's going to take 77 00:03:41,290 --> 00:03:42,830 all of those tasks, 78 00:03:42,830 --> 00:03:47,190 and it's going to break those down and put them into a DAG. 79 00:03:47,190 --> 00:03:50,350 That DAG then is going to be moved into the workers, 80 00:03:50,350 --> 00:03:53,460 so that the workers know how to perform the different tasks, 81 00:03:53,460 --> 00:03:56,670 or how the tasks get assigned to the workers. 82 00:03:56,670 --> 00:04:00,350 And so this is behind the scenes what's actual happening 83 00:04:00,350 --> 00:04:03,103 as we execute a job in Spark. 84 00:04:05,600 --> 00:04:07,650 All right, we're going to stop here, 85 00:04:07,650 --> 00:04:09,360 and we're going to start talking a little bit 86 00:04:09,360 --> 00:04:11,180 about the wrap up. 87 00:04:11,180 --> 00:04:15,650 First, this is a small topic, don't go overboard. 88 00:04:15,650 --> 00:04:19,420 We did not touch very deeply into DAG. 89 00:04:19,420 --> 00:04:23,020 There is a whole lot of information around Spark 90 00:04:23,020 --> 00:04:26,430 and how resilient distributed datasets work 91 00:04:26,430 --> 00:04:28,583 and how DAG works. 92 00:04:29,530 --> 00:04:33,630 But that is not in the scope of this DP-203. 93 00:04:33,630 --> 00:04:35,870 Really what we need to do is we need to understand 94 00:04:35,870 --> 00:04:39,450 at a high-level what's happening with DAG, 95 00:04:39,450 --> 00:04:41,070 understand that it connects with Spark 96 00:04:41,070 --> 00:04:43,010 and understand a little bit about what's going on 97 00:04:43,010 --> 00:04:47,320 behind the scenes that makes a DAG scheduler or DAG work. 98 00:04:47,320 --> 00:04:48,690 We have covered that. 99 00:04:48,690 --> 00:04:50,470 You don't really need to go much beyond that 100 00:04:50,470 --> 00:04:53,000 so don't waste a ton of time diving into that 101 00:04:53,000 --> 00:04:55,460 just for the sake of the DP-203. 102 00:04:55,460 --> 00:04:58,240 What you really need to understand is what DAG is, 103 00:04:58,240 --> 00:05:00,700 and you need to understand where it's used. 104 00:05:00,700 --> 00:05:02,500 If you understand those 2 things, 105 00:05:02,500 --> 00:05:05,900 you should be good to go for the DP-203 requirements. 106 00:05:05,900 --> 00:05:07,070 You can dive in further 107 00:05:07,070 --> 00:05:08,920 if you really interact with Spark a lot, 108 00:05:08,920 --> 00:05:10,230 or you need to know more. 109 00:05:10,230 --> 00:05:12,822 All right, so with that, we have wrapped up this lesson. 110 00:05:12,822 --> 00:05:14,353 I'll see you in the next.