1
00:00:00,000 --> 00:00:01,092
<v ->Congratulations!</v>

2
00:00:01,092 --> 00:00:03,716
You have almost finished this section.

3
00:00:03,716 --> 00:00:04,880
In this lesson,

4
00:00:04,880 --> 00:00:06,810
I'm going to talk to you about interpreting

5
00:00:06,810 --> 00:00:11,787
a Spark directed acyclic graph, hereafter known as DAG.

6
00:00:12,650 --> 00:00:14,060
So when we get started,

7
00:00:14,060 --> 00:00:18,920
we're going to start with what a DAG actually is.

8
00:00:18,920 --> 00:00:20,180
Then I'm going to jump in,

9
00:00:20,180 --> 00:00:23,820
and we're going to talk about how we use DAG in Spark,

10
00:00:23,820 --> 00:00:27,043
and then follow that up by how we execute a job in Spark.

11
00:00:28,810 --> 00:00:33,630
So, the basics of DAG. Let's start with a vertex.

12
00:00:33,630 --> 00:00:38,053
So this is just a point. It's also called a node.

13
00:00:39,190 --> 00:00:42,770
Then we have an edge. There's the edge.

14
00:00:42,770 --> 00:00:45,690
So we have 2 vertices.

15
00:00:45,690 --> 00:00:49,100
Those are the 2 nodes, the 2 blue circles,

16
00:00:49,100 --> 00:00:52,100
and then we have the edge that's connecting them.

17
00:00:52,100 --> 00:00:53,440
Now, if we take a bunch of these,

18
00:00:53,440 --> 00:00:55,870
and we put them in order like this,

19
00:00:55,870 --> 00:00:58,960
you can see that we have a graph,

20
00:00:58,960 --> 00:01:01,250
and we have a direction to that graph.

21
00:01:01,250 --> 00:01:03,120
It's moving down.

22
00:01:03,120 --> 00:01:06,220
Lastly, as we look at DAG,

23
00:01:06,220 --> 00:01:08,450
keep in mind that these are acyclic,

24
00:01:08,450 --> 00:01:10,440
meaning that we go from start to finish,

25
00:01:10,440 --> 00:01:13,340
and we don't cycle back around to the beginning.

26
00:01:13,340 --> 00:01:18,340
So this is the basic of what a directed acyclic graph is.

27
00:01:18,710 --> 00:01:20,050
All right, so moving on,

28
00:01:20,050 --> 00:01:23,900
let's talk about what this looks like in Spark.

29
00:01:23,900 --> 00:01:26,360
To start with, we need to talk just a tiny bit

30
00:01:26,360 --> 00:01:30,840
about resilient distributed datasets, or RDD.

31
00:01:30,840 --> 00:01:34,580
So these are the main logical data units in Spark.

32
00:01:34,580 --> 00:01:36,280
A good way to think about that

33
00:01:36,280 --> 00:01:41,010
is like a partition that is read only,

34
00:01:41,010 --> 00:01:42,890
and inside of that partition,

35
00:01:42,890 --> 00:01:46,710
we have our data that we can perform work on in Spark.

36
00:01:46,710 --> 00:01:50,830
And this is what you use or what you can use in Spark.

37
00:01:50,830 --> 00:01:54,560
And so when we talk about DAG, here's an example

38
00:01:54,560 --> 00:01:57,090
of kind of what this looks like in practice.

39
00:01:57,090 --> 00:02:00,020
So we have our collection of records,

40
00:02:00,020 --> 00:02:01,350
and we're going to perform some work

41
00:02:01,350 --> 00:02:04,270
on these resilient distributed datasets.

42
00:02:04,270 --> 00:02:06,790
So we're going to start off here in stage zero,

43
00:02:06,790 --> 00:02:08,730
and we're going to pull in a text file.

44
00:02:08,730 --> 00:02:10,810
Then what we're going to do

45
00:02:10,810 --> 00:02:15,810
is we're going to split the line of the text into words.

46
00:02:16,610 --> 00:02:18,720
So once we've done that, we're going to take that,

47
00:02:18,720 --> 00:02:21,400
and we're going to map that into word pairs.

48
00:02:21,400 --> 00:02:24,620
And then finally, we're going to sum counts for each word.

49
00:02:24,620 --> 00:02:26,170
And so what we're doing here

50
00:02:26,170 --> 00:02:29,670
is we're developing a DAG visualization.

51
00:02:29,670 --> 00:02:33,210
So basically, we are taking all of the different actions

52
00:02:33,210 --> 00:02:36,800
that we're going to do on those datasets,

53
00:02:36,800 --> 00:02:38,970
and we are putting them in order

54
00:02:40,060 --> 00:02:44,153
in an acyclic-directed fashion.

55
00:02:45,158 --> 00:02:48,410
So that is what a simple DAG looks like

56
00:02:48,410 --> 00:02:50,800
in execution in Spark.

57
00:02:50,800 --> 00:02:53,320
So then finally, what we're going to do is we're going to

58
00:02:53,320 --> 00:02:55,630
jump in, and we're going to take a look a little bit further

59
00:02:55,630 --> 00:02:57,010
into that execution,

60
00:02:57,010 --> 00:02:59,873
and I'm going to show you what's happening behind the

61
00:03:00,740 --> 00:03:02,350
scenes. So we start off with our driver,

62
00:03:02,350 --> 00:03:04,260
and the driver is going to take that request

63
00:03:04,260 --> 00:03:05,530
that's been given to it,

64
00:03:05,530 --> 00:03:08,360
and it is going to interact with the cluster manager,

65
00:03:08,360 --> 00:03:12,270
and it's going to ask for worker nodes to be spun up.

66
00:03:12,270 --> 00:03:15,420
So the cluster manager is going to allocate resources

67
00:03:15,420 --> 00:03:17,530
and spin up some workers.

68
00:03:17,530 --> 00:03:20,110
Now, these workers are just simply nodes

69
00:03:20,110 --> 00:03:24,910
that are going to perform parts of that request in parallel.

70
00:03:24,910 --> 00:03:26,560
And so with inside that worker,

71
00:03:26,560 --> 00:03:29,340
they have an executor and then the individual tasks

72
00:03:29,340 --> 00:03:31,530
that are assigned to that worker.

73
00:03:31,530 --> 00:03:33,120
And so then the other piece to this

74
00:03:33,120 --> 00:03:37,490
is Spark is going to create an operator graph

75
00:03:37,490 --> 00:03:39,280
through this DAG scheduler.

76
00:03:39,280 --> 00:03:41,290
And what's going to happen is it's going to take

77
00:03:41,290 --> 00:03:42,830
all of those tasks,

78
00:03:42,830 --> 00:03:47,190
and it's going to break those down and put them into a DAG.

79
00:03:47,190 --> 00:03:50,350
That DAG then is going to be moved into the workers,

80
00:03:50,350 --> 00:03:53,460
so that the workers know how to perform the different tasks,

81
00:03:53,460 --> 00:03:56,670
or how the tasks get assigned to the workers.

82
00:03:56,670 --> 00:04:00,350
And so this is behind the scenes what's actual happening

83
00:04:00,350 --> 00:04:03,103
as we execute a job in Spark.

84
00:04:05,600 --> 00:04:07,650
All right, we're going to stop here,

85
00:04:07,650 --> 00:04:09,360
and we're going to start talking a little bit

86
00:04:09,360 --> 00:04:11,180
about the wrap up.

87
00:04:11,180 --> 00:04:15,650
First, this is a small topic, don't go overboard.

88
00:04:15,650 --> 00:04:19,420
We did not touch very deeply into DAG.

89
00:04:19,420 --> 00:04:23,020
There is a whole lot of information around Spark

90
00:04:23,020 --> 00:04:26,430
and how resilient distributed datasets work

91
00:04:26,430 --> 00:04:28,583
and how DAG works.

92
00:04:29,530 --> 00:04:33,630
But that is not in the scope of this DP-203.

93
00:04:33,630 --> 00:04:35,870
Really what we need to do is we need to understand

94
00:04:35,870 --> 00:04:39,450
at a high-level what's happening with DAG,

95
00:04:39,450 --> 00:04:41,070
understand that it connects with Spark

96
00:04:41,070 --> 00:04:43,010
and understand a little bit about what's going on

97
00:04:43,010 --> 00:04:47,320
behind the scenes that makes a DAG scheduler or DAG work.

98
00:04:47,320 --> 00:04:48,690
We have covered that.

99
00:04:48,690 --> 00:04:50,470
You don't really need to go much beyond that

100
00:04:50,470 --> 00:04:53,000
so don't waste a ton of time diving into that

101
00:04:53,000 --> 00:04:55,460
just for the sake of the DP-203.

102
00:04:55,460 --> 00:04:58,240
What you really need to understand is what DAG is,

103
00:04:58,240 --> 00:05:00,700
and you need to understand where it's used.

104
00:05:00,700 --> 00:05:02,500
If you understand those 2 things,

105
00:05:02,500 --> 00:05:05,900
you should be good to go for the DP-203 requirements.

106
00:05:05,900 --> 00:05:07,070
You can dive in further

107
00:05:07,070 --> 00:05:08,920
if you really interact with Spark a lot,

108
00:05:08,920 --> 00:05:10,230
or you need to know more.

109
00:05:10,230 --> 00:05:12,822
All right, so with that, we have wrapped up this lesson.

110
00:05:12,822 --> 00:05:14,353
I'll see you in the next.