1
00:00:00,550 --> 00:00:01,430
In this lesson,
2
00:00:01,430 --> 00:00:04,680
we are going to talk about "Tuning Shuffle Partitions".
3
00:00:04,680 --> 00:00:06,240
And I'm going to tell you right off the bat,
4
00:00:06,240 --> 00:00:08,180
I'm having a little bit of a struggle,
5
00:00:08,180 --> 00:00:11,010
because when I think of tuning shuffle partitions,
6
00:00:11,010 --> 00:00:13,930
I'm kind of torn between shuffling cards
7
00:00:13,930 --> 00:00:16,010
and shuffling as a dance move.
8
00:00:16,010 --> 00:00:17,100
So, I don't know.
9
00:00:17,100 --> 00:00:18,270
I threw them both on there.
10
00:00:18,270 --> 00:00:19,840
I'll let you figure it out.
11
00:00:19,840 --> 00:00:20,810
In this lesson, though,
12
00:00:20,810 --> 00:00:24,050
we're actually going to talk about Spark shuffle.
13
00:00:24,050 --> 00:00:25,550
And I'm going to start by talking to you about
14
00:00:25,550 --> 00:00:28,390
what Spark shuffle is at a high level,
15
00:00:28,390 --> 00:00:31,740
because again, I just want to focus on the DP-203 here,
16
00:00:31,740 --> 00:00:34,500
and not get lost in the weeds.
17
00:00:34,500 --> 00:00:36,040
Then I'm going to talk to you about
18
00:00:36,040 --> 00:00:39,370
how do we set a shuffle partition size.
19
00:00:39,370 --> 00:00:41,160
Going to be a relatively short lesson,
20
00:00:41,160 --> 00:00:43,570
but let's go ahead and jump on in.
21
00:00:43,570 --> 00:00:45,660
So what is a Spark shuffle?
22
00:00:45,660 --> 00:00:50,430
Well, at a core, it's a mechanism for repartitioning data.
23
00:00:50,430 --> 00:00:53,010
And we've talked about repartitioning several times
24
00:00:53,010 --> 00:00:54,660
in this course.
25
00:00:54,660 --> 00:00:57,470
When we talk about Spark shuffling,
26
00:00:57,470 --> 00:00:59,960
it looks like this over here on the right.
27
00:00:59,960 --> 00:01:03,500
So, we've talked about the driver, cluster manager,
28
00:01:03,500 --> 00:01:05,030
and then the executors.
29
00:01:05,030 --> 00:01:06,560
And then within those executors,
30
00:01:06,560 --> 00:01:09,160
we have individual tasks, right?
31
00:01:09,160 --> 00:01:13,010
That is the standard setup for Spark.
32
00:01:13,010 --> 00:01:16,910
Now, what if we wanted to copy or move data?
33
00:01:16,910 --> 00:01:18,380
What does that look like?
34
00:01:18,380 --> 00:01:21,870
Well, we introduce a shuffle step.
35
00:01:21,870 --> 00:01:23,580
And basically what this is going to do
36
00:01:23,580 --> 00:01:26,700
is it's going to take the executor data from those tasks,
37
00:01:26,700 --> 00:01:28,920
-and those are going to be in different partitions--
38
00:01:28,920 --> 00:01:31,010
and it's going to pull all of those together
39
00:01:31,010 --> 00:01:35,990
into a shuffle step, or maybe multiple shuffle steps.
40
00:01:35,990 --> 00:01:39,430
And then it's going to reduce that data down.
41
00:01:39,430 --> 00:01:41,250
So essentially, we're moving it here.
42
00:01:41,250 --> 00:01:44,170
And we would go from maybe 5 partitions
43
00:01:44,170 --> 00:01:46,150
down to 2 partitions.
44
00:01:46,150 --> 00:01:50,610
And that happens in that shuffle and reducer step.
45
00:01:50,610 --> 00:01:53,080
So, that's what's happening behind the scenes
46
00:01:53,080 --> 00:01:55,160
when we do a Spark shuffle.
47
00:01:55,160 --> 00:01:56,420
And what we're doing is,
48
00:01:56,420 --> 00:01:58,780
essentially, we're grouping data differently
49
00:01:58,780 --> 00:02:00,883
across partitions, okay?
50
00:02:01,920 --> 00:02:05,620
This can be beneficial or it can be harmful.
51
00:02:05,620 --> 00:02:07,530
Now, it can be beneficial because,
52
00:02:07,530 --> 00:02:10,615
hey, we're reducing things down. When you reduce things
53
00:02:10,615 --> 00:02:12,650
down, generally you have less storage,
54
00:02:12,650 --> 00:02:16,930
and it's generally easier and faster to run queries.
55
00:02:16,930 --> 00:02:19,270
It can also be harmful, however,
56
00:02:19,270 --> 00:02:23,630
because this shuffle step can be a very costly step
57
00:02:23,630 --> 00:02:25,483
if you don't do it correctly.
58
00:02:26,440 --> 00:02:29,280
And so, when we run shuffling,
59
00:02:29,280 --> 00:02:32,370
anytime you do a transformation operation in Spark,
60
00:02:32,370 --> 00:02:36,760
-so a groupBy, a join, a union, that kind of stuff--
61
00:02:36,760 --> 00:02:40,290
you are going to initiate a shuffle behind the scenes.
62
00:02:40,290 --> 00:02:44,430
And like I said before, shuffling is an expensive operation,
63
00:02:44,430 --> 00:02:47,960
so it's important that we set it correctly.
64
00:02:47,960 --> 00:02:51,460
And what makes it expensive is it's requiring
65
00:02:51,460 --> 00:02:54,610
disk I/O and network I/O when we do this shuffle.
66
00:02:54,610 --> 00:02:56,980
Because what it's doing is it's storing data
67
00:02:56,980 --> 00:02:58,420
and it's moving data.
68
00:02:58,420 --> 00:03:00,990
And so that can be a very expensive operation
69
00:03:00,990 --> 00:03:02,350
when we talk about Databricks,
70
00:03:02,350 --> 00:03:06,473
which is generally done on massive, big datasets.
71
00:03:07,940 --> 00:03:11,220
And it also involves data serialization
72
00:03:11,220 --> 00:03:13,610
and deserialization.
73
00:03:13,610 --> 00:03:15,140
And what that means is,
74
00:03:15,140 --> 00:03:18,740
we are just transforming data and the format of it
75
00:03:18,740 --> 00:03:19,870
into a series of bytes,
76
00:03:19,870 --> 00:03:21,500
and then pulling it back out of bytes.
77
00:03:21,500 --> 00:03:25,110
And so basically, it's involving a lot of transformation
78
00:03:25,110 --> 00:03:27,690
and it's involving a lot of movement and storage.
79
00:03:27,690 --> 00:03:29,633
All of that is expensive.
80
00:03:32,190 --> 00:03:33,750
All right. So, how do we set
81
00:03:33,750 --> 00:03:35,910
the shuffle partition size then?
82
00:03:35,910 --> 00:03:37,750
Well, the challenge with this is
83
00:03:37,750 --> 00:03:40,610
finding the right shuffle partition number.
84
00:03:40,610 --> 00:03:42,870
And it actually gets even a little more complicated,
85
00:03:42,870 --> 00:03:46,800
because we have multiple steps, sometimes, of shuffle.
86
00:03:46,800 --> 00:03:47,900
And when we run a query,
87
00:03:47,900 --> 00:03:51,520
and if we are doing multiple query steps,
88
00:03:51,520 --> 00:03:54,510
the shuffle partition size, it may change
89
00:03:54,510 --> 00:03:57,890
or it may need to change in order to have it optimized.
90
00:03:57,890 --> 00:04:00,873
The answer to this is using AQE,
91
00:04:00,873 --> 00:04:03,360
or adaptive query execution.
92
00:04:03,360 --> 00:04:06,450
This is something that you can do within Databricks.
93
00:04:06,450 --> 00:04:10,120
And basically, you set the shuffle partition number,
94
00:04:10,120 --> 00:04:12,110
and then as you move forward,
95
00:04:12,110 --> 00:04:14,850
AQE is going to automatically adjust
96
00:04:14,850 --> 00:04:18,200
your shuffle partition number at each stage of your query.
97
00:04:18,200 --> 00:04:19,300
And this is going to be based
98
00:04:19,300 --> 00:04:21,870
on the size of your shuffle output.
99
00:04:21,870 --> 00:04:24,110
So basically, as your data grows or shrinks
100
00:04:24,110 --> 00:04:25,920
over those different stages,
101
00:04:25,920 --> 00:04:28,860
your AQE is going to be adjusting
102
00:04:28,860 --> 00:04:30,790
your shuffle partition number
103
00:04:30,790 --> 00:04:33,960
to make sure that you have the right task size.
104
00:04:33,960 --> 00:04:36,380
Now, a critical part to this is
105
00:04:36,380 --> 00:04:39,890
you need to set the initial shuffle partition number.
106
00:04:39,890 --> 00:04:42,760
And you do that with a script just like this one here.
107
00:04:42,760 --> 00:04:46,100
So basically, you are setting your shuffle partitions.
108
00:04:46,100 --> 00:04:48,080
And in this case, we'd be setting it to 100.
109
00:04:48,080 --> 00:04:50,480
So you can see that in that very top line there.
110
00:04:53,160 --> 00:04:54,960
Now, that's it for this lesson.
111
00:04:54,960 --> 00:04:58,300
Let me give you just a few parting key points to remember.
112
00:04:58,300 --> 00:05:00,190
First, you need to remember that Spark shuffle
113
00:05:00,190 --> 00:05:04,080
can be very costly if you don't optimize correctly.
114
00:05:04,080 --> 00:05:05,930
AQE can assist with this.
115
00:05:05,930 --> 00:05:09,103
You need to use adaptive query execution.
116
00:05:10,380 --> 00:05:13,930
You also need to set that initial shuffle number.
117
00:05:13,930 --> 00:05:15,990
And don't forget, we set that spark number
118
00:05:15,990 --> 00:05:18,260
with a script just like this.
119
00:05:18,260 --> 00:05:22,470
And so, for the DP-203, this is as far as we need to go.
120
00:05:22,470 --> 00:05:25,260
You need to remember that Spark shuffle exists.
121
00:05:25,260 --> 00:05:27,360
You need to remember that it can be very costly
122
00:05:27,360 --> 00:05:29,100
if you don't set it correctly.
123
00:05:29,100 --> 00:05:33,100
And AQE, or adaptive query execution, can assist with that
124
00:05:33,100 --> 00:05:35,490
when you're looking at Databricks.
125
00:05:35,490 --> 00:05:38,620
So for the DP-203, that's all you need to know.
126
00:05:38,620 --> 00:05:40,360
So I'm actually going to stop here,
127
00:05:40,360 --> 00:05:42,960
and we're going to jump on to the next lesson.
128
00:05:42,960 --> 00:05:44,460
All right. I'll see you there.