1
00:00:00,550 --> 00:00:01,430
In this lesson,

2
00:00:01,430 --> 00:00:04,680
we are going to talk about "Tuning Shuffle Partitions".

3
00:00:04,680 --> 00:00:06,240
And I'm going to tell you right off the bat,

4
00:00:06,240 --> 00:00:08,180
I'm having a little bit of a struggle,

5
00:00:08,180 --> 00:00:11,010
because when I think of tuning shuffle partitions,

6
00:00:11,010 --> 00:00:13,930
I'm kind of torn between shuffling cards

7
00:00:13,930 --> 00:00:16,010
and shuffling as a dance move.

8
00:00:16,010 --> 00:00:17,100
So, I don't know.

9
00:00:17,100 --> 00:00:18,270
I threw them both on there.

10
00:00:18,270 --> 00:00:19,840
I'll let you figure it out.

11
00:00:19,840 --> 00:00:20,810
In this lesson, though,

12
00:00:20,810 --> 00:00:24,050
we're actually going to talk about Spark shuffle.

13
00:00:24,050 --> 00:00:25,550
And I'm going to start by talking to you about

14
00:00:25,550 --> 00:00:28,390
what Spark shuffle is at a high level,

15
00:00:28,390 --> 00:00:31,740
because again, I just want to focus on the DP-203 here,

16
00:00:31,740 --> 00:00:34,500
and not get lost in the weeds.

17
00:00:34,500 --> 00:00:36,040
Then I'm going to talk to you about

18
00:00:36,040 --> 00:00:39,370
how do we set a shuffle partition size.

19
00:00:39,370 --> 00:00:41,160
Going to be a relatively short lesson,

20
00:00:41,160 --> 00:00:43,570
but let's go ahead and jump on in.

21
00:00:43,570 --> 00:00:45,660
So what is a Spark shuffle?

22
00:00:45,660 --> 00:00:50,430
Well, at a core, it's a mechanism for repartitioning data.

23
00:00:50,430 --> 00:00:53,010
And we've talked about repartitioning several times

24
00:00:53,010 --> 00:00:54,660
in this course.

25
00:00:54,660 --> 00:00:57,470
When we talk about Spark shuffling,

26
00:00:57,470 --> 00:00:59,960
it looks like this over here on the right.

27
00:00:59,960 --> 00:01:03,500
So, we've talked about the driver, cluster manager,

28
00:01:03,500 --> 00:01:05,030
and then the executors.

29
00:01:05,030 --> 00:01:06,560
And then within those executors,

30
00:01:06,560 --> 00:01:09,160
we have individual tasks, right?

31
00:01:09,160 --> 00:01:13,010
That is the standard setup for Spark.

32
00:01:13,010 --> 00:01:16,910
Now, what if we wanted to copy or move data?

33
00:01:16,910 --> 00:01:18,380
What does that look like?

34
00:01:18,380 --> 00:01:21,870
Well, we introduce a shuffle step.

35
00:01:21,870 --> 00:01:23,580
And basically what this is going to do

36
00:01:23,580 --> 00:01:26,700
is it's going to take the executor data from those tasks,

37
00:01:26,700 --> 00:01:28,920
<v ->-and those are going to be in different partitions--</v>

38
00:01:28,920 --> 00:01:31,010
and it's going to pull all of those together

39
00:01:31,010 --> 00:01:35,990
into a shuffle step, or maybe multiple shuffle steps.

40
00:01:35,990 --> 00:01:39,430
And then it's going to reduce that data down.

41
00:01:39,430 --> 00:01:41,250
So essentially, we're moving it here.

42
00:01:41,250 --> 00:01:44,170
And we would go from maybe 5 partitions

43
00:01:44,170 --> 00:01:46,150
down to 2 partitions.

44
00:01:46,150 --> 00:01:50,610
And that happens in that shuffle and reducer step.

45
00:01:50,610 --> 00:01:53,080
So, that's what's happening behind the scenes

46
00:01:53,080 --> 00:01:55,160
when we do a Spark shuffle.

47
00:01:55,160 --> 00:01:56,420
And what we're doing is,

48
00:01:56,420 --> 00:01:58,780
essentially, we're grouping data differently

49
00:01:58,780 --> 00:02:00,883
across partitions, okay?

50
00:02:01,920 --> 00:02:05,620
This can be beneficial or it can be harmful.

51
00:02:05,620 --> 00:02:07,530
Now, it can be beneficial because,

52
00:02:07,530 --> 00:02:10,615
hey, we're reducing things down. When you reduce things

53
00:02:10,615 --> 00:02:12,650
down, generally you have less storage,

54
00:02:12,650 --> 00:02:16,930
and it's generally easier and faster to run queries.

55
00:02:16,930 --> 00:02:19,270
It can also be harmful, however,

56
00:02:19,270 --> 00:02:23,630
because this shuffle step can be a very costly step

57
00:02:23,630 --> 00:02:25,483
if you don't do it correctly.

58
00:02:26,440 --> 00:02:29,280
And so, when we run shuffling,

59
00:02:29,280 --> 00:02:32,370
anytime you do a transformation operation in Spark,

60
00:02:32,370 --> 00:02:36,760
<v ->-so a groupBy, a join, a union, that kind of stuff--</v>

61
00:02:36,760 --> 00:02:40,290
you are going to initiate a shuffle behind the scenes.

62
00:02:40,290 --> 00:02:44,430
And like I said before, shuffling is an expensive operation,

63
00:02:44,430 --> 00:02:47,960
so it's important that we set it correctly.

64
00:02:47,960 --> 00:02:51,460
And what makes it expensive is it's requiring

65
00:02:51,460 --> 00:02:54,610
disk I/O and network I/O when we do this shuffle.

66
00:02:54,610 --> 00:02:56,980
Because what it's doing is it's storing data

67
00:02:56,980 --> 00:02:58,420
and it's moving data.

68
00:02:58,420 --> 00:03:00,990
And so that can be a very expensive operation

69
00:03:00,990 --> 00:03:02,350
when we talk about Databricks,

70
00:03:02,350 --> 00:03:06,473
which is generally done on massive, big datasets.

71
00:03:07,940 --> 00:03:11,220
And it also involves data serialization

72
00:03:11,220 --> 00:03:13,610
and deserialization.

73
00:03:13,610 --> 00:03:15,140
And what that means is,

74
00:03:15,140 --> 00:03:18,740
we are just transforming data and the format of it

75
00:03:18,740 --> 00:03:19,870
into a series of bytes,

76
00:03:19,870 --> 00:03:21,500
and then pulling it back out of bytes.

77
00:03:21,500 --> 00:03:25,110
And so basically, it's involving a lot of transformation

78
00:03:25,110 --> 00:03:27,690
and it's involving a lot of movement and storage.

79
00:03:27,690 --> 00:03:29,633
All of that is expensive.

80
00:03:32,190 --> 00:03:33,750
All right. So, how do we set

81
00:03:33,750 --> 00:03:35,910
the shuffle partition size then?

82
00:03:35,910 --> 00:03:37,750
Well, the challenge with this is

83
00:03:37,750 --> 00:03:40,610
finding the right shuffle partition number.

84
00:03:40,610 --> 00:03:42,870
And it actually gets even a little more complicated,

85
00:03:42,870 --> 00:03:46,800
because we have multiple steps, sometimes, of shuffle.

86
00:03:46,800 --> 00:03:47,900
And when we run a query,

87
00:03:47,900 --> 00:03:51,520
and if we are doing multiple query steps,

88
00:03:51,520 --> 00:03:54,510
the shuffle partition size, it may change

89
00:03:54,510 --> 00:03:57,890
or it may need to change in order to have it optimized.

90
00:03:57,890 --> 00:04:00,873
The answer to this is using AQE,

91
00:04:00,873 --> 00:04:03,360
or adaptive query execution.

92
00:04:03,360 --> 00:04:06,450
This is something that you can do within Databricks.

93
00:04:06,450 --> 00:04:10,120
And basically, you set the shuffle partition number,

94
00:04:10,120 --> 00:04:12,110
and then as you move forward,

95
00:04:12,110 --> 00:04:14,850
AQE is going to automatically adjust

96
00:04:14,850 --> 00:04:18,200
your shuffle partition number at each stage of your query.

97
00:04:18,200 --> 00:04:19,300
And this is going to be based

98
00:04:19,300 --> 00:04:21,870
on the size of your shuffle output.

99
00:04:21,870 --> 00:04:24,110
So basically, as your data grows or shrinks

100
00:04:24,110 --> 00:04:25,920
over those different stages,

101
00:04:25,920 --> 00:04:28,860
your AQE is going to be adjusting

102
00:04:28,860 --> 00:04:30,790
your shuffle partition number

103
00:04:30,790 --> 00:04:33,960
to make sure that you have the right task size.

104
00:04:33,960 --> 00:04:36,380
Now, a critical part to this is

105
00:04:36,380 --> 00:04:39,890
you need to set the initial shuffle partition number.

106
00:04:39,890 --> 00:04:42,760
And you do that with a script just like this one here.

107
00:04:42,760 --> 00:04:46,100
So basically, you are setting your shuffle partitions.

108
00:04:46,100 --> 00:04:48,080
And in this case, we'd be setting it to 100.

109
00:04:48,080 --> 00:04:50,480
So you can see that in that very top line there.

110
00:04:53,160 --> 00:04:54,960
Now, that's it for this lesson.

111
00:04:54,960 --> 00:04:58,300
Let me give you just a few parting key points to remember.

112
00:04:58,300 --> 00:05:00,190
First, you need to remember that Spark shuffle

113
00:05:00,190 --> 00:05:04,080
can be very costly if you don't optimize correctly.

114
00:05:04,080 --> 00:05:05,930
AQE can assist with this.

115
00:05:05,930 --> 00:05:09,103
You need to use adaptive query execution.

116
00:05:10,380 --> 00:05:13,930
You also need to set that initial shuffle number.

117
00:05:13,930 --> 00:05:15,990
And don't forget, we set that spark number

118
00:05:15,990 --> 00:05:18,260
with a script just like this.

119
00:05:18,260 --> 00:05:22,470
And so, for the DP-203, this is as far as we need to go.

120
00:05:22,470 --> 00:05:25,260
You need to remember that Spark shuffle exists.

121
00:05:25,260 --> 00:05:27,360
You need to remember that it can be very costly

122
00:05:27,360 --> 00:05:29,100
if you don't set it correctly.

123
00:05:29,100 --> 00:05:33,100
And AQE, or adaptive query execution, can assist with that

124
00:05:33,100 --> 00:05:35,490
when you're looking at Databricks.

125
00:05:35,490 --> 00:05:38,620
So for the DP-203, that's all you need to know.

126
00:05:38,620 --> 00:05:40,360
So I'm actually going to stop here,

127
00:05:40,360 --> 00:05:42,960
and we're going to jump on to the next lesson.

128
00:05:42,960 --> 00:05:44,460
All right. I'll see you there.