1 00:00:00,550 --> 00:00:01,430 In this lesson, 2 00:00:01,430 --> 00:00:04,680 we are going to talk about "Tuning Shuffle Partitions". 3 00:00:04,680 --> 00:00:06,240 And I'm going to tell you right off the bat, 4 00:00:06,240 --> 00:00:08,180 I'm having a little bit of a struggle, 5 00:00:08,180 --> 00:00:11,010 because when I think of tuning shuffle partitions, 6 00:00:11,010 --> 00:00:13,930 I'm kind of torn between shuffling cards 7 00:00:13,930 --> 00:00:16,010 and shuffling as a dance move. 8 00:00:16,010 --> 00:00:17,100 So, I don't know. 9 00:00:17,100 --> 00:00:18,270 I threw them both on there. 10 00:00:18,270 --> 00:00:19,840 I'll let you figure it out. 11 00:00:19,840 --> 00:00:20,810 In this lesson, though, 12 00:00:20,810 --> 00:00:24,050 we're actually going to talk about Spark shuffle. 13 00:00:24,050 --> 00:00:25,550 And I'm going to start by talking to you about 14 00:00:25,550 --> 00:00:28,390 what Spark shuffle is at a high level, 15 00:00:28,390 --> 00:00:31,740 because again, I just want to focus on the DP-203 here, 16 00:00:31,740 --> 00:00:34,500 and not get lost in the weeds. 17 00:00:34,500 --> 00:00:36,040 Then I'm going to talk to you about 18 00:00:36,040 --> 00:00:39,370 how do we set a shuffle partition size. 19 00:00:39,370 --> 00:00:41,160 Going to be a relatively short lesson, 20 00:00:41,160 --> 00:00:43,570 but let's go ahead and jump on in. 21 00:00:43,570 --> 00:00:45,660 So what is a Spark shuffle? 22 00:00:45,660 --> 00:00:50,430 Well, at a core, it's a mechanism for repartitioning data. 23 00:00:50,430 --> 00:00:53,010 And we've talked about repartitioning several times 24 00:00:53,010 --> 00:00:54,660 in this course. 25 00:00:54,660 --> 00:00:57,470 When we talk about Spark shuffling, 26 00:00:57,470 --> 00:00:59,960 it looks like this over here on the right. 27 00:00:59,960 --> 00:01:03,500 So, we've talked about the driver, cluster manager, 28 00:01:03,500 --> 00:01:05,030 and then the executors. 29 00:01:05,030 --> 00:01:06,560 And then within those executors, 30 00:01:06,560 --> 00:01:09,160 we have individual tasks, right? 31 00:01:09,160 --> 00:01:13,010 That is the standard setup for Spark. 32 00:01:13,010 --> 00:01:16,910 Now, what if we wanted to copy or move data? 33 00:01:16,910 --> 00:01:18,380 What does that look like? 34 00:01:18,380 --> 00:01:21,870 Well, we introduce a shuffle step. 35 00:01:21,870 --> 00:01:23,580 And basically what this is going to do 36 00:01:23,580 --> 00:01:26,700 is it's going to take the executor data from those tasks, 37 00:01:26,700 --> 00:01:28,920 -and those are going to be in different partitions-- 38 00:01:28,920 --> 00:01:31,010 and it's going to pull all of those together 39 00:01:31,010 --> 00:01:35,990 into a shuffle step, or maybe multiple shuffle steps. 40 00:01:35,990 --> 00:01:39,430 And then it's going to reduce that data down. 41 00:01:39,430 --> 00:01:41,250 So essentially, we're moving it here. 42 00:01:41,250 --> 00:01:44,170 And we would go from maybe 5 partitions 43 00:01:44,170 --> 00:01:46,150 down to 2 partitions. 44 00:01:46,150 --> 00:01:50,610 And that happens in that shuffle and reducer step. 45 00:01:50,610 --> 00:01:53,080 So, that's what's happening behind the scenes 46 00:01:53,080 --> 00:01:55,160 when we do a Spark shuffle. 47 00:01:55,160 --> 00:01:56,420 And what we're doing is, 48 00:01:56,420 --> 00:01:58,780 essentially, we're grouping data differently 49 00:01:58,780 --> 00:02:00,883 across partitions, okay? 50 00:02:01,920 --> 00:02:05,620 This can be beneficial or it can be harmful. 51 00:02:05,620 --> 00:02:07,530 Now, it can be beneficial because, 52 00:02:07,530 --> 00:02:10,615 hey, we're reducing things down. When you reduce things 53 00:02:10,615 --> 00:02:12,650 down, generally you have less storage, 54 00:02:12,650 --> 00:02:16,930 and it's generally easier and faster to run queries. 55 00:02:16,930 --> 00:02:19,270 It can also be harmful, however, 56 00:02:19,270 --> 00:02:23,630 because this shuffle step can be a very costly step 57 00:02:23,630 --> 00:02:25,483 if you don't do it correctly. 58 00:02:26,440 --> 00:02:29,280 And so, when we run shuffling, 59 00:02:29,280 --> 00:02:32,370 anytime you do a transformation operation in Spark, 60 00:02:32,370 --> 00:02:36,760 -so a groupBy, a join, a union, that kind of stuff-- 61 00:02:36,760 --> 00:02:40,290 you are going to initiate a shuffle behind the scenes. 62 00:02:40,290 --> 00:02:44,430 And like I said before, shuffling is an expensive operation, 63 00:02:44,430 --> 00:02:47,960 so it's important that we set it correctly. 64 00:02:47,960 --> 00:02:51,460 And what makes it expensive is it's requiring 65 00:02:51,460 --> 00:02:54,610 disk I/O and network I/O when we do this shuffle. 66 00:02:54,610 --> 00:02:56,980 Because what it's doing is it's storing data 67 00:02:56,980 --> 00:02:58,420 and it's moving data. 68 00:02:58,420 --> 00:03:00,990 And so that can be a very expensive operation 69 00:03:00,990 --> 00:03:02,350 when we talk about Databricks, 70 00:03:02,350 --> 00:03:06,473 which is generally done on massive, big datasets. 71 00:03:07,940 --> 00:03:11,220 And it also involves data serialization 72 00:03:11,220 --> 00:03:13,610 and deserialization. 73 00:03:13,610 --> 00:03:15,140 And what that means is, 74 00:03:15,140 --> 00:03:18,740 we are just transforming data and the format of it 75 00:03:18,740 --> 00:03:19,870 into a series of bytes, 76 00:03:19,870 --> 00:03:21,500 and then pulling it back out of bytes. 77 00:03:21,500 --> 00:03:25,110 And so basically, it's involving a lot of transformation 78 00:03:25,110 --> 00:03:27,690 and it's involving a lot of movement and storage. 79 00:03:27,690 --> 00:03:29,633 All of that is expensive. 80 00:03:32,190 --> 00:03:33,750 All right. So, how do we set 81 00:03:33,750 --> 00:03:35,910 the shuffle partition size then? 82 00:03:35,910 --> 00:03:37,750 Well, the challenge with this is 83 00:03:37,750 --> 00:03:40,610 finding the right shuffle partition number. 84 00:03:40,610 --> 00:03:42,870 And it actually gets even a little more complicated, 85 00:03:42,870 --> 00:03:46,800 because we have multiple steps, sometimes, of shuffle. 86 00:03:46,800 --> 00:03:47,900 And when we run a query, 87 00:03:47,900 --> 00:03:51,520 and if we are doing multiple query steps, 88 00:03:51,520 --> 00:03:54,510 the shuffle partition size, it may change 89 00:03:54,510 --> 00:03:57,890 or it may need to change in order to have it optimized. 90 00:03:57,890 --> 00:04:00,873 The answer to this is using AQE, 91 00:04:00,873 --> 00:04:03,360 or adaptive query execution. 92 00:04:03,360 --> 00:04:06,450 This is something that you can do within Databricks. 93 00:04:06,450 --> 00:04:10,120 And basically, you set the shuffle partition number, 94 00:04:10,120 --> 00:04:12,110 and then as you move forward, 95 00:04:12,110 --> 00:04:14,850 AQE is going to automatically adjust 96 00:04:14,850 --> 00:04:18,200 your shuffle partition number at each stage of your query. 97 00:04:18,200 --> 00:04:19,300 And this is going to be based 98 00:04:19,300 --> 00:04:21,870 on the size of your shuffle output. 99 00:04:21,870 --> 00:04:24,110 So basically, as your data grows or shrinks 100 00:04:24,110 --> 00:04:25,920 over those different stages, 101 00:04:25,920 --> 00:04:28,860 your AQE is going to be adjusting 102 00:04:28,860 --> 00:04:30,790 your shuffle partition number 103 00:04:30,790 --> 00:04:33,960 to make sure that you have the right task size. 104 00:04:33,960 --> 00:04:36,380 Now, a critical part to this is 105 00:04:36,380 --> 00:04:39,890 you need to set the initial shuffle partition number. 106 00:04:39,890 --> 00:04:42,760 And you do that with a script just like this one here. 107 00:04:42,760 --> 00:04:46,100 So basically, you are setting your shuffle partitions. 108 00:04:46,100 --> 00:04:48,080 And in this case, we'd be setting it to 100. 109 00:04:48,080 --> 00:04:50,480 So you can see that in that very top line there. 110 00:04:53,160 --> 00:04:54,960 Now, that's it for this lesson. 111 00:04:54,960 --> 00:04:58,300 Let me give you just a few parting key points to remember. 112 00:04:58,300 --> 00:05:00,190 First, you need to remember that Spark shuffle 113 00:05:00,190 --> 00:05:04,080 can be very costly if you don't optimize correctly. 114 00:05:04,080 --> 00:05:05,930 AQE can assist with this. 115 00:05:05,930 --> 00:05:09,103 You need to use adaptive query execution. 116 00:05:10,380 --> 00:05:13,930 You also need to set that initial shuffle number. 117 00:05:13,930 --> 00:05:15,990 And don't forget, we set that spark number 118 00:05:15,990 --> 00:05:18,260 with a script just like this. 119 00:05:18,260 --> 00:05:22,470 And so, for the DP-203, this is as far as we need to go. 120 00:05:22,470 --> 00:05:25,260 You need to remember that Spark shuffle exists. 121 00:05:25,260 --> 00:05:27,360 You need to remember that it can be very costly 122 00:05:27,360 --> 00:05:29,100 if you don't set it correctly. 123 00:05:29,100 --> 00:05:33,100 And AQE, or adaptive query execution, can assist with that 124 00:05:33,100 --> 00:05:35,490 when you're looking at Databricks. 125 00:05:35,490 --> 00:05:38,620 So for the DP-203, that's all you need to know. 126 00:05:38,620 --> 00:05:40,360 So I'm actually going to stop here, 127 00:05:40,360 --> 00:05:42,960 and we're going to jump on to the next lesson. 128 00:05:42,960 --> 00:05:44,460 All right. I'll see you there.