1 00:00:00,680 --> 00:00:01,740 Welcome back. 2 00:00:01,740 --> 00:00:02,630 So in this lesson, 3 00:00:02,630 --> 00:00:05,900 I'm going to talk to you about "Compacting Small Files". 4 00:00:05,900 --> 00:00:09,270 And we do that with a tool known as auto optimization. 5 00:00:09,270 --> 00:00:10,420 So we're going to talk a little bit 6 00:00:10,420 --> 00:00:13,600 about auto optimization, and what that actually means. 7 00:00:13,600 --> 00:00:14,958 And then I'm going to show you 8 00:00:14,958 --> 00:00:16,327 just a couple of quick scripts, 9 00:00:16,327 --> 00:00:20,450 so that you know how to use auto optimization. 10 00:00:20,450 --> 00:00:23,220 So with that, let's dive in and get started. 11 00:00:23,220 --> 00:00:27,420 First off, auto optimization is an Azure Databricks tool. 12 00:00:27,420 --> 00:00:30,640 So keep that in mind for the DP-203. 13 00:00:30,640 --> 00:00:33,023 Auto optimization, Databricks. 14 00:00:33,880 --> 00:00:35,180 So how does it work? 15 00:00:35,180 --> 00:00:40,180 Well, it allows you to automatically compact small files. 16 00:00:40,790 --> 00:00:43,120 And what it's going to do is it's going to write those 17 00:00:43,120 --> 00:00:45,370 into a Delta table. 18 00:00:45,370 --> 00:00:48,440 So let's take a look at a traditional write process 19 00:00:48,440 --> 00:00:50,030 over here on the right. 20 00:00:50,030 --> 00:00:51,760 So in a traditional write process, 21 00:00:51,760 --> 00:00:53,843 we're going to have our large blue box, 22 00:00:53,843 --> 00:00:55,810 which is our executor, 23 00:00:55,810 --> 00:00:58,250 and then we're going to have our data pieces. 24 00:00:58,250 --> 00:00:59,470 And in a traditional write, 25 00:00:59,470 --> 00:01:01,090 we're just going to take those data pieces 26 00:01:01,090 --> 00:01:02,560 and we are going to move them 27 00:01:02,560 --> 00:01:04,810 into Delta tables and partitions. 28 00:01:04,810 --> 00:01:06,950 So what can happen is you can wind up with a large number 29 00:01:06,950 --> 00:01:09,530 of small files in a partition. 30 00:01:09,530 --> 00:01:11,540 Optimization can fix this. 31 00:01:11,540 --> 00:01:13,660 So with optimization, what you're going to do 32 00:01:13,660 --> 00:01:16,940 is you're going to go through and look at the files, 33 00:01:16,940 --> 00:01:20,870 and you're going to combine all of those files 34 00:01:20,870 --> 00:01:24,430 into 128-megabit sizes. 35 00:01:24,430 --> 00:01:25,860 Now, with that, 36 00:01:25,860 --> 00:01:27,980 you're going to get less files in the partitions, 37 00:01:27,980 --> 00:01:29,188 which means you're going to be able 38 00:01:29,188 --> 00:01:30,529 to query your data faster. 39 00:01:31,956 --> 00:01:34,560 Now, with auto optimization, 40 00:01:34,560 --> 00:01:36,950 we're just simply turning that process on 41 00:01:36,950 --> 00:01:39,690 so that Azure Databricks, as it runs, 42 00:01:39,690 --> 00:01:43,990 is going to automatically look and compact those files 43 00:01:43,990 --> 00:01:46,183 into that larger file size. 44 00:01:47,220 --> 00:01:49,563 Now, how do we turn that on? 45 00:01:50,450 --> 00:01:52,610 With an auto optimization script. 46 00:01:52,610 --> 00:01:56,390 So you have to enable Optimized Writes and Auto Compaction. 47 00:01:56,390 --> 00:01:58,510 It is not on by default. 48 00:01:58,510 --> 00:02:03,140 You do that by, first, writing something like this script, 49 00:02:03,140 --> 00:02:06,010 and this piece in orange here that I've highlighted, 50 00:02:06,010 --> 00:02:09,910 this is the actual piece that turns on Auto Optimize. 51 00:02:09,910 --> 00:02:11,630 So you can see that we're turning on 52 00:02:11,630 --> 00:02:15,100 autoOptimize.optimizeWrite = true, 53 00:02:15,100 --> 00:02:18,040 and autoOptimize.autoCompact = true. 54 00:02:18,040 --> 00:02:20,150 So this is where we're turning those on. 55 00:02:20,150 --> 00:02:23,323 So you need to turn on Auto Optimize in Databricks. 56 00:02:24,530 --> 00:02:28,190 And then, if you're going to alter a table, same process. 57 00:02:28,190 --> 00:02:30,150 So the top one is for creating a table, 58 00:02:30,150 --> 00:02:32,530 and the bottom one is for altering a table. 59 00:02:32,530 --> 00:02:34,470 But again, that piece in orange, 60 00:02:34,470 --> 00:02:36,630 that's the piece that we actually care about, 61 00:02:36,630 --> 00:02:39,543 at least as far as auto optimization is concerned. 62 00:02:41,030 --> 00:02:43,950 Should I always auto compact? 63 00:02:43,950 --> 00:02:46,490 The answer is: not necessarily. 64 00:02:46,490 --> 00:02:48,560 It depends on latency first. 65 00:02:48,560 --> 00:02:52,750 So if we have a ton of files being written very quickly, 66 00:02:52,750 --> 00:02:55,480 we might care about latency and how fast it takes us 67 00:02:55,480 --> 00:02:58,640 to actually get those files into the Delta tables. 68 00:02:58,640 --> 00:03:00,730 Obviously, if we're going to optimize, 69 00:03:00,730 --> 00:03:02,660 that's going to add an additional step, 70 00:03:02,660 --> 00:03:06,560 which is going to slow the process down just a little bit. 71 00:03:06,560 --> 00:03:09,640 It also depends on your concurrent operations. 72 00:03:09,640 --> 00:03:13,260 So if you are deleting or merging data or things like that, 73 00:03:13,260 --> 00:03:15,660 it can break the auto optimization process, 74 00:03:15,660 --> 00:03:17,460 and so you might not want to use it. 75 00:03:18,740 --> 00:03:21,360 Some key points to remember as we wrap up this lesson. 76 00:03:21,360 --> 00:03:23,270 First, don't get lost in the weeds. 77 00:03:23,270 --> 00:03:25,380 This is not about Databricks. 78 00:03:25,380 --> 00:03:28,663 This is about auto optimization and Auto Compaction. 79 00:03:28,663 --> 00:03:31,660 You need to know that it exists, and you need to know 80 00:03:31,660 --> 00:03:34,403 that you do need to turn it on in Databricks. 81 00:03:35,400 --> 00:03:38,610 It can be extremely helpful when queries are important 82 00:03:38,610 --> 00:03:39,980 and latency isn't. 83 00:03:39,980 --> 00:03:43,900 This is when we're going to use this Auto Compaction. 84 00:03:43,900 --> 00:03:45,540 So keep that in mind. 85 00:03:45,540 --> 00:03:47,730 And with that, we finished this lesson. 86 00:03:47,730 --> 00:03:48,660 On to the next. 87 00:03:48,660 --> 00:03:49,610 I'll see you there.