1 00:00:00,620 --> 00:00:01,560 Hey, Cloud Gurus. 2 00:00:01,560 --> 00:00:04,963 Welcome to Partitioning Best Practices: Part 1. 3 00:00:06,720 --> 00:00:07,553 In this lesson, 4 00:00:07,553 --> 00:00:11,150 we're going to discuss partitioning in Azure Data Lake. 5 00:00:11,150 --> 00:00:14,110 We're also going to look at performance considerations, 6 00:00:14,110 --> 00:00:18,810 scalability considerations, and availability considerations. 7 00:00:18,810 --> 00:00:22,110 Much of the content within those concepts will apply to both 8 00:00:22,110 --> 00:00:24,850 Azure data lakes, as well as Azure Synapse. 9 00:00:24,850 --> 00:00:26,250 But we're going to be focused mostly 10 00:00:26,250 --> 00:00:28,640 in this lesson on Azure Data Lake. 11 00:00:28,640 --> 00:00:29,690 And then, in part 2, 12 00:00:29,690 --> 00:00:32,693 we'll extend that into some specifics about Azure Synapse. 13 00:00:33,810 --> 00:00:35,880 At the end of this lesson, as usual, 14 00:00:35,880 --> 00:00:37,503 we will wrap it up in a review. 15 00:00:40,770 --> 00:00:43,130 Of course, before we can talk about best practices 16 00:00:43,130 --> 00:00:45,870 for Azure Data Lake, we need to kind of understand 17 00:00:45,870 --> 00:00:49,440 how the partitioning is working behind the scenes. 18 00:00:49,440 --> 00:00:53,500 The partition key for Blob and Data Lake storage is actually 19 00:00:53,500 --> 00:00:57,830 comprised of the account, plus the container, plus the blob. 20 00:00:57,830 --> 00:01:01,130 So, in essence, the full blob name. 21 00:01:01,130 --> 00:01:03,910 For example, if we had our Data Lake account, 22 00:01:03,910 --> 00:01:05,830 named acdatalake, 23 00:01:05,830 --> 00:01:08,690 you can see it represented in orange here. 24 00:01:08,690 --> 00:01:12,560 And then, our container for our landing zone in purple, 25 00:01:12,560 --> 00:01:15,420 and finally, the blob itself, the actual file, 26 00:01:15,420 --> 00:01:18,290 our Parquet file in this case. 27 00:01:18,290 --> 00:01:20,150 And that gives you the full partition key 28 00:01:20,150 --> 00:01:22,053 when working with Azure Data Lake. 29 00:01:24,630 --> 00:01:27,730 Behind the scenes, it uses range-based partitioning. 30 00:01:27,730 --> 00:01:29,720 So the data is split into ranges, 31 00:01:29,720 --> 00:01:33,280 which are load balanced across the storage system. 32 00:01:33,280 --> 00:01:35,414 With this type of approach, 33 00:01:35,414 --> 00:01:37,354 if you're using a naming convention 34 00:01:37,354 --> 00:01:39,071 that has lexical ordering, 35 00:01:39,071 --> 00:01:41,130 that means your partitions are more likely to be located 36 00:01:41,130 --> 00:01:43,110 on the same partition server. 37 00:01:43,110 --> 00:01:45,740 And then, once load increases to a certain point, 38 00:01:45,740 --> 00:01:47,763 they are split into smaller ranges. 39 00:01:48,660 --> 00:01:51,060 Let's get a visual idea of what that looks like. 40 00:01:52,460 --> 00:01:55,930 If our convention for Awesome Company is prefixed with AC, 41 00:01:55,930 --> 00:01:58,193 we have our AC engineering department, 42 00:01:59,185 --> 00:02:03,020 AC HR, AC IT, and AC sales. 43 00:02:03,020 --> 00:02:04,910 And to start with, they're all in order, 44 00:02:04,910 --> 00:02:08,320 and they're all served by the same partition server. 45 00:02:08,320 --> 00:02:10,760 But over time, as the load increases, 46 00:02:10,760 --> 00:02:13,330 it will be split out into smaller ranges, 47 00:02:13,330 --> 00:02:15,623 each served by their own partition server. 48 00:02:17,020 --> 00:02:19,890 Having these 2 key pieces of information is important 49 00:02:19,890 --> 00:02:21,830 when considering your partitioning strategy 50 00:02:21,830 --> 00:02:23,590 with Azure Data Lake. 51 00:02:23,590 --> 00:02:26,450 Be aware that you're naming at the account, container, 52 00:02:26,450 --> 00:02:29,860 and blob level affects the partition key. 53 00:02:29,860 --> 00:02:31,240 And that, in the background, 54 00:02:31,240 --> 00:02:34,223 it's using that to perform this range-based partitioning. 55 00:02:37,410 --> 00:02:40,560 Let's dive into some performance considerations. 56 00:02:40,560 --> 00:02:42,770 First, based on what we just learned, 57 00:02:42,770 --> 00:02:45,060 it's important to name with care. 58 00:02:45,060 --> 00:02:48,470 Create a naming convention that sets you up for success. 59 00:02:48,470 --> 00:02:51,399 Remember that when we say naming convention, 60 00:02:51,399 --> 00:02:52,480 that includes accounts, containers, 61 00:02:52,480 --> 00:02:55,010 as well as the blobs themselves. 62 00:02:55,010 --> 00:02:57,210 Also, be careful if you're using numeric 63 00:02:57,210 --> 00:02:59,530 or timestamp identifiers. 64 00:02:59,530 --> 00:03:01,438 If you're not careful, 65 00:03:01,438 --> 00:03:03,120 you could dump everything into 1 bucket. 66 00:03:03,120 --> 00:03:04,547 For instance, 67 00:03:04,547 --> 00:03:07,370 if you're using today's date, or you're using a specific ID, 68 00:03:07,370 --> 00:03:09,190 you could dump a large amount of records 69 00:03:09,190 --> 00:03:10,980 into just that bucket. 70 00:03:10,980 --> 00:03:13,330 So consider using a hashing function 71 00:03:13,330 --> 00:03:15,693 to prefix names with a 3-digit hash. 72 00:03:17,130 --> 00:03:19,920 Remember that one of the primary goals in these systems 73 00:03:19,920 --> 00:03:21,980 is parallel operations. 74 00:03:21,980 --> 00:03:24,500 The more reads and writes that can happen in parallel, 75 00:03:24,500 --> 00:03:26,040 the better. 76 00:03:26,040 --> 00:03:27,483 Think of a highway system. 77 00:03:28,460 --> 00:03:30,440 If you have a 6-lane highway, you're crippling yourself 78 00:03:30,440 --> 00:03:33,160 if you're only using 2 of those lanes. 79 00:03:33,160 --> 00:03:36,140 Design your partitioning so that operations can happen 80 00:03:36,140 --> 00:03:38,563 in parallel, using all 6 lanes. 81 00:03:40,010 --> 00:03:41,630 Embrace pruning. 82 00:03:41,630 --> 00:03:44,020 Pruning data means that your application has to look 83 00:03:44,020 --> 00:03:47,980 in fewer places for smaller amounts of information. 84 00:03:47,980 --> 00:03:50,840 Usually this is going to be based on dates, 85 00:03:50,840 --> 00:03:53,340 but consider carefully when you can prune data 86 00:03:53,340 --> 00:03:54,290 out of your system. 87 00:03:56,470 --> 00:03:58,280 Less is more. 88 00:03:58,280 --> 00:04:00,290 Keep your partition sizes smaller, 89 00:04:00,290 --> 00:04:03,820 and that will help keep your query response times lower. 90 00:04:03,820 --> 00:04:05,860 Think of yourself when you're traveling through the airport 91 00:04:05,860 --> 00:04:06,950 with baggage. 92 00:04:06,950 --> 00:04:09,400 If you took everything in the kitchen sink with you, 93 00:04:09,400 --> 00:04:12,070 you're going to be loaded down and not very mobile 94 00:04:12,070 --> 00:04:13,760 as you're trying to get to your plane. 95 00:04:13,760 --> 00:04:14,840 Keep it lean and mean, 96 00:04:14,840 --> 00:04:16,890 and that will enable you to move quickly. 97 00:04:19,600 --> 00:04:22,920 For scalability, plan for the future. 98 00:04:22,920 --> 00:04:25,860 Estimate the size and workload of each partition 99 00:04:25,860 --> 00:04:27,253 with an aim toward balance. 100 00:04:28,120 --> 00:04:31,240 And so, you really need to analyze your application's needs 101 00:04:31,240 --> 00:04:33,543 to project those scalability requirements. 102 00:04:35,700 --> 00:04:38,200 Stay within the limits of the system. 103 00:04:38,200 --> 00:04:40,530 Be aware of Azure infrastructure limits 104 00:04:40,530 --> 00:04:44,510 on a single partition store, and be sure not to exceed them. 105 00:04:44,510 --> 00:04:46,570 If it looks like you're going to exceed them, 106 00:04:46,570 --> 00:04:49,053 plan for more and smaller partitions. 107 00:04:50,950 --> 00:04:52,230 And then follow up. 108 00:04:52,230 --> 00:04:54,810 Don't just assume that the partition scheme you've created 109 00:04:54,810 --> 00:04:56,740 is going to work as expected. 110 00:04:56,740 --> 00:04:58,630 Monitor it to make sure the distribution 111 00:04:58,630 --> 00:05:00,290 is working optimally. 112 00:05:00,290 --> 00:05:03,160 Reality doesn't always match our predictions, 113 00:05:03,160 --> 00:05:04,080 and that's okay. 114 00:05:04,080 --> 00:05:05,900 Everybody's wrong sometimes. 115 00:05:05,900 --> 00:05:08,060 Hope for the best, but plan for the worst, 116 00:05:08,060 --> 00:05:10,810 and follow up afterwards to make sure you got it right. 117 00:05:13,200 --> 00:05:16,333 Lastly, a couple of availability considerations. 118 00:05:17,400 --> 00:05:19,640 Prioritize by partition. 119 00:05:19,640 --> 00:05:21,760 Apply availability and backup plans 120 00:05:21,760 --> 00:05:25,300 according to a partition's level of criticalness. 121 00:05:25,300 --> 00:05:27,570 You know your data, or you should. 122 00:05:27,570 --> 00:05:29,050 And so, you know what is critical 123 00:05:29,050 --> 00:05:31,450 and what weighs more heavily against other data. 124 00:05:32,470 --> 00:05:34,880 Also be mindful of time. 125 00:05:34,880 --> 00:05:37,510 Know the best times for partitions to be taken offline 126 00:05:37,510 --> 00:05:39,450 for maintenance, and keep them small 127 00:05:39,450 --> 00:05:41,170 to make sure that maintenance completes 128 00:05:41,170 --> 00:05:42,520 within your planned window. 129 00:05:43,603 --> 00:05:46,060 Again, this all comes back to knowing your data 130 00:05:46,060 --> 00:05:48,550 and knowing when something can be taken offline 131 00:05:48,550 --> 00:05:51,103 and how long operations against it will take. 132 00:05:53,590 --> 00:05:54,743 By way of review, 133 00:05:55,800 --> 00:05:58,340 the naming convention you use will greatly affect 134 00:05:58,340 --> 00:06:00,250 how data is partitioned. 135 00:06:00,250 --> 00:06:02,160 And I can't stress that enough. 136 00:06:02,160 --> 00:06:04,600 Be mindful of how you plan your account, 137 00:06:04,600 --> 00:06:06,670 container, and blob names. 138 00:06:06,670 --> 00:06:09,540 This is going to be your partition key, and therefore, 139 00:06:09,540 --> 00:06:12,683 what affects your partitioning in that range-based system. 140 00:06:14,630 --> 00:06:17,440 For best performance and fewer administrative headaches, 141 00:06:17,440 --> 00:06:20,780 keep partition sizes smaller and prune. 142 00:06:20,780 --> 00:06:22,790 I know the tendency is always to hang on 143 00:06:22,790 --> 00:06:24,340 to everything possible, 144 00:06:24,340 --> 00:06:26,430 but really, that's only going to hurt you in the end, 145 00:06:26,430 --> 00:06:27,293 not help you. 146 00:06:28,330 --> 00:06:30,730 And lastly, I know you hear me say it a lot, 147 00:06:30,730 --> 00:06:32,450 but know your data well 148 00:06:32,450 --> 00:06:34,680 in order to plan your partitions well. 149 00:06:34,680 --> 00:06:37,723 And then, monitor afterwards to verify your success. 150 00:06:38,770 --> 00:06:40,190 That's it for this lesson. 151 00:06:40,190 --> 00:06:42,714 I hope this helps you understand better 152 00:06:42,714 --> 00:06:44,544 good partitioning practices, 153 00:06:44,544 --> 00:06:47,210 especially when applied to Azure Data Lake. 154 00:06:47,210 --> 00:06:48,943 In our next lesson, 155 00:06:48,943 --> 00:06:50,881 we're going to continue discussing 156 00:06:50,881 --> 00:06:52,200 how to have a great partitioning strategy 157 00:06:52,200 --> 00:06:55,120 and also focus a little more on Azure Synapse. 158 00:06:55,120 --> 00:06:56,870 I look forward to seeing you there.