1 00:00:00,770 --> 00:00:02,910 Hey, Cloud Gurus. Welcome back. 2 00:00:02,910 --> 00:00:05,730 Now that you know about creating your Azure data lake 3 00:00:05,730 --> 00:00:07,970 and some basics about container creation, 4 00:00:07,970 --> 00:00:11,100 let's talk about getting its folder structure right. 5 00:00:11,100 --> 00:00:13,760 Not every data lake will be structured the same way, 6 00:00:13,760 --> 00:00:15,300 but there are some common practices 7 00:00:15,300 --> 00:00:17,000 that I'd like to go over with you. 8 00:00:17,940 --> 00:00:20,320 In this lesson, we'll start by talking about 9 00:00:20,320 --> 00:00:23,280 data lake zones, and then we'll talk about 10 00:00:23,280 --> 00:00:26,720 some folder structure strategy within those zones. 11 00:00:26,720 --> 00:00:28,000 From that general strategy, 12 00:00:28,000 --> 00:00:30,440 we'll construct some examples for you 13 00:00:30,440 --> 00:00:32,763 and then we'll wrap everything up in a review. 14 00:00:35,130 --> 00:00:37,450 And so let's start with data zones. 15 00:00:37,450 --> 00:00:39,220 And zone sounds really intimidating, 16 00:00:39,220 --> 00:00:41,590 but really all this means is a folder 17 00:00:41,590 --> 00:00:43,260 such as the ones you saw me create 18 00:00:43,260 --> 00:00:45,950 in the last lesson's demo. 19 00:00:45,950 --> 00:00:47,630 There are usually 3 to 5, 20 00:00:47,630 --> 00:00:49,610 but everybody does it differently. 21 00:00:49,610 --> 00:00:52,760 Some are optional, but we'll get to that in a moment. 22 00:00:52,760 --> 00:00:55,420 Starting out, we have our landing zone, 23 00:00:55,420 --> 00:00:59,840 and this is where our raw data lands in its original state. 24 00:00:59,840 --> 00:01:03,600 It's always immutable and not usually fit for consumption. 25 00:01:03,600 --> 00:01:06,460 It can be organized with folders per source. 26 00:01:06,460 --> 00:01:07,430 And as you can imagine, 27 00:01:07,430 --> 00:01:09,340 this can become a very expensive zone 28 00:01:09,340 --> 00:01:11,270 because you're keeping the full set of data 29 00:01:11,270 --> 00:01:13,090 from a lot of different places. 30 00:01:13,090 --> 00:01:15,340 And so you might want to programmatically move it 31 00:01:15,340 --> 00:01:18,743 to a cool access tier at an interval of your choosing. 32 00:01:19,610 --> 00:01:21,150 From there, we have staging. 33 00:01:21,150 --> 00:01:23,400 And this is the first step to refinement 34 00:01:23,400 --> 00:01:25,510 by adding basic structure. 35 00:01:25,510 --> 00:01:29,160 There's usually an automatic process to bring raw data in 36 00:01:29,160 --> 00:01:31,200 and provide a better structure for it 37 00:01:31,200 --> 00:01:33,320 to prepare it for being curated. 38 00:01:33,320 --> 00:01:36,060 But even at this stage, it provides an increased value 39 00:01:36,060 --> 00:01:38,170 over the raw data. 40 00:01:38,170 --> 00:01:40,660 From there, we have our more curated zone, 41 00:01:40,660 --> 00:01:43,740 where it's transformed into consumable datasets. 42 00:01:43,740 --> 00:01:46,160 This could be files or tables. 43 00:01:46,160 --> 00:01:48,860 This zone can be used to feed a data warehouse, 44 00:01:48,860 --> 00:01:51,450 but it's not a replacement for a data warehouse. 45 00:01:51,450 --> 00:01:55,320 The speeds aren't suited to end user dashboards or reports. 46 00:01:55,320 --> 00:01:58,640 Instead, it's focused more on larger internal analytics 47 00:01:58,640 --> 00:02:02,810 that don't have time requirements such as ad hoc queries. 48 00:02:02,810 --> 00:02:05,260 But it differentiates itself from the staging 49 00:02:05,260 --> 00:02:07,520 in that it contains only consumer data 50 00:02:07,520 --> 00:02:09,130 that has been quality checked 51 00:02:09,130 --> 00:02:11,840 and combined with like sources. 52 00:02:11,840 --> 00:02:15,770 We use that curated data to feed our production zone. 53 00:02:15,770 --> 00:02:19,060 And this provides an easy access point for consumers 54 00:02:19,060 --> 00:02:21,490 that includes business logic. 55 00:02:21,490 --> 00:02:24,610 And that business logic may be something like surrogate keys 56 00:02:24,610 --> 00:02:27,690 or other application-specific needs. 57 00:02:27,690 --> 00:02:30,350 We can also use this zone to feed our 58 00:02:30,350 --> 00:02:32,430 machine learning models. 59 00:02:32,430 --> 00:02:35,340 But overall, it's beneficial because it provides data 60 00:02:35,340 --> 00:02:37,523 to consumers in a friendly format. 61 00:02:38,630 --> 00:02:41,300 Lastly, we may have an experimental zone. 62 00:02:41,300 --> 00:02:44,330 And this is more of a sandbox where our data scientists 63 00:02:44,330 --> 00:02:47,070 can combine multiple datasets 64 00:02:47,070 --> 00:02:50,640 both from within the data lake and from outside of it. 65 00:02:50,640 --> 00:02:52,590 And they can mix and match and experiment 66 00:02:52,590 --> 00:02:53,913 to their heart's content. 67 00:02:55,130 --> 00:02:58,930 As I said earlier, not every organization is going to have 68 00:02:58,930 --> 00:03:01,470 these exact zones, and even if they do, 69 00:03:01,470 --> 00:03:02,900 they may name them differently. 70 00:03:02,900 --> 00:03:04,590 That's completely okay. 71 00:03:04,590 --> 00:03:07,140 What's important is understanding the general philosophy 72 00:03:07,140 --> 00:03:08,780 behind each. 73 00:03:08,780 --> 00:03:12,170 And as I said, not everywhere will have all of them. 74 00:03:12,170 --> 00:03:14,610 The most important and commonly shared among them 75 00:03:14,610 --> 00:03:19,610 is our landing zone, curated, and production. 76 00:03:19,660 --> 00:03:22,120 Almost every analytic solution is going to have 77 00:03:22,120 --> 00:03:25,530 at least those, and everything else just kind of adds 78 00:03:25,530 --> 00:03:26,943 niceties around them. 79 00:03:28,410 --> 00:03:30,710 But once you have your zone structure, 80 00:03:30,710 --> 00:03:34,130 within there you begin to develop your folder structure. 81 00:03:34,130 --> 00:03:36,203 So let's go over some strategy for that. 82 00:03:37,710 --> 00:03:39,690 First, you want to keep it simple. 83 00:03:39,690 --> 00:03:42,160 Your naming convention should be human readable, 84 00:03:42,160 --> 00:03:45,300 easily understood, and self-documenting. 85 00:03:45,300 --> 00:03:47,550 It's always tempting to over-engineer things 86 00:03:47,550 --> 00:03:50,970 and make it super complex, but don't fall into that trap. 87 00:03:50,970 --> 00:03:52,560 The simpler you make it, 88 00:03:52,560 --> 00:03:54,943 the easier it is to scale and to manage. 89 00:03:56,190 --> 00:03:59,890 Speaking of management, it should be appropriately granular. 90 00:03:59,890 --> 00:04:02,230 Design for effective permissions 91 00:04:02,230 --> 00:04:06,210 without generating unnecessary maintenance overhead. 92 00:04:06,210 --> 00:04:07,930 You can name it in such a way 93 00:04:07,930 --> 00:04:09,980 that you're going to create problems for yourself 94 00:04:09,980 --> 00:04:11,750 by having too many subdirectories 95 00:04:11,750 --> 00:04:14,200 on which you have to manage too many permissions. 96 00:04:15,580 --> 00:04:17,580 Be thoughtful in your partitioning. 97 00:04:17,580 --> 00:04:20,300 Align it with your partition strategy for the purpose 98 00:04:20,300 --> 00:04:22,160 of the zone you're working in. 99 00:04:22,160 --> 00:04:26,410 For example, aim for optimal retrieval on the curated zone. 100 00:04:26,410 --> 00:04:28,830 And don't feel you have to have the same strategy 101 00:04:28,830 --> 00:04:30,780 across all of the zones. 102 00:04:30,780 --> 00:04:32,760 Design it intentionally. 103 00:04:32,760 --> 00:04:35,360 We'll get to more on partitioning in a later lesson. 104 00:04:36,440 --> 00:04:38,630 Also, group similar items. 105 00:04:38,630 --> 00:04:40,840 In general, folders should have files 106 00:04:40,840 --> 00:04:43,090 of the same schema and format. 107 00:04:43,090 --> 00:04:44,660 It just makes it easier to work with them 108 00:04:44,660 --> 00:04:45,863 in a consistent way. 109 00:04:47,790 --> 00:04:49,140 Now that you have some strategy, 110 00:04:49,140 --> 00:04:51,500 let's go over some concrete examples. 111 00:04:51,500 --> 00:04:55,230 We can create structures by source system, by departments, 112 00:04:55,230 --> 00:04:58,050 by projects, and really whatever makes sense 113 00:04:58,050 --> 00:04:59,263 for your business. 114 00:05:01,170 --> 00:05:04,070 Some examples of these types of filtering include 115 00:05:04,070 --> 00:05:08,290 for our source system, maybe we have one in our raw zone 116 00:05:08,290 --> 00:05:10,310 divided by data source. 117 00:05:10,310 --> 00:05:13,520 Like I mentioned earlier, you can have a folder 118 00:05:13,520 --> 00:05:16,650 within your raw zone for each of your data sources. 119 00:05:16,650 --> 00:05:18,820 That allows write permissions to be granted 120 00:05:18,820 --> 00:05:21,200 to source systems at the data source level 121 00:05:21,200 --> 00:05:23,010 with default ACLs. 122 00:05:23,010 --> 00:05:24,500 And then permissions will be inherited 123 00:05:24,500 --> 00:05:25,850 to new folders under there. 124 00:05:27,400 --> 00:05:29,440 For department, perhaps you have 125 00:05:29,440 --> 00:05:33,200 within your production zone a sales organization 126 00:05:33,200 --> 00:05:36,180 or a marketing or an internal IT one. 127 00:05:36,180 --> 00:05:39,150 All of those can be divided up by department. 128 00:05:39,150 --> 00:05:41,360 And then you can point various systems 129 00:05:41,360 --> 00:05:42,930 with different application logic 130 00:05:42,930 --> 00:05:44,913 at those different folder structures. 131 00:05:45,830 --> 00:05:48,330 A common thing to do is to use dates. 132 00:05:48,330 --> 00:05:50,720 So maybe we have things down a few levels 133 00:05:50,720 --> 00:05:53,860 divided into years, month, and day, 134 00:05:53,860 --> 00:05:56,260 with the day being our final folder 135 00:05:56,260 --> 00:05:58,600 that holds our actual files. 136 00:05:58,600 --> 00:05:59,610 If you put that all together, 137 00:05:59,610 --> 00:06:01,500 it might look something like this: 138 00:06:01,500 --> 00:06:04,575 our raw zone with our data source folder, 139 00:06:04,575 --> 00:06:06,370 whatever entity is applicable, 140 00:06:06,370 --> 00:06:09,530 and then our year, month, day structure. 141 00:06:09,530 --> 00:06:11,700 If you put that year, month, and day at the front 142 00:06:11,700 --> 00:06:12,880 instead of the data source, 143 00:06:12,880 --> 00:06:14,500 you could cause problems for yourself 144 00:06:14,500 --> 00:06:17,180 because it would be more difficult to inherit permissions, 145 00:06:17,180 --> 00:06:19,240 and therefore to maintain them. 146 00:06:19,240 --> 00:06:21,800 It's much easier to set it at the data source level 147 00:06:21,800 --> 00:06:24,010 and let that trickle down to all of your 148 00:06:24,010 --> 00:06:25,990 years, months, and days. 149 00:06:25,990 --> 00:06:29,470 Having the dates also makes it easy to prune later 150 00:06:29,470 --> 00:06:32,893 if you need to go back and delete certain ranges of dates. 151 00:06:35,030 --> 00:06:36,240 Aside from this filtering, 152 00:06:36,240 --> 00:06:39,290 we can also use sensitivity levels. 153 00:06:39,290 --> 00:06:42,130 Perhaps in your raw zone, you can divide it into 154 00:06:42,130 --> 00:06:44,663 general before splitting it into data sources. 155 00:06:45,840 --> 00:06:49,150 And for your sensitive data in your staging zone, 156 00:06:49,150 --> 00:06:52,630 again, you may have a separate sensitive folder. 157 00:06:52,630 --> 00:06:55,220 But this allows us to add different security controls 158 00:06:55,220 --> 00:06:56,883 on different folder structures. 159 00:06:59,400 --> 00:07:02,640 By way of review, zone and folder structure 160 00:07:02,640 --> 00:07:06,400 help represent various levels of data transformation. 161 00:07:06,400 --> 00:07:08,540 And really that's all you're trying to do here. 162 00:07:08,540 --> 00:07:11,760 You're trying to create a structure that is both beneficial 163 00:07:11,760 --> 00:07:14,890 for your automated workflow and for the humans 164 00:07:14,890 --> 00:07:16,040 trying to work with it. 165 00:07:18,270 --> 00:07:20,010 Folder structure can affect everything 166 00:07:20,010 --> 00:07:23,120 from query performance to security, 167 00:07:23,120 --> 00:07:26,610 data pruning, and administrative maintenance. 168 00:07:26,610 --> 00:07:29,040 And so making smart choices up front 169 00:07:29,040 --> 00:07:31,633 will help you down the road in each of those areas. 170 00:07:33,170 --> 00:07:35,550 And lastly, each solution has to be crafted 171 00:07:35,550 --> 00:07:39,290 according to individual business requirements and data type. 172 00:07:39,290 --> 00:07:41,990 No 2 businesses are going to have the same 173 00:07:41,990 --> 00:07:43,640 analytical needs. 174 00:07:43,640 --> 00:07:47,500 And so don't feel like you have to use the exact same zone 175 00:07:47,500 --> 00:07:49,720 or folder structure as someone else out there 176 00:07:49,720 --> 00:07:50,710 on the internet. 177 00:07:50,710 --> 00:07:52,590 Take advice with a grain of salt 178 00:07:52,590 --> 00:07:56,180 and design it in a way that makes sense for your business. 179 00:07:56,180 --> 00:07:58,760 That's it for this lesson. Thank you for joining me. 180 00:07:58,760 --> 00:08:00,810 I look forward to seeing you in the next.