1
00:00:00,620 --> 00:00:01,560
Hey, Cloud Gurus.

2
00:00:01,560 --> 00:00:04,963
Welcome to Partitioning Best Practices: Part 1.

3
00:00:06,720 --> 00:00:07,553
In this lesson,

4
00:00:07,553 --> 00:00:11,150
we're going to discuss partitioning in Azure Data Lake.

5
00:00:11,150 --> 00:00:14,110
We're also going to look at performance considerations,

6
00:00:14,110 --> 00:00:18,810
scalability considerations, and availability considerations.

7
00:00:18,810 --> 00:00:22,110
Much of the content within those concepts will apply to both

8
00:00:22,110 --> 00:00:24,850
Azure data lakes, as well as Azure Synapse.

9
00:00:24,850 --> 00:00:26,250
But we're going to be focused mostly

10
00:00:26,250 --> 00:00:28,640
in this lesson on Azure Data Lake.

11
00:00:28,640 --> 00:00:29,690
And then, in part 2,

12
00:00:29,690 --> 00:00:32,693
we'll extend that into some specifics about Azure Synapse.

13
00:00:33,810 --> 00:00:35,880
At the end of this lesson, as usual,

14
00:00:35,880 --> 00:00:37,503
we will wrap it up in a review.

15
00:00:40,770 --> 00:00:43,130
Of course, before we can talk about best practices

16
00:00:43,130 --> 00:00:45,870
for Azure Data Lake, we need to kind of understand

17
00:00:45,870 --> 00:00:49,440
how the partitioning is working behind the scenes.

18
00:00:49,440 --> 00:00:53,500
The partition key for Blob and Data Lake storage is actually

19
00:00:53,500 --> 00:00:57,830
comprised of the account, plus the container, plus the blob.

20
00:00:57,830 --> 00:01:01,130
So, in essence, the full blob name.

21
00:01:01,130 --> 00:01:03,910
For example, if we had our Data Lake account,

22
00:01:03,910 --> 00:01:05,830
named acdatalake,

23
00:01:05,830 --> 00:01:08,690
you can see it represented in orange here.

24
00:01:08,690 --> 00:01:12,560
And then, our container for our landing zone in purple,

25
00:01:12,560 --> 00:01:15,420
and finally, the blob itself, the actual file,

26
00:01:15,420 --> 00:01:18,290
our Parquet file in this case.

27
00:01:18,290 --> 00:01:20,150
And that gives you the full partition key

28
00:01:20,150 --> 00:01:22,053
when working with Azure Data Lake.

29
00:01:24,630 --> 00:01:27,730
Behind the scenes, it uses range-based partitioning.

30
00:01:27,730 --> 00:01:29,720
So the data is split into ranges,

31
00:01:29,720 --> 00:01:33,280
which are load balanced across the storage system.

32
00:01:33,280 --> 00:01:35,414
With this type of approach,

33
00:01:35,414 --> 00:01:37,354
if you're using a naming convention

34
00:01:37,354 --> 00:01:39,071
that has lexical ordering,

35
00:01:39,071 --> 00:01:41,130
that means your partitions are more likely to be located

36
00:01:41,130 --> 00:01:43,110
on the same partition server.

37
00:01:43,110 --> 00:01:45,740
And then, once load increases to a certain point,

38
00:01:45,740 --> 00:01:47,763
they are split into smaller ranges.

39
00:01:48,660 --> 00:01:51,060
Let's get a visual idea of what that looks like.

40
00:01:52,460 --> 00:01:55,930
If our convention for Awesome Company is prefixed with AC,

41
00:01:55,930 --> 00:01:58,193
we have our AC engineering department,

42
00:01:59,185 --> 00:02:03,020
AC HR, AC IT, and AC sales.

43
00:02:03,020 --> 00:02:04,910
And to start with, they're all in order,

44
00:02:04,910 --> 00:02:08,320
and they're all served by the same partition server.

45
00:02:08,320 --> 00:02:10,760
But over time, as the load increases,

46
00:02:10,760 --> 00:02:13,330
it will be split out into smaller ranges,

47
00:02:13,330 --> 00:02:15,623
each served by their own partition server.

48
00:02:17,020 --> 00:02:19,890
Having these 2 key pieces of information is important

49
00:02:19,890 --> 00:02:21,830
when considering your partitioning strategy

50
00:02:21,830 --> 00:02:23,590
with Azure Data Lake.

51
00:02:23,590 --> 00:02:26,450
Be aware that you're naming at the account, container,

52
00:02:26,450 --> 00:02:29,860
and blob level affects the partition key.

53
00:02:29,860 --> 00:02:31,240
And that, in the background,

54
00:02:31,240 --> 00:02:34,223
it's using that to perform this range-based partitioning.

55
00:02:37,410 --> 00:02:40,560
Let's dive into some performance considerations.

56
00:02:40,560 --> 00:02:42,770
First, based on what we just learned,

57
00:02:42,770 --> 00:02:45,060
it's important to name with care.

58
00:02:45,060 --> 00:02:48,470
Create a naming convention that sets you up for success.

59
00:02:48,470 --> 00:02:51,399
Remember that when we say naming convention,

60
00:02:51,399 --> 00:02:52,480
that includes accounts, containers,

61
00:02:52,480 --> 00:02:55,010
as well as the blobs themselves.

62
00:02:55,010 --> 00:02:57,210
Also, be careful if you're using numeric

63
00:02:57,210 --> 00:02:59,530
or timestamp identifiers.

64
00:02:59,530 --> 00:03:01,438
If you're not careful,

65
00:03:01,438 --> 00:03:03,120
you could dump everything into 1 bucket.

66
00:03:03,120 --> 00:03:04,547
For instance,

67
00:03:04,547 --> 00:03:07,370
if you're using today's date, or you're using a specific ID,

68
00:03:07,370 --> 00:03:09,190
you could dump a large amount of records

69
00:03:09,190 --> 00:03:10,980
into just that bucket.

70
00:03:10,980 --> 00:03:13,330
So consider using a hashing function

71
00:03:13,330 --> 00:03:15,693
to prefix names with a 3-digit hash.

72
00:03:17,130 --> 00:03:19,920
Remember that one of the primary goals in these systems

73
00:03:19,920 --> 00:03:21,980
is parallel operations.

74
00:03:21,980 --> 00:03:24,500
The more reads and writes that can happen in parallel,

75
00:03:24,500 --> 00:03:26,040
the better.

76
00:03:26,040 --> 00:03:27,483
Think of a highway system.

77
00:03:28,460 --> 00:03:30,440
If you have a 6-lane highway, you're crippling yourself

78
00:03:30,440 --> 00:03:33,160
if you're only using 2 of those lanes.

79
00:03:33,160 --> 00:03:36,140
Design your partitioning so that operations can happen

80
00:03:36,140 --> 00:03:38,563
in parallel, using all 6 lanes.

81
00:03:40,010 --> 00:03:41,630
Embrace pruning.

82
00:03:41,630 --> 00:03:44,020
Pruning data means that your application has to look

83
00:03:44,020 --> 00:03:47,980
in fewer places for smaller amounts of information.

84
00:03:47,980 --> 00:03:50,840
Usually this is going to be based on dates,

85
00:03:50,840 --> 00:03:53,340
but consider carefully when you can prune data

86
00:03:53,340 --> 00:03:54,290
out of your system.

87
00:03:56,470 --> 00:03:58,280
Less is more.

88
00:03:58,280 --> 00:04:00,290
Keep your partition sizes smaller,

89
00:04:00,290 --> 00:04:03,820
and that will help keep your query response times lower.

90
00:04:03,820 --> 00:04:05,860
Think of yourself when you're traveling through the airport

91
00:04:05,860 --> 00:04:06,950
with baggage.

92
00:04:06,950 --> 00:04:09,400
If you took everything in the kitchen sink with you,

93
00:04:09,400 --> 00:04:12,070
you're going to be loaded down and not very mobile

94
00:04:12,070 --> 00:04:13,760
as you're trying to get to your plane.

95
00:04:13,760 --> 00:04:14,840
Keep it lean and mean,

96
00:04:14,840 --> 00:04:16,890
and that will enable you to move quickly.

97
00:04:19,600 --> 00:04:22,920
For scalability, plan for the future.

98
00:04:22,920 --> 00:04:25,860
Estimate the size and workload of each partition

99
00:04:25,860 --> 00:04:27,253
with an aim toward balance.

100
00:04:28,120 --> 00:04:31,240
And so, you really need to analyze your application's needs

101
00:04:31,240 --> 00:04:33,543
to project those scalability requirements.

102
00:04:35,700 --> 00:04:38,200
Stay within the limits of the system.

103
00:04:38,200 --> 00:04:40,530
Be aware of Azure infrastructure limits

104
00:04:40,530 --> 00:04:44,510
on a single partition store, and be sure not to exceed them.

105
00:04:44,510 --> 00:04:46,570
If it looks like you're going to exceed them,

106
00:04:46,570 --> 00:04:49,053
plan for more and smaller partitions.

107
00:04:50,950 --> 00:04:52,230
And then follow up.

108
00:04:52,230 --> 00:04:54,810
Don't just assume that the partition scheme you've created

109
00:04:54,810 --> 00:04:56,740
is going to work as expected.

110
00:04:56,740 --> 00:04:58,630
Monitor it to make sure the distribution

111
00:04:58,630 --> 00:05:00,290
is working optimally.

112
00:05:00,290 --> 00:05:03,160
Reality doesn't always match our predictions,

113
00:05:03,160 --> 00:05:04,080
and that's okay.

114
00:05:04,080 --> 00:05:05,900
Everybody's wrong sometimes.

115
00:05:05,900 --> 00:05:08,060
Hope for the best, but plan for the worst,

116
00:05:08,060 --> 00:05:10,810
and follow up afterwards to make sure you got it right.

117
00:05:13,200 --> 00:05:16,333
Lastly, a couple of availability considerations.

118
00:05:17,400 --> 00:05:19,640
Prioritize by partition.

119
00:05:19,640 --> 00:05:21,760
Apply availability and backup plans

120
00:05:21,760 --> 00:05:25,300
according to a partition's level of criticalness.

121
00:05:25,300 --> 00:05:27,570
You know your data, or you should.

122
00:05:27,570 --> 00:05:29,050
And so, you know what is critical

123
00:05:29,050 --> 00:05:31,450
and what weighs more heavily against other data.

124
00:05:32,470 --> 00:05:34,880
Also be mindful of time.

125
00:05:34,880 --> 00:05:37,510
Know the best times for partitions to be taken offline

126
00:05:37,510 --> 00:05:39,450
for maintenance, and keep them small

127
00:05:39,450 --> 00:05:41,170
to make sure that maintenance completes

128
00:05:41,170 --> 00:05:42,520
within your planned window.

129
00:05:43,603 --> 00:05:46,060
Again, this all comes back to knowing your data

130
00:05:46,060 --> 00:05:48,550
and knowing when something can be taken offline

131
00:05:48,550 --> 00:05:51,103
and how long operations against it will take.

132
00:05:53,590 --> 00:05:54,743
By way of review,

133
00:05:55,800 --> 00:05:58,340
the naming convention you use will greatly affect

134
00:05:58,340 --> 00:06:00,250
how data is partitioned.

135
00:06:00,250 --> 00:06:02,160
And I can't stress that enough.

136
00:06:02,160 --> 00:06:04,600
Be mindful of how you plan your account,

137
00:06:04,600 --> 00:06:06,670
container, and blob names.

138
00:06:06,670 --> 00:06:09,540
This is going to be your partition key, and therefore,

139
00:06:09,540 --> 00:06:12,683
what affects your partitioning in that range-based system.

140
00:06:14,630 --> 00:06:17,440
For best performance and fewer administrative headaches,

141
00:06:17,440 --> 00:06:20,780
keep partition sizes smaller and prune.

142
00:06:20,780 --> 00:06:22,790
I know the tendency is always to hang on

143
00:06:22,790 --> 00:06:24,340
to everything possible,

144
00:06:24,340 --> 00:06:26,430
but really, that's only going to hurt you in the end,

145
00:06:26,430 --> 00:06:27,293
not help you.

146
00:06:28,330 --> 00:06:30,730
And lastly, I know you hear me say it a lot,

147
00:06:30,730 --> 00:06:32,450
but know your data well

148
00:06:32,450 --> 00:06:34,680
in order to plan your partitions well.

149
00:06:34,680 --> 00:06:37,723
And then, monitor afterwards to verify your success.

150
00:06:38,770 --> 00:06:40,190
That's it for this lesson.

151
00:06:40,190 --> 00:06:42,714
I hope this helps you understand better

152
00:06:42,714 --> 00:06:44,544
good partitioning practices,

153
00:06:44,544 --> 00:06:47,210
especially when applied to Azure Data Lake.

154
00:06:47,210 --> 00:06:48,943
In our next lesson,

155
00:06:48,943 --> 00:06:50,881
we're going to continue discussing

156
00:06:50,881 --> 00:06:52,200
how to have a great partitioning strategy

157
00:06:52,200 --> 00:06:55,120
and also focus a little more on Azure Synapse.

158
00:06:55,120 --> 00:06:56,870
I look forward to seeing you there.