1
00:00:00,690 --> 00:00:01,523
In this lesson,

2
00:00:01,523 --> 00:00:02,960
we are going to take some of the concepts

3
00:00:02,960 --> 00:00:04,980
that we learned about data flow,

4
00:00:04,980 --> 00:00:06,790
and we're going to dive just a little bit deeper

5
00:00:06,790 --> 00:00:09,760
to talk about how we handle schema drift.

6
00:00:09,760 --> 00:00:11,970
Specifically, we are going to talk about

7
00:00:11,970 --> 00:00:14,490
what schema drift actually is.

8
00:00:14,490 --> 00:00:15,860
Then we're going to talk about

9
00:00:15,860 --> 00:00:18,190
how we handle schema drift in data flow

10
00:00:18,190 --> 00:00:21,370
by talking about the what, when, where, and why.

11
00:00:21,370 --> 00:00:22,410
And then we're going to finish up

12
00:00:22,410 --> 00:00:24,430
with a live look in the Azure portal

13
00:00:24,430 --> 00:00:29,293
so you can see how we would set for schema drift.

14
00:00:30,250 --> 00:00:33,230
Alright, so what is schema drift?

15
00:00:33,230 --> 00:00:37,880
Well, you can have changes that are fast or slow,

16
00:00:37,880 --> 00:00:40,590
and so take a look at this boat over here on the right.

17
00:00:40,590 --> 00:00:42,050
Think about currents.

18
00:00:42,050 --> 00:00:45,550
So over time, that boat is going to slowly move

19
00:00:45,550 --> 00:00:48,180
if you don't have it anchored to the dock.

20
00:00:48,180 --> 00:00:49,900
Same thing is true for that race car.

21
00:00:49,900 --> 00:00:51,320
It's just really fast.

22
00:00:51,320 --> 00:00:52,530
As it goes around the corner,

23
00:00:52,530 --> 00:00:53,680
it's going to drift,

24
00:00:53,680 --> 00:00:57,300
or move off of the line that it intends to go on.

25
00:00:57,300 --> 00:01:01,810
And again, this drift, when we talk about data or schemas,

26
00:01:01,810 --> 00:01:06,680
it's going to lead to a breakdown of your pipelines.

27
00:01:06,680 --> 00:01:08,500
So in data engineering,

28
00:01:08,500 --> 00:01:10,640
what we talk about when we look at schema drift,

29
00:01:10,640 --> 00:01:13,270
we're thinking about changes to our fields

30
00:01:13,270 --> 00:01:16,160
or our columns or our data types.

31
00:01:16,160 --> 00:01:18,970
That is what schema drift looks like.

32
00:01:18,970 --> 00:01:20,350
And again, that leads to,

33
00:01:20,350 --> 00:01:21,890
when we talk about Data Factory,

34
00:01:21,890 --> 00:01:26,330
a breakdown of pipelines if we don't correctly account

35
00:01:26,330 --> 00:01:27,763
for that schema drift.

36
00:01:28,810 --> 00:01:31,570
So make sure that you think about the stability

37
00:01:31,570 --> 00:01:34,670
of your schema as you are building pipelines

38
00:01:34,670 --> 00:01:36,860
and as you are using data flow.

39
00:01:36,860 --> 00:01:37,957
You need to be considering,

40
00:01:37,957 --> 00:01:41,970
"Well, is my schema likely to change over time?"

41
00:01:41,970 --> 00:01:44,730
And if your schema is likely to change over time,

42
00:01:44,730 --> 00:01:46,370
you need to think about that stability

43
00:01:46,370 --> 00:01:48,850
and make sure that you have accounted for that

44
00:01:48,850 --> 00:01:52,693
in your data flow and in your Data Factory pipelines.

45
00:01:54,970 --> 00:01:57,710
So what about the what, when, where, and why?

46
00:01:57,710 --> 00:01:58,960
Well, summarizing,

47
00:01:58,960 --> 00:02:02,850
the problem is schema drift causes broken pipelines,

48
00:02:02,850 --> 00:02:05,370
and that can lead to bad data.

49
00:02:05,370 --> 00:02:10,110
This issue occurs when we look at incoming data sources,

50
00:02:10,110 --> 00:02:12,380
and you need to take care of this

51
00:02:12,380 --> 00:02:14,190
as you're building the data flow,

52
00:02:14,190 --> 00:02:16,453
not down the road in production.

53
00:02:17,380 --> 00:02:18,690
And you need to take care of it,

54
00:02:18,690 --> 00:02:19,800
because if you don't,

55
00:02:19,800 --> 00:02:23,920
it can lead to a complete breakdown of your pipelines.

56
00:02:23,920 --> 00:02:25,870
So now that we have that out of the way,

57
00:02:25,870 --> 00:02:27,350
let's take just a couple minutes

58
00:02:27,350 --> 00:02:31,560
and look at how we set that in Data Factory.

59
00:02:31,560 --> 00:02:33,570
So let's jump over to the portal

60
00:02:33,570 --> 00:02:36,220
and take a look at a data flow.

61
00:02:36,220 --> 00:02:39,890
You can see here that we have our incoming source,

62
00:02:39,890 --> 00:02:41,730
and then we have our middle step,

63
00:02:41,730 --> 00:02:44,460
which is whatever transformations,

64
00:02:44,460 --> 00:02:45,680
in this case it's an aggregate,

65
00:02:45,680 --> 00:02:49,250
but it could be a whole bunch of transformation steps here,

66
00:02:49,250 --> 00:02:51,660
and then we finish with our sink.

67
00:02:51,660 --> 00:02:56,260
The sink is just simply the finishing product, the output.

68
00:02:56,260 --> 00:02:58,700
So we define our source, the input,

69
00:02:58,700 --> 00:03:00,973
and we define our sink, the output.

70
00:03:01,970 --> 00:03:04,660
When we look at setting up for schema drift,

71
00:03:04,660 --> 00:03:07,300
we start off in the input source.

72
00:03:07,300 --> 00:03:09,630
We're going to come down to our configuration panel

73
00:03:09,630 --> 00:03:11,170
under Source Settings,

74
00:03:11,170 --> 00:03:12,170
and we're going to make sure

75
00:03:12,170 --> 00:03:17,170
that we have allowed for schema drift.

76
00:03:17,180 --> 00:03:20,090
Then we're going to come over here to our sink,

77
00:03:20,090 --> 00:03:21,490
and we are going to make sure

78
00:03:21,490 --> 00:03:24,890
that we have schema drift enabled here,

79
00:03:24,890 --> 00:03:29,380
and also, that under Mapping, we are Auto Mapping.

80
00:03:29,380 --> 00:03:31,680
So what's going to happen is,

81
00:03:31,680 --> 00:03:34,190
if we have selected for this,

82
00:03:34,190 --> 00:03:36,470
your data flow is going to look

83
00:03:36,470 --> 00:03:39,020
at all of the incoming fields,

84
00:03:39,020 --> 00:03:42,370
and then it's going to look at all of the outgoing fields,

85
00:03:42,370 --> 00:03:46,810
and it's going to attempt to automap through these steps.

86
00:03:46,810 --> 00:03:47,910
So that's what we're doing,

87
00:03:47,910 --> 00:03:50,580
we're attempting to automap every single time

88
00:03:50,580 --> 00:03:55,580
we run that data flow and not relying on a fixed schema.

89
00:03:56,120 --> 00:03:58,170
So if our schema changes from the input

90
00:03:58,170 --> 00:04:00,370
or changes as it's going into the output,

91
00:04:00,370 --> 00:04:02,900
we should be able to account for that.

92
00:04:02,900 --> 00:04:05,550
So at its core, it's that simple.

93
00:04:05,550 --> 00:04:07,820
Basically, what we're doing is,

94
00:04:07,820 --> 00:04:10,620
as we look at the changing of our metadata,

95
00:04:10,620 --> 00:04:13,258
which again, that's the fields, the columns,

96
00:04:13,258 --> 00:04:16,430
that is the schema drift that we're looking at.

97
00:04:16,430 --> 00:04:18,130
And as those change over time,

98
00:04:18,130 --> 00:04:21,200
we account for that by using data flow

99
00:04:21,200 --> 00:04:24,140
to enable our schema drift,

100
00:04:24,140 --> 00:04:25,970
which allows for late binding.

101
00:04:25,970 --> 00:04:28,400
So we're not going to have a fixed schema.

102
00:04:28,400 --> 00:04:30,060
We're going to actually run the data flow

103
00:04:30,060 --> 00:04:33,760
and attempt to bind those input and output columns

104
00:04:33,760 --> 00:04:35,403
as data flow runs.

105
00:04:36,330 --> 00:04:37,570
And this is important

106
00:04:37,570 --> 00:04:41,180
pretty much anytime you have frequently changing sources.

107
00:04:41,180 --> 00:04:44,640
So as you build your data flow and your data pipelines,

108
00:04:44,640 --> 00:04:46,620
you need to be considering,

109
00:04:46,620 --> 00:04:49,350
are those sources likely to change over time?

110
00:04:49,350 --> 00:04:52,320
And if so, are you accounting for that change?

111
00:04:52,320 --> 00:04:54,230
If you have, awesome.

112
00:04:54,230 --> 00:04:55,800
That's it for this lesson,

113
00:04:55,800 --> 00:04:57,250
and I'll see you in the next.