1 00:00:00,690 --> 00:00:01,523 In this lesson, 2 00:00:01,523 --> 00:00:02,960 we are going to take some of the concepts 3 00:00:02,960 --> 00:00:04,980 that we learned about data flow, 4 00:00:04,980 --> 00:00:06,790 and we're going to dive just a little bit deeper 5 00:00:06,790 --> 00:00:09,760 to talk about how we handle schema drift. 6 00:00:09,760 --> 00:00:11,970 Specifically, we are going to talk about 7 00:00:11,970 --> 00:00:14,490 what schema drift actually is. 8 00:00:14,490 --> 00:00:15,860 Then we're going to talk about 9 00:00:15,860 --> 00:00:18,190 how we handle schema drift in data flow 10 00:00:18,190 --> 00:00:21,370 by talking about the what, when, where, and why. 11 00:00:21,370 --> 00:00:22,410 And then we're going to finish up 12 00:00:22,410 --> 00:00:24,430 with a live look in the Azure portal 13 00:00:24,430 --> 00:00:29,293 so you can see how we would set for schema drift. 14 00:00:30,250 --> 00:00:33,230 Alright, so what is schema drift? 15 00:00:33,230 --> 00:00:37,880 Well, you can have changes that are fast or slow, 16 00:00:37,880 --> 00:00:40,590 and so take a look at this boat over here on the right. 17 00:00:40,590 --> 00:00:42,050 Think about currents. 18 00:00:42,050 --> 00:00:45,550 So over time, that boat is going to slowly move 19 00:00:45,550 --> 00:00:48,180 if you don't have it anchored to the dock. 20 00:00:48,180 --> 00:00:49,900 Same thing is true for that race car. 21 00:00:49,900 --> 00:00:51,320 It's just really fast. 22 00:00:51,320 --> 00:00:52,530 As it goes around the corner, 23 00:00:52,530 --> 00:00:53,680 it's going to drift, 24 00:00:53,680 --> 00:00:57,300 or move off of the line that it intends to go on. 25 00:00:57,300 --> 00:01:01,810 And again, this drift, when we talk about data or schemas, 26 00:01:01,810 --> 00:01:06,680 it's going to lead to a breakdown of your pipelines. 27 00:01:06,680 --> 00:01:08,500 So in data engineering, 28 00:01:08,500 --> 00:01:10,640 what we talk about when we look at schema drift, 29 00:01:10,640 --> 00:01:13,270 we're thinking about changes to our fields 30 00:01:13,270 --> 00:01:16,160 or our columns or our data types. 31 00:01:16,160 --> 00:01:18,970 That is what schema drift looks like. 32 00:01:18,970 --> 00:01:20,350 And again, that leads to, 33 00:01:20,350 --> 00:01:21,890 when we talk about Data Factory, 34 00:01:21,890 --> 00:01:26,330 a breakdown of pipelines if we don't correctly account 35 00:01:26,330 --> 00:01:27,763 for that schema drift. 36 00:01:28,810 --> 00:01:31,570 So make sure that you think about the stability 37 00:01:31,570 --> 00:01:34,670 of your schema as you are building pipelines 38 00:01:34,670 --> 00:01:36,860 and as you are using data flow. 39 00:01:36,860 --> 00:01:37,957 You need to be considering, 40 00:01:37,957 --> 00:01:41,970 "Well, is my schema likely to change over time?" 41 00:01:41,970 --> 00:01:44,730 And if your schema is likely to change over time, 42 00:01:44,730 --> 00:01:46,370 you need to think about that stability 43 00:01:46,370 --> 00:01:48,850 and make sure that you have accounted for that 44 00:01:48,850 --> 00:01:52,693 in your data flow and in your Data Factory pipelines. 45 00:01:54,970 --> 00:01:57,710 So what about the what, when, where, and why? 46 00:01:57,710 --> 00:01:58,960 Well, summarizing, 47 00:01:58,960 --> 00:02:02,850 the problem is schema drift causes broken pipelines, 48 00:02:02,850 --> 00:02:05,370 and that can lead to bad data. 49 00:02:05,370 --> 00:02:10,110 This issue occurs when we look at incoming data sources, 50 00:02:10,110 --> 00:02:12,380 and you need to take care of this 51 00:02:12,380 --> 00:02:14,190 as you're building the data flow, 52 00:02:14,190 --> 00:02:16,453 not down the road in production. 53 00:02:17,380 --> 00:02:18,690 And you need to take care of it, 54 00:02:18,690 --> 00:02:19,800 because if you don't, 55 00:02:19,800 --> 00:02:23,920 it can lead to a complete breakdown of your pipelines. 56 00:02:23,920 --> 00:02:25,870 So now that we have that out of the way, 57 00:02:25,870 --> 00:02:27,350 let's take just a couple minutes 58 00:02:27,350 --> 00:02:31,560 and look at how we set that in Data Factory. 59 00:02:31,560 --> 00:02:33,570 So let's jump over to the portal 60 00:02:33,570 --> 00:02:36,220 and take a look at a data flow. 61 00:02:36,220 --> 00:02:39,890 You can see here that we have our incoming source, 62 00:02:39,890 --> 00:02:41,730 and then we have our middle step, 63 00:02:41,730 --> 00:02:44,460 which is whatever transformations, 64 00:02:44,460 --> 00:02:45,680 in this case it's an aggregate, 65 00:02:45,680 --> 00:02:49,250 but it could be a whole bunch of transformation steps here, 66 00:02:49,250 --> 00:02:51,660 and then we finish with our sink. 67 00:02:51,660 --> 00:02:56,260 The sink is just simply the finishing product, the output. 68 00:02:56,260 --> 00:02:58,700 So we define our source, the input, 69 00:02:58,700 --> 00:03:00,973 and we define our sink, the output. 70 00:03:01,970 --> 00:03:04,660 When we look at setting up for schema drift, 71 00:03:04,660 --> 00:03:07,300 we start off in the input source. 72 00:03:07,300 --> 00:03:09,630 We're going to come down to our configuration panel 73 00:03:09,630 --> 00:03:11,170 under Source Settings, 74 00:03:11,170 --> 00:03:12,170 and we're going to make sure 75 00:03:12,170 --> 00:03:17,170 that we have allowed for schema drift. 76 00:03:17,180 --> 00:03:20,090 Then we're going to come over here to our sink, 77 00:03:20,090 --> 00:03:21,490 and we are going to make sure 78 00:03:21,490 --> 00:03:24,890 that we have schema drift enabled here, 79 00:03:24,890 --> 00:03:29,380 and also, that under Mapping, we are Auto Mapping. 80 00:03:29,380 --> 00:03:31,680 So what's going to happen is, 81 00:03:31,680 --> 00:03:34,190 if we have selected for this, 82 00:03:34,190 --> 00:03:36,470 your data flow is going to look 83 00:03:36,470 --> 00:03:39,020 at all of the incoming fields, 84 00:03:39,020 --> 00:03:42,370 and then it's going to look at all of the outgoing fields, 85 00:03:42,370 --> 00:03:46,810 and it's going to attempt to automap through these steps. 86 00:03:46,810 --> 00:03:47,910 So that's what we're doing, 87 00:03:47,910 --> 00:03:50,580 we're attempting to automap every single time 88 00:03:50,580 --> 00:03:55,580 we run that data flow and not relying on a fixed schema. 89 00:03:56,120 --> 00:03:58,170 So if our schema changes from the input 90 00:03:58,170 --> 00:04:00,370 or changes as it's going into the output, 91 00:04:00,370 --> 00:04:02,900 we should be able to account for that. 92 00:04:02,900 --> 00:04:05,550 So at its core, it's that simple. 93 00:04:05,550 --> 00:04:07,820 Basically, what we're doing is, 94 00:04:07,820 --> 00:04:10,620 as we look at the changing of our metadata, 95 00:04:10,620 --> 00:04:13,258 which again, that's the fields, the columns, 96 00:04:13,258 --> 00:04:16,430 that is the schema drift that we're looking at. 97 00:04:16,430 --> 00:04:18,130 And as those change over time, 98 00:04:18,130 --> 00:04:21,200 we account for that by using data flow 99 00:04:21,200 --> 00:04:24,140 to enable our schema drift, 100 00:04:24,140 --> 00:04:25,970 which allows for late binding. 101 00:04:25,970 --> 00:04:28,400 So we're not going to have a fixed schema. 102 00:04:28,400 --> 00:04:30,060 We're going to actually run the data flow 103 00:04:30,060 --> 00:04:33,760 and attempt to bind those input and output columns 104 00:04:33,760 --> 00:04:35,403 as data flow runs. 105 00:04:36,330 --> 00:04:37,570 And this is important 106 00:04:37,570 --> 00:04:41,180 pretty much anytime you have frequently changing sources. 107 00:04:41,180 --> 00:04:44,640 So as you build your data flow and your data pipelines, 108 00:04:44,640 --> 00:04:46,620 you need to be considering, 109 00:04:46,620 --> 00:04:49,350 are those sources likely to change over time? 110 00:04:49,350 --> 00:04:52,320 And if so, are you accounting for that change? 111 00:04:52,320 --> 00:04:54,230 If you have, awesome. 112 00:04:54,230 --> 00:04:55,800 That's it for this lesson, 113 00:04:55,800 --> 00:04:57,250 and I'll see you in the next.