1 00:00:00,760 --> 00:00:02,170 Congratulation, Gurus, 2 00:00:02,170 --> 00:00:04,640 on finishing this section. 3 00:00:04,640 --> 00:00:06,360 In this lesson, we're going to take a look 4 00:00:06,360 --> 00:00:07,360 in the rear-view mirror, 5 00:00:07,360 --> 00:00:10,500 and we are going to review some of the highlights. 6 00:00:10,500 --> 00:00:12,700 In the lesson, we're going to talk about 2 things. 7 00:00:12,700 --> 00:00:15,677 First, we are going to say, "Hey, thanks, Landon. 8 00:00:15,677 --> 00:00:18,257 "Greatly appreciate all the work you've done 9 00:00:18,257 --> 00:00:21,840 "through this section, it's been extremely helpful." 10 00:00:21,840 --> 00:00:24,140 And 2, we are going to jump in, 11 00:00:24,140 --> 00:00:26,830 and take a look at the review process. 12 00:00:26,830 --> 00:00:28,770 Having looked back, 13 00:00:28,770 --> 00:00:30,430 I kind of want to go skiing now. 14 00:00:30,430 --> 00:00:34,080 However, can't do that because we need to jump in 15 00:00:34,080 --> 00:00:36,370 and complete this section. 16 00:00:36,370 --> 00:00:37,820 So here we go. 17 00:00:37,820 --> 00:00:40,430 First, we talked about Data Factory, 18 00:00:40,430 --> 00:00:42,330 and some of the core concepts there. 19 00:00:42,330 --> 00:00:43,870 We talked about pipelines. 20 00:00:43,870 --> 00:00:46,880 Remember those are a logical grouping of activities, 21 00:00:46,880 --> 00:00:48,890 and we talked about those activities 22 00:00:48,890 --> 00:00:52,580 being the processing steps in a pipeline, 23 00:00:52,580 --> 00:00:55,010 there are 3 main types of activities, 24 00:00:55,010 --> 00:00:57,560 data movement, and data transformation, 25 00:00:57,560 --> 00:00:58,750 and if you remember, 26 00:00:58,750 --> 00:01:02,440 we really focused on data movement, and data transformation, 27 00:01:02,440 --> 00:01:03,940 we'll jump in later on in the course, 28 00:01:03,940 --> 00:01:05,930 and talk a little bit about control, 29 00:01:05,930 --> 00:01:08,360 but that's important for activities. 30 00:01:08,360 --> 00:01:11,010 We talked about data sets being where the data that you need 31 00:01:11,010 --> 00:01:13,390 for those inputs or outputs for those activities, 32 00:01:13,390 --> 00:01:14,920 where that data lives, 33 00:01:14,920 --> 00:01:17,560 and then the linked services being the connection string 34 00:01:17,560 --> 00:01:20,543 that allows us to connect to that data. 35 00:01:21,530 --> 00:01:23,550 We also talked about T-SQL. 36 00:01:23,550 --> 00:01:26,200 If you remember, that is Transact-SQL, 37 00:01:26,200 --> 00:01:29,030 and it's a powerful language that lets us do, 38 00:01:29,030 --> 00:01:32,713 move, and transform data activities. 39 00:01:34,290 --> 00:01:36,300 T-SQL can be used in machine learning, 40 00:01:36,300 --> 00:01:38,790 as well as data engineering, and if you remember, 41 00:01:38,790 --> 00:01:41,290 there was a couple of uses over here on the right. 42 00:01:41,290 --> 00:01:43,400 We can use it to create tables for results 43 00:01:43,400 --> 00:01:44,530 or save data sets, 44 00:01:44,530 --> 00:01:47,240 we can use it for custom transformations 45 00:01:47,240 --> 00:01:49,270 on a whole bunch of different things, 46 00:01:49,270 --> 00:01:51,100 or we could use it to filter and alter data 47 00:01:51,100 --> 00:01:54,430 and return those query results as data tables. 48 00:01:54,430 --> 00:01:56,100 So don't forget about T-SQL, 49 00:01:56,100 --> 00:02:00,210 it is a very important concept for data engineers. 50 00:02:00,210 --> 00:02:02,500 We also talked about Azure Data Factory, 51 00:02:02,500 --> 00:02:04,880 and how it compares to Synapse. 52 00:02:04,880 --> 00:02:05,970 If you remember, 53 00:02:05,970 --> 00:02:08,550 we looked through and we said that synapse 54 00:02:08,550 --> 00:02:12,270 misses some of the features that ADF has, 55 00:02:12,270 --> 00:02:16,520 but it also adds other features on its own. 56 00:02:16,520 --> 00:02:18,610 We also talked about Azure Data Factory 57 00:02:18,610 --> 00:02:22,793 being used primarily for migration work or ETL. 58 00:02:23,650 --> 00:02:27,940 However, if you have a need for analytics projects 59 00:02:27,940 --> 00:02:31,460 or more complex solutions involving databases, 60 00:02:31,460 --> 00:02:35,540 Synapse is a fantastic choice. 61 00:02:35,540 --> 00:02:37,540 Next up we talked about Scala. 62 00:02:37,540 --> 00:02:42,250 Scala is a programming language that is used heavily 63 00:02:42,250 --> 00:02:45,020 in data engineering and in machine learning, 64 00:02:45,020 --> 00:02:47,310 and we talked about just the basics for that, 65 00:02:47,310 --> 00:02:50,250 for this course, those basics are going to be just fine, 66 00:02:50,250 --> 00:02:52,050 you don't need to know a whole lot about Scala 67 00:02:52,050 --> 00:02:56,313 to pass the DP-203, so don't dive too far into that. 68 00:02:57,730 --> 00:03:00,510 We also talked about Apache Spark. 69 00:03:00,510 --> 00:03:01,343 And if you remember, 70 00:03:01,343 --> 00:03:04,270 we talked about Data Factory and that Data Factory 71 00:03:04,270 --> 00:03:08,230 uses Apache Spark as an on-demand service, 72 00:03:08,230 --> 00:03:11,630 it creates Spark clusters for just-in-time processes, 73 00:03:11,630 --> 00:03:13,540 and it processes the data that you need, 74 00:03:13,540 --> 00:03:16,090 and then it deletes that cluster once it's done, 75 00:03:16,090 --> 00:03:19,670 so that just-in time-processing is very important, 76 00:03:19,670 --> 00:03:22,490 and we talked about Spark jobs being extensible, 77 00:03:22,490 --> 00:03:24,930 which means you can provide multiple files, 78 00:03:24,930 --> 00:03:27,460 such as Python scripts, or jar packages. 79 00:03:27,460 --> 00:03:31,260 And so it's a very flexible and helpful concept 80 00:03:31,260 --> 00:03:33,070 for you to understand. 81 00:03:33,070 --> 00:03:34,560 We'll talk more about Apache Spark 82 00:03:34,560 --> 00:03:35,840 as we go through the course, 83 00:03:35,840 --> 00:03:38,100 but for now just getting that overview in mind 84 00:03:38,100 --> 00:03:39,740 will be very helpful. 85 00:03:39,740 --> 00:03:41,650 We also talked about notebooks, 86 00:03:41,650 --> 00:03:44,140 and if you remember, we talked about Jupyter Notebooks 87 00:03:44,140 --> 00:03:46,210 being those open source web tools 88 00:03:46,210 --> 00:03:49,210 that we can use to write scripts, 89 00:03:49,210 --> 00:03:53,130 and it's really kind of a conglomeration of everything. 90 00:03:53,130 --> 00:03:56,780 So it allows us to do instructions, code, visual output, 91 00:03:56,780 --> 00:03:58,810 whole bunch of different stuff, 92 00:03:58,810 --> 00:04:02,870 and we can use Python and Scala code in Databricks 93 00:04:02,870 --> 00:04:07,520 and machine learning to help us with data transformations. 94 00:04:07,520 --> 00:04:10,340 We also talked about testing, and if you remember, 95 00:04:10,340 --> 00:04:13,410 we built this really nice workflow to write tests, 96 00:04:13,410 --> 00:04:15,360 publish pipelines, count activities, 97 00:04:15,360 --> 00:04:17,390 and then check our row counts, 98 00:04:17,390 --> 00:04:19,000 and so this is something that's helpful 99 00:04:19,000 --> 00:04:20,440 for you to understand. 100 00:04:20,440 --> 00:04:24,610 Just, again, DP203 is not going to be heavy on testing, 101 00:04:24,610 --> 00:04:27,540 but understanding the basics of what's going on 102 00:04:27,540 --> 00:04:29,283 is going to be helpful for you. 103 00:04:31,320 --> 00:04:32,900 And then we talked about tools. 104 00:04:32,900 --> 00:04:33,880 If you remember, 105 00:04:33,880 --> 00:04:35,260 there were a couple of different tools, 106 00:04:35,260 --> 00:04:37,500 we talked about data quality services, 107 00:04:37,500 --> 00:04:41,070 the clean missing data module and mapping data flows, 108 00:04:41,070 --> 00:04:44,430 so this is going to be helpful for you to just understand 109 00:04:44,430 --> 00:04:47,430 what is available and what you could use 110 00:04:48,270 --> 00:04:50,920 to go about cleaning your data. 111 00:04:50,920 --> 00:04:53,773 So don't forget about those 3 different options. 112 00:04:55,750 --> 00:04:58,690 Next up, we talked about conditional splits, 113 00:04:58,690 --> 00:05:01,740 and so this is basically making new branches, 114 00:05:01,740 --> 00:05:04,870 and what happens is we're actually routing our data rows 115 00:05:04,870 --> 00:05:09,420 to particular streams based upon conditions that we specify. 116 00:05:09,420 --> 00:05:12,050 And this is very similar to the case statement 117 00:05:12,050 --> 00:05:13,903 in traditional programming. 118 00:05:14,930 --> 00:05:17,460 And we talked about our code example, 119 00:05:17,460 --> 00:05:19,910 and so if you see this code here, 120 00:05:19,910 --> 00:05:22,153 you should immediately think about-- 121 00:05:23,930 --> 00:05:26,150 you should say shredding JSON here. 122 00:05:26,150 --> 00:05:27,870 So when you see this code snippet, 123 00:05:27,870 --> 00:05:30,723 you should be thinking about shredding JSON. 124 00:05:33,210 --> 00:05:36,053 Next up we talked about continue on error. 125 00:05:37,070 --> 00:05:40,980 Errors happen, but it's what we do about them that counts. 126 00:05:40,980 --> 00:05:43,940 And the default is normally to fail on first error, 127 00:05:43,940 --> 00:05:47,010 but we can enable continue on error, 128 00:05:47,010 --> 00:05:50,680 get some new options--so we can do transact commit, 129 00:05:50,680 --> 00:05:54,130 or output rejected data, or success on error. 130 00:05:54,130 --> 00:05:58,520 And so, understanding those different types of options 131 00:05:58,520 --> 00:06:01,810 and what we can do to just avoid that 132 00:06:01,810 --> 00:06:04,730 fail on first error can really help you 133 00:06:04,730 --> 00:06:06,580 in your pipeline development, 134 00:06:06,580 --> 00:06:08,200 and help pipelines to go through, 135 00:06:08,200 --> 00:06:11,610 even if there's errors in the pipeline. 136 00:06:11,610 --> 00:06:15,800 We also talked about exploratory data analysis, or EDA. 137 00:06:15,800 --> 00:06:18,960 And if you remember, this is a process 138 00:06:18,960 --> 00:06:22,850 for performing initial investigations on data, 139 00:06:22,850 --> 00:06:26,150 and what we're doing is we're applying summary statistics 140 00:06:26,150 --> 00:06:29,960 and graphical representations to test hypothesis. 141 00:06:29,960 --> 00:06:32,920 So the important bit here is the initial investigations 142 00:06:32,920 --> 00:06:36,200 in summary statistics and graphical representations. 143 00:06:36,200 --> 00:06:39,000 That's what you need to remember about EDA. 144 00:06:39,000 --> 00:06:39,833 And of course, 145 00:06:39,833 --> 00:06:43,410 don't forget that it is used to make sense of the data 146 00:06:43,410 --> 00:06:46,333 that we have before we go too deep with it. 147 00:06:47,970 --> 00:06:50,760 In summary, this is the foundation 148 00:06:50,760 --> 00:06:53,220 we're going to continue to build upon these concepts 149 00:06:53,220 --> 00:06:55,410 as we move through the course. 150 00:06:55,410 --> 00:06:58,240 It's also a focus on the DP-203, 151 00:06:58,240 --> 00:07:00,490 so there's a lot of things that we could be covering, 152 00:07:00,490 --> 00:07:03,610 but we're focusing on the DP-203 only, 153 00:07:03,610 --> 00:07:05,620 so we might gloss over a few concepts 154 00:07:05,620 --> 00:07:07,610 that are important for data engineering, 155 00:07:07,610 --> 00:07:10,120 but we don't need to go into that much depth 156 00:07:10,120 --> 00:07:12,550 in order to get you to pass the 203. 157 00:07:12,550 --> 00:07:13,890 Finally, don't forget to look at 158 00:07:13,890 --> 00:07:16,030 those Microsoft exam requirements as well. 159 00:07:16,030 --> 00:07:18,600 We talked about that in the very first section, 160 00:07:18,600 --> 00:07:20,550 so if you want to go back and see what that is, 161 00:07:20,550 --> 00:07:24,190 you can jump back to the first section and review on that, 162 00:07:24,190 --> 00:07:26,440 and of course, don't forget about the labs. 163 00:07:26,440 --> 00:07:30,440 You're going to need those for your exam and your career. 164 00:07:30,440 --> 00:07:32,990 Finally, Landon and I would both appreciate it 165 00:07:32,990 --> 00:07:35,700 if you're enjoying this course, do us a huge favor, 166 00:07:35,700 --> 00:07:38,640 smash that thumbs up button on those lessons, 167 00:07:38,640 --> 00:07:42,270 and with that, congratulations for passing this section, 168 00:07:42,270 --> 00:07:43,550 I'll see you in the next, 169 00:07:43,550 --> 00:07:46,460 and we'll talk a little bit about batch processing. 170 00:07:46,460 --> 00:07:47,410 I'll see you there.