1 00:00:00,580 --> 00:00:01,900 Wait, what? 2 00:00:01,900 --> 00:00:04,030 Section 10 recap. 3 00:00:04,030 --> 00:00:06,200 We have actually reached the end 4 00:00:06,200 --> 00:00:10,040 of all of the content sections for this course. 5 00:00:10,040 --> 00:00:13,610 Congratulations, this is a milestone. 6 00:00:13,610 --> 00:00:15,580 Before we finish section 10 though, 7 00:00:15,580 --> 00:00:18,270 let's go in and do a brief recap. 8 00:00:18,270 --> 00:00:21,270 In this section review, as in the others, 9 00:00:21,270 --> 00:00:24,540 we are going to be reviewing all the topics in section 10. 10 00:00:24,540 --> 00:00:26,100 If something doesn't look familiar, 11 00:00:26,100 --> 00:00:28,320 jump back in, watch the lessons. 12 00:00:28,320 --> 00:00:30,090 It's going to be focused on the 203. 13 00:00:30,090 --> 00:00:31,830 And since we're focused on the 203, 14 00:00:31,830 --> 00:00:34,020 I've tried to pull out pieces of information 15 00:00:34,020 --> 00:00:35,910 that I thought were the most important 16 00:00:35,910 --> 00:00:38,543 for the certification in this lesson. 17 00:00:39,420 --> 00:00:41,610 Finally, if you don't know something, review. 18 00:00:41,610 --> 00:00:44,150 We have a lot of blended concepts here at this point. 19 00:00:44,150 --> 00:00:45,400 So you might need to jump back, 20 00:00:45,400 --> 00:00:49,500 even into a previous section, if something is unfamiliar. 21 00:00:49,500 --> 00:00:51,193 All right, let's dive in. 22 00:00:52,940 --> 00:00:54,470 Auto optimization review. 23 00:00:54,470 --> 00:00:57,140 This was an Azure Databricks tool that we talked about 24 00:00:57,140 --> 00:00:59,840 that lets us write to a Delta table. 25 00:00:59,840 --> 00:01:03,230 What it does is it auto-compacts small files together. 26 00:01:03,230 --> 00:01:05,610 If you remember, that's going to be very helpful 27 00:01:05,610 --> 00:01:07,800 if you have lots and lots of small files. 28 00:01:07,800 --> 00:01:09,510 This is the traditional write, 29 00:01:09,510 --> 00:01:11,500 writing down into the partitions. 30 00:01:11,500 --> 00:01:13,870 And if you remember, the auto optimization tool 31 00:01:13,870 --> 00:01:17,430 puts in a separate step that compacts those files together, 32 00:01:17,430 --> 00:01:21,083 so we have less files being written to our Delta tables. 33 00:01:22,000 --> 00:01:24,340 We also talked about hash distribution. 34 00:01:24,340 --> 00:01:26,810 We talked about distribution quite a bit, actually. 35 00:01:26,810 --> 00:01:30,990 The hash function lets us use a map, essentially, 36 00:01:30,990 --> 00:01:33,120 to map the data that's coming in 37 00:01:33,120 --> 00:01:35,680 and put it into the individual partitions. 38 00:01:35,680 --> 00:01:37,830 Essentially, it's a fancy math algorithm. 39 00:01:37,830 --> 00:01:40,620 And if you remember, it kind of looks like this. 40 00:01:40,620 --> 00:01:44,730 We've got our input data, it goes into our hash function, 41 00:01:44,730 --> 00:01:48,560 and then it gets distributed into those different buckets 42 00:01:48,560 --> 00:01:50,350 over there on the right. 43 00:01:50,350 --> 00:01:52,800 Round-robin distribution, on the other hand, 44 00:01:52,800 --> 00:01:55,550 is first random, then it's sequential. 45 00:01:55,550 --> 00:01:57,810 It gets rid of that hash function, 46 00:01:57,810 --> 00:02:00,330 that purple box over there on the right, 47 00:02:00,330 --> 00:02:03,650 and allows us to move data into the system much faster. 48 00:02:03,650 --> 00:02:05,100 However, it's not mapped, 49 00:02:05,100 --> 00:02:07,490 so it's going to be a little harder to get it back out. 50 00:02:07,490 --> 00:02:10,773 So round-robin: quick to load, slow to query. 51 00:02:11,840 --> 00:02:14,680 We talked about setting shuffle partition sizes. 52 00:02:14,680 --> 00:02:16,470 The challenge was finding that right shuffle 53 00:02:16,470 --> 00:02:20,180 partition number, especially as that number might change 54 00:02:20,180 --> 00:02:23,820 as we move through stages in a query. 55 00:02:23,820 --> 00:02:28,120 And so we can use our adaptive query execution, or AQE, 56 00:02:28,120 --> 00:02:29,970 to help solve for this problem. 57 00:02:29,970 --> 00:02:32,410 We set the initial shuffle partition number, 58 00:02:32,410 --> 00:02:34,710 AQE then can help to adjust that 59 00:02:34,710 --> 00:02:36,960 as we move through the query. 60 00:02:36,960 --> 00:02:39,790 So this is what you need to do in order to set 61 00:02:39,790 --> 00:02:41,540 your initial shuffle partition number. 62 00:02:41,540 --> 00:02:43,110 That was the syntax for it. 63 00:02:43,110 --> 00:02:45,240 Then we talked about some optimization steps 64 00:02:45,240 --> 00:02:47,520 that you can use for Synapse. 65 00:02:47,520 --> 00:02:49,210 First was the query plan. 66 00:02:49,210 --> 00:02:51,120 Don't forget to turn on statistics. 67 00:02:51,120 --> 00:02:51,970 You need that on. 68 00:02:51,970 --> 00:02:55,300 That's going to help you as you optimize your queries. 69 00:02:55,300 --> 00:02:57,730 We also looked at reducers and combiners 70 00:02:57,730 --> 00:03:00,520 as a solution for optimization. 71 00:03:00,520 --> 00:03:01,650 The recursive reducer, 72 00:03:01,650 --> 00:03:03,140 the short of that, that you need to remember 73 00:03:03,140 --> 00:03:04,900 is parallel performance. 74 00:03:04,900 --> 00:03:06,150 You do need to turn it on. 75 00:03:06,150 --> 00:03:07,900 That's what the code looks like. 76 00:03:07,900 --> 00:03:10,130 And you also have row-level combiners, 77 00:03:10,130 --> 00:03:13,100 the short of that again, parallel performance. 78 00:03:13,100 --> 00:03:15,623 This is the code for that, over there on the right. 79 00:03:18,380 --> 00:03:20,420 Next, we talked about optimization tips 80 00:03:20,420 --> 00:03:21,630 for your environment. 81 00:03:21,630 --> 00:03:23,700 Remember, we talked about ARM templates, 82 00:03:23,700 --> 00:03:25,490 used for replication and control. 83 00:03:25,490 --> 00:03:27,270 We talked about not putting everything 84 00:03:27,270 --> 00:03:29,872 in the same subscription. That's going to be helpful 85 00:03:29,872 --> 00:03:32,410 for our defense in depth concepts. 86 00:03:32,410 --> 00:03:36,690 It's also going to be helpful for compliance and management. 87 00:03:36,690 --> 00:03:38,770 And we talked about Azure Advisor. 88 00:03:38,770 --> 00:03:41,210 I actually jumped in and showed you where that was. 89 00:03:41,210 --> 00:03:42,560 That's something that's going to be helpful 90 00:03:42,560 --> 00:03:45,140 for you just in everyday work. 91 00:03:45,140 --> 00:03:46,710 Jump in, make sure you're looking 92 00:03:46,710 --> 00:03:49,300 at that Azure Advisor periodically. 93 00:03:49,300 --> 00:03:51,293 It is a must-use. 94 00:03:52,890 --> 00:03:55,200 We talked about the basics of result caching. 95 00:03:55,200 --> 00:03:58,160 So, SQL pool auto-caches are query results, 96 00:03:58,160 --> 00:04:01,090 and we can use that for repeated queries. 97 00:04:01,090 --> 00:04:04,270 If you remember, this is a persisted cache 98 00:04:04,270 --> 00:04:07,000 that helps us with our query performance, 99 00:04:07,000 --> 00:04:09,500 and improved query performance means we're using 100 00:04:09,500 --> 00:04:12,020 less compute and less cost. 101 00:04:12,020 --> 00:04:13,700 But we got to turn it on, 102 00:04:13,700 --> 00:04:17,550 and we can do that with a user database or with a session. 103 00:04:17,550 --> 00:04:21,040 And remember, we don't cache UDFs, user-defined functions. 104 00:04:21,040 --> 00:04:23,610 We don't cache row/column security. 105 00:04:23,610 --> 00:04:26,550 We don't cache rows larger than 64 kilobytes 106 00:04:26,550 --> 00:04:29,900 or databases over 10 gigabytes. 107 00:04:29,900 --> 00:04:32,670 And we don't cache built-in functions 108 00:04:32,670 --> 00:04:34,910 or runtime that isn't deterministic. 109 00:04:34,910 --> 00:04:36,670 So we talked about all of that. 110 00:04:36,670 --> 00:04:38,640 That's actually a pretty important concept 111 00:04:38,640 --> 00:04:40,423 that you need to remember as well. 112 00:04:41,400 --> 00:04:45,010 We talked about OLTP and OLAP. 113 00:04:45,010 --> 00:04:46,600 Quick question for you: 114 00:04:46,600 --> 00:04:49,933 which one of these is on the DP-203? 115 00:04:51,460 --> 00:04:53,280 You should have said OLAP. 116 00:04:53,280 --> 00:04:56,840 Online analytical processing is a part of Synapse, 117 00:04:56,840 --> 00:05:00,120 and that is on the DP-203. 118 00:05:00,120 --> 00:05:02,710 OLTP, online transaction processing, 119 00:05:02,710 --> 00:05:04,610 is a part of SQL database, 120 00:05:04,610 --> 00:05:08,350 or at least is used most frequently in SQL database, 121 00:05:08,350 --> 00:05:10,320 which is not on the DP-203. 122 00:05:10,320 --> 00:05:12,290 So I would assume that any questions you get 123 00:05:12,290 --> 00:05:14,270 are probably going to be around OLAP, 124 00:05:14,270 --> 00:05:16,610 not transaction processing. 125 00:05:16,610 --> 00:05:20,023 But the difference is OLAP is for complex queries. 126 00:05:21,130 --> 00:05:24,070 Slower loading than OLTP. 127 00:05:24,070 --> 00:05:26,240 And remember, that's because of that schema. 128 00:05:26,240 --> 00:05:29,100 We also talked about OLAP being used for business decisions. 129 00:05:29,100 --> 00:05:31,440 We talked quite a bit about being able to look 130 00:05:31,440 --> 00:05:34,033 at, like, customer profiles or things like that. 131 00:05:34,880 --> 00:05:36,000 And we talked about OLAP 132 00:05:36,000 --> 00:05:39,030 being more than 1 terabyte in size. 133 00:05:39,030 --> 00:05:41,663 So keep those basics in mind for OLAP. 134 00:05:44,040 --> 00:05:46,240 We talked about Ambari UI. 135 00:05:46,240 --> 00:05:47,730 This is a good way to troubleshoot 136 00:05:47,730 --> 00:05:50,150 when we're looking at HDInsight clusters. 137 00:05:50,150 --> 00:05:51,740 We look at configuration settings, 138 00:05:51,740 --> 00:05:53,780 cluster health, stack and version. 139 00:05:53,780 --> 00:05:56,270 We talked about the log files that you need to look at. 140 00:05:56,270 --> 00:05:58,010 We talked about configuration settings 141 00:05:58,010 --> 00:06:01,290 that you may or may not have set, but is worth a look at. 142 00:06:01,290 --> 00:06:02,930 And then we talked about reproducing 143 00:06:02,930 --> 00:06:04,960 the error on a new cluster. 144 00:06:04,960 --> 00:06:07,660 In addition, we talked about taking slow changes 145 00:06:07,660 --> 00:06:10,050 as you move through, and starting simple 146 00:06:10,050 --> 00:06:12,453 and moving more complex as you go. 147 00:06:14,640 --> 00:06:16,470 And we talked about tracking applications 148 00:06:16,470 --> 00:06:19,790 in the Spark UI using DAG visualizations. 149 00:06:19,790 --> 00:06:21,450 So we can track jobs. 150 00:06:21,450 --> 00:06:24,010 We can pull detailed information on our submitted jobs. 151 00:06:24,010 --> 00:06:25,770 We can track executors. 152 00:06:25,770 --> 00:06:27,780 We can break that down by ID, 153 00:06:27,780 --> 00:06:30,020 get task information on what's going on, 154 00:06:30,020 --> 00:06:32,290 and we can look at memory and shuffle usage. 155 00:06:32,290 --> 00:06:33,670 And then we can look at those stages, 156 00:06:33,670 --> 00:06:36,320 which is what you're seeing over there on the right. 157 00:06:36,320 --> 00:06:38,150 So we can look at our shuffle read/write, 158 00:06:38,150 --> 00:06:40,610 we can look at the duration and I/O, 159 00:06:40,610 --> 00:06:42,620 and then we can see the DAG visualization 160 00:06:42,620 --> 00:06:44,930 of each different stage. 161 00:06:44,930 --> 00:06:46,410 What's DAG stand for? 162 00:06:46,410 --> 00:06:48,070 Do you remember? 163 00:06:48,070 --> 00:06:50,023 Directed acyclic graph. 164 00:06:52,520 --> 00:06:56,380 In summary, know shuffling and distribution. 165 00:06:56,380 --> 00:07:00,200 Review the round-robin and hash distribution specifically, 166 00:07:00,200 --> 00:07:02,540 but you do need to understand shuffling and distribution. 167 00:07:02,540 --> 00:07:06,770 It appears in quite a few places in the exam requirements. 168 00:07:06,770 --> 00:07:08,600 You should review the code from this section. 169 00:07:08,600 --> 00:07:10,420 I think that's actually very helpful. 170 00:07:10,420 --> 00:07:11,880 Again, don't memorize. 171 00:07:11,880 --> 00:07:13,540 But you should glance through the section 172 00:07:13,540 --> 00:07:16,557 and be able to pick out the pieces that we're talking about 173 00:07:16,557 --> 00:07:18,683 and the concepts that are important. 174 00:07:19,540 --> 00:07:21,000 Don't forget the labs. 175 00:07:21,000 --> 00:07:22,570 I've said that a couple of times. 176 00:07:22,570 --> 00:07:24,950 You'll need those for the exam and the career. 177 00:07:24,950 --> 00:07:28,270 With that, hey, congratulations again. 178 00:07:28,270 --> 00:07:30,340 You have finished section 10. 179 00:07:30,340 --> 00:07:33,400 On to the last couple of videos, where I give you some tips 180 00:07:33,400 --> 00:07:36,630 and tricks for preparing for your exam 181 00:07:36,630 --> 00:07:39,230 and your new certification. 182 00:07:39,230 --> 00:07:41,560 All right, congratulations again. 183 00:07:41,560 --> 00:07:43,210 I'll see you in the next section.