1 00:00:00,540 --> 00:00:01,420 In this lesson, 2 00:00:01,420 --> 00:00:03,840 we are going to talk about "Optimizing Pipelines 3 00:00:03,840 --> 00:00:07,410 "for Analytical or Transactional Purposes". 4 00:00:07,410 --> 00:00:08,792 Now, before we get started, 5 00:00:08,792 --> 00:00:10,610 just like a couple of lessons ago, 6 00:00:10,610 --> 00:00:14,240 this is going to be a little bit of a weird lesson. 7 00:00:14,240 --> 00:00:15,760 And the reason it's going to be a weird lesson 8 00:00:15,760 --> 00:00:19,030 is because there's not a whole lot of practical application. 9 00:00:19,030 --> 00:00:21,180 We're talking a lot more theory here. 10 00:00:21,180 --> 00:00:23,250 But it is on the DP-203, 11 00:00:23,250 --> 00:00:24,230 so we need to make sure 12 00:00:24,230 --> 00:00:27,150 that we cover it, at least at a high level. 13 00:00:27,150 --> 00:00:28,470 So with that little caveat, 14 00:00:28,470 --> 00:00:30,490 let's go ahead and get started. 15 00:00:30,490 --> 00:00:32,280 In this lesson, we're going to talk about 16 00:00:32,280 --> 00:00:34,160 online transaction processing 17 00:00:34,160 --> 00:00:39,160 versus online analytical processing, OLTP versus OLAP. 18 00:00:39,420 --> 00:00:41,470 Now, this should be familiar to you, 19 00:00:41,470 --> 00:00:44,960 and this should be a little bit of just a basic refresh. 20 00:00:44,960 --> 00:00:47,440 We are also going to do a slightly deeper dive 21 00:00:47,440 --> 00:00:50,330 on solution choice between those 2. 22 00:00:50,330 --> 00:00:52,550 In addition, I'm going to give you a quick recap 23 00:00:52,550 --> 00:00:55,698 of basic optimization techniques, including Azure services, 24 00:00:55,698 --> 00:01:00,150 schema selection, data cleanliness, and automation. 25 00:01:00,150 --> 00:01:01,670 With that, let's start talking 26 00:01:01,670 --> 00:01:03,710 about online transaction processing 27 00:01:03,710 --> 00:01:06,720 versus online analytical processing. 28 00:01:06,720 --> 00:01:09,830 So, the first thing to look at is queries. 29 00:01:09,830 --> 00:01:13,581 Simple queries generally fall into transaction processing, 30 00:01:13,581 --> 00:01:15,927 whereas complex queries generally fall 31 00:01:15,927 --> 00:01:18,310 into analytical processing. 32 00:01:18,310 --> 00:01:21,154 So if I'm just looking at sales by a product, 33 00:01:21,154 --> 00:01:24,060 that's probably going to be transaction processing. 34 00:01:24,060 --> 00:01:26,301 If I want to look at sales trends 35 00:01:26,301 --> 00:01:28,960 over a range of products in stores, 36 00:01:28,960 --> 00:01:30,903 that's probably analytical processing. 37 00:01:32,450 --> 00:01:35,147 Transaction processing, generally faster to load. 38 00:01:35,147 --> 00:01:37,950 Analytical processing, generally slower to load. 39 00:01:37,950 --> 00:01:42,070 And this is due to your schema differences. 40 00:01:42,070 --> 00:01:45,360 In addition, transaction processing, generally day-to-day, 41 00:01:45,360 --> 00:01:48,757 versus business decisions are analytical processing. 42 00:01:48,757 --> 00:01:49,860 And that kind of goes back 43 00:01:49,860 --> 00:01:52,800 to that simple queries versus complex queries. 44 00:01:52,800 --> 00:01:54,891 If I'm just doing day-to-day transactions, 45 00:01:54,891 --> 00:01:57,310 that's going to be OLTP. 46 00:01:57,310 --> 00:02:00,160 If I want to get business analysis, 47 00:02:00,160 --> 00:02:02,834 that's going to be more complex queries, 48 00:02:02,834 --> 00:02:04,284 analytical processing. 49 00:02:05,700 --> 00:02:09,820 Transaction processing, typically less than 10 gigabytes. 50 00:02:09,820 --> 00:02:13,340 Analytical processing, typically more than 1 terabyte. 51 00:02:13,340 --> 00:02:17,620 And again, this is a guide, not a rule. 52 00:02:17,620 --> 00:02:18,900 So this is going to help you to understand 53 00:02:18,900 --> 00:02:23,020 the differences between OLTP and OLAP. 54 00:02:23,020 --> 00:02:24,810 With that, let's go a little bit further 55 00:02:24,810 --> 00:02:27,610 into analytical processing. 56 00:02:27,610 --> 00:02:30,170 First, let's talk about a couple of scenarios. 57 00:02:30,170 --> 00:02:32,480 The first reason I would use analytical processing 58 00:02:32,480 --> 00:02:34,800 is read-heavy scenarios. 59 00:02:34,800 --> 00:02:37,870 So again, this falls into your analytics. 60 00:02:37,870 --> 00:02:40,010 If I'm putting data in, and then I'm pulling 61 00:02:40,010 --> 00:02:42,270 that data out a lot in queries 62 00:02:42,270 --> 00:02:43,960 to figure out different ways to look 63 00:02:43,960 --> 00:02:45,780 at that data or slice it, 64 00:02:45,780 --> 00:02:48,260 that's going to be more analytical processing. 65 00:02:48,260 --> 00:02:50,440 In addition, analytical processing helps me 66 00:02:50,440 --> 00:02:53,640 to look at patterns, trends, data exploration. 67 00:02:53,640 --> 00:02:55,180 So for instance, if I'm a trader, 68 00:02:55,180 --> 00:02:57,280 and I'm looking at technical analysis, 69 00:02:57,280 --> 00:02:59,383 I'm going to be looking at lots of different data points 70 00:02:59,383 --> 00:03:00,881 and trying to pull meaning 71 00:03:00,881 --> 00:03:04,480 by building patterns and trends out of that data. 72 00:03:04,480 --> 00:03:08,140 Very analytical. Same thing's true with customers. 73 00:03:08,140 --> 00:03:11,340 If I just want to take a look at Sally, for instance, 74 00:03:11,340 --> 00:03:14,070 and figure out that she bought something last week, 75 00:03:14,070 --> 00:03:16,130 well, that's transaction processing. 76 00:03:16,130 --> 00:03:18,369 If on the other hand, I want to understand 77 00:03:18,369 --> 00:03:20,900 Sally as a customer profile, 78 00:03:20,900 --> 00:03:23,240 and I want to look at multiple metrics to figure out 79 00:03:23,240 --> 00:03:26,090 if Sally's spending is going up or down, 80 00:03:26,090 --> 00:03:28,053 that's very analytical processing. 81 00:03:29,390 --> 00:03:31,170 Synapse is generally the good fit 82 00:03:31,170 --> 00:03:33,830 for analytical processing in Azure. 83 00:03:33,830 --> 00:03:35,687 So when we talk about the DP-203, 84 00:03:35,687 --> 00:03:37,090 and we talk about OLAP, 85 00:03:37,090 --> 00:03:39,510 we're probably talking about Synapse. 86 00:03:39,510 --> 00:03:41,807 This is because it has a columnar data format. 87 00:03:41,807 --> 00:03:46,063 It also is very good at handling large data sources. 88 00:03:46,920 --> 00:03:50,220 It has integration with Databricks and Power BI. 89 00:03:50,220 --> 00:03:52,800 So again, from that analytical processing, 90 00:03:52,800 --> 00:03:55,380 we can pull data in, we can transform it, 91 00:03:55,380 --> 00:03:57,046 and we can sync in with Power BI 92 00:03:57,046 --> 00:04:00,500 very easily using Synapse. 93 00:04:00,500 --> 00:04:02,210 Finally, performance is very important, 94 00:04:02,210 --> 00:04:04,520 because we're running complex queries. 95 00:04:04,520 --> 00:04:08,380 Synapse is great because it has parallel processing here. 96 00:04:08,380 --> 00:04:10,110 So that's going to help you when you're looking 97 00:04:10,110 --> 00:04:12,143 at getting that peak performance. 98 00:04:13,685 --> 00:04:17,370 When we talk about online transactional processing, 99 00:04:17,370 --> 00:04:20,110 this is not likely to be on the exam. 100 00:04:20,110 --> 00:04:24,220 This is because OLTP typically runs SQL database, 101 00:04:24,220 --> 00:04:27,650 which, SQL database is not on the DP-203. 102 00:04:27,650 --> 00:04:32,650 So OLAP is very much in the wheelhouse of the DP-203. 103 00:04:32,840 --> 00:04:34,963 OLTP, not so much. 104 00:04:37,100 --> 00:04:40,320 So let's talk about just some basic optimization techniques 105 00:04:40,320 --> 00:04:42,630 and kind of recap that. 106 00:04:42,630 --> 00:04:45,751 Azure services. Choosing the appropriate pipeline steps 107 00:04:45,751 --> 00:04:50,450 is absolutely critical in choosing your Azure service. 108 00:04:50,450 --> 00:04:51,770 This is true whether we're talking 109 00:04:51,770 --> 00:04:54,780 about doing light transformations just in Power BI, 110 00:04:54,780 --> 00:04:56,060 or whether we're doing much more 111 00:04:56,060 --> 00:04:59,410 complex transformations in Databricks, 112 00:04:59,410 --> 00:05:00,710 whether we can run everything 113 00:05:00,710 --> 00:05:04,020 through a Data Factory or a Stream Analytics, 114 00:05:04,020 --> 00:05:08,180 or whether we need to build this in our Azure Synapse, 115 00:05:08,180 --> 00:05:10,440 and we need to sync several services there. 116 00:05:10,440 --> 00:05:12,370 Choosing the appropriate pipeline steps 117 00:05:12,370 --> 00:05:14,083 is very important to sit and think 118 00:05:14,083 --> 00:05:17,083 through the different options that you could choose. 119 00:05:18,090 --> 00:05:19,910 Next, schema selection. 120 00:05:19,910 --> 00:05:22,940 So snowflake, star, are we doing something else? 121 00:05:22,940 --> 00:05:24,860 Are we doing a galaxy schema? 122 00:05:24,860 --> 00:05:26,400 Understanding your schema selection 123 00:05:26,400 --> 00:05:28,000 is also going to be helpful, 124 00:05:28,000 --> 00:05:29,895 and a lot of the schema selection comes down to: 125 00:05:29,895 --> 00:05:32,369 how is the data going to be coming in? 126 00:05:32,369 --> 00:05:35,290 How many times, if at all, 127 00:05:35,290 --> 00:05:38,070 do you think the schema is going to change? 128 00:05:38,070 --> 00:05:39,640 What types of queries do you think 129 00:05:39,640 --> 00:05:41,540 you're going to be running off of that schema? 130 00:05:41,540 --> 00:05:43,823 And how important is data size? 131 00:05:45,890 --> 00:05:47,460 Third, data cleanliness. 132 00:05:47,460 --> 00:05:49,890 And this goes over here to the picture on the right. 133 00:05:49,890 --> 00:05:51,470 Don't make a data dump. 134 00:05:51,470 --> 00:05:54,220 You need to make sure that you have proper storage. 135 00:05:54,220 --> 00:05:56,750 You need to make sure that, as the data comes in, 136 00:05:56,750 --> 00:05:58,930 that you are storing it in good places, 137 00:05:58,930 --> 00:06:00,300 you don't have duplicate data, 138 00:06:00,300 --> 00:06:02,157 you don't have errors in your data. 139 00:06:02,157 --> 00:06:05,390 That you are transforming if you need to, 140 00:06:05,390 --> 00:06:08,580 so this is the ETL versus ELT discussion. 141 00:06:08,580 --> 00:06:10,560 So when we look at transformation, 142 00:06:10,560 --> 00:06:13,360 are we going to transform after loading the data 143 00:06:13,360 --> 00:06:15,010 or before loading the data? 144 00:06:15,010 --> 00:06:17,420 How are we going to use the data, right? 145 00:06:17,420 --> 00:06:20,460 So think about data cleanliness. 146 00:06:20,460 --> 00:06:22,480 And then finally, automation. 147 00:06:22,480 --> 00:06:24,530 You need to make sure that you have a test 148 00:06:24,530 --> 00:06:27,380 and dev environment, or development environment. 149 00:06:27,380 --> 00:06:29,170 And when you look at automation, 150 00:06:29,170 --> 00:06:31,817 you want to make sure that you are testing your automation, 151 00:06:31,817 --> 00:06:33,279 and that you're using automation 152 00:06:33,279 --> 00:06:36,052 in order to optimize the environment itself 153 00:06:36,052 --> 00:06:38,702 and the movement of data throughout your environment. 154 00:06:40,780 --> 00:06:43,130 Some key points to remember from this lesson. 155 00:06:43,130 --> 00:06:44,810 Take a holistic approach. 156 00:06:44,810 --> 00:06:48,010 Pipelines are a combination of multiple activities. 157 00:06:48,010 --> 00:06:49,700 You need to be thinking about security, 158 00:06:49,700 --> 00:06:51,490 you need to be thinking about ingestion, 159 00:06:51,490 --> 00:06:53,710 you need to be thinking about transformation, 160 00:06:53,710 --> 00:06:56,400 data movement, automation, right? 161 00:06:56,400 --> 00:07:00,560 All of those things fall into your pipeline. 162 00:07:00,560 --> 00:07:01,680 So make sure that you're taking 163 00:07:01,680 --> 00:07:04,630 a holistic approach to building pipelines. 164 00:07:04,630 --> 00:07:08,314 Remember, SQL database is not on the DP-203. 165 00:07:08,314 --> 00:07:10,950 OLAP is likely Synapse. 166 00:07:10,950 --> 00:07:12,800 So just go for the basics here, 167 00:07:12,800 --> 00:07:15,960 you don't need to dive into the depths of OLAP. 168 00:07:15,960 --> 00:07:17,435 You should have a basic knowledge 169 00:07:17,435 --> 00:07:20,150 and know that that generally ties to Synapse. 170 00:07:20,150 --> 00:07:22,600 If you've done that, you're probably pretty good. 171 00:07:22,600 --> 00:07:24,457 All right. That is it for this lesson. 172 00:07:24,457 --> 00:07:26,323 I'll see you in the next.