1 00:00:00,380 --> 00:00:01,360 Hey, Cloud Gurus. 2 00:00:01,360 --> 00:00:03,690 Welcome to the first of a series of lessons 3 00:00:03,690 --> 00:00:05,710 on meeting the tools of the trade. 4 00:00:05,710 --> 00:00:07,690 Getting familiar with what you'll be using 5 00:00:07,690 --> 00:00:11,160 to ingest and transform data within Azure. 6 00:00:11,160 --> 00:00:13,183 First up is Azure Data Factory. 7 00:00:14,340 --> 00:00:16,960 So in this lesson, we're going to start off with a recap 8 00:00:16,960 --> 00:00:19,360 on Azure Data Factory. 9 00:00:19,360 --> 00:00:22,230 We're then going to briefly look at native data flows 10 00:00:22,230 --> 00:00:24,600 and external transformations, 11 00:00:24,600 --> 00:00:26,970 just to get you familiar with what they are. 12 00:00:26,970 --> 00:00:29,470 Most of our time is going to be spent in the demo, 13 00:00:29,470 --> 00:00:32,050 taking a hands-on look at transformations 14 00:00:32,050 --> 00:00:33,790 in Azure Data Factory. 15 00:00:33,790 --> 00:00:36,190 And then we'll wrap everything up with a review. 16 00:00:37,880 --> 00:00:40,420 Brian did a fantastic job in section two, 17 00:00:40,420 --> 00:00:42,840 introducing Azure Data Factory. 18 00:00:42,840 --> 00:00:45,710 If you remember this slide from his presentation, 19 00:00:45,710 --> 00:00:48,770 within ADF we have structures called pipelines 20 00:00:48,770 --> 00:00:51,670 they are logical groupings of activities. 21 00:00:51,670 --> 00:00:53,820 Our activities can be data movement, 22 00:00:53,820 --> 00:00:56,420 data transformation and control. 23 00:00:56,420 --> 00:00:57,760 Datasets are what allow us 24 00:00:57,760 --> 00:00:59,760 to actually interact with the data 25 00:00:59,760 --> 00:01:03,190 and linked services is how we connect to the systems, 26 00:01:03,190 --> 00:01:05,100 holding that data. 27 00:01:05,100 --> 00:01:08,990 In this lesson, we're focusing in on activities. 28 00:01:08,990 --> 00:01:10,420 And more specifically, 29 00:01:10,420 --> 00:01:13,493 data movement and data transformation activities. 30 00:01:15,040 --> 00:01:17,713 Let's start by looking at our native data flows. 31 00:01:18,620 --> 00:01:21,460 First of all, we have the Mapping Data Flow, 32 00:01:21,460 --> 00:01:24,690 and this allows us to visually design data transformations 33 00:01:24,690 --> 00:01:28,710 without writing a single bit of code. Pretty awesome, right? 34 00:01:28,710 --> 00:01:29,690 As you'll see in a moment, 35 00:01:29,690 --> 00:01:31,730 you can link different items together 36 00:01:31,730 --> 00:01:34,870 and direct your data into the shape you want it to be. 37 00:01:34,870 --> 00:01:37,700 Think of it kind of like rivers of data coming together 38 00:01:37,700 --> 00:01:39,210 and changing along the way 39 00:01:39,210 --> 00:01:41,713 to produce one combined stream in the end. 40 00:01:43,030 --> 00:01:44,230 Another code-free way 41 00:01:44,230 --> 00:01:47,850 of accomplishing data transformation is data wrangling. 42 00:01:47,850 --> 00:01:51,090 This utilizes Power Query to accomplish, again, 43 00:01:51,090 --> 00:01:54,580 code-free data preparation at cloud scale. 44 00:01:54,580 --> 00:01:56,270 This utilizes Power Query 45 00:01:56,270 --> 00:01:59,603 to accomplish code-free data preparation at cloud scale. 46 00:02:00,490 --> 00:02:05,260 This activity was previously called a wrangling data flow. 47 00:02:05,260 --> 00:02:07,830 It integrates with Power Query Online 48 00:02:07,830 --> 00:02:10,623 and makes Power Query M functions available. 49 00:02:11,530 --> 00:02:15,530 Both of these are executed as activities within the pipeline 50 00:02:15,530 --> 00:02:18,200 because the transformation is really just an activity 51 00:02:18,200 --> 00:02:20,130 that executes in a computing environment 52 00:02:20,130 --> 00:02:23,073 such as Azure Databricks or Azure HDInsight. 53 00:02:24,880 --> 00:02:26,920 Moving past our native data flows, 54 00:02:26,920 --> 00:02:30,290 we have a number of external transformations 55 00:02:30,290 --> 00:02:33,100 and these allow you to bring your own compute, if you want-- 56 00:02:33,100 --> 00:02:34,913 such as an HDInsight Cluster. 57 00:02:36,300 --> 00:02:39,200 We have the HDInsight Hive Activity, 58 00:02:39,200 --> 00:02:42,810 which lets you execute Hive queries on your cluster. 59 00:02:42,810 --> 00:02:45,090 The HDInsight Pig Activity, 60 00:02:45,090 --> 00:02:47,113 which lets you execute Pig queries. 61 00:02:48,140 --> 00:02:50,620 The HDInsight MapReduce Activity, 62 00:02:50,620 --> 00:02:53,380 which executes MapReduce programs. 63 00:02:53,380 --> 00:02:54,810 And those are just a few examples. 64 00:02:54,810 --> 00:02:59,680 We have many more, such as: HDInsight Streaming Activities, 65 00:02:59,680 --> 00:03:01,570 HDInsight Spark Activities, 66 00:03:01,570 --> 00:03:03,580 Machine Language Studio Activities, 67 00:03:03,580 --> 00:03:05,630 Stored Procedure Activities, 68 00:03:05,630 --> 00:03:09,680 Data Lake Analytics with U-SQL, Azure Synapse Notebooks, 69 00:03:09,680 --> 00:03:13,990 Databricks Notebooks, Databricks Jar, 70 00:03:13,990 --> 00:03:17,210 and Databricks Python Activity. 71 00:03:17,210 --> 00:03:19,190 And if all of that still isn't enough, 72 00:03:19,190 --> 00:03:22,330 you can create a custom .net activity. 73 00:03:22,330 --> 00:03:25,063 And manually create the transformations that you want. 74 00:03:26,120 --> 00:03:29,600 All of that to say, there is a huge amount of flexibility. 75 00:03:29,600 --> 00:03:32,210 And so whatever your transformation need is, 76 00:03:32,210 --> 00:03:34,560 Azure Data Factory pretty much has you covered. 77 00:03:35,810 --> 00:03:37,230 With that overview in mind, 78 00:03:37,230 --> 00:03:38,850 let's jump into the Azure portal 79 00:03:38,850 --> 00:03:40,583 and take a look at this in action. 80 00:03:42,620 --> 00:03:44,350 Here we are in the Azure portal 81 00:03:44,350 --> 00:03:46,500 and I've opened up Azure Data Studio 82 00:03:46,500 --> 00:03:48,423 within my Azure Data Factory. 83 00:03:49,460 --> 00:03:53,320 If we come over to the pencil icon that stands for author, 84 00:03:53,320 --> 00:03:56,100 we can begin to work with our pipelines. 85 00:03:56,100 --> 00:03:58,140 We have categories here for our pipelines, 86 00:03:58,140 --> 00:04:00,190 datasets, data flows, 87 00:04:00,190 --> 00:04:04,070 and even its own separate category for Power Query. 88 00:04:04,070 --> 00:04:06,370 That is a change in recent years. 89 00:04:06,370 --> 00:04:07,620 And as you know with Azure, 90 00:04:07,620 --> 00:04:09,670 you'll see the interface change along the way 91 00:04:09,670 --> 00:04:11,300 so don't be alarmed by that. 92 00:04:11,300 --> 00:04:14,500 It's just part of the cloud constantly improving. 93 00:04:14,500 --> 00:04:17,490 But to get started, we can click on the + sign 94 00:04:17,490 --> 00:04:19,453 and let's create a new pipeline. 95 00:04:20,930 --> 00:04:24,130 I'll leave it as pipeline1, just for fun. 96 00:04:24,130 --> 00:04:26,530 Really what we want to do here during this lesson 97 00:04:26,530 --> 00:04:28,380 is take a look at the various activities 98 00:04:28,380 --> 00:04:30,390 we were just discussing. 99 00:04:30,390 --> 00:04:34,730 So under Move & Transform, we have different options here. 100 00:04:34,730 --> 00:04:37,820 If you want to simply copy your data and not transform it, 101 00:04:37,820 --> 00:04:39,523 there is an activity for that. 102 00:04:40,586 --> 00:04:43,460 And you can see, we can easily specify our source, 103 00:04:44,460 --> 00:04:46,680 such as our Dev environment and 104 00:04:46,680 --> 00:04:51,370 our Sink--where we're going to--like our Prod environment 105 00:04:51,370 --> 00:04:52,323 or vice versa. 106 00:04:53,900 --> 00:04:56,400 We can change mappings if they're not identical 107 00:04:56,400 --> 00:04:59,230 and a few other general properties, 108 00:04:59,230 --> 00:05:01,900 but that's not going to accomplish a lot for you 109 00:05:01,900 --> 00:05:03,773 if you have a complex scenario. 110 00:05:05,260 --> 00:05:08,120 For that, we have the Data flow activity 111 00:05:08,120 --> 00:05:09,730 that I was just telling you about, 112 00:05:09,730 --> 00:05:11,473 one of our native data flows. 113 00:05:12,400 --> 00:05:15,850 Within here, we can accomplish some more complex tasks. 114 00:05:15,850 --> 00:05:19,410 If we go to Settings and start a new Data flow, 115 00:05:19,410 --> 00:05:22,450 it'll give you lots of helpful hints along the way 116 00:05:22,450 --> 00:05:24,800 to help you learn how to set up your data flow. 117 00:05:26,400 --> 00:05:27,883 We can add a source. 118 00:05:29,310 --> 00:05:32,650 And this is very similar to our copy activity 119 00:05:32,650 --> 00:05:34,903 where we're defining a dataset to pull from. 120 00:05:36,070 --> 00:05:38,260 But here's where things really start to get fun. 121 00:05:38,260 --> 00:05:40,630 If you click on our little + icon, 122 00:05:40,630 --> 00:05:42,390 you can start to see the different directions 123 00:05:42,390 --> 00:05:43,860 you can guide your river in, 124 00:05:43,860 --> 00:05:46,183 to bring about that clean stream at the end. 125 00:05:47,100 --> 00:05:49,140 We can join the multiple datasets. 126 00:05:49,140 --> 00:05:51,640 Say we had a couple of different sources. 127 00:05:51,640 --> 00:05:53,920 We can split it into 2 if we only have one 128 00:05:53,920 --> 00:05:55,910 based on certain values, 129 00:05:55,910 --> 00:05:59,303 and a number of other powerful transformations. 130 00:06:01,710 --> 00:06:04,960 As you start to add these together, 131 00:06:04,960 --> 00:06:07,760 they can branch into more and more. 132 00:06:07,760 --> 00:06:11,890 And eventually, come out to the sink, or target, 133 00:06:13,480 --> 00:06:16,153 with the state that you want it to be in. 134 00:06:17,280 --> 00:06:21,430 And remember, this is just one element within our pipeline. 135 00:06:21,430 --> 00:06:23,450 Once this result is achieved, 136 00:06:23,450 --> 00:06:27,220 we can still add to that in a chain of other activities 137 00:06:27,220 --> 00:06:28,743 for our complete pipeline. 138 00:06:30,000 --> 00:06:32,110 If Power Query is what you're looking for instead, 139 00:06:32,110 --> 00:06:33,770 to do more data wrangling 140 00:06:33,770 --> 00:06:36,680 that has its own category now down at the bottom, 141 00:06:36,680 --> 00:06:39,223 we have Power Query currently in preview. 142 00:06:40,790 --> 00:06:44,313 If we click on Settings, we can do a new Power Query. 143 00:06:45,750 --> 00:06:50,063 I'll select the same dataset, add that in there. 144 00:06:51,650 --> 00:06:54,870 And you can begin forming your Power Query. 145 00:06:54,870 --> 00:06:57,660 You see some similar elements here, such as transformations, 146 00:06:57,660 --> 00:07:00,430 combinations and the like, 147 00:07:00,430 --> 00:07:02,390 but this gives you another option 148 00:07:02,390 --> 00:07:05,330 for how to interact with and transform your data. 149 00:07:05,330 --> 00:07:08,823 Again, still just one activity within this pipeline. 150 00:07:09,720 --> 00:07:11,690 We could even combine them together 151 00:07:11,690 --> 00:07:13,453 and go from one to the other. 152 00:07:15,000 --> 00:07:16,960 But we'll talk more on putting pipelines together 153 00:07:16,960 --> 00:07:18,003 in a later lesson. 154 00:07:19,220 --> 00:07:21,910 Before we pop out of our demo, I wanted to show you 155 00:07:21,910 --> 00:07:25,350 where our other external transformations are. 156 00:07:25,350 --> 00:07:28,710 If you remember, many started with the word HDInsight. 157 00:07:28,710 --> 00:07:32,010 And so those are here under this category, our Hive, 158 00:07:32,010 --> 00:07:36,063 MapReduce, Pig, Spark and Streaming Activities. 159 00:07:36,960 --> 00:07:40,103 There is also under--Data Lake Analytics--U-SQL. 160 00:07:41,400 --> 00:07:43,520 And so I hope you can begin to imagine 161 00:07:43,520 --> 00:07:45,580 how you can mix and match these together 162 00:07:45,580 --> 00:07:47,540 to accomplish very simple 163 00:07:47,540 --> 00:07:50,363 or very complex data transformations. 164 00:07:52,230 --> 00:07:56,980 By way of review, mapping data flows and data wrangling, 165 00:07:56,980 --> 00:07:59,470 or, as it's now called, Power Query, 166 00:07:59,470 --> 00:08:01,690 allows you to build code-free transformations 167 00:08:01,690 --> 00:08:03,640 at cloud scale. 168 00:08:03,640 --> 00:08:06,270 They are natively integrated into Spark Clusters 169 00:08:06,270 --> 00:08:07,653 that automatically scale. 170 00:08:09,378 --> 00:08:12,400 There are a variety of external transformations as well 171 00:08:12,400 --> 00:08:15,630 that enable you to accomplish almost any task. 172 00:08:15,630 --> 00:08:16,980 With these, you have the option 173 00:08:16,980 --> 00:08:19,240 of connecting to your own compute instance, 174 00:08:19,240 --> 00:08:22,910 such as an HDInsight Cluster that you already have set up. 175 00:08:22,910 --> 00:08:24,810 And they really fill out the flexibility 176 00:08:24,810 --> 00:08:26,223 of Azure Data Factory. 177 00:08:27,750 --> 00:08:30,070 Remember that all of these execute as activities 178 00:08:30,070 --> 00:08:33,030 within a pipeline and so they can be standalone 179 00:08:33,030 --> 00:08:35,150 or chained together as we saw 180 00:08:35,150 --> 00:08:37,323 to produce your desired end result. 181 00:08:38,500 --> 00:08:40,780 That's it for this lesson, thank you for joining me 182 00:08:40,780 --> 00:08:43,300 as we talk about how to use Azure Data Factory 183 00:08:43,300 --> 00:08:45,970 for data ingestion and transformation. 184 00:08:45,970 --> 00:08:48,300 Next up, we'll talk about another tool of the trade 185 00:08:48,300 --> 00:08:51,460 that you can have in your arsenal as a data engineer. 186 00:08:51,460 --> 00:08:53,360 When you're ready, I'll see you there.