1 00:00:00,220 --> 00:00:01,110 All right. 2 00:00:01,230 --> 00:00:04,860 Let's fill some of those missing values shall we. 3 00:00:04,890 --> 00:00:06,530 So make a little heading here. 4 00:00:06,780 --> 00:00:08,490 Fill missing values. 5 00:00:08,490 --> 00:00:09,720 Beautiful. 6 00:00:09,960 --> 00:00:14,140 And we've got a fair few missing values here. 7 00:00:14,190 --> 00:00:16,120 So what we might do. 8 00:00:16,410 --> 00:00:19,640 I think we'll do the numeric missing values first. 9 00:00:19,680 --> 00:00:21,050 So let's put it here. 10 00:00:21,150 --> 00:00:24,170 So another little heading and we'll go. 11 00:00:24,570 --> 00:00:26,730 Let's change this to number two. 12 00:00:27,030 --> 00:00:34,740 Just because I like the stick I'll go fill numeric missing values first. 13 00:00:34,760 --> 00:00:35,210 Wonderful. 14 00:00:35,210 --> 00:00:45,930 Now how do we check what columns and numeric and what columns aren't numeric so what we might do remember 15 00:00:45,930 --> 00:00:48,560 there is string day type. 16 00:00:48,560 --> 00:00:52,870 We did that before we loop through our data frame right up here. 17 00:00:52,900 --> 00:00:57,710 You can play press shift into their Where is it. 18 00:00:57,790 --> 00:00:59,530 So is string daytime. 19 00:01:00,100 --> 00:01:01,850 So there's another one as you might have guessed. 20 00:01:01,860 --> 00:01:05,240 There's also is numeric data type is it up here. 21 00:01:05,250 --> 00:01:06,580 No it might be. 22 00:01:06,580 --> 00:01:07,830 Here we go. 23 00:01:07,840 --> 00:01:11,340 Here is numeric go. 24 00:01:11,590 --> 00:01:17,410 Check with the provided array or daytime is of numeric detail because you want to fill numeric data 25 00:01:18,750 --> 00:01:21,510 we want to find out which columns are numeric. 26 00:01:21,510 --> 00:01:28,410 Let's loop through for label content in the F temp dot item so we're gonna loop through the column names 27 00:01:28,440 --> 00:01:31,410 as well as what each column contains. 28 00:01:31,710 --> 00:01:43,860 If API dot types dot is numeric data that makes sense where we need to put a name. 29 00:01:43,920 --> 00:01:49,810 If the content yes if the content in the column is numeric we want to print out the labels that's gonna 30 00:01:49,830 --> 00:01:50,670 be the name of the column. 31 00:01:51,730 --> 00:01:52,540 Wonderful. 32 00:01:52,540 --> 00:01:54,340 So let's check out one of these. 33 00:01:54,340 --> 00:02:01,630 We made these ones down here so we know that they're numeric but let's check out just for fun model 34 00:02:01,670 --> 00:02:02,050 i.e. 35 00:02:06,620 --> 00:02:13,080 there's gonna be a model a day or actually we can check what that is model day identifier for your unique 36 00:02:13,080 --> 00:02:16,110 machine model i.e. if I model description. 37 00:02:16,110 --> 00:02:16,780 Beautiful. 38 00:02:17,180 --> 00:02:17,450 OK. 39 00:02:17,490 --> 00:02:19,310 So these are our numeric types. 40 00:02:19,410 --> 00:02:24,720 And now what we might do is find out which ones of these have missing values. 41 00:02:24,780 --> 00:02:32,340 We're more focus on it might we want to fill the missing numeric columns with some sort of value check 42 00:02:32,850 --> 00:02:37,980 for which numeric columns have now. 43 00:02:38,190 --> 00:02:40,920 Now you know how might we do this. 44 00:02:40,920 --> 00:02:47,970 Well first of all let's start out with four label content or just write exactly what we did before the 45 00:02:47,970 --> 00:02:49,750 attempt the items. 46 00:02:50,250 --> 00:03:02,580 And now we want to go if PD dot API dot types dot is numeric tried to do tab but there's too many options 47 00:03:02,640 --> 00:03:06,140 is numeric day type content. 48 00:03:06,270 --> 00:03:07,380 Wonderful. 49 00:03:07,380 --> 00:03:13,730 And then what we might need to do is check there's a feature or a function in Panda's called is no. 50 00:03:14,160 --> 00:03:16,170 So if we go PDA is no 51 00:03:19,950 --> 00:03:21,090 pandas is no. 52 00:03:21,090 --> 00:03:21,840 Here we go. 53 00:03:22,940 --> 00:03:31,850 So this detects missing values for an array like object so we can do this by pandas is now content. 54 00:03:31,850 --> 00:03:33,410 That's what we want. 55 00:03:33,470 --> 00:03:34,880 So we'll do another segment here. 56 00:03:34,880 --> 00:03:37,200 So PDA is now so this is going to go. 57 00:03:37,280 --> 00:03:42,410 Is there no values in the content a.k.a. the column in the whole column. 58 00:03:42,410 --> 00:03:43,690 Is there no values. 59 00:03:43,760 --> 00:03:46,240 But we need to make sure there's a sum so total. 60 00:03:46,250 --> 00:03:54,160 So if there's any null values so if this is higher than zero we want to print the label. 61 00:03:54,280 --> 00:03:58,270 This should tell us which columns have no values. 62 00:03:58,270 --> 00:03:58,780 All right. 63 00:03:58,790 --> 00:04:06,040 Now that we know which columns or which numeric columns have no values we can fill them with something. 64 00:04:06,110 --> 00:04:13,140 So what we might do is fill the numeric rows with the median so how might we do that we can just take 65 00:04:13,140 --> 00:04:16,170 this we'll copy this down here. 66 00:04:17,380 --> 00:04:17,840 Right. 67 00:04:17,920 --> 00:04:21,450 We're gonna go fill numeric rows. 68 00:04:21,720 --> 00:04:23,510 If the median. 69 00:04:23,860 --> 00:04:24,340 Wonderful. 70 00:04:24,340 --> 00:04:28,330 But there's one other thing that we have to actually do go here. 71 00:04:28,330 --> 00:04:30,640 Well you don't have to do it but we're gonna do it anyway. 72 00:04:30,640 --> 00:04:38,050 And I'm going to write it out first because we're going to add a binary column which tells us if the 73 00:04:38,050 --> 00:04:40,910 data was missing or not. 74 00:04:41,050 --> 00:04:42,700 That comment is a bit long but that doesn't matter. 75 00:04:43,090 --> 00:04:44,780 What do you think we might do this. 76 00:04:44,790 --> 00:04:47,270 I just that you have a think about it while we do it. 77 00:04:47,680 --> 00:04:56,710 We'll go here is missing equals PD is now content. 78 00:04:56,710 --> 00:04:57,520 Wonderful. 79 00:04:58,150 --> 00:05:10,150 And then what we're going to do here is fill missing numeric values with medium DFT temp label equals 80 00:05:10,150 --> 00:05:11,460 content. 81 00:05:11,500 --> 00:05:22,210 Fill in a content dot medium to we only build upon what we've done before but we added two new steps 82 00:05:22,210 --> 00:05:24,740 in here before we hit shift and enter on this. 83 00:05:24,880 --> 00:05:29,110 We're going against our rule F and outrun the code I want you to just think why we might put this column 84 00:05:29,110 --> 00:05:29,800 here. 85 00:05:30,040 --> 00:05:32,680 You might be asking as well why are we filling it with the median. 86 00:05:32,680 --> 00:05:40,810 Why not the main well the reason why we're adding this column here is because if we just fill up all 87 00:05:40,810 --> 00:05:48,790 of our missing data with the median then there might be a reason for that missing value that was there. 88 00:05:48,940 --> 00:05:55,990 We're keeping that information the fact that that variable was missing by adding a column which is true 89 00:05:55,990 --> 00:05:56,490 or false. 90 00:05:56,500 --> 00:05:58,110 If it was missing or not. 91 00:05:58,210 --> 00:06:04,420 So that means if we find a row that has a missing auctioneer I.D. value and then we fill it with the 92 00:06:04,420 --> 00:06:12,100 median of the auction idea column there's gonna be a column on the very end of our data frame which 93 00:06:12,100 --> 00:06:20,500 says true which is basically 0 0 1 which tells us that hey we filled up this auctioneer idea row but 94 00:06:20,770 --> 00:06:25,440 originally that caller all this row was missing this value. 95 00:06:25,570 --> 00:06:29,860 We're keeping that date and we're keeping that information at the fact that that particular sample had 96 00:06:29,860 --> 00:06:32,760 a missing auctioneer Audie in the dataset. 97 00:06:32,770 --> 00:06:36,940 So we're keeping that information there but then we're filling it up with the media and you might be 98 00:06:36,940 --> 00:06:39,240 asking why the median right. 99 00:06:39,240 --> 00:06:40,670 Why of the main. 100 00:06:40,690 --> 00:06:46,330 Well the thing here the reason being is because median is more robust than the mean and you might be 101 00:06:46,330 --> 00:06:49,770 wondering what do you mean the median is more robust in the mean. 102 00:06:49,780 --> 00:06:55,660 Well the thing is when you use the mean when we have a lot of different values so we have four hundred 103 00:06:55,660 --> 00:07:02,420 twelve thousand values the mean of a lot of different values can be very sensitive to outlaw. 104 00:07:02,440 --> 00:07:05,590 So let me demonstrate it with a kind of a rough example. 105 00:07:05,770 --> 00:07:14,310 So demonstrate how median is more robust than Maine. 106 00:07:14,440 --> 00:07:17,040 And we've just run this sell so we'll check out what happened with that. 107 00:07:17,050 --> 00:07:21,370 Second I just want to give you an example of why we might use the median you might go in our search 108 00:07:21,370 --> 00:07:29,050 for this go why use median over mean that could be some sort of extension you could do some research 109 00:07:29,050 --> 00:07:35,200 there but we go here you go tens equals NDP for what this is going to do. 110 00:07:35,260 --> 00:07:40,300 Creating an array a thousand long way a value of one hundred. 111 00:07:40,300 --> 00:07:43,860 So say there's thousand people why we're going in this tens. 112 00:07:43,870 --> 00:07:49,110 So one hundred hundreds there's a thousand people each with 100 hundred dollars each. 113 00:07:49,410 --> 00:07:49,880 OK. 114 00:07:49,940 --> 00:07:56,130 And then we'll make one we hundreds and a billion let's say out of those thousand people. 115 00:07:56,180 --> 00:07:59,860 Bill Gates just walked in and he's got a billion dollars in his pocket. 116 00:08:00,320 --> 00:08:01,670 So we'll go. 117 00:08:01,810 --> 00:08:03,890 Not tens or hundreds. 118 00:08:03,890 --> 00:08:05,000 So there's a thousand people. 119 00:08:05,000 --> 00:08:05,620 One hundred dollars. 120 00:08:05,630 --> 00:08:08,210 But now Bill Gates has walked in and he has a billion. 121 00:08:08,420 --> 00:08:11,440 That's nine zeros. 122 00:08:11,480 --> 00:08:12,200 My goodness. 123 00:08:12,200 --> 00:08:13,130 That's a lot of zeros. 124 00:08:13,130 --> 00:08:14,030 One two three. 125 00:08:14,030 --> 00:08:15,320 One two three one two three. 126 00:08:15,320 --> 00:08:18,000 Well done OK. 127 00:08:18,230 --> 00:08:19,930 So what we're going to do here. 128 00:08:20,050 --> 00:08:28,940 He's going to empty main hundreds and then we're going to empty main hundreds billion and then we'll 129 00:08:28,940 --> 00:08:38,360 give me your MP maybe in hundreds and then we're going to go empty median hundreds billion. 130 00:08:38,360 --> 00:08:42,890 So what we're doing is we're taking the main and the main of each of these and the median meaning of 131 00:08:42,890 --> 00:08:46,200 each of these let's say Woo. 132 00:08:46,950 --> 00:08:53,190 So you see what I mean here in our array hundreds let's have a look at it. 133 00:08:53,710 --> 00:08:57,700 This is just a thousand examples of the number one hundred. 134 00:08:57,800 --> 00:08:59,400 The main is one hundred. 135 00:08:59,400 --> 00:09:09,070 That makes sense and the median is also one hundred but when we add one billion to the end it should 136 00:09:09,070 --> 00:09:10,430 be right down here somewhere. 137 00:09:10,510 --> 00:09:12,230 Well maybe it's not. 138 00:09:12,230 --> 00:09:13,810 It's in the hundreds billion. 139 00:09:14,450 --> 00:09:18,550 Now we have one example of a billion been right at the end. 140 00:09:18,580 --> 00:09:21,180 Look how much that influences our main. 141 00:09:21,310 --> 00:09:23,020 And now this is an extreme example right. 142 00:09:23,020 --> 00:09:24,700 We're using a billion versus one hundred. 143 00:09:24,730 --> 00:09:27,890 So we've got a couple of orders of magnitude here. 144 00:09:28,030 --> 00:09:31,330 Look how different it is to the median in its purest form. 145 00:09:31,480 --> 00:09:37,720 This is why we're using the median and so you might come across that in your research of why using median 146 00:09:37,720 --> 00:09:44,750 to the main but the main reason is is because the median is more robust to outliers. 147 00:09:44,880 --> 00:09:49,250 But with that being covered let's have a look at what we just did. 148 00:09:49,410 --> 00:09:50,580 What did we just do with this. 149 00:09:50,580 --> 00:10:01,500 So to figure that out what we can do now is check if there is any no numeric values and how did we do 150 00:10:01,500 --> 00:10:02,160 that before. 151 00:10:02,310 --> 00:10:06,300 Well we did for one label content endeared tend to items. 152 00:10:06,300 --> 00:10:07,270 That's what we're after. 153 00:10:07,270 --> 00:10:07,650 Yep. 154 00:10:07,860 --> 00:10:19,620 If PD on API dot is on a type that isn't getting ahead of myself here numeric daytime content wonderful. 155 00:10:20,070 --> 00:10:30,210 If payday is now content that some print label so this should tell us if there is still any numeric 156 00:10:30,390 --> 00:10:37,180 values or numeric columns that have no values and it should print out those column names and it prints 157 00:10:37,180 --> 00:10:38,400 out nothing. 158 00:10:38,410 --> 00:10:46,940 The reason being is because we just filled the content the ones which were missing with the median. 159 00:10:47,350 --> 00:10:50,540 Now let's check to see what our binary column did. 160 00:10:50,930 --> 00:10:52,010 OK. 161 00:10:52,270 --> 00:11:00,230 Check to see if this worked we should have Let's go up here. 162 00:11:00,240 --> 00:11:02,430 How many auctioneer idea values were missing. 163 00:11:02,430 --> 00:11:07,370 Can we find that out from our previous one auctioneer I.D.. 164 00:11:07,370 --> 00:11:07,960 OK. 165 00:11:08,150 --> 00:11:17,210 So we should have the fact that we filled up false should be the total number of rows minus the so that 166 00:11:17,210 --> 00:11:19,900 true should be this value of our binary column. 167 00:11:19,940 --> 00:11:22,430 Right because it wasn't missing. 168 00:11:22,430 --> 00:11:28,910 So let's go here on a true should be this number because that's how many missing values there were. 169 00:11:28,910 --> 00:11:29,420 See there. 170 00:11:29,780 --> 00:11:33,630 So what we've done here before we filled our values we've checked the sounding missing values there 171 00:11:33,630 --> 00:11:34,200 are. 172 00:11:34,280 --> 00:11:39,650 So our binary column should be true for twenty thousand one hundred thirty six examples because that's 173 00:11:39,650 --> 00:11:41,550 how many examples we've filled. 174 00:11:41,600 --> 00:11:43,030 So let's check that out. 175 00:11:43,140 --> 00:11:52,800 So dear on the score 10 dot because we've created a column auctioneer I.D. is missing. 176 00:11:52,800 --> 00:11:56,090 See here we've got label plus is missing. 177 00:11:56,090 --> 00:12:02,630 Now this should be true and false beautiful. 178 00:12:02,670 --> 00:12:09,390 What that means is that we filled up twenty thousand one hundred and thirty six values in the auctioneer 179 00:12:09,390 --> 00:12:10,830 I.D. column. 180 00:12:11,040 --> 00:12:18,210 Not that is missing column because maybe we just made this with the median of the original auction idea 181 00:12:18,240 --> 00:12:21,450 column for now. 182 00:12:21,460 --> 00:12:26,460 That was pretty full on but we've still got some missing values in our data front much checking out. 183 00:12:26,460 --> 00:12:28,440 What else do we have. 184 00:12:28,450 --> 00:12:33,930 We filled up all the numeric missing values but we've still got some of the eight we still got a fair 185 00:12:33,930 --> 00:12:35,470 few all these ones are zero. 186 00:12:35,520 --> 00:12:37,230 They don't have any missing value so that's good. 187 00:12:37,240 --> 00:12:41,790 They're getting ready to use with our machine learning model but we still have to fill all of these 188 00:12:41,790 --> 00:12:45,360 before we can use our data with a machine learning model. 189 00:12:45,360 --> 00:12:47,550 So let's do that in the next video.