1 00:00:00,320 --> 00:00:01,320 All righty. 2 00:00:01,350 --> 00:00:04,860 So we've seen how to choose an estimate for a regression problem. 3 00:00:04,950 --> 00:00:07,590 How can we do the same for a classification problem. 4 00:00:08,130 --> 00:00:13,420 Well first of all let's get a data set that is a classification problem. 5 00:00:13,620 --> 00:00:14,730 Back in our midst. 6 00:00:14,760 --> 00:00:16,050 So we want data. 7 00:00:16,140 --> 00:00:21,150 We've already imported this above what we we're just gonna do it again just so we have it says we just 8 00:00:21,150 --> 00:00:22,800 to make sure it's the same. 9 00:00:23,160 --> 00:00:29,840 Heart disease heart disease dot head wonderful. 10 00:00:29,930 --> 00:00:35,740 So again we've seen this dataset before we want to use these columns of patient each row as a patient 11 00:00:35,750 --> 00:00:41,870 so each sample is a patient each column or each feature is a health attribute of that particular patient 12 00:00:42,200 --> 00:00:46,280 and the target 1 or 0 is whether that patient has heart disease or not. 13 00:00:46,280 --> 00:00:51,590 So our model classification predicting something is one thing or another heart disease or not. 14 00:00:52,310 --> 00:00:53,540 That's what we need. 15 00:00:53,540 --> 00:00:58,380 So what we're going to do is we left ourselves this little link here to go visit the maps we're going 16 00:00:58,380 --> 00:01:06,680 to pay attention to that we go here so we know we want something in the classification realm but what 17 00:01:06,680 --> 00:01:10,850 we're going to do is we're gonna start from the top and follow this through. 18 00:01:11,030 --> 00:01:14,150 So start do we have above 50 samples. 19 00:01:14,150 --> 00:01:22,470 Well let's check each round with a sample so we want Len heart disease 303. 20 00:01:22,850 --> 00:01:24,230 Yes we can. 21 00:01:24,230 --> 00:01:25,940 Are we predicting category. 22 00:01:25,940 --> 00:01:29,380 Are we heart disease or not heart disease. 23 00:01:29,490 --> 00:01:30,860 Sounds like a category to me. 24 00:01:30,870 --> 00:01:31,700 Yes. 25 00:01:31,710 --> 00:01:33,420 Do we have label data. 26 00:01:33,420 --> 00:01:34,370 Yes. 27 00:01:34,410 --> 00:01:37,200 Do we have under 100 k samples. 28 00:01:37,200 --> 00:01:38,450 Yes. 29 00:01:38,580 --> 00:01:39,580 What is this. 30 00:01:39,600 --> 00:01:41,140 We reach a little green box. 31 00:01:41,160 --> 00:01:42,120 Beautiful. 32 00:01:42,150 --> 00:01:46,780 So remember the green boxes are estimated as or machine learning algorithms. 33 00:01:46,860 --> 00:01:47,760 So let's have a look. 34 00:01:47,760 --> 00:01:52,710 This is telling us our problems classification and maybe the first one we come to is trying a linear 35 00:01:52,800 --> 00:01:54,790 as we see. 36 00:01:55,380 --> 00:02:02,610 All right classification SBC and Linnea SBC a class is capable of performing multicast classification 37 00:02:02,700 --> 00:02:03,520 on a data set. 38 00:02:03,960 --> 00:02:06,610 Well this one particularly said linear SBC. 39 00:02:06,810 --> 00:02:07,980 So let's have a look. 40 00:02:08,310 --> 00:02:14,070 Linear Support Vector classification similar to SBC with parameter kernel linear but implement in terms 41 00:02:14,070 --> 00:02:17,440 of live linear rather than lib SVM. 42 00:02:17,580 --> 00:02:20,000 That's a lot of times and most of my I don't really understand. 43 00:02:20,010 --> 00:02:28,870 But what what's important here is that if we go down to the code example we can see from S.K. learned 44 00:02:28,870 --> 00:02:37,330 on SVM import linear SPC x y oh they've used x and y CSF classifier equals linear as we. 45 00:02:37,390 --> 00:02:37,770 All right. 46 00:02:37,780 --> 00:02:39,340 This is kind of similar. 47 00:02:39,340 --> 00:02:43,660 Well let's not take the documentations word for it let's try it for ourselves. 48 00:02:43,670 --> 00:02:55,990 We'll go import actually we'll leave ourselves a note consulting the map and it says to try Linnea SABC 49 00:02:57,600 --> 00:03:07,190 so that will be our first port of call in Port Linnea as we see estimated class we go from SBA loan 50 00:03:07,940 --> 00:03:13,460 SVM and really what I'm doing here is I'm basically just gonna rewrite this spot for our problem. 51 00:03:13,790 --> 00:03:27,940 So SBA loan but SVM important in the SABC import Linnea SABC wonderful set up random seed and paid out 52 00:03:27,950 --> 00:03:31,350 random dot seed and we'll use our faithful 42. 53 00:03:31,610 --> 00:03:34,430 And then we're gonna go make the data. 54 00:03:34,430 --> 00:03:37,740 We saw this in the previous section section 1 Get your data ready. 55 00:03:37,820 --> 00:03:42,750 X equals heart disease dot drop. 56 00:03:43,100 --> 00:03:53,480 I want to get rid of the target column on axis equals 1 and Y equals heart disease Y is actually the 57 00:03:53,480 --> 00:03:54,160 target column. 58 00:03:54,200 --> 00:03:54,830 Beautiful. 59 00:03:55,190 --> 00:04:02,690 And then when you go split the data x test or no train comes first. 60 00:04:02,820 --> 00:04:06,510 So he may have done this about 100 times now and still getting it wrong. 61 00:04:06,510 --> 00:04:11,340 See this is what it takes it takes a little bit of practice or it takes a lot of practice to start applying 62 00:04:11,340 --> 00:04:17,130 stop becoming machine learning practitioner knowing how to deal with data knowing how to split data 63 00:04:17,370 --> 00:04:18,960 knowing how to model data. 64 00:04:18,990 --> 00:04:21,680 It's all part of the practice test songs. 65 00:04:21,690 --> 00:04:22,860 Yes that is correct. 66 00:04:22,860 --> 00:04:27,460 That's what we want to do instantiate linear SBC. 67 00:04:27,810 --> 00:04:35,360 So we call it C.L. short for classifier Linnea SABC and we're gonna go. 68 00:04:35,380 --> 00:04:38,500 That'll do actually say a left not fit. 69 00:04:38,790 --> 00:04:53,660 We want X train y train and then we're gonna go evaluate linear and say C a left on score x test y test. 70 00:04:53,660 --> 00:04:56,430 All right beautiful let's say this in action. 71 00:04:58,010 --> 00:05:00,380 Wonderful linear fail to converge. 72 00:05:00,380 --> 00:05:02,890 Increase the number of iterations. 73 00:05:03,010 --> 00:05:08,660 Well this is one of those things where if you came into an error like this what I might do is go convergence 74 00:05:08,660 --> 00:05:13,610 warning I google this message and see what came up because I've had a little bit of experience with 75 00:05:13,610 --> 00:05:14,030 it. 76 00:05:14,300 --> 00:05:20,390 I might do something like Max it equals a thousand and it still happens. 77 00:05:20,540 --> 00:05:20,930 Right. 78 00:05:20,930 --> 00:05:23,110 So increase the number of iterations. 79 00:05:23,180 --> 00:05:23,840 We've done that. 80 00:05:23,840 --> 00:05:30,620 We've gone Max it equals a thousand but this is actually how much does it start Max iteration or is 81 00:05:30,620 --> 00:05:31,820 default by a thousand. 82 00:05:31,820 --> 00:05:34,040 Let's say 10000 still there. 83 00:05:34,460 --> 00:05:41,180 Well let's bypass that for the time being we're getting a warning and see here the score is below point 84 00:05:41,180 --> 00:05:41,660 five. 85 00:05:41,660 --> 00:05:44,300 Now why is that a trigger in our heads. 86 00:05:44,300 --> 00:05:56,330 Well because if we look at our data we want heart disease target don't value Kant's 87 00:06:00,560 --> 00:06:07,490 so we say there's only two classes 1 0 0 does someone have heart disease or not. 88 00:06:07,490 --> 00:06:15,190 Right so it's a binary classification problem binary meaning one or the other and our model is receiving 89 00:06:15,190 --> 00:06:21,010 a score of point four seven returns the mean accuracy on a given test date and labels. 90 00:06:21,010 --> 00:06:24,870 As I said we'll look into evaluation metrics in in a future section. 91 00:06:25,000 --> 00:06:30,060 But what you can imagine this is our model is only operating at 47 per cent accuracy. 92 00:06:30,430 --> 00:06:32,160 And why does that trigger us. 93 00:06:32,170 --> 00:06:34,510 Well because there's only two classes. 94 00:06:34,510 --> 00:06:40,120 So that means if we were just guessing whether someone had heart disease or not but based off their 95 00:06:40,390 --> 00:06:46,870 health attributes we were just guessing yes or no we would get 50 per cent it's basically a coin toss. 96 00:06:46,870 --> 00:06:52,150 So what's that triggering in a head and saying hey without fixing this warning or without improving 97 00:06:52,150 --> 00:06:58,780 our model by choosing the hyper parameters our linear SPC model might not be finding patterns between 98 00:06:59,050 --> 00:07:00,310 x and y. 99 00:07:00,310 --> 00:07:01,120 So what can we do. 100 00:07:01,480 --> 00:07:05,560 Well when in doubt come back to the graphic not working. 101 00:07:05,740 --> 00:07:11,040 Are we working with test data because we just tried linear as we say so we're going to say not working. 102 00:07:11,080 --> 00:07:13,620 We were going to test data no. 103 00:07:13,670 --> 00:07:18,640 Is going to say caner is neighbors classifier but I'm going to skip this one I'm going to go to ensemble 104 00:07:18,640 --> 00:07:20,150 classifiers. 105 00:07:20,290 --> 00:07:21,790 So let's click on this. 106 00:07:21,790 --> 00:07:22,590 Now why this. 107 00:07:22,750 --> 00:07:30,730 Because we've seen this before ensemble methods forest of randomize trees S.K. lone ensemble from S.K. 108 00:07:30,740 --> 00:07:33,820 lone ensemble import random forest classifier. 109 00:07:33,820 --> 00:07:40,510 Now if you remember back up in our regression problem when we switched to using a random forest aggressor 110 00:07:40,900 --> 00:07:47,320 we saw a bump in the score well because random forests regressive did so well. 111 00:07:47,530 --> 00:07:52,140 Good news is for us it's got a dance partner called random forest classifier. 112 00:07:52,240 --> 00:07:53,910 We've actually seen this before. 113 00:07:54,040 --> 00:08:00,700 But what we're going to do is compare it to linear SPC so to save time I'm going to copy this code and 114 00:08:00,700 --> 00:08:01,870 I'm gonna bring it down here. 115 00:08:02,590 --> 00:08:14,340 So we want to change this to random forest classifier we actually want ensemble import random forest 116 00:08:14,520 --> 00:08:16,930 classifier wonderful. 117 00:08:17,100 --> 00:08:25,050 We can keep the rest the same except for this we need to actually sub this and random forest classifier 118 00:08:26,640 --> 00:08:35,850 and we want to get random forest classifier wonderful we'll do the same here just to tidy up our notes 119 00:08:35,850 --> 00:08:37,840 keep our code communicated. 120 00:08:38,190 --> 00:08:38,730 Wonderful. 121 00:08:38,730 --> 00:08:42,510 And now what do you think will happen here 3 to 1. 122 00:08:42,540 --> 00:08:48,790 Let's see error fit missing one argument why oh there we go. 123 00:08:48,820 --> 00:08:49,700 That's what we've forgotten. 124 00:08:49,700 --> 00:08:53,460 Forgot to add a little brackets on the end. 125 00:08:53,550 --> 00:08:58,560 Wonderful No to get rid of this warning we need to change an estimate as I could see here the default 126 00:08:58,560 --> 00:09:02,420 value of any estimate is will change from 10 inversion zero point two to 100. 127 00:09:02,430 --> 00:09:10,060 So we'll just get rid of that by changing an estimate as equals 100 delicious wonderful. 128 00:09:10,100 --> 00:09:11,450 So what's happened here. 129 00:09:11,480 --> 00:09:20,120 We've used a different model now linear SBC has scored 47 percent accuracy and our random forest classifier 130 00:09:20,150 --> 00:09:25,940 has scored 85 percent accuracy so nearly nearly double by just using another model. 131 00:09:26,160 --> 00:09:26,420 Right. 132 00:09:26,420 --> 00:09:34,230 So you can see why I'm kind of jumping to using random forest if we go back all I've done is I've gone 133 00:09:34,230 --> 00:09:37,270 start I figured out what kind of problem we're trying to solve. 134 00:09:37,320 --> 00:09:38,790 You can do this as well. 135 00:09:38,880 --> 00:09:43,560 So for our regression problem we were trying to figure out whether something was a number for our classification 136 00:09:43,560 --> 00:09:45,130 problem we're trying to predict a label. 137 00:09:45,240 --> 00:09:46,200 Do we have label data. 138 00:09:46,260 --> 00:09:46,740 Yes. 139 00:09:46,740 --> 00:09:47,910 Are we producing category. 140 00:09:47,970 --> 00:09:48,930 Yes. 141 00:09:48,990 --> 00:09:54,750 And now Alinea S.E.C. model wasn't working so we went straight to ensemble classifiers. 142 00:09:55,020 --> 00:09:58,380 And now you might be thinking what about all the other models right. 143 00:09:58,380 --> 00:10:00,780 So we've kind of skipped all these ones here. 144 00:10:00,990 --> 00:10:05,880 All of these and we've only really touched the surface we've only used this one an ensemble classifiers 145 00:10:05,880 --> 00:10:06,830 from both. 146 00:10:06,910 --> 00:10:08,450 And that's a great question. 147 00:10:08,460 --> 00:10:13,740 Well the first reason is namely for time we could go through and try all of these. 148 00:10:13,740 --> 00:10:17,340 And the second reason is there's a little tidbit in machine learning. 149 00:10:17,790 --> 00:10:25,170 If you have structured data a.k.a. tables or data frames use ensemble methods such as random forest. 150 00:10:25,170 --> 00:10:26,100 Why. 151 00:10:26,220 --> 00:10:27,840 Because it'll perform pretty well. 152 00:10:28,050 --> 00:10:32,320 If there are patterns this is a tidbit here. 153 00:10:32,340 --> 00:10:39,360 Heart disease might write this down so we know I'll put this in a resources section as well. 154 00:10:39,360 --> 00:10:55,750 To is one if you have structured data use ensemble methods and two is if you have unstructured data 155 00:10:56,470 --> 00:11:02,190 use deep learning or transfer learning. 156 00:11:02,500 --> 00:11:05,350 Now what's an example of structured data. 157 00:11:05,350 --> 00:11:12,940 While stuff in a table like this and if we have unstructured data which is like images audio or text 158 00:11:12,940 --> 00:11:16,120 or something like that use deep learning or transfer learning. 159 00:11:16,120 --> 00:11:21,310 Now we haven't covered deep learning and transfer learning yet but since we have structured data a.k.a. 160 00:11:21,310 --> 00:11:27,430 things in a data frame that Tibetan his use ensemble methods which is hence why we've kind of gone you 161 00:11:27,430 --> 00:11:31,660 know what I'm not going to try any of these I'm going straight to ensemble classifiers straight to the 162 00:11:31,660 --> 00:11:37,370 random for us because the random forest is known for its robustness and ability to find patterns. 163 00:11:37,420 --> 00:11:44,660 Now who we've covered a lot we've had to look at these machine learning map if we're still looking at 164 00:11:44,660 --> 00:11:47,240 this going Holy gosh what's going on. 165 00:11:47,240 --> 00:11:47,770 Don't worry. 166 00:11:47,780 --> 00:11:53,270 It took me a while to figure it out too but it really clicked once I realized hold on the first step 167 00:11:53,300 --> 00:11:57,460 is to just get the data figure out the main problem we're trying to solve. 168 00:11:57,650 --> 00:12:02,690 Usually regression of classification they're the two most common ones you come across and then start 169 00:12:02,720 --> 00:12:08,240 answering the questions going through it and then use a little framework something as simple as this 170 00:12:08,570 --> 00:12:15,260 to start running machine learning estimate or machine learning model on your data and getting some feedback 171 00:12:15,260 --> 00:12:16,410 from it quickly. 172 00:12:16,460 --> 00:12:21,770 That's the most important part being a data scientist or machine learning engineer is reducing the time 173 00:12:21,770 --> 00:12:23,700 between your experiments. 174 00:12:23,720 --> 00:12:27,950 So what we're gonna do now is we've seen how to choose a model. 175 00:12:28,190 --> 00:12:32,720 We're gonna get into the next section which is fitting a model to the data and then using it to make 176 00:12:32,720 --> 00:12:35,840 predictions because that's really the essence of machine learning right. 177 00:12:35,840 --> 00:12:41,020 Finding patterns in data and then using those patterns to make predictions on future data. 178 00:12:41,050 --> 00:12:42,690 A model hasn't seen before. 179 00:12:42,770 --> 00:12:44,930 Take a quick break and I'll see you in the next video.