1 00:00:00,540 --> 00:00:02,760 Now, when we are building a machine learning model. 2 00:00:03,890 --> 00:00:05,260 There are seven steps to it. 3 00:00:06,780 --> 00:00:08,580 It starts with problem formulation. 4 00:00:09,820 --> 00:00:17,050 Then we do data dating, then data preprocessing, then we split the data between test data and training 5 00:00:17,050 --> 00:00:20,710 data, then we build the model and train it. 6 00:00:21,840 --> 00:00:25,810 Once we get the results, we validate the results and take more accuracy. 7 00:00:26,640 --> 00:00:29,850 And lastly, we use the model for prediction and other purposes. 8 00:00:31,170 --> 00:00:32,570 Let's look at them one by one. 9 00:00:35,100 --> 00:00:37,320 So the first thing which is a problem formulation. 10 00:00:38,910 --> 00:00:44,730 It basically means when you have the business problem with you, you have to convert that business problem 11 00:00:44,730 --> 00:00:46,260 into a statistical problem. 12 00:00:47,360 --> 00:00:52,520 Your business problem can be as simple as increasing your business's revenue. 13 00:00:53,540 --> 00:01:01,130 When you go deeper into it, you may identify that the problem that your business is facing is one of 14 00:01:01,130 --> 00:01:08,150 attrition, that your customers are not renewing their membership or stopping the use of your product. 15 00:01:09,020 --> 00:01:11,840 And second is, you're not able to find new prospects. 16 00:01:12,260 --> 00:01:13,620 You don't have new customers. 17 00:01:15,110 --> 00:01:21,530 So this business problem will then be converted into two smaller problems, one of which is decreasing 18 00:01:21,530 --> 00:01:26,120 attrition, and the second one is getting better customer prospect. 19 00:01:27,470 --> 00:01:34,580 So when we have a tradition and we want to minimize it, we first need to identify the variables which 20 00:01:34,580 --> 00:01:35,300 impacted. 21 00:01:36,410 --> 00:01:43,370 So attrition will be the dependent variable and we want to identify the independent variables which 22 00:01:43,370 --> 00:01:45,120 increase or decrease attrition. 23 00:01:45,440 --> 00:01:51,710 These variables can be anything that as technical problems with the product or the customer service 24 00:01:51,710 --> 00:01:56,060 that we provide to our customers or any other business variable. 25 00:01:56,990 --> 00:02:03,620 Once we identify these variables, we need to collect data for all these variables and attrition. 26 00:02:04,920 --> 00:02:11,970 So this step of identifying dependent and independent variables from the business problem is a problem 27 00:02:11,970 --> 00:02:12,630 formulation. 28 00:02:16,690 --> 00:02:22,600 Once we have identified the problem and have gathered the data for that problem, we need to add it 29 00:02:22,600 --> 00:02:22,900 up. 30 00:02:22,910 --> 00:02:28,700 We need to clean that data so that it is usable for data analysis, for data analysis. 31 00:02:28,930 --> 00:02:34,900 The data should be available in a clear table format with rows and columns. 32 00:02:35,560 --> 00:02:40,910 And in each column, the values should clearly represent different variables. 33 00:02:41,320 --> 00:02:50,440 For example, in this column, if you have M 014 representing a male whose age is between zero and 14, 34 00:02:50,680 --> 00:02:52,270 it should not be kept like this. 35 00:02:53,230 --> 00:02:56,650 This data should be converted into these two columns. 36 00:02:56,810 --> 00:03:04,870 Read one column has the variable of customer six, which contains male or female, and the other which 37 00:03:04,870 --> 00:03:09,610 has each categories such as zero to 14, 15 to 24 and so on. 38 00:03:11,910 --> 00:03:19,080 Once you have clean data, you need to pre-process your data before putting it into the model preprocessing 39 00:03:19,080 --> 00:03:24,870 and include filtering data, which means you are removing some particular type of data from all the 40 00:03:24,870 --> 00:03:25,680 data that you have. 41 00:03:27,910 --> 00:03:35,210 Aggregating values that is assigning aggregate values wherever necessary, missing value treatment, 42 00:03:35,320 --> 00:03:38,430 a lot of times some variables are missing, some values. 43 00:03:38,860 --> 00:03:44,920 We need to treat those missing values as per the choice of analysis that we have decided. 44 00:03:46,940 --> 00:03:52,010 We need to treat the outliers as they can harmfully impact their analysis. 45 00:03:53,840 --> 00:03:58,430 Similarly, women want to transform some variables or reduce the total number of variables. 46 00:03:59,640 --> 00:04:06,780 All this constitutes data processing and we will be covering all this in a separate section, because 47 00:04:06,780 --> 00:04:10,560 this is one of the most important part before starting any analysis. 48 00:04:11,280 --> 00:04:17,490 And when you are in data analysis business, almost 80 percent of your time will always go into data 49 00:04:17,490 --> 00:04:19,310 dating and data preprocessing. 50 00:04:19,530 --> 00:04:25,320 So it is very important to know how to get divided up and in the right format. 51 00:04:27,780 --> 00:04:37,320 The fourth step is to split the data that you have in the past data entry into the training data that 52 00:04:37,320 --> 00:04:39,600 we will use to train our algorithm. 53 00:04:41,010 --> 00:04:48,270 This data will include both input data and output data and this is this data the algorithm will learn 54 00:04:48,270 --> 00:04:51,360 what is the relationship between input and output variables? 55 00:04:52,860 --> 00:05:01,650 And the algorithm will be trying to minimize the error of the expected output from the actual value 56 00:05:01,860 --> 00:05:04,320 that is that we have given in the data. 57 00:05:06,710 --> 00:05:13,910 If you remember the example that I showed you earlier, if you have five images of bananas and apples, 58 00:05:14,300 --> 00:05:22,190 we will use four to train the model and we will keep one aside so that once the model is trained, we 59 00:05:22,190 --> 00:05:28,000 will use that one image to test whether our model is predicting correctly or not. 60 00:05:29,330 --> 00:05:32,780 So that one image that we separated is called testing data. 61 00:05:33,800 --> 00:05:36,410 Testing data will have input data. 62 00:05:37,250 --> 00:05:42,560 You will give this input data to our model and the model will predict an expected output. 63 00:05:43,700 --> 00:05:50,220 We will compare this expected output of the model with the actual output that we have with us. 64 00:05:50,540 --> 00:05:57,260 This will help us assess the accuracy of the model that we have created and whether the predictive function 65 00:05:57,260 --> 00:06:00,720 that we have generated using the model is working correctly or not. 66 00:06:02,060 --> 00:06:07,910 Usually when you have a lot of data, around 80 percent of the available data is used as training data 67 00:06:07,910 --> 00:06:10,760 and 20 percent is used as testing data. 68 00:06:15,830 --> 00:06:21,920 The fifth part is training the model, you have the value of dependent variable, you have the value 69 00:06:21,920 --> 00:06:25,600 of independent variables, that you have output and input variables. 70 00:06:26,630 --> 00:06:30,490 You want to estimate this function effects basis. 71 00:06:30,500 --> 00:06:35,410 The previous data, your model will run and estimate the value of this function in this. 72 00:06:39,350 --> 00:06:47,440 So once your model is trained and you have that function, which is the relationship between your output 73 00:06:47,440 --> 00:06:51,430 variable and input variable, you need to check its performance. 74 00:06:53,390 --> 00:06:56,540 The model that we have created will have two types of added. 75 00:06:57,800 --> 00:07:00,770 These are called into and out of the military. 76 00:07:02,350 --> 00:07:08,310 Instability means we provided some training data to our model to train itself. 77 00:07:10,170 --> 00:07:16,640 After training, it may still not be able to predict all the values in the training data correctly. 78 00:07:18,630 --> 00:07:22,020 So it may be giving Eddowes while predicting the values of training data. 79 00:07:23,040 --> 00:07:24,600 That error is called intent. 80 00:07:27,390 --> 00:07:35,770 So suppose we give four images of apples and bananas to our model to train and out of those four images. 81 00:07:36,390 --> 00:07:39,530 The model is predicting three of them correctly. 82 00:07:40,170 --> 00:07:45,810 Then we have 75 percent accuracy and there is twenty five percent in templated. 83 00:07:47,070 --> 00:07:55,440 Whereas now that new model that you have created, if you apply it to a new sample of four new images 84 00:07:55,800 --> 00:08:03,210 and out of those four new images, just predicting only two values correctly, then those two wrongly 85 00:08:03,210 --> 00:08:06,030 predicted values is Aldo's and Balladur. 86 00:08:07,670 --> 00:08:10,700 The last step is to use this prediction model. 87 00:08:12,120 --> 00:08:18,210 Once you have this prediction model, you should set up a pipeline so that your input data constantly 88 00:08:18,210 --> 00:08:19,360 flows into this model. 89 00:08:19,650 --> 00:08:22,710 You get this output in the form of dashboards and tables. 90 00:08:24,020 --> 00:08:31,160 You should also keep on monitoring the output of your model and improve it by adding more training data 91 00:08:31,160 --> 00:08:36,080 to it or incorporating any new variables that you identify. 92 00:08:37,970 --> 00:08:41,750 Also, if your business requires, you can automate this scoring process.