1
00:00:00,540 --> 00:00:02,760
Now, when we are building a machine learning model.

2
00:00:03,890 --> 00:00:05,260
There are seven steps to it.

3
00:00:06,780 --> 00:00:08,580
It starts with problem formulation.

4
00:00:09,820 --> 00:00:17,050
Then we do data dating, then data preprocessing, then we split the data between test data and training

5
00:00:17,050 --> 00:00:20,710
data, then we build the model and train it.

6
00:00:21,840 --> 00:00:25,810
Once we get the results, we validate the results and take more accuracy.

7
00:00:26,640 --> 00:00:29,850
And lastly, we use the model for prediction and other purposes.

8
00:00:31,170 --> 00:00:32,570
Let's look at them one by one.

9
00:00:35,100 --> 00:00:37,320
So the first thing which is a problem formulation.

10
00:00:38,910 --> 00:00:44,730
It basically means when you have the business problem with you, you have to convert that business problem

11
00:00:44,730 --> 00:00:46,260
into a statistical problem.

12
00:00:47,360 --> 00:00:52,520
Your business problem can be as simple as increasing your business's revenue.

13
00:00:53,540 --> 00:01:01,130
When you go deeper into it, you may identify that the problem that your business is facing is one of

14
00:01:01,130 --> 00:01:08,150
attrition, that your customers are not renewing their membership or stopping the use of your product.

15
00:01:09,020 --> 00:01:11,840
And second is, you're not able to find new prospects.

16
00:01:12,260 --> 00:01:13,620
You don't have new customers.

17
00:01:15,110 --> 00:01:21,530
So this business problem will then be converted into two smaller problems, one of which is decreasing

18
00:01:21,530 --> 00:01:26,120
attrition, and the second one is getting better customer prospect.

19
00:01:27,470 --> 00:01:34,580
So when we have a tradition and we want to minimize it, we first need to identify the variables which

20
00:01:34,580 --> 00:01:35,300
impacted.

21
00:01:36,410 --> 00:01:43,370
So attrition will be the dependent variable and we want to identify the independent variables which

22
00:01:43,370 --> 00:01:45,120
increase or decrease attrition.

23
00:01:45,440 --> 00:01:51,710
These variables can be anything that as technical problems with the product or the customer service

24
00:01:51,710 --> 00:01:56,060
that we provide to our customers or any other business variable.

25
00:01:56,990 --> 00:02:03,620
Once we identify these variables, we need to collect data for all these variables and attrition.

26
00:02:04,920 --> 00:02:11,970
So this step of identifying dependent and independent variables from the business problem is a problem

27
00:02:11,970 --> 00:02:12,630
formulation.

28
00:02:16,690 --> 00:02:22,600
Once we have identified the problem and have gathered the data for that problem, we need to add it

29
00:02:22,600 --> 00:02:22,900
up.

30
00:02:22,910 --> 00:02:28,700
We need to clean that data so that it is usable for data analysis, for data analysis.

31
00:02:28,930 --> 00:02:34,900
The data should be available in a clear table format with rows and columns.

32
00:02:35,560 --> 00:02:40,910
And in each column, the values should clearly represent different variables.

33
00:02:41,320 --> 00:02:50,440
For example, in this column, if you have M 014 representing a male whose age is between zero and 14,

34
00:02:50,680 --> 00:02:52,270
it should not be kept like this.

35
00:02:53,230 --> 00:02:56,650
This data should be converted into these two columns.

36
00:02:56,810 --> 00:03:04,870
Read one column has the variable of customer six, which contains male or female, and the other which

37
00:03:04,870 --> 00:03:09,610
has each categories such as zero to 14, 15 to 24 and so on.

38
00:03:11,910 --> 00:03:19,080
Once you have clean data, you need to pre-process your data before putting it into the model preprocessing

39
00:03:19,080 --> 00:03:24,870
and include filtering data, which means you are removing some particular type of data from all the

40
00:03:24,870 --> 00:03:25,680
data that you have.

41
00:03:27,910 --> 00:03:35,210
Aggregating values that is assigning aggregate values wherever necessary, missing value treatment,

42
00:03:35,320 --> 00:03:38,430
a lot of times some variables are missing, some values.

43
00:03:38,860 --> 00:03:44,920
We need to treat those missing values as per the choice of analysis that we have decided.

44
00:03:46,940 --> 00:03:52,010
We need to treat the outliers as they can harmfully impact their analysis.

45
00:03:53,840 --> 00:03:58,430
Similarly, women want to transform some variables or reduce the total number of variables.

46
00:03:59,640 --> 00:04:06,780
All this constitutes data processing and we will be covering all this in a separate section, because

47
00:04:06,780 --> 00:04:10,560
this is one of the most important part before starting any analysis.

48
00:04:11,280 --> 00:04:17,490
And when you are in data analysis business, almost 80 percent of your time will always go into data

49
00:04:17,490 --> 00:04:19,310
dating and data preprocessing.

50
00:04:19,530 --> 00:04:25,320
So it is very important to know how to get divided up and in the right format.

51
00:04:27,780 --> 00:04:37,320
The fourth step is to split the data that you have in the past data entry into the training data that

52
00:04:37,320 --> 00:04:39,600
we will use to train our algorithm.

53
00:04:41,010 --> 00:04:48,270
This data will include both input data and output data and this is this data the algorithm will learn

54
00:04:48,270 --> 00:04:51,360
what is the relationship between input and output variables?

55
00:04:52,860 --> 00:05:01,650
And the algorithm will be trying to minimize the error of the expected output from the actual value

56
00:05:01,860 --> 00:05:04,320
that is that we have given in the data.

57
00:05:06,710 --> 00:05:13,910
If you remember the example that I showed you earlier, if you have five images of bananas and apples,

58
00:05:14,300 --> 00:05:22,190
we will use four to train the model and we will keep one aside so that once the model is trained, we

59
00:05:22,190 --> 00:05:28,000
will use that one image to test whether our model is predicting correctly or not.

60
00:05:29,330 --> 00:05:32,780
So that one image that we separated is called testing data.

61
00:05:33,800 --> 00:05:36,410
Testing data will have input data.

62
00:05:37,250 --> 00:05:42,560
You will give this input data to our model and the model will predict an expected output.

63
00:05:43,700 --> 00:05:50,220
We will compare this expected output of the model with the actual output that we have with us.

64
00:05:50,540 --> 00:05:57,260
This will help us assess the accuracy of the model that we have created and whether the predictive function

65
00:05:57,260 --> 00:06:00,720
that we have generated using the model is working correctly or not.

66
00:06:02,060 --> 00:06:07,910
Usually when you have a lot of data, around 80 percent of the available data is used as training data

67
00:06:07,910 --> 00:06:10,760
and 20 percent is used as testing data.

68
00:06:15,830 --> 00:06:21,920
The fifth part is training the model, you have the value of dependent variable, you have the value

69
00:06:21,920 --> 00:06:25,600
of independent variables, that you have output and input variables.

70
00:06:26,630 --> 00:06:30,490
You want to estimate this function effects basis.

71
00:06:30,500 --> 00:06:35,410
The previous data, your model will run and estimate the value of this function in this.

72
00:06:39,350 --> 00:06:47,440
So once your model is trained and you have that function, which is the relationship between your output

73
00:06:47,440 --> 00:06:51,430
variable and input variable, you need to check its performance.

74
00:06:53,390 --> 00:06:56,540
The model that we have created will have two types of added.

75
00:06:57,800 --> 00:07:00,770
These are called into and out of the military.

76
00:07:02,350 --> 00:07:08,310
Instability means we provided some training data to our model to train itself.

77
00:07:10,170 --> 00:07:16,640
After training, it may still not be able to predict all the values in the training data correctly.

78
00:07:18,630 --> 00:07:22,020
So it may be giving Eddowes while predicting the values of training data.

79
00:07:23,040 --> 00:07:24,600
That error is called intent.

80
00:07:27,390 --> 00:07:35,770
So suppose we give four images of apples and bananas to our model to train and out of those four images.

81
00:07:36,390 --> 00:07:39,530
The model is predicting three of them correctly.

82
00:07:40,170 --> 00:07:45,810
Then we have 75 percent accuracy and there is twenty five percent in templated.

83
00:07:47,070 --> 00:07:55,440
Whereas now that new model that you have created, if you apply it to a new sample of four new images

84
00:07:55,800 --> 00:08:03,210
and out of those four new images, just predicting only two values correctly, then those two wrongly

85
00:08:03,210 --> 00:08:06,030
predicted values is Aldo's and Balladur.

86
00:08:07,670 --> 00:08:10,700
The last step is to use this prediction model.

87
00:08:12,120 --> 00:08:18,210
Once you have this prediction model, you should set up a pipeline so that your input data constantly

88
00:08:18,210 --> 00:08:19,360
flows into this model.

89
00:08:19,650 --> 00:08:22,710
You get this output in the form of dashboards and tables.

90
00:08:24,020 --> 00:08:31,160
You should also keep on monitoring the output of your model and improve it by adding more training data

91
00:08:31,160 --> 00:08:36,080
to it or incorporating any new variables that you identify.

92
00:08:37,970 --> 00:08:41,750
Also, if your business requires, you can automate this scoring process.