0 1 00:00:00,450 --> 00:00:05,250 Please note that this content is targeted for SysOps administrators. If 1 2 00:00:05,250 --> 00:00:09,840 you're a Solutions Architect or a developer you may want to skip over this 2 3 00:00:09,840 --> 00:00:16,379 one. Welcome back to BackSpace Academy. In this hands-on lecture I'm 3 4 00:00:16,379 --> 00:00:21,600 just going to run through some of the techniques that are available for us for 4 5 00:00:21,600 --> 00:00:25,949 troubleshooting our ec2 instances. So I'm just starting off here in the ec2 5 6 00:00:25,949 --> 00:00:30,390 dashboard. What I'm going to look at here is on the left hand side we can see we 6 7 00:00:30,390 --> 00:00:34,859 have our limits. So we click on that, it's going to provide us with all of the 7 8 00:00:34,859 --> 00:00:40,499 available limits that we have, so we can see here on this account that I've got 8 9 00:00:40,499 --> 00:00:46,979 here I can run five c1 medium instances but I can only run one 9 10 00:00:46,979 --> 00:00:52,589 c3 for extra large instances, so just one of those so if I try and run 10 11 00:00:52,589 --> 00:00:56,819 more than that it's not going to let me, so I would have to go through and 11 12 00:00:56,819 --> 00:01:02,399 request a limit increased, so what I'll do now is I'm just going to try and 12 13 00:01:02,399 --> 00:01:06,840 launch two of these instances now I don't recommend that you do this 13 14 00:01:06,840 --> 00:01:11,399 yourself because they're very expensive instances sand if you forget to delete 14 15 00:01:11,399 --> 00:01:14,640 them you're going to end up with a pretty nasty bill at the end of the 15 16 00:01:14,640 --> 00:01:19,259 month. So just go back to the ec2 dashboard, launch instance, I'm just going 16 17 00:01:19,259 --> 00:01:26,100 to grab a Linux AMI will be fine, let's scroll down to a c4 17 18 00:01:26,100 --> 00:01:35,909 extra-large. Okay c4 4xextra large, I'm just going to review and launch 18 19 00:01:35,909 --> 00:01:38,210 that 19 20 00:01:44,320 --> 00:01:49,120 so it's allowed me to launch that instance it's quite a big one and I'm 20 21 00:01:49,120 --> 00:01:54,940 just going to try and launch another one of those so again we're just you see 21 22 00:01:54,940 --> 00:02:08,490 Linux AMI and c4 for extra-large and we'll just review and launch 22 23 00:02:08,490 --> 00:02:16,000 select our key pair and launch, ok so the launch has failed, so we've exceeded our 23 24 00:02:16,000 --> 00:02:22,240 limits and it's stopped us from launching that second instance, so I just 24 25 00:02:22,240 --> 00:02:26,890 cancel that before I get a big bill and I'll just terminate the one that 25 26 00:02:26,890 --> 00:02:29,670 I've created there. 26 27 00:02:36,950 --> 00:02:43,080 okay so what I'll do now is just launch a smaller instance and we'll just see 27 28 00:02:43,080 --> 00:02:47,430 what tools are available for us to troubleshoot that instance so we just 28 29 00:02:47,430 --> 00:02:52,489 launched another instance again we'll just use the Amazon Linux will be fine 29 30 00:02:52,489 --> 00:02:58,849 and a t2 micro will be fine and it's going to review and launch 30 31 00:03:09,090 --> 00:03:14,590 so we'll just wait for that to finish being launched and finish all that 31 32 00:03:14,590 --> 00:03:21,910 status checks okay so after a certain amount of time we have our instance up 32 33 00:03:21,910 --> 00:03:26,680 and running and it's status checks have passed now if we select that instance 33 34 00:03:26,680 --> 00:03:34,150 and we click on the status checks tab we can see there that yes it has passed its 34 35 00:03:34,150 --> 00:03:38,140 system status checks and it's instant status checks now if you want to 35 36 00:03:38,140 --> 00:03:42,280 understand the difference between the two just hover over the information icon 36 37 00:03:42,280 --> 00:03:48,760 there and the system status check check verifies that your instance is reachable 37 38 00:03:48,760 --> 00:03:54,010 whereas we look at the instant status check that verifies that your is 38 39 00:03:54,010 --> 00:04:01,030 instances operating system is accepting traffic also it tells you what to do if 39 40 00:04:01,030 --> 00:04:05,230 there is a problem that you fail that check so you can see there that the if 40 41 00:04:05,230 --> 00:04:09,880 you fail an instant status check that's an operating system is not operating 41 42 00:04:09,880 --> 00:04:16,209 correctly then you you are best off to reboot that instance whereas if you look 42 43 00:04:16,209 --> 00:04:21,430 at the system status check if something goes wrong with that then it's obviously 43 44 00:04:21,430 --> 00:04:26,380 at a system level and there is a problem with the underlying host that is running 44 45 00:04:26,380 --> 00:04:31,150 that that that instance and so the best thing to do is to stop and start it or 45 46 00:04:31,150 --> 00:04:40,090 else replace that instance now the next thing we can look at the system logs for 46 47 00:04:40,090 --> 00:04:47,620 the instance so we go to instance settings and then get a system log so 47 48 00:04:47,620 --> 00:04:52,270 what that is it's an output of the console of our Linux operating system so 48 49 00:04:52,270 --> 00:04:57,520 that is if we were connected to this Linux instance this is what we would see 49 50 00:04:57,520 --> 00:05:02,410 from from the console and so that gives you a lot of information as to what is 50 51 00:05:02,410 --> 00:05:07,539 going on so we can scroll down here to the latest and we can see it's at the at 51 52 00:05:07,539 --> 00:05:11,260 the login stage they are waiting for a login to the instance so just close that 52 53 00:05:11,260 --> 00:05:17,050 and what we can also do is we can go into again into instant settings and we 53 54 00:05:17,050 --> 00:05:22,210 can get a a screenshot of the console of that instance 54 55 00:05:22,210 --> 00:05:30,090 as it is right now and so there we can see it's it's waiting for a logging and 55 56 00:05:30,090 --> 00:05:35,620 we can refresh that every want as well and get a later one and so these are 56 57 00:05:35,620 --> 00:05:40,449 quite quite good because it allows you to see exactly what has gone going on 57 58 00:05:40,449 --> 00:05:45,280 from a historical perspective through the booting up stage and also what the 58 59 00:05:45,280 --> 00:05:50,139 console is telling the telling you right at this point in time just close out of 59 60 00:05:50,139 --> 00:05:58,380 that now if we find that we have a problem with an instance status check 60 61 00:05:58,380 --> 00:06:04,360 what we can do is we can implement an a cloud watch action or a cloud watch 61 62 00:06:04,360 --> 00:06:09,340 alarm so we just go into cloud watch monitoring and add edit alarms and 62 63 00:06:09,340 --> 00:06:14,440 create alarm so what we can do is we obviously can send a notification as we 63 64 00:06:14,440 --> 00:06:19,319 can do with most cloud watch alarms but we can also take an action on that 64 65 00:06:19,319 --> 00:06:24,699 instance so we click on take action we can actually recover that instance so if 65 66 00:06:24,699 --> 00:06:30,490 it it's if it's filed a status check then we can automatically repair this 66 67 00:06:30,490 --> 00:06:36,250 instance in in the event of that but you can only recover certain types not all 67 68 00:06:36,250 --> 00:06:41,080 ec2 instance types again but you can recover t2 and most of the common types 68 69 00:06:41,080 --> 00:06:47,560 of of ec2 instance you can also take an action to stop the instance now if you 69 70 00:06:47,560 --> 00:06:52,030 want to stop the instance as opposed to recover the instance the cloud watch 70 71 00:06:52,030 --> 00:06:56,680 service needs to have an IAM role to do that so you need to select here create 71 72 00:06:56,680 --> 00:07:04,750 IAM role and you can only stop an instance if it's if it's an EBS volume 72 73 00:07:04,750 --> 00:07:10,060 backed instance, so you can't stop an a an instance that is instance store back 73 74 00:07:10,060 --> 00:07:14,620 it will just need to be terminated so then we've got terminate instance as 74 75 00:07:14,620 --> 00:07:19,150 well and we can also reboot that instance but again we need to create or 75 76 00:07:19,150 --> 00:07:23,710 we need to have an I am role created for that so what we can do is if we would 76 77 00:07:23,710 --> 00:07:29,050 like to have different stages that kick in we can first off try and recover the 77 78 00:07:29,050 --> 00:07:33,909 instance if it's filed for so meters consecutive periods and we can create 78 79 00:07:33,909 --> 00:07:37,960 alarm for that and then we can look at rebooting the 79 80 00:07:37,960 --> 00:07:43,030 instance and we can do that after say you know ten consecutive periods same 80 81 00:07:43,030 --> 00:07:49,120 and then we can create create another alarm to terminate that instance if it's 81 82 00:07:49,120 --> 00:07:53,200 which would be very handy bits in an auto scaling group so you can terminate 82 83 00:07:53,200 --> 00:07:59,740 that and then another instance can be created quite quickly for you so I'll 83 84 00:07:59,740 --> 00:08:08,290 just get out of that now so just back here in our status checks tab we can see 84 85 00:08:08,290 --> 00:08:13,030 here there is also a link to create a status check alarm and that is the same 85 86 00:08:13,030 --> 00:08:18,520 alarm that we are talking before and so we can take an action to recover the 86 87 00:08:18,520 --> 00:08:21,730 instance now if we want to terminate an instance one thing I didn't mention 87 88 00:08:21,730 --> 00:08:27,100 before is it we need to make sure that termination protection on that instance 88 89 00:08:27,100 --> 00:08:32,740 is disabled otherwise CloudWatch won't be able to terminate that instance, so 89 90 00:08:32,740 --> 00:08:38,410 that's about all I need to tell you about troubleshooting of ec2 coming up 90 91 00:08:38,410 --> 00:08:42,219 next we'll have the same for troubleshooting RDS I'll see you in the 91 92 00:08:42,219 --> 00:08:44,580 next lesson