1 00:00:00,670 --> 00:00:07,300 For me, the best part of digging into data isn't a prerequisite of gathering, loading or cleaning 2 00:00:07,330 --> 00:00:10,850 the data, but when I actually get to interview the data. 3 00:00:10,880 --> 00:00:11,290 Right. 4 00:00:11,290 --> 00:00:16,120 So these are the moments when I discover whether the data is clean or dirty. 5 00:00:16,120 --> 00:00:22,860 So whether it's a complete and most of all, when story and what story the data can tell. 6 00:00:22,870 --> 00:00:29,500 So think of interviewing the data as a process, asking to interviewing a person applying for a job. 7 00:00:29,500 --> 00:00:29,860 Right? 8 00:00:29,860 --> 00:00:38,200 So you want to ask questions that reveal whether the reality of their expertise matches their resume. 9 00:00:38,200 --> 00:00:38,650 Right? 10 00:00:38,650 --> 00:00:42,760 So interviewing the data is exciting because you discover truth. 11 00:00:43,150 --> 00:00:50,770 So, for example, you might find that half the respondents forgot to fill out the email. 12 00:00:51,310 --> 00:00:58,600 Uh, for example here, let's actually let's open this with the query here, click right click on the 13 00:00:58,600 --> 00:00:59,080 teachers. 14 00:00:59,080 --> 00:01:02,150 And after that we waited data all rows here. 15 00:01:02,150 --> 00:01:08,810 So for example, you might find that half of the respondents forgot to fill out the email, for example, 16 00:01:08,810 --> 00:01:10,700 filled in the questionnaire. 17 00:01:11,730 --> 00:01:16,190 Our mayor hasn't paid property taxes for the past five years. 18 00:01:16,200 --> 00:01:19,500 Or you might learn that your data is dirty. 19 00:01:19,530 --> 00:01:27,570 Names are spelled inconsistently, dates are incorrect or numbers doesn't jibe with your expectations, 20 00:01:27,570 --> 00:01:32,370 so your findings become the part of the story. 21 00:01:36,430 --> 00:01:41,080 In SQL interviewing data starts with the select keyword. 22 00:01:42,730 --> 00:01:43,540 So. 23 00:01:44,860 --> 00:01:51,220 Um, which retrieves the rows and columns from one or more of the tables in a database. 24 00:01:51,220 --> 00:01:57,820 So a select statement can be a very simple as you can see, we can't delete this because we use the 25 00:01:57,820 --> 00:02:03,010 select here by right clicking onto our teachers and clicking related data. 26 00:02:03,010 --> 00:02:07,900 And here first 100 rows, if we click on that, you will see limit 100. 27 00:02:07,900 --> 00:02:10,900 So we will use that select a lot. 28 00:02:10,900 --> 00:02:17,980 So now we're going to firstly go to here tables and now we will click on the query tool. 29 00:02:17,980 --> 00:02:25,930 And as you can see, we are using on a database of PostgreSQL, which is the database we want to make 30 00:02:25,930 --> 00:02:27,640 the SQL statements on. 31 00:02:27,640 --> 00:02:29,890 So here is the basic. 32 00:02:31,370 --> 00:02:32,210 SQL syntax here. 33 00:02:32,210 --> 00:02:36,410 So select this select from. 34 00:02:37,640 --> 00:02:41,370 And here we will use the teachers. 35 00:02:41,460 --> 00:02:42,380 Teachers. 36 00:02:43,110 --> 00:02:45,510 And here, let's click run. 37 00:02:45,510 --> 00:02:52,980 And here, as you can see, we we are seeing some of the here outputs here Caitlin Rush, Janet Smith, 38 00:02:53,010 --> 00:03:01,350 Lee Reynolds, Samuel Cole, Samantha Bush, and here these are the salary of numeric and so on. 39 00:03:01,740 --> 00:03:10,980 So this single line of code here shows the most basic form of SQL query and the asterisk following with 40 00:03:10,980 --> 00:03:13,950 the actual let me get that pen. 41 00:03:13,950 --> 00:03:16,500 So the pen or. 42 00:03:17,440 --> 00:03:19,600 Ink right here. 43 00:03:19,840 --> 00:03:24,190 So now we will we will draw some things on the screen. 44 00:03:25,100 --> 00:03:25,550 Here. 45 00:03:26,300 --> 00:03:27,170 So now. 46 00:03:29,150 --> 00:03:29,720 Okay. 47 00:03:29,720 --> 00:03:33,860 So here the asterisk on the screen. 48 00:03:35,200 --> 00:03:35,740 Yeah. 49 00:03:35,980 --> 00:03:42,550 So here, this wild card, which is like standing for value, it doesn't represent anything in particular 50 00:03:42,550 --> 00:03:47,330 and instead represents everything the value could possibly be. 51 00:03:47,350 --> 00:03:51,130 So here it shorthand for select all columns. 52 00:03:51,130 --> 00:03:57,100 So if you had given a column name instead of the wild card here, this command would select a value. 53 00:03:57,100 --> 00:04:05,200 So in that column here, so the from keyword here this from keyword indicates you want the query to 54 00:04:05,200 --> 00:04:07,300 return data from a particular table. 55 00:04:07,300 --> 00:04:14,740 So the semicolon after the table name tells the PostgreSQL it's the end of the query statement. 56 00:04:14,740 --> 00:04:22,210 So let's use the select statement with the asterisk wild card on the teachers table you created in previous 57 00:04:22,210 --> 00:04:30,310 lectures, and once again you open the Pgadmin, select the analysis, our Postgres database here, 58 00:04:30,430 --> 00:04:36,370 and after that, open the query tool here and then execute the statement you are seeing here. 59 00:04:36,370 --> 00:04:41,290 So remember, as an alternative to typing these statements into the query tool, you can also run the 60 00:04:41,290 --> 00:04:44,950 code by clicking open file and navigating the place. 61 00:04:44,950 --> 00:04:49,960 You have saved the code you downloaded from our attachment sections. 62 00:04:49,960 --> 00:04:50,410 Right? 63 00:04:50,420 --> 00:04:55,330 Always do this if you see the code is truncated with here. 64 00:04:55,330 --> 00:04:59,230 So now we will once execute the query. 65 00:04:59,920 --> 00:05:07,510 The result here in the query tools output page contains all the rows and columns you inserted into the 66 00:05:07,510 --> 00:05:10,390 teachers table in previous lectures. 67 00:05:10,390 --> 00:05:14,560 So these rows may not always appear in this order, but that's okay. 68 00:05:14,650 --> 00:05:24,190 So note that the ID column here of here Big integer is automatically filled with sequential integers, 69 00:05:24,220 --> 00:05:26,170 even though you didn't explicitly insert them. 70 00:05:26,170 --> 00:05:26,380 Right? 71 00:05:26,380 --> 00:05:28,030 So it's very handy. 72 00:05:28,030 --> 00:05:36,560 This is auto incrementing integer acts as a unique identifier or key that not only ensures how how each 73 00:05:36,560 --> 00:05:44,090 row in the table is unique, but also later gives us a way to connect this table to other tables in 74 00:05:44,090 --> 00:05:45,020 the database. 75 00:05:45,020 --> 00:05:52,760 So before you move on, note that you have two other ways to view all of the rows in a table using Pgadmin, 76 00:05:52,760 --> 00:06:02,420 you can right click on the teachers table in the object tree and choose the view with ID and our rows. 77 00:06:02,690 --> 00:06:05,300 And here you can see the. 78 00:06:06,320 --> 00:06:09,110 At the same result here. 79 00:06:09,320 --> 00:06:10,250 So. 80 00:06:11,680 --> 00:06:12,640 As you can see here. 81 00:06:14,930 --> 00:06:19,040 We can also, let's say, query a subset of columns here. 82 00:06:19,220 --> 00:06:24,860 So often it's more practical to limit the columns, the query retrieves, and especially with a large 83 00:06:24,860 --> 00:06:29,930 database so you don't have to write root access information. 84 00:06:29,930 --> 00:06:34,820 So you can do this by naming columns separated by commas right after the select keyword here. 85 00:06:34,820 --> 00:06:36,650 So we don't need that for now. 86 00:06:36,650 --> 00:06:42,440 We will go to our here teachers that we used and executed our queries. 87 00:06:42,530 --> 00:06:43,610 SQL queries here. 88 00:06:43,610 --> 00:06:46,130 So now we will select. 89 00:06:47,330 --> 00:06:56,180 So firstly, let's I want to introduce how this select works, how how we can, how we can query a subset 90 00:06:56,180 --> 00:06:57,200 of columns here. 91 00:06:57,200 --> 00:07:00,890 So now here, my first. 92 00:07:02,410 --> 00:07:02,860 Alarm. 93 00:07:03,890 --> 00:07:04,850 Another. 94 00:07:06,150 --> 00:07:06,780 Alarm. 95 00:07:08,070 --> 00:07:10,170 Super duper. 96 00:07:11,660 --> 00:07:12,200 Fulham. 97 00:07:13,710 --> 00:07:16,320 And from our table name. 98 00:07:16,470 --> 00:07:17,190 In this case. 99 00:07:17,190 --> 00:07:19,410 In this case, this is just an example here. 100 00:07:19,560 --> 00:07:23,220 My super table name. 101 00:07:23,400 --> 00:07:24,270 That's it. 102 00:07:24,300 --> 00:07:30,390 With that syntax, the query will retrieve all rows from just those three columns. 103 00:07:30,390 --> 00:07:32,100 So let's apply. 104 00:07:32,880 --> 00:07:35,640 Let's apply this to the teachers table. 105 00:07:35,640 --> 00:07:38,250 Perhaps in your in your. 106 00:07:40,000 --> 00:07:40,870 Postgres. 107 00:07:42,550 --> 00:07:48,640 You want to focus on teachers names, first names and salaries. 108 00:07:48,640 --> 00:07:49,240 Right? 109 00:07:50,230 --> 00:07:54,160 So in that case, you would select the relevant columns. 110 00:07:54,700 --> 00:07:59,440 So we're going to use let's actually use also use the. 111 00:08:00,460 --> 00:08:01,390 Names. 112 00:08:03,170 --> 00:08:04,400 Last names. 113 00:08:08,620 --> 00:08:10,870 And hire dates and salaries. 114 00:08:10,870 --> 00:08:11,170 Right. 115 00:08:11,170 --> 00:08:12,970 Let's use the for here. 116 00:08:13,330 --> 00:08:19,840 So instead of doing this, we will use the select select for the last name. 117 00:08:21,350 --> 00:08:23,660 Last name and then first name. 118 00:08:24,710 --> 00:08:26,510 And then salary. 119 00:08:27,710 --> 00:08:31,760 So actually before seller, let's print the higher date. 120 00:08:32,180 --> 00:08:35,540 Higher date from teachers. 121 00:08:36,800 --> 00:08:37,580 That's it. 122 00:08:37,610 --> 00:08:38,840 Now let's run. 123 00:08:39,570 --> 00:08:41,270 And here, as you can see, we have last name. 124 00:08:41,270 --> 00:08:42,290 First name, higher date. 125 00:08:42,320 --> 00:08:44,150 Oops, We forgot the salaries. 126 00:08:44,150 --> 00:08:44,660 Right? 127 00:08:44,840 --> 00:08:46,610 Salary. 128 00:08:48,340 --> 00:08:49,390 Office insert. 129 00:08:49,540 --> 00:08:50,030 Okay. 130 00:08:50,050 --> 00:08:51,970 Salary and teachers. 131 00:08:54,810 --> 00:08:56,040 Uh, from teachers. 132 00:08:58,490 --> 00:08:59,330 From teachers. 133 00:08:59,330 --> 00:09:00,020 That's it. 134 00:09:00,440 --> 00:09:04,040 As you can see, last name, first name, hire, date and salary here. 135 00:09:04,160 --> 00:09:10,070 So although these examples are basic, they illustrate a good strategy for beginning your interview 136 00:09:10,070 --> 00:09:10,480 with data. 137 00:09:10,730 --> 00:09:18,230 So generally it's wise to start your analysis by checking whether your data is present and in the format 138 00:09:18,230 --> 00:09:18,830 you expect. 139 00:09:18,830 --> 00:09:19,220 Right? 140 00:09:19,220 --> 00:09:26,810 So which is a task well-suited to this select keyword our data in a proper format, complete with a 141 00:09:26,810 --> 00:09:35,270 month, date and year or they entered as once regularly observed as text with the month and year old. 142 00:09:35,270 --> 00:09:43,700 So does every row have values in all these columns and there mysteriously no last names starting with 143 00:09:44,000 --> 00:09:46,070 letters beyond M Right. 144 00:09:46,070 --> 00:09:54,440 So all these issues indicate potential hazards ranging from missing data to other records keeping somewhere 145 00:09:54,440 --> 00:09:55,420 in the workflow. 146 00:09:55,430 --> 00:09:59,550 So we are only working with a table of six rows. 147 00:10:00,370 --> 00:10:06,460 But when you're facing a table of thousands or even millions of rows, it's essential to get a quick 148 00:10:06,460 --> 00:10:11,570 read on your data quality and range of values it contains. 149 00:10:11,590 --> 00:10:18,100 To do this, let's dig deeper and add several SQL keywords.