1 00:00:00,180 --> 00:00:02,580 So now let's talk about Amazon Redshift, 2 00:00:02,580 --> 00:00:05,939 which is a database, but also an analytics engine. 3 00:00:05,939 --> 00:00:09,750 So Redshift is based on the PostgreSQL technology, 4 00:00:09,750 --> 00:00:11,310 but instead of Postgres, 5 00:00:11,310 --> 00:00:13,710 it's not used for online transaction processing. 6 00:00:13,710 --> 00:00:16,590 It's actually an all app type of database, 7 00:00:16,590 --> 00:00:19,350 which means online analytical processing. 8 00:00:19,350 --> 00:00:22,740 And it's used for analytics and data warehousing. 9 00:00:22,740 --> 00:00:25,230 So it has a 10 X better performance 10 00:00:25,230 --> 00:00:27,600 than any other data warehouses out there, 11 00:00:27,600 --> 00:00:30,300 and it scales to petabytes and petabytes of data. 12 00:00:30,300 --> 00:00:32,460 So the idea is that you would load all your data 13 00:00:32,460 --> 00:00:34,950 into Redshift and then very quickly 14 00:00:34,950 --> 00:00:37,980 you can randomize it from within Redshift. 15 00:00:37,980 --> 00:00:40,170 So the Redshift has good performance improvements 16 00:00:40,170 --> 00:00:43,590 because it's actually using a columnar storage of data, 17 00:00:43,590 --> 00:00:44,850 so instead of row based, 18 00:00:44,850 --> 00:00:47,310 and it has a parallel query engine. 19 00:00:47,310 --> 00:00:50,910 You pay as you go for all the instances you provision 20 00:00:50,910 --> 00:00:52,590 in your Redshift cluster. 21 00:00:52,590 --> 00:00:53,850 And to perform your queries, 22 00:00:53,850 --> 00:00:57,750 you can use just directly some SQL statements. 23 00:00:57,750 --> 00:00:59,940 So any business intelligence tools, 24 00:00:59,940 --> 00:01:01,470 such as Amazon Quicksights 25 00:01:01,470 --> 00:01:04,500 or other ones such as Tableau integrate with Redshift. 26 00:01:04,500 --> 00:01:07,410 And so if you had to compare Redshift and Athena, 27 00:01:07,410 --> 00:01:10,050 Redshift, first you have to load the data, 28 00:01:10,050 --> 00:01:11,190 sometimes from Amazon S3, 29 00:01:11,190 --> 00:01:14,190 you have to load all the data into Redshift. 30 00:01:14,190 --> 00:01:15,630 And then Redshift to Athena, 31 00:01:15,630 --> 00:01:17,640 well, if it's loaded in Redshift, 32 00:01:17,640 --> 00:01:20,610 Redshift is going to have much faster queries. 33 00:01:20,610 --> 00:01:23,430 Also Redshift can do much faster joins 34 00:01:23,430 --> 00:01:25,260 and much faster integration 35 00:01:25,260 --> 00:01:26,970 because Redshift actually has something 36 00:01:26,970 --> 00:01:28,620 that Athena does not have. 37 00:01:28,620 --> 00:01:31,380 Redshift has indexes, and it builds indexes 38 00:01:31,380 --> 00:01:34,170 to have this very high performance for a data warehouse. 39 00:01:34,170 --> 00:01:37,500 So if it's just an ad hoc query on Amazon S3, 40 00:01:37,500 --> 00:01:39,720 then Athena is going to be a great use case. 41 00:01:39,720 --> 00:01:41,970 But if it's like intense data warehousing 42 00:01:41,970 --> 00:01:44,370 with many queries and they're complicated, 43 00:01:44,370 --> 00:01:46,260 there are joins, aggregations and so on, 44 00:01:46,260 --> 00:01:49,650 then a Redshift is going to be a better candidate. 45 00:01:49,650 --> 00:01:51,660 So your Redshift cluster has two things. 46 00:01:51,660 --> 00:01:53,010 It has leader nodes, 47 00:01:53,010 --> 00:01:55,530 and they do query planning and results aggregation. 48 00:01:55,530 --> 00:01:57,120 And then it has compute notes, 49 00:01:57,120 --> 00:01:58,860 and they actually perform the queries 50 00:01:58,860 --> 00:02:01,080 and they send back the results to the leader. 51 00:02:01,080 --> 00:02:02,640 And because it's a Redshift cluster, 52 00:02:02,640 --> 00:02:05,040 you have to provision the node size in advance. 53 00:02:05,040 --> 00:02:07,350 And if you wanted to do therefore cost saving, 54 00:02:07,350 --> 00:02:09,509 you could use reserved instances. 55 00:02:09,509 --> 00:02:11,580 So your Redshift cluster has a leader node 56 00:02:11,580 --> 00:02:12,840 and then some compute nodes, 57 00:02:12,840 --> 00:02:14,760 and then you would submit a query 58 00:02:14,760 --> 00:02:16,890 in the SQL form to the leader node 59 00:02:16,890 --> 00:02:20,040 and the query would happen in the backend. 60 00:02:20,040 --> 00:02:22,473 So this is it for an overview of Redshift. 61 00:02:23,310 --> 00:02:24,720 So let's talk about snapshots 62 00:02:24,720 --> 00:02:26,910 and disaster recovery in Redshift. 63 00:02:26,910 --> 00:02:29,370 So Redshift has no multi-AZ mode, okay? 64 00:02:29,370 --> 00:02:32,010 All your cluster is in one availability zone. 65 00:02:32,010 --> 00:02:33,240 And so for this, if you want 66 00:02:33,240 --> 00:02:34,950 to have a disaster recovery strategy for Redshift, 67 00:02:34,950 --> 00:02:36,510 you need to take snapshots. 68 00:02:36,510 --> 00:02:39,120 So snapshots are point in time backups of a cluster, 69 00:02:39,120 --> 00:02:42,180 and they will be stored internally in Amazon S3. 70 00:02:42,180 --> 00:02:43,380 And they are incremental, 71 00:02:43,380 --> 00:02:45,600 only what that has changed will be saved. 72 00:02:45,600 --> 00:02:47,730 And that will save you a lot of space, obviously. 73 00:02:47,730 --> 00:02:50,880 You can restore a snapshot into a new Redshift cluster, 74 00:02:50,880 --> 00:02:53,700 and you have two modes for cluster for snapshots. 75 00:02:53,700 --> 00:02:55,230 You can take them manually, 76 00:02:55,230 --> 00:02:57,690 or you can take them in an automated way. 77 00:02:57,690 --> 00:02:58,590 So if you automate it, 78 00:02:58,590 --> 00:03:01,200 every eight hours is going to be a snapshot 79 00:03:01,200 --> 00:03:04,020 or every five gigabytes on the schedule. 80 00:03:04,020 --> 00:03:05,190 And then you can set the retention 81 00:03:05,190 --> 00:03:06,630 for your automated snapshots, 82 00:03:06,630 --> 00:03:08,610 or if you wanna do a manual snapshot, 83 00:03:08,610 --> 00:03:10,500 the snapshot is going to be retained 84 00:03:10,500 --> 00:03:12,660 until you delete it manually. 85 00:03:12,660 --> 00:03:14,850 And then a really, really cool feature of Redshift 86 00:03:14,850 --> 00:03:16,140 is that you can configure Redshift 87 00:03:16,140 --> 00:03:17,790 to automatically copy snapshots, 88 00:03:17,790 --> 00:03:19,830 whether they're automated or manual 89 00:03:19,830 --> 00:03:22,610 of the cluster into another AWS region, 90 00:03:22,610 --> 00:03:25,500 hence giving you a disaster recovery strategy. 91 00:03:25,500 --> 00:03:26,850 So let's take this Redshift cluster, 92 00:03:26,850 --> 00:03:28,050 your original cluster, 93 00:03:28,050 --> 00:03:29,520 and then we have another region. 94 00:03:29,520 --> 00:03:31,050 So we're going to take snapshots. 95 00:03:31,050 --> 00:03:32,910 They're going to be automatically copied 96 00:03:32,910 --> 00:03:34,710 into your new region. 97 00:03:34,710 --> 00:03:37,410 And from there you can restore a new Redshift cluster 98 00:03:37,410 --> 00:03:39,930 from that copied snapshots. 99 00:03:39,930 --> 00:03:42,660 Now let's talk about how we ingest data into Redshift. 100 00:03:42,660 --> 00:03:45,480 So we have three ways I'm going to describe to you. 101 00:03:45,480 --> 00:03:48,210 The first one is to use Amazon Kinesis Data Firehose. 102 00:03:48,210 --> 00:03:50,310 So we have Firehose that will be receiving data 103 00:03:50,310 --> 00:03:51,630 from different sources, 104 00:03:51,630 --> 00:03:54,120 and then it will send it into Redshift. 105 00:03:54,120 --> 00:03:56,790 And to do so, it will first write the data 106 00:03:56,790 --> 00:03:59,130 into an Amazon S3 bucket, 107 00:03:59,130 --> 00:04:01,800 and then Kinesis Data Firehose will issue automatically 108 00:04:01,800 --> 00:04:06,090 an S3 copy command to load the data into Redshift. 109 00:04:06,090 --> 00:04:07,620 Now, how does this copy command work? 110 00:04:07,620 --> 00:04:09,000 We can also use it manually. 111 00:04:09,000 --> 00:04:11,310 So you would load data into S3, 112 00:04:11,310 --> 00:04:13,350 and then you would issue a copy command directly 113 00:04:13,350 --> 00:04:16,140 from Redshift to copy data from S3 bucket, 114 00:04:16,140 --> 00:04:19,740 using an iam_role into your Amazon Redshift cluster. 115 00:04:19,740 --> 00:04:21,839 And as we've seen, there are two ways of doing so. 116 00:04:21,839 --> 00:04:23,910 You might go through the internet 117 00:04:23,910 --> 00:04:26,820 because your S3 buckets are public through the internet. 118 00:04:26,820 --> 00:04:27,653 I mean, they're not public, 119 00:04:27,653 --> 00:04:29,190 but they're connected to the internet. 120 00:04:29,190 --> 00:04:31,080 And so therefore you're going to do that flow 121 00:04:31,080 --> 00:04:33,180 through the internet back into your Redshift cluster, 122 00:04:33,180 --> 00:04:36,210 and this is without enhanced VPC routing. 123 00:04:36,210 --> 00:04:39,090 But if you wanted all your network to remain private 124 00:04:39,090 --> 00:04:40,740 into your virtual private cloud, 125 00:04:40,740 --> 00:04:43,350 then you can enable enhance VPC routing 126 00:04:43,350 --> 00:04:47,400 to have all the data flow through the VPC entirely. 127 00:04:47,400 --> 00:04:49,710 And finally, if you wanted to insert data 128 00:04:49,710 --> 00:04:52,200 using the JDBC driver into the Redshift cluster, 129 00:04:52,200 --> 00:04:53,400 you could do so as well. 130 00:04:53,400 --> 00:04:55,170 So for example, if you have an application, 131 00:04:55,170 --> 00:04:57,300 an EC2 instance that needs to write data 132 00:04:57,300 --> 00:04:58,590 into your Redshift cluster, 133 00:04:58,590 --> 00:05:00,870 you would need to use this method. 134 00:05:00,870 --> 00:05:02,880 And in that case, it is much better 135 00:05:02,880 --> 00:05:06,540 to write large batches of data into Amazon Redshift, 136 00:05:06,540 --> 00:05:08,040 instead of one row at a time, 137 00:05:08,040 --> 00:05:11,223 which would be truly inefficient for this type of database. 138 00:05:12,510 --> 00:05:17,130 So a cool feature of Redshift is Redshift Spectrum. 139 00:05:17,130 --> 00:05:20,400 The idea is that you would have data in Amazon S3, 140 00:05:20,400 --> 00:05:22,080 and you want to analyze it using Redshift, 141 00:05:22,080 --> 00:05:24,780 but you don't want to load it into Redshift first. 142 00:05:24,780 --> 00:05:25,613 And on top of it, 143 00:05:25,613 --> 00:05:28,440 you want to use a lot more processing power. 144 00:05:28,440 --> 00:05:29,970 So you use Redshift Spectrum, 145 00:05:29,970 --> 00:05:32,340 and you must have a Redshift cluster already available 146 00:05:32,340 --> 00:05:33,750 to start the query. 147 00:05:33,750 --> 00:05:35,400 And then once you start the query, 148 00:05:35,400 --> 00:05:36,780 the query will then be submitted 149 00:05:36,780 --> 00:05:39,000 to thousands of Redshift Spectrum nodes 150 00:05:39,000 --> 00:05:41,910 that will perform the query onto your data in S3. 151 00:05:41,910 --> 00:05:43,560 So let's go for an example. 152 00:05:43,560 --> 00:05:45,780 You have your Redshift cluster with a leader node, 153 00:05:45,780 --> 00:05:48,450 and a bunch of compute notes as we've seen. 154 00:05:48,450 --> 00:05:51,960 And the data you want to analyze is in Amazon S3. 155 00:05:51,960 --> 00:05:53,970 So in this case, we're going to run a query 156 00:05:53,970 --> 00:05:55,560 onto our Redshift cluster, 157 00:05:55,560 --> 00:05:57,600 as we can see what we want to do is 158 00:05:57,600 --> 00:06:00,540 that the table we want to query is living in S3. 159 00:06:00,540 --> 00:06:04,050 So we have from S3 dots and whatever you want. 160 00:06:04,050 --> 00:06:06,780 In that case, Spectrum will launch automatically. 161 00:06:06,780 --> 00:06:08,790 And so the query is going to be submitted 162 00:06:08,790 --> 00:06:11,400 to thousands of Redshift Spectrum nodes, 163 00:06:11,400 --> 00:06:14,070 who are going to read the data from Amazon S3 164 00:06:14,070 --> 00:06:15,810 and perform some aggregations. 165 00:06:15,810 --> 00:06:16,830 And when they're done, 166 00:06:16,830 --> 00:06:18,360 they're going to send back the results 167 00:06:18,360 --> 00:06:20,550 into your own Amazon Redshift cluster, 168 00:06:20,550 --> 00:06:24,540 and then you will get back into whoever initiated the query. 169 00:06:24,540 --> 00:06:25,373 But with this feature, 170 00:06:25,373 --> 00:06:28,470 we can leverage a lot more processing power from Redshift 171 00:06:28,470 --> 00:06:31,380 than the one we have provisioned into our cluster 172 00:06:31,380 --> 00:06:34,800 without in the first place loading the data 173 00:06:34,800 --> 00:06:37,410 from Amazon S3 into Redshift. 174 00:06:37,410 --> 00:06:39,360 So that's it for Amazon Redshift. 175 00:06:39,360 --> 00:06:40,200 I hope you liked it, 176 00:06:40,200 --> 00:06:42,150 and I will see you in the next lecture.