1
00:00:00,180 --> 00:00:02,580
So now let's talk about Amazon Redshift,

2
00:00:02,580 --> 00:00:05,939
which is a database, but also an analytics engine.

3
00:00:05,939 --> 00:00:09,750
So Redshift is based on the PostgreSQL technology,

4
00:00:09,750 --> 00:00:11,310
but instead of Postgres,

5
00:00:11,310 --> 00:00:13,710
it's not used for online transaction processing.

6
00:00:13,710 --> 00:00:16,590
It's actually an all app type of database,

7
00:00:16,590 --> 00:00:19,350
which means online analytical processing.

8
00:00:19,350 --> 00:00:22,740
And it's used for analytics and data warehousing.

9
00:00:22,740 --> 00:00:25,230
So it has a 10 X better performance

10
00:00:25,230 --> 00:00:27,600
than any other data warehouses out there,

11
00:00:27,600 --> 00:00:30,300
and it scales to petabytes and petabytes of data.

12
00:00:30,300 --> 00:00:32,460
So the idea is that you would load all your data

13
00:00:32,460 --> 00:00:34,950
into Redshift and then very quickly

14
00:00:34,950 --> 00:00:37,980
you can randomize it from within Redshift.

15
00:00:37,980 --> 00:00:40,170
So the Redshift has good performance improvements

16
00:00:40,170 --> 00:00:43,590
because it's actually using a columnar storage of data,

17
00:00:43,590 --> 00:00:44,850
so instead of row based,

18
00:00:44,850 --> 00:00:47,310
and it has a parallel query engine.

19
00:00:47,310 --> 00:00:50,910
You pay as you go for all the instances you provision

20
00:00:50,910 --> 00:00:52,590
in your Redshift cluster.

21
00:00:52,590 --> 00:00:53,850
And to perform your queries,

22
00:00:53,850 --> 00:00:57,750
you can use just directly some SQL statements.

23
00:00:57,750 --> 00:00:59,940
So any business intelligence tools,

24
00:00:59,940 --> 00:01:01,470
such as Amazon Quicksights

25
00:01:01,470 --> 00:01:04,500
or other ones such as Tableau integrate with Redshift.

26
00:01:04,500 --> 00:01:07,410
And so if you had to compare Redshift and Athena,

27
00:01:07,410 --> 00:01:10,050
Redshift, first you have to load the data,

28
00:01:10,050 --> 00:01:11,190
sometimes from Amazon S3,

29
00:01:11,190 --> 00:01:14,190
you have to load all the data into Redshift.

30
00:01:14,190 --> 00:01:15,630
And then Redshift to Athena,

31
00:01:15,630 --> 00:01:17,640
well, if it's loaded in Redshift,

32
00:01:17,640 --> 00:01:20,610
Redshift is going to have much faster queries.

33
00:01:20,610 --> 00:01:23,430
Also Redshift can do much faster joins

34
00:01:23,430 --> 00:01:25,260
and much faster integration

35
00:01:25,260 --> 00:01:26,970
because Redshift actually has something

36
00:01:26,970 --> 00:01:28,620
that Athena does not have.

37
00:01:28,620 --> 00:01:31,380
Redshift has indexes, and it builds indexes

38
00:01:31,380 --> 00:01:34,170
to have this very high performance for a data warehouse.

39
00:01:34,170 --> 00:01:37,500
So if it's just an ad hoc query on Amazon S3,

40
00:01:37,500 --> 00:01:39,720
then Athena is going to be a great use case.

41
00:01:39,720 --> 00:01:41,970
But if it's like intense data warehousing

42
00:01:41,970 --> 00:01:44,370
with many queries and they're complicated,

43
00:01:44,370 --> 00:01:46,260
there are joins, aggregations and so on,

44
00:01:46,260 --> 00:01:49,650
then a Redshift is going to be a better candidate.

45
00:01:49,650 --> 00:01:51,660
So your Redshift cluster has two things.

46
00:01:51,660 --> 00:01:53,010
It has leader nodes,

47
00:01:53,010 --> 00:01:55,530
and they do query planning and results aggregation.

48
00:01:55,530 --> 00:01:57,120
And then it has compute notes,

49
00:01:57,120 --> 00:01:58,860
and they actually perform the queries

50
00:01:58,860 --> 00:02:01,080
and they send back the results to the leader.

51
00:02:01,080 --> 00:02:02,640
And because it's a Redshift cluster,

52
00:02:02,640 --> 00:02:05,040
you have to provision the node size in advance.

53
00:02:05,040 --> 00:02:07,350
And if you wanted to do therefore cost saving,

54
00:02:07,350 --> 00:02:09,509
you could use reserved instances.

55
00:02:09,509 --> 00:02:11,580
So your Redshift cluster has a leader node

56
00:02:11,580 --> 00:02:12,840
and then some compute nodes,

57
00:02:12,840 --> 00:02:14,760
and then you would submit a query

58
00:02:14,760 --> 00:02:16,890
in the SQL form to the leader node

59
00:02:16,890 --> 00:02:20,040
and the query would happen in the backend.

60
00:02:20,040 --> 00:02:22,473
So this is it for an overview of Redshift.

61
00:02:23,310 --> 00:02:24,720
So let's talk about snapshots

62
00:02:24,720 --> 00:02:26,910
and disaster recovery in Redshift.

63
00:02:26,910 --> 00:02:29,370
So Redshift has no multi-AZ mode, okay?

64
00:02:29,370 --> 00:02:32,010
All your cluster is in one availability zone.

65
00:02:32,010 --> 00:02:33,240
And so for this, if you want

66
00:02:33,240 --> 00:02:34,950
to have a disaster recovery strategy for Redshift,

67
00:02:34,950 --> 00:02:36,510
you need to take snapshots.

68
00:02:36,510 --> 00:02:39,120
So snapshots are point in time backups of a cluster,

69
00:02:39,120 --> 00:02:42,180
and they will be stored internally in Amazon S3.

70
00:02:42,180 --> 00:02:43,380
And they are incremental,

71
00:02:43,380 --> 00:02:45,600
only what that has changed will be saved.

72
00:02:45,600 --> 00:02:47,730
And that will save you a lot of space, obviously.

73
00:02:47,730 --> 00:02:50,880
You can restore a snapshot into a new Redshift cluster,

74
00:02:50,880 --> 00:02:53,700
and you have two modes for cluster for snapshots.

75
00:02:53,700 --> 00:02:55,230
You can take them manually,

76
00:02:55,230 --> 00:02:57,690
or you can take them in an automated way.

77
00:02:57,690 --> 00:02:58,590
So if you automate it,

78
00:02:58,590 --> 00:03:01,200
every eight hours is going to be a snapshot

79
00:03:01,200 --> 00:03:04,020
or every five gigabytes on the schedule.

80
00:03:04,020 --> 00:03:05,190
And then you can set the retention

81
00:03:05,190 --> 00:03:06,630
for your automated snapshots,

82
00:03:06,630 --> 00:03:08,610
or if you wanna do a manual snapshot,

83
00:03:08,610 --> 00:03:10,500
the snapshot is going to be retained

84
00:03:10,500 --> 00:03:12,660
until you delete it manually.

85
00:03:12,660 --> 00:03:14,850
And then a really, really cool feature of Redshift

86
00:03:14,850 --> 00:03:16,140
is that you can configure Redshift

87
00:03:16,140 --> 00:03:17,790
to automatically copy snapshots,

88
00:03:17,790 --> 00:03:19,830
whether they're automated or manual

89
00:03:19,830 --> 00:03:22,610
of the cluster into another AWS region,

90
00:03:22,610 --> 00:03:25,500
hence giving you a disaster recovery strategy.

91
00:03:25,500 --> 00:03:26,850
So let's take this Redshift cluster,

92
00:03:26,850 --> 00:03:28,050
your original cluster,

93
00:03:28,050 --> 00:03:29,520
and then we have another region.

94
00:03:29,520 --> 00:03:31,050
So we're going to take snapshots.

95
00:03:31,050 --> 00:03:32,910
They're going to be automatically copied

96
00:03:32,910 --> 00:03:34,710
into your new region.

97
00:03:34,710 --> 00:03:37,410
And from there you can restore a new Redshift cluster

98
00:03:37,410 --> 00:03:39,930
from that copied snapshots.

99
00:03:39,930 --> 00:03:42,660
Now let's talk about how we ingest data into Redshift.

100
00:03:42,660 --> 00:03:45,480
So we have three ways I'm going to describe to you.

101
00:03:45,480 --> 00:03:48,210
The first one is to use Amazon Kinesis Data Firehose.

102
00:03:48,210 --> 00:03:50,310
So we have Firehose that will be receiving data

103
00:03:50,310 --> 00:03:51,630
from different sources,

104
00:03:51,630 --> 00:03:54,120
and then it will send it into Redshift.

105
00:03:54,120 --> 00:03:56,790
And to do so, it will first write the data

106
00:03:56,790 --> 00:03:59,130
into an Amazon S3 bucket,

107
00:03:59,130 --> 00:04:01,800
and then Kinesis Data Firehose will issue automatically

108
00:04:01,800 --> 00:04:06,090
an S3 copy command to load the data into Redshift.

109
00:04:06,090 --> 00:04:07,620
Now, how does this copy command work?

110
00:04:07,620 --> 00:04:09,000
We can also use it manually.

111
00:04:09,000 --> 00:04:11,310
So you would load data into S3,

112
00:04:11,310 --> 00:04:13,350
and then you would issue a copy command directly

113
00:04:13,350 --> 00:04:16,140
from Redshift to copy data from S3 bucket,

114
00:04:16,140 --> 00:04:19,740
using an iam_role into your Amazon Redshift cluster.

115
00:04:19,740 --> 00:04:21,839
And as we've seen, there are two ways of doing so.

116
00:04:21,839 --> 00:04:23,910
You might go through the internet

117
00:04:23,910 --> 00:04:26,820
because your S3 buckets are public through the internet.

118
00:04:26,820 --> 00:04:27,653
I mean, they're not public,

119
00:04:27,653 --> 00:04:29,190
but they're connected to the internet.

120
00:04:29,190 --> 00:04:31,080
And so therefore you're going to do that flow

121
00:04:31,080 --> 00:04:33,180
through the internet back into your Redshift cluster,

122
00:04:33,180 --> 00:04:36,210
and this is without enhanced VPC routing.

123
00:04:36,210 --> 00:04:39,090
But if you wanted all your network to remain private

124
00:04:39,090 --> 00:04:40,740
into your virtual private cloud,

125
00:04:40,740 --> 00:04:43,350
then you can enable enhance VPC routing

126
00:04:43,350 --> 00:04:47,400
to have all the data flow through the VPC entirely.

127
00:04:47,400 --> 00:04:49,710
And finally, if you wanted to insert data

128
00:04:49,710 --> 00:04:52,200
using the JDBC driver into the Redshift cluster,

129
00:04:52,200 --> 00:04:53,400
you could do so as well.

130
00:04:53,400 --> 00:04:55,170
So for example, if you have an application,

131
00:04:55,170 --> 00:04:57,300
an EC2 instance that needs to write data

132
00:04:57,300 --> 00:04:58,590
into your Redshift cluster,

133
00:04:58,590 --> 00:05:00,870
you would need to use this method.

134
00:05:00,870 --> 00:05:02,880
And in that case, it is much better

135
00:05:02,880 --> 00:05:06,540
to write large batches of data into Amazon Redshift,

136
00:05:06,540 --> 00:05:08,040
instead of one row at a time,

137
00:05:08,040 --> 00:05:11,223
which would be truly inefficient for this type of database.

138
00:05:12,510 --> 00:05:17,130
So a cool feature of Redshift is Redshift Spectrum.

139
00:05:17,130 --> 00:05:20,400
The idea is that you would have data in Amazon S3,

140
00:05:20,400 --> 00:05:22,080
and you want to analyze it using Redshift,

141
00:05:22,080 --> 00:05:24,780
but you don't want to load it into Redshift first.

142
00:05:24,780 --> 00:05:25,613
And on top of it,

143
00:05:25,613 --> 00:05:28,440
you want to use a lot more processing power.

144
00:05:28,440 --> 00:05:29,970
So you use Redshift Spectrum,

145
00:05:29,970 --> 00:05:32,340
and you must have a Redshift cluster already available

146
00:05:32,340 --> 00:05:33,750
to start the query.

147
00:05:33,750 --> 00:05:35,400
And then once you start the query,

148
00:05:35,400 --> 00:05:36,780
the query will then be submitted

149
00:05:36,780 --> 00:05:39,000
to thousands of Redshift Spectrum nodes

150
00:05:39,000 --> 00:05:41,910
that will perform the query onto your data in S3.

151
00:05:41,910 --> 00:05:43,560
So let's go for an example.

152
00:05:43,560 --> 00:05:45,780
You have your Redshift cluster with a leader node,

153
00:05:45,780 --> 00:05:48,450
and a bunch of compute notes as we've seen.

154
00:05:48,450 --> 00:05:51,960
And the data you want to analyze is in Amazon S3.

155
00:05:51,960 --> 00:05:53,970
So in this case, we're going to run a query

156
00:05:53,970 --> 00:05:55,560
onto our Redshift cluster,

157
00:05:55,560 --> 00:05:57,600
as we can see what we want to do is

158
00:05:57,600 --> 00:06:00,540
that the table we want to query is living in S3.

159
00:06:00,540 --> 00:06:04,050
So we have from S3 dots and whatever you want.

160
00:06:04,050 --> 00:06:06,780
In that case, Spectrum will launch automatically.

161
00:06:06,780 --> 00:06:08,790
And so the query is going to be submitted

162
00:06:08,790 --> 00:06:11,400
to thousands of Redshift Spectrum nodes,

163
00:06:11,400 --> 00:06:14,070
who are going to read the data from Amazon S3

164
00:06:14,070 --> 00:06:15,810
and perform some aggregations.

165
00:06:15,810 --> 00:06:16,830
And when they're done,

166
00:06:16,830 --> 00:06:18,360
they're going to send back the results

167
00:06:18,360 --> 00:06:20,550
into your own Amazon Redshift cluster,

168
00:06:20,550 --> 00:06:24,540
and then you will get back into whoever initiated the query.

169
00:06:24,540 --> 00:06:25,373
But with this feature,

170
00:06:25,373 --> 00:06:28,470
we can leverage a lot more processing power from Redshift

171
00:06:28,470 --> 00:06:31,380
than the one we have provisioned into our cluster

172
00:06:31,380 --> 00:06:34,800
without in the first place loading the data

173
00:06:34,800 --> 00:06:37,410
from Amazon S3 into Redshift.

174
00:06:37,410 --> 00:06:39,360
So that's it for Amazon Redshift.

175
00:06:39,360 --> 00:06:40,200
I hope you liked it,

176
00:06:40,200 --> 00:06:42,150
and I will see you in the next lecture.