1
00:00:02,340 --> 00:00:04,262
So now let's talk about Amazon Athena.

2
00:00:04,262 --> 00:00:06,729
So Athena is a serverless query service

3
00:00:06,729 --> 00:00:08,490
to help you analyze the data

4
00:00:08,490 --> 00:00:10,980
stored in Amazon S3 buckets.

5
00:00:10,980 --> 00:00:12,248
And to analyze this data

6
00:00:12,248 --> 00:00:15,180
you're going to use the standard SQL language

7
00:00:15,180 --> 00:00:16,216
to query the files.

8
00:00:16,216 --> 00:00:19,320
Behind the scenes, Athena is built on the Presto engine

9
00:00:19,320 --> 00:00:21,030
which uses the SQL language.

10
00:00:21,030 --> 00:00:21,863
So the idea is that users

11
00:00:21,863 --> 00:00:24,600
are going to load data into your S3 buckets

12
00:00:24,600 --> 00:00:27,011
or you are going to load data into your S3 buckets

13
00:00:27,011 --> 00:00:29,550
and then you would use the Athena service to query

14
00:00:29,550 --> 00:00:33,660
and analyze this data in Amazon S3 without moving it.

15
00:00:33,660 --> 00:00:34,920
So Athena is serverless

16
00:00:34,920 --> 00:00:36,990
and it analyzes directly your data

17
00:00:36,990 --> 00:00:38,790
living in your S3 buckets.

18
00:00:38,790 --> 00:00:40,350
So it supports different formats,

19
00:00:40,350 --> 00:00:42,384
such as CSV, JSON, ORC,

20
00:00:42,384 --> 00:00:43,217
Avro,

21
00:00:43,217 --> 00:00:44,310
and Parquet

22
00:00:44,310 --> 00:00:45,660
and possibly others.

23
00:00:45,660 --> 00:00:47,520
And the pricing is very simple.

24
00:00:47,520 --> 00:00:49,680
You're just going to pay a fixed amount

25
00:00:49,680 --> 00:00:52,050
per terabytes of data scans.

26
00:00:52,050 --> 00:00:53,757
You don't need to provision any database again

27
00:00:53,757 --> 00:00:56,580
because the whole service is serverless.

28
00:00:56,580 --> 00:00:59,640
So Athena is commonly used with another tool

29
00:00:59,640 --> 00:01:02,670
called Amazon Quicksight to create reports

30
00:01:02,670 --> 00:01:03,630
and dashboards.

31
00:01:03,630 --> 00:01:06,180
So we'll see Quicksight later in depth.

32
00:01:06,180 --> 00:01:08,670
But Amazon Quicksight connects to Athena

33
00:01:08,670 --> 00:01:10,535
which connects to your S3 buckets.

34
00:01:10,535 --> 00:01:13,440
Now the use cases for Amazon Athena

35
00:01:13,440 --> 00:01:15,939
are to do adhoc queries, business intelligence,

36
00:01:15,939 --> 00:01:17,880
analytics, reporting

37
00:01:17,880 --> 00:01:21,090
and to analyze end query, any kind of logs

38
00:01:21,090 --> 00:01:24,199
that originates from your AWS services.

39
00:01:24,199 --> 00:01:26,370
So it could be your VPC flow logs,

40
00:01:26,370 --> 00:01:27,690
your load balancer logs,

41
00:01:27,690 --> 00:01:28,860
your CloudTrail trails,

42
00:01:28,860 --> 00:01:30,120
and so on.

43
00:01:30,120 --> 00:01:33,480
So as an exam tip, anytime you need to analyze data

44
00:01:33,480 --> 00:01:37,260
in Amazon S3 using a serverless SQL engine.

45
00:01:37,260 --> 00:01:38,823
You can think about Athena.

46
00:01:39,690 --> 00:01:41,940
Now I've talked about performance improvements

47
00:01:41,940 --> 00:01:44,226
and you can actually improve Athena performance.

48
00:01:44,226 --> 00:01:47,365
And this is the exam will test you on this as well.

49
00:01:47,365 --> 00:01:50,640
So, first of all, because you pay for the amount

50
00:01:50,640 --> 00:01:53,206
of data scanned per terabytes,

51
00:01:53,206 --> 00:01:56,070
you need to use a type of data,

52
00:01:56,070 --> 00:01:58,290
where you're going to scan less data.

53
00:01:58,290 --> 00:02:01,290
And for this, you can use a columnar data type

54
00:02:01,290 --> 00:02:04,382
for cost savings because you only scan the columns you need.

55
00:02:04,382 --> 00:02:08,741
So therefore, the recommended formats for Amazon Athena

56
00:02:08,741 --> 00:02:10,832
are going to be Apache Parquet

57
00:02:10,832 --> 00:02:11,820
and

58
00:02:11,820 --> 00:02:12,713
ORC.

59
00:02:12,713 --> 00:02:15,780
And it's going to give you a huge performance improvement

60
00:02:15,780 --> 00:02:18,900
and to get your files into the Apache Parquet

61
00:02:18,900 --> 00:02:21,261
or ORC formats, you must use a service

62
00:02:21,261 --> 00:02:22,954
that will see as well in the section.

63
00:02:22,954 --> 00:02:25,980
For example, Glue, glue can be very helpful

64
00:02:25,980 --> 00:02:29,490
to convert your data as an ETL job between,

65
00:02:29,490 --> 00:02:30,544
for example, CSV

66
00:02:30,544 --> 00:02:32,670
and Parquet.

67
00:02:32,670 --> 00:02:35,321
Now also because we want to scan less data,

68
00:02:35,321 --> 00:02:38,160
we need to compress data for smaller retrieval.

69
00:02:38,160 --> 00:02:39,946
So there are different compression mechanism

70
00:02:39,946 --> 00:02:43,680
you can use that I have listed it right here.

71
00:02:43,680 --> 00:02:46,230
Next, if you know, you're going to query,

72
00:02:46,230 --> 00:02:48,563
all the time on some specific columns,

73
00:02:48,563 --> 00:02:50,940
you can partition your data sets,

74
00:02:50,940 --> 00:02:53,862
and partitioning data sets means that in your S3 buckets

75
00:02:53,862 --> 00:02:56,771
you're going to have the full path with slashes

76
00:02:56,771 --> 00:02:59,840
and each slash will be a different column name

77
00:02:59,840 --> 00:03:01,480
with a specific value.

78
00:03:01,480 --> 00:03:02,741
And so you're organizing,

79
00:03:02,741 --> 00:03:05,059
you're partitioning your data in Amazon S3,

80
00:03:05,059 --> 00:03:07,410
so that when you query it you know exactly,

81
00:03:07,410 --> 00:03:09,780
in which folder, at which path in Amazon S3,

82
00:03:09,780 --> 00:03:11,400
you need to scan for data.

83
00:03:11,400 --> 00:03:13,740
So here's an example of data partitions.

84
00:03:13,740 --> 00:03:16,382
So we have flight data in Parquet formats

85
00:03:16,382 --> 00:03:20,070
and then we do slash year equals 1991.

86
00:03:20,070 --> 00:03:21,274
So we partition by year

87
00:03:21,274 --> 00:03:24,000
and we'll have one folder for each year.

88
00:03:24,000 --> 00:03:26,370
Then within each year, we'll have month.

89
00:03:26,370 --> 00:03:27,480
So month equals one

90
00:03:27,480 --> 00:03:29,370
and within each month, we'll have days.

91
00:03:29,370 --> 00:03:30,210
Days equals one.

92
00:03:30,210 --> 00:03:33,090
And so when I do a query on Athena

93
00:03:33,090 --> 00:03:35,250
and I filter for a specific year,

94
00:03:35,250 --> 00:03:36,300
a specific month

95
00:03:36,300 --> 00:03:39,510
and a specific day, then it will know exactly,

96
00:03:39,510 --> 00:03:42,570
to which folder in Amazon S3 to get the data from.

97
00:03:42,570 --> 00:03:45,227
And therefore will only recover a subset of the data

98
00:03:45,227 --> 00:03:48,660
therefore will have really, really good partitioning.

99
00:03:48,660 --> 00:03:50,154
Finally, the last performance improvement

100
00:03:50,154 --> 00:03:53,520
you need to do is to use bigger files,

101
00:03:53,520 --> 00:03:54,600
to minimize the overhead.

102
00:03:54,600 --> 00:03:58,230
So if you have many, many, many small files in Amazon S3

103
00:03:58,230 --> 00:04:00,150
Athena is not going to be as performance

104
00:04:00,150 --> 00:04:03,690
as if you had larger files for example, 128 megabytes

105
00:04:03,690 --> 00:04:04,590
and over.

106
00:04:04,590 --> 00:04:06,229
Because larger files are easier to scan

107
00:04:06,229 --> 00:04:08,000
and easier to retrieve.

108
00:04:08,000 --> 00:04:09,800
Okay, that's it for this lecture.

109
00:04:09,800 --> 00:04:10,742
I hope you liked it.

110
00:04:10,742 --> 00:04:12,873
And I will see you in the next lecture.