1 00:00:02,340 --> 00:00:04,262 So now let's talk about Amazon Athena. 2 00:00:04,262 --> 00:00:06,729 So Athena is a serverless query service 3 00:00:06,729 --> 00:00:08,490 to help you analyze the data 4 00:00:08,490 --> 00:00:10,980 stored in Amazon S3 buckets. 5 00:00:10,980 --> 00:00:12,248 And to analyze this data 6 00:00:12,248 --> 00:00:15,180 you're going to use the standard SQL language 7 00:00:15,180 --> 00:00:16,216 to query the files. 8 00:00:16,216 --> 00:00:19,320 Behind the scenes, Athena is built on the Presto engine 9 00:00:19,320 --> 00:00:21,030 which uses the SQL language. 10 00:00:21,030 --> 00:00:21,863 So the idea is that users 11 00:00:21,863 --> 00:00:24,600 are going to load data into your S3 buckets 12 00:00:24,600 --> 00:00:27,011 or you are going to load data into your S3 buckets 13 00:00:27,011 --> 00:00:29,550 and then you would use the Athena service to query 14 00:00:29,550 --> 00:00:33,660 and analyze this data in Amazon S3 without moving it. 15 00:00:33,660 --> 00:00:34,920 So Athena is serverless 16 00:00:34,920 --> 00:00:36,990 and it analyzes directly your data 17 00:00:36,990 --> 00:00:38,790 living in your S3 buckets. 18 00:00:38,790 --> 00:00:40,350 So it supports different formats, 19 00:00:40,350 --> 00:00:42,384 such as CSV, JSON, ORC, 20 00:00:42,384 --> 00:00:43,217 Avro, 21 00:00:43,217 --> 00:00:44,310 and Parquet 22 00:00:44,310 --> 00:00:45,660 and possibly others. 23 00:00:45,660 --> 00:00:47,520 And the pricing is very simple. 24 00:00:47,520 --> 00:00:49,680 You're just going to pay a fixed amount 25 00:00:49,680 --> 00:00:52,050 per terabytes of data scans. 26 00:00:52,050 --> 00:00:53,757 You don't need to provision any database again 27 00:00:53,757 --> 00:00:56,580 because the whole service is serverless. 28 00:00:56,580 --> 00:00:59,640 So Athena is commonly used with another tool 29 00:00:59,640 --> 00:01:02,670 called Amazon Quicksight to create reports 30 00:01:02,670 --> 00:01:03,630 and dashboards. 31 00:01:03,630 --> 00:01:06,180 So we'll see Quicksight later in depth. 32 00:01:06,180 --> 00:01:08,670 But Amazon Quicksight connects to Athena 33 00:01:08,670 --> 00:01:10,535 which connects to your S3 buckets. 34 00:01:10,535 --> 00:01:13,440 Now the use cases for Amazon Athena 35 00:01:13,440 --> 00:01:15,939 are to do adhoc queries, business intelligence, 36 00:01:15,939 --> 00:01:17,880 analytics, reporting 37 00:01:17,880 --> 00:01:21,090 and to analyze end query, any kind of logs 38 00:01:21,090 --> 00:01:24,199 that originates from your AWS services. 39 00:01:24,199 --> 00:01:26,370 So it could be your VPC flow logs, 40 00:01:26,370 --> 00:01:27,690 your load balancer logs, 41 00:01:27,690 --> 00:01:28,860 your CloudTrail trails, 42 00:01:28,860 --> 00:01:30,120 and so on. 43 00:01:30,120 --> 00:01:33,480 So as an exam tip, anytime you need to analyze data 44 00:01:33,480 --> 00:01:37,260 in Amazon S3 using a serverless SQL engine. 45 00:01:37,260 --> 00:01:38,823 You can think about Athena. 46 00:01:39,690 --> 00:01:41,940 Now I've talked about performance improvements 47 00:01:41,940 --> 00:01:44,226 and you can actually improve Athena performance. 48 00:01:44,226 --> 00:01:47,365 And this is the exam will test you on this as well. 49 00:01:47,365 --> 00:01:50,640 So, first of all, because you pay for the amount 50 00:01:50,640 --> 00:01:53,206 of data scanned per terabytes, 51 00:01:53,206 --> 00:01:56,070 you need to use a type of data, 52 00:01:56,070 --> 00:01:58,290 where you're going to scan less data. 53 00:01:58,290 --> 00:02:01,290 And for this, you can use a columnar data type 54 00:02:01,290 --> 00:02:04,382 for cost savings because you only scan the columns you need. 55 00:02:04,382 --> 00:02:08,741 So therefore, the recommended formats for Amazon Athena 56 00:02:08,741 --> 00:02:10,832 are going to be Apache Parquet 57 00:02:10,832 --> 00:02:11,820 and 58 00:02:11,820 --> 00:02:12,713 ORC. 59 00:02:12,713 --> 00:02:15,780 And it's going to give you a huge performance improvement 60 00:02:15,780 --> 00:02:18,900 and to get your files into the Apache Parquet 61 00:02:18,900 --> 00:02:21,261 or ORC formats, you must use a service 62 00:02:21,261 --> 00:02:22,954 that will see as well in the section. 63 00:02:22,954 --> 00:02:25,980 For example, Glue, glue can be very helpful 64 00:02:25,980 --> 00:02:29,490 to convert your data as an ETL job between, 65 00:02:29,490 --> 00:02:30,544 for example, CSV 66 00:02:30,544 --> 00:02:32,670 and Parquet. 67 00:02:32,670 --> 00:02:35,321 Now also because we want to scan less data, 68 00:02:35,321 --> 00:02:38,160 we need to compress data for smaller retrieval. 69 00:02:38,160 --> 00:02:39,946 So there are different compression mechanism 70 00:02:39,946 --> 00:02:43,680 you can use that I have listed it right here. 71 00:02:43,680 --> 00:02:46,230 Next, if you know, you're going to query, 72 00:02:46,230 --> 00:02:48,563 all the time on some specific columns, 73 00:02:48,563 --> 00:02:50,940 you can partition your data sets, 74 00:02:50,940 --> 00:02:53,862 and partitioning data sets means that in your S3 buckets 75 00:02:53,862 --> 00:02:56,771 you're going to have the full path with slashes 76 00:02:56,771 --> 00:02:59,840 and each slash will be a different column name 77 00:02:59,840 --> 00:03:01,480 with a specific value. 78 00:03:01,480 --> 00:03:02,741 And so you're organizing, 79 00:03:02,741 --> 00:03:05,059 you're partitioning your data in Amazon S3, 80 00:03:05,059 --> 00:03:07,410 so that when you query it you know exactly, 81 00:03:07,410 --> 00:03:09,780 in which folder, at which path in Amazon S3, 82 00:03:09,780 --> 00:03:11,400 you need to scan for data. 83 00:03:11,400 --> 00:03:13,740 So here's an example of data partitions. 84 00:03:13,740 --> 00:03:16,382 So we have flight data in Parquet formats 85 00:03:16,382 --> 00:03:20,070 and then we do slash year equals 1991. 86 00:03:20,070 --> 00:03:21,274 So we partition by year 87 00:03:21,274 --> 00:03:24,000 and we'll have one folder for each year. 88 00:03:24,000 --> 00:03:26,370 Then within each year, we'll have month. 89 00:03:26,370 --> 00:03:27,480 So month equals one 90 00:03:27,480 --> 00:03:29,370 and within each month, we'll have days. 91 00:03:29,370 --> 00:03:30,210 Days equals one. 92 00:03:30,210 --> 00:03:33,090 And so when I do a query on Athena 93 00:03:33,090 --> 00:03:35,250 and I filter for a specific year, 94 00:03:35,250 --> 00:03:36,300 a specific month 95 00:03:36,300 --> 00:03:39,510 and a specific day, then it will know exactly, 96 00:03:39,510 --> 00:03:42,570 to which folder in Amazon S3 to get the data from. 97 00:03:42,570 --> 00:03:45,227 And therefore will only recover a subset of the data 98 00:03:45,227 --> 00:03:48,660 therefore will have really, really good partitioning. 99 00:03:48,660 --> 00:03:50,154 Finally, the last performance improvement 100 00:03:50,154 --> 00:03:53,520 you need to do is to use bigger files, 101 00:03:53,520 --> 00:03:54,600 to minimize the overhead. 102 00:03:54,600 --> 00:03:58,230 So if you have many, many, many small files in Amazon S3 103 00:03:58,230 --> 00:04:00,150 Athena is not going to be as performance 104 00:04:00,150 --> 00:04:03,690 as if you had larger files for example, 128 megabytes 105 00:04:03,690 --> 00:04:04,590 and over. 106 00:04:04,590 --> 00:04:06,229 Because larger files are easier to scan 107 00:04:06,229 --> 00:04:08,000 and easier to retrieve. 108 00:04:08,000 --> 00:04:09,800 Okay, that's it for this lecture. 109 00:04:09,800 --> 00:04:10,742 I hope you liked it. 110 00:04:10,742 --> 00:04:12,873 And I will see you in the next lecture.