1 00:00:01,570 --> 00:00:05,920 to Backspace Academy, one area of AWS that has grown rapidly over the last 2 00:00:05,920 --> 00:00:10,089 couple of years is out of big data in this lecture I'll go through the big 3 00:00:10,089 --> 00:00:14,860 data services that are available with AWS I'll talk about Amazon redshift the 4 00:00:14,860 --> 00:00:21,039 data warehouse service Amazon Elastic MapReduce the Hadoop service the elastic 5 00:00:21,039 --> 00:00:25,779 search service which provides search engine capabilities the Amazon quick 6 00:00:25,779 --> 00:00:29,199 sight which is a relatively new service that has come in for business 7 00:00:29,199 --> 00:00:33,100 intelligence reporting of big data I look at machine learning for 8 00:00:33,100 --> 00:00:38,140 predictive analysis of data and Amazon Kinesis for analysis of real-time 9 00:00:38,140 --> 00:00:45,100 streams the big data services available to us in database and storage are 10 00:00:45,100 --> 00:00:51,190 redshift DynamoDB Amazon s3 and of course RDS in analysis we have elastic 11 00:00:51,190 --> 00:00:56,739 MapReduce the dupe framework as a service we have elastic search elastic 12 00:00:56,739 --> 00:01:01,359 search search engine as a service quick site business intelligence reporting and 13 00:01:01,359 --> 00:01:07,270 Amazon machine learning for predictive analysis of data and AWS lambda for 14 00:01:07,270 --> 00:01:12,310 compute capacity for analysis of real-time data we have Kinesis streams 15 00:01:12,310 --> 00:01:17,170 and we also have third-party applications that can be launched on ec2 16 00:01:17,170 --> 00:01:25,750 instances Amazon redshift is a petabyte scale data warehousing service and it is 17 00:01:25,750 --> 00:01:31,750 based upon PostgreSQL database engine and it's access to be standard business 18 00:01:31,750 --> 00:01:35,829 intelligent reporting tools you can still use it with tableau or bird or 19 00:01:35,829 --> 00:01:42,009 whatever you are using to to collect data from a PostgreSQL database you can 20 00:01:42,009 --> 00:01:46,000 still use those tools all data is replicated between nodes in a cluster 21 00:01:46,000 --> 00:01:50,590 and it is continuously backed up to Amazon s3 with snapshots and those 22 00:01:50,590 --> 00:01:56,860 snapshots can be held for 1 to 35 days as is the case with RDS user-initiated 23 00:01:56,860 --> 00:02:03,850 DB snapshots are retained upon cluster deletion and it also allows for quick 24 00:02:03,850 --> 00:02:07,890 recovery from those snapshots 25 00:02:08,869 --> 00:02:14,810 now it's one thing to have a big data warehouse but we need to be able to get 26 00:02:14,810 --> 00:02:18,650 our big data that we've got already into that data warehouse so we need to think 27 00:02:18,650 --> 00:02:24,319 carefully about migration as Big Data the services that are available for us 28 00:02:24,319 --> 00:02:30,440 to use first off there we've got data base snapshots to s3 so we could create 29 00:02:30,440 --> 00:02:35,540 an RDS snapshot about about data base and then use that snapshot to create a 30 00:02:35,540 --> 00:02:42,560 replica of that data base we have the AWS database migration service and 31 00:02:42,560 --> 00:02:48,920 that will allow us to transfer data from a source database through to a target 32 00:02:48,920 --> 00:02:55,819 database over a connection we also have the AWS data pipeline and that again 33 00:02:55,819 --> 00:03:02,299 will allow us to create a data pipeline job which can transfer and convert it 34 00:03:02,299 --> 00:03:10,549 needed data from one data source through to a target database and finally there 35 00:03:10,549 --> 00:03:15,170 we have the really the best solution for migrating as large amounts of data and 36 00:03:15,170 --> 00:03:21,170 that is to use the AWS snowball and use that snowball device to import to Amazon 37 00:03:21,170 --> 00:03:27,019 s3 and again like we explained in the introductory of this course the AWS 38 00:03:27,019 --> 00:03:31,700 snowball device you can have it sent out to you on premises you can upload your 39 00:03:31,700 --> 00:03:35,420 large amounts of data to that snowball device or devices if you've got even 40 00:03:35,420 --> 00:03:41,180 more than the snowball device can handle and then you carry that off to AWS and 41 00:03:41,180 --> 00:03:46,700 they will upload that for you into an s3 bucket all you need to worry about then 42 00:03:46,700 --> 00:03:51,739 is getting it from your s3 bucket into your target database and that is where 43 00:03:51,739 --> 00:03:56,299 the AWS data pipeline comes in so you cannot do that with the AWS data 44 00:03:56,299 --> 00:04:01,489 migration service you can't go from s3 to a data base but you can create a job 45 00:04:01,489 --> 00:04:08,840 in the data pipeline service to import from s3 into your database the reason we 46 00:04:08,840 --> 00:04:12,709 would want to use snowball I've just got a table there even when we're looking at 47 00:04:12,709 --> 00:04:18,049 a gigabit connection with high network utilization to transfer a hundred 48 00:04:18,049 --> 00:04:22,430 terabyte of data across to AWS would take 12 days 49 00:04:22,430 --> 00:04:27,380 so we would want to consider using AWS snowball when we have 60 terabytes or 50 00:04:27,380 --> 00:04:32,900 more of data to transfer across when we start looking at lower quality 51 00:04:32,900 --> 00:04:38,169 connections that are going to take about a year to upload certainly we want to 52 00:04:38,169 --> 00:04:48,110 look at using snowball for two terabyte or more of data elastic MapReduce is a 53 00:04:48,110 --> 00:04:53,509 WSS fully managed Hadoop service aperture hoop it's an open source 54 00:04:53,509 --> 00:04:57,949 framework for the distributed processing of large data sets across clusters of 55 00:04:57,949 --> 00:05:04,250 compute instances the AWS EMR service it provides clusters of ec2 instances that 56 00:05:04,250 --> 00:05:09,470 analyze our data and those clusters can be automatically deleted upon task 57 00:05:09,470 --> 00:05:13,760 completion so when the job's done we can have those those clusters 58 00:05:13,760 --> 00:05:18,500 terminated that's what we want it's not suitable for small data sets and it's 59 00:05:18,500 --> 00:05:22,820 not suitable for acid transaction requirements the data processing 60 00:05:22,820 --> 00:05:28,970 frameworks available Hadoop MapReduce and Apache spark which simplifies the 61 00:05:28,970 --> 00:05:34,340 process of creating MapReduce functions the storage options available to us 62 00:05:34,340 --> 00:05:39,320 other Hadoop distributed file system which is ephemeral storage and we have 63 00:05:39,320 --> 00:05:46,820 the EMR file system or EMR FS which allows you to access Amazon s3 as a file 64 00:05:46,820 --> 00:05:51,979 system for elastic MapReduce and we also have the local file system which is out 65 00:05:51,979 --> 00:06:01,310 on our ec2 clusters which is the ec2 instance or AWS elastic search is a 66 00:06:01,310 --> 00:06:08,930 fully managed implementation of elastic codes elastic search framework elastic 67 00:06:08,930 --> 00:06:14,510 search is a real-time distributed search and analytics engine it is the most 68 00:06:14,510 --> 00:06:20,000 popular enterprise search engine it is used by Facebook github Stack Exchange 69 00:06:20,000 --> 00:06:26,389 and Quora to name a few so it is a very popular analytics engine and used as a 70 00:06:26,389 --> 00:06:31,340 popular search engine you can analyze data from Amazon s3 Amazon Kinesis 71 00:06:31,340 --> 00:06:36,790 streams DynamoDB streams your cloud watch logs and cloud trail 72 00:06:36,790 --> 00:06:42,830 it's suitable for clearing and searching large amounts of data as is any good 73 00:06:42,830 --> 00:06:47,870 search engine and is not but it's not suitable for online transaction 74 00:06:47,870 --> 00:06:52,100 processing and for petabytes storage so if you're looking for petabytes storage 75 00:06:52,100 --> 00:06:57,440 in an elastic search framework then you should look at managing your that that's 76 00:06:57,440 --> 00:07:04,120 that yourself by using an ami and launching it on ec2 instances yourself 77 00:07:04,120 --> 00:07:09,320 Amazon quick sight is a relatively new business intelligence reporting service 78 00:07:09,320 --> 00:07:16,250 from AWS very good service it provides a very good bi software at one tenth of 79 00:07:16,250 --> 00:07:22,730 the cost of traditional software such as tableau it provides superfast it 80 00:07:22,730 --> 00:07:28,400 provides a super fast parallel in-memory calculation engine or spice although it 81 00:07:28,400 --> 00:07:32,930 is it says there one tenth of the traditional BI software of course there 82 00:07:32,930 --> 00:07:39,430 are open source BI tools such as bird that will cost you nothing virtually 83 00:07:39,430 --> 00:07:44,000 Amazon machine learning is for predictive analytics and machine 84 00:07:44,000 --> 00:07:49,250 learning simplified with visualization tools and wizards so it has some very 85 00:07:49,250 --> 00:07:55,130 very complex machine learning algorithms behind it and it and it makes it very 86 00:07:55,130 --> 00:07:59,120 easy to use because of the wizards and these tools that are available to 87 00:07:59,120 --> 00:08:04,040 visualize what you are at what you are trying to analyze the data sources for 88 00:08:04,040 --> 00:08:14,720 input to AML include Amazon s3 redshift and RDS or the my sequel RDS it's very 89 00:08:14,720 --> 00:08:19,880 suitable to flag suspicious transactions for forecasting product demand for 90 00:08:19,880 --> 00:08:24,670 example for personalizing a play application content on a website for 91 00:08:24,670 --> 00:08:31,400 predicting user activity and also for analyzing social media streams it's not 92 00:08:31,400 --> 00:08:37,130 suitable for very large data sets and it's not suitable for unsupported 93 00:08:37,130 --> 00:08:42,860 learning tasks so there are a library of learning tasks but if they're not 94 00:08:42,860 --> 00:08:48,320 supported by the AML then it cannot do that so consider 95 00:08:48,320 --> 00:08:53,130 using if you need something else that has more learning tasks and consider 96 00:08:53,130 --> 00:08:57,540 using Elastic Map Reduce to run spark and the machine learning library for 97 00:08:57,540 --> 00:09:05,550 elastic MapReduce Amazon Kinesis is very good for analyzing real-time data in the 98 00:09:05,550 --> 00:09:10,529 format of a stream so data can be put into streams into a Kinesis stream using 99 00:09:10,529 --> 00:09:15,450 API calls using one of the software development kits which has Kinesis 100 00:09:15,450 --> 00:09:21,269 functions inside it and also with the Animas on Kinesis producing producer 101 00:09:21,269 --> 00:09:26,970 library or kpl which is a c++ application that we can use to to 102 00:09:26,970 --> 00:09:31,560 channel our data into Kinesis and also the Kinesis agent java application as 103 00:09:31,560 --> 00:09:36,959 well once we have a Kinesis stream established we can use the Kinesis 104 00:09:36,959 --> 00:09:42,720 client library or KCl to process that process that data in that stream that's 105 00:09:42,720 --> 00:09:47,730 available for Java nodejs net Python and is currently being developed from what I 106 00:09:47,730 --> 00:09:56,730 understand for Ruby Kinesis firehose can capture transform and load streaming 107 00:09:56,730 --> 00:10:04,440 data into Amazon Kinesis analytics Amazon s3 Amazon redshift and also the 108 00:10:04,440 --> 00:10:09,540 Amazon Elastic search service so that layer allows for real-time analysis 109 00:10:09,540 --> 00:10:14,329 through that Kinesis of firehose of our stream data 110 00:10:14,329 --> 00:10:18,870 Kinesis is obviously suitable for real-time data analytics it's suitable 111 00:10:18,870 --> 00:10:23,310 for log and data feed intake and processing and for real-time metrics and 112 00:10:23,310 --> 00:10:27,690 reporting it's not suitable for small-scale consistent throughput and 113 00:10:27,690 --> 00:10:32,790 it's not suitable for long term data storage and a delayed exit it's 114 00:10:32,790 --> 00:10:41,160 real-time data as a stream and it's not suitable for long-term data storage so 115 00:10:41,160 --> 00:10:47,279 that is that concludes our lecture on Big Data and I'll see you in the next lecture