1

00:00:01,570  -->  00:00:05,920
to Backspace Academy, one area of AWS
that has grown rapidly over the last

2

00:00:05,920  -->  00:00:10,089
couple of years is out of big data in
this lecture I'll go through the big

3

00:00:10,089  -->  00:00:14,860
data services that are available with
AWS I'll talk about Amazon redshift the

4

00:00:14,860  -->  00:00:21,039
data warehouse service Amazon Elastic
MapReduce the Hadoop service the elastic

5

00:00:21,039  -->  00:00:25,779
search service which provides search
engine capabilities the Amazon quick

6

00:00:25,779  -->  00:00:29,199
sight which is a relatively new service
that has come in for business

7

00:00:29,199  -->  00:00:33,100
intelligence reporting of big data
I look at machine learning for

8

00:00:33,100  -->  00:00:38,140
predictive analysis of data and Amazon
Kinesis for analysis of real-time

9

00:00:38,140  -->  00:00:45,100
streams the big data services available
to us in database and storage are

10

00:00:45,100  -->  00:00:51,190
redshift DynamoDB Amazon s3 and of
course RDS in analysis we have elastic

11

00:00:51,190  -->  00:00:56,739
MapReduce the dupe framework as a
service we have elastic search elastic

12

00:00:56,739  -->  00:01:01,359
search search engine as a service quick
site business intelligence reporting and

13

00:01:01,359  -->  00:01:07,270
Amazon machine learning for predictive
analysis of data and AWS lambda for

14

00:01:07,270  -->  00:01:12,310
compute capacity for analysis of
real-time data we have Kinesis streams

15

00:01:12,310  -->  00:01:17,170
and we also have third-party
applications that can be launched on ec2

16

00:01:17,170  -->  00:01:25,750
instances Amazon redshift is a petabyte
scale data warehousing service and it is

17

00:01:25,750  -->  00:01:31,750
based upon PostgreSQL database engine
and it's access to be standard business

18

00:01:31,750  -->  00:01:35,829
intelligent reporting tools you can
still use it with tableau or bird or

19

00:01:35,829  -->  00:01:42,009
whatever you are using to to collect
data from a PostgreSQL database you can

20

00:01:42,009  -->  00:01:46,000
still use those tools all data is
replicated between nodes in a cluster

21

00:01:46,000  -->  00:01:50,590
and it is continuously backed up to
Amazon s3 with snapshots and those

22

00:01:50,590  -->  00:01:56,860
snapshots can be held for 1 to 35 days
as is the case with RDS user-initiated

23

00:01:56,860  -->  00:02:03,850
DB snapshots are retained upon cluster
deletion and it also allows for quick

24

00:02:03,850  -->  00:02:07,890
recovery from those snapshots

25

00:02:08,869  -->  00:02:14,810
now it's one thing to have a big data
warehouse but we need to be able to get

26

00:02:14,810  -->  00:02:18,650
our big data that we've got already into
that data warehouse so we need to think

27

00:02:18,650  -->  00:02:24,319
carefully about migration as Big Data
the services that are available for us

28

00:02:24,319  -->  00:02:30,440
to use first off there we've got data
base snapshots to s3 so we could create

29

00:02:30,440  -->  00:02:35,540
an RDS snapshot about about data base
and then use that snapshot to create a

30

00:02:35,540  -->  00:02:42,560
replica of that data base we have the
 AWS database migration service and

31

00:02:42,560  -->  00:02:48,920
that will allow us to transfer data from
a source database through to a target

32

00:02:48,920  -->  00:02:55,819
database over a connection we also have
the AWS data pipeline and that again

33

00:02:55,819  -->  00:03:02,299
will allow us to create a data pipeline
job which can transfer and convert it

34

00:03:02,299  -->  00:03:10,549
needed data from one data source through
to a target database and finally there

35

00:03:10,549  -->  00:03:15,170
we have the really the best solution for
migrating as large amounts of data and

36

00:03:15,170  -->  00:03:21,170
that is to use the AWS snowball and use
that snowball device to import to Amazon

37

00:03:21,170  -->  00:03:27,019
s3 and again like we explained in the
introductory of this course the AWS

38

00:03:27,019  -->  00:03:31,700
snowball device you can have it sent out
to you on premises you can upload your

39

00:03:31,700  -->  00:03:35,420
large amounts of data to that snowball
device or devices if you've got even

40

00:03:35,420  -->  00:03:41,180
more than the snowball device can handle
and then you carry that off to AWS and

41

00:03:41,180  -->  00:03:46,700
they will upload that for you into an s3
bucket all you need to worry about then

42

00:03:46,700  -->  00:03:51,739
is getting it from your s3 bucket into
your target database and that is where

43

00:03:51,739  -->  00:03:56,299
the AWS data pipeline comes in so you
cannot do that with the AWS data

44

00:03:56,299  -->  00:04:01,489
migration service you can't go from s3
to a data base but you can create a job

45

00:04:01,489  -->  00:04:08,840
in the data pipeline service to import
from s3 into your database the reason we

46

00:04:08,840  -->  00:04:12,709
would want to use snowball I've just got
a table there even when we're looking at

47

00:04:12,709  -->  00:04:18,049
a gigabit connection with high network
utilization to transfer a hundred

48

00:04:18,049  -->  00:04:22,430
terabyte of data across to AWS would
take 12 days

49

00:04:22,430  -->  00:04:27,380
so we would want to consider using AWS
snowball when we have 60 terabytes or

50

00:04:27,380  -->  00:04:32,900
more of data to transfer across when we
start looking at lower quality

51

00:04:32,900  -->  00:04:38,169
connections that are going to take about
a year to upload certainly we want to

52

00:04:38,169  -->  00:04:48,110
look at using snowball for two terabyte
or more of data elastic MapReduce is a

53

00:04:48,110  -->  00:04:53,509
WSS fully managed Hadoop service
aperture hoop it's an open source

54

00:04:53,509  -->  00:04:57,949
framework for the distributed processing
of large data sets across clusters of

55

00:04:57,949  -->  00:05:04,250
compute instances the AWS EMR service it
provides clusters of ec2 instances that

56

00:05:04,250  -->  00:05:09,470
analyze our data and those clusters can
be automatically deleted upon task

57

00:05:09,470  -->  00:05:13,760
completion so when the job's done
we can have those those clusters

58

00:05:13,760  -->  00:05:18,500
terminated that's what we want it's not
suitable for small data sets and it's

59

00:05:18,500  -->  00:05:22,820
not suitable for acid transaction
requirements the data processing

60

00:05:22,820  -->  00:05:28,970
frameworks available Hadoop MapReduce
and Apache spark which simplifies the

61

00:05:28,970  -->  00:05:34,340
process of creating MapReduce functions
the storage options available to us

62

00:05:34,340  -->  00:05:39,320
other Hadoop distributed file system
which is ephemeral storage and we have

63

00:05:39,320  -->  00:05:46,820
the EMR file system or EMR FS which
allows you to access Amazon s3 as a file

64

00:05:46,820  -->  00:05:51,979
system for elastic MapReduce and we also
have the local file system which is out

65

00:05:51,979  -->  00:06:01,310
on our ec2 clusters which is the ec2
instance or AWS elastic search is a

66

00:06:01,310  -->  00:06:08,930
fully managed implementation of elastic
codes elastic search framework elastic

67

00:06:08,930  -->  00:06:14,510
search is a real-time distributed search
and analytics engine it is the most

68

00:06:14,510  -->  00:06:20,000
popular enterprise search engine it is
used by Facebook github Stack Exchange

69

00:06:20,000  -->  00:06:26,389
and Quora to name a few so it is a very
popular analytics engine and used as a

70

00:06:26,389  -->  00:06:31,340
popular search engine you can analyze
data from Amazon s3 Amazon Kinesis

71

00:06:31,340  -->  00:06:36,790
streams DynamoDB streams your cloud
watch logs and cloud trail

72

00:06:36,790  -->  00:06:42,830
it's suitable for clearing and searching
large amounts of data as is any good

73

00:06:42,830  -->  00:06:47,870
search engine and is not but it's not
suitable for online transaction

74

00:06:47,870  -->  00:06:52,100
processing and for petabytes storage so
if you're looking for petabytes storage

75

00:06:52,100  -->  00:06:57,440
in an elastic search framework then you
should look at managing your that that's

76

00:06:57,440  -->  00:07:04,120
that yourself by using an ami and
launching it on ec2 instances yourself

77

00:07:04,120  -->  00:07:09,320
Amazon quick sight is a relatively new
business intelligence reporting service

78

00:07:09,320  -->  00:07:16,250
from AWS very good service it provides a
very good bi software at one tenth of

79

00:07:16,250  -->  00:07:22,730
the cost of traditional software such as
tableau it provides superfast it

80

00:07:22,730  -->  00:07:28,400
provides a super fast parallel in-memory
calculation engine or spice although it

81

00:07:28,400  -->  00:07:32,930
is it says there one tenth of the
traditional BI software of course there

82

00:07:32,930  -->  00:07:39,430
are open source BI tools such as bird
that will cost you nothing virtually

83

00:07:39,430  -->  00:07:44,000
Amazon machine learning is for
predictive analytics and machine

84

00:07:44,000  -->  00:07:49,250
learning simplified with visualization
tools and wizards so it has some very

85

00:07:49,250  -->  00:07:55,130
very complex machine learning algorithms
behind it and it and it makes it very

86

00:07:55,130  -->  00:07:59,120
easy to use because of the wizards and
these tools that are available to

87

00:07:59,120  -->  00:08:04,040
visualize what you are at what you are
trying to analyze the data sources for

88

00:08:04,040  -->  00:08:14,720
input to AML include Amazon s3 redshift
and RDS or the my sequel RDS it's very

89

00:08:14,720  -->  00:08:19,880
suitable to flag suspicious transactions
for forecasting product demand for

90

00:08:19,880  -->  00:08:24,670
example for personalizing a play
application content on a website for

91

00:08:24,670  -->  00:08:31,400
predicting user activity and also for
analyzing social media streams it's not

92

00:08:31,400  -->  00:08:37,130
suitable for very large data sets and
it's not suitable for unsupported

93

00:08:37,130  -->  00:08:42,860
learning tasks so there are a library of
learning tasks but if they're not

94

00:08:42,860  -->  00:08:48,320
supported by the AML then it cannot do
that so consider

95

00:08:48,320  -->  00:08:53,130
using if you need something else that
has more learning tasks and consider

96

00:08:53,130  -->  00:08:57,540
using Elastic Map Reduce to run spark
and the machine learning library for

97

00:08:57,540  -->  00:09:05,550
elastic MapReduce Amazon Kinesis is very
good for analyzing real-time data in the

98

00:09:05,550  -->  00:09:10,529
format of a stream so data can be put
into streams into a Kinesis stream using

99

00:09:10,529  -->  00:09:15,450
API calls using one of the software
development kits which has Kinesis

100

00:09:15,450  -->  00:09:21,269
functions inside it and also with the
Animas on Kinesis producing producer

101

00:09:21,269  -->  00:09:26,970
library or kpl which is a c++
application that we can use to to

102

00:09:26,970  -->  00:09:31,560
channel our data into Kinesis and also
the Kinesis agent java application as

103

00:09:31,560  -->  00:09:36,959
well once we have a Kinesis stream
established we can use the Kinesis

104

00:09:36,959  -->  00:09:42,720
client library or KCl to process that
process that data in that stream that's

105

00:09:42,720  -->  00:09:47,730
available for Java nodejs net Python and
is currently being developed from what I

106

00:09:47,730  -->  00:09:56,730
understand for Ruby Kinesis firehose can
capture transform and load streaming

107

00:09:56,730  -->  00:10:04,440
data into Amazon Kinesis analytics
Amazon s3 Amazon redshift and also the

108

00:10:04,440  -->  00:10:09,540
Amazon Elastic search service so that
layer allows for real-time analysis

109

00:10:09,540  -->  00:10:14,329
through that Kinesis of firehose of our
stream data

110

00:10:14,329  -->  00:10:18,870
Kinesis is obviously suitable for
real-time data analytics it's suitable

111

00:10:18,870  -->  00:10:23,310
for log and data feed intake and
processing and for real-time metrics and

112

00:10:23,310  -->  00:10:27,690
reporting it's not suitable for
small-scale consistent throughput and

113

00:10:27,690  -->  00:10:32,790
it's not suitable for long term data
storage and a delayed exit it's

114

00:10:32,790  -->  00:10:41,160
real-time data as a stream and it's not
suitable for long-term data storage so

115

00:10:41,160  -->  00:10:47,279
that is that concludes our lecture on
Big Data and I'll see you in the next lecture