0
1
00:00:10,280 --> 00:00:16,670
Welcome everyone to another video expert malware analysis and reverse engineering course. In this video we are 
1

2
00:00:16,670 --> 00:00:19,600
going to analyze malicious PDF files.
2

3
00:00:19,880 --> 00:00:27,220
So before we dive into analyzing a malicious PDF file,  like like we did for office files, 
3

4
00:00:27,290 --> 00:00:30,570
We will first understand the PDF file structure.
4

5
00:00:30,590 --> 00:00:36,920
This will help us in building a stronger base in understanding how exactly the PDF file format looks
5

6
00:00:36,920 --> 00:00:37,710
like.
6

7
00:00:37,760 --> 00:00:42,940
How exactly is structured and how to understand its different component.
7

8
00:00:43,070 --> 00:00:49,430
PDF stands for Portable Document Format and it is a file format to basically represent documents.
8

9
00:00:49,640 --> 00:00:51,410
It's a very simple file format.
9

10
00:00:51,410 --> 00:00:58,510
It basically can be called as a text format where everything is stored in the form of dictionaries and
10

11
00:00:58,510 --> 00:01:05,520
in dictionaries contains information about the file and it contains streams which hold the data about
11

12
00:01:05,620 --> 00:01:06,100
the PDF.
12

13
00:01:06,100 --> 00:01:12,710
For example if you have image or if you have text all those things will go inside the streams of PDF
13

14
00:01:14,120 --> 00:01:17,700
It encapsulates various elements of documents as objects.
14

15
00:01:17,930 --> 00:01:23,630
Just like we mentioned right now, it will have different objects and all those objects will contain properties
15

16
00:01:23,630 --> 00:01:30,100
of different elements inside the PDF, as well as the stream of data that is present in the PDF.
16

17
00:01:30,100 --> 00:01:36,040
It Supports rich and interactive elements, for example it supports images, it supports graphs, you can draw shapes
17

18
00:01:36,070 --> 00:01:38,550
inside it. It has scripting capability as well.
18

19
00:01:38,560 --> 00:01:44,170
So PDF file formats are very much capable of supporting scripting languages like action script and
19

20
00:01:44,170 --> 00:01:45,030
Javascript.
20

21
00:01:45,220 --> 00:01:53,200
You can create dynamic elements and dynamic actions like filling up all forms; loading information
21

22
00:01:53,200 --> 00:01:55,250
dynamically and things like that.
22

23
00:01:56,530 --> 00:02:03,270
In a PDF file, there are four major sections:
header, Body, cross-reference and trailer.
23

24
00:02:03,770 --> 00:02:06,950
So let us try to understand what exactly these sections mean.
24

25
00:02:08,360 --> 00:02:16,340
So on the right hand side you can see that there is a structure off of PDF file. In order to view the
25

26
00:02:16,340 --> 00:02:18,190
structure of any PDF file,
26

27
00:02:18,380 --> 00:02:23,980
You can just simply drag and drop the file in a text editor and you can view the structure
27

28
00:02:23,990 --> 00:02:28,280
something similar to the one that I have here on the right hand side.
28

29
00:02:28,670 --> 00:02:35,330
So the first section of the PDF file is its header. So the header is nothing but the magic number which denotes
29

30
00:02:35,330 --> 00:02:40,270
that this is a PDF file and it contains the version of the PDF file.
30

31
00:02:40,550 --> 00:02:47,970
You see here at the top, it says that it's a PDF file version 1.7. So falling downwards you'd see a
31

32
00:02:47,960 --> 00:02:51,470
bunch of objects inside the PDF file.
32

33
00:02:51,480 --> 00:02:56,610
These objects basically constitute the different elements of the PDF file.
33

34
00:02:56,610 --> 00:03:04,110
So '1' basically denotes its name, '0'  is its version number and 'obj' denotes the object.
34

35
00:03:04,110 --> 00:03:05,570
Beginning and ending of these signs
35

36
00:03:05,580 --> 00:03:07,710
Basically represents a dictionary.
36

37
00:03:07,800 --> 00:03:12,940
And within those dictionaries you could contain the different elements of the PDF file.
37

38
00:03:13,140 --> 00:03:19,050
Then we have 'endobj' which basically denotes the end of object and then there will be another object and
38

39
00:03:19,050 --> 00:03:19,790
so on.
39

40
00:03:21,410 --> 00:03:24,280
So this was the first part which is harder.
40

41
00:03:24,380 --> 00:03:27,050
The second part is the body.
41

42
00:03:27,170 --> 00:03:30,300
The third parties cross-reference and fourth parties creen.
42

43
00:03:30,440 --> 00:03:37,700
So the Body basically constitutes of all these objects that are there inside the PDF.
43

44
00:03:37,760 --> 00:03:43,200
Let us talk about the cross-reference table here which is our third important element of the PDF file
44

45
00:03:43,490 --> 00:03:50,390
and probably the most complicated one. Cross-reference table is nothing but a table that allows the
45

46
00:03:50,390 --> 00:03:54,960
PDF parser to quickly access every object inside the body.
46

47
00:03:55,340 --> 00:04:03,740
So it begins with 'xref' which denote the cross-reference, then '0' denotes the start of object
47

48
00:04:04,070 --> 00:04:09,700
and '4' denotes the number of objects that are there inside the body.
48

49
00:04:10,160 --> 00:04:13,850
So if you see here we have one two and three objects.
49

50
00:04:14,000 --> 00:04:17,080
So the question is why do we have four here.
50

51
00:04:17,090 --> 00:04:20,650
The answer is this root object over here.
51

52
00:04:20,870 --> 00:04:29,690
So every time the body will contain one less than the total count of the objects because PDF also includes
52

53
00:04:29,780 --> 00:04:32,360
this root object as one.
53

54
00:04:32,420 --> 00:04:39,400
So the total count of number of objects inside the PDF becomes 4. Now the cross-reference stable
54

55
00:04:39,400 --> 00:04:41,220
has a standard format.
55

56
00:04:41,740 --> 00:04:46,900
So this denotes the offset from where the first object is starting this denotes the offset from there.
56

57
00:04:46,900 --> 00:04:50,130
The second object is starting. Then third, fourth.
57

58
00:04:50,690 --> 00:04:54,820
65535 is basically a five digit generation number.
58

59
00:04:54,820 --> 00:05:02,350
It's something that the PDF tool generates for you and here  'f' and 'n' have two different meanings.
59

60
00:05:02,350 --> 00:05:08,610
For example whether that object is a free object or whether that object is in use.
60

61
00:05:08,680 --> 00:05:15,740
The last part is the trailer part, the important part of the PDF file. So trailer contains the
61

62
00:05:15,820 --> 00:05:19,260
overall information about the PDF.
62

63
00:05:19,300 --> 00:05:28,430
So if you look here it begins with the keywords 'startxref'. Then it contains an integer.
63

64
00:05:28,450 --> 00:05:34,430
So this integer basically denotes the beginning of the cross-reference table.
64

65
00:05:35,680 --> 00:05:38,500
So basically points to this location over here.
65

66
00:05:38,560 --> 00:05:41,710
So if it stays 249 it means it's here.
66

67
00:05:41,710 --> 00:05:46,510
So there are four characters it means zero is at location 253.
67

68
00:05:46,510 --> 00:05:49,350
So that's how the table is denoted.
68

69
00:05:49,390 --> 00:05:57,150
And then finally we have End Of File(eof), which tells that the PDF file has ended.
69

70
00:05:57,220 --> 00:06:04,570
So what exactly happens here. Once you load of PDF file do a PDF editor or into a PDF reader software
70

71
00:06:04,570 --> 00:06:12,670
like Adobe Reader The PDF tool starts reading the file from the bottom. It first goes to start xref
71

72
00:06:12,850 --> 00:06:18,440
It figures out the starting byte of data cross-reference table.
72

73
00:06:18,460 --> 00:06:25,030
It comes here. From there it knows that there are four objects inside it. It picks up the location of the first
73

74
00:06:25,030 --> 00:06:27,420
object which is pointed here.
74

75
00:06:27,550 --> 00:06:34,240
Then it comes to the root object where it says that it wants to go to object number one 
75

76
00:06:34,240 --> 00:06:37,890
the entire flow then jumps all the way to the top to object number one.
76

77
00:06:38,170 --> 00:06:43,780
Then it reads the content of the object number one and it goes on two to three and so on.
77

78
00:06:43,780 --> 00:06:49,030
So this is how the parsing of a PDF file happens inside a PDF parsing tool