0 1 00:00:10,280 --> 00:00:16,670 Welcome everyone to another video expert malware analysis and reverse engineering course. In this video we are 1 2 00:00:16,670 --> 00:00:19,600 going to analyze malicious PDF files. 2 3 00:00:19,880 --> 00:00:27,220 So before we dive into analyzing a malicious PDF file, like like we did for office files, 3 4 00:00:27,290 --> 00:00:30,570 We will first understand the PDF file structure. 4 5 00:00:30,590 --> 00:00:36,920 This will help us in building a stronger base in understanding how exactly the PDF file format looks 5 6 00:00:36,920 --> 00:00:37,710 like. 6 7 00:00:37,760 --> 00:00:42,940 How exactly is structured and how to understand its different component. 7 8 00:00:43,070 --> 00:00:49,430 PDF stands for Portable Document Format and it is a file format to basically represent documents. 8 9 00:00:49,640 --> 00:00:51,410 It's a very simple file format. 9 10 00:00:51,410 --> 00:00:58,510 It basically can be called as a text format where everything is stored in the form of dictionaries and 10 11 00:00:58,510 --> 00:01:05,520 in dictionaries contains information about the file and it contains streams which hold the data about 11 12 00:01:05,620 --> 00:01:06,100 the PDF. 12 13 00:01:06,100 --> 00:01:12,710 For example if you have image or if you have text all those things will go inside the streams of PDF 13 14 00:01:14,120 --> 00:01:17,700 It encapsulates various elements of documents as objects. 14 15 00:01:17,930 --> 00:01:23,630 Just like we mentioned right now, it will have different objects and all those objects will contain properties 15 16 00:01:23,630 --> 00:01:30,100 of different elements inside the PDF, as well as the stream of data that is present in the PDF. 16 17 00:01:30,100 --> 00:01:36,040 It Supports rich and interactive elements, for example it supports images, it supports graphs, you can draw shapes 17 18 00:01:36,070 --> 00:01:38,550 inside it. It has scripting capability as well. 18 19 00:01:38,560 --> 00:01:44,170 So PDF file formats are very much capable of supporting scripting languages like action script and 19 20 00:01:44,170 --> 00:01:45,030 Javascript. 20 21 00:01:45,220 --> 00:01:53,200 You can create dynamic elements and dynamic actions like filling up all forms; loading information 21 22 00:01:53,200 --> 00:01:55,250 dynamically and things like that. 22 23 00:01:56,530 --> 00:02:03,270 In a PDF file, there are four major sections: header, Body, cross-reference and trailer. 23 24 00:02:03,770 --> 00:02:06,950 So let us try to understand what exactly these sections mean. 24 25 00:02:08,360 --> 00:02:16,340 So on the right hand side you can see that there is a structure off of PDF file. In order to view the 25 26 00:02:16,340 --> 00:02:18,190 structure of any PDF file, 26 27 00:02:18,380 --> 00:02:23,980 You can just simply drag and drop the file in a text editor and you can view the structure 27 28 00:02:23,990 --> 00:02:28,280 something similar to the one that I have here on the right hand side. 28 29 00:02:28,670 --> 00:02:35,330 So the first section of the PDF file is its header. So the header is nothing but the magic number which denotes 29 30 00:02:35,330 --> 00:02:40,270 that this is a PDF file and it contains the version of the PDF file. 30 31 00:02:40,550 --> 00:02:47,970 You see here at the top, it says that it's a PDF file version 1.7. So falling downwards you'd see a 31 32 00:02:47,960 --> 00:02:51,470 bunch of objects inside the PDF file. 32 33 00:02:51,480 --> 00:02:56,610 These objects basically constitute the different elements of the PDF file. 33 34 00:02:56,610 --> 00:03:04,110 So '1' basically denotes its name, '0' is its version number and 'obj' denotes the object. 34 35 00:03:04,110 --> 00:03:05,570 Beginning and ending of these signs 35 36 00:03:05,580 --> 00:03:07,710 Basically represents a dictionary. 36 37 00:03:07,800 --> 00:03:12,940 And within those dictionaries you could contain the different elements of the PDF file. 37 38 00:03:13,140 --> 00:03:19,050 Then we have 'endobj' which basically denotes the end of object and then there will be another object and 38 39 00:03:19,050 --> 00:03:19,790 so on. 39 40 00:03:21,410 --> 00:03:24,280 So this was the first part which is harder. 40 41 00:03:24,380 --> 00:03:27,050 The second part is the body. 41 42 00:03:27,170 --> 00:03:30,300 The third parties cross-reference and fourth parties creen. 42 43 00:03:30,440 --> 00:03:37,700 So the Body basically constitutes of all these objects that are there inside the PDF. 43 44 00:03:37,760 --> 00:03:43,200 Let us talk about the cross-reference table here which is our third important element of the PDF file 44 45 00:03:43,490 --> 00:03:50,390 and probably the most complicated one. Cross-reference table is nothing but a table that allows the 45 46 00:03:50,390 --> 00:03:54,960 PDF parser to quickly access every object inside the body. 46 47 00:03:55,340 --> 00:04:03,740 So it begins with 'xref' which denote the cross-reference, then '0' denotes the start of object 47 48 00:04:04,070 --> 00:04:09,700 and '4' denotes the number of objects that are there inside the body. 48 49 00:04:10,160 --> 00:04:13,850 So if you see here we have one two and three objects. 49 50 00:04:14,000 --> 00:04:17,080 So the question is why do we have four here. 50 51 00:04:17,090 --> 00:04:20,650 The answer is this root object over here. 51 52 00:04:20,870 --> 00:04:29,690 So every time the body will contain one less than the total count of the objects because PDF also includes 52 53 00:04:29,780 --> 00:04:32,360 this root object as one. 53 54 00:04:32,420 --> 00:04:39,400 So the total count of number of objects inside the PDF becomes 4. Now the cross-reference stable 54 55 00:04:39,400 --> 00:04:41,220 has a standard format. 55 56 00:04:41,740 --> 00:04:46,900 So this denotes the offset from where the first object is starting this denotes the offset from there. 56 57 00:04:46,900 --> 00:04:50,130 The second object is starting. Then third, fourth. 57 58 00:04:50,690 --> 00:04:54,820 65535 is basically a five digit generation number. 58 59 00:04:54,820 --> 00:05:02,350 It's something that the PDF tool generates for you and here 'f' and 'n' have two different meanings. 59 60 00:05:02,350 --> 00:05:08,610 For example whether that object is a free object or whether that object is in use. 60 61 00:05:08,680 --> 00:05:15,740 The last part is the trailer part, the important part of the PDF file. So trailer contains the 61 62 00:05:15,820 --> 00:05:19,260 overall information about the PDF. 62 63 00:05:19,300 --> 00:05:28,430 So if you look here it begins with the keywords 'startxref'. Then it contains an integer. 63 64 00:05:28,450 --> 00:05:34,430 So this integer basically denotes the beginning of the cross-reference table. 64 65 00:05:35,680 --> 00:05:38,500 So basically points to this location over here. 65 66 00:05:38,560 --> 00:05:41,710 So if it stays 249 it means it's here. 66 67 00:05:41,710 --> 00:05:46,510 So there are four characters it means zero is at location 253. 67 68 00:05:46,510 --> 00:05:49,350 So that's how the table is denoted. 68 69 00:05:49,390 --> 00:05:57,150 And then finally we have End Of File(eof), which tells that the PDF file has ended. 69 70 00:05:57,220 --> 00:06:04,570 So what exactly happens here. Once you load of PDF file do a PDF editor or into a PDF reader software 70 71 00:06:04,570 --> 00:06:12,670 like Adobe Reader The PDF tool starts reading the file from the bottom. It first goes to start xref 71 72 00:06:12,850 --> 00:06:18,440 It figures out the starting byte of data cross-reference table. 72 73 00:06:18,460 --> 00:06:25,030 It comes here. From there it knows that there are four objects inside it. It picks up the location of the first 73 74 00:06:25,030 --> 00:06:27,420 object which is pointed here. 74 75 00:06:27,550 --> 00:06:34,240 Then it comes to the root object where it says that it wants to go to object number one 75 76 00:06:34,240 --> 00:06:37,890 the entire flow then jumps all the way to the top to object number one. 76 77 00:06:38,170 --> 00:06:43,780 Then it reads the content of the object number one and it goes on two to three and so on. 77 78 00:06:43,780 --> 00:06:49,030 So this is how the parsing of a PDF file happens inside a PDF parsing tool