0 1 00:00:10,920 --> 00:00:15,960 Welcome everyone to another video off Malware Analysis and reverse engineering course. 1 2 00:00:15,960 --> 00:00:21,990 So in the previous video we understood how we can analyze email headers to understand whether a given 2 3 00:00:22,020 --> 00:00:24,230 e-mail is spam or not. 3 4 00:00:24,420 --> 00:00:29,760 We also noticed that those e-mails consisted of malicious attachments as well. 4 5 00:00:29,760 --> 00:00:36,300 So starting from this video we're going to dive deeper into understanding those attached file format 5 6 00:00:36,630 --> 00:00:43,520 and we'll try and see how we can uncover the malicious characterstices of those file attachments. 6 7 00:00:44,040 --> 00:00:49,530 We want to start with understanding Microsoft Office file formats because it is one of the most popular 7 8 00:00:49,530 --> 00:00:54,420 file formats that is used by everyone. By Microsoft Office file formats, 8 9 00:00:54,420 --> 00:01:01,170 I'm referring to common files like Microsoft Word Microsoft PowerPoint Microsoft Excel. 9 10 00:01:01,290 --> 00:01:06,750 These are one of the most common everyday tools that are used in enterprises. 10 11 00:01:07,170 --> 00:01:11,800 So Microsoft Office format has two different file formats. 11 12 00:01:11,850 --> 00:01:15,830 One is the OLE compound file and the second one is the Open Office. 12 13 00:01:15,870 --> 00:01:21,500 XML or simply open XML format. 13 14 00:01:21,530 --> 00:01:29,180 So first let us look into the OLE compound file. so OLE stands for object linking and embedding. In 14 15 00:01:29,180 --> 00:01:36,920 this file format a document would consist of different objects which are linked with each other thus 15 16 00:01:36,920 --> 00:01:40,950 forming a complete container of the document. 16 17 00:01:41,000 --> 00:01:48,050 It is the hierarchial collection of stream and storage objects. It consists of a root storage object 17 18 00:01:48,320 --> 00:01:54,540 with optional storage objects and stream objects in a nested hierarchy. 18 19 00:01:54,590 --> 00:02:01,550 It supports older versions ranging from Microsoft Windows 97 to 2003. 19 20 00:02:01,790 --> 00:02:07,520 The Forth point lists a bunch of hexadecimal characters. I want to leave this as an exercise for you 20 21 00:02:07,520 --> 00:02:10,660 to go in figuring out what exactly those hex characters mean. 21 22 00:02:10,910 --> 00:02:15,710 You can definitely find the answer to this question in the discussant section as well but I would highly 22 23 00:02:15,710 --> 00:02:21,590 encourage that go and figured out what exactly these hex characters mean. 23 24 00:02:21,640 --> 00:02:27,100 So here is a pictorial representation of the OLE compound file formats and this will make things a 24 25 00:02:27,100 --> 00:02:28,670 lot more clearer to you. 25 26 00:02:29,020 --> 00:02:37,330 So as we saw in the previous slide, there is storage and the root storage consist of stream and storage 26 27 00:02:37,420 --> 00:02:39,330 in a hierarchy. 27 28 00:02:39,370 --> 00:02:46,450 Now the storage can be further sub-classified into either storage or stream objects. Now a stream 28 29 00:02:46,450 --> 00:02:52,840 is basically a collection of data that is present inside the document and the storage can further contain 29 30 00:02:52,930 --> 00:02:56,050 either stream or storage objects. 30 31 00:02:56,050 --> 00:03:02,500 So if you look at the diagram on the right hand side there will be a root storage and within that storage 31 32 00:03:02,890 --> 00:03:06,220 there is another storage called macro. Within that macro, 32 33 00:03:06,220 --> 00:03:13,510 There is another storage called VBA and there are a bunch of streams like one table, Compobject, 33 34 00:03:13,510 --> 00:03:16,480 summary information, word document. 34 35 00:03:16,480 --> 00:03:23,650 So if you create a Word file, there will be a root storage at the top and if that word file contains 35 36 00:03:23,770 --> 00:03:28,300 macros there, will be a storage called macros. 36 37 00:03:28,360 --> 00:03:36,870 Now whatever text that you type inside that word document gets recorded into the stream that 37 38 00:03:36,870 --> 00:03:39,190 is called Word document. 38 39 00:03:39,190 --> 00:03:41,190 The one here at the last. 39 40 00:03:41,290 --> 00:03:47,050 So these streams basically contains the data about the particular document. 40 41 00:03:47,050 --> 00:03:53,350 Now if your document also contains macros there will be a storage folder called macros. 41 42 00:03:53,560 --> 00:03:59,260 Within that there will be an another storage folder called VBA and within that VBA there will be a bunch 42 43 00:03:59,260 --> 00:04:02,060 of streams that will contain the data. 43 44 00:04:02,290 --> 00:04:07,500 And by the data on what I mean is that they will contain the actual VBA code. 44 45 00:04:07,990 --> 00:04:17,340 So this was an overview of the OLE compound file format. Moving on to office open XML format. This is for 45 46 00:04:17,340 --> 00:04:22,210 the newer versions of Microsoft Office ranging from 2007 onwards. 46 47 00:04:22,500 --> 00:04:30,720 It's much more simpler than the OLE compound file because it's simply a zip file format where you basically 47 48 00:04:30,780 --> 00:04:38,580 collect a bunch of linked files and you kind of zip them and create a toplevel file format for the 48 49 00:04:38,580 --> 00:04:46,290 particular document. It represents document in the form of markups which is definitely the property of 49 50 00:04:46,300 --> 00:04:48,540 XML. 50 51 00:04:48,540 --> 00:04:51,360 So how do you analyze a DOCX file. 51 52 00:04:51,810 --> 00:04:58,230 It's very simple once you have our DOCX file, all you can do is just remove the DOCX extension 52 53 00:04:58,230 --> 00:05:00,750 from behind and replace it with zip. 53 54 00:05:01,020 --> 00:05:07,170 And you can just right click and unzip and it will give you the entire content of that 54 55 00:05:07,170 --> 00:05:08,320 DOCX file 55 56 00:05:08,670 --> 00:05:15,450 So I want to specify that DocX is the extension of the newer versions of Microsoft Office like Docx, 56 57 00:05:15,450 --> 00:05:18,440 Xlsx, Pptx. 57 58 00:05:18,510 --> 00:05:24,720 These are the newer versions of Microsoft Office. the older versions, the OLE compound file versions 58 59 00:05:24,900 --> 00:05:31,200 they have file extensions as Doc, Ppt, Xls etc.. 59 60 00:05:32,320 --> 00:05:37,020 So coming back to the doc X-File package Lets say you right click and unzip the entire package you'll 60 61 00:05:37,090 --> 00:05:39,510 see a bunch of different file names. 61 62 00:05:39,700 --> 00:05:43,480 You'll have records doc prop's word. Inside word 62 63 00:05:43,480 --> 00:05:45,300 You will again have relations. 63 64 00:05:45,520 --> 00:05:52,150 So the relation is nothing but a lookup for each item that is reference inside the document. 64 65 00:05:52,150 --> 00:05:54,970 You'll have another file called Document.xml 65 66 00:05:54,970 --> 00:06:00,640 This is basically the text of the document where the entire text is there within the document file will 66 67 00:06:00,670 --> 00:06:01,500 be present. 67 68 00:06:01,660 --> 00:06:06,010 Then you have header and footer that will contain the information of the header and footer that you 68 69 00:06:06,010 --> 00:06:08,040 create inside your document. 69 70 00:06:08,080 --> 00:06:11,550 You have styles, media, themes, chart and so on. 70 71 00:06:11,770 --> 00:06:18,510 So this is how the newer docx file formats are created by Microsoft office. 71 72 00:06:18,640 --> 00:06:24,880 So that was all about the two different file formats that Microsoft Office supports. 72 73 00:06:25,030 --> 00:06:32,560 Starting from next videos will start doing a deep dive into actual files to understand how we can uncover 73 74 00:06:32,620 --> 00:06:34,850 the malicious traits out of them. 74 75 00:06:34,870 --> 00:06:36,190 That's all for this video. 75 76 00:06:36,220 --> 00:06:41,890 If you have any questions or if you want to understand more please do write in the discussion section 76 77 00:06:41,950 --> 00:06:44,640 and we can discuss more about these file formats. 77 78 00:06:45,010 --> 00:06:46,280 Thanks a lot for watching. 78 79 00:06:46,290 --> 00:06:46,600 Buh-Bye.