FINAL PROJECT 2_논문분석

논문리스트

1 : Detection of Malicious PDF based on Document Structure Features and Stream Objects

https://www.koreascience.or.kr/article/JAKO201809355933293.pdf

2 : Data Mining Based Strategy for Detecting Malicious PDF Files

https://ieeexplore.ieee.org/document/8455965

3 : Hidost: a static machine-learning-based detector of malicious files

https://link.springer.com/content/pdf/10.1186/s13635-016-0045-0.pdf

4 : Detection of malicious PDF files and directions for enhancements: A state-of-the art survey

https://www.sciencedirect.com/science/article/pii/S0167404814001606?casa_token=ewyvSRQBFmkAAAAA:DpxW4KmaPbM0oe8n6z2oObI7eIzUsVuwhGz_gy8gSqLstniQhuwShjqFT-je9Ol7T7AY_MZ4VXA

5 : A structural and content-based approach for a precise and robust detection of malicious PDF files

https://ieeexplore.ieee.org/abstract/document/7509925?casa_token=pCxlt1XsOoEAAAAA:3_QC0TeTuFg49lV46evto3db1HCUMqTcYczFHYCX-3bQmo_6XPdI7_YVUJPEu1CZxeynzQTt974

논문 1 . 요약

문서형 악성코드는 주로 이메일이나 웹사이트의 첨부 문서 파일을 통해 유포되며 문서 파일을 클릭하여 확인함과 동시에 사용자 모르게 악성코드가 드롭되거나 다운로드 된다

본 연구에서는 문서의 구조뿐만 아니라 삽입된 악성 스크립트로부터 텍스트 키워드를 추출하여 보다 정교한 탐지 모델을 제시한다.

오브젝트는 여러 종류를 가지는데, 그 중 스트림(stream) 오브젝트는 연속적인 바이트의 집단(이진 데이터)으로, 길이에 제약이 없어 크기가 큰 이미지 파일이나 페이지 구성하는 오브젝트이다.

악성 PDF는 보통이 스트림 오브젝트에 악성코드를 삽입하게 되는데, 다른 타입은 길이의 제약을 가지므로 stream 오브젝트를 사용한다.

정상 문서의 경우에는 구성하는 오브젝트의 수가 많은데 비해 악성 파일은 오브젝트의 수가 적다

가장 중요한 변수는 size로 악성 문서는 정상 문서에 비해 크기가 작은 특징을 가진다

긴 바이트를 분석해보면 자바스크립트가 포함되어 있는 경우가 대부분인데, 자바스크립트가 바로 보이게 되면 탐지될 우려가 높기 때문에 여러 가지 형태로 인코딩되어 있음을 알 수 있었다.

논문 1 . 아이디어

1. acrobat pdf reader 처럼 파일실행이 아닌 내부구조를 확인하는 라이브러리 확인

https://stackoverflow.com/questions/29342542/how-can-i-extract-a-javascript-from-a-pdf-file-with-a-command-line-tool

How can I extract a JavaScript from a PDF file with a command line tool?

How can I extract a JavaScript object from a PDF file using a command line tool? I am trying to make a GUI using Python with this function. I found these two modules but couldn't run them: pyPd...

stackoverflow.com

2. 저자가 직접 선택한 특성값을 벗어난 새로운 악성코드가 담긴 PDF는 분류해낼 가능성이 적다.

다른 논문들과 마찬가지로 PDF 악성코드 분류 자체가 수동적으로 대처할 수 밖에 없다 (한계점)

이를 극복할 수 있는 방법은 무엇이 있을까?

(1) 굉장히 빠른 후속대응 : 기존 데이터베이스를 벗어난 새로운 악성코드 발견시 빠르게 업데이트 및 모델에 반영

(2) 선제대응방법은?

마침 선제적대응 (리버스 엔지니어링) 관련 강연을 시큐레터 임차성 대표가 진행함

https://www.boannews.com/media/view.asp?idx=92106&kind=

[ISEC 2020] 리버스 엔지니어링을 통한 악성코드 선제적 대응

국내 최대 규모 보안 컨퍼런스 ‘ISEC 2020’이 서울 코엑스에서 열렸다. 시큐레터 임차성 대표는 ‘지능화, 고도화되는 악성코드 위협, 리버스 엔지니어링(역공학)을 통한 선제적 대응’을 주제

www.boannews.com

http://m.elec4.co.kr/article/articleView.asp?idx=27520

[스타트업] 시큐레터 “이메일에 포함된 악성코드로부터 정보와 자산을 지켜드려요”

[스타트업] 시큐레터 “이메일에 포함된 악성코드로부터 정보와 자산을 지켜드려요” 2021-05-11 신윤오 기자, yoshin@elec4.co.kr 진단 기술 이용해 악성코드 공격을 탐지·진단·분석·차단해 시큐레터

m.elec4.co.kr

논문 2 . 요약

data_mining_based_strategy_for_detecting_malicious_pdf_files.pdf

0.29MB

In this paper, a new algorithm based on data mining techniques is developed for the detection of malicious PDF files.

The feature selection stage is used to the select the optimum number of features extracted from the PDF file to achieve high detection rate and low false positive rate with small computational overhead. Accordingly, The proposed algorithm can achieve high detection rate and accuracy, with low false positive rate.

시스템모델

Improved binary gravitational search algorithm (IBGSA) is used to select the best features that maximize classification accuracy and minimize false positive rate.

+ Random Forest

( 논문1에서는 특성을 직접 매뉴얼하게 선정했다면, 논문2에서는 특성 선택 자체를 'IBGSA'알고리즘을 통해 선택 후 모델링함)

논문 2 . 아이디어

논문1의 한계 (매뉴얼한 특성선택)를 개선하기 위해 특정 알고리즘(IBGSA)를 활용

새로운 프로젝트 진행시 특성선택에 또다른 알고리즘을 활용하여 모델링을 진행하는 방법이 있을수 있겠다

논문 5 . 요약

a_structural_and_content-nased_A.pdf

0.21MB

section1 요약

In this paper, we present a novel machine learning-based approach to the detection of maliciousPDF files that leverages on information extracted both from the "structure" and the "content" of the PDF file.
On one hand, we represent the information about the structure by analyzing: a) general properties of the PDF file structure and b) structural properties of the PDF objects in terms of keywords.
On the other hand, we analyze content-ased information such as: a)malformed objects, streams and codes, b) known vulnerabilities in Javascript code and c) embedded contents such as other PDF files

section2 pdf 구조소개

section 3 그간의 작업들 소개 및 한계점
Because of the evident weaknesses of structural systems, research focused again on detecting malicious Javascript code.
다양한 방법으로 PDF를 분석하였으나,
지금까지의 대부분의 연구는 자바스크립트 코드를 검사하는데 포커스를 두었다고 한다

section4
머신러닝을 통해 그동안 작업이 진행되었으나 제약이 있다
그래서 저자는 structure 에 국한되지 않은 3가지 feature 로 나누어 더 다양한 재료로 머신러닝을 진행하고자 함

[[[지금까지의 이야기는 결국 java script에만 치우친 structure에만 치우친 악성코드 분류는 완벽하지않다
우리는 거기에 content도 feature로 추가해서 더 좋은 성능의 악성코드 분류모델을 만들었다]]]

4-1
(1) General structure
8개 특성
i) The size of the file;
ii) The number of versions of the file
iii) The number of indirect objects;
iv) The number of streams;
v) The number of compressed objects;
vi) The number of object streams
vii) The number of X-ref streams
viii)The number of objects containing Javascript.

(2) Object structure
키워드의 숫자를 k-클러스터링 한 값을 활용하는 이유는
특정 문자의 빈도에 따라 그 목적을 유추할수있기 때문
The reason why we considered characteristic keywords, along with their occurrence, is that their presence is often associated to specific actions performed by the file.

(3)Content based structure
PeePDF 또는 Origami reject 되는 경우 의심해야한다
The two tools perform a non-forced scan
If one of these tools rejects the files, it means that their might be suspicious elements such as the execution of code, malformed or incorrect x-ref tables, corrupted headers, etc.
There are also 5 features that represent ifnormation about malformed
a) objects (for example, whe scripting codes are entirely injected inside a PDF dictionary),
b) streams,
c) actions (using keywords not proper of the PDF language),
d) code (for example,using functions related to known vulnerabilities)
e) compression filters (the way in which data like images or code are compressed in the file to reduce the file size).

4-2 classification
decision tress classifier + Adaboost

5 실험내용

6결론
this work, we presented a new approach that leveraged on both structural and content-based information to provide a very accurate detection of PDF malware.

논문 5 . 아이디어

앞선 논문들과는 달리 단순 java script 만을 feature로 활용한 예측모델을 만든 것이 아니라, PDF 내 content 도 주요 feature로 활용하여 더 효과적인 모델을 만듬.

포인트는 기존의 data를 활용한 효과적인 모델을 만드는 동시에 선제적인 대응이 가능한 모델을 어떻게 구축하느냐가 더 중요하다고 판단됨

논문 3 . 요약

Hidost- a static machine learning based detector of malicious files.pdf

2.21MB

728x90

저작자표시

'AI월드 > ⚙️AI BOOTCAMP_Section 6' 카테고리의 다른 글

FINAL PROJECT 2_아이디어(딥러닝활용) (0)	2021.06.28
FINAL PROJECT 2_PDF parser (0)	2021.06.24
FINAL PROJECT 1_프로젝트 FLOW (0)	2021.06.02

칼리드월드

FINAL PROJECT 2_논문분석

'AI월드 > ⚙️AI BOOTCAMP_Section 6' 카테고리의 다른 글

댓글

티스토리툴바

FINAL PROJECT 2_논문분석

'AI월드 > ⚙️AI BOOTCAMP_Section 6' 카테고리의 다른 글

관련글

댓글

티스토리툴바