[분석 방법론] Ensemble Learning(4)

본 포스팅은 고려대학교 산업경영공학부 강필성 교수님의 [Korea University] Business Analytics (Graduate, IME654) 강의 중 04-4: Ensemble Learning - Random Forests 영상을 보고 정리한 내용입니다.

1. Random Forests 개요

- bagging 기법의 특수한 형태

- base learner는 decision tree

- ensemble의 diversity를 확보하기 위해 bagging 기법을 사용하고, predictor variables(예측 변수)를 랜덤하게 선택함

* base learner가 decision tree인 단순 bagging과 차이가 있음

2. Random Forests 알고리즘

- 총 B개의 decision tree

- 각 decision tree마다 bootstrap sample을 생성하고 총 p개의 변수 중 m개의 변수를 랜덤으로 선택

- dauther node(child node)로 split 할 때마다 변수를 새로 선택

- 변수를 제약함으로써, learner들의 평균 성능은 단순 bagging보다 떨어질 수 있지만, 단순 bagging보다 더 다양하게 학습되었기 때문에, 전체 성능은 더 좋아질 수 있음

3. Generalization Error

- 각각의 tree는 pruning을 하지 않아서, 데이터셋에 대하여 over-fit되는 경향이 있음

- tree 개수가 충분히 많다는 가정 하에서, random forest의 generalization error를 계산할 수 있음

- $ˉ ρ <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>ρ</mi><mo stretchy="false">¯</mo></mover></mrow></math>$ : 개별 tree간 correlation coeffiecients 값의 평균

+ 2개의 tree, 동일한 label에 대해 산출한 확률로 correlation coeffiecients를 계산하면 됨

- $s 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>s</mi><mn>2</mn></msup></math>$ : 개별 tree에서 margin function (정답과 오답에 대하여 산출된 확률의 차이 평균)

- 모델이 다양할수록 $ˉ ρ <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>ρ</mi><mo stretchy="false">¯</mo></mover></mrow></math>$ 값은 작아지고, 개별 모델이 정확할수록 $s 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>s</mi><mn>2</mn></msup></math>$ 값은 커져서, generalization error가 작아진다.

4. Variable Importance

- Random Forests 알고리즘으로 변수의 중요도까지 산출할 수 있기 때문에, 현실 세계에서 자주 사용됨

- 중요도 산출 단계

1) OOB 데이터 생성

2) 학습된 tree를 OOB 데이터로 error 계산 ( $e i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mi>i</mi></msub></math>$ )

3) OOB 데이터의 $x i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub></math>$ 변수의 데이터를 뒤죽박죽 섞어서(Random permutation) error 계산 ( $p i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mi>i</mi></msub></math>$ )

4) 모든 tree에 대한 $p i - e i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mi>i</mi></msub><mo>-</mo><msub><mi>e</mi><mi>i</mi></msub></math>$ 의 평균과 표준편차를 계산하여 variable importance 도출

- 데이터를 섞었을 때, error가 크게 나온다면 해당 변수가 학습에 사용되었다는 의미이며, 만약 사용되지 않았다면 아무런 영향도 받지 않아서 $p i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mi>i</mi></msub></math>$ 와 $e i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>e</mi><mi>i</mi></msub></math>$ 값이 동일할 것

참고 자료

- 고려대학교 산업경영공학부 강필성 교수님 강의

'머신러닝' 카테고리의 다른 글

[분석 방법론] Ensemble Learning(7) - XGBoost (0)	2022.12.28
[분석 방법론] Ensemble Learning(5) - Adaptive Boosting(AdaBoost) (0)	2022.12.12
[분석 방법론] Ensemble Learning(3) - Bagging (0)	2022.11.29
[분석 방법론] Ensemble Learning(2) - Bias-Variance Decomposition (0)	2022.11.29
[분석 방법론] Ensemble Learning(1) - Overview (0)	2022.11.23

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

공부하응

[분석 방법론] Ensemble Learning(4) - Random Forests

'머신러닝' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[분석 방법론] Ensemble Learning(4) - Random Forests

'머신러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역