[5-2] Decision Tree + Random Forest

학교수업/데이터 사이언스

[5-2] Decision Tree + Random Forest

해영이의 성장일기 2026. 4. 1. 20:17

What is Classification?

Supervised vs. Unsupervised Learning

지도 학습: 정답이 있는 데이터를 학습하고, 새로운 데이터에 대해 예측 비지도 학습: 정답이 없는 데이터를 분석하여 숨겨진 패턴을 찾아냄

지도 학습(Classification)은 훈련 데이터에 클래스 레이블이 붙어 있으며, 이 레이블을 바탕으로 새로운 데이터를 분류합니다.
비지도 학습(Clustering)은 훈련 데이터에 레이블이 없고, 데이터를 그룹화하여 유사한 항목들을 묶는 방법입니다.

Issues in Classification: Data Preparation

Data cleaning : Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection) : Remove the irrelevant (index, ID, etc…) or redundant attributes (yearsalary and monthly salary, etc…)
Data transformation : Generalize and/or normalize data

Issues in Classification: Evaluation Points

What is Decision tree?

❑ A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions

❑ Each branch node represents a choice between alternatives, and each leaf node represents a decision

분기점(branch node)은 선택 조건 말단 노드(leaf node)는 결정된 결과

그럼 tree 를 어떻게 만들거냐?

Decision Tree Induction

The only difference is how to select features?

ID3: 정보 이득(Information Gain)을 사용하여 가장 좋은 분할 조건을 찾는다.
C4.5: 이득 비율(Gain Ratio)을 사용하여, 정보를 최대한 분할하는 특성을 선택한다.
CART: 지니 계수(Gini Index)를 사용하여 불순도를 최소화하는 방식으로 분할한다.

Feature Selection

그럼 아까 방법 3개를 더 자세하게 공부해 보겠다.

불확실성이 클수록 엔트로피가 크고 , 불확실성이 작을수록 엔트로피가 작다 how the data is mixed

because of the limitation of ID3: if the partation is big, each separation have very little mixed data. The possibility is increaseing. So, we should reduce the bias. The penalty is large if the feature make many partation.

machine learning 랑 똑같아.. 우리는 앞으로 올 data 를 잘 분리하고 싶은거임.

training data 에만 맟추다 보니 model become too complex

그러니까 해결 할 방법이 있어야징..

Pre-pruning - 문제는 how to check the goodness and threshold? It will be depend on data set.

Post-pruning

validation set으로 성능 확인
잘라도 성능 안 떨어지면 제거
성능 떨어질 때까지 반복

이게 더 유명하고 더 많이 쓰인 방법이다

데이터에서 중복 허용해서 랜덤 추출 이렇게 만든 여러 dataset → 각각 트리 학습

매 split마다 전체 feature 중 일부만 랜덤 선택 그 중에서 best split 고름 .. 그래서 트리마다 보는 feature도 다름

모델마다 신뢰도(accuracy) 다름 그래서 그냥 세지 말고 가중치 곱해서 합산

'학교수업 > 데이터 사이언스' 카테고리의 다른 글

Random Forest (0)	2026.04.09
[6-1] Classification part 2 (0)	2026.04.06
[5-1] MaxMiner, Closet, CHARM (0)	2026.03.31
[4-1] FP-Growth (0)	2026.03.23
[3-2] Improving Apriori (0)	2026.03.16

현재글[5-2] Decision Tree + Random Forest

해영이의 성장일기

유학생 개발자의 성장 기록 🤍

splaytree, datastructure, bst, datamining, 한양대, 드라마리뷰, classification, DeletewithSearch, 공감, datascience, 공부, 리뷰, LinkedList, 컴소과, DEEPLEARNING, physis, 일반물리, runningtime, 엔하이픈, 유학생,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

해영이의 성장일기