Files
dataminingexample/cheat-sheet_reconcile.md

196 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## 📊 TIỀN XỬ LÝ DỮ LIỆU
### Thống Kê Mô Tả
**Mean:** `x̄ = Σxi / n`
Ví dụ: [4,5,6] → 15/3 = **5**
**Median:** Giá trị giữa sau khi sắp xếp
Ví dụ: [4,5,6] → **5** | [4,5,6,7] → (5+6)/2 = **5.5**
**Mode:** Giá trị xuất hiện nhiều nhất
Ví dụ: [4,5,5,6] → **5**
**Standard Deviation:** `σ = √[Σxi²/n - x̄²]`
Ví dụ: [4,7,10] → Σxi²=165, n=3, x̄=7
σ = √[165/3 - 7²] = √[55-49] = √6 = **2.45**
**Standard Deviation:** `σ = √[Σxi²/n - x̄²]`
Ví dụ: [2,4,6] → Σxi²=56, x̄=4, n=3 → √[56/3 - 16] = √2.67 = **1.63**
---
### Boxplot
**Quartiles:**
```
Q1 = Median(nửa dưới)
Q2 = Median
Q3 = Median(nửa trên)
IQR = Q3 - Q1
```
Ví dụ: [1,2,3,4,5,6,7,8,9] → Q1=2.5, Q2=5, Q3=7.5, IQR=5
**Outliers:**
```
Lower fence = Q1 - 1.5×IQR
Upper fence = Q3 + 1.5×IQR
```
---
### Chuẩn Hóa
**1. Decimal Scaling:** `x' = x / 10^j`
Ví dụ: 91 → 91/100 = **0.91**
**2. Min-Max:** `x' = (x-min)/(max-min) × (b-a) + a`
Ví dụ: [4,7,10] về [0,1]: 7 → (7-4)/6 = **0.5**
**3. Z-Score:** `x' = (x-μ)/σ` với `σ = √[Σxi²/n - x̄²]`
Ví dụ: [4,7,10] → x̄=7, σ=2.45: 7 → (7-7)/2.45 = **0**
**4. Modified Z-Score:** `x' = 0.6745×(x-median)/MAD`
MAD = `median(|xi-median|)`
Ví dụ: [1,2,3,4,100] → Median=3, MAD=1: 100 → 0.6745×97 = **65.43**
---
### Binning
**Equal-Width:** `Width = (Max-Min) / số bins`
Ví dụ: [1-9] → 3 bins → Width=2.67
**Smoothing:**
- **Means:** Thay = Trung bình bin. Ví dụ: [1,2,3] → [2,2,2]
- **Medians:** Thay = Trung vị bin. Ví dụ: [1,2,3] → [2,2,2]
- **Boundaries:** Thay = Min/Max gần nhất. Ví dụ: [1,2,3] → [1,1,3]
---
### Correlation
**Computational (Khuyến nghị):**
```
Cov(X,Y) = Σ(xi·yi)/n - x̄·ȳ
σx = √[Σxi²/n - x̄²]
σy = √[Σyi²/n - ȳ²]
r = Cov(X,Y) / (σx × σy)
```
Ví dụ: X=[1,2,3], Y=[2,4,6]
Σ(xi·yi)=28, Σxi²=14, Σyi²=56, n=3
→ Cov=1.33, σx=0.82, σy=1.63 → r=**1.0**
**Definitional:**
```
r = Σ[(xi-x̄)(yi-ȳ)] / √[Σ(xi-x̄)² × Σ(yi-ȳ)²]
```
---
## 🔍 THUẬT TOÁN
### Apriori
**Support:** `Count(X) / Total`
Ví dụ: {A,B} trong 3/10 giao dịch → **30%**
**Confidence:** `Support(XY) / Support(X)`
Ví dụ: Sup({A,B})=30%, Sup({A})=50% → **60%**
**Lift:** `Confidence(X→Y) / Support(Y)`
Ví dụ: Conf=60%, Sup(B)=40% → **1.5**
---
### ID3
**Entropy:** `E(S) = -Σ pi × log₂(pi)`
Ví dụ: 9 Yes, 5 No → p₁=9/14=0.643, p₂=5/14=0.357
E = -(0.643×log₂0.643 + 0.357×log₂0.357) = **0.940**
**Gain:** `Gain(S,A) = E(S) - Σ(|Sv|/|S|)×E(Sv)`
Ví dụ: E(S)=0.940, E(Sunny)=0.971, E(Rain)=0
Gain = 0.940 - [(5/14)×0.971 + (4/14)×0] = **0.246**
---
### K-means
**Euclidean Distance:** `d(p,q) = √[Σ(pi-qi)²]`
Ví dụ: p(2,3), q(5,7) → √[(2-5)²+(3-7)²] = √25 = **5**
**SSE:** `Σ distance²(điểm, tâm)`
Ví dụ: Tâm(3,3), A(2,3), B(4,5) → SSE = 1+5 = **6**
**Centroid:** `(Trung bình xi, Trung bình yi)`
Ví dụ: (2,3), (4,5), (6,7) → **(4,5)**
---
## 🔑 CÔNG THỨC QUAN TRỌNG
1. **Mean:** x̄ = Σxi/n
2. **Median:** Giá trị giữa
3. **σ (Std Dev):** √[Σxi²/n - x̄²]
4. **Min-Max:** (x-min)/(max-min)×(b-a)+a
5. **Z-Score:** (x-μ)/σ
6. **Modified Z-Score:** 0.6745×(x-median)/MAD
7. **Cov:** Σ(xi·yi)/n - x̄·ȳ
8. **r:** Cov(X,Y)/(σx×σy)
9. **Support:** Count/Total
10. **Confidence:** Sup(XY)/Sup(X)
11. **Entropy:** -Σ pi×log₂(pi)
12. **Distance:** √[Σ(pi-qi)²]
---
## 📊 BẢNG NHANH
### Normalization
| Method | Formula | Use |
|--------|---------|-----|
| Decimal | x/10^j | Nhanh |
| Min-Max | (x-min)/(max-min) | [0,1] |
| Z-Score | (x-μ)/σ | Phân phối chuẩn |
| Modified Z | 0.6745×(x-med)/MAD | Có outliers |
### Correlation
| \|r\| | Mức độ |
|------|--------|
| <0.3 | Yếu |
| 0.3-0.7 | Trung bình |
| >0.7 | Mạnh |
### Outliers
| Method | Threshold |
|--------|-----------|
| IQR | x < Q1-1.5×IQR hoặc x > Q3+1.5×IQR |
| Z-Score | \|z\| > 3 |
| Modified Z | \|z'\| > 3.5 |
---
## 💡 GHI NHỚ
**Computational vs Definitional:**
- Computational = Nhanh (dùng Σxi², Σ(xi·yi))
- Definitional = Dễ hiểu (dùng (xi-x̄))
**Binning:**
- Equal-Width = Chiều rộng đều
- Equal-Frequency = Số phần tử đều
**Interval Notation:**
- [a,b] = bao gồm a VÀ b
- [a,b) = bao gồm a, KHÔNG b
**ID3:** Chọn attribute có **Gain cao nhất**
**K-means:** Gán về **cluster gần nhất**
**Apriori:** Lift>1 → X,Y liên quan