1. Ce este clasificarea in Machine Learning?
Dataset: [(x1,y1), ..., (xn,yn)] xi = vectorul de trasaturi (features) yi = eticheta de clasa Model = functie f(xi) -> clasa Algoritmi cls. XII: - k-NN -- bazat pe distanta euclidiana - Arbore de decizie -- bazat pe reguli if/else
2. Algoritmul k-NN — pas cu pas
1. Calculeaza distanta de la q la fiecare exemplu 2. Sorteaza dupa distanta (crescator) 3. Selecteaza primii k vecini 4. Numara voturile fiecarei clase 5. Returneaza clasa cu cele mai multe voturi
dist(p,q) = sqrt((px-qx)^2 + (py-qy)^2) Exemplu: p=(1,1), q=(2,3) => sqrt(1+4) = 2.2361
import math
from collections import Counter
dataset = [(1.0,1.0,"A"),(1.5,2.0,"A"),(3.0,4.0,"B"),
(5.0,7.0,"B"),(3.5,5.0,"B"),(4.5,5.0,"B"),(3.5,4.5,"B")]
query = (2.0, 3.0); k = 3
def euclidean(p1, p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
distances = [(euclidean(query,(r[0],r[1])),r[2]) for r in dataset]
distances.sort(); neighbors = distances[:k]
votes = Counter(lbl for _,lbl in neighbors)
print(f"Primii {k} vecini: {neighbors}")
print(f"Voturi: {dict(votes)}")
print(f"Clasa prezisa: {votes.most_common(1)[0][0]}")Primii 3 vecini: [(1.1180339887498949, "A"), (1.4142135623730951, "B"), (2.1213203435596424, "B")]
Voturi: {"A": 1, "B": 2}
Clasa prezisa: B3. Arborele de decizie — Gini impurity
Gini(S) = 1 - sum(pi^2), pi = proportia clasei i
Exemplu: S = {3 Da, 3 Nu}
p_Da = 0.5, p_Nu = 0.5
Gini = 1 - (0.25 + 0.25) = 0.5 (dezordine maxima)
Nod pur (o singura clasa): Gini = 0 (optim)Dataset: (30,Nu),(28,Nu),(20,Da),(22,Da),(24,Da),(26,Nu) Gini total: 0.5000 Split temperatura < 25: Stanga [Da,Da,Da]: Gini = 0.0 Dreapta [Nu,Nu,Nu]: Gini = 0.0 Gini ponderat = 0.0000 Information Gain = 0.5000 (split perfect)
4. Implementare cu sklearn
scikit-learn este standardul industrial pentru ML in Python. Interfata uniforma: fit(), predict(), score().from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris = load_iris(); X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
print(f"k-NN acuratete: {knn.score(X_test, y_test):.4f}")from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
dt.fit(X_train, y_train)
print(f"Arbore acuratete: {dt.score(X_test, y_test):.4f}")
print(f"Feature importances: {dt.feature_importances_.round(3)}")k-NN acuratete: 0.9778 Arbore acuratete: 0.9556 Feature importances: [0. 0.026 0.056 0.918]
5. Comparatie k-NN vs Arbore + Normalizare
Criteriu | k-NN | Arbore de decizie -------------------|-------------------|------------------ Antrenament | Instantaneu | Construit recursiv Predictie | O(n*d) per ex. | O(adancime) Interpretabilitate | Scazuta | Ridicata (reguli) Sensibil la scala | DA (norm. req.) | Nu Control overfitting| k mai mare | max_depth mai mic
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
knn_s = KNeighborsClassifier(n_neighbors=3)
knn_s.fit(X_train_s, y_train)
print(f"k-NN cu normalizare: {knn_s.score(X_test_s, y_test):.4f}")6. Evaluarea modelelor — Matrice de confuzie
Acuratete = (TP + TN) / Total Precizie = TP / (TP + FP) Recall = TP / (TP + FN) F1-score = 2 * Precizie * Recall / (Precizie + Recall)
from sklearn.metrics import confusion_matrix, classification_report y_pred = knn.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred, target_names=iris.target_names))
[[19 0 0]
[ 0 12 1]
[ 0 0 13]]
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 0.92 0.96 13
virginica 0.93 1.00 0.96 13
accuracy 0.98 45