Análise – Índice de Estresse Acadêmico (Random Forest)

Exploração dos Dados

A base possui 280 registros e 9 colunas (Timestamp, AcademicStage, PeerPressure, HomePressure, StudyEnv, Strategy, BadHabits, AcademicComp, Stress).

Distribuições e estatísticas principais:

Estágio acadêmico (AcademicStage): undergraduate 63.57%, high school 22.14%, post-graduate 14.29%.
Pressão dos colegas (PeerPressure): média 3.01 (escala 1–5).
Pressão acadêmica da família (HomePressure): média 3.09 (1–5).
Ambiente de estudo (StudyEnv): Peaceful 47.14%, Noisy 26.43%, disrupted 26.07%, e 0.36% ausente (1 valor nulo).
Estratégia de enfrentamento (Strategy): Analyze the situation… 54.29%, Emotional breakdown 26.79%, Social support 18.93%.
Maus hábitos (BadHabits): No 65.71%, Yes 18.93%, prefer not to say 15.36%.
Competição acadêmica (AcademicComp): média 3.27 (1–5).
Índice de estresse (target) (Stress): balanceado — 20.0% (1), 20.0% (2), 20.0% (3), 20.0% (4), 20.0% (5); média 3.0.

AcademicStage	PeerPressure	HomePressure	StudyEnv	Strategy	BadHabits	AcademicComp	Stress
undergraduate	4	5	Noisy	Analyze the situation and handle it with intellect	No	3	5
undergraduate	3	4	Peaceful	Analyze the situation and handle it with intellect	No	3	3
undergraduate	1	1	Peaceful	Social support (friends, family)	No	2	4
undergraduate	3	2	Peaceful	Analyze the situation and handle it with intellect	No	4	3
undergraduate	3	3	Peaceful	Analyze the situation and handle it with intellect	No	4	5

Pré-processamento

Remoção de colunas irrelevantes

A coluna Timestamp foi removida por não agregar informação para a previsão.

Variável-alvo

A variável alvo definida foi Stress (índice de estresse acadêmico: 1 a 5).

Tratamento de missing value

A base apresenta 1 valor nulo em StudyEnv. Preencher com Peaceful (moda) calculada somente nos dados de treino .

Codificação de variáveis categóricas

As colunas AcademicStage, StudyEnv, Strategy e BadHabits são categóricas. O pipeline atual usa Label Encoding — válido para modelos de árvore.

Features e target

features (X): AcademicStage, PeerPressure, HomePressure, StudyEnv, Strategy, BadHabits, AcademicComp
target (y): Stress (1 a 5)

Divisão dos Dados

80% dos registros foram separados para treino e 20% para teste.

Binarização

Aproveitei o modelo KNN feito anteriormente e mantive a binarização do target neste modelo. (niveis de estresse menores ou iguais a 3 são classificados como "baixo" e maiores que 3 como "alto"), porém após rodar o modelo, acabou resultando num overfitting (97% de acurácia), assim, a binarização foi revertida.

Treinamento do Modelo

random forest code

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Carregar base
df = pd.read_csv("https://raw.githubusercontent.com/tigasparzin/Machine-Learning/refs/heads/main/data/StressExp.csv")

y = (df["Stress"])

# Features fixas da base
X = df.drop(columns=["Timestamp", "Stress"])
X = pd.get_dummies(
    X,
    columns=["AcademicStage", "StudyEnv", "Strategy", "BadHabits"],
    drop_first=False
)

# Split estratificado
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    max_features='sqrt',
    random_state=42
)
rf.fit(X_train, y_train)

# Avaliação
pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, pred):.2f}")
print()

# Importâncias (vetor)
print("Feature Importances:", rf.feature_importances_)

Avaliação do Modelo

No modelo Random Forest, a acurácia foi de ~0.62 .

avaliacao do modelocode

Accuracy: 0.62 precision recall f1-score support 1 1.000 0.727 0.842 11 2 0.818 0.818 0.818 11 3 0.438 0.636 0.519 11 4 0.571 0.333 0.421 12 5 0.500 0.636 0.560 11 accuracy 0.625 56 macro avg 0.665 0.630 0.632 56 weighted avg 0.664 0.625 0.628 56 Top 10 features: AcademicComp: 0.232 HomePressure: 0.179 BadHabits: 0.177 PeerPressure: 0.149 StudyEnv: 0.092 AcademicStage: 0.089 Strategy: 0.081

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('https://raw.githubusercontent.com/tigasparzin/Machine-Learning/refs/heads/main/data/StressExp.csv')

if 'Timestamp' in df.columns:
    df = df.drop(columns=['Timestamp'])

X = df[['AcademicStage','PeerPressure','HomePressure','StudyEnv','Strategy','BadHabits','AcademicComp']].copy()
y = df['Stress'].astype(int)  # sem binarização (1..5)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

if X_train['StudyEnv'].isna().any() or X_test['StudyEnv'].isna().any():
    mode_env = X_train['StudyEnv'].mode().iloc[0]
    X_train['StudyEnv'] = X_train['StudyEnv'].fillna(mode_env)
    X_test['StudyEnv']  = X_test['StudyEnv'].fillna(mode_env)

cat_cols = ['AcademicStage','StudyEnv','Strategy','BadHabits']
encoders = {}
for col in cat_cols:
    le = LabelEncoder()
    X_train[col] = le.fit_transform(X_train[col].astype(str))
    X_test[col] = X_test[col].astype(str).where(X_test[col].astype(str).isin(le.classes_), le.classes_[0])
    import numpy as _np
    le.classes_ = _np.unique(_np.concatenate([le.classes_, X_test[col].unique()]))
    X_test[col] = le.transform(X_test[col])
    encoders[col] = le

clf = RandomForestClassifier(
    n_estimators=300,
    max_depth=5,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f"Accuracy: {(y_pred == y_test).mean():.2f}\n")
print(classification_report(y_test, y_pred, digits=3))

labels = np.sort(df['Stress'].unique())
cm = confusion_matrix(y_test, y_pred, labels=labels)
fig, ax = plt.subplots(figsize=(7, 5))
ConfusionMatrixDisplay(cm, display_labels=labels).plot(
    ax=ax, cmap=plt.cm.Blues, values_format="d", colorbar=False
)
ax.set_title("Matriz de Confusão - RF (5 classes)")
ax.set_xlabel("Previsto"); ax.set_ylabel("Real")
buf = StringIO(); plt.savefig(buf, format="svg", transparent=True, bbox_inches="tight")
print(buf.getvalue()); plt.close(fig)

feat_names = X_train.columns.to_numpy()
importances = clf.feature_importances_
top_idx = np.argsort(importances)[::-1][:10]
print("Top 10 features:")
for i in top_idx:
    print(f"{feat_names[i]}: {importances[i]:.3f}")

Conclusão

Analisando nossa matriz de confusão, é perceptivel que o modelo teve desempenho mediano (Accuracy ≈ 0,63; Macro-F1 ≈ 0,63) com confusão forte entre níveis altos (4↔5). Bom em 1–2, fraco no nivel 4.
A informação anterior explica bem o por que, em testes anteriores, quando utilizei apenas: niveis baixos (1,2 e 3) e niveis altos (4 e 5), acabei caindo em overfitting, visto que o modelo causa confusão entre niveis maiores.
Top features: AcademicComp (0,232) > HomePressure (0,179) > BadHabits (0,177) > PeerPressure (0,149) > StudyEnv/Strategy, ou seja, o modelo propoe que os maiores causadores de estresse entre alunos, são: competitividade, pressao sofrida pelos familiares, habitos ruins (beber/fumar), pressao sofrida pelos amigos, tendo por ultimo ambiente de estudo e estrategia para lidar com estresse.
Pensando em ideias para deixar o modelo balanceado, imagino que a melhor decisão seria criar 3 niveis: Alto(4 e 5), medio (3) e baixo (1 e 2), ou então distribuir manualmente os pesos, fazendo com que erros no nivel 4 sejam mais penalizados.