to handle the train/test split, vectorization, model training, and
evaluation.
result <- pipeline(
# --- Define the vectorization method ---
# Options: "bow" (raw counts), "tf" (term frequency), "tfidf", "binary"
vect_method = "tf",
# --- Define the model to train ---
# Options: "logit", "rf", "xgb","nb"
model_name = "rf",
# --- Specify the data and column names ---
text_vector = tweets$cleaned_text , # The column with our preprocessed text
sentiment_vector = tweets$sentiment, # The column with the target variable
# --- Set vectorization options ---
# Use n_gram = 2 for unigrams + bigrams, or 1 for just unigrams
n_gram = 1,
parallel = cores
)
## --- Running Pipeline: TERM_FREQUENCY + RANDOM_FOREST ---
## Data split: 944 training elements, 237 test elements.
## Vectorizing with TERM_FREQUENCY (ngram=1)...
## - Fitting BoW model (term_frequency) on training data...
## - Applying BoW transformation (term_frequency) to new data...
##
## --- Training Random Forest Model (ranger) ---
## --- Random Forest complete. Returning results. ---
##
## ======================================================
## PIPELINE COMPLETE: TERM_FREQUENCY + RANDOM_FOREST
## Model AUC: 0.690
## Recommended ROC Threshold: 0.279
## ======================================================