Validated Data

Wk 6 R Pipeline

Multi-model analytical workflow — glm, rpart, nnet, and tm — across structured and text data in one reproducible pipeline.

Problem

Evaluate multiple analytical tasks in one reproducible workflow — subscription renewal prediction, insurance risk classification, credit risk scoring, and speech text pattern analysis.

Approach

Logistic regression with glm for renewal propensity
Decision tree modeling with rpart for insurance category logic
Neural network classification with nnet on normalized features
Text mining with tm and document-term matrices for term prevalence insights

Solution

A scored, script-driven pipeline combining logistic regression, decision trees, neural networks, and text mining with consistent preprocessing and validation across heterogeneous datasets.

Outcome

100/100 on the Week 6 summative — consistent workflow across structured and semi-structured data with interpretable, reusable scripts.

Graduate-level data mining coursework (DSC 550) demonstrating end-to-end analytical engineering: heterogeneous inputs, multiple model families, and defensible validation — not a single-algorithm notebook exercise.

The neural network scoring output below is a representative sample from Exercise 3 (credit risk classification). Twenty-three applicants scored DO NOT LEND in the full run; all had credit scores below 500.

# Simplified reproducible pipeline skeleton
model_glm <- glm(renewal ~ ., data = train_df, family = binomial())
model_tree <- rpart(risk_class ~ ., data = train_df)
model_nnet <- nnet(target ~ ., data = normalized_train, size = 5)

What I Learned

— Feature normalization is critical for stable neural network training
— Root split and variable importance in trees are related but not identical concepts
— Text mining pipelines need strict preprocessing consistency to make output defensible

Previous TradeFolio