Wk 6 R Pipeline
Multi-model analytical workflow — glm, rpart, nnet, and tm — across structured and text data in one reproducible pipeline.
Evaluate multiple analytical tasks in one reproducible workflow — subscription renewal prediction, insurance risk classification, credit risk scoring, and speech text pattern analysis.
- Logistic regression with glm for renewal propensity
- Decision tree modeling with rpart for insurance category logic
- Neural network classification with nnet on normalized features
- Text mining with tm and document-term matrices for term prevalence insights
A scored, script-driven pipeline combining logistic regression, decision trees, neural networks, and text mining with consistent preprocessing and validation across heterogeneous datasets.
100/100 on the Week 6 summative — consistent workflow across structured and semi-structured data with interpretable, reusable scripts.
Graduate-level data mining coursework (DSC 550) demonstrating end-to-end analytical engineering: heterogeneous inputs, multiple model families, and defensible validation — not a single-algorithm notebook exercise.
The neural network scoring output below is a representative sample from Exercise 3 (credit risk classification). Twenty-three applicants scored DO NOT LEND in the full run; all had credit scores below 500.
# Simplified reproducible pipeline skeleton
model_glm <- glm(renewal ~ ., data = train_df, family = binomial())
model_tree <- rpart(risk_class ~ ., data = train_df)
model_nnet <- nnet(target ~ ., data = normalized_train, size = 5) - — Feature normalization is critical for stable neural network training
- — Root split and variable importance in trees are related but not identical concepts
- — Text mining pipelines need strict preprocessing consistency to make output defensible