Validated Data

Wk 6 R Pipeline

Multi-model analytical workflow — glm, rpart, nnet, and tm — across structured and text data in one reproducible pipeline.

Problem

Evaluate multiple analytical tasks in one reproducible workflow — subscription renewal prediction, insurance risk classification, credit risk scoring, and speech text pattern analysis.

Approach
  • Logistic regression with glm for renewal propensity
  • Decision tree modeling with rpart for insurance category logic
  • Neural network classification with nnet on normalized features
  • Text mining with tm and document-term matrices for term prevalence insights
Solution

A scored, script-driven pipeline combining logistic regression, decision trees, neural networks, and text mining with consistent preprocessing and validation across heterogeneous datasets.

Outcome

100/100 on the Week 6 summative — consistent workflow across structured and semi-structured data with interpretable, reusable scripts.

Wk 6 R Pipeline proof artifact

Graduate-level data mining coursework (DSC 550) demonstrating end-to-end analytical engineering: heterogeneous inputs, multiple model families, and defensible validation — not a single-algorithm notebook exercise.

The neural network scoring output below is a representative sample from Exercise 3 (credit risk classification). Twenty-three applicants scored DO NOT LEND in the full run; all had credit scores below 500.

# Simplified reproducible pipeline skeleton
model_glm <- glm(renewal ~ ., data = train_df, family = binomial())
model_tree <- rpart(risk_class ~ ., data = train_df)
model_nnet <- nnet(target ~ ., data = normalized_train, size = 5)
What I Learned
  • — Feature normalization is critical for stable neural network training
  • — Root split and variable importance in trees are related but not identical concepts
  • — Text mining pipelines need strict preprocessing consistency to make output defensible