Mastering Scikit-learn: The Swiss Army Knife of Machine Learning in Python
If you’ve ventured into the world of machine learning with Python, you’ve almost certainly encountered Scikit-learn. This powerful library has become the cornerstone of the Python machine learning ecosystem, and for good reason. With its consistent API, comprehensive algorithm collection, and excellent documentation, Scikit-learn has democratized machine learning, making it accessible to developers, data scientists, and researchers alike.
In this comprehensive guide, I’ll take you through everything you need to know about Scikit-learn, from basic concepts to advanced techniques, with plenty of practical code examples along the way.
What is Scikit-learn?
Scikit-learn (often abbreviated as sklearn) is an open-source Python library for machine learning that provides simple and efficient tools for data analysis and modeling. First released in 2007 as a Google Summer of Code project by David Cournapeau, it has since grown into one of the most popular machine learning libraries in existence.
Built on NumPy, SciPy, and Matplotlib, Scikit-learn focuses on classical machine learning algorithms rather than deep learning (which is better served by libraries like TensorFlow and PyTorch). Its name comes from the idea that it’s a “SciKit” (SciPy Toolkit) focused on machine learning.
Why Use Scikit-learn?
Scikit-learn has become the go-to library for machine learning in Python for several compelling reasons:
- Consistent API: All algorithms follow a simple fit/predict interface
- Comprehensive: Covers most machine learning needs in one package
- Well-documented: Excellent documentation with examples
- Production-ready: Efficient, reliable, and extensively tested code
- Interoperability: Works seamlessly with NumPy, Pandas, and other data science tools
- Active community: Continuously improved and updated
Let’s dive into the practical aspects of using Scikit-learn for various machine learning tasks.
Getting Started with Scikit-learn
Installation
Installing Scikit-learn is straightforward:
pip install scikit-learn
Or with conda:
conda install scikit-learn
Let’s verify the installation:
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")
The Machine Learning Workflow
Scikit-learn is designed around a typical machine learning workflow:
- Data preparation: Loading, cleaning, and transforming data
- Model selection: Choosing the right algorithm for your problem
- Model training: Fitting the model to your data
- Model evaluation: Assessing how well your model performs
- Prediction: Using your model on new data
Let’s see how Scikit-learn supports each step of this process.
Data Preparation
Loading and Exploring Data
Scikit-learn includes many built-in datasets for practice:
from sklearn import datasets
import pandas as pd
import numpy as np
# Load the Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Convert to DataFrame for easier exploration
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['target'] = y
iris_df['target_name'] = [iris.target_names[i] for i in y]
# Basic exploration
print(f"Dataset shape: {iris_df.shape}")
print(f"First few rows:\n{iris_df.head()}")
print(f"Feature statistics:\n{iris_df.describe()}")
print(f"Class distribution:\n{iris_df['target_name'].value_counts()}")
Splitting Data
It’s essential to split your data into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y)
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Training class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
Note the stratify=y parameter, which ensures that the class distribution is preserved in both training and test sets.
Feature Scaling
Many algorithms perform better when features are on a similar scale:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Only transform, don't fit
# Alternative: Min-Max scaling (range [0, 1])
min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)
# Compare original vs scaled data
print("Original features (first sample):")
print(X_train[0])
print("\nStandardized features (first sample):")
print(X_train_scaled[0])
print("\nMin-Max scaled features (first sample):")
print(X_train_minmax[0])
Handling Missing Values
Real-world data often contains missing values:
from sklearn.impute import SimpleImputer, KNNImputer
import numpy as np
# Create data with missing values
X_missing = X_train.copy()
X_missing[0, 0] = np.nan # Make first value missing
X_missing[1, 2] = np.nan # Make another value missing
# Simple imputation (mean, median, most_frequent, or constant)
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_missing)
# More advanced: KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_knn_imputed = knn_imputer.fit_transform(X_missing)
print("Original data with missing values:")
print(X_missing[:2])
print("\nAfter mean imputation:")
print(X_imputed[:2])
print("\nAfter KNN imputation:")
print(X_knn_imputed[:2])
Encoding Categorical Features
For categorical data, you need encoding:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
# Sample categorical data
cat_features = np.array([['Male', 'New York', 'High School'],
['Female', 'California', 'Bachelor'],
['Male', 'California', 'Master'],
['Female', 'New York', 'PhD']])
# One-hot encoding (creates binary columns for each category)
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(cat_features)
print("One-hot encoded features:")
print(onehot_encoded)
print(f"Feature names: {onehot_encoder.get_feature_names_out()}")
# Ordinal encoding (for ordered categories)
ordinal_encoder = OrdinalEncoder()
ordinal_encoded = ordinal_encoder.fit_transform(cat_features[:, 2].reshape(-1, 1))
print("\nOrdinal encoded education:")
print(ordinal_encoded)
# Label encoding (for target variables)
label_encoder = LabelEncoder()
labels = np.array(['cat', 'dog', 'cat', 'bird', 'dog'])
encoded_labels = label_encoder.fit_transform(labels)
print("\nLabel encoded targets:")
print(encoded_labels)
Supervised Learning Models
Scikit-learn provides implementations of numerous supervised learning algorithms. Let’s explore some of the most popular ones.
Classification
Logistic Regression
A simple but powerful linear classifier:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Create and train the model
logreg = LogisticRegression(random_state=42, max_iter=200)
logreg.fit(X_train_scaled, y_train)
# Make predictions
y_pred = logreg.predict(X_test_scaled)
y_prob = logreg.predict_proba(X_test_scaled) # Probability estimates
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f"Accuracy: {accuracy:.4f}")
print(f"\nConfusion Matrix:\n{conf_matrix}")
print(f"\nClassification Report:\n{class_report}")
# Model coefficients (feature importance for linear models)
coef_df = pd.DataFrame({
'Feature': iris.feature_names,
'Coefficient (Class 0 vs Rest)': logreg.coef_[0],
'Coefficient (Class 1 vs Rest)': logreg.coef_[1],
'Coefficient (Class 2 vs Rest)': logreg.coef_[2]
})
print("\nModel Coefficients:")
print(coef_df)
Decision Trees
Tree-based models that recursively split the data:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt
# Create and train the model
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_clf.fit(X_train, y_train) # No need for scaling with tree models
# Make predictions
y_pred = dt_clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f"Accuracy: {accuracy:.4f}")
print(f"\nConfusion Matrix:\n{conf_matrix}")
print(f"\nClassification Report:\n{class_report}")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': iris.feature_names,
'Importance': dt_clf.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
# Visualize the tree (optional)
plt.figure(figsize=(15, 10))
tree.plot_tree(dt_clf, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True)
plt.title("Decision Tree for Iris Classification")
plt.savefig('decision_tree.png') # Save the image
plt.close()
Random Forest
An ensemble of decision trees:
from sklearn.ensemble import RandomForestClassifier
# Create and train the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
# Make predictions
y_pred = rf_clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f"Accuracy: {accuracy:.4f}")
print(f"\nConfusion Matrix:\n{conf_matrix}")
print(f"\nClassification Report:\n{class_report}")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': iris.feature_names,
'Importance': rf_clf.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
Support Vector Machines
Models that find the optimal hyperplane to separate classes:
from sklearn.svm import SVC
# Create and train the model
svc = SVC(kernel='rbf', probability=True, random_state=42)
svc.fit(X_train_scaled, y_train) # SVM works better with scaled features
# Make predictions
y_pred = svc.predict(X_test_scaled)
y_prob = svc.predict_proba(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f"Accuracy: {accuracy:.4f}")
print(f"\nConfusion Matrix:\n{conf_matrix}")
print(f"\nClassification Report:\n{class_report}")
K-Nearest Neighbors
A simple but effective instance-based learning algorithm:
from sklearn.neighbors import KNeighborsClassifier
# Create and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Make predictions
y_pred = knn.predict(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f"Accuracy: {accuracy:.4f}")
print(f"\nConfusion Matrix:\n{conf_matrix}")
print(f"\nClassification Report:\n{class_report}")
# Finding the optimal k
k_range = range(1, 31)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
k_scores.append(knn.score(X_test_scaled, y_test))
# Plot accuracy vs k
plt.figure(figsize=(10, 6))
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
plt.title('Accuracy vs K Value')
plt.grid(True)
plt.savefig('knn_accuracy.png')
plt.close()
Regression
For predicting continuous values, let’s use a different dataset:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Load the California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Function to evaluate regression models
def evaluate_regression(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} Performance:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")
print(f"R² Score: {r2:.4f}\n")
return mse, rmse, mae, r2
# Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
lr_metrics = evaluate_regression(y_test, y_pred_lr, "Linear Regression")
# Ridge Regression (L2 regularization)
ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
ridge_metrics = evaluate_regression(y_test, y_pred_ridge, "Ridge Regression")
# Lasso Regression (L1 regularization)
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
lasso_metrics = evaluate_regression(y_test, y_pred_lasso, "Lasso Regression")
# ElasticNet (L1 + L2 regularization)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic.fit(X_train_scaled, y_train)
y_pred_elastic = elastic.predict(X_test_scaled)
elastic_metrics = evaluate_regression(y_test, y_pred_elastic, "ElasticNet Regression")
# Feature coefficients comparison
coef_df = pd.DataFrame({
'Feature': housing.feature_names,
'Linear Regression': lr.coef_,
'Ridge': ridge.coef_,
'Lasso': lasso.coef_,
'ElasticNet': elastic.coef_
})
print("Feature Coefficients Comparison:")
print(coef_df)
Decision Tree Regression
Tree-based regression models:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_reg.fit(X_train, y_train) # No scaling needed
y_pred_dt = dt_reg.predict(X_test)
dt_metrics = evaluate_regression(y_test, y_pred_dt, "Decision Tree Regressor")
# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
rf_metrics = evaluate_regression(y_test, y_pred_rf, "Random Forest Regressor")
# Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_reg.fit(X_train, y_train)
y_pred_gb = gb_reg.predict(X_test)
gb_metrics = evaluate_regression(y_test, y_pred_gb, "Gradient Boosting Regressor")
# Feature importance comparison
importance_df = pd.DataFrame({
'Feature': housing.feature_names,
'Decision Tree': dt_reg.feature_importances_,
'Random Forest': rf_reg.feature_importances_,
'Gradient Boosting': gb_reg.feature_importances_
}).melt(id_vars='Feature', var_name='Model', value_name='Importance')
# Plot feature importance
plt.figure(figsize=(12, 8))
import seaborn as sns
sns.barplot(x='Importance', y='Feature', hue='Model', data=importance_df)
plt.title('Feature Importance Comparison')
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.close()
Unsupervised Learning
Unsupervised learning finds patterns in unlabeled data.
Clustering
Grouping similar data points together:
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate synthetic data for clustering
X_blobs, y_blobs = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=42)
# K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans_labels = kmeans.fit_predict(X_blobs)
kmeans_silhouette = silhouette_score(X_blobs, kmeans_labels)
# DBSCAN clustering
dbscan = DBSCAN(eps=0.3, min_samples=10)
dbscan_labels = dbscan.fit_predict(X_blobs)
# Only compute silhouette if more than one cluster and no noise points (-1)
if len(set(dbscan_labels)) > 1 and -1 not in dbscan_labels:
dbscan_silhouette = silhouette_score(X_blobs, dbscan_labels)
else:
dbscan_silhouette = "N/A (contains noise points)"
# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=4)
agg_labels = agg_clustering.fit_predict(X_blobs)
agg_silhouette = silhouette_score(X_blobs, agg_labels)
# Plot results
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=kmeans_labels, cmap='viridis')
plt.title(f'K-Means Clustering\nSilhouette: {kmeans_silhouette:.3f}')
plt.subplot(1, 3, 2)
plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=dbscan_labels, cmap='viridis')
plt.title(f'DBSCAN Clustering\nSilhouette: {dbscan_silhouette}')
plt.subplot(1, 3, 3)
plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=agg_labels, cmap='viridis')
plt.title(f'Hierarchical Clustering\nSilhouette: {agg_silhouette:.3f}')
plt.tight_layout()
plt.savefig('clustering_comparison.png')
plt.close()
# Finding optimal K for K-Means
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X_blobs)
silhouette_scores.append(silhouette_score(X_blobs, labels))
plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs. Number of Clusters')
plt.grid(True)
plt.savefig('kmeans_silhouette.png')
plt.close()
Dimensionality Reduction
Reducing the number of features while preserving important information:
from sklearn.decomposition import PCA, TruncatedSVD, NMF
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
# Load digits dataset for visualization
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
# Scale the data
X_digits_scaled = StandardScaler().fit_transform(X_digits)
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_digits_scaled)
explained_variance = pca.explained_variance_ratio_
print(f"PCA explained variance: {explained_variance}")
print(f"Total variance explained: {sum(explained_variance):.4f}")
# t-SNE (better for visualization but slower)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_digits_scaled)
# Truncated SVD (works with sparse matrices too)
svd = TruncatedSVD(n_components=2, random_state=42)
X_svd = svd.fit_transform(X_digits_scaled)
print(f"SVD explained variance: {svd.explained_variance_ratio_}")
print(f"Total variance explained: {sum(svd.explained_variance_ratio_):.4f}")
# NMF (Non-negative Matrix Factorization)
nmf = NMF(n_components=2, random_state=42)
X_nmf = nmf.fit_transform(X_digits_scaled - X_digits_scaled.min()) # Make non-negative
# Plot the results
plt.figure(figsize=(20, 5))
plt.subplot(1, 4, 1)
for i in range(10):
plt.scatter(X_pca[y_digits == i, 0], X_pca[y_digits == i, 1], label=str(i), alpha=0.7)
plt.title('PCA')
plt.legend()
plt.subplot(1, 4, 2)
for i in range(10):
plt.scatter(X_tsne[y_digits == i, 0], X_tsne[y_digits == i, 1], label=str(i), alpha=0.7)
plt.title('t-SNE')
plt.subplot(1, 4, 3)
for i in range(10):
plt.scatter(X_svd[y_digits == i, 0], X_svd[y_digits == i, 1], label=str(i), alpha=0.7)
plt.title('Truncated SVD')
plt.subplot(1, 4, 4)
for i in range(10):
plt.scatter(X_nmf[y_digits == i, 0], X_nmf[y_digits == i, 1], label=str(i), alpha=0.7)
plt.title('NMF')
plt.tight_layout()
plt.savefig('dimensionality_reduction.png')
plt.close()
Model Evaluation and Improvement
Cross-Validation
Cross-validation provides a more robust estimate of model performance:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# Create a model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Basic cross-validation (5-fold)
cv_scores = cross_val_score(rf, X, y, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f}")
print(f"Standard deviation: {cv_scores.std():.4f}")
# K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(rf, X, y, cv=kf)
print(f"\nK-Fold CV scores: {kf_scores}")
print(f"Mean K-Fold CV score: {kf_scores.mean():.4f}")
# Stratified K-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf_scores = cross_val_score(rf, X, y, cv=skf)
print(f"\nStratified K-Fold CV scores: {skf_scores}")
print(f"Mean Stratified K-Fold CV score: {skf_scores.mean():.4f}")
Learning Curves
Learning curves help diagnose overfitting and underfitting:
from sklearn.model_selection import learning_curve
# Function to plot learning curves
def plot_learning_curve(estimator, X, y, cv=5, n_jobs=-1):
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy')
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='r', label='Training score')
plt.plot(train_sizes, test_mean, 'o-', color='g', label='Cross-validation score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='r')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='g')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.title('Learning Curve')
plt.legend(loc='best')
plt.grid(True)
return plt
# Compare learning curves for models with different complexity
simple_model = DecisionTreeClassifier(max_depth=3, random_state=42)
complex_model = DecisionTreeClassifier(max_depth=None, random_state=42)
simple_lc_plot = plot_learning_curve(simple_model, X, y)
simple_lc_plot.savefig('simple_model_learning_curve.png')
plt.close()
complex_lc_plot = plot_learning_curve(complex_model, X, y)
complex_lc_plot.savefig('complex_model_learning_curve.png')
plt.close()
ROC Curves and AUC
For binary classification performance:
from sklearn.metrics import roc_curve, roc_auc_score, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
# Convert iris to binary classification (first class vs. rest)
y_binary = (y == 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X, y_binary, test_size=0.3, random_state=42)
# Train classifiers
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'SVM': SVC(probability=True, random_state=42)
}
plt.figure(figsize=(10, 8))
for name, model in models.items():
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label=f'{name} (AUC = {roc_auc:.3f})')
# Random classifier for reference
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend(loc='lower right')
plt.grid(True)
plt.savefig('roc_curves.png')
plt.close()
Hyperparameter Tuning
Finding the optimal hyperparameters for your model:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Create a model
rf = RandomForestClassifier(random_state=42)
# Define the parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Grid search (exhaustive)
grid_search = GridSearchCV(
rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Grid Search Results:")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
# Get detailed results
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values('rank_test_score')
print("\nTop 5 parameter combinations:")
print(results[['params', 'mean_test_score', 'std_test_score']].head())
# RandomizedSearchCV (more efficient for large parameter spaces)
from scipy.stats import randint, uniform
# Define parameter distributions
param_dist = {
'n_estimators': randint(10, 300),
'max_depth': [None] + list(randint(3, 50).rvs(10)),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', None] + list(uniform(0.1, 0.9).rvs(5))
}
# Randomized search
random_search = RandomizedSearchCV(
rf, param_dist, n_iter=50, cv=5, scoring='accuracy',
n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)
print("\nRandomized Search Results:")
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.4f}")
# Evaluate the best model from the search
best_rf = random_search.best_estimator_
y_pred = best_rf.predict(X_test)
print(f"\nTest set accuracy with tuned model: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))
Pipelines: Streamlining Your Workflow
Scikit-learn’s Pipeline feature allows you to chain multiple steps together:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
# Create a pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('classifier', SVC(random_state=42))
])
# Alternatively, use make_pipeline for simpler syntax
# pipeline = make_pipeline(StandardScaler(), PCA(n_components=2), SVC(random_state=42))
# Train and evaluate the pipeline
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f"Pipeline accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Cross-validation with the pipeline
cv_scores = cross_val_score(pipeline, X, y, cv=5)
print(f"Pipeline CV scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f}")
# Hyperparameter tuning with pipelines
param_grid = {
'pca__n_components': [2, 3, 4],
'classifier__C': [0.1, 1, 10, 100],
'classifier__kernel': ['linear', 'rbf']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, y)
print("\nBest pipeline parameters:")
print(grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.4f}")
Feature Selection in Pipelines
# Pipeline with feature selection
feature_selection_pipeline = Pipeline([
('scaler', StandardScaler()),
('select', SelectKBest(f_classif, k=2)),
('classifier', LogisticRegression(random_state=42))
])
# Train and evaluate
feature_selection_pipeline.fit(X_train, y_train)
y_pred = feature_selection_pipeline.predict(X_test)
print(f"Feature selection pipeline accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Get selected features
selected_features_mask = feature_selection_pipeline.named_steps['select'].get_support()
selected_features = [iris.feature_names[i] for i, selected in enumerate(selected_features_mask) if selected]
print(f"Selected features: {selected_features}")
Advanced Topics
Text Processing with Scikit-learn
Scikit-learn has powerful tools for text processing:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Sample text data
texts = [
"This is a great movie, I loved it!",
"The film was fantastic and engaging",
"Terrible movie, waste of time",
"I hated this film, so boring",
"Amazing plot and great acting",
"One of the worst movies I've seen"
]
labels = [1, 1, 0, 0, 1, 0] # 1 for positive, 0 for negative
# Split data
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.3, random_state=42)
# Create a pipeline for text classification
text_clf = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
# Train the pipeline
text_clf.fit(X_train, y_train)
# Make predictions
y_pred = text_clf.predict(X_test)
print("Text Classification Results:")
print(classification_report(y_test, y_pred))
# Compare different vectorizers
vectorizers = {
'CountVectorizer': CountVectorizer(stop_words='english'),
'TF-IDF Vectorizer': TfidfVectorizer(stop_words='english')
}
for name, vectorizer in vectorizers.items():
# Transform text to feature vectors
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Train classifier
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)
# Predict
y_pred = clf.predict(X_test_vec)
# Print feature names and counts
feature_names = vectorizer.get_feature_names_out()
print(f"\n{name}:")
print(f"Number of features: {len(feature_names)}")
print(f"First 10 features: {feature_names[:10]}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Handling Imbalanced Data
Real-world datasets often have class imbalance:
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
# Create imbalanced dataset
X_imb, y_imb = make_classification(
n_samples=1000, n_classes=2, weights=[0.9, 0.1],
n_features=20, random_state=42)
print(f"Class distribution: {np.bincount(y_imb)}")
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X_imb, y_imb, test_size=0.3, random_state=42, stratify=y_imb)
# Train without addressing imbalance
clf_no_balance = RandomForestClassifier(random_state=42)
clf_no_balance.fit(X_train, y_train)
y_pred_no_balance = clf_no_balance.predict(X_test)
print("Without addressing imbalance:")
print(classification_report(y_test, y_pred_no_balance))
# Using SMOTE for oversampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print(f"Resampled class distribution: {np.bincount(y_train_resampled)}")
# Train with balanced data
clf_balanced = RandomForestClassifier(random_state=42)
clf_balanced.fit(X_train_resampled, y_train_resampled)
y_pred_balanced = clf_balanced.predict(X_test)
print("After SMOTE oversampling:")
print(classification_report(y_test, y_pred_balanced))
# Using imbalanced-learn pipeline
imb_pipeline = ImbPipeline([
('sampling', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
imb_pipeline.fit(X_train, y_train)
y_pred_pipeline = imb_pipeline.predict(X_test)
print("Using imbalanced-learn pipeline:")
print(classification_report(y_test, y_pred_pipeline))
Model Persistence
Save and load models for future use:
import joblib
import pickle
# Train a model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Save the model with joblib
joblib.dump(clf, 'random_forest_model.joblib')
# Later, load the model
loaded_model = joblib.load('random_forest_model.joblib')
print(f"Loaded model accuracy: {loaded_model.score(X_test, y_test):.4f}")
# Alternative: using pickle
with open('model.pkl', 'wb') as file:
pickle.dump(clf, file)
with open('model.pkl', 'rb') as file:
loaded_model = pickle.load(file)
print(f"Pickled model accuracy: {loaded_model.score(X_test, y_test):.4f}")
Practical Machine Learning Workflow: Putting It All Together
Let’s put everything together in a comprehensive example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.datasets import fetch_openml
# Load a more complex dataset: Titanic
titanic = fetch_openml('titanic', version=1, as_frame=True)
X, y = titanic.data, titanic.target
# Data exploration
print(f"Dataset shape: {X.shape}")
print(f"Feature names: {X.columns.tolist()}")
print(f"Missing values per column:\n{X.isnull().sum()}")
print(f"Target distribution:\n{y.value_counts()}")
# Data preprocessing pipeline
# Define numerical and categorical features
numerical_features = ['age', 'fare']
categorical_features = ['sex', 'embarked', 'pclass']
# Preprocessor for numerical features
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Preprocessor for categorical features
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine preprocessors
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# Create model pipelines
models = {
'Logistic Regression': Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42))
]),
'Random Forest': Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
]),
'Gradient Boosting': Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(random_state=42))
])
}
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# Train and evaluate models
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
results[name] = {
'accuracy': accuracy,
'report': classification_report(y_test, y_pred),
'confusion_matrix': confusion_matrix(y_test, y_pred)
}
print(f"\n{name} Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Classification Report:\n{results[name]['report']}")
# Hyperparameter tuning for best model
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 5, 10, 15],
'classifier__min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
models['Random Forest'], param_grid, cv=5,
scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print("\nBest Parameters:")
print(grid_search.best_params_)
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
# Final model evaluation
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred_best)
final_report = classification_report(y_test, y_pred_best)
print("\nFinal Model Performance:")
print(f"Accuracy: {final_accuracy:.4f}")
print(f"Classification Report:\n{final_report}")
# Feature importance (for Random Forest)
if hasattr(best_model.named_steps['classifier'], 'feature_importances_'):
# Get feature names after preprocessing
cat_features_after_onehot = best_model.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numerical_features, cat_features_after_onehot])
# Get feature importances
importances = best_model.named_steps['classifier'].feature_importances_
indices = np.argsort(importances)[::-1]
# Plot feature importances
plt.figure(figsize=(12, 8))
plt.title('Feature Importances')
plt.bar(range(len(indices)), importances[indices], align='center')
plt.xticks(range(len(indices)), [feature_names[i] for i in indices], rotation=90)
plt.tight_layout()
plt.savefig('titanic_feature_importance.png')
plt.close()
Best Practices and Tips
Data Leakage Prevention
Avoid data leakage by properly separating training and testing data:
# Wrong approach (leakage)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fitting on all data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Correct approach
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit only on training
X_test_scaled = scaler.transform(X_test) # Apply to test
Reproducibility
Ensure reproducible results by setting random seeds:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Set random seed
np.random.seed(42)
# Split data with fixed random state
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42)
# Set random state in the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
Memory Efficiency for Large Datasets
For large datasets, consider these approaches:
from sklearn.linear_model import SGDClassifier
# Use algorithms that support partial_fit
sgd = SGDClassifier(random_state=42)
# Simulate batch processing
batch_size = 100
n_batches = len(X_train) // batch_size
for i in range(n_batches):
start = i * batch_size
end = (i + 1) * batch_size
sgd.partial_fit(
X_train[start:end],
y_train[start:end],
classes=np.unique(y) # Required for first call
)
print(f"SGD Classifier accuracy: {sgd.score(X_test, y_test):.4f}")
Performance Monitoring
In production, monitor your model’s performance over time:
# Simulating performance monitoring over time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
import matplotlib.pyplot as plt
# Assume we have model predictions for different time periods
time_periods = ['Week 1', 'Week 2', 'Week 3', 'Week 4']
y_true_periods = [y_test.copy() for _ in range(4)] # In reality, these would be different
# Simulate performance drift
y_pred_periods = []
drift_factors = [0, 0.05, 0.1, 0.2] # Increasing drift
for factor in drift_factors:
# Simulate prediction drift by flipping some predictions
y_pred = best_model.predict(X_test).copy()
flip_indices = np.random.choice(
len(y_pred),
size=int(len(y_pred) * factor),
replace=False
)
y_pred[flip_indices] = 1 - y_pred[flip_indices] # Flip binary predictions
y_pred_periods.append(y_pred)
# Calculate metrics for each period
metrics = []
for period, y_true, y_pred in zip(time_periods, y_true_periods, y_pred_periods):
metrics.append({
'Period': period,
'Accuracy': accuracy_score(y_true, y_pred),
'Precision': precision_score(y_true, y_pred, pos_label='1'),
'Recall': recall_score(y_true, y_pred, pos_label='1'),
'F1 Score': f1_score(y_true, y_pred, pos_label='1')
})
# Create DataFrame for visualization
metrics_df = pd.DataFrame(metrics)
print(metrics_df)
# Plot metrics over time
plt.figure(figsize=(12, 6))
for metric in ['Accuracy', 'Precision', 'Recall', 'F1 Score']:
plt.plot(metrics_df['Period'], metrics_df[metric], marker='o', label=metric)
plt.title('Model Performance Over Time')
plt.xlabel('Time Period')
plt.ylabel('Score')
plt.legend()
plt.grid(True)
plt.savefig('performance_monitoring.png')
plt.close()
Resources for Learning More
To deepen your Scikit-learn knowledge:
- Official Documentation: Scikit-learn Documentation
- Books:
- “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron
- “Introduction to Machine Learning with Python” by Andreas Müller and Sarah Guido
- Online Courses:
- Scikit-learn Course by Inria
- “Machine Learning with Python” on Coursera or edX
- Tutorials:
Conclusion
Scikit-learn has democratized machine learning by making powerful algorithms accessible through a consistent, well-documented API. Its comprehensive collection of tools covers the entire machine learning workflow, from data preprocessing to model deployment.
In this guide, we’ve explored the key components of Scikit-learn and demonstrated how to use them effectively. We’ve covered:
- Data preparation and preprocessing
- Supervised learning for classification and regression
- Unsupervised learning for clustering and dimensionality reduction
- Model evaluation and hyperparameter tuning
- Pipelines for streamlining workflows
- Advanced topics like text processing and handling imbalanced data
Whether you’re a beginner just starting your machine learning journey or an experienced practitioner looking to refine your skills, Scikit-learn provides the tools you need to build effective models efficiently.
In future posts, I’ll explore more advanced topics, such as custom estimators, ensemble methods, and integration with deep learning frameworks. What aspects of Scikit-learn would you like to learn more about? Let me know in the comments!
“The goal of machine learning is to make humans more productive, not to replace them.” — Victor Muthomi, ML Developer