API Anomaly Detection

This notebook details the implementation of a comprehensive anomaly detection system for API access behaviors. It involves data preprocessing, extensive feature engineering, and the training of multiple machine learning models including Random Forest, XGBoost, LightGBM, and Isolation Forest. The analysis identifies key behavioral patterns and discovers thousands of new potential anomalies missed by existing algorithmic classifications.

=== DATA QUALITY ANALYSIS === Supervised Dataset: Missing values: 8 Data types: Unnamed: 0 int64 _id object inter_api_access_duration(sec) float64 api_access_uniqueness float64 sequence_length(count) float64 vsession_duration(min) int64 ip_type object num_sessions float64 num_users float64 num_unique_apis float64 source object classification object dtype: object Remaining Dataset: Missing values: 2 Data types: Unnamed: 0 int64 _id object inter_api_access_duration(sec) float64 api_access_uniqueness float64 sequence_length(count) float64 vsession_duration(min) int64 ip_type object behavior object behavior_type object num_sessions float64 num_users float64 num_unique_apis float64 source object dtype: object === CLASS DISTRIBUTION === Supervised Dataset - Classification: classification normal 1106 outlier 593 Name: count, dtype: int64 Percentage distribution: classification normal 65.097116 outlier 34.902884 Name: count, dtype: float64 Remaining Dataset - Behavior: behavior outlier 24146 Normal 8946 Googlebot/2.1 95 BingPreview/1.0b 94 Googlebot/2.1;unknown 93 ... AdsBot-Google;Chrome;Chrome Mobile;Firefox;Mobile Firefox;Mobile Safari 1 SQL Injection 1 AdsBot-Google;Apple Mail;Chrome Mobile;Mobile Safari;Safari 1 AdsBot-Google;Android browser;Chrome;Chrome Mobile;Mobile Safari;Silk 1 Chrome;Chrome Mobile;Googlebot/2.1;Mobile Safari 1 Name: count, Length: 75, dtype: int64 Percentage distribution: behavior outlier 70.144961 Normal 25.988438 Googlebot/2.1 0.275978 BingPreview/1.0b 0.273073 Googlebot/2.1;unknown 0.270168 ... AdsBot-Google;Chrome;Chrome Mobile;Firefox;Mobile Firefox;Mobile Safari 0.002905 SQL Injection 0.002905 AdsBot-Google;Apple Mail;Chrome Mobile;Mobile Safari;Safari 0.002905 AdsBot-Google;Android browser;Chrome;Chrome Mobile;Mobile Safari;Silk 0.002905 Chrome;Chrome Mobile;Googlebot/2.1;Mobile Safari 0.002905 Name: count, Length: 75, dtype: float64 Remaining Dataset - Behavior Type: behavior_type outlier 24146 normal 8946 bot 1309 attack 22 Name: count, dtype: int64 Percentage distribution: behavior_type outlier 70.144961 normal 25.988438 bot 3.802690 attack 0.063911 Name: count, dtype: float64

=== BEHAVIORAL FEATURES ANALYSIS === Supervised Dataset - Numerical Features Summary: inter_api_access_duration(sec) api_access_uniqueness \ count 1695.000000 1695.000000 mean 1.501123 0.173226 std 21.697558 0.283641 min 0.000003 0.001200 25% 0.000707 0.009199 50% 0.002574 0.018717 75% 0.024579 0.230769 max 852.929250 1.000000 sequence_length(count) vsession_duration(min) num_sessions \ count 1699.000000 1.699000e+03 1699.000000 mean 61.648982 6.028341e+03 564.726898 std 205.803273 4.665042e+04 1179.931200 min 0.000000 1.000000e+00 2.000000 25% 9.984756 6.300000e+01 5.000000 50% 17.095238 1.950000e+02 164.000000 75% 41.349478 3.711500e+03 446.500000 max 3303.000000 1.352948e+06 9299.000000 num_users num_unique_apis count 1699.000000 1699.000000 mean 406.263685 67.246616 std 960.718580 82.189214 min 1.000000 0.000000 25% 1.000000 14.000000 50% 141.000000 37.000000 75% 308.500000 90.000000 max 8447.000000 524.000000 Remaining Dataset - Numerical Features Summary: inter_api_access_duration(sec) api_access_uniqueness \ count 34422.000000 34422.000000 mean 20.548672 0.444437 std 77.845323 0.302303 min 0.000000 0.000714 25% 0.381705 0.190476 50% 2.205273 0.384615 75% 9.495604 0.666667 max 2333.627333 1.000000 sequence_length(count) vsession_duration(min) num_sessions \ count 34423.000000 3.442300e+04 34423.000000 mean 65.140736 2.960553e+04 9.880022 std 150.605273 7.987217e+04 65.043625 min 0.000000 0.000000e+00 1.000000 25% 6.666667 5.430000e+02 1.000000 50% 15.000000 5.972000e+03 1.000000 75% 58.000000 2.505650e+04 3.000000 max 2800.000000 2.787530e+06 1462.000000 num_users num_unique_apis count 34423.000000 34423.000000 mean 3.615432 15.491851 std 10.520912 15.356068 min 1.000000 0.000000 25% 1.000000 5.000000 50% 1.000000 10.000000 75% 2.000000 21.000000 max 219.000000 178.000000 === CATEGORICAL FEATURES === IP Type distribution in Supervised Dataset: ip_type default 1542 datacenter 157 Name: count, dtype: int64 IP Type distribution in Remaining Dataset: ip_type default 32900 private_ip 1384 datacenter 138 google_bot 1 Name: count, dtype: int64 Source distribution in Supervised Dataset: source E 1499 F 200 Name: count, dtype: int64 Source distribution in Remaining Dataset: source E 17789 F 16634 Name: count, dtype: int64

Output Image image/png - a55fb979-11ed-43bd-bae5-051634ffa8f8

=== KEY INSIGHTS === Supervised Dataset: 1699 samples Remaining Dataset: 34423 samples Class Imbalance Ratio (Normal:Outlier): 1106:593 Outlier percentage in supervised data: 34.9% Outlier percentage in remaining data: 70.1%

=== STATISTICAL ANALYSIS: NORMAL vs OUTLIER BEHAVIORS === Sample sizes: Normal = 1106, Outlier = 593 Feature Statistics Comparison: Feature Normal_Mean Outlier_Mean Normal_Median \ 0 inter_api_access_duration(sec) 0.006752 4.307192 0.000992 1 api_access_uniqueness 0.014329 0.471597 0.011557 2 sequence_length(count) 24.271808 131.360877 16.492023 3 vsession_duration(min) 2144.150090 13272.716695 191.500000 4 num_sessions 862.426763 9.489039 282.000000 5 num_users 623.552441 1.000000 220.500000 6 num_unique_apis 94.440325 16.527825 68.000000 Outlier_Median P_Value Cohens_D Significant 0 0.100778 0.000000 -0.199030 True 1 0.409091 0.000000 -2.515906 True 2 22.000000 0.000119 -0.536983 True 3 229.000000 0.000934 -0.240040 True 4 3.000000 0.000000 0.769829 True 5 1.000000 0.000000 0.681144 True 6 11.000000 0.000000 1.062402 True === MOST DISCRIMINATIVE FEATURES === Features ranked by effect size (|Cohen's D|): api_access_uniqueness | Cohen's D: -2.516 | p-value: 4.64e-249 *** num_unique_apis | Cohen's D: 1.062 | p-value: 6.22e-157 *** num_sessions | Cohen's D: 0.770 | p-value: 1.96e-250 *** num_users | Cohen's D: 0.681 | p-value: 7.48e-265 *** sequence_length(count) | Cohen's D: -0.537 | p-value: 1.19e-04 *** vsession_duration(min) | Cohen's D: -0.240 | p-value: 9.34e-04 *** inter_api_access_duration(sec) | Cohen's D: -0.199 | p-value: 1.64e-155 ***

=== FEATURE ENGINEERING & MODELING RECOMMENDATIONS === 1. KEY BEHAVIORAL PATTERNS DISCOVERED: • API Access Uniqueness: Outliers show MUCH higher uniqueness (0.47 vs 0.01) • User Behavior: Outliers typically involve single users vs. normal with 620+ users • Session Patterns: Outliers have fewer but much longer sessions • API Diversity: Normal behavior accesses ~94 unique APIs vs ~17 for outliers • Sequence Complexity: Outliers have 5x longer API call sequences 2. DATA QUALITY INSIGHTS: • Supervised dataset: 1,699 samples (34.9% outliers) • Remaining dataset: 34,423 samples (70.1% outliers) • Very low missing data (< 1%) • Significant class imbalance in both datasets 3. FEATURE ENGINEERING RECOMMENDATIONS: • Log transform: 'inter_api_access_duration', 'vsession_duration', 'sequence_length' • Ratio features: 'apis_per_session', 'users_per_session', 'uniqueness_per_api' • Binning: Create categorical versions of highly skewed continuous features • Normalization: Apply robust scaling due to extreme outliers 4. MODELING APPROACH RECOMMENDATIONS: • Algorithm: Random Forest or XGBoost (handle class imbalance well) • Sampling: Use SMOTE or class weights to address imbalance • Validation: Stratified k-fold cross-validation • Metrics: Focus on Precision, Recall, F1-score, and AUC-ROC • Feature Selection: Use top discriminative features identified 5. TOP DISCRIMINATIVE FEATURES (by effect size): 1. api_access_uniqueness (Cohen's D: -2.516 - Large effect) 2. num_unique_apis (Cohen's D: 1.062 - Large effect) 3. num_sessions (Cohen's D: 0.770 - Medium effect) 4. num_users (Cohen's D: 0.681 - Medium effect) 5. sequence_length(count) (Cohen's D: -0.537 - Medium effect) 6. SECURITY INSIGHTS: • Attackers show highly unique API access patterns (low repetition) • Single-user, long-duration sessions are suspicious • Normal users have diverse API usage with shorter, repeated sessions • Outlier detection should focus on API uniqueness and session patterns ================================================================================ DATASET READY FOR MACHINE LEARNING MODELING ================================================================================

=== APPLYING FEATURE ENGINEERING === Supervised dataset shape after processing: (1695, 18) Remaining dataset shape after processing: (34422, 20) Supervised dataset features: ['inter_api_access_duration(sec)', 'api_access_uniqueness', 'sequence_length(count)', 'vsession_duration(min)', 'num_sessions', 'num_users', 'num_unique_apis', 'log_inter_api_access_duration(sec)', 'log_vsession_duration(min)', 'log_sequence_length(count)', 'apis_per_session', 'users_per_session', 'uniqueness_per_api', 'sequence_rate', 'api_diversity', 'ip_type_default', 'source_F', 'target'] Class distribution in supervised data: target 0 1106 1 589 Name: count, dtype: int64 Class distribution in remaining data: target 1 24145 0 10277 Name: count, dtype: int64

=== DATA PREPARATION COMPLETE === Training set shape: (1356, 17) Test set shape: (339, 17) Training class distribution: target 0 885 1 471 Name: count, dtype: int64 Test class distribution: target 0 221 1 118 Name: count, dtype: int64 === TRAINING RANDOM FOREST (Baseline) ===

Random Forest Results: Precision=1.000, Recall=1.000, F1=1.000, AUC=1.000 === TRAINING RANDOM FOREST WITH SMOTE ===

Random Forest + SMOTE Results: Precision=1.000, Recall=1.000, F1=1.000, AUC=1.000 SMOTE increased training samples from 1356 to 1770 Class distribution after SMOTE: target 0 885 1 885 Name: count, dtype: int64

=== TRAINING XGBOOST === XGBoost Results: Precision=1.000, Recall=1.000, F1=1.000, AUC=1.000 === TRAINING LIGHTGBM === LightGBM Results: Precision=1.000, Recall=1.000, F1=1.000, AUC=1.000 === TRAINING ISOLATION FOREST (Unsupervised) ===

Isolation Forest Results: Precision=0.605, Recall=1.000, F1=0.754, AUC=0.998

=== CROSS-VALIDATION ANALYSIS === Cross-validating Random Forest (Baseline)...

Cross-validating XGBoost...

Cross-validating LightGBM...

=== CROSS-VALIDATION RESULTS === Model Precision Recall F1-Score AUC-ROC ------------------------------------------------------------------------------------- Random Forest (Baseline) 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 XGBoost 0.998 ± 0.003 1.000 ± 0.000 0.999 ± 0.002 1.000 ± 0.001 LightGBM 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 === FEATURE IMPORTANCE ANALYSIS === Top 10 Most Important Features (Random Forest): 1. num_users 0.2657 2. api_access_uniqueness 0.1643 3. num_sessions 0.1599 4. api_diversity 0.1242 5. uniqueness_per_api 0.0879 6. users_per_session 0.0721 7. apis_per_session 0.0566 8. num_unique_apis 0.0191 9. log_inter_api_access_duration(sec) 0.0142 10. inter_api_access_duration(sec) 0.0126 Top 10 Most Important Features (XGBoost): 1. num_users 0.9446 2. num_sessions 0.0339 3. users_per_session 0.0084 4. api_access_uniqueness 0.0040 5. sequence_length(count) 0.0022 6. sequence_rate 0.0018 7. uniqueness_per_api 0.0017 8. apis_per_session 0.0017 9. api_diversity 0.0016 10. num_unique_apis 0.0001

=== SEMI-SUPERVISED LEARNING APPROACH === Common features between datasets: 17 Missing in remaining: set() Extra in remaining: {'ip_type_private_ip', 'ip_type_google_bot'} Aligned supervised data shape: (1695, 17) Aligned remaining data shape: (34422, 17) === FINAL MODEL: LightGBM === Training completed on full supervised dataset === PREDICTING ON REMAINING DATASET ===

Total samples in remaining dataset: 34,422 Original algorithmic classification: - Outliers: 24,145 (70.1%) - Normal: 10,277 (29.9%) Our model predictions: - Outliers: 34,005 (98.8%) - Normal: 417 (1.2%) Agreement with algorithmic classification: - Agreement: 24,562 (71.4%) - Disagreement: 9,860 (28.6%) === HIGH-CONFIDENCE DISAGREEMENTS === High-confidence predictions (>0.8): 34,005 High-confidence disagreements: 9,860 These represent potential anomalies missed by the algorithmic approach

Output Image image/png - 446ca8ca-b6c1-4f9e-9fc9-b549025fc645

=== DETAILED DISAGREEMENT ANALYSIS === Type 1: Original=Normal, Predicted=Outlier: 9,860 cases Type 2: Original=Outlier, Predicted=Normal: 0 cases High-confidence new anomalies (Type 1): 9,860 These represent potential attack patterns missed by the algorithmic approach === SUMMARY OF FINDINGS === ✓ Trained 3 different anomaly detection models ✓ Best model: LightGBM with F1-score: 1.000 ✓ Identified 9,860 potential new anomalies with high confidence ✓ Model shows 71.4% agreement with algorithmic classification ✓ Discovered behavioral patterns in 10 key features