PyTorch TensorFlow Reproduction Gap in DermMNIST Models

PyTorch vs TensorFlow: Why 2026 Reproductions Fall 4% Short on DermMNIST

A machine learning researcher has encountered a persistent 4-percentage-point accuracy gap between a TensorFlow-based paper and its PyTorch reproduction on the DermMNIST dataset, raising critical questions about cross-framework implementation differences. The original study, "A Lightweight Hybrid Gabor Deep Learning Approach" by Ahmed et al., reports 77.01% test accuracy using a hybrid Gabor-CNN architecture. The PyTorch version, however, consistently achieves only 73–74%, despite meticulous attempts to replicate kernel parameters, random seeds, and Gabor filter configurations.

Gabor Filter Implementation Differences

TensorFlow’s default padding and convolutional implementation often differ subtly from PyTorch’s, especially with fixed, non-trainable filters like Gabor kernels. PyTorch’s F.conv2d with same padding may not replicate TensorFlow’s "SAME" padding exactly, leading to spatial misalignment in feature maps. The researcher’s use of L2 normalization on Gabor kernels may also be inconsistently applied—TensorFlow’s automatic graph optimization and tensor dtype handling (often float32 with implicit casting) can mask numerical instabilities that PyTorch exposes more transparently.

Weight Initialization Discrepancies

Even when architectures appear identical, weight initialization order and method can diverge. The SE block and residual connections in the original paper may use Xavier or He initialization in TensorFlow, while PyTorch defaults to different schemes unless explicitly set. A simple mismatch in activation order—e.g., ReLU before or after batch norm—can disrupt residual learning, reducing model generalization. Validate initialization using identical seed values and print layer weights before training begins.

Data Preprocessing Variations

Grayscale conversion using RGB coefficients (0.299, 0.587, 0.114) must be pixel-perfect. Slight floating-point rounding differences between frameworks alter input distributions. Also, check for accidental data augmentation during evaluation: if transforms are applied to test sets in PyTorch but not in the original TensorFlow code, this creates a false performance penalty. Always disable augmentation during validation/testing.

Optimizer and Scheduler Misalignment

While both use Adam, TensorFlow’s default epsilon is 1e-7 vs. PyTorch’s 1e-8—this small difference alters convergence dynamics. The ReduceLROnPlateau scheduler’s patience and cooldown parameters were unspecified in the reproduction. If patience is too low, learning rates may drop prematurely, stalling training before optimal performance. Always log optimizer parameters and compare them frame-to-frame.

Gradient Computation and Framework Bias

According to a 2026 arXiv study, even identical architectures can produce different gradient flows due to framework-specific computational graphs. The combination of fixed Gabor filters with trainable layers amplifies this effect. To isolate the issue, freeze all layers except the final fully connected layer and compare accuracy. If the gap vanishes, the architecture is sound—your pipeline is the culprit.

To close the gap, validate every preprocessing step using identical input tensors from the same DermMNIST sample. Compare intermediate feature maps between frameworks. Perform direct weight transfer from TensorFlow to PyTorch after aligning tensor dimensions. This reveals whether the model itself is viable or if the error lies in execution.

Ultimately, the 4-point gap is not merely a coding error—it’s a symptom of deeper framework divergence. Reproducing state-of-the-art results requires not just algorithmic fidelity, but meticulous attention to low-level implementation details. PyTorch vs TensorFlow reproduction gap in DermMNIST models remains a cautionary tale for the reproducibility crisis in deep learning.

AI-Powered Content

Sources: medium.com • arxiv.org

PyTorch vs TensorFlow: Why 2026 Reproductions Fall 4% Short on DermMNIST

PyTorch vs TensorFlow: Why 2026 Reproductions Fall 4% Short on DermMNIST

summarize3-Point Summary

psychology_altWhy It Matters

PyTorch vs TensorFlow: Why 2026 Reproductions Fall 4% Short on DermMNIST

Gabor Filter Implementation Differences

Weight Initialization Discrepancies

Data Preprocessing Variations

Optimizer and Scheduler Misalignment

Gradient Computation and Framework Bias

recommendRelated Articles

Adam Optimizer in 2026: How It Corrects SGD's Frequency Bias in Language Models

LLM Societies: How Multi-Agent Thought Revolutionizes AI Chip Design in 2026

Nuclear LLMs & China's 2026 AI Benchmark Reshape Global Tech Race