2025. 8. 12. 13:04ㆍpython/ML
In the previous post, we trained an MNIST network with the quadratic (MSE) cost and a per-sample backprop loop.
In this post, we switch to the Cross-Entropy (CE) cost and a mini-batch (vectorized) update.

With sigmoid + CE (binary) or softmax + CE (multiclass), the output error is

There is no extra σ′(z) factor multiplying the error at the output layer.
This avoids the severe gradient shrinkage you often get with MSE + sigmoid, where
δ(L)=(a(L)−y)⊙σ′(z(L)) can be near zero when the neuron saturates.

Gradients (Mini-Batch, Vectorized)
Setup
Let the mini-batch size be m samples.
For each layer l:
- A(l)∈Rnl×m activations
- Z(l)∈Rnl×m pre-activations
- W(l)∈Rnl×nl−1 weights
- b(l)∈Rnl×1 biases
1. Forward pass
For l = 1 to L:
Z[l] = W[l] @ A[l-1] + b[l] # b[l] is broadcast over columns
A[l] = sigmoid(Z[l]) # for hidden layers
A[L] = softmax(Z[L]) # for output layer with multiclass CE
2. Output error
For sigmoid + CE (binary)
delta[L] = A[L] - Y
3. Hidden Layer errors
For l = L - 1 down to 1 :
delta[l] = (W[l+1].T @ delta[l+1]) * sigmoid_prime(Z[l])
4. Gradients (Averaged over Mini-Batch)
For each layer l:
grad_W[l] = (1/m) * delta[l] @ A[l-1].T
grad_b[l] = (1/m) * np.sum(delta[l], axis=1, keepdims=True)
5. Parameter Update
With learning rate eta:
W[l] = W[l] - eta * grad_W[l]
b[l] = b[l] - eta * grad_b[l]
All code :
#%%
import numpy as np
import random
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_prime(z):
s = sigmoid(z)
return s * (1 - s)
class Network2():
def __init__(self,sizes):
self.sizes = sizes
self.num_layers = len(sizes)
self.biases = [np.random.randn(y,1) for y in sizes[1:]]
self.weights = [np.random.randn(y,x) for x,y in zip(sizes[:-1],sizes[1:])]
def feedforward(self, mini_batch):
# first : batch list --> batch numpy with hstack
X = np.hstack([x for x,_ in mini_batch]) # shape of X : (features, mini_batch)
Y = np.hstack([y for _,y in mini_batch]) # shape of Y : (output_features, mini_batch)
A = X
for W,b in zip(self.weights , self.biases):
z = W @ A + b
A = sigmoid(z)
return A
def SGD(self, training_data, mini_batch_size, epochs, eta, test_data = None):
if test_data: n_test = len(test_data)
n = len(training_data)
for epoch in range(epochs):
random.shuffle(training_data)
mini_batches = [training_data[k:k+mini_batch_size] for k in range(0,n,mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta, "ce")
if test_data:
print(f"Epoch {epoch} : {self.evaluate(test_data,mini_batch_size)} | {n_test}")
else:
print(f"Epoch {epoch} complete.")
def update_mini_batch(self, mini_batch, eta = 0.1, loss = 'ce'):
X = np.hstack([x for x,_ in mini_batch])
Y = np.hstack([y for _,y in mini_batch])
m = X.shape[1]
# Forward pass
As = [X]
Zs = []
A = X
for W,b in zip(self.weights, self.biases):
Z = W @ A + b
A = sigmoid(Z) # (l layer neruon, m)
Zs.append(Z)
As.append(A)
if loss == "ce":
delta = As[-1] - Y # (l layer neuron, 1) --> sum <---
elif loss == "mse":
delta = (As[-1] - Y) * sigmoid_prime(Zs[-1])
else:
raise ValueError("loss must be ce(cross_entropy) or MSE")
# Gradients for last layer
nabla_b_L = np.sum(delta, axis = 1, keepdims = True) / m
nabla_w_L = (delta @ As[-2].T) / m
nabla_bs = [None] * len(self.biases)
nabla_ws = [None] * len(self.weights)
nabla_bs[-1] = nabla_b_L
nabla_ws[-1] = nabla_w_L
for l in range(2, self.num_layers):
Z = Zs[-l]
sp = sigmoid_prime(Z)
delta = (self.weights[-l + 1].T @ delta) * sp # shape : (l-1 layer neuron, m)
nabla_b = np.sum(delta, axis=1, keepdims = True) / m
nabla_w = (delta @ As[-l-1].T) / m
nabla_bs[-l] = nabla_b
nabla_ws[-l] = nabla_w
for i in range(len(self.weights)):
self.weights[i] -= eta * nabla_ws[i]
self.biases[i] -= eta * nabla_bs[i]
def evaluate(self, test_data, mini_batch_size = 128):
n = len(test_data)
correct = 0
for k in range(0, n , mini_batch_size):
batch = test_data[k : k + mini_batch_size]
Y_hat = self.feedforward(batch)
Y_true = np.hstack([y for _, y in batch])
pred = np.argmax(Y_hat, axis = 0)
true = np.argmax(Y_true, axis= 0)
correct += np.sum(pred == true)
return correct
souce code : http://neuralnetworksanddeeplearning.com/chap3.html
Neural networks and deep learning
Question: How do you approach utilizing and researching machine learning techniques that are supported almost entirely empirically, as opposed to mathematically? Also in what situations have you noticed some of these techniques fail? Answer: You have to re
neuralnetworksanddeeplearning.com
'python > ML' 카테고리의 다른 글
| [ML_8] Ensemble : Voting , Bagging , Boosting(+) (0) | 2025.08.22 |
|---|---|
| [ML] MNIST_Hand written code (0) | 2025.08.09 |
| [ML] Bayesian Concept learning (0) | 2025.08.04 |
| [Probability] Bayes Rule (0) | 2025.08.03 |
| [Linear_algebra] Null space (0) | 2025.08.01 |