Naive Bayes

ml ml supervised classification probability bayes 4 min read

Probabilistic classifier based on Bayes' theorem with class-conditional independence

Naive Bayes classifiers have a general assumption that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class-conditional independence.

$P(A

B) = \frac{P(B

A) P(A)}{P(B)}$

$\text{posterior} = \frac{\text{likelihood prob} \times \text{prior}}{\text{marginal}}$

In the context of Naive Bayes classification, we often work with log probabilities to avoid numerical underflow:

$\log P(A

B) = \log P(B

A) + \log P(A) - \log P(B)$

The denominator $P(B)$ is omitted since it’s constant across classes and doesn’t affect the $\text{argmax}$:

$\log P(A

B) = \log P(B

A) + \log P(A)$

Additional Details

Prior Probability $P(A)$: Initial belief about probability of each class before seeing evidence.
**Likelihood Probability $P(B A)$:** Probability of observing features given a particular class.
Marginal Probability $P(B)$: Probability of observing features regardless of class (normalizing constant).
**Posterior Probability $P(A B)$:** Final probability of a class given observed features.

Gaussian PDF: $P(x

c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} e^{-\frac{(x-\mu_c)^2}{2\sigma_c^2}}$

Advantages

Simple and fast to train and predict
Works well with small datasets
Can handle both continuous and discrete features
Less prone to overfitting

Disadvantages

Assumes features are independent (naive assumption)
May not perform well when features are correlated
Requires features to be normally distributed for optimal performance
Sensitive to feature scaling

Laplacian Smoothing

Handles zero probability problems. For a feature value x and class c:

$P(x

c) = \frac{count(x,c) + \alpha}{count(c) + \alpha

Where $\alpha$ is the smoothing parameter (typically 1) and $

$ is the vocabulary size.

Questions

Why Naive Bayes is called Naive? Simplified assumption that all features are independent: $P(y

x_1,…,x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i

y)}{P(X)}$

Code

class NaiveBayes:
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)
        for idx, c in enumerate(self._classes):
            X_c = X[y == c]
            self._mean[idx, :] = X_c.mean(axis=0)
            self._var[idx, :] = X_c.var(axis=0)
            self._priors[idx] = X_c.shape[0] / float(n_samples)

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        posteriors = []
        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + posterior
            posteriors.append(posterior)
        return self._classes[np.argmax(posteriors)]

    def _pdf(self, class_idx, x):
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator