Create 8BitsBinary-to-Reverse8BitsBinary_Task_is_LinearlySeparable

Browse files

Files changed (1) hide show

8BitsBinary-to-Reverse8BitsBinary_Task_is_LinearlySeparable +241 -0

8BitsBinary-to-Reverse8BitsBinary_Task_is_LinearlySeparable ADDED Viewed

	@@ -0,0 +1,241 @@

+---
+# The `8BitsBinary2Reverse8BitsBinary` Task
+Query: Is 8BitsBinary2Reverse8BitsBinar "Linearly Separable"?  [Can a NN or MLP having no nonlinear-Activation Function e.g., ReLU classify it?]
+This synthetic dataset provides a classic example of a **linearly separable** problem.
+It is designed to be a simple, yet illustrative, test case for an MLP's ability to learn a basic permutation.
+### Task Description
+The model is given an 8-bit binary vector as input and must learn to output the exact reverse of that vector.
+The mapping is a direct one-to-one permutation:
+*   **Input:** `[x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈]`
+*   **Target:** `[x₈, x₇, x₆, x₅, x₄, x₃, x₂, x₁]`
+#### Text Illustration
+A concrete example of a single data sample would be:
+```
+Input:  [1, 1, 0, 0, 1, 0, 1, 0]
+Target: [0, 1, 0, 1, 0, 0, 1, 1]
+```
+### Why a Linear (Identity Activation) MLP Can Achieve 100% Accuracy
+The fundamental reason this problem is perfectly solvable by a model with Null or Identity (`f(x) = x`) activation functions is that the relationship between each input bit and its corresponding output bit is **purely linear**.
+There is no complex, non-linear logic required (like an `XOR` or `AND` gate). The model does not need to learn relationships *between* different input bits to calculate an output; it only needs to learn a direct "re-wiring" scheme. An MLP with only linear activation functions is essentially a powerful linear transformer.
+#### A Worked Example: The "Perfect" Weight Matrix
+Consider a simple MLP with one layer. To calculate the **first output bit (`y₁`)**, the model must learn that `y₁` is always equal to the **eighth input bit (`x₈`)**.
+A neuron with an Identity activation function calculates its output as: `output = f( (w · x) + b ) = (w · x) + b`.
+The ideal solution that a gradient descent optimizer can easily find for the first output neuron is:
+```
+To compute y₁ (which must equal x₈):
+- Set the weight connecting x₈ to the neuron to 1.0
+- Set all other weights (from x₁ to x₇) to 0.0
+- Set the neuron's bias to 0.0
+Input Vector (x) = [x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈]
+Weight Vector (w) = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
+Bias (b)          = 0.0
+```
+Let's test this:
+*   **If `x₈ = 1`**: The calculation is `(0*x₁ + ... + 0*x₇ + 1*1) + 0 = 1`. The output is **1**. Correct.
+*   **If `x₈ = 0`**: The calculation is `(0*x₁ + ... + 0*x₇ + 1*0) + 0 = 0`. The output is **0**. Correct.
+The model simply needs to learn the following weight matrix to solve the entire problem, where each row of the matrix corresponds to the weights for one output neuron. This is a classic **permutation matrix**.
+```
+        x₁   x₂   x₃   x₄   x₅   x₆   x₇   x₈  (Inputs)
+      +-----------------------------------------+
+y₁ -> | 0    0    0    0    0    0    0    1   |
+y₂ -> | 0    0    0    0    0    0    1    0   |
+y₃ -> | 0    0    0    0    0    1    0    0   |
+y₄ -> | 0    0    0    0    1    0    0    0   |
+y₅ -> | 0    0    0    1    0    0    0    0   |
+y₆ -> | 0    0    1    0    0    0    0    0   |
+y₇ -> | 0    1    0    0    0    0    0    0   |
+y₈ -> | 1    0    0    0    0    0    0    0   |
+      +-----------------------------------------+
+      (Each row is the weight vector for an output neuron)
+```
+Because this perfect, simple linear solution exists, an MLP using only Identity activation functions does not need to learn any complex non-linear transformations. Its entire task is to adjust its weights to match this permutation matrix, which it can do with 100% accuracy.
+def get_dataset(mode):
+    print(f"Generating dataset for mode: {mode}...")
+    if mode == 'Analog2Thermometer':
+        inputs = np.random.uniform(-1, 1, (NUM_SAMPLES, 1)).astype(np.float32)
+        int_values = np.floor((inputs.flatten() + 1) / 2 * NUM_THERMOMETER_LEVELS).astype(int)
+        int_values = np.clip(int_values, 0, NUM_THERMOMETER_LEVELS - 1)
+        targets = np.zeros((NUM_SAMPLES, NUM_THERMOMETER_LEVELS), dtype=np.float32)
+        for i, val in enumerate(int_values): targets[i, :val] = 1.0
+        return TensorDataset(torch.from_numpy(inputs), torch.from_numpy(targets))
+    elif mode == 'Analog2Binary':
+        inputs = np.random.uniform(-1, 1, (NUM_SAMPLES, 1)).astype(np.float32)
+        int_values = np.floor((inputs.flatten() + 1) / 2 * NUM_THERMOMETER_LEVELS).astype(int)
+        int_values = np.clip(int_values, 0, NUM_THERMOMETER_LEVELS - 1)
+        targets = np.array([list(np.binary_repr(val, width=PRECISION_BITS)) for val in int_values], dtype=np.float32)
+        return TensorDataset(torch.from_numpy(inputs), torch.from_numpy(targets))
+    elif mode == '8BitsBinary2Reverse8BitsBinary':
+        inputs = np.random.randint(0, 2, (NUM_SAMPLES, PRECISION_BITS)).astype(np.float32)
+        targets = np.fliplr(inputs).copy().astype(np.float32) # Use fliplr for reversing
+        return TensorDataset(torch.from_numpy(inputs), torch.from_numpy(targets))
+    elif mode == '8BitsBinary2ThermometerCode':
+        int_values = np.random.randint(0, NUM_THERMOMETER_LEVELS, NUM_SAMPLES)
+        inputs = np.array([list(np.binary_repr(val, width=PRECISION_BITS)) for val in int_values], dtype=np.float32)
+        targets = np.zeros((NUM_SAMPLES, NUM_THERMOMETER_LEVELS), dtype=np.float32)
+        for i, val in enumerate(int_values): targets[i, :val] = 1.0
+        return TensorDataset(torch.from_numpy(inputs), torch.from_numpy(targets))
+# --- --- --- --- --- --- --- ---
+# --- B. BAKE-OFF HYPERPARAMETERS ---
+# --- --- --- --- --- --- --- ---
+BAKEOFF_CONFIGS = [
+    {'name': 'Baseline_NONE',        'hidden_layers': [128], 'lr': 0.0001, 'activation': nn.Identity()}, # Added NONE model
+    {'name': 'Baseline_IDENTITY',    'hidden_layers': [128], 'lr': 0.0001, 'activation': nn.Identity()}, # Added IDENTITY model
+    {'name': 'Baseline_Low_LR_ReLU', 'hidden_layers': [128], 'lr': 0.0001, 'activation': nn.ReLU()},
+    {'name': 'Baseline_GeLU',        'hidden_layers': [128], 'lr': 0.0001, 'activation': nn.GELU()},
+    {'name': 'Baseline_Swish',       'hidden_layers': [128], 'lr': 0.0001, 'activation': nn.SiLU()},
+    {'name': 'Low_LR_PReLU',       'hidden_layers': [128], 'lr': 0.0001, 'activation': nn.PReLU()},
+Actual bakeoff Training Results:
+--- Script Objective: Evaluate and Optimize Neural Network Architectures for Digital Conversion ---
+Mode: 8BitsBinary2Reverse8BitsBinary
+Precision Bits: 64    #Hidden Dim = 128 MLP Models solved 100% for 8 bits, 16, bits, 32 bits and 64 bits.  [can you estimate or find the limit?]
+Number of Epochs: 31
+Batch Size: 64
+Number of Samples: 10000
+Enable Overfitting (No Dropout): True
+Generating dataset for mode: 8BitsBinary2Reverse8BitsBinary...
+...
+--- Epoch 31/31 ---
+Activation: NONE                | T Loss: 0.0631 | V Loss: 0.0639 | Acc: 100.00% | Hamming Dist: Max: 0 Avg: 0.0000 Min: 0
+Activation: IDENTITY            | T Loss: 0.0613 | V Loss: 0.0638 | Acc: 100.00% | Hamming Dist: Max: 0 Avg: 0.0000 Min: 0
+Activation: ReLU                | T Loss: 0.1124 | V Loss: 0.1182 | Acc: 99.75%  | Hamming Dist: Max: 2 Avg: 0.0029 Min: 0
+Activation: GeLU                | T Loss: 0.1075 | V Loss: 0.1108 | Acc: 99.85%  | Hamming Dist: Max: 1 Avg: 0.0015 Min: 0
+Activation: Swish               | T Loss: 0.1063 | V Loss: 0.1067 | Acc: 99.95%  | Hamming Dist: Max: 1 Avg: 0.0005 Min: 0
+This gets to the very heart of a problem's fundamental complexity.
+The definitive answer is: **Yes, the 8BitsBinary2Reverse8BitsBinary dataset is linearly separable.**
+Let's break down exactly why.
+### 1. What "Linearly Separable" Means
+A problem is linearly separable if you can draw a single straight line (or a flat hyperplane in more than two dimensions) that perfectly separates the data points of one class from the data points of another.
+In the context of neural networks, it means that a **single-layer perceptron** (a model with no hidden layers, just an input layer connected directly to an output layer) can solve the problem perfectly.
+For a multi-output problem like this one, we analyze it on a per-output basis. The entire problem is considered linearly separable if *each individual output bit's classification task* is linearly separable.
+### 2. The Proof: Analyzing a Single Output Bit
+Let's analyze the task of predicting just the **first output bit (`y₁`)**.
+The rule for this dataset is: `y₁` must be the value of the **last input bit (`x₈`)**. All other input bits (`x₁` through `x₇`) are completely irrelevant for calculating `y₁`.
+The task for `y₁` is to classify the 8-bit input vector into two groups:
+*   **Class 1 (Output 1):** All input vectors where `x₈ = 1`.
+*   **Class 0 (Output 0):** All input vectors where `x₈ = 0`.
+Can a single linear equation separate these two groups? Yes, easily. A linear model calculates its output using the formula: `activation(w · x + b)`, where `w` is a vector of weights and `b` is a bias.
+We can define a perfect set of weights and a bias for `y₁`:
+*   **Weights (`w`):** `[0, 0, 0, 0, 0, 0, 0, 1]`
+*   **Bias (`b`):** `-0.5`
+Let's test this:
+*   **Case 1: `x₈ = 1`** (e.g., input is `[1,0,1,1,0,1,0,1]`)
+    *   The calculation is `(0*x₁ + 0*x₂ + ... + 0*x₇ + 1*x₈) - 0.5`
+    *   This simplifies to `x₈ - 0.5`
+    *   Result: `1 - 0.5 = 0.5`. Since `0.5` is positive, the model correctly outputs **1**.
+*   **Case 2: `x₈ = 0`** (e.g., input is `[1,0,1,1,0,1,0,0]`)
+    *   The calculation is `x₈ - 0.5`
+    *   Result: `0 - 0.5 = -0.5`. Since `-0.5` is negative, the model correctly outputs **0**.
+Since we found a simple linear formula that works for all possible inputs, the task of predicting `y₁` is linearly separable.
+### 3. Generalizing to All Outputs
+The exact same logic applies to every other output bit:
+*   To predict `y₂ = x₇`, the weights would be `[0, 0, 0, 0, 0, 0, 1, 0]` with a bias of `-0.5`.
+*   To predict `y₃ = x₆`, the weights would be `[0, 0, 0, 0, 0, 1, 0, 0]` with a bias of `-0.5`.
+*   ...and so on.
+Since a linear solution exists for each of the 8 output tasks, the entire problem is linearly separable.
+### Contrast with a Non-Linearly Separable Problem: XOR
+The classic example of a problem that is *not* linearly separable is XOR. For a 2-input XOR, the outputs are:
+*   `0, 0` -> `0`
+*   `0, 1` -> `1`
+*   `1, 0` -> `1`
+*   `1, 1` -> `0`
+You cannot draw a single straight line to separate the `0` outputs from the `1` outputs. You need a hidden layer to create a non-linear decision boundary. The key difference is that XOR's output depends on a **combination of inputs in a non-linear way**, whereas the bit-reversal problem has outputs that depend on **single inputs in a very simple, linear way.**
+This provides an excellent way to build a comprehensive understanding of what makes a problem "hard" or "easy" for a simple neural network.
+The fundamental dividing line is this:
+*   A bitwise transformation is **linearly separable** if the value of every single output bit can be determined by looking at the value of **at most one** input bit. The model only needs to learn a simple "re-wiring" (permutation) and/or "flipping" (inversion) scheme.
+*   A bitwise transformation is **not linearly separable** if the value of at least one output bit *requires* combining or comparing the values of **two or more** input bits. The model must learn logical relationships like `AND`, `OR`, or `XOR`.
+Here are the lists based on that principle.
+---
+### Linearly Separable Bitwise Transformations
+These are tasks where a simple MLP with no hidden layers (or with identity/linear activation functions) can achieve 100% accuracy. They are fundamentally about mapping single inputs to single outputs.
+| Transformation | Description | Text Illustration (`Input: [1,1,0,0,1,0,1,0]`) |
+| :--- | :--- | :--- |
+| **Identity**                | The output is an exact copy of the input. The model learns an identity matrix.   | `Output: [1,1,0,0,1,0,1,0]` |
+| **Bitwise NOT (Inversion)** | Every output bit is the inverse of the corresponding input bit (`yᵢ = NOT xᵢ`).  | `Output: [0,0,1,1,0,1,0,1]` |
+| **Permutations (Shuffle)**  | The output bits are the same as the input bits, just in a different order.       | |
+| ↳ **Reverse**               | The specific permutation where the order of bits is reversed.                    | `Output: [0,1,0,1,0,0,1,1]` |
+| ↳ **Cyclic Shift (Rotate)** | Bits are shifted left or right, with bits from one end wrapping around to the other. | `Rotate Left by 2: [0,0,1,0,1,0,1,1]` |
+| ↳ **Arbitrary Shuffle**     | Any fixed, consistent re-ordering of the input bits.                             | `Shuffle: [0,0,1,1,1,0,1,0]` |
+| **Masking & Selection**     | Some output bits are forced to a constant value, while others copy an input bit. | |
+| ↳ **Set High Bits to Zero** | The first N bits are set to 0, the rest are copied from the input.               | `Set 4 high bits to 0: [0,0,0,0,1,0,1,0]` |
+| ↳ **Select Even Bits**      | Bits in even positions are copied; bits in odd positions are set to 0.           | `Output: [0,1,0,0,0,0,0,0]` |
+| **Combined Permutation & Inversion** | The output is a shuffled and inverted version of the input.             | `Reverse and Invert: [1,0,1,0,1,1,0,0]` |
+---
+### Non-Linearly Separable Bitwise Transformations
+These are tasks where a simple linear model will fail. They require at least one hidden layer with a non-linear activation function (like ReLU, Sigmoid, etc.) to learn the complex relationships between input bits.
+| Transformation | Description | Text Illustration (`Input: [1,1,0,0,1,0,1,0]`) |
+| :--- | :--- | :--- |
+| **Bitwise Logic Gates** | The output is the result of a logical operation between two inputs (e.g., the input and a fixed key, or the input and a shifted version of itself). | `Key: [0,1,0,1,0,1,0,1]` |
+| **XOR** | The canonical non-linear problem. Each output bit `yᵢ = xᵢ XOR kᵢ`. | `Input XOR Key: [1,0,0,1,1,1,1,1]` |
+| **AND / OR** | `yᵢ = xᵢ AND kᵢ`. The output can only be 1 if *both* corresponding input bits are 1. | `Input AND Key: [0,1,0,0,0,0,0,0]` |
+| **Arithmetic Operations** | These require combining bits with logic for carrying/borrowing, making them highly non-linear. | `Add 1 (Increment): [1,1,0,0,1,0,1,1]` |
+| **Binary Addition/Subtraction** | `y = x + k`. The value of `yᵢ` depends on `xᵢ`, `kᵢ`, and the carry from the `i-1` position. | `Input + Key: [0,0,0,1,1,1,1,1]` (with carry) |
+| **Two's Complement (Negation)** | Invert all bits (linear step), then add one (non-linear step). | `Negate Input: [0,0,1,1,0,1,1,0]` |
+| **Parity Check** | The output indicates if the number of 1s in the input is even or odd. | |
+| **Single Parity Bit** | A single output bit is the result of `x₁ XOR x₂ XOR ... XOR x₈`. | `Parity (even=0): [0,0,0,0,0,0,0,0]` |
+| **Conditional Logic** | The operation depends on a condition involving multiple bits. | |
+| **Conditional Inversion** | "If the first two bits are `1,1`, then invert the last four bits." The `AND` logic of the condition is non-linear. | `Output: [1,1,0,0,0,1,0,1]` |
+| **Counting / Population Count**| The output is the binary representation of the number of 1s in the input. | `Input has 4 ones: [0,0,0,0,0,1,0,0]` |