MartialTerran
/

Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs

Model card Files Files and versions Community

MartialTerran commited on Nov 22, 2024

Commit

582b308

·

verified ·

1 Parent(s): b93df3b

Update README.md

Files changed (1) hide show

README.md +17 -2

README.md CHANGED Viewed

@@ -13,7 +13,15 @@ Large language models, especially those based on transformer architectures, have
 The Invention:
-This invention provides a method and apparatus for dynamically reducing logit computation by selectively deactivating tokens at the output layer. This deactivation effectively reduces the dimensions of the unembedding matrix and bias vector used in logit computation, leading to significant computational savings. This is accomplished during the forward pass of the model, allowing for context-dependent token deactivation.
 Detailed Description of the Method:
@@ -38,8 +46,15 @@ Probability Distribution and Token Selection: A probability distribution is gene
 Iteration: Steps 1 through 5 are repeated until the desired length of text is generated.
 Detailed Description of the Apparatus:
-The apparatus comprises:
 Processing Unit: A processing unit (e.g., CPU, GPU, TPU) configured to execute the steps of the method described above.

 The Invention:
+ This invention provides a technique to make LLMs faster and more efficient by intelligently reducing the number of logit calculations during inference. It achieves this by dynamically identifying relevant words based on context and employing various optimization strategies like pruning, hierarchical computation, approximation, or specialized hardware.
+ This invention provides a method and apparatus for dynamically reducing logit computation by selectively deactivating tokens at the output layer. This deactivation effectively reduces the dimensions of the unembedding matrix and bias vector used in logit computation, leading to significant computational savings. This is accomplished during the forward pass of the model, allowing for context-dependent token deactivation.
+In one embodiment, the method performs Dynamic Pruning/Filtering, such as:
+Identify irrelevant words: The system will analyze the context of the input text and dynamically identify words or select pregrouped sets of words from the vocabulary that are unlikely/unwanted to be the next word in the sequence.
+Skip logit calculation: Logits are only calculated for a subset of the vocabulary – those deemed relevant/wanted – significantly reducing computation.
+Using Contextual clues: Techniques like previous words, sentence structure, topic modeling, or even sentiment analysis could be used to determine relevance/selection.
 Detailed Description of the Method:
 Iteration: Steps 1 through 5 are repeated until the desired length of text is generated.
 Detailed Description of the Apparatus:
+The "apparatus" mentioned may be implemented to include one or more of:
+Software modules: Implemented as part of an LLM inference engine.
+Hardware accelerators: Specialized processors (e.g., GPUs, TPUs) optimized for LLM computations.
+Combined software-hardware systems: Leveraging both specialized hardware and optimized software algorithms.
+An exemplary embodiment of the apparatus comprises:
 Processing Unit: A processing unit (e.g., CPU, GPU, TPU) configured to execute the steps of the method described above.