Spaces:

BridgeAI-Lab
/

SemF1

Runtime error

App Files Files Community

nbansal commited on Jun 25, 2024

Commit

42c888f

1 Parent(s): 2c33aa3

Updated the documentation and added more test cases.

Browse files

Files changed (3) hide show

README.md +35 -19
semf1.py +90 -55
tests.py +320 -1

README.md CHANGED Viewed

@@ -25,49 +25,65 @@ summary with the reference overlap summary. It evaluates the semantic overlap su
 computes precision, recall and F1 scores.
 ## How to Use
-Sem-F1 takes 2 mandatory arguments:
-    `predictions`: (a list of system generated documents in the form of sentences i.e. List[List[str]]),
-    `references`: (a list of ground-truth documents in the form of sentences i.e. List[List[str]])
 ```python
 from evaluate import load
 predictions = [
     ["I go to School.", "You are stupid."],
     ["I love adventure sports."],
 ]
 references = [
     ["I go to School.", "You are stupid."],
-    ["I love adventure sports."],
 ]
 metric = load("semf1")
 results = metric.compute(predictions=predictions, references=references)
 ```
-It also accepts another optional arguments:
-`model_type: Optional[str]`:
-The model to use for encoding the sentences.
-Options are:
-[`pv1`](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1),
-[`stsb`](https://huggingface.co/sentence-transformers/stsb-roberta-large),
-[`use`](https://huggingface.co/sentence-transformers/use-cmlm-multilingual).
-The default value is `use`.
-[//]: # (### Inputs)
 [//]: # (*List all input arguments in the format below*)
 [//]: # (- **input_field** *&#40;type&#41;: Definition of input, with explanation if necessary. State any default value&#40;s&#41;.*)
 ### Output Values
-`precision`: The [precision](https://huggingface.co/metrics/precision) for each sentence from the `predictions` + `references` lists, which ranges from 0.0 to 1.0.
-`recall`: The [recall](https://huggingface.co/metrics/recall) for each sentence from the `predictions` + `references` lists, which ranges from 0.0 to 1.0.
-`f1`: The [F1 score](https://huggingface.co/metrics/f1) for each sentence from the `predictions` + `references` lists, which ranges from 0.0 to 1.0.
-[//]: # (#### Values from Popular Papers)
 [//]: # (*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*)

 computes precision, recall and F1 scores.
 ## How to Use
+Sem-F1 takes 2 mandatory arguments:
+- `predictions` - List of predictions. Format varies based on `tokenize_sentences` and `multi_references` flags.
+- `references`: List of references. Format varies based on `tokenize_sentences` and `multi_references` flags.
 ```python
 from evaluate import load
 predictions = [
     ["I go to School.", "You are stupid."],
     ["I love adventure sports."],
 ]
 references = [
     ["I go to School.", "You are stupid."],
+    ["I love outdoor sports."],
 ]
 metric = load("semf1")
 results = metric.compute(predictions=predictions, references=references)
+for score in results:
+    print(f"Precision: {score.precision}, Recall: {score.recall}, F1: {score.f1}")
 ```
+Sem-F1 also accepts multiple optional arguments:
+   - `model_type (str)`: Model to use for encoding sentences. Options: ['pv1', 'stsb', 'use']
+     - `pv1` - [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1)
+     - `stsb` - [stsb-roberta-large](https://huggingface.co/sentence-transformers/stsb-roberta-large)
+     - `use` - [Universal Sentence Encoder](https://huggingface.co/sentence-transformers/use-cmlm-multilingual) (Default)
+   - `tokenize_sentences (bool)`: Flag to indicate whether to tokenize the sentences in the input documents. Default: True.
+   - `multi_references (bool)`: Flag to indicate whether multiple references are provided. Default: False.
+   - `gpu (Union[bool, str, int, List[Union[str, int]]])`: Whether to use GPU, CPU or multiple-processes for computation.
+   - `batch_size (int)`: Batch size for encoding. Default: 32.
+   - `verbose (bool)`: Flag to indicate verbose output. Default: False.
+Refer to the inputs descriptions for more detailed usage as follows
+```python
+import evaluate
+metric = evaluate.load("semf1")
+metric.inputs_description
+```
 [//]: # (*List all input arguments in the format below*)
 [//]: # (- **input_field** *&#40;type&#41;: Definition of input, with explanation if necessary. State any default value&#40;s&#41;.*)
 ### Output Values
+List of `Scores` dataclass corresponding to each sample -
+ - `precision: float`: Precision score, which ranges from 0.0 to 1.0.
+ - `recall: List[float]`: Recall score corresponding to each reference
+ - `f1: float`: F1 score (between precision and average recall).
+## Future Extensions
+Currently, we have only implemented the 3 encoders* that we experimented with in our
+[paper](https://aclanthology.org/2022.emnlp-main.49/). However, it can easily with extended for more models by simply
+extending the `Encoder` base class. (Refer to `encoder_models.py` file).
+`*` *In out paper, we used the Tensorflow [version](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder)
+of the USE model, however, in our current implementation, we used [PyTorch version](https://huggingface.co/sentence-transformers/use-cmlm-multilingual).*
 [//]: # (*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*)

semf1.py CHANGED Viewed

@@ -14,7 +14,6 @@
 # TODO: Add test cases, Remove tokenize_sentences flag since it can be determined from the input itself.
 """Sem-F1 metric"""
-from functools import partial
 from typing import List, Optional, Tuple
 import datasets
@@ -56,69 +55,93 @@ sentence level and computes precision, recall and F1 scores.
 """
 _KWARGS_DESCRIPTION = """
-Sem-F1 compares the system generated overlap summary with ground truth reference overlap.
 Args:
-    predictions: list - List of  predictions (Details below)
-    references: list - List of references (Details below)
-        reference should be a string with tokens separated by spaces.
-    model_type: str - Model to use. [pv1, stsb, use]
-        Options:
-            pv1 - paraphrase-distilroberta-base-v1 (Default)
-            stsb - stsb-roberta-large
-            use - Universal Sentence Encoder
-    tokenize_sentences: bool - Sentence tokenize the input document (prediction/reference). Default: True.
-    gpu: Union[bool, int] - Whether to use GPU or CPU.
-        Options:
             False - CPU (Default)
-            True - GPU, device 0
-            n: int - GPU, device n
-    batch_size: int - Batch Size, Default = 32.
 Returns:
-    precision: Precision.
-    recall: Recall.
-    f1: F1 score.
-There are 4 possible cases for inputs corresponding to predictions and references arguments
-Case 1: Multi-Ref = False, tokenize_sentences = False
-    predictions: List[List[str]] - List of  predictions where each prediction is a list of sentences.
     references: List[List[str]] - List of references where each reference is a list of sentences.
-Case 2: Multi-Ref = False, tokenize_sentences = True
-    predictions: List[str] - List of  predictions where each prediction is a document
-    references: List[str] - List of references where each reference is a document
-Case 3: Multi-Ref = True, tokenize_sentences = False
-    predictions: List[List[str]] - List of  predictions where each prediction is a list of sentences.
-    references: List[List[List[str]]] - List of multi-references i.e. [[r11, r12, ...], [r21, r22, ...], ...]
-                                        where each rij is further a list of sentences
-Case 4: Multi-Ref = True, tokenize_sentences = True
-    predictions: List[str] - List of  predictions where each prediction is a document
-    references: List[List[str]] - List of multi-references i.e. [[r11, r12, ...], [r21, r22, ...], ...]
-                                  where each rij is a document
-This can be seen in the form of truth table as follows:
-Case | Multi-Ref | tokenize_sentences | predictions     | references
-1    | 0         | 0                  | List[List[str]] | List[List[str]]
-2    | 0         | 1                  | List[str]       | List[str]
-3    | 1         | 0                  | List[List[str]] | List[List[List[str]]]
-4    | 1         | 1                  | List[str]       | List[List[str]]
-It is automatically determined whether it is Multi-Ref case Single-Ref case.
 Examples:
     >>> import evaluate
     >>> predictions = [
-        ["I go to School.", "You are stupid."],
         ["I love adventure sports."],
     ]
     >>> references = [
-        ["I go to School.", "You are stupid."],
-        ["I love adventure sports."],
     ]
     >>> metric = evaluate.load("semf1")
     >>> results = metric.compute(predictions=predictions, references=references)
-    >>> print([round(v, 2) for v in results["f1"]])
-    [0.77, 0.56]
 """
@@ -194,7 +217,12 @@ def _validate_input_format(
         - `PREDICTION_TYPE` and `REFERENCE_TYPE` are defined at the top of the file
     """
-    is_list_of_strings_at_depth = partial(is_nested_list_of_type, element_type=str)
     if tokenize_sentences and multi_references:
         condition = is_list_of_strings_at_depth(predictions, 1) and is_list_of_strings_at_depth(references, 2)
     elif not tokenize_sentences and multi_references:
@@ -225,7 +253,7 @@ class SemF1(evaluate.Metric):
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
             features=[
-                # Multi References: False, Tokenize_Sentences = False
                 datasets.Features(
                     {
                         # predictions: List[List[str]] - List of predictions where prediction is a list of sentences
@@ -234,7 +262,7 @@ class SemF1(evaluate.Metric):
                         "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
                     }
                 ),
-                # Multi References: False, Tokenize_Sentences = True
                 datasets.Features(
                     {
                         # predictions: List[str] - List of predictions
@@ -243,7 +271,7 @@ class SemF1(evaluate.Metric):
                         "references": datasets.Value("string", id="sequence"),
                     }
                 ),
-                # Multi References: True, Tokenize_Sentences = False
                 datasets.Features(
                     {
                         # predictions: List[List[str]] - List of predictions where prediction is a list of sentences
@@ -255,7 +283,7 @@ class SemF1(evaluate.Metric):
                             datasets.Sequence(datasets.Value("string", id="sequence"), id="ref"), id="references"),
                     }
                 ),
-                # Multi References: True, Tokenize_Sentences = True
                 datasets.Features(
                     {
                         # predictions: List[str] - List of predictions
@@ -319,6 +347,12 @@ class SemF1(evaluate.Metric):
             :return: List of Scores dataclass with precision, recall, and F1 scores.
         """
         # Validate inputs corresponding to flags
         _validate_input_format(tokenize_sentences, multi_references, predictions, references)
@@ -363,10 +397,11 @@ class SemF1(evaluate.Metric):
             # Precision: Concatenate all the sentences in all the references
             concat_refs = np.concatenate(refs, axis=0)
             precision, _ = _compute_cosine_similarity(preds, concat_refs)
             # Recall: Compute individually for each reference
             recall_scores = [_compute_cosine_similarity(r_embeds, preds) for r_embeds in refs]
-            recall_scores = [r_scores for (r_scores, _) in recall_scores]
             results.append(Scores(precision, recall_scores))

 # TODO: Add test cases, Remove tokenize_sentences flag since it can be determined from the input itself.
 """Sem-F1 metric"""
 from typing import List, Optional, Tuple
 import datasets
 """
 _KWARGS_DESCRIPTION = """
+Sem-F1 compares the system-generated summaries (predictions) with ground truth reference summaries (references)
+using precision, recall, and F1 score based on sentence embeddings.
 Args:
+    predictions (list): List of predictions. Format varies based on `tokenize_sentences` and `multi_references` flags.
+    references (list): List of references. Format varies based on `tokenize_sentences` and `multi_references` flags.
+    model_type (str): Model to use for encoding sentences. Options: ['pv1', 'stsb', 'use']
+        pv1 - paraphrase-distilroberta-base-v1 (Default)
+        stsb - stsb-roberta-large
+        use - Universal Sentence Encoder
+    tokenize_sentences (bool): Flag to indicate whether to tokenize the sentences in the input documents. Default: True.
+    multi_references (bool): Flag to indicate whether multiple references are provided. Default is False.
+    gpu (Union[bool, str, int, List[Union[str, int]]]): Whether to use GPU or CPU for computation.
+        bool -
             False - CPU (Default)
+            True - GPU (device 0) if gpu is available else CPU
+        int -
+            n - GPU, device index n
+        str -
+            'cuda', 'gpu', 'cpu'
+        List[Union[str, int]] - Multiple GPUs/cpus i.e. use multiple processes when computing embeddings
+    batch_size (int): Batch size for encoding. Default is 32.
+    verbose (bool): Flag to indicate verbose output. Default is False.
 Returns:
+    List of Scores dataclass with attributes as follows -
+        precision: float - precision score
+        recall: List[float] - List of recall scores corresponding to single/multiple references
+        f1: float - F1 score (between precision and average recall)
+Examples of input formats:
+Case 1: multi_references = False, tokenize_sentences = False
+    predictions: List[List[str]] - List of predictions where each prediction is a list of sentences.
     references: List[List[str]] - List of references where each reference is a list of sentences.
+    Example:
+        predictions = [["This is a prediction sentence 1.", "This is a prediction sentence 2."]]
+        references = [["This is a reference sentence 1.", "This is a reference sentence 2."]]
+Case 2: multi_references = False, tokenize_sentences = True
+    predictions: List[str] - List of predictions where each prediction is a document.
+    references: List[str] - List of references where each reference is a document.
+    Example:
+        predictions = ["This is a prediction sentence 1. This is a prediction sentence 2."]
+        references = ["This is a reference sentence 1. This is a reference sentence 2."]
+Case 3: multi_references = True, tokenize_sentences = False
+    predictions: List[List[str]] - List of predictions where each prediction is a list of sentences.
+    references: List[List[List[str]]] - List of references where each example has multi-references (List[r1, r2, ...])
+        and each ri is a List of sentences.
+    Example:
+        predictions = [["Prediction sentence 1.", "Prediction sentence 2."]]
+        references = [
+            [
+                ["Reference sentence 1.", "Reference sentence 2."],  # Reference 1
+                ["Alternative reference 1.", "Alternative reference 2."],  # Reference 2
+            ]
+        ]
+Case 4: multi_references = True, tokenize_sentences = True
+    predictions: List[str] - List of predictions where each prediction is a document.
+    references: List[List[str]] - List of references where each example has multi-references (List[r1, r2, ...]) where
+    each r1 is a document.
+    Example:
+        predictions = ["Prediction sentence 1. Prediction sentence 2."]
+        references = [
+            [
+                "Reference sentence 1. Reference sentence 2.",  # Reference 1
+                "Alternative reference 1. Alternative reference 2.",  # Reference 2
+            ]
+        ]
 Examples:
     >>> import evaluate
     >>> predictions = [
+        ["I go to School. You are stupid."],
         ["I love adventure sports."],
     ]
     >>> references = [
+        ["I go to School. You are stupid."],
+        ["I love outdoor sports."],
     ]
     >>> metric = evaluate.load("semf1")
     >>> results = metric.compute(predictions=predictions, references=references)
+    >>> for score in results:
+    >>>     print(f"Precision: {score.precision}, Recall: {score.recall}, F1: {score.f1}")
 """
         - `PREDICTION_TYPE` and `REFERENCE_TYPE` are defined at the top of the file
     """
+    if len(predictions) != len(references):
+        raise ValueError("Predictions and references must have the same length.")
+    def is_list_of_strings_at_depth(lst_obj, depth: int):
+        return is_nested_list_of_type(lst_obj, element_type=str, depth=depth)
     if tokenize_sentences and multi_references:
         condition = is_list_of_strings_at_depth(predictions, 1) and is_list_of_strings_at_depth(references, 2)
     elif not tokenize_sentences and multi_references:
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
             features=[
+                # F0: Multi References: False, Tokenize_Sentences = False
                 datasets.Features(
                     {
                         # predictions: List[List[str]] - List of predictions where prediction is a list of sentences
                         "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
                     }
                 ),
+                # F1: Multi References: False, Tokenize_Sentences = True
                 datasets.Features(
                     {
                         # predictions: List[str] - List of predictions
                         "references": datasets.Value("string", id="sequence"),
                     }
                 ),
+                # F2: Multi References: True, Tokenize_Sentences = False
                 datasets.Features(
                     {
                         # predictions: List[List[str]] - List of predictions where prediction is a list of sentences
                             datasets.Sequence(datasets.Value("string", id="sequence"), id="ref"), id="references"),
                     }
                 ),
+                # F3: Multi References: True, Tokenize_Sentences = True
                 datasets.Features(
                     {
                         # predictions: List[str] - List of predictions
             :return: List of Scores dataclass with precision, recall, and F1 scores.
         """
+        # Note: I have to specifically handle this case because the library considers the feature corresponding to
+        #  this case (F2) as the feature for the other case (F0) i.e. it can't make any distinction between
+        #  List[str] and List[List[str]]
+        if not tokenize_sentences and multi_references:
+            references = [[eval(ref) for ref in mul_ref_ex] for mul_ref_ex in references]
         # Validate inputs corresponding to flags
         _validate_input_format(tokenize_sentences, multi_references, predictions, references)
             # Precision: Concatenate all the sentences in all the references
             concat_refs = np.concatenate(refs, axis=0)
             precision, _ = _compute_cosine_similarity(preds, concat_refs)
+            precision = np.clip(precision, a_min=0.0, a_max=1.0).item()
             # Recall: Compute individually for each reference
             recall_scores = [_compute_cosine_similarity(r_embeds, preds) for r_embeds in refs]
+            recall_scores = [np.clip(r_scores, 0.0, 1.0).item() for (r_scores, _) in recall_scores]
             results.append(Scores(precision, recall_scores))

tests.py CHANGED Viewed

@@ -3,9 +3,12 @@ import unittest
 import numpy as np
 import torch
 from sentence_transformers import SentenceTransformer
 from encoder_models import SBertEncoder, get_encoder
 from utils import get_gpu, slice_embeddings, is_nested_list_of_type, flatten_list, compute_f1, Scores
@@ -178,5 +181,321 @@ class TestGetEncoder(unittest.TestCase):
         # self.assertEqual(encoder.verbose, verbose)
 if __name__ == '__main__':
-    unittest.main()

 import numpy as np
 import torch
+from numpy.testing import assert_almost_equal
 from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
 from encoder_models import SBertEncoder, get_encoder
+from semf1 import SemF1, _compute_cosine_similarity, _validate_input_format
 from utils import get_gpu, slice_embeddings, is_nested_list_of_type, flatten_list, compute_f1, Scores
         # self.assertEqual(encoder.verbose, verbose)
+class TestSemF1(unittest.TestCase):
+    def setUp(self):
+        self.semf1_metric = SemF1()  # semf1_metric
+        # Example cases, #Samples = 1
+        self.untokenized_single_reference_predictions = [
+            "This is a prediction sentence 1. This is a prediction sentence 2."]
+        self.untokenized_single_reference_references = [
+            "This is a reference sentence 1. This is a reference sentence 2."]
+        self.tokenized_single_reference_predictions = [
+            ["This is a prediction sentence 1.", "This is a prediction sentence 2."],
+        ]
+        self.tokenized_single_reference_references = [
+            ["This is a reference sentence 1.", "This is a reference sentence 2."],
+        ]
+        self.untokenized_multi_reference_predictions = [
+            "Prediction sentence 1. Prediction sentence 2."
+        ]
+        self.untokenized_multi_reference_references = [
+            ["Reference sentence 1. Reference sentence 2.", "Alternative reference 1. Alternative reference 2."],
+        ]
+        self.tokenized_multi_reference_predictions = [
+            ["Prediction sentence 1.", "Prediction sentence 2."],
+        ]
+        self.tokenized_multi_reference_references = [
+            [
+                ["Reference sentence 1.", "Reference sentence 2."],
+                ["Alternative reference 1.", "Alternative reference 2."]
+            ],
+        ]
+    def test_untokenized_single_reference(self):
+        scores = self.semf1_metric.compute(
+            predictions=self.untokenized_single_reference_predictions,
+            references=self.untokenized_single_reference_references,
+            tokenize_sentences=True,
+            multi_references=False,
+            gpu=False,
+            batch_size=32,
+            verbose=False
+        )
+        self.assertIsInstance(scores, list)
+        self.assertEqual(len(scores), len(self.untokenized_single_reference_predictions))
+    def test_tokenized_single_reference(self):
+        scores = self.semf1_metric.compute(
+            predictions=self.tokenized_single_reference_predictions,
+            references=self.tokenized_single_reference_references,
+            tokenize_sentences=False,
+            multi_references=False,
+            gpu=False,
+            batch_size=32,
+            verbose=False
+        )
+        self.assertIsInstance(scores, list)
+        self.assertEqual(len(scores), len(self.tokenized_single_reference_predictions))
+        for score in scores:
+            self.assertIsInstance(score, Scores)
+            self.assertTrue(0.0 <= score.precision <= 1.0)
+            self.assertTrue(all(0.0 <= recall <= 1.0 for recall in score.recall))
+    def test_untokenized_multi_reference(self):
+        scores = self.semf1_metric.compute(
+            predictions=self.untokenized_multi_reference_predictions,
+            references=self.untokenized_multi_reference_references,
+            tokenize_sentences=True,
+            multi_references=True,
+            gpu=False,
+            batch_size=32,
+            verbose=False
+        )
+        self.assertIsInstance(scores, list)
+        self.assertEqual(len(scores), len(self.untokenized_multi_reference_predictions))
+    def test_tokenized_multi_reference(self):
+        scores = self.semf1_metric.compute(
+            predictions=self.tokenized_multi_reference_predictions,
+            references=self.tokenized_multi_reference_references,
+            tokenize_sentences=False,
+            multi_references=True,
+            gpu=False,
+            batch_size=32,
+            verbose=False
+        )
+        self.assertIsInstance(scores, list)
+        self.assertEqual(len(scores), len(self.tokenized_multi_reference_predictions))
+        for score in scores:
+            self.assertIsInstance(score, Scores)
+            self.assertTrue(0.0 <= score.precision <= 1.0)
+            self.assertTrue(all(0.0 <= recall <= 1.0 for recall in score.recall))
+    def test_same_predictions_and_references(self):
+        scores = self.semf1_metric.compute(
+            predictions=self.tokenized_single_reference_predictions,
+            references=self.tokenized_single_reference_predictions,
+            tokenize_sentences=False,
+            multi_references=False,
+            gpu=False,
+            batch_size=32,
+            verbose=False
+        )
+        self.assertIsInstance(scores, list)
+        self.assertEqual(len(scores), len(self.tokenized_single_reference_predictions))
+        for score in scores:
+            self.assertIsInstance(score, Scores)
+            self.assertAlmostEqual(score.precision, 1.0, places=6)
+            assert_almost_equal(score.recall, 1, decimal=5, err_msg="Not all values are almost equal to 1")
+    def test_exact_output_scores(self):
+        predictions = [
+            ["I go to School.", "You are stupid."],
+            ["I love adventure sports."],
+        ]
+        references = [
+            ["I go to playground.", "You are genius.", "You need to be admired."],
+            ["I love adventure sports."],
+        ]
+        scores = self.semf1_metric.compute(
+            predictions=predictions,
+            references=references,
+            tokenize_sentences=False,
+            multi_references=False,
+            gpu=False,
+            batch_size=32,
+            verbose=False,
+            model_type="use",
+        )
+        self.assertIsInstance(scores, list)
+        self.assertEqual(len(scores), len(predictions))
+        score = scores[0]
+        self.assertIsInstance(score, Scores)
+        self.assertAlmostEqual(score.precision, 0.73, places=2)
+        self.assertAlmostEqual(score.recall[0], 0.63, places=2)
+class TestCosineSimilarity(unittest.TestCase):
+    def setUp(self):
+        # Sample embeddings for testing
+        self.pred_embeds = np.array([
+            [1, 0, 0],
+            [0, 1, 0],
+            [0, 0, 1]
+        ])
+        self.ref_embeds = np.array([
+            [1, 0, 0],
+            [0, 1, 0],
+            [0, 0, 1]
+        ])
+        self.pred_embeds_random = np.random.rand(3, 3)
+        self.ref_embeds_random = np.random.rand(3, 3)
+    def test_cosine_similarity_perfect_match(self):
+        precision, recall = _compute_cosine_similarity(self.pred_embeds, self.ref_embeds)
+        # Expected values are 1.0 for both precision and recall since embeddings are identical
+        self.assertAlmostEqual(precision, 1.0, places=5)
+        self.assertAlmostEqual(recall, 1.0, places=5)
+    def _test_cosine_similarity_base(self, pred_embeds, ref_embeds):
+        precision, recall = _compute_cosine_similarity(pred_embeds, ref_embeds)
+        # Calculate expected precision and recall using sklearn's cosine similarity function
+        cosine_scores = cosine_similarity(pred_embeds, ref_embeds)
+        expected_precision = np.mean(np.max(cosine_scores, axis=-1)).item()
+        expected_recall = np.mean(np.max(cosine_scores, axis=0)).item()
+        self.assertAlmostEqual(precision, expected_precision, places=5)
+        self.assertAlmostEqual(recall, expected_recall, places=5)
+    def test_cosine_similarity_random(self):
+        self._test_cosine_similarity_base(self.pred_embeds_random, self.ref_embeds_random)
+    def test_cosine_similarity_different_shapes(self):
+        pred_embeds_diff = np.random.rand(5, 3)
+        ref_embeds_diff = np.random.rand(3, 3)
+        self._test_cosine_similarity_base(pred_embeds_diff, ref_embeds_diff)
+class TestValidateInputFormat(unittest.TestCase):
+    def setUp(self):
+        # Sample predictions and references for different scenarios where number of samples = 1
+        # Note: Naming Convention: # When tokenize_sentences = True (i.e. input is untokenized) and vice-versa
+        # When tokenize_sentences = True (untokenized input) and multi_references = False
+        self.untokenized_single_reference_predictions = [
+            "This is a prediction sentence 1. This is a prediction sentence 2."
+        ]
+        self.untokenized_single_reference_references = [
+            "This is a reference sentence 1. This is a reference sentence 2."
+        ]
+        # When tokenize_sentences = False (tokenized input) and multi_references = False
+        self.tokenized_single_reference_predictions = [
+            ["This is a prediction sentence 1.", "This is a prediction sentence 2."]
+        ]
+        self.tokenized_single_reference_references = [
+            ["This is a reference sentence 1.", "This is a reference sentence 2."]
+        ]
+        # When tokenize_sentences = True (untokenized input) and multi_references = True
+        self.untokenized_multi_reference_predictions = [
+            "This is a prediction sentence 1. This is a prediction sentence 2."
+        ]
+        self.untokenized_multi_reference_references = [
+            [
+                "This is a reference sentence 1. This is a reference sentence 2.",
+                "Another reference sentence."
+            ]
+        ]
+        # When tokenize_sentences = False (tokenized input) and multi_references = True
+        self.tokenized_multi_reference_predictions = [
+            ["This is a prediction sentence 1.", "This is a prediction sentence 2."]
+        ]
+        self.tokenized_multi_reference_references = [
+            [
+                ["This is a reference sentence 1.", "This is a reference sentence 2."],
+                ["Another reference sentence."]
+            ]
+        ]
+    def test_tokenized_sentences_true_multi_references_true(self):
+        # Invalid format should raise an error
+        with self.assertRaises(ValueError):
+            _validate_input_format(
+                True,
+                True,
+                self.tokenized_single_reference_predictions,
+                self.tokenized_single_reference_references,
+            )
+        # Valid format should pass without error
+        _validate_input_format(
+            True,
+            True,
+            self.untokenized_multi_reference_predictions,
+            self.untokenized_multi_reference_references,
+        )
+    def test_tokenized_sentences_false_multi_references_true(self):
+        # Invalid format should raise an error
+        with self.assertRaises(ValueError):
+            _validate_input_format(
+                False,
+                True,
+                self.untokenized_single_reference_predictions,
+                self.untokenized_multi_reference_references,
+            )
+        # Valid format should pass without error
+        _validate_input_format(
+            False,
+            True,
+            self.tokenized_multi_reference_predictions,
+            self.tokenized_multi_reference_references,
+        )
+    def test_tokenized_sentences_true_multi_references_false(self):
+        # Invalid format should raise an error
+        with self.assertRaises(ValueError):
+            _validate_input_format(
+                True,
+                False,
+                self.tokenized_single_reference_predictions,
+                self.tokenized_single_reference_references,
+            )
+        # Valid format should pass without error
+        _validate_input_format(
+            True,
+            False,
+            self.untokenized_single_reference_predictions,
+            self.untokenized_single_reference_references,
+        )
+    def test_tokenized_sentences_false_multi_references_false(self):
+        # Invalid format should raise an error
+        with self.assertRaises(ValueError):
+            _validate_input_format(
+                False,
+                False,
+                self.untokenized_single_reference_predictions,
+                self.untokenized_single_reference_references,
+            )
+        # Valid format should pass without error
+        _validate_input_format(
+            False,
+            False,
+            self.tokenized_single_reference_predictions,
+            self.tokenized_single_reference_references,
+        )
+    def test_mismatched_lengths(self):
+        # Length mismatch should raise an error
+        with self.assertRaises(ValueError):
+            _validate_input_format(
+                True,
+                True,
+                self.untokenized_single_reference_predictions,
+                [self.untokenized_single_reference_predictions[0], self.untokenized_single_reference_predictions[0]],
+            )
 if __name__ == '__main__':
+    unittest.main(verbosity=2)
+    # unittest.main()