Spaces:

AIML-TUDA
/

VerifiableRewardsForScalableLogicalReasoning

Running

App Files Files Community

LukasHug commited on 12 days ago

Commit

999258b

1 Parent(s): c8de9ce

only accept one rule as a solution, we select the first one. Do not allow groundings

Browse files

Files changed (3) hide show

.gitignore +1 -0
README.md +7 -4
VerifiableRewardsForScalableLogicalReasoning.py +30 -1

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ .idea

README.md CHANGED Viewed

@@ -18,16 +18,19 @@ description: >-
 # Metric Card for Symbolic Judge: Verifiable Rewards for Scalable Logical Reasoning
-## Metric Description
-Verifiable Rewards for Scalable Logical Reasoning (**SLR**) introduces a **symbolic judge** that provides verifiable rewards for logical reasoning tasks.
-To check whether a given task is solved, the symbolic judge evaluates a candidate solution (i.e., a logic rule, typically generated by a language model) and using an **executable validation program** that encodes the task's background knowledge and labeled examples.
-Evaluations performed by the symbolic judge are fully verifiable and grounded in formal logic, ensuring an automatic, transparent, and reproducible standard for evaluation and reward in both supervised and reinforcement learning settings.
 ### How it Works
 - **Input:** The symbolic judge takes as input a candidate hypothesis (logic rule) and an executable validation program containing background knowledge and examples.
 - **Execution:** The candidate rule is executed against the validation program using a Prolog interpreter.
 - **Correctness Criteria:** The rule is considered correct if it entails all positive examples and rejects all negative examples.
 - **Metrics:** The symbolic judge computes a range of evaluation metrics (detailed below).
 **Note:** A local Prolog interpreter is required to execute validation programs.
 ---
 ### Inputs

 # Metric Card for Symbolic Judge: Verifiable Rewards for Scalable Logical Reasoning
+This metric is part of the SLR framework (AIML-TUDA/SLR-Bench) and provides rewards for logical reasoning tasks.
+THe reward model is grounded in the ILP (Inductive Logic Programming) paradigm, testing whether a given hypothesis (logic rule) solves a logical reasoning task.
+TO check for entailment, the logic rule is executed against a set of background knowledge and examples, ensuring automatic evaluation that is verifiable, transparent, and reproducible.
 ### How it Works
 - **Input:** The symbolic judge takes as input a candidate hypothesis (logic rule) and an executable validation program containing background knowledge and examples.
 - **Execution:** The candidate rule is executed against the validation program using a Prolog interpreter.
 - **Correctness Criteria:** The rule is considered correct if it entails all positive examples and rejects all negative examples.
 - **Metrics:** The symbolic judge computes a range of evaluation metrics (detailed below).
 **Note:** A local Prolog interpreter is required to execute validation programs.
 ---
 ### Inputs

VerifiableRewardsForScalableLogicalReasoning.py CHANGED Viewed

@@ -100,13 +100,41 @@ Returns:
 """
 def _evaluate_with_prolog(prediction, validation_program, eval_config, timeout=5):
     """
     Evaluates a predicted rule against the validation program using Prolog.
     """
     # Extract configuration
     positive_pred = eval_config.get("positive_predicate", "eastbound")
     negative_pred = eval_config.get("negative_predicate", "westbound")
     # extract predicate from rule_to_evaluate
     rule_to_evaluate = extract_ilp_from_text_v2(prediction)
     if positive_pred not in rule_to_evaluate:
@@ -234,6 +262,7 @@ def extract_ilp_from_text_v2(text, target_predicates=None):
         if not statement.endswith('.'):
             statement += '.'
         p_code += statement + '\n'
     return p_code.strip()  # Ensure no trailing whitespace
@@ -315,7 +344,7 @@ class VerifiableRewardsForScalableLogicalReasoning(evaluate.Metric):
             eval_inputs.append((prediction, validation_program, eval_config))
         # if more than 1k predictions, we use multiprocessing to speed up the evaluation
-        if len(eval_inputs) > 1000:
             # Process evaluations in parallel
             num_cpus = max(1, mp.cpu_count() - 1)  # Leave one CPU free
             with mp.Pool(processes=num_cpus) as pool:

 """
+def validate_rule_no_hardcoded_cars(prediction):
+    """Reject rules that hardcode specific car identifiers"""
+    import re
+    # Look for has_car with a constant (lowercase) in second position
+    hardcoded_pattern = r'has_car\([^,]+,\s*([a-z][a-z0-9_]*)\)'
+    matches = re.findall(hardcoded_pattern, prediction)
+    if matches:
+        return False, f"Rule contains ground cars: {matches[0]}"
+    return True, "Rule is valid"
 def _evaluate_with_prolog(prediction, validation_program, eval_config, timeout=5):
     """
     Evaluates a predicted rule against the validation program using Prolog.
     """
+    is_valid, validation_msg = validate_rule_no_hardcoded_cars(prediction)
+    if not is_valid:
+        return {
+            "is_correct": False,
+            "partial_score": 0.0,
+            "syntax_valid": False,
+            "error": f"Rule validation failed: {validation_msg}"
+        }
     # Extract configuration
     positive_pred = eval_config.get("positive_predicate", "eastbound")
     negative_pred = eval_config.get("negative_predicate", "westbound")
+    validation_program = anonymize_entities(validation_program)
     # extract predicate from rule_to_evaluate
     rule_to_evaluate = extract_ilp_from_text_v2(prediction)
     if positive_pred not in rule_to_evaluate:
         if not statement.endswith('.'):
             statement += '.'
         p_code += statement + '\n'
+    print(p_code)
     return p_code.strip()  # Ensure no trailing whitespace
             eval_inputs.append((prediction, validation_program, eval_config))
         # if more than 1k predictions, we use multiprocessing to speed up the evaluation
+        if len(eval_inputs) > 500:
             # Process evaluations in parallel
             num_cpus = max(1, mp.cpu_count() - 1)  # Leave one CPU free
             with mp.Pool(processes=num_cpus) as pool: