nuojohnchen commited on
Commit
2bf602a
·
verified ·
1 Parent(s): e65e11f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -13
README.md CHANGED
@@ -30,9 +30,9 @@ client = OpenAI(
30
  )
31
 
32
  paper_content="markdown"
33
- selected_content="After that, we define CAT-score to measure the matching degree between the filtered attention matrix and the distance matrix."
34
 
35
- prompt = "help me redefine cat-score based on the context."
36
 
37
  content = f"""
38
  Please improve the selected content based on the following. Act as an expert model for improving articles **PAPER_CONTENT**.\n
@@ -84,22 +84,14 @@ model = AutoModelForCausalLM.from_pretrained(
84
  tokenizer = AutoTokenizer.from_pretrained(model_name)
85
 
86
  paper_content="""
87
- In this paper, we propose a metric-based probing method, namely, CAT-probing, to quantitatively evaluate how CodePTMs Attention scores relate to distances between AST nodes.
88
- First, to denoise the input code sequence in the original attention scores matrix, we classify the rows/cols by token types that are pre-defined by compilers,
89
- and then retain tokens whose types have the highest proportion scores to derive a filtered attention matrix (see Figure 1(b)).
90
- Meanwhile, inspired by the works (Wang et al., 2020; Zhu et al., 2022), we add edges to improve the connectivity of AST and calculate the distances between nodes corresponding to the selected tokens,
91
- which generates a distance matrix as shown in Figure 1(c). After that, we define CAT-score to measure the matching degree between the filtered attention matrix and the distance matrix.
92
- Specifically, the point-wise elements of the two matrices are matched if both the two conditions are satisfied:
93
- 1) the attention score is larger than a threshold; 2) the distance value is smaller than a threshold. If only one condition is reached, the elements are unmatched.
94
- We calculate the CAT-score by the ratio of the number of matched elements to the summation of matched and unmatched elements.
95
- Finally, the CAT-score is used to interpret how CodePTMs attend code structure, where a higher score indicates that the model has learned more structural information.
96
  """
97
 
98
  selected_content="""
99
- After that, we define CAT-score to measure the matching degree between the filtered attention matrix and the distance matrix.
100
  """
101
  prompt ="""
102
- help me redefine cat-score based on the context.
103
  """
104
 
105
  content = f"""
 
30
  )
31
 
32
  paper_content="markdown"
33
+ selected_content="Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples—highlighting the limitations of SFT in such scenarios."
34
 
35
+ prompt = "help me modify based on the context."
36
 
37
  content = f"""
38
  Please improve the selected content based on the following. Act as an expert model for improving articles **PAPER_CONTENT**.\n
 
84
  tokenizer = AutoTokenizer.from_pretrained(model_name)
85
 
86
  paper_content="""
87
+ The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples—highlighting the limitations of SFT in such scenarios. To address this, we introduce \textbf{JudgeLRM}, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79\% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
 
 
 
 
 
 
 
 
88
  """
89
 
90
  selected_content="""
91
+ Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples—highlighting the limitations of SFT in such scenarios.
92
  """
93
  prompt ="""
94
+ help me modify based on the context.
95
  """
96
 
97
  content = f"""