Xtra-Computing
/

XtraGPT-7B

@@ -30,9 +30,9 @@ client = OpenAI(
 )
 paper_content="markdown"
-selected_content="After that, we define CAT-score to measure the matching degree between the filtered attention matrix and the distance matrix."
-prompt = "help me redefine cat-score based on the context."
 content = f"""
 Please improve the selected content based on the following. Act as an expert model for improving articles **PAPER_CONTENT**.\n
@@ -84,22 +84,14 @@ model = AutoModelForCausalLM.from_pretrained(
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 paper_content="""
-In this paper, we propose a metric-based probing method, namely, CAT-probing, to quantitatively evaluate how CodePTMs Attention scores relate to distances between AST nodes.
-First, to denoise the input code sequence in the original attention scores matrix, we classify the rows/cols by token types that are pre-defined by compilers,
-and then retain tokens whose types have the highest proportion scores to derive a filtered attention matrix (see Figure 1(b)).
-Meanwhile, inspired by the works (Wang et al., 2020; Zhu et al., 2022), we add edges to improve the connectivity of AST and calculate the distances between nodes corresponding to the selected tokens,
-which generates a distance matrix as shown in Figure 1(c). After that, we define CAT-score to measure the matching degree between the filtered attention matrix and the distance matrix.
-Specifically, the point-wise elements of the two matrices are matched if both the two conditions are satisfied:
-1) the attention score is larger than a threshold; 2) the distance value is smaller than a threshold. If only one condition is reached, the elements are unmatched.
-We calculate the CAT-score by the ratio of the number of matched elements to the summation of matched and unmatched elements.
-Finally, the CAT-score is used to interpret how CodePTMs attend code structure, where a higher score indicates that the model has learned more structural information.
 """
 selected_content="""
-After that, we define CAT-score to measure the matching degree between the filtered attention matrix and the distance matrix.
 """
 prompt ="""
-help me redefine cat-score based on the context.
 """
 content = f"""

 )
 paper_content="markdown"
+selected_content="Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples—highlighting the limitations of SFT in such scenarios."
+prompt = "help me modify based on the context."
 content = f"""
 Please improve the selected content based on the following. Act as an expert model for improving articles **PAPER_CONTENT**.\n
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 paper_content="""
+The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples—highlighting the limitations of SFT in such scenarios. To address this, we introduce \textbf{JudgeLRM}, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79\% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
 """
 selected_content="""
+Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples—highlighting the limitations of SFT in such scenarios.
 """
 prompt ="""
+help me modify based on the context.
 """
 content = f"""