new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Mar 14

MGTBench: Benchmarking Machine-Generated Text Detection

Nowadays large language models (LLMs) have shown revolutionary power in a variety of natural language processing (NLP) tasks such as text classification, sentiment analysis, language translation, and question-answering. In this way, detecting machine-generated texts (MGTs) is becoming increasingly important as LLMs become more advanced and prevalent. These models can generate human-like language that can be difficult to distinguish from text written by a human, which raises concerns about authenticity, accountability, and potential bias. However, existing detection methods against MGTs are evaluated under different model architectures, datasets, and experimental settings, resulting in a lack of a comprehensive evaluation framework across different methodologies In this paper, we fill this gap by proposing the first benchmark framework for MGT detection, named MGTBench. Extensive evaluations on public datasets with curated answers generated by ChatGPT (the most representative and powerful LLMs thus far) show that most of the current detection methods perform less satisfactorily against MGTs. An exceptional case is ChatGPT Detector, which is trained with ChatGPT-generated texts and shows great performance in detecting MGTs. Nonetheless, we note that only a small fraction of adversarial-crafted perturbations on MGTs can evade the ChatGPT Detector, thus highlighting the need for more robust MGT detection methods. We envision that MGTBench will serve as a benchmark tool to accelerate future investigations involving the evaluation of state-of-the-art MGT detection methods on their respective datasets and the development of more advanced MGT detection methods. Our source code and datasets are available at https://github.com/xinleihe/MGTBench.

Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy

Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMs. In this paper, we seek to exploit maximum mean discrepancy (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain multiple text populations due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel multi-population aware optimization method for MMD called MMD-MP, which can avoid variance increases and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP. The source code is available at https://github.com/ZSHsh98/MMD-MP.

Theoretical Antineutrino Detection, Direction and Ranging at Long Distances

In this paper we introduce the concept of what we call "NUDAR" (NeUtrino Direction and Ranging), making the point that measurements of the observed energy and direction vectors can be employed to passively deduce the exact three-dimensional location and thermal power of geophysical and anthropogenic neutrino sources from even a single detector. We present the most precise background estimates to date, all handled in full three dimensions, as functions of depth and geographical location. For the present calculations, we consider a hypothetical 138 kiloton detector which can be transported to an ocean site and deployed to an operational depth. We present a Bayesian estimation framework to incorporate any a priori knowledge of the reactor that we are trying to detect, as well as the estimated uncertainty in the background and the oscillation parameters. Most importantly, we fully employ the knowledge of the reactor spectrum and the distance-dependent effects of neutrino oscillations on such spectra. The latter, in particular, makes possible determination of range from one location, given adequate signal statistics. Further, we explore the rich potential of improving detection with even modest improvements in individual neutrino direction determination. We conclude that a 300 MWth reactor can indeed be geolocated, and its operating power estimated with one or two detectors in the hundred kiloton class at ranges out to a few hundred kilometers. We note that such detectors would have natural and non-interfering utility for scientific studies of geo-neutrinos, neutrino oscillations, and astrophysical neutrinos. This motivates the development of cost effective methods of constructing and deploying such next generation detectors.

The Zwicky Transient Facility Bright Transient Survey. III. BTSbot: Automated Identification and Follow-up of Bright Transients with Deep Learning

The Bright Transient Survey (BTS) aims to obtain a classification spectrum for all bright (m_peak,leq,18.5,mag) extragalactic transients found in the Zwicky Transient Facility (ZTF) public survey. BTS critically relies on visual inspection ("scanning") to select targets for spectroscopic follow-up, which, while effective, has required a significant time investment over the past sim5 yr of ZTF operations. We present BTSbot, a multi-modal convolutional neural network, which provides a bright transient score to individual ZTF detections using their image data and 25 extracted features. BTSbot is able to eliminate the need for daily human scanning by automatically identifying and requesting spectroscopic follow-up observations of new bright transient candidates. BTSbot recovers all bright transients in our test split and performs on par with scanners in terms of identification speed (on average, sim1 hour quicker than scanners). We also find that BTSbot is not significantly impacted by any data shift by comparing performance across a concealed test split and a sample of very recent BTS candidates. BTSbot has been integrated into Fritz and Kowalski, ZTF's first-party marshal and alert broker, and now sends automatic spectroscopic follow-up requests for the new transients it identifies. During the month of October 2023, BTSbot selected 296 sources in real-time, 93% of which were real extragalactic transients. With BTSbot and other automation tools, the BTS workflow has produced the first fully automatic end-to-end discovery and classification of a transient, representing a significant reduction in the human-time needed to scan. Future development has tremendous potential for creating similar models to identify and request follow-up observations for specific types of transients.

Investigating the Efficacy of Large Language Models for Code Clone Detection

Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to guide the model in accomplishing the task. GPT-based models are one of the popular ones studied for tasks such as code comment generation or test generation. These tasks are `generative' tasks. However, there is limited research on the usage of LLMs for `non-generative' tasks such as classification using the prompt-based paradigm. In this preliminary exploratory study, we investigated the applicability of LLMs for Code Clone Detection (CCD), a non-generative task. By building a mono-lingual and cross-lingual CCD dataset derived from CodeNet, we first investigated two different prompts using ChatGPT to detect Type-4 code clones in Java-Java and Java-Ruby pairs in a zero-shot setting. We then conducted an analysis to understand the strengths and weaknesses of ChatGPT in CCD. ChatGPT surpasses the baselines in cross-language CCD attaining an F1-score of 0.877 and achieves comparable performance to fully fine-tuned models for mono-lingual CCD, with an F1-score of 0.878. Also, the prompt and the difficulty level of the problems has an impact on the performance of ChatGPT. Finally we provide insights and future directions based on our initial analysis

Monte Carlo Linear Clustering with Single-Point Supervision is Enough for Infrared Small Target Detection

Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds on infrared images. Recently, deep learning based methods have achieved promising performance on SIRST detection, but at the cost of a large amount of training data with expensive pixel-level annotations. To reduce the annotation burden, we propose the first method to achieve SIRST detection with single-point supervision. The core idea of this work is to recover the per-pixel mask of each target from the given single point label by using clustering approaches, which looks simple but is indeed challenging since targets are always insalient and accompanied with background clutters. To handle this issue, we introduce randomness to the clustering process by adding noise to the input images, and then obtain much more reliable pseudo masks by averaging the clustered results. Thanks to this "Monte Carlo" clustering approach, our method can accurately recover pseudo masks and thus turn arbitrary fully supervised SIRST detection networks into weakly supervised ones with only single point annotation. Experiments on four datasets demonstrate that our method can be applied to existing SIRST detection networks to achieve comparable performance with their fully supervised counterparts, which reveals that single-point supervision is strong enough for SIRST detection. Our code will be available at: https://github.com/YeRen123455/SIRST-Single-Point-Supervision.

mGPT: Few-Shot Learners Go Multilingual

Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. source code and the mGPT XL model are publicly released.

DetectGPT-SC: Improving Detection of Text Generated by Large Language Models through Self-Consistency with Masked Predictions

General large language models (LLMs) such as ChatGPT have shown remarkable success, but it has also raised concerns among people about the misuse of AI-generated texts. Therefore, an important question is how to detect whether the texts are generated by ChatGPT or by humans. Existing detectors are built on the assumption that there is a distribution gap between human-generated and AI-generated texts. These gaps are typically identified using statistical information or classifiers. In contrast to prior research methods, we find that large language models such as ChatGPT exhibit strong self-consistency in text generation and continuation. Self-consistency capitalizes on the intuition that AI-generated texts can still be reasoned with by large language models using the same logical reasoning when portions of the texts are masked, which differs from human-generated texts. Using this observation, we subsequently proposed a new method for AI-generated texts detection based on self-consistency with masked predictions to determine whether a text is generated by LLMs. This method, which we call DetectGPT-SC. We conducted a series of experiments to evaluate the performance of DetectGPT-SC. In these experiments, we employed various mask scheme, zero-shot, and simple prompt for completing masked texts and self-consistency predictions. The results indicate that DetectGPT-SC outperforms the current state-of-the-art across different tasks.

The Mu3e Experiment: Status and Short-Term Plans

Mu3e is an experiment currently under construction at the Paul Scherrer Institute in Switzerland, designed to search for the Lepton Flavor Violating (LFV) decay mu^+ rightarrow e^+e^-e^+. In extensions of the Standard Model (SM) that account for neutrino masses, this decay is theoretically allowed but occurs only through extremely rare loop processes, with a predicted branching ratio of approximately O(10^{-54}). Such a small probability implies that any observation of this decay would provide clear evidence for physics beyond the SM. The Mu3e experiment aims to probe the mu^+ rightarrow e^+e^-e^+ decay with a sensitivity of approximately O(10^{-15}) in its Phase-1 and plans to achieve a sensitivity of O(10^{-16}) after future upgrades. To reach its Phase-1 ambitious goals, Mu3e is going to use the most intense continuous muon beam in the world, generating 10^{8} muon stops per second in the target placed at the center of the Mu3e. Mu3e will use three main technologies for particle detection. The tracking will done through ultra-thin (50 - 70 mu m) pixel detectors based on MuPix11 sensors. These are high-voltage monolithic active pixel sensors (HV-MAPS) with a sim 23~mum spatial resolution. The timing will be done through scintillating fibres (sim 250 ps) and tiles (sim 40 ps), coupled to silicon photomultipliers and read out by MuTRiG3 ASICs. A triggerless DAQ system based on FPGAs will collect data from the detectors, which will then undergo reconstruction in a GPU filter farm. The assembly of the detectors has started, with a detector commissioning beam time planned for 2025. This document reports on the status of the construction, installation, and data-taking plans for the near future.

Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/

DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios

Detecting text generated by large language models (LLMs) is of great recent interest. With zero-shot methods like DetectGPT, detection capabilities have reached impressive levels. However, the reliability of existing detectors in real-world applications remains underexplored. In this study, we present a new benchmark, DetectRL, highlighting that even state-of-the-art (SOTA) detection techniques still underperformed in this task. We collected human-written datasets from domains where LLMs are particularly prone to misuse. Using popular LLMs, we generated data that better aligns with real-world applications. Unlike previous studies, we employed heuristic rules to create adversarial LLM-generated text, simulating advanced prompt usages, human revisions like word substitutions, and writing errors. Our development of DetectRL reveals the strengths and limitations of current SOTA detectors. More importantly, we analyzed the potential impact of writing styles, model types, attack methods, the text lengths, and real-world human writing factors on different types of detectors. We believe DetectRL could serve as an effective benchmark for assessing detectors in real-world scenarios, evolving with advanced attack methods, thus providing more stressful evaluation to drive the development of more efficient detectors. Data and code are publicly available at: https://github.com/NLP2CT/DetectRL.

Taking ROCKET on an Efficiency Mission: Multivariate Time Series Classification with LightWaveS

Nowadays, with the rising number of sensors in sectors such as healthcare and industry, the problem of multivariate time series classification (MTSC) is getting increasingly relevant and is a prime target for machine and deep learning approaches. Their expanding adoption in real-world environments is causing a shift in focus from the pursuit of ever-higher prediction accuracy with complex models towards practical, deployable solutions that balance accuracy and parameters such as prediction speed. An MTSC model that has attracted attention recently is ROCKET, based on random convolutional kernels, both because of its very fast training process and its state-of-the-art accuracy. However, the large number of features it utilizes may be detrimental to inference time. Examining its theoretical background and limitations enables us to address potential drawbacks and present LightWaveS: a framework for accurate MTSC, which is fast both during training and inference. Specifically, utilizing wavelet scattering transformation and distributed feature selection, we manage to create a solution that employs just 2.5% of the ROCKET features, while achieving accuracy comparable to recent MTSC models. LightWaveS also scales well across multiple compute nodes and with the number of input channels during training. In addition, it can significantly reduce the input size and provide insight to an MTSC problem by keeping only the most useful channels. We present three versions of our algorithm and their results on distributed training time and scalability, accuracy, and inference speedup. We show that we achieve speedup ranging from 9x to 53x compared to ROCKET during inference on an edge device, on datasets with comparable accuracy.