new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Mar 14

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub at https://github.com/huggingface/gaia. We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces - https://huggingface.co/spaces/spacerini/gaia.

Gaia Data Release 3: Summary of the content and survey properties

We present the third data release of the European Space Agency's Gaia mission, GDR3. The GDR3 catalogue is the outcome of the processing of raw data collected with the Gaia instruments during the first 34 months of the mission by the Gaia Data Processing and Analysis Consortium. The GDR3 catalogue contains the same source list, celestial positions, proper motions, parallaxes, and broad band photometry in the G, G_{BP}, and G_{RP} pass-bands already present in the Early Third Data Release. GDR3 introduces an impressive wealth of new data products. More than 33 million objects in the ranges G_{rvs} < 14 and 3100 <T_{eff} <14500 , have new determinations of their mean radial velocities based on data collected by Gaia. We provide G_{rvs} magnitudes for most sources with radial velocities, and a line broadening parameter is listed for a subset of these. Mean Gaia spectra are made available to the community. The GDR3 catalogue includes about 1 million mean spectra from the radial velocity spectrometer, and about 220 million low-resolution blue and red prism photometer BPRP mean spectra. The results of the analysis of epoch photometry are provided for some 10 million sources across 24 variability types. GDR3 includes astrophysical parameters and source class probabilities for about 470 million and 1500 million sources, respectively, including stars, galaxies, and quasars. Orbital elements and trend parameters are provided for some 800,000 astrometric, spectroscopic and eclipsing binaries. More than 150,000 Solar System objects, including new discoveries, with preliminary orbital solutions and individual epoch observations are part of this release. Reflectance spectra derived from the epoch BPRP spectral data are published for about 60\,000 asteroids. Finally, an additional data set is provided, namely the Gaia Andromeda Photometric Survey (abridged)