new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Mar 13

ClusterNet: A Perception-Based Clustering Model for Scattered Data

Visualizations for scattered data are used to make users understand certain attributes of their data by solving different tasks, e.g. correlation estimation, outlier detection, cluster separation. In this paper, we focus on the later task, and develop a technique that is aligned to human perception, that can be used to understand how human subjects perceive clusterings in scattered data and possibly optimize for better understanding. Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques, such as for instance k-means or DBSCAN. However, as these algorithms are based on non-perceptual metrics, we can show in our experiments, that their output do not reflect human cluster perception. We propose a learning strategy which directly operates on scattered data. To learn perceptual cluster separation on this data, we crowdsourced a large scale dataset, consisting of 7,320 point-wise cluster affiliations for bivariate data, which has been labeled by 384 human crowd workers. Based on this data, we were able to train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability. In order to train ClusterNet on human annotated data, we use a PointNet++ architecture enabling inference on point clouds directly. In this work, we provide details on how we collected our dataset, report statistics of the resulting annotations, and investigate perceptual agreement of cluster separation for real-world data. We further report the training and evaluation protocol of ClusterNet and introduce a novel metric, that measures the accuracy between a clustering technique and a group of human annotators. Finally, we compare our approach against existing state-of-the-art clustering techniques and can show, that ClusterNet is able to generalize to unseen and out of scope data.

ViDi: Descriptive Visual Data Clustering as Radiologist Assistant in COVID-19 Streamline Diagnostic

In the light of the COVID-19 pandemic, deep learning methods have been widely investigated in detecting COVID-19 from chest X-rays. However, a more pragmatic approach to applying AI methods to a medical diagnosis is designing a framework that facilitates human-machine interaction and expert decision making. Studies have shown that categorization can play an essential rule in accelerating real-world decision making. Inspired by descriptive document clustering, we propose a domain-independent explanatory clustering framework to group contextually related instances and support radiologists' decision making. While most descriptive clustering approaches employ domain-specific characteristics to form meaningful clusters, we focus on model-level explanation as a more general-purpose element of every learning process to achieve cluster homogeneity. We employ DeepSHAP to generate homogeneous clusters in terms of disease severity and describe the clusters using favorable and unfavorable saliency maps, which visualize the class discriminating regions of an image. These human-interpretable maps complement radiologist knowledge to investigate the whole cluster at once. Besides, as part of this study, we evaluate a model based on VGG-19, which can identify COVID and pneumonia cases with a positive predictive value of 95% and 97%, respectively, comparable to the recent explainable approaches for COVID diagnosis.

CLAMS: A Cluster Ambiguity Measure for Estimating Perceptual Variability in Visual Clustering

Visual clustering is a common perceptual task in scatterplots that supports diverse analytics tasks (e.g., cluster identification). However, even with the same scatterplot, the ways of perceiving clusters (i.e., conducting visual clustering) can differ due to the differences among individuals and ambiguous cluster boundaries. Although such perceptual variability casts doubt on the reliability of data analysis based on visual clustering, we lack a systematic way to efficiently assess this variability. In this research, we study perceptual variability in conducting visual clustering, which we call Cluster Ambiguity. To this end, we introduce CLAMS, a data-driven visual quality measure for automatically predicting cluster ambiguity in monochrome scatterplots. We first conduct a qualitative study to identify key factors that affect the visual separation of clusters (e.g., proximity or size difference between clusters). Based on study findings, we deploy a regression module that estimates the human-judged separability of two clusters. Then, CLAMS predicts cluster ambiguity by analyzing the aggregated results of all pairwise separability between clusters that are generated by the module. CLAMS outperforms widely-used clustering techniques in predicting ground truth cluster ambiguity. Meanwhile, CLAMS exhibits performance on par with human annotators. We conclude our work by presenting two applications for optimizing and benchmarking data mining techniques using CLAMS. The interactive demo of CLAMS is available at clusterambiguity.dev.

Part123: Part-aware 3D Reconstruction from a Single-view Image

Recently, the emergence of diffusion models has opened up new opportunities for single-view reconstruction. However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape. Moreover, the generated meshes usually suffer from large noises, unsmooth surfaces, and blurry textures, making it challenging to obtain satisfactory part segments using 3D segmentation techniques. In this paper, we present Part123, a novel framework for part-aware 3D reconstruction from a single-view image. We first use diffusion models to generate multiview-consistent images from a given image, and then leverage Segment Anything Model (SAM), which demonstrates powerful generalization ability on arbitrary objects, to generate multiview segmentation masks. To effectively incorporate 2D part-based information into 3D reconstruction and handle inconsistency, we introduce contrastive learning into a neural rendering framework to learn a part-aware feature space based on the multiview segmentation masks. A clustering-based algorithm is also developed to automatically derive 3D part segmentation results from the reconstructed models. Experiments show that our method can generate 3D models with high-quality segmented parts on various objects. Compared to existing unstructured reconstruction methods, the part-aware 3D models from our method benefit some important applications, including feature-preserving reconstruction, primitive fitting, and 3D shape editing.

Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous Data

Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves non-convex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed K-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -- a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including K-means, SDP and EM algorithms.

Clustering based Point Cloud Representation Learning for 3D Analysis

Point cloud analysis (such as 3D segmentation and detection) is a challenging task, because of not only the irregular geometries of many millions of unordered points, but also the great variations caused by depth, viewpoint, occlusion, etc. Current studies put much focus on the adaption of neural networks to the complex geometries of point clouds, but are blind to a fundamental question: how to learn an appropriate point embedding space that is aware of both discriminative semantics and challenging variations? As a response, we propose a clustering based supervised learning scheme for point cloud analysis. Unlike current de-facto, scene-wise training paradigm, our algorithm conducts within-class clustering on the point embedding space for automatically discovering subclass patterns which are latent yet representative across scenes. The mined patterns are, in turn, used to repaint the embedding space, so as to respect the underlying distribution of the entire training dataset and improve the robustness to the variations. Our algorithm is principled and readily pluggable to modern point cloud segmentation networks during training, without extra overhead during testing. With various 3D network architectures (i.e., voxel-based, point-based, Transformer-based, automatically searched), our algorithm shows notable improvements on famous point cloud segmentation datasets (i.e.,2.0-2.6% on single-scan and 2.0-2.2% multi-scan of SemanticKITTI, 1.8-1.9% on S3DIS, in terms of mIoU). Our algorithm also demonstrates utility in 3D detection, showing 2.0-3.4% mAP gains on KITTI.

Extending Bootstrap AMG for Clustering of Attributed Graphs

In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means algorithm, using a new vector-valued distance in the embedding space. Main novelty of our method, which can be classified as an early fusion method, i.e., a method in which additional information on vertices are fused to the structure information before applying clustering, is the interpretation of attributes as new realizations of graph vertices, which can be dealt with as coordinate vectors in a related Euclidean space. This allows us to extend a scalable generalized spectral clustering procedure which substitutes graph Laplacian eigenvectors with some vectors, named algebraically smooth vectors, obtained by a linear-time complexity Algebraic MultiGrid (AMG) method. We discuss the performance of our proposed clustering method by comparison with recent literature approaches and public available results. Extensive experiments on different types of synthetic datasets and real-world attributed graphs show that our new algorithm, embedding attributes information in the clustering, outperforms structure-only-based methods, when the attributed network has an ambiguous structure. Furthermore, our new method largely outperforms the method which originally proposed the graph augmentation, showing that our embedding strategy and vector-valued distance are very effective in taking advantages from the augmented-graph representation.

Probabilistic Partitive Partitioning (PPP)

Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies can be achieved by reducing the input space if a minimal loss of information can be achieved. Clustering algorithms, in general, face two common problems: 1) these converge to different settings with different initial conditions and; 2) the number of clusters has to be arbitrarily decided beforehand. This problem has become critical in the realm of big data. Recently, clustering algorithms have emerged which can speedup computations using parallel processing over the grid but face the aforementioned problems. Goals: Our goals are to find methods to cluster data which: 1) guarantee convergence to the same settings irrespective of the initial conditions; 2) eliminate the need to establish the number of clusters beforehand, and 3) can be applied to cluster large datasets. Methods: We introduce a method that combines probabilistic and combinatorial clustering methods to produce repeatable and compact clusters that are not sensitive to initial conditions. This method harnesses the power of k-means (a combinatorial clustering method) to cluster/partition very large dimensional datasets and uses the Gaussian Mixture Model (a probabilistic clustering method) to validate the k-means partitions. Results: We show that this method produces very compact clusters that are not sensitive to initial conditions. This method can be used to identify the most 'separable' set in a dataset which increases the 'clusterability' of a dataset. This method also eliminates the need to specify the number of clusters in advance.

Selection Function of Clusters in Dark Energy Survey Year 3 Data from Cross-Matching with South Pole Telescope Detections

Galaxy clusters selected based on overdensities of galaxies in photometric surveys provide the largest cluster samples. Yet modeling the selection function of such samples is complicated by non-cluster members projected along the line of sight (projection effects) and the potential detection of unvirialized objects (contamination). We empirically constrain the magnitude of these effects by cross-matching galaxy clusters selected in the Dark Energy survey data with the \rdmpr, algorithm with significant detections in three South Pole Telescope surveys (SZ, pol-ECS, pol-500d). For matched clusters, we augment the \rdmpr,catalog by the SPT detection significance. For unmatched objects we use the SPT detection threshold as an upper limit on the SZe signature. Using a Bayesian population model applied to the collected multi-wavelength data, we explore various physically motivated models to describe the relationship between observed richness and halo mass. Our analysis reveals the limitations of a simple lognormal scatter model in describing the data. We rule out significant contamination by unvirialized objects at the high-richness end of the sample. While dedicated simulations offer a well-fitting calibration of projection effects, our findings suggest the presence of redshift-dependent trends that these simulations may not have captured. Our findings highlight that modeling the selection function of optically detected clusters remains a complicated challenge, requiring a combination of simulation and data-driven approaches.

Shortcut Partitions in Minor-Free Graphs: Steiner Point Removal, Distance Oracles, Tree Covers, and More

The notion of shortcut partition, introduced recently by Chang, Conroy, Le, Milenkovi\'c, Solomon, and Than [CCLMST23], is a new type of graph partition into low-diameter clusters. Roughly speaking, the shortcut partition guarantees that for every two vertices u and v in the graph, there exists a path between u and v that intersects only a few clusters. They proved that any planar graph admits a shortcut partition and gave several applications, including a construction of tree cover for arbitrary planar graphs with stretch 1+varepsilon and O(1) many trees for any fixed varepsilon in (0,1). However, the construction heavily exploits planarity in multiple steps, and is thus inherently limited to planar graphs. In this work, we breach the "planarity barrier" to construct a shortcut partition for K_r-minor-free graphs for any r. To this end, we take a completely different approach -- our key contribution is a novel deterministic variant of the cop decomposition in minor-free graphs [And86, AGG14]. Our shortcut partition for K_r-minor-free graphs yields several direct applications. Most notably, we construct the first optimal distance oracle for K_r-minor-free graphs, with 1+varepsilon stretch, linear space, and constant query time for any fixed varepsilon in (0,1). The previous best distance oracle [AG06] uses O(nlog n) space and O(log n) query time, and its construction relies on Robertson-Seymour structural theorem and other sophisticated tools. We also obtain the first tree cover of O(1) size for minor-free graphs with stretch 1+varepsilon, while the previous best (1+varepsilon)-tree cover has size O(log^2 n) [BFN19].

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model's 3D hierarchical segmentations in real world scenes. The code and dataset are available at https://github.com/hardyho/ultrametric_feature_fields

Transductive Few-Shot Learning: Clustering is All You Need?

We investigate a general formulation for clustering and transductive few-shot learning, which integrates prototype-based objectives, Laplacian regularization and supervision constraints from a few labeled data points. We propose a concave-convex relaxation of the problem, and derive a computationally efficient block-coordinate bound optimizer, with convergence guarantee. At each iteration,our optimizer computes independent (parallel) updates for each point-to-cluster assignment. Therefore, it could be trivially distributed for large-scale clustering and few-shot tasks. Furthermore, we provides a thorough convergence analysis based on point-to-set maps. Were port comprehensive clustering and few-shot learning experiments over various data sets, showing that our method yields competitive performances, in term of accuracy and optimization quality, while scaling up to large problems. Using standard training on the base classes, without resorting to complex meta-learning and episodic-training strategies, our approach outperforms state-of-the-art few-shot methods by significant margins, across various models, settings and data sets. Surprisingly, we found that even standard clustering procedures (e.g., K-means), which correspond to particular, non-regularized cases of our general model, already achieve competitive performances in comparison to the state-of-the-art in few-shot learning. These surprising results point to the limitations of the current few-shot benchmarks, and question the viability of a large body of convoluted few-shot learning techniques in the recent literature.

Deep view of the intracluster light in the Coma cluster of galaxies

Detection and study of the intracluster light in rich clusters of galaxies has been a problem of long standing challenge and interest. Using the lowest surface brightness images of the Coma cluster of galaxies in the g and r bands, from the Halos and Environment of Nearby Galaxies (HERON) Coma Cluster Project, we obtained the most extensive image of intracluster light (ICL) in a single cluster to date, spreading over 1.5 Mpc from the cluster core. The unprecedented wealth of spectroscopic data made publicly available by the Dark Energy Spectroscopic Instrument (DESI) Early Data Release, complemented with a compilation from the NASA/IPAC Extragalactic Database and the literature, enabled the identification of 2,157 galaxy members within Coma, from which 42 distinct groups were identified. The synergy between these high-quality data allowed us to: 1) calculate ICL fractions of 19.9pm0.5\% and 19.6pm0.6\% in the g and r bands, respectively, consistent with a dynamically active cluster, 2) unveil Coma's faintest tidal features, and 3) provide a comprehensive picture of the dynamics and interactions within this complex system. Our findings indicate that the ICL connects several of these groups in a filamentous network, from which we infer the ongoing dynamical processes. In particular, we identified a faint stellar bridge linking the core of Coma with the galaxy NGC 4839, providing compelling evidence that this galaxy has already traversed the central region of the cluster.

From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.

Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

Local Graph Clustering with Noisy Labels

The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.

When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive (top-down) algorithms recursively partition the nodes into two communities, until a stopping rule indicates that no further split is needed. In contrast, agglomerative (bottom-up) algorithms first identify the smallest community structure and then repeatedly merge the communities using a linkage method. In this article, we establish theoretical guarantees for the recovery of the hierarchical tree and community structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We also establish that this bottom-up algorithm attains the information-theoretic threshold for exact recovery at intermediate levels of the hierarchy. Notably, these recovery conditions are less restrictive compared to those existing for top-down algorithms. This shows that bottom-up algorithms extend the feasible region for achieving exact recovery at intermediate levels. Numerical experiments on both synthetic and real data sets confirm the superiority of bottom-up algorithms over top-down algorithms. We also observe that top-down algorithms can produce dendrograms with inversions. These findings contribute to a better understanding of hierarchical clustering techniques and their applications in network analysis.

Unsupervised Manifold Linearizing and Clustering

We consider the problem of simultaneously clustering and learning a linear representation of data lying close to a union of low-dimensional manifolds, a fundamental task in machine learning and computer vision. When the manifolds are assumed to be linear subspaces, this reduces to the classical problem of subspace clustering, which has been studied extensively over the past two decades. Unfortunately, many real-world datasets such as natural images can not be well approximated by linear subspaces. On the other hand, numerous works have attempted to learn an appropriate transformation of the data, such that data is mapped from a union of general non-linear manifolds to a union of linear subspaces (with points from the same manifold being mapped to the same subspace). However, many existing works have limitations such as assuming knowledge of the membership of samples to clusters, requiring high sampling density, or being shown theoretically to learn trivial representations. In this paper, we propose to optimize the Maximal Coding Rate Reduction metric with respect to both the data representation and a novel doubly stochastic cluster membership, inspired by state-of-the-art subspace clustering results. We give a parameterization of such a representation and membership, allowing efficient mini-batching and one-shot initialization. Experiments on CIFAR-10, -20, -100, and TinyImageNet-200 datasets show that the proposed method is much more accurate and scalable than state-of-the-art deep clustering methods, and further learns a latent linear representation of the data.

A Practical Approach to Novel Class Discovery in Tabular Data

The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

Comparison of Clustering Algorithms for Statistical Features of Vibration Data Sets

Vibration-based condition monitoring systems are receiving increasing attention due to their ability to accurately identify different conditions by capturing dynamic features over a broad frequency range. However, there is little research on clustering approaches in vibration data and the resulting solutions are often optimized for a single data set. In this work, we present an extensive comparison of the clustering algorithms K-means clustering, OPTICS, and Gaussian mixture model clustering (GMM) applied to statistical features extracted from the time and frequency domains of vibration data sets. Furthermore, we investigate the influence of feature combinations, feature selection using principal component analysis (PCA), and the specified number of clusters on the performance of the clustering algorithms. We conducted this comparison in terms of a grid search using three different benchmark data sets. Our work showed that averaging (Mean, Median) and variance-based features (Standard Deviation, Interquartile Range) performed significantly better than shape-based features (Skewness, Kurtosis). In addition, K-means outperformed GMM slightly for these data sets, whereas OPTICS performed significantly worse. We were also able to show that feature combinations as well as PCA feature selection did not result in any significant performance improvements. With an increase in the specified number of clusters, clustering algorithms performed better, although there were some specific algorithmic restrictions.

Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering

Collecting large-scale medical datasets with fully annotated samples for training of deep networks is prohibitively expensive, especially for 3D volume data. Recent breakthroughs in self-supervised learning (SSL) offer the ability to overcome the lack of labeled training samples by learning feature representations from unlabeled data. However, most current SSL techniques in the medical field have been designed for either 2D images or 3D volumes. In practice, this restricts the capability to fully leverage unlabeled data from numerous sources, which may include both 2D and 3D data. Additionally, the use of these pre-trained networks is constrained to downstream tasks with compatible data dimensions. In this paper, we propose a novel framework for unsupervised joint learning on 2D and 3D data modalities. Given a set of 2D images or 2D slices extracted from 3D volumes, we construct an SSL task based on a 2D contrastive clustering problem for distinct classes. The 3D volumes are exploited by computing vectored embedding at each slice and then assembling a holistic feature through deformable self-attention mechanisms in Transformer, allowing incorporating long-range dependencies between slices inside 3D volumes. These holistic features are further utilized to define a novel 3D clustering agreement-based SSL task and masking embedding prediction inspired by pre-trained language models. Experiments on downstream tasks, such as 3D brain segmentation, lung nodule detection, 3D heart structures segmentation, and abnormal chest X-ray detection, demonstrate the effectiveness of our joint 2D and 3D SSL approach. We improve plain 2D Deep-ClusterV2 and SwAV by a significant margin and also surpass various modern 2D and 3D SSL approaches.

Adaptive Personlization in Federated Learning for Highly Non-i.i.d. Data

Federated learning (FL) is a distributed learning method that offers medical institutes the prospect of collaboration in a global model while preserving the privacy of their patients. Although most medical centers conduct similar medical imaging tasks, their differences, such as specializations, number of patients, and devices, lead to distinctive data distributions. Data heterogeneity poses a challenge for FL and the personalization of the local models. In this work, we investigate an adaptive hierarchical clustering method for FL to produce intermediate semi-global models, so clients with similar data distribution have the chance of forming a more specialized model. Our method forms several clusters consisting of clients with the most similar data distributions; then, each cluster continues to train separately. Inside the cluster, we use meta-learning to improve the personalization of the participants' models. We compare the clustering approach with classical FedAvg and centralized training by evaluating our proposed methods on the HAM10k dataset for skin lesion classification with extreme heterogeneous data distribution. Our experiments demonstrate significant performance gain in heterogeneous distribution compared to standard FL methods in classification accuracy. Moreover, we show that the models converge faster if applied in clusters and outperform centralized training while using only a small subset of data.

Fast, Expressive SE(n) Equivariant Networks through Weight-Sharing in Position-Orientation Space

Based on the theory of homogeneous spaces we derive geometrically optimal edge attributes to be used within the flexible message-passing framework. We formalize the notion of weight sharing in convolutional networks as the sharing of message functions over point-pairs that should be treated equally. We define equivalence classes of point-pairs that are identical up to a transformation in the group and derive attributes that uniquely identify these classes. Weight sharing is then obtained by conditioning message functions on these attributes. As an application of the theory, we develop an efficient equivariant group convolutional network for processing 3D point clouds. The theory of homogeneous spaces tells us how to do group convolutions with feature maps over the homogeneous space of positions R^3, position and orientations R^3 {times} S^2, and the group SE(3) itself. Among these, R^3 {times} S^2 is an optimal choice due to the ability to represent directional information, which R^3 methods cannot, and it significantly enhances computational efficiency compared to indexing features on the full SE(3) group. We support this claim with state-of-the-art results -- in accuracy and speed -- on five different benchmarks in 2D and 3D, including interatomic potential energy prediction, trajectory forecasting in N-body systems, and generating molecules via equivariant diffusion models.

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (at comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from these modifications. It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100, we observed a 200-fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k).

A Framework for Fast and Stable Representations of Multiparameter Persistent Homology Decompositions

Topological data analysis (TDA) is an area of data science that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for geometric data sets such as point clouds. One of the most important such descriptors is {\em persistent homology}, which encodes the change in shape as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to simultaneously vary multiple filtration parameters, for example feature scale and density. While the theoretical properties of single parameter persistent homology are well understood, less is known about the multiparameter case. In particular, a central question is the problem of representing multiparameter persistent homology by elements of a vector space for integration with standard machine learning algorithms. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a new general representation framework that leverages recent results on {\em decompositions} of multiparameter persistent homology. This framework is rich in information, fast to compute, and encompasses previous approaches. Moreover, we establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for analyzing geometric and point cloud data. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets.

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.

DendroMap: Visual Exploration of Large-Scale Image Datasets for Machine Learning with Treemaps

In this paper, we present DendroMap, a novel approach to interactively exploring large-scale image datasets for machine learning (ML). ML practitioners often explore image datasets by generating a grid of images or projecting high-dimensional representations of images into 2-D using dimensionality reduction techniques (e.g., t-SNE). However, neither approach effectively scales to large datasets because images are ineffectively organized and interactions are insufficiently supported. To address these challenges, we develop DendroMap by adapting Treemaps, a well-known visualization technique. DendroMap effectively organizes images by extracting hierarchical cluster structures from high-dimensional representations of images. It enables users to make sense of the overall distributions of datasets and interactively zoom into specific areas of interests at multiple levels of abstraction. Our case studies with widely-used image datasets for deep learning demonstrate that users can discover insights about datasets and trained models by examining the diversity of images, identifying underperforming subgroups, and analyzing classification errors. We conducted a user study that evaluates the effectiveness of DendroMap in grouping and searching tasks by comparing it with a gridified version of t-SNE and found that participants preferred DendroMap. DendroMap is available at https://div-lab.github.io/dendromap/.

Grape detection, segmentation and tracking using deep neural networks and three-dimensional association

Agricultural applications such as yield prediction, precision agriculture and automated harvesting need systems able to infer the crop state from low-cost sensing devices. Proximal sensing using affordable cameras combined with computer vision has seen a promising alternative, strengthened after the advent of convolutional neural networks (CNNs) as an alternative for challenging pattern recognition problems in natural images. Considering fruit growing monitoring and automation, a fundamental problem is the detection, segmentation and counting of individual fruits in orchards. Here we show that for wine grapes, a crop presenting large variability in shape, color, size and compactness, grape clusters can be successfully detected, segmented and tracked using state-of-the-art CNNs. In a test set containing 408 grape clusters from images taken on a trellis-system based vineyard, we have reached an F 1 -score up to 0.91 for instance segmentation, a fine separation of each cluster from other structures in the image that allows a more accurate assessment of fruit size and shape. We have also shown as clusters can be identified and tracked along video sequences recording orchard rows. We also present a public dataset containing grape clusters properly annotated in 300 images and a novel annotation methodology for segmentation of complex objects in natural images. The presented pipeline for annotation, training, evaluation and tracking of agricultural patterns in images can be replicated for different crops and production systems. It can be employed in the development of sensing components for several agricultural and environmental applications.

MV-Map: Offboard HD-Map Generation with Multi-view Consistency

While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from different viewpoints. This is because BEV perception is typically set up in an 'onboard' manner, which restricts the computation and consequently prevents algorithms from reasoning multiple views simultaneously. This paper overcomes these limitations and advocates a more practical 'offboard' HD-Map generation setup that removes the computation constraints, based on the fact that HD-Maps are commonly reusable infrastructures built offline in data centers. To this end, we propose a novel offboard pipeline called MV-Map that capitalizes multi-view consistency and can handle an arbitrary number of frames with the key design of a 'region-centric' framework. In MV-Map, the target HD-Maps are created by aggregating all the frames of onboard predictions, weighted by the confidence scores assigned by an 'uncertainty network'. To further enhance multi-view consistency, we augment the uncertainty network with the global 3D structure optimized by a voxelized neural radiance field (Voxel-NeRF). Extensive experiments on nuScenes show that our MV-Map significantly improves the quality of HD-Maps, further highlighting the importance of offboard methods for HD-Map generation.

Most discriminative stimuli for functional cell type clustering

Identifying cell types and understanding their functional properties is crucial for unraveling the mechanisms underlying perception and cognition. In the retina, functional types can be identified by carefully selected stimuli, but this requires expert domain knowledge and biases the procedure towards previously known cell types. In the visual cortex, it is still unknown what functional types exist and how to identify them. Thus, for unbiased identification of the functional cell types in retina and visual cortex, new approaches are needed. Here we propose an optimization-based clustering approach using deep predictive models to obtain functional clusters of neurons using Most Discriminative Stimuli (MDS). Our approach alternates between stimulus optimization with cluster reassignment akin to an expectation-maximization algorithm. The algorithm recovers functional clusters in mouse retina, marmoset retina and macaque visual area V4. This demonstrates that our approach can successfully find discriminative stimuli across species, stages of the visual system and recording techniques. The resulting most discriminative stimuli can be used to assign functional cell types fast and on the fly, without the need to train complex predictive models or show a large natural scene dataset, paving the way for experiments that were previously limited by experimental time. Crucially, MDS are interpretable: they visualize the distinctive stimulus patterns that most unambiguously identify a specific type of neuron.

Bridging 3D Gaussian and Mesh for Freeview Video Rendering

This is only a preview version of GauMesh. Recently, primitive-based rendering has been proven to achieve convincing results in solving the problem of modeling and rendering the 3D dynamic scene from 2D images. Despite this, in the context of novel view synthesis, each type of primitive has its inherent defects in terms of representation ability. It is difficult to exploit the mesh to depict the fuzzy geometry. Meanwhile, the point-based splatting (e.g. the 3D Gaussian Splatting) method usually produces artifacts or blurry pixels in the area with smooth geometry and sharp textures. As a result, it is difficult, even not impossible, to represent the complex and dynamic scene with a single type of primitive. To this end, we propose a novel approach, GauMesh, to bridge the 3D Gaussian and Mesh for modeling and rendering the dynamic scenes. Given a sequence of tracked mesh as initialization, our goal is to simultaneously optimize the mesh geometry, color texture, opacity maps, a set of 3D Gaussians, and the deformation field. At a specific time, we perform alpha-blending on the RGB and opacity values based on the merged and re-ordered z-buffers from mesh and 3D Gaussian rasterizations. This produces the final rendering, which is supervised by the ground-truth image. Experiments demonstrate that our approach adapts the appropriate type of primitives to represent the different parts of the dynamic scene and outperforms all the baseline methods in both quantitative and qualitative comparisons without losing render speed.

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications.Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. Subsequently, a view selection pipeline filters these views based on quality and consistency, ensuring that only the high-quality and reliable views are used for reconstruction. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. FlemRM directly outputs 3D Gaussian points leveraging a tri-plane representation, enabling efficient and detailed 3D generation. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.

FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection

The transformation of features from 2D perspective space to 3D space is essential to multi-view 3D object detection. Recent approaches mainly focus on the design of view transformation, either pixel-wisely lifting perspective view features into 3D space with estimated depth or grid-wisely constructing BEV features via 3D projection, treating all pixels or grids equally. However, choosing what to transform is also important but has rarely been discussed before. The pixels of a moving car are more informative than the pixels of the sky. To fully utilize the information contained in images, the view transformation should be able to adapt to different image regions according to their contents. In this paper, we propose a novel framework named FrustumFormer, which pays more attention to the features in instance regions via adaptive instance-aware resampling. Specifically, the model obtains instance frustums on the bird's eye view by leveraging image view object proposals. An adaptive occupancy mask within the instance frustum is learned to refine the instance location. Moreover, the temporal frustum intersection could further reduce the localization uncertainty of objects. Comprehensive experiments on the nuScenes dataset demonstrate the effectiveness of FrustumFormer, and we achieve a new state-of-the-art performance on the benchmark. Codes and models will be made available at https://github.com/Robertwyq/Frustum.

A New Dataset and Comparative Study for Aphid Cluster Detection and Segmentation in Sorghum Fields

Aphid infestations are one of the primary causes of extensive damage to wheat and sorghum fields and are one of the most common vectors for plant viruses, resulting in significant agricultural yield losses. To address this problem, farmers often employ the inefficient use of harmful chemical pesticides that have negative health and environmental impacts. As a result, a large amount of pesticide is wasted on areas without significant pest infestation. This brings to attention the urgent need for an intelligent autonomous system that can locate and spray sufficiently large infestations selectively within the complex crop canopies. We have developed a large multi-scale dataset for aphid cluster detection and segmentation, collected from actual sorghum fields and meticulously annotated to include clusters of aphids. Our dataset comprises a total of 54,742 image patches, showcasing a variety of viewpoints, diverse lighting conditions, and multiple scales, highlighting its effectiveness for real-world applications. In this study, we trained and evaluated four real-time semantic segmentation models and three object detection models specifically for aphid cluster segmentation and detection. Considering the balance between accuracy and efficiency, Fast-SCNN delivered the most effective segmentation results, achieving 80.46% mean precision, 81.21% mean recall, and 91.66 frames per second (FPS). For object detection, RT-DETR exhibited the best overall performance with a 61.63% mean average precision (mAP), 92.6% mean recall, and 72.55 on an NVIDIA V100 GPU. Our experiments further indicate that aphid cluster segmentation is more suitable for assessing aphid infestations than using detection models.

Attention-based Dynamic Subspace Learners for Medical Image Analysis

Learning similarity is a key aspect in medical image analysis, particularly in recommendation systems or in uncovering the interpretation of anatomical data in images. Most existing methods learn such similarities in the embedding space over image sets using a single metric learner. Images, however, have a variety of object attributes such as color, shape, or artifacts. Encoding such attributes using a single metric learner is inadequate and may fail to generalize. Instead, multiple learners could focus on separate aspects of these attributes in subspaces of an overarching embedding. This, however, implies the number of learners to be found empirically for each new dataset. This work, Dynamic Subspace Learners, proposes to dynamically exploit multiple learners by removing the need of knowing apriori the number of learners and aggregating new subspace learners during training. Furthermore, the visual interpretability of such subspace learning is enforced by integrating an attention module into our method. This integrated attention mechanism provides a visual insight of discriminative image features that contribute to the clustering of image sets and a visual explanation of the embedding features. The benefits of our attention-based dynamic subspace learners are evaluated in the application of image clustering, image retrieval, and weakly supervised segmentation. Our method achieves competitive results with the performances of multiple learners baselines and significantly outperforms the classification network in terms of clustering and retrieval scores on three different public benchmark datasets. Moreover, our attention maps offer a proxy-labels, which improves the segmentation accuracy up to 15% in Dice scores when compared to state-of-the-art interpretation techniques.

Unsupervised semantic segmentation of high-resolution UAV imagery for road scene parsing

Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.

Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-modal Distillation and Super-Voxel Clustering

Semantic segmentation of point clouds usually requires exhausting efforts of human annotations, hence it attracts wide attention to the challenging topic of learning from unlabeled or weaker forms of annotations. In this paper, we take the first attempt for fully unsupervised semantic segmentation of point clouds, which aims to delineate semantically meaningful objects without any form of annotations. Previous works of unsupervised pipeline on 2D images fails in this task of point clouds, due to: 1) Clustering Ambiguity caused by limited magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity caused by the irregular sparsity of point cloud. Therefore, we propose a novel framework, PointDC, which is comprised of two steps that handle the aforementioned problems respectively: Cross-Modal Distillation (CMD) and Super-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual features are back-projected to the 3D space and aggregated to a unified point feature to distill the training of the point representation. In the second stage of SVC, the point features are aggregated to super-voxels and then fed to the iterative clustering process for excavating semantic classes. PointDC yields a significant improvement over the prior state-of-the-art unsupervised methods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic segmentation benchmarks.

Exploring the cloud of feature interaction scores in a Rashomon set

Interactions among features are central to understanding the behavior of machine learning models. Recent research has made significant strides in detecting and quantifying feature interactions in single predictive models. However, we argue that the feature interactions extracted from a single pre-specified model may not be trustworthy since: a well-trained predictive model may not preserve the true feature interactions and there exist multiple well-performing predictive models that differ in feature interaction strengths. Thus, we recommend exploring feature interaction strengths in a model class of approximately equally accurate predictive models. In this work, we introduce the feature interaction score (FIS) in the context of a Rashomon set, representing a collection of models that achieve similar accuracy on a given task. We propose a general and practical algorithm to calculate the FIS in the model class. We demonstrate the properties of the FIS via synthetic data and draw connections to other areas of statistics. Additionally, we introduce a Halo plot for visualizing the feature interaction variance in high-dimensional space and a swarm plot for analyzing FIS in a Rashomon set. Experiments with recidivism prediction and image classification illustrate how feature interactions can vary dramatically in importance for similarly accurate predictive models. Our results suggest that the proposed FIS can provide valuable insights into the nature of feature interactions in machine learning models.

From Posterior Sampling to Meaningful Diversity in Image Restoration

Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling. Code and examples are available at https://noa-cohen.github.io/MeaningfulDiversityInIR.

Binary Opacity Grids: Capturing Fine Geometric Detail for Mesh-Based View Synthesis

While surface-based view synthesis algorithms are appealing due to their low computational requirements, they often struggle to reproduce thin structures. In contrast, more expensive methods that model the scene's geometry as a volumetric density field (e.g. NeRF) excel at reconstructing fine geometric detail. However, density fields often represent geometry in a "fuzzy" manner, which hinders exact localization of the surface. In this work, we modify density fields to encourage them to converge towards surfaces, without compromising their ability to reconstruct thin structures. First, we employ a discrete opacity grid representation instead of a continuous density field, which allows opacity values to discontinuously transition from zero to one at the surface. Second, we anti-alias by casting multiple rays per pixel, which allows occlusion boundaries and subpixel structures to be modelled without using semi-transparent voxels. Third, we minimize the binary entropy of the opacity values, which facilitates the extraction of surface geometry by encouraging opacity values to binarize towards the end of training. Lastly, we develop a fusion-based meshing strategy followed by mesh simplification and appearance model fitting. The compact meshes produced by our model can be rendered in real-time on mobile devices and achieve significantly higher view synthesis quality compared to existing mesh-based approaches.

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.

DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model

With the increasing popularity of autonomous driving based on the powerful and unified bird's-eye-view (BEV) representation, a demand for high-quality and large-scale multi-view video data with accurate annotation is urgently required. However, such large-scale multi-view data is hard to obtain due to expensive collection and annotation costs. To alleviate the problem, we propose a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate realistic multi-view videos controlled by 3D layout. There are three challenges when synthesizing multi-view videos given a 3D layout: How to keep 1) cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the quality of the generated instances? Our DrivingDiffusion solves the problem by cascading the multi-view single-frame image generation step, the single-view video generation step shared by multiple cameras, and post-processing that can handle long video generation. In the multi-view model, the consistency of multi-view images is ensured by information exchange between adjacent cameras. In the temporal model, we mainly query the information that needs attention in subsequent frame generation from the multi-view images of the first frame. We also introduce the local prompt to effectively improve the quality of generated instances. In post-processing, we further enhance the cross-view consistency of subsequent frames and extend the video length by employing temporal sliding window algorithm. Without any extra cost, our model can generate large-scale realistic multi-camera driving videos in complex urban scenes, fueling the downstream driving tasks. The code will be made publicly available.

NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes

Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed from very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields for sparse view synthesis of outdoor scenes. NeO 360 is a generalizable method that reconstructs 360{\deg} scenes from a single or a few posed RGB images. The essence of our approach is in capturing the distribution of complex real-world outdoor 3D scenes and using a hybrid image-conditional triplanar representation that can be queried from any world point. Our representation combines the best of both voxel-based and bird's-eye-view (BEV) representations and is more effective and expressive than each. NeO 360's representation allows us to learn from a large collection of unbounded 3D scenes while offering generalizability to new views and novel scenes from as few as a single image during inference. We demonstrate our approach on the proposed challenging 360{\deg} unbounded dataset, called NeRDS 360, and show that NeO 360 outperforms state-of-the-art generalizable methods for novel view synthesis while also offering editing and composition capabilities. Project page: https://zubair-irshad.github.io/projects/neo360.html

The SRG/eROSITA All-Sky Survey: Large-scale view of the Centaurus cluster

Methods. We utilized the combined five SRG/eROSITA All-Sky Survey data (eRASS:5) to perform X-ray imaging and spectral analyses of the Centaurus cluster in various directions to large radii. Surface brightness (SB) profiles out to 2R_{200} were constructed. We acquired gas temperature, metallicity, and normalization per area profiles out to R_{200}. We compared our results with previous Centaurus studies, cluster outskirts measurements, and simulations. Comprehensive sky background analysis was done across the FoV, in particular, to assess the variation of the eROSITA Bubble emission that partially contaminates the field. Results. The processed X-ray images show the known sloshing-induced structures in the core. The core (rleq11~kpc) is better described with a 2T model than a 1T model. Here, we measured lower T from the cooler component (~1.0 keV) and higher Z (sim!1.6Z_odot), signifying an iron bias. In the intermediate radial range, we observed prominent SB and normalization per area excesses in the eastern sector (Cen 45 location), reaching out to R_{500}. Temperature enhancements near the location of Cen 45 imply that the gas is shock-heated due to the interaction with Cen 30, the significant excess behind Cen 45 center might be the tail/ram-pressure-stripped gas. We found good agreement between the outskirt temperatures with the profile from simulations and fit from Suzaku outskirts measurements. We detected significant SB emission to the sky background level out to R_{200} with a 3.5sigma and followed by 2.9sigma at 1.1R_{200}. The metallicity at R_{500}-R_{200} is low but within the ranges of other outskirts studies. Conclusions. We present the first measurement of ICM morphology and properties of Centaurus cluster sampling the whole azimuth beyond 30', increasing the probed volume by a factor of almost 30.

Landscaping Linear Mode Connectivity

The presence of linear paths in parameter space between two different network solutions in certain cases, i.e., linear mode connectivity (LMC), has garnered interest from both theoretical and practical fronts. There has been significant research that either practically designs algorithms catered for connecting networks by adjusting for the permutation symmetries as well as some others that more theoretically construct paths through which networks can be connected. Yet, the core reasons for the occurrence of LMC, when in fact it does occur, in the highly non-convex loss landscapes of neural networks are far from clear. In this work, we take a step towards understanding it by providing a model of how the loss landscape needs to behave topographically for LMC (or the lack thereof) to manifest. Concretely, we present a `mountainside and ridge' perspective that helps to neatly tie together different geometric features that can be spotted in the loss landscape along the training runs. We also complement this perspective by providing a theoretical analysis of the barrier height, for which we provide empirical support, and which additionally extends as a faithful predictor of layer-wise LMC. We close with a toy example that provides further intuition on how barriers arise in the first place, all in all, showcasing the larger aim of the work -- to provide a working model of the landscape and its topography for the occurrence of LMC.

Space-time tradeoffs of lenses and optics via higher category theory

Optics and lenses are abstract categorical gadgets that model systems with bidirectional data flow. In this paper we observe that the denotational definition of optics - identifying two optics as equivalent by observing their behaviour from the outside - is not suitable for operational, software oriented approaches where optics are not merely observed, but built with their internal setups in mind. We identify operational differences between denotationally isomorphic categories of cartesian optics and lenses: their different composition rule and corresponding space-time tradeoffs, positioning them at two opposite ends of a spectrum. With these motivations we lift the existing categorical constructions and their relationships to the 2-categorical level, showing that the relevant operational concerns become visible. We define the 2-category 2-Optic(C) whose 2-cells explicitly track optics' internal configuration. We show that the 1-category Optic(C) arises by locally quotienting out the connected components of this 2-category. We show that the embedding of lenses into cartesian optics gets weakened from a functor to an oplax functor whose oplaxator now detects the different composition rule. We determine the difficulties in showing this functor forms a part of an adjunction in any of the standard 2-categories. We establish a conjecture that the well-known isomorphism between cartesian lenses and optics arises out of the lax 2-adjunction between their double-categorical counterparts. In addition to presenting new research, this paper is also meant to be an accessible introduction to the topic.

OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution

Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer viewers the chance to freely choose the view directions in immersive environments such as virtual reality. The M\"obius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem. In this paper, we propose a novel deep learning-based approach, called OmniZoomer, to incorporate the M\"obius transformation into the network for movement and zoom on ODIs. By learning various transformed feature maps under different conditions, the network is enhanced to handle the increasing edge curvatures, which alleviates the blurry effect. Moreover, to address the aliasing problem, we propose two key components. Firstly, to compensate for the lack of pixels for describing curves, we enhance the feature maps in the high-resolution (HR) space and calculate the transformed index map with a spatial index generation module. Secondly, considering that ODIs are inherently represented in the spherical space, we propose a spherical resampling module that combines the index map and HR feature maps to transform the feature maps for better spherical correlation. The transformed feature maps are decoded to output a zoomed ODI. Experiments show that our method can produce HR and high-quality ODIs with the flexibility to move and zoom in to the object of interest. Project page is available at http://vlislab22.github.io/OmniZoomer/.

MVD^2: Efficient Multiview 3D Reconstruction for Multiview Diffusion

As a promising 3D generation technique, multiview diffusion (MVD) has received a lot of attention due to its advantages in terms of generalizability, quality, and efficiency. By finetuning pretrained large image diffusion models with 3D data, the MVD methods first generate multiple views of a 3D object based on an image or text prompt and then reconstruct 3D shapes with multiview 3D reconstruction. However, the sparse views and inconsistent details in the generated images make 3D reconstruction challenging. We present MVD^2, an efficient 3D reconstruction method for multiview diffusion (MVD) images. MVD^2 aggregates image features into a 3D feature volume by projection and convolution and then decodes volumetric features into a 3D mesh. We train MVD^2 with 3D shape collections and MVD images prompted by rendered views of 3D shapes. To address the discrepancy between the generated multiview images and ground-truth views of the 3D shapes, we design a simple-yet-efficient view-dependent training scheme. MVD^2 improves the 3D generation quality of MVD and is fast and robust to various MVD methods. After training, it can efficiently decode 3D meshes from multiview images within one second. We train MVD^2 with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its superior performance in generating 3D models from multiview images generated by different MVD methods, using both synthetic and real images as prompts.

Heterogeneous Graph Contrastive Learning with Meta-path Contexts and Adaptively Weighted Negative Samples

Heterogeneous graph contrastive learning has received wide attention recently. Some existing methods use meta-paths, which are sequences of object types that capture semantic relationships between objects, to construct contrastive views. However, most of them ignore the rich meta-path context information that describes how two objects are connected by meta-paths. Further, they fail to distinguish negative samples, which could adversely affect the model performance. To address the problems, we propose MEOW, which considers both meta-path contexts and weighted negative samples. Specifically, MEOW constructs a coarse view and a fine-grained view for contrast. The former reflects which objects are connected by meta-paths, while the latter uses meta-path contexts and characterizes details on how the objects are connected. Then, we theoretically analyze the InfoNCE loss and recognize its limitations for computing gradients of negative samples. To better distinguish negative samples, we learn hard-valued weights for them based on node clustering and use prototypical contrastive learning to pull close embeddings of nodes in the same cluster. In addition, we propose a variant model AdaMEOW that adaptively learns soft-valued weights of negative samples to further improve node representation. Finally, we conduct extensive experiments to show the superiority of MEOW and AdaMEOW against other state-of-the-art methods.

GaussianObject: Just Taking Four Images to Get A High-Quality 3D Object with Gaussian Splatting

Reconstructing and rendering 3D objects from highly sparse views is of critical importance for promoting applications of 3D vision techniques and improving user experience. However, images from sparse views only contain very limited 3D information, leading to two significant challenges: 1) Difficulty in building multi-view consistency as images for matching are too few; 2) Partially omitted or highly compressed object information as view coverage is insufficient. To tackle these challenges, we propose GaussianObject, a framework to represent and render the 3D object with Gaussian splatting, that achieves high rendering quality with only 4 input images. We first introduce techniques of visual hull and floater elimination which explicitly inject structure priors into the initial optimization process for helping build multi-view consistency, yielding a coarse 3D Gaussian representation. Then we construct a Gaussian repair model based on diffusion models to supplement the omitted object information, where Gaussians are further refined. We design a self-generating strategy to obtain image pairs for training the repair model. Our GaussianObject is evaluated on several challenging datasets, including MipNeRF360, OmniObject3D, and OpenIllumination, achieving strong reconstruction results from only 4 views and significantly outperforming previous state-of-the-art methods.

The FRB20190520B Sightline Intersects Foreground Galaxy Clusters

The repeating fast radio burst FRB20190520B is an anomaly of the FRB population thanks to its high dispersion measure (DM=1205,pc,cm^{-3}) despite its low redshift of z_frb=0.241. This excess has been attributed to a host contribution of {DM_{host}} approx 900,pc,cm^{-3}, far larger than any other known FRB. In this paper, we describe spectroscopic observations of the FRB20190520B field obtained as part of the FLIMFLAM survey on the 2dF/AAOmega facility, which yielded 701 galaxies redshifts in a field of approx 3,deg^2. Applying a friends-of-friends group finder reveals multiple galaxy groups and clusters, for which we then estimated halo masses by comparing their richness with forward-modeled mocks from numerical simulations. We discover two separate M_halo >10^{14},M_odot galaxy clusters, at z=0.1867 and z=0.2170, respectively, that are directly intersected by the FRB sightline within their characteristic radius r_{200}. Subtracting off their estimated DM contributions as well that of the diffuse intergalactic medium, we estimate a host contribution of DM_{host}=467^{+140}_{-230},pc,cm^{-3} or {DM_{host}} = 339^{+122}_{-174},pc,cm^{-3} (observed frame) depending on whether we assume the halo gas extends to r_{200} or 2times r_{200}. This significantly smaller DM_{host} -- no longer the largest known value -- is now consistent with Halpha emission measure estimates of the host galaxy without having to invoke unusually high gas temperatures. We also re-estimate the turbulent fluctuation and geometric amplification factor of the scattering layer to be FG approx 3.9 - 7.5,(pc^2;km)^{-1/3}. This result illustrates the importance of incorporating foreground data for FRB analyses, both for understanding the nature of FRBs and to realize their potential as a cosmological probe.

Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation

Generating dense multiview images from text prompts is crucial for creating high-fidelity 3D assets. Nevertheless, existing methods struggle with space-view correspondences, resulting in sparse and low-quality outputs. In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. For achieving neighbor-view consistency, each viewpoint densely interacts with adjacent viewpoints to perceive the global spatial structure, and aggregates information along motion paths explicitly defined by physical principles to refine details. To further enhance cross-view consistency and alleviate content drift, CoSER rapidly scan all views in spiral bidirectional manner to aware holistic information and then scores each point based on semantic material. Subsequently, we conduct weighted down-sampling along the spatial dimension based on scores, thereby facilitating prominent information fusion across all views with lightweight computation. Technically, the core module is built by integrating the attention mechanism with a selective state space model, exploiting the robust learning capabilities of the former and the low overhead of the latter. Extensive evaluation shows that CoSER is capable of producing dense, high-fidelity, content-consistent multiview images that can be flexibly integrated into various 3D generation models.

Persistent homology of the cosmic web. I: Hierarchical topology in ΛCDM cosmologies

Using a set of LambdaCDM simulations of cosmic structure formation, we study the evolving connectivity and changing topological structure of the cosmic web using state-of-the-art tools of multiscale topological data analysis (TDA). We follow the development of the cosmic web topology in terms of the evolution of Betti number curves and feature persistence diagrams of the three (topological) classes of structural features: matter concentrations, filaments and tunnels, and voids. The Betti curves specify the prominence of features as a function of density level, and their evolution with cosmic epoch reflects the changing network connections between these structural features. The persistence diagrams quantify the longevity and stability of topological features. In this study we establish, for the first time, the link between persistence diagrams, the features they show, and the gravitationally driven cosmic structure formation process. By following the diagrams' development over cosmic time, the link between the multiscale topology of the cosmic web and the hierarchical buildup of cosmic structure is established. The sharp apexes in the diagrams are intimately related to key transitions in the structure formation process. The apex in the matter concentration diagrams coincides with the density level at which, typically, they detach from the Hubble expansion and begin to collapse. At that level many individual islands merge to form the network of the cosmic web and a large number of filaments and tunnels emerge to establish its connecting bridges. The location trends of the apex possess a self-similar character that can be related to the cosmic web's hierarchical buildup. We find that persistence diagrams provide a significantly higher and more profound level of information on the structure formation process than more global summary statistics like Euler characteristic or Betti numbers.

VaLID: Variable-Length Input Diffusion for Novel View Synthesis

Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision. As this task is heavily under-constrained, some recent work, like Zero123, tries to solve this problem with generative modeling, specifically using pre-trained diffusion models. Although this strategy generalizes well to new scenes, compared to neural radiance field-based methods, it offers low levels of flexibility. For example, it can only accept a single-view image as input, despite realistic applications often offering multiple input images. This is because the source-view images and corresponding poses are processed separately and injected into the model at different stages. Thus it is not trivial to generalize the model into multi-view source images, once they are available. To solve this issue, we try to process each pose image pair separately and then fuse them as a unified visual representation which will be injected into the model to guide image synthesis at the target-views. However, inconsistency and computation costs increase as the number of input source-view images increases. To solve these issues, the Multi-view Cross Former module is proposed which maps variable-length input data to fix-size output data. A two-stage training strategy is introduced to further improve the efficiency during training time. Qualitative and quantitative evaluation over multiple datasets demonstrates the effectiveness of the proposed method against previous approaches. The code will be released according to the acceptance.

Do logarithmic proximity measures outperform plain ones in graph clustering?

We consider a number of graph kernels and proximity measures including commute time kernel, regularized Laplacian kernel, heat kernel, exponential diffusion kernel (also called "communicability"), etc., and the corresponding distances as applied to clustering nodes in random graphs and several well-known datasets. The model of generating random graphs involves edge probabilities for the pairs of nodes that belong to the same class or different predefined classes of nodes. It turns out that in most cases, logarithmic measures (i.e., measures resulting after taking logarithm of the proximities) perform better while distinguishing underlying classes than the "plain" measures. A comparison in terms of reject curves of inter-class and intra-class distances confirms this conclusion. A similar conclusion can be made for several well-known datasets. A possible origin of this effect is that most kernels have a multiplicative nature, while the nature of distances used in cluster algorithms is an additive one (cf. the triangle inequality). The logarithmic transformation is a tool to transform the first nature to the second one. Moreover, some distances corresponding to the logarithmic measures possess a meaningful cutpoint additivity property. In our experiments, the leader is usually the logarithmic Communicability measure. However, we indicate some more complicated cases in which other measures, typically, Communicability and plain Walk, can be the winners.

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

Recently, State Space Models (SSMs), with Mamba as a prime example, have shown great promise for long-range dependency modeling with linear complexity. Then, Vision Mamba and the subsequent architectures are presented successively, and they perform well on visual tasks. The crucial step of applying Mamba to visual tasks is to construct 2D visual features in sequential manners. To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module. The embeddings extracted from the preceding layer are projected into multiple lower-dimensional subspaces. Subsequently, within each subspace, the selective scan is performed along distinct scan routes. The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space. Moreover, we incorporate a Scan Route Attention (SRA) mechanism to enhance the module's capability to discern complex structures. To validate the efficacy of our module, we exclusively substitute the 2D-Selective-Scan (SS2D) block in VM-UNet with our proposed module, and we train our models from scratch without using any pre-trained weights. The results indicate a significant improvement in performance while reducing the parameters of the original VM-UNet. The code for this study is publicly available at https://github.com/PixDeep/MHS-VM.

A Survey on Machine Learning Solutions for Graph Pattern Extraction

A subgraph is constructed by using a subset of vertices and edges of a given graph. There exist many graph properties that are hereditary for subgraphs. Hence, researchers from different communities have paid a great deal of attention in studying numerous subgraph problems, on top of the ordinary graph problems. Many algorithms are proposed in studying subgraph problems, where one common approach is by extracting the patterns and structures of a given graph. Due to the complex structures of certain types of graphs and to improve overall performances of the existing frameworks, machine learning techniques have recently been employed in dealing with various subgraph problems. In this article, we present a comprehensive review on five well known subgraph problems that have been tackled by using machine learning methods. They are subgraph isomorphism (both counting and matching), maximum common subgraph, community detection and community search problems. We provide an outline of each proposed method, and examine its designs and performances. We also explore non-learning-based algorithms for each problem and a brief discussion is given. We then suggest some promising research directions in this area, hoping that relevant subgraph problems can be tackled by using a similar strategy. Since there is a huge growth in employing machine learning techniques in recent years, we believe that this survey will serve as a good reference point to relevant research communities.