Spaces:
Running
Running
Upload 5 files
Browse files- README.md +1 -44
- app.py +70 -6
- e621FastTextModel010Replacement_small.bin +3 -0
- fluffyrock_3m.csv +0 -0
- requirements.txt +1 -1
README.md
CHANGED
|
@@ -7,49 +7,6 @@ sdk: gradio
|
|
| 7 |
sdk_version: 4.19.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
-
tags:
|
| 11 |
-
- not-for-all-audience
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
## Frequently Asked Questions (FAQs)
|
| 16 |
-
|
| 17 |
-
Technically I am writing this before anyone but me has used the tool, so no one has asked questions yet. But if they did, here are the questions I think they might ask:
|
| 18 |
-
|
| 19 |
-
### Why is this space tagged "not-for-all-audience"
|
| 20 |
-
|
| 21 |
-
The "not-for-all-audience" tag informs users that this tool's text output is derived from e621.net data for tag prediction and completion. This measure underscores a commitment to responsible content sharing.
|
| 22 |
-
|
| 23 |
-
### Does input order matter?
|
| 24 |
-
|
| 25 |
-
No
|
| 26 |
-
|
| 27 |
-
### Should I use underscores in the input tags?
|
| 28 |
-
|
| 29 |
-
It doesn't matter. The application handles tags either way.
|
| 30 |
-
|
| 31 |
-
### Why are some valid tags marked as "unseen", and why don't some artists ever get returned?
|
| 32 |
-
|
| 33 |
-
Some data is excluded from consideration if it did not occur frequently enough in the sample from which the application makes its calculations.
|
| 34 |
-
If an artist or tag is too infrequent, we might not think we have enough data to make predictions about it.
|
| 35 |
-
|
| 36 |
-
### Are there any special tags?
|
| 37 |
-
|
| 38 |
-
Yes. We normalized the favorite counts of each image to a range of 0-9, with 0 being the lowest favcount, and 9 being the highest.
|
| 39 |
-
You can include any of these special tags: "score:0", "score:1", "score:2", "score:3", "score:4", "score:5", "score:6", "score:7", "score:8", "score:9"
|
| 40 |
-
in your list to bias the output toward artists with higher or lower scoring images.
|
| 41 |
-
|
| 42 |
-
### Are there any other special tricks?
|
| 43 |
-
|
| 44 |
-
Yes. If you want to more strongly bias the artist output toward a specific tag, you can just list it multiple times.
|
| 45 |
-
So for example, the query "red fox, red fox, red fox, score:7" will yield a list of artists who are more strongly associated with the tag "red fox"
|
| 46 |
-
than the query "red fox, score:7".
|
| 47 |
-
|
| 48 |
-
### What calculation is this thing actually performing?
|
| 49 |
-
|
| 50 |
-
Each artist is represented by a "pseudo-document" composed of all the tags from their uploaded images, treating these tags similarly to words in a text document.
|
| 51 |
-
Similarly, when you input a set of tags, the system creates a pseudo-document for your query out of all the tags.
|
| 52 |
-
It then uses a technique called cosine similarity to compare your tags against each artist's collection, essentially finding which artist's tags are most "similar" to yours.
|
| 53 |
-
This method helps identify artists whose work is closely aligned with the themes or elements you're interested in.
|
| 54 |
-
For those curious about the underlying mechanics of comparing text-like data, we employ the TF-IDF (Term Frequency-Inverse Document Frequency) method, a standard approach in information retrieval.
|
| 55 |
-
You can read more about TF-IDF on its [Wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
|
|
|
|
| 7 |
sdk_version: 4.19.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -4,6 +4,11 @@ import numpy as np
|
|
| 4 |
from joblib import load
|
| 5 |
import h5py
|
| 6 |
from io import BytesIO
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
|
| 9 |
faq_content="""
|
|
@@ -59,13 +64,71 @@ with h5py.File('complete_artist_data.hdf5', 'r') as f:
|
|
| 59 |
|
| 60 |
# Load artist names and decode to strings
|
| 61 |
artist_names = [name.decode() for name in f['artist_names'][:]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
def find_similar_artists(new_tags_string, top_n):
|
| 64 |
-
#
|
| 65 |
new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
|
| 66 |
-
unseen_tags = set(new_image_tags) - set(vectorizer.vocabulary_.keys())
|
| 67 |
-
|
| 68 |
-
|
| 69 |
X_new_image = vectorizer.transform([','.join(new_image_tags)])
|
| 70 |
similarities = cosine_similarity(X_new_image, X_artist)[0]
|
| 71 |
|
|
@@ -75,7 +138,8 @@ def find_similar_artists(new_tags_string, top_n):
|
|
| 75 |
top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
|
| 76 |
dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
|
| 77 |
|
| 78 |
-
return
|
|
|
|
| 79 |
|
| 80 |
iface = gr.Interface(
|
| 81 |
fn=find_similar_artists,
|
|
@@ -84,7 +148,7 @@ iface = gr.Interface(
|
|
| 84 |
gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
|
| 85 |
],
|
| 86 |
outputs=[
|
| 87 |
-
gr.
|
| 88 |
gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags. The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
|
| 89 |
gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
|
| 90 |
],
|
|
|
|
| 4 |
from joblib import load
|
| 5 |
import h5py
|
| 6 |
from io import BytesIO
|
| 7 |
+
import csv
|
| 8 |
+
import re
|
| 9 |
+
import random
|
| 10 |
+
import compress_fasttext
|
| 11 |
+
from collections import OrderedDict
|
| 12 |
|
| 13 |
|
| 14 |
faq_content="""
|
|
|
|
| 64 |
|
| 65 |
# Load artist names and decode to strings
|
| 66 |
artist_names = [name.decode() for name in f['artist_names'][:]]
|
| 67 |
+
|
| 68 |
+
def clean_tag(tag):
|
| 69 |
+
return ''.join(char for char in tag if ord(char) < 128)
|
| 70 |
+
|
| 71 |
+
#Normally returns tag to aliases, but when reverse=True, returns alias to tags
|
| 72 |
+
def build_aliases_dict(filename, reverse=False):
|
| 73 |
+
aliases_dict = {}
|
| 74 |
+
with open(filename, 'r', newline='', encoding='utf-8') as csvfile:
|
| 75 |
+
reader = csv.reader(csvfile)
|
| 76 |
+
for row in reader:
|
| 77 |
+
tag = clean_tag(row[0])
|
| 78 |
+
alias_list = [] if row[3] == "null" else [clean_tag(alias) for alias in row[3].split(',')]
|
| 79 |
+
if reverse:
|
| 80 |
+
for alias in alias_list:
|
| 81 |
+
aliases_dict.setdefault(alias, []).append(tag)
|
| 82 |
+
else:
|
| 83 |
+
aliases_dict[tag] = alias_list
|
| 84 |
+
return aliases_dict
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def find_similar_tags(test_tags):
|
| 88 |
+
|
| 89 |
+
#Initialize stuff
|
| 90 |
+
if not hasattr(find_similar_tags, "fasttext_small_model"):
|
| 91 |
+
find_similar_tags.fasttext_small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load('e621FastTextModel010Replacement_small.bin')
|
| 92 |
+
tag_aliases_file = 'fluffyrock_3m.csv'
|
| 93 |
+
if not hasattr(find_similar_tags, "tag2aliases"):
|
| 94 |
+
find_similar_tags.tag2aliases = build_aliases_dict(tag_aliases_file)
|
| 95 |
+
if not hasattr(find_similar_tags, "alias2tags"):
|
| 96 |
+
find_similar_tags.alias2tags = build_aliases_dict(tag_aliases_file, reverse=True)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
# Find similar tags and prepare data for dataframe.
|
| 100 |
+
results_data = []
|
| 101 |
+
for tag in test_tags:
|
| 102 |
+
similar_words = find_similar_tags.fasttext_small_model.most_similar(tag)
|
| 103 |
+
result, seen = [], set()
|
| 104 |
+
if tag in find_similar_tags.tag2aliases:
|
| 105 |
+
result.append((tag, 1))
|
| 106 |
+
seen.add(tag)
|
| 107 |
+
else:
|
| 108 |
+
for item in similar_words:
|
| 109 |
+
similar_word, similarity = item
|
| 110 |
+
if similar_word not in seen:
|
| 111 |
+
if similar_word in find_similar_tags.tag2aliases:
|
| 112 |
+
result.append((similar_word.replace('_', ' '), round(similarity, 3)))
|
| 113 |
+
seen.add(similar_word)
|
| 114 |
+
else:
|
| 115 |
+
for similar_tag in find_similar_tags.alias2tags.get(similar_word, []):
|
| 116 |
+
if similar_tag not in seen:
|
| 117 |
+
result.append((similar_tag.replace('_', ' '), round(similarity, 3)))
|
| 118 |
+
seen.add(similar_tag)
|
| 119 |
+
# Append tag and formatted similar tags to results_data
|
| 120 |
+
for word, sim in result:
|
| 121 |
+
#if word not in seen:
|
| 122 |
+
results_data.append([tag, word, sim])
|
| 123 |
+
#seen.add(word)
|
| 124 |
+
|
| 125 |
+
return results_data # Return list of lists for Dataframe
|
| 126 |
|
| 127 |
def find_similar_artists(new_tags_string, top_n):
|
|
|
|
| 128 |
new_image_tags = [tag.replace('_', ' ').strip() for tag in new_tags_string.split(",")]
|
| 129 |
+
unseen_tags = list(set(OrderedDict.fromkeys(new_image_tags)) - set(vectorizer.vocabulary_.keys()))
|
| 130 |
+
unseen_tags_data = find_similar_tags(unseen_tags) if unseen_tags else [["No unseen tags", "", ""]]
|
| 131 |
+
|
| 132 |
X_new_image = vectorizer.transform([','.join(new_image_tags)])
|
| 133 |
similarities = cosine_similarity(X_new_image, X_artist)[0]
|
| 134 |
|
|
|
|
| 138 |
top_artists_str = "\n".join([f"{rank+1}. {artist[3:]} ({score:.4f})" for rank, (artist, score) in enumerate(top_artists)])
|
| 139 |
dynamic_prompts_formatted_artists = "{" + "|".join([artist for artist, _ in top_artists]) + "}"
|
| 140 |
|
| 141 |
+
return unseen_tags_data, top_artists_str, dynamic_prompts_formatted_artists
|
| 142 |
+
|
| 143 |
|
| 144 |
iface = gr.Interface(
|
| 145 |
fn=find_similar_artists,
|
|
|
|
| 148 |
gr.Slider(minimum=1, maximum=100, value=10, step=1, label="Number of artists")
|
| 149 |
],
|
| 150 |
outputs=[
|
| 151 |
+
gr.Dataframe(label="Unseen Tags", headers=["Tag", "Similar Tags"]),
|
| 152 |
gr.Textbox(label="Top Artists", info="These are the artists most strongly associated with your tags. The number in parenthes is a similarity score between 0 and 1, with higher numbers indicating greater similarity."),
|
| 153 |
gr.Textbox(label="Dynamic Prompts Format", info="For if you're using the Automatic1111 webui (https://github.com/AUTOMATIC1111/stable-diffusion-webui) with the Dynamic Prompts extension activated (https://github.com/adieyal/sd-dynamic-prompts) and want to try them all individually.")
|
| 154 |
],
|
e621FastTextModel010Replacement_small.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a9ade94b75665a92776b73d4bb8871deca566b1b24a0866c0b1d2c56fa7ce68e
|
| 3 |
+
size 15782079
|
fluffyrock_3m.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
requirements.txt
CHANGED
|
@@ -3,4 +3,4 @@ numpy==1.25.1
|
|
| 3 |
scikit-learn==1.2.2
|
| 4 |
h5py==3.8.0
|
| 5 |
joblib==1.2.0
|
| 6 |
-
|
|
|
|
| 3 |
scikit-learn==1.2.2
|
| 4 |
h5py==3.8.0
|
| 5 |
joblib==1.2.0
|
| 6 |
+
compress-fasttext
|