tien314 commited on
Commit
c81a0c9
·
verified ·
1 Parent(s): cd84a49

Update BM25S model

Browse files
README.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
 
12
  # BM25S Index
13
 
14
- This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.6`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
15
 
16
  BM25S Related Links:
17
 
@@ -26,10 +26,10 @@ BM25S Related Links:
26
  You can install the `bm25s` library with `pip`:
27
 
28
  ```bash
29
- pip install "bm25s==0.2.6"
30
 
31
  # Include extra dependencies like stemmer
32
- pip install "bm25s[full]==0.2.6"
33
 
34
  # For huggingface hub usage
35
  pip install huggingface_hub
@@ -123,9 +123,9 @@ This dataset was created using the following data:
123
 
124
  | Statistic | Value |
125
  | --- | --- |
126
- | Number of documents | 831507 |
127
- | Number of tokens | 8338070 |
128
- | Average tokens per document | 10.03 |
129
 
130
  ## Parameters
131
 
 
11
 
12
  # BM25S Index
13
 
14
+ This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.7post1`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
15
 
16
  BM25S Related Links:
17
 
 
26
  You can install the `bm25s` library with `pip`:
27
 
28
  ```bash
29
+ pip install "bm25s==0.2.7post1"
30
 
31
  # Include extra dependencies like stemmer
32
+ pip install "bm25s[full]==0.2.7post1"
33
 
34
  # For huggingface hub usage
35
  pip install huggingface_hub
 
123
 
124
  | Statistic | Value |
125
  | --- | --- |
126
+ | Number of documents | 750312 |
127
+ | Number of tokens | 7592215 |
128
+ | Average tokens per document | 10.12 |
129
 
130
  ## Parameters
131
 
corpus.jsonl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:16164bf429cfc58ca04d9782582ab594274462760bc2eb0cc776e07d095cc5d5
3
- size 85381127
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b999015937322493801414a7628aea03373b140623efa2d7e247af03e4eb2b2
3
+ size 73690346
corpus.mmindex.json CHANGED
The diff for this file is too large to render. See raw diff
 
data.csc.index.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9c698db1df7ff3cb81bc33c0e162561368ebbc3dd40cf3bc4e1ea747d1234407
3
- size 33352408
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e30a1c11c7cc998cca6c57896b569da0e32ca94e6c59b92d14648706c2d670aa
3
+ size 30368988
indices.csc.index.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f1c9a4c9f1c7423f165ff28fbff534b9cb486281f7aa29246899c9eda84521dc
3
- size 33352408
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b2f427f0d29bab6a4744d4d14cad7f6205617efb8f2381fa0e82664698e1f92
3
+ size 30368988
indptr.csc.index.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:03d05dfbd9c6cf91f4b4d3f7047f44c573bd55fadee26ee6a4fbefc25b920fe8
3
- size 1408676
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e01faf6b39cfa30b9e13fb00c062117e7487e6a6e8b4539522a37f6b30591a52
3
+ size 559348
params.index.json CHANGED
@@ -6,7 +6,7 @@
6
  "idf_method": "lucene",
7
  "dtype": "float32",
8
  "int_dtype": "int32",
9
- "num_docs": 831507,
10
- "version": "0.2.6",
11
  "backend": "numpy"
12
  }
 
6
  "idf_method": "lucene",
7
  "dtype": "float32",
8
  "int_dtype": "int32",
9
+ "num_docs": 750312,
10
+ "version": "0.2.7post1",
11
  "backend": "numpy"
12
  }
vocab.index.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:736514f073d877676ddb3bb40b61362d0d9974b4e0826c84ed13ccd47e59b414
3
- size 6301373
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d224cd8071210250b768abb4e03969c446bef1a2d382ad711325ad63c54af4c9
3
+ size 2283889