tien314 commited on
Commit
eff795c
·
verified ·
1 Parent(s): 28a816e

Update BM25S model

Browse files
README.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
 
12
  # BM25S Index
13
 
14
- This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.3`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
15
 
16
  BM25S Related Links:
17
 
@@ -26,10 +26,10 @@ BM25S Related Links:
26
  You can install the `bm25s` library with `pip`:
27
 
28
  ```bash
29
- pip install "bm25s==0.2.3"
30
 
31
  # Include extra dependencies like stemmer
32
- pip install "bm25s[full]==0.2.3"
33
 
34
  # For huggingface hub usage
35
  pip install huggingface_hub
@@ -123,9 +123,9 @@ This dataset was created using the following data:
123
 
124
  | Statistic | Value |
125
  | --- | --- |
126
- | Number of documents | 791616 |
127
- | Number of tokens | 8818694 |
128
- | Average tokens per document | 11.14 |
129
 
130
  ## Parameters
131
 
 
11
 
12
  # BM25S Index
13
 
14
+ This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.6`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
15
 
16
  BM25S Related Links:
17
 
 
26
  You can install the `bm25s` library with `pip`:
27
 
28
  ```bash
29
+ pip install "bm25s==0.2.6"
30
 
31
  # Include extra dependencies like stemmer
32
+ pip install "bm25s[full]==0.2.6"
33
 
34
  # For huggingface hub usage
35
  pip install huggingface_hub
 
123
 
124
  | Statistic | Value |
125
  | --- | --- |
126
+ | Number of documents | 210801 |
127
+ | Number of tokens | 1247449 |
128
+ | Average tokens per document | 5.92 |
129
 
130
  ## Parameters
131
 
corpus.jsonl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:26b3f17468ce27a2a2624bd67aceaaaf9d1363f10cd8482f0aa1e71b414a4e83
3
- size 87071998
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:34a38e542a64bfd8e7e4f75c31ec7f9cae03b26196701e40904bc580cd971150
3
+ size 14491479
corpus.mmindex.json CHANGED
The diff for this file is too large to render. See raw diff
 
data.csc.index.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f124c4c4d9beb8eedc20bf8ea6f452c2ce8472aeab724a46970182e3ffbea3d9
3
- size 35274904
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:99d46b75c6d39fd6b858be4b194d6385f77073896e33955d707086366a09801c
3
+ size 4989924
indices.csc.index.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e7e82d8f583b6dc5948c9c9faf2898bd9467db795477481ec1e2bc1bc4d1499
3
- size 35274904
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d615201000cc91e6961a8b2a102d79cffbfa87610645fbfe414753bcc592d6e
3
+ size 4989924
indptr.csc.index.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:80cd1256b07bb59e9b8c2e5dbb706fff27da0daf336b77e4af3c6c4b38a8d6b6
3
- size 2249916
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee295a3f6f37a7988013298b018103c98aef8d5861590e4d87391285d40d266c
3
+ size 248616
params.index.json CHANGED
@@ -6,7 +6,7 @@
6
  "idf_method": "lucene",
7
  "dtype": "float32",
8
  "int_dtype": "int32",
9
- "num_docs": 791616,
10
- "version": "0.2.3",
11
  "backend": "numpy"
12
  }
 
6
  "idf_method": "lucene",
7
  "dtype": "float32",
8
  "int_dtype": "int32",
9
+ "num_docs": 210801,
10
+ "version": "0.2.6",
11
  "backend": "numpy"
12
  }
vocab.index.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:baa46f4faa9fbd28089112f6903af07248635544d37bf4240ecdcd39b2bf684b
3
- size 11550214
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03d5613c6c043f095d746de405aab145523713820d2756469e919e871e02f5e4
3
+ size 995887