github-actions[bot] commited on
Commit
e4316f1
·
1 Parent(s): d2e668f

Auto-sync from demo at Mon Oct 20 09:04:45 UTC 2025

Browse files
graphgen/configs/aggregated_config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  read:
2
- input_file: resources/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
@@ -18,5 +18,5 @@ partition: # graph partition configuration
18
  max_tokens_per_community: 10240 # max tokens per community
19
  unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
20
  generate:
21
- mode: aggregated # atomic, aggregated, multi_hop, cot
22
  data_format: ChatML # Alpaca, Sharegpt, ChatML
 
1
  read:
2
+ input_file: resources/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
 
18
  max_tokens_per_community: 10240 # max tokens per community
19
  unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
20
  generate:
21
+ mode: aggregated # atomic, aggregated, multi_hop, cot, vqa
22
  data_format: ChatML # Alpaca, Sharegpt, ChatML
graphgen/configs/atomic_config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  read:
2
- input_file: resources/input_examples/json_demo.json # input file path, support json, jsonl, txt, csv. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
@@ -15,5 +15,5 @@ partition: # graph partition configuration
15
  method_params:
16
  max_units_per_community: 1 # atomic partition, one node or edge per community
17
  generate:
18
- mode: atomic # atomic, aggregated, multi_hop, cot
19
  data_format: Alpaca # Alpaca, Sharegpt, ChatML
 
1
  read:
2
+ input_file: resources/input_examples/json_demo.json # input file path, support json, jsonl, txt, csv, pdf. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
 
15
  method_params:
16
  max_units_per_community: 1 # atomic partition, one node or edge per community
17
  generate:
18
+ mode: atomic # atomic, aggregated, multi_hop, cot, vqa
19
  data_format: Alpaca # Alpaca, Sharegpt, ChatML
graphgen/configs/cot_config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  read:
2
- input_file: resources/input_examples/txt_demo.txt # input file path, support json, jsonl, txt. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
@@ -15,5 +15,5 @@ partition: # graph partition configuration
15
  use_lcc: false # whether to use the largest connected component
16
  random_seed: 42 # random seed for partitioning
17
  generate:
18
- mode: cot # atomic, aggregated, multi_hop, cot
19
  data_format: Sharegpt # Alpaca, Sharegpt, ChatML
 
1
  read:
2
+ input_file: resources/input_examples/txt_demo.txt # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
 
15
  use_lcc: false # whether to use the largest connected component
16
  random_seed: 42 # random seed for partitioning
17
  generate:
18
+ mode: cot # atomic, aggregated, multi_hop, cot, vqa
19
  data_format: Sharegpt # Alpaca, Sharegpt, ChatML
graphgen/configs/multi_hop_config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  read:
2
- input_file: resources/input_examples/csv_demo.csv # input file path, support json, jsonl, txt. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
@@ -18,5 +18,5 @@ partition: # graph partition configuration
18
  max_tokens_per_community: 10240 # max tokens per community
19
  unit_sampling: random # unit sampling strategy, support: random, max_loss, min_loss
20
  generate:
21
- mode: multi_hop # strategy for generating multi-hop QA pairs
22
  data_format: ChatML # Alpaca, Sharegpt, ChatML
 
1
  read:
2
+ input_file: resources/input_examples/csv_demo.csv # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
3
  split:
4
  chunk_size: 1024 # chunk size for text splitting
5
  chunk_overlap: 100 # chunk overlap for text splitting
 
18
  max_tokens_per_community: 10240 # max tokens per community
19
  unit_sampling: random # unit sampling strategy, support: random, max_loss, min_loss
20
  generate:
21
+ mode: multi_hop # atomic, aggregated, multi_hop, cot, vqa
22
  data_format: ChatML # Alpaca, Sharegpt, ChatML
graphgen/configs/vqa_config.yaml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ read:
2
+ input_file: resources/input_examples/pdf_demo.pdf # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
3
+ split:
4
+ chunk_size: 1024 # chunk size for text splitting
5
+ chunk_overlap: 100 # chunk overlap for text splitting
6
+ search: # web search configuration
7
+ enabled: false # whether to enable web search
8
+ search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
9
+ quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
10
+ enabled: true
11
+ quiz_samples: 2 # number of quiz samples to generate
12
+ re_judge: false # whether to re-judge the existing quiz samples
13
+ partition: # graph partition configuration
14
+ method: ece # ece is a custom partition method based on comprehension loss
15
+ method_params:
16
+ max_units_per_community: 20 # max nodes and edges per community
17
+ min_units_per_community: 5 # min nodes and edges per community
18
+ max_tokens_per_community: 10240 # max tokens per community
19
+ unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
20
+ generate:
21
+ mode: vqa # atomic, aggregated, multi_hop, cot, vqa
22
+ data_format: ChatML # Alpaca, Sharegpt, ChatML
graphgen/generate.py CHANGED
@@ -72,24 +72,11 @@ def main():
72
 
73
  graph_gen.search(search_config=config["search"])
74
 
75
- # Use pipeline according to the output data type
76
- if mode in ["atomic", "aggregated", "multi_hop"]:
77
- logger.info("Generation mode set to '%s'. Start generation.", mode)
78
- if "quiz_and_judge" in config and config["quiz_and_judge"]["enabled"]:
79
- graph_gen.quiz_and_judge(quiz_and_judge_config=config["quiz_and_judge"])
80
- else:
81
- logger.warning(
82
- "Quiz and Judge strategy is disabled. Edge sampling falls back to random."
83
- )
84
- assert (
85
- config["partition"]["method"] == "ece"
86
- and "method_params" in config["partition"]
87
- ), "Only ECE partition with edge sampling is supported."
88
- config["partition"]["method_params"]["edge_sampling"] = "random"
89
- elif mode == "cot":
90
- logger.info("Generation mode set to 'cot'. Start generation.")
91
- else:
92
- raise ValueError(f"Unsupported output data type: {mode}")
93
 
94
  graph_gen.generate(
95
  partition_config=config["partition"],
 
72
 
73
  graph_gen.search(search_config=config["search"])
74
 
75
+ if config.get("quiz_and_judge", {}).get("enabled"):
76
+ graph_gen.quiz_and_judge(quiz_and_judge_config=config["quiz_and_judge"])
77
+
78
+ # TODO: add data filtering step here in the future
79
+ # graph_gen.filter(filter_config=config["filter"])
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  graph_gen.generate(
82
  partition_config=config["partition"],
graphgen/graphgen.py CHANGED
@@ -91,7 +91,7 @@ class GraphGen:
91
  insert chunks into the graph
92
  """
93
  # Step 1: Read files
94
- data = read_files(read_config["input_file"])
95
  if len(data) == 0:
96
  logger.warning("No data to process")
97
  return
@@ -105,6 +105,7 @@ class GraphGen:
105
  "content": doc["content"]
106
  }
107
  for doc in data
 
108
  }
109
  _add_doc_keys = await self.full_docs_storage.filter_keys(list(new_docs.keys()))
110
  new_docs = {k: v for k, v in new_docs.items() if k in _add_doc_keys}
 
91
  insert chunks into the graph
92
  """
93
  # Step 1: Read files
94
+ data = read_files(read_config["input_file"], self.working_dir)
95
  if len(data) == 0:
96
  logger.warning("No data to process")
97
  return
 
105
  "content": doc["content"]
106
  }
107
  for doc in data
108
+ if doc.get("type", "text") == "text"
109
  }
110
  _add_doc_keys = await self.full_docs_storage.filter_keys(list(new_docs.keys()))
111
  new_docs = {k: v for k, v in new_docs.items() if k in _add_doc_keys}
graphgen/models/__init__.py CHANGED
@@ -4,6 +4,7 @@ from .generator import (
4
  AtomicGenerator,
5
  CoTGenerator,
6
  MultiHopGenerator,
 
7
  )
8
  from .kg_builder import LightRAGKGBuilder
9
  from .llm.openai_client import OpenAIClient
@@ -14,7 +15,7 @@ from .partitioner import (
14
  ECEPartitioner,
15
  LeidenPartitioner,
16
  )
17
- from .reader import CsvReader, JsonlReader, JsonReader, TxtReader
18
  from .search.db.uniprot_search import UniProtSearch
19
  from .search.kg.wiki_search import WikiSearch
20
  from .search.web.bing_search import BingSearch
 
4
  AtomicGenerator,
5
  CoTGenerator,
6
  MultiHopGenerator,
7
+ VQAGenerator,
8
  )
9
  from .kg_builder import LightRAGKGBuilder
10
  from .llm.openai_client import OpenAIClient
 
15
  ECEPartitioner,
16
  LeidenPartitioner,
17
  )
18
+ from .reader import CSVReader, JSONLReader, JSONReader, PDFReader, TXTReader
19
  from .search.db.uniprot_search import UniProtSearch
20
  from .search.kg.wiki_search import WikiSearch
21
  from .search.web.bing_search import BingSearch
graphgen/models/generator/__init__.py CHANGED
@@ -2,3 +2,4 @@ from .aggregated_generator import AggregatedGenerator
2
  from .atomic_generator import AtomicGenerator
3
  from .cot_generator import CoTGenerator
4
  from .multi_hop_generator import MultiHopGenerator
 
 
2
  from .atomic_generator import AtomicGenerator
3
  from .cot_generator import CoTGenerator
4
  from .multi_hop_generator import MultiHopGenerator
5
+ from .vqa_generator import VQAGenerator
graphgen/models/generator/vqa_generator.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+ from typing import Any
3
+
4
+ from graphgen.bases import BaseGenerator
5
+
6
+
7
+ @dataclass
8
+ class VQAGenerator(BaseGenerator):
9
+ @staticmethod
10
+ def build_prompt(
11
+ batch: tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]
12
+ ) -> str:
13
+ raise NotImplementedError(
14
+ "VQAGenerator.build_prompt is not implemented. "
15
+ "Please provide an implementation for VQA prompt construction."
16
+ )
17
+
18
+ @staticmethod
19
+ def parse_response(response: str) -> Any:
20
+ raise NotImplementedError(
21
+ "VQAGenerator.parse_response is not implemented. "
22
+ "Please provide an implementation for VQA response parsing."
23
+ )
graphgen/models/reader/__init__.py CHANGED
@@ -1,4 +1,5 @@
1
- from .csv_reader import CsvReader
2
- from .json_reader import JsonReader
3
- from .jsonl_reader import JsonlReader
4
- from .txt_reader import TxtReader
 
 
1
+ from .csv_reader import CSVReader
2
+ from .json_reader import JSONReader
3
+ from .jsonl_reader import JSONLReader
4
+ from .pdf_reader import PDFReader
5
+ from .txt_reader import TXTReader
graphgen/models/reader/csv_reader.py CHANGED
@@ -5,7 +5,7 @@ import pandas as pd
5
  from graphgen.bases.base_reader import BaseReader
6
 
7
 
8
- class CsvReader(BaseReader):
9
  def read(self, file_path: str) -> List[Dict[str, Any]]:
10
 
11
  df = pd.read_csv(file_path)
 
5
  from graphgen.bases.base_reader import BaseReader
6
 
7
 
8
+ class CSVReader(BaseReader):
9
  def read(self, file_path: str) -> List[Dict[str, Any]]:
10
 
11
  df = pd.read_csv(file_path)
graphgen/models/reader/json_reader.py CHANGED
@@ -4,7 +4,7 @@ from typing import Any, Dict, List
4
  from graphgen.bases.base_reader import BaseReader
5
 
6
 
7
- class JsonReader(BaseReader):
8
  def read(self, file_path: str) -> List[Dict[str, Any]]:
9
  with open(file_path, "r", encoding="utf-8") as f:
10
  data = json.load(f)
 
4
  from graphgen.bases.base_reader import BaseReader
5
 
6
 
7
+ class JSONReader(BaseReader):
8
  def read(self, file_path: str) -> List[Dict[str, Any]]:
9
  with open(file_path, "r", encoding="utf-8") as f:
10
  data = json.load(f)
graphgen/models/reader/jsonl_reader.py CHANGED
@@ -5,7 +5,7 @@ from graphgen.bases.base_reader import BaseReader
5
  from graphgen.utils import logger
6
 
7
 
8
- class JsonlReader(BaseReader):
9
  def read(self, file_path: str) -> List[Dict[str, Any]]:
10
  docs = []
11
  with open(file_path, "r", encoding="utf-8") as f:
 
5
  from graphgen.utils import logger
6
 
7
 
8
+ class JSONLReader(BaseReader):
9
  def read(self, file_path: str) -> List[Dict[str, Any]]:
10
  docs = []
11
  with open(file_path, "r", encoding="utf-8") as f:
graphgen/models/reader/pdf_reader.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import subprocess
4
+ import tempfile
5
+ from pathlib import Path
6
+ from typing import Any, Dict, List, Optional, Union
7
+
8
+ from graphgen.bases.base_reader import BaseReader
9
+ from graphgen.models.reader.txt_reader import TXTReader
10
+ from graphgen.utils import logger, pick_device
11
+
12
+
13
+ class PDFReader(BaseReader):
14
+ """
15
+ PDF files are converted using MinerU, see [MinerU](https://github.com/opendatalab/MinerU).
16
+ After conversion, the resulting markdown file is parsed into text, images, tables, and formulas which can be used
17
+ for multi-modal graph generation.
18
+ """
19
+
20
+ def __init__(
21
+ self,
22
+ *,
23
+ output_dir: Optional[Union[str, Path]] = None,
24
+ method: str = "auto", # auto | txt | ocr
25
+ lang: Optional[str] = None, # ch / en / ja / ...
26
+ backend: Optional[
27
+ str
28
+ ] = None, # pipeline | vlm-transformers | vlm-sglang-engine | vlm-sglang-client
29
+ device: Optional[str] = "auto", # cpu | cuda | cuda:0 | npu | mps | auto
30
+ source: Optional[str] = None, # huggingface | modelscope | local
31
+ vlm_url: Optional[str] = None, # 当 backend=vlm-sglang-client 时必填
32
+ start_page: Optional[int] = None, # 0-based
33
+ end_page: Optional[int] = None, # 0-based, inclusive
34
+ formula: bool = True,
35
+ table: bool = True,
36
+ return_assets: bool = True,
37
+ **other_mineru_kwargs: Any,
38
+ ):
39
+ super().__init__()
40
+ self.output_dir = os.path.join(output_dir, "mineru") if output_dir else None
41
+
42
+ if device == "auto":
43
+ device = pick_device()
44
+
45
+ self._default_kwargs: Dict[str, Any] = {
46
+ "method": method,
47
+ "lang": lang,
48
+ "backend": backend,
49
+ "device": device,
50
+ "source": source,
51
+ "vlm_url": vlm_url,
52
+ "start_page": start_page,
53
+ "end_page": end_page,
54
+ "formula": formula,
55
+ "table": table,
56
+ **other_mineru_kwargs,
57
+ }
58
+ self._default_kwargs = {
59
+ k: v for k, v in self._default_kwargs.items() if v is not None
60
+ }
61
+ self.return_assets = return_assets
62
+ self.parser = MinerUParser()
63
+ self.txt_reader = TXTReader()
64
+
65
+ def read(self, file_path: str, **override) -> List[Dict[str, Any]]:
66
+ """
67
+ file_path
68
+ **override: override MinerU parameters
69
+ """
70
+ pdf_path = Path(file_path).expanduser().resolve()
71
+ if not pdf_path.is_file():
72
+ raise FileNotFoundError(pdf_path)
73
+
74
+ kwargs = {**self._default_kwargs, **override}
75
+
76
+ mineru_result = self._call_mineru(pdf_path, kwargs)
77
+ return mineru_result
78
+
79
+ def _call_mineru(
80
+ self, pdf_path: Path, kwargs: Dict[str, Any]
81
+ ) -> List[Dict[str, Any]]:
82
+ output_dir: Optional[str] = None
83
+ if self.output_dir:
84
+ output_dir = str(self.output_dir)
85
+
86
+ return self.parser.parse_pdf(pdf_path, output_dir=output_dir, **kwargs)
87
+
88
+ def _locate_md(self, pdf_path: Path, kwargs: Dict[str, Any]) -> Optional[Path]:
89
+ out_dir = (
90
+ Path(self.output_dir) if self.output_dir else Path(tempfile.gettempdir())
91
+ )
92
+ method = kwargs.get("method", "auto")
93
+ backend = kwargs.get("backend", "")
94
+ if backend.startswith("vlm-"):
95
+ method = "vlm"
96
+
97
+ candidate = Path(
98
+ os.path.join(out_dir, pdf_path.stem, method, f"{pdf_path.stem}.md")
99
+ )
100
+ if candidate.exists():
101
+ return candidate
102
+ candidate = Path(os.path.join(out_dir, f"{pdf_path.stem}.md"))
103
+ if candidate.exists():
104
+ return candidate
105
+ return None
106
+
107
+
108
+ class MinerUParser:
109
+ def __init__(self) -> None:
110
+ self._check_bin()
111
+
112
+ @staticmethod
113
+ def parse_pdf(
114
+ pdf_path: Union[str, Path],
115
+ output_dir: Optional[Union[str, Path]] = None,
116
+ method: str = "auto",
117
+ device: str = "cpu",
118
+ **kw: Any,
119
+ ) -> List[Dict[str, Any]]:
120
+ pdf = Path(pdf_path).expanduser().resolve()
121
+ if not pdf.is_file():
122
+ raise FileNotFoundError(pdf)
123
+
124
+ out = (
125
+ Path(output_dir) if output_dir else Path(tempfile.mkdtemp(prefix="mineru_"))
126
+ )
127
+ out.mkdir(parents=True, exist_ok=True)
128
+
129
+ cached = MinerUParser._try_load_cached_result(str(out), pdf.stem, method)
130
+ if cached is not None:
131
+ return cached
132
+
133
+ MinerUParser._run_mineru(pdf, out, method, device, **kw)
134
+
135
+ cached = MinerUParser._try_load_cached_result(str(out), pdf.stem, method)
136
+ return cached if cached is not None else []
137
+
138
+ @staticmethod
139
+ def _try_load_cached_result(
140
+ out_dir: str, pdf_stem: str, method: str
141
+ ) -> Optional[List[Dict[str, Any]]]:
142
+ """
143
+ try to load cached json result from MinerU output.
144
+ :param out_dir:
145
+ :param pdf_stem:
146
+ :param method:
147
+ :return:
148
+ """
149
+ json_file = os.path.join(
150
+ out_dir, pdf_stem, method, f"{pdf_stem}_content_list.json"
151
+ )
152
+ if not os.path.exists(json_file):
153
+ return None
154
+
155
+ try:
156
+ with open(json_file, encoding="utf-8") as f:
157
+ data = json.load(f)
158
+ except Exception as exc: # pylint: disable=broad-except
159
+ logger.warning("Failed to load cached MinerU result: %s", exc)
160
+ return None
161
+
162
+ base = os.path.dirname(json_file)
163
+ results = []
164
+ for item in data:
165
+ for key in ("img_path", "table_img_path", "equation_img_path"):
166
+ rel_path = item.get(key)
167
+ if rel_path:
168
+ item[key] = str(Path(base).joinpath(rel_path).resolve())
169
+ if item["type"] == "text":
170
+ item["content"] = item["text"]
171
+ del item["text"]
172
+ for key in ("page_idx", "bbox", "text_level"):
173
+ if item.get(key) is not None:
174
+ del item[key]
175
+ if item["type"] == "text" and not item["content"].strip():
176
+ continue
177
+ results.append(item)
178
+ return results
179
+
180
+ @staticmethod
181
+ def _run_mineru(
182
+ pdf: Path,
183
+ out: Path,
184
+ method: str,
185
+ device: str,
186
+ **kw: Any,
187
+ ) -> None:
188
+ cmd = [
189
+ "mineru",
190
+ "-p",
191
+ str(pdf),
192
+ "-o",
193
+ str(out),
194
+ "-m",
195
+ method,
196
+ "-d",
197
+ device,
198
+ ]
199
+ for k, v in kw.items():
200
+ if v is None:
201
+ continue
202
+ if isinstance(v, bool):
203
+ cmd += [f"--{k}", str(v).lower()]
204
+ else:
205
+ cmd += [f"--{k}", str(v)]
206
+
207
+ logger.info("Parsing PDF with MinerU: %s", pdf)
208
+ logger.debug("Running MinerU command: %s", " ".join(cmd))
209
+
210
+ proc = subprocess.run(
211
+ cmd,
212
+ stdout=subprocess.PIPE,
213
+ stderr=subprocess.PIPE,
214
+ text=True,
215
+ encoding="utf-8",
216
+ errors="ignore",
217
+ check=False,
218
+ )
219
+ if proc.returncode != 0:
220
+ raise RuntimeError(f"MinerU failed: {proc.stderr or proc.stdout}")
221
+
222
+ @staticmethod
223
+ def _check_bin() -> None:
224
+ try:
225
+ subprocess.run(
226
+ ["mineru", "--version"],
227
+ stdout=subprocess.DEVNULL,
228
+ stderr=subprocess.DEVNULL,
229
+ check=True,
230
+ )
231
+ except (subprocess.CalledProcessError, FileNotFoundError) as exc:
232
+ raise RuntimeError(
233
+ "MinerU is not installed or not found in PATH. Please install it from pip: \n"
234
+ "pip install -U 'mineru[core]'"
235
+ ) from exc
graphgen/models/reader/txt_reader.py CHANGED
@@ -3,7 +3,7 @@ from typing import Any, Dict, List
3
  from graphgen.bases.base_reader import BaseReader
4
 
5
 
6
- class TxtReader(BaseReader):
7
  def read(self, file_path: str) -> List[Dict[str, Any]]:
8
  docs = []
9
  with open(file_path, "r", encoding="utf-8") as f:
 
3
  from graphgen.bases.base_reader import BaseReader
4
 
5
 
6
+ class TXTReader(BaseReader):
7
  def read(self, file_path: str) -> List[Dict[str, Any]]:
8
  docs = []
9
  with open(file_path, "r", encoding="utf-8") as f:
graphgen/operators/generate/generate_qas.py CHANGED
@@ -6,6 +6,7 @@ from graphgen.models import (
6
  AtomicGenerator,
7
  CoTGenerator,
8
  MultiHopGenerator,
 
9
  )
10
  from graphgen.utils import logger, run_concurrent
11
 
@@ -39,6 +40,8 @@ async def generate_qas(
39
  generator = MultiHopGenerator(llm_client)
40
  elif mode == "cot":
41
  generator = CoTGenerator(llm_client)
 
 
42
  else:
43
  raise ValueError(f"Unsupported generation mode: {mode}")
44
 
 
6
  AtomicGenerator,
7
  CoTGenerator,
8
  MultiHopGenerator,
9
+ VQAGenerator,
10
  )
11
  from graphgen.utils import logger, run_concurrent
12
 
 
40
  generator = MultiHopGenerator(llm_client)
41
  elif mode == "cot":
42
  generator = CoTGenerator(llm_client)
43
+ elif mode == "vqa":
44
+ generator = VQAGenerator(llm_client)
45
  else:
46
  raise ValueError(f"Unsupported generation mode: {mode}")
47
 
graphgen/operators/read/read_files.py CHANGED
@@ -1,16 +1,22 @@
1
- from graphgen.models import CsvReader, JsonlReader, JsonReader, TxtReader
2
 
3
  _MAPPING = {
4
- "jsonl": JsonlReader,
5
- "json": JsonReader,
6
- "txt": TxtReader,
7
- "csv": CsvReader,
 
8
  }
9
 
10
 
11
- def read_files(file_path: str):
12
- suffix = file_path.split(".")[-1]
13
- if suffix in _MAPPING:
 
 
 
 
 
14
  reader = _MAPPING[suffix]()
15
  else:
16
  raise ValueError(
 
1
+ from graphgen.models import CSVReader, JSONLReader, JSONReader, PDFReader, TXTReader
2
 
3
  _MAPPING = {
4
+ "jsonl": JSONLReader,
5
+ "json": JSONReader,
6
+ "txt": TXTReader,
7
+ "csv": CSVReader,
8
+ "pdf": PDFReader,
9
  }
10
 
11
 
12
+ def read_files(file_path: str, cache_dir: str | None = None) -> list[dict]:
13
+ suffix = file_path.split(".")[-1].lower()
14
+ if suffix == "pdf":
15
+ if cache_dir is not None:
16
+ reader = _MAPPING[suffix](output_dir=cache_dir)
17
+ else:
18
+ reader = _MAPPING[suffix]()
19
+ elif suffix in _MAPPING:
20
  reader = _MAPPING[suffix]()
21
  else:
22
  raise ValueError(
graphgen/utils/__init__.py CHANGED
@@ -1,5 +1,6 @@
1
  from .calculate_confidence import yes_no_loss_entropy
2
  from .detect_lang import detect_if_chinese, detect_main_language
 
3
  from .format import (
4
  handle_single_entity_extraction,
5
  handle_single_relationship_extraction,
 
1
  from .calculate_confidence import yes_no_loss_entropy
2
  from .detect_lang import detect_if_chinese, detect_main_language
3
+ from .device import pick_device
4
  from .format import (
5
  handle_single_entity_extraction,
6
  handle_single_relationship_extraction,
graphgen/utils/device.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import shutil
2
+ import subprocess
3
+ import sys
4
+
5
+
6
+ def pick_device() -> str:
7
+ """Return the best available device string for MinerU."""
8
+ # 1. NVIDIA GPU
9
+ if shutil.which("nvidia-smi") is not None:
10
+ try:
11
+ # check if there's any free GPU memory
12
+ out = subprocess.check_output(
13
+ [
14
+ "nvidia-smi",
15
+ "--query-gpu=memory.free",
16
+ "--format=csv,noheader,nounits",
17
+ ],
18
+ text=True,
19
+ )
20
+ if any(int(line) > 0 for line in out.strip().splitlines()):
21
+ return "cuda:0"
22
+ except Exception: # pylint: disable=broad-except
23
+ pass
24
+
25
+ # 2. Apple Silicon
26
+ if sys.platform == "darwin" and shutil.which("sysctl"):
27
+ try:
28
+ brand = subprocess.check_output(
29
+ ["sysctl", "-n", "machdep.cpu.brand_string"], text=True
30
+ )
31
+ if "Apple" in brand:
32
+ return "mps"
33
+ except Exception: # pylint: disable=broad-except
34
+ pass
35
+
36
+ # 3. Ascend NPU
37
+ if shutil.which("npu-smi") is not None:
38
+ try:
39
+ subprocess.check_call(["npu-smi", "info"], stdout=subprocess.DEVNULL)
40
+ return "npu"
41
+ except Exception: # pylint: disable=broad-except
42
+ pass
43
+
44
+ return "cpu"
webui/examples/csv_demo.csv CHANGED
@@ -1,5 +1,5 @@
1
- content
2
- "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"
3
- "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"
4
- "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."
5
- "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."
 
1
+ type,content
2
+ text,云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。
3
+ text,隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。
4
+ text,"Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."
5
+ text,"Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."
webui/examples/json_demo.json CHANGED
@@ -1,6 +1,6 @@
1
  [
2
- {"content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"},
3
- {"content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"},
4
- {"content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."},
5
- {"content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
6
  ]
 
1
  [
2
+ {"type": "text", "content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"},
3
+ {"type": "text", "content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"},
4
+ {"type": "text", "content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."},
5
+ {"type": "text", "content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
6
  ]
webui/examples/jsonl_demo.jsonl CHANGED
@@ -1,4 +1,4 @@
1
- {"content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"}
2
- {"content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"}
3
- {"content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."}
4
- {"content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
 
1
+ {"type": "text", "content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"}
2
+ {"type": "text", "content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"}
3
+ {"type": "text", "content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."}
4
+ {"type": "text", "content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
webui/examples/vqa_demo.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ [
2
+ {"type": "text", "content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"},
3
+ {"type": "text", "content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"},
4
+ {"type": "text", "content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."},
5
+ {"type": "text", "content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
6
+ ]