Spaces:
Running
Running
github-actions[bot]
commited on
Commit
·
e4316f1
1
Parent(s):
d2e668f
Auto-sync from demo at Mon Oct 20 09:04:45 UTC 2025
Browse files- graphgen/configs/aggregated_config.yaml +2 -2
- graphgen/configs/atomic_config.yaml +2 -2
- graphgen/configs/cot_config.yaml +2 -2
- graphgen/configs/multi_hop_config.yaml +2 -2
- graphgen/configs/vqa_config.yaml +22 -0
- graphgen/generate.py +5 -18
- graphgen/graphgen.py +2 -1
- graphgen/models/__init__.py +2 -1
- graphgen/models/generator/__init__.py +1 -0
- graphgen/models/generator/vqa_generator.py +23 -0
- graphgen/models/reader/__init__.py +5 -4
- graphgen/models/reader/csv_reader.py +1 -1
- graphgen/models/reader/json_reader.py +1 -1
- graphgen/models/reader/jsonl_reader.py +1 -1
- graphgen/models/reader/pdf_reader.py +235 -0
- graphgen/models/reader/txt_reader.py +1 -1
- graphgen/operators/generate/generate_qas.py +3 -0
- graphgen/operators/read/read_files.py +14 -8
- graphgen/utils/__init__.py +1 -0
- graphgen/utils/device.py +44 -0
- webui/examples/csv_demo.csv +5 -5
- webui/examples/json_demo.json +4 -4
- webui/examples/jsonl_demo.jsonl +4 -4
- webui/examples/vqa_demo.json +6 -0
graphgen/configs/aggregated_config.yaml
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
read:
|
| 2 |
-
input_file: resources/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
@@ -18,5 +18,5 @@ partition: # graph partition configuration
|
|
| 18 |
max_tokens_per_community: 10240 # max tokens per community
|
| 19 |
unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
|
| 20 |
generate:
|
| 21 |
-
mode: aggregated # atomic, aggregated, multi_hop, cot
|
| 22 |
data_format: ChatML # Alpaca, Sharegpt, ChatML
|
|
|
|
| 1 |
read:
|
| 2 |
+
input_file: resources/input_examples/jsonl_demo.jsonl # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
|
|
| 18 |
max_tokens_per_community: 10240 # max tokens per community
|
| 19 |
unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
|
| 20 |
generate:
|
| 21 |
+
mode: aggregated # atomic, aggregated, multi_hop, cot, vqa
|
| 22 |
data_format: ChatML # Alpaca, Sharegpt, ChatML
|
graphgen/configs/atomic_config.yaml
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
read:
|
| 2 |
-
input_file: resources/input_examples/json_demo.json # input file path, support json, jsonl, txt, csv. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
@@ -15,5 +15,5 @@ partition: # graph partition configuration
|
|
| 15 |
method_params:
|
| 16 |
max_units_per_community: 1 # atomic partition, one node or edge per community
|
| 17 |
generate:
|
| 18 |
-
mode: atomic # atomic, aggregated, multi_hop, cot
|
| 19 |
data_format: Alpaca # Alpaca, Sharegpt, ChatML
|
|
|
|
| 1 |
read:
|
| 2 |
+
input_file: resources/input_examples/json_demo.json # input file path, support json, jsonl, txt, csv, pdf. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
|
|
| 15 |
method_params:
|
| 16 |
max_units_per_community: 1 # atomic partition, one node or edge per community
|
| 17 |
generate:
|
| 18 |
+
mode: atomic # atomic, aggregated, multi_hop, cot, vqa
|
| 19 |
data_format: Alpaca # Alpaca, Sharegpt, ChatML
|
graphgen/configs/cot_config.yaml
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
read:
|
| 2 |
-
input_file: resources/input_examples/txt_demo.txt # input file path, support json, jsonl, txt. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
@@ -15,5 +15,5 @@ partition: # graph partition configuration
|
|
| 15 |
use_lcc: false # whether to use the largest connected component
|
| 16 |
random_seed: 42 # random seed for partitioning
|
| 17 |
generate:
|
| 18 |
-
mode: cot # atomic, aggregated, multi_hop, cot
|
| 19 |
data_format: Sharegpt # Alpaca, Sharegpt, ChatML
|
|
|
|
| 1 |
read:
|
| 2 |
+
input_file: resources/input_examples/txt_demo.txt # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
|
|
| 15 |
use_lcc: false # whether to use the largest connected component
|
| 16 |
random_seed: 42 # random seed for partitioning
|
| 17 |
generate:
|
| 18 |
+
mode: cot # atomic, aggregated, multi_hop, cot, vqa
|
| 19 |
data_format: Sharegpt # Alpaca, Sharegpt, ChatML
|
graphgen/configs/multi_hop_config.yaml
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
read:
|
| 2 |
-
input_file: resources/input_examples/csv_demo.csv # input file path, support json, jsonl, txt. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
@@ -18,5 +18,5 @@ partition: # graph partition configuration
|
|
| 18 |
max_tokens_per_community: 10240 # max tokens per community
|
| 19 |
unit_sampling: random # unit sampling strategy, support: random, max_loss, min_loss
|
| 20 |
generate:
|
| 21 |
-
mode: multi_hop #
|
| 22 |
data_format: ChatML # Alpaca, Sharegpt, ChatML
|
|
|
|
| 1 |
read:
|
| 2 |
+
input_file: resources/input_examples/csv_demo.csv # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
|
| 3 |
split:
|
| 4 |
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
chunk_overlap: 100 # chunk overlap for text splitting
|
|
|
|
| 18 |
max_tokens_per_community: 10240 # max tokens per community
|
| 19 |
unit_sampling: random # unit sampling strategy, support: random, max_loss, min_loss
|
| 20 |
generate:
|
| 21 |
+
mode: multi_hop # atomic, aggregated, multi_hop, cot, vqa
|
| 22 |
data_format: ChatML # Alpaca, Sharegpt, ChatML
|
graphgen/configs/vqa_config.yaml
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
read:
|
| 2 |
+
input_file: resources/input_examples/pdf_demo.pdf # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples
|
| 3 |
+
split:
|
| 4 |
+
chunk_size: 1024 # chunk size for text splitting
|
| 5 |
+
chunk_overlap: 100 # chunk overlap for text splitting
|
| 6 |
+
search: # web search configuration
|
| 7 |
+
enabled: false # whether to enable web search
|
| 8 |
+
search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia
|
| 9 |
+
quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
|
| 10 |
+
enabled: true
|
| 11 |
+
quiz_samples: 2 # number of quiz samples to generate
|
| 12 |
+
re_judge: false # whether to re-judge the existing quiz samples
|
| 13 |
+
partition: # graph partition configuration
|
| 14 |
+
method: ece # ece is a custom partition method based on comprehension loss
|
| 15 |
+
method_params:
|
| 16 |
+
max_units_per_community: 20 # max nodes and edges per community
|
| 17 |
+
min_units_per_community: 5 # min nodes and edges per community
|
| 18 |
+
max_tokens_per_community: 10240 # max tokens per community
|
| 19 |
+
unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss
|
| 20 |
+
generate:
|
| 21 |
+
mode: vqa # atomic, aggregated, multi_hop, cot, vqa
|
| 22 |
+
data_format: ChatML # Alpaca, Sharegpt, ChatML
|
graphgen/generate.py
CHANGED
|
@@ -72,24 +72,11 @@ def main():
|
|
| 72 |
|
| 73 |
graph_gen.search(search_config=config["search"])
|
| 74 |
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
else:
|
| 81 |
-
logger.warning(
|
| 82 |
-
"Quiz and Judge strategy is disabled. Edge sampling falls back to random."
|
| 83 |
-
)
|
| 84 |
-
assert (
|
| 85 |
-
config["partition"]["method"] == "ece"
|
| 86 |
-
and "method_params" in config["partition"]
|
| 87 |
-
), "Only ECE partition with edge sampling is supported."
|
| 88 |
-
config["partition"]["method_params"]["edge_sampling"] = "random"
|
| 89 |
-
elif mode == "cot":
|
| 90 |
-
logger.info("Generation mode set to 'cot'. Start generation.")
|
| 91 |
-
else:
|
| 92 |
-
raise ValueError(f"Unsupported output data type: {mode}")
|
| 93 |
|
| 94 |
graph_gen.generate(
|
| 95 |
partition_config=config["partition"],
|
|
|
|
| 72 |
|
| 73 |
graph_gen.search(search_config=config["search"])
|
| 74 |
|
| 75 |
+
if config.get("quiz_and_judge", {}).get("enabled"):
|
| 76 |
+
graph_gen.quiz_and_judge(quiz_and_judge_config=config["quiz_and_judge"])
|
| 77 |
+
|
| 78 |
+
# TODO: add data filtering step here in the future
|
| 79 |
+
# graph_gen.filter(filter_config=config["filter"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
graph_gen.generate(
|
| 82 |
partition_config=config["partition"],
|
graphgen/graphgen.py
CHANGED
|
@@ -91,7 +91,7 @@ class GraphGen:
|
|
| 91 |
insert chunks into the graph
|
| 92 |
"""
|
| 93 |
# Step 1: Read files
|
| 94 |
-
data = read_files(read_config["input_file"])
|
| 95 |
if len(data) == 0:
|
| 96 |
logger.warning("No data to process")
|
| 97 |
return
|
|
@@ -105,6 +105,7 @@ class GraphGen:
|
|
| 105 |
"content": doc["content"]
|
| 106 |
}
|
| 107 |
for doc in data
|
|
|
|
| 108 |
}
|
| 109 |
_add_doc_keys = await self.full_docs_storage.filter_keys(list(new_docs.keys()))
|
| 110 |
new_docs = {k: v for k, v in new_docs.items() if k in _add_doc_keys}
|
|
|
|
| 91 |
insert chunks into the graph
|
| 92 |
"""
|
| 93 |
# Step 1: Read files
|
| 94 |
+
data = read_files(read_config["input_file"], self.working_dir)
|
| 95 |
if len(data) == 0:
|
| 96 |
logger.warning("No data to process")
|
| 97 |
return
|
|
|
|
| 105 |
"content": doc["content"]
|
| 106 |
}
|
| 107 |
for doc in data
|
| 108 |
+
if doc.get("type", "text") == "text"
|
| 109 |
}
|
| 110 |
_add_doc_keys = await self.full_docs_storage.filter_keys(list(new_docs.keys()))
|
| 111 |
new_docs = {k: v for k, v in new_docs.items() if k in _add_doc_keys}
|
graphgen/models/__init__.py
CHANGED
|
@@ -4,6 +4,7 @@ from .generator import (
|
|
| 4 |
AtomicGenerator,
|
| 5 |
CoTGenerator,
|
| 6 |
MultiHopGenerator,
|
|
|
|
| 7 |
)
|
| 8 |
from .kg_builder import LightRAGKGBuilder
|
| 9 |
from .llm.openai_client import OpenAIClient
|
|
@@ -14,7 +15,7 @@ from .partitioner import (
|
|
| 14 |
ECEPartitioner,
|
| 15 |
LeidenPartitioner,
|
| 16 |
)
|
| 17 |
-
from .reader import
|
| 18 |
from .search.db.uniprot_search import UniProtSearch
|
| 19 |
from .search.kg.wiki_search import WikiSearch
|
| 20 |
from .search.web.bing_search import BingSearch
|
|
|
|
| 4 |
AtomicGenerator,
|
| 5 |
CoTGenerator,
|
| 6 |
MultiHopGenerator,
|
| 7 |
+
VQAGenerator,
|
| 8 |
)
|
| 9 |
from .kg_builder import LightRAGKGBuilder
|
| 10 |
from .llm.openai_client import OpenAIClient
|
|
|
|
| 15 |
ECEPartitioner,
|
| 16 |
LeidenPartitioner,
|
| 17 |
)
|
| 18 |
+
from .reader import CSVReader, JSONLReader, JSONReader, PDFReader, TXTReader
|
| 19 |
from .search.db.uniprot_search import UniProtSearch
|
| 20 |
from .search.kg.wiki_search import WikiSearch
|
| 21 |
from .search.web.bing_search import BingSearch
|
graphgen/models/generator/__init__.py
CHANGED
|
@@ -2,3 +2,4 @@ from .aggregated_generator import AggregatedGenerator
|
|
| 2 |
from .atomic_generator import AtomicGenerator
|
| 3 |
from .cot_generator import CoTGenerator
|
| 4 |
from .multi_hop_generator import MultiHopGenerator
|
|
|
|
|
|
| 2 |
from .atomic_generator import AtomicGenerator
|
| 3 |
from .cot_generator import CoTGenerator
|
| 4 |
from .multi_hop_generator import MultiHopGenerator
|
| 5 |
+
from .vqa_generator import VQAGenerator
|
graphgen/models/generator/vqa_generator.py
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from dataclasses import dataclass
|
| 2 |
+
from typing import Any
|
| 3 |
+
|
| 4 |
+
from graphgen.bases import BaseGenerator
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
@dataclass
|
| 8 |
+
class VQAGenerator(BaseGenerator):
|
| 9 |
+
@staticmethod
|
| 10 |
+
def build_prompt(
|
| 11 |
+
batch: tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]
|
| 12 |
+
) -> str:
|
| 13 |
+
raise NotImplementedError(
|
| 14 |
+
"VQAGenerator.build_prompt is not implemented. "
|
| 15 |
+
"Please provide an implementation for VQA prompt construction."
|
| 16 |
+
)
|
| 17 |
+
|
| 18 |
+
@staticmethod
|
| 19 |
+
def parse_response(response: str) -> Any:
|
| 20 |
+
raise NotImplementedError(
|
| 21 |
+
"VQAGenerator.parse_response is not implemented. "
|
| 22 |
+
"Please provide an implementation for VQA response parsing."
|
| 23 |
+
)
|
graphgen/models/reader/__init__.py
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
-
from .csv_reader import
|
| 2 |
-
from .json_reader import
|
| 3 |
-
from .jsonl_reader import
|
| 4 |
-
from .
|
|
|
|
|
|
| 1 |
+
from .csv_reader import CSVReader
|
| 2 |
+
from .json_reader import JSONReader
|
| 3 |
+
from .jsonl_reader import JSONLReader
|
| 4 |
+
from .pdf_reader import PDFReader
|
| 5 |
+
from .txt_reader import TXTReader
|
graphgen/models/reader/csv_reader.py
CHANGED
|
@@ -5,7 +5,7 @@ import pandas as pd
|
|
| 5 |
from graphgen.bases.base_reader import BaseReader
|
| 6 |
|
| 7 |
|
| 8 |
-
class
|
| 9 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 10 |
|
| 11 |
df = pd.read_csv(file_path)
|
|
|
|
| 5 |
from graphgen.bases.base_reader import BaseReader
|
| 6 |
|
| 7 |
|
| 8 |
+
class CSVReader(BaseReader):
|
| 9 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 10 |
|
| 11 |
df = pd.read_csv(file_path)
|
graphgen/models/reader/json_reader.py
CHANGED
|
@@ -4,7 +4,7 @@ from typing import Any, Dict, List
|
|
| 4 |
from graphgen.bases.base_reader import BaseReader
|
| 5 |
|
| 6 |
|
| 7 |
-
class
|
| 8 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 9 |
with open(file_path, "r", encoding="utf-8") as f:
|
| 10 |
data = json.load(f)
|
|
|
|
| 4 |
from graphgen.bases.base_reader import BaseReader
|
| 5 |
|
| 6 |
|
| 7 |
+
class JSONReader(BaseReader):
|
| 8 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 9 |
with open(file_path, "r", encoding="utf-8") as f:
|
| 10 |
data = json.load(f)
|
graphgen/models/reader/jsonl_reader.py
CHANGED
|
@@ -5,7 +5,7 @@ from graphgen.bases.base_reader import BaseReader
|
|
| 5 |
from graphgen.utils import logger
|
| 6 |
|
| 7 |
|
| 8 |
-
class
|
| 9 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 10 |
docs = []
|
| 11 |
with open(file_path, "r", encoding="utf-8") as f:
|
|
|
|
| 5 |
from graphgen.utils import logger
|
| 6 |
|
| 7 |
|
| 8 |
+
class JSONLReader(BaseReader):
|
| 9 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 10 |
docs = []
|
| 11 |
with open(file_path, "r", encoding="utf-8") as f:
|
graphgen/models/reader/pdf_reader.py
ADDED
|
@@ -0,0 +1,235 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import os
|
| 3 |
+
import subprocess
|
| 4 |
+
import tempfile
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
from typing import Any, Dict, List, Optional, Union
|
| 7 |
+
|
| 8 |
+
from graphgen.bases.base_reader import BaseReader
|
| 9 |
+
from graphgen.models.reader.txt_reader import TXTReader
|
| 10 |
+
from graphgen.utils import logger, pick_device
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class PDFReader(BaseReader):
|
| 14 |
+
"""
|
| 15 |
+
PDF files are converted using MinerU, see [MinerU](https://github.com/opendatalab/MinerU).
|
| 16 |
+
After conversion, the resulting markdown file is parsed into text, images, tables, and formulas which can be used
|
| 17 |
+
for multi-modal graph generation.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
def __init__(
|
| 21 |
+
self,
|
| 22 |
+
*,
|
| 23 |
+
output_dir: Optional[Union[str, Path]] = None,
|
| 24 |
+
method: str = "auto", # auto | txt | ocr
|
| 25 |
+
lang: Optional[str] = None, # ch / en / ja / ...
|
| 26 |
+
backend: Optional[
|
| 27 |
+
str
|
| 28 |
+
] = None, # pipeline | vlm-transformers | vlm-sglang-engine | vlm-sglang-client
|
| 29 |
+
device: Optional[str] = "auto", # cpu | cuda | cuda:0 | npu | mps | auto
|
| 30 |
+
source: Optional[str] = None, # huggingface | modelscope | local
|
| 31 |
+
vlm_url: Optional[str] = None, # 当 backend=vlm-sglang-client 时必填
|
| 32 |
+
start_page: Optional[int] = None, # 0-based
|
| 33 |
+
end_page: Optional[int] = None, # 0-based, inclusive
|
| 34 |
+
formula: bool = True,
|
| 35 |
+
table: bool = True,
|
| 36 |
+
return_assets: bool = True,
|
| 37 |
+
**other_mineru_kwargs: Any,
|
| 38 |
+
):
|
| 39 |
+
super().__init__()
|
| 40 |
+
self.output_dir = os.path.join(output_dir, "mineru") if output_dir else None
|
| 41 |
+
|
| 42 |
+
if device == "auto":
|
| 43 |
+
device = pick_device()
|
| 44 |
+
|
| 45 |
+
self._default_kwargs: Dict[str, Any] = {
|
| 46 |
+
"method": method,
|
| 47 |
+
"lang": lang,
|
| 48 |
+
"backend": backend,
|
| 49 |
+
"device": device,
|
| 50 |
+
"source": source,
|
| 51 |
+
"vlm_url": vlm_url,
|
| 52 |
+
"start_page": start_page,
|
| 53 |
+
"end_page": end_page,
|
| 54 |
+
"formula": formula,
|
| 55 |
+
"table": table,
|
| 56 |
+
**other_mineru_kwargs,
|
| 57 |
+
}
|
| 58 |
+
self._default_kwargs = {
|
| 59 |
+
k: v for k, v in self._default_kwargs.items() if v is not None
|
| 60 |
+
}
|
| 61 |
+
self.return_assets = return_assets
|
| 62 |
+
self.parser = MinerUParser()
|
| 63 |
+
self.txt_reader = TXTReader()
|
| 64 |
+
|
| 65 |
+
def read(self, file_path: str, **override) -> List[Dict[str, Any]]:
|
| 66 |
+
"""
|
| 67 |
+
file_path
|
| 68 |
+
**override: override MinerU parameters
|
| 69 |
+
"""
|
| 70 |
+
pdf_path = Path(file_path).expanduser().resolve()
|
| 71 |
+
if not pdf_path.is_file():
|
| 72 |
+
raise FileNotFoundError(pdf_path)
|
| 73 |
+
|
| 74 |
+
kwargs = {**self._default_kwargs, **override}
|
| 75 |
+
|
| 76 |
+
mineru_result = self._call_mineru(pdf_path, kwargs)
|
| 77 |
+
return mineru_result
|
| 78 |
+
|
| 79 |
+
def _call_mineru(
|
| 80 |
+
self, pdf_path: Path, kwargs: Dict[str, Any]
|
| 81 |
+
) -> List[Dict[str, Any]]:
|
| 82 |
+
output_dir: Optional[str] = None
|
| 83 |
+
if self.output_dir:
|
| 84 |
+
output_dir = str(self.output_dir)
|
| 85 |
+
|
| 86 |
+
return self.parser.parse_pdf(pdf_path, output_dir=output_dir, **kwargs)
|
| 87 |
+
|
| 88 |
+
def _locate_md(self, pdf_path: Path, kwargs: Dict[str, Any]) -> Optional[Path]:
|
| 89 |
+
out_dir = (
|
| 90 |
+
Path(self.output_dir) if self.output_dir else Path(tempfile.gettempdir())
|
| 91 |
+
)
|
| 92 |
+
method = kwargs.get("method", "auto")
|
| 93 |
+
backend = kwargs.get("backend", "")
|
| 94 |
+
if backend.startswith("vlm-"):
|
| 95 |
+
method = "vlm"
|
| 96 |
+
|
| 97 |
+
candidate = Path(
|
| 98 |
+
os.path.join(out_dir, pdf_path.stem, method, f"{pdf_path.stem}.md")
|
| 99 |
+
)
|
| 100 |
+
if candidate.exists():
|
| 101 |
+
return candidate
|
| 102 |
+
candidate = Path(os.path.join(out_dir, f"{pdf_path.stem}.md"))
|
| 103 |
+
if candidate.exists():
|
| 104 |
+
return candidate
|
| 105 |
+
return None
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
class MinerUParser:
|
| 109 |
+
def __init__(self) -> None:
|
| 110 |
+
self._check_bin()
|
| 111 |
+
|
| 112 |
+
@staticmethod
|
| 113 |
+
def parse_pdf(
|
| 114 |
+
pdf_path: Union[str, Path],
|
| 115 |
+
output_dir: Optional[Union[str, Path]] = None,
|
| 116 |
+
method: str = "auto",
|
| 117 |
+
device: str = "cpu",
|
| 118 |
+
**kw: Any,
|
| 119 |
+
) -> List[Dict[str, Any]]:
|
| 120 |
+
pdf = Path(pdf_path).expanduser().resolve()
|
| 121 |
+
if not pdf.is_file():
|
| 122 |
+
raise FileNotFoundError(pdf)
|
| 123 |
+
|
| 124 |
+
out = (
|
| 125 |
+
Path(output_dir) if output_dir else Path(tempfile.mkdtemp(prefix="mineru_"))
|
| 126 |
+
)
|
| 127 |
+
out.mkdir(parents=True, exist_ok=True)
|
| 128 |
+
|
| 129 |
+
cached = MinerUParser._try_load_cached_result(str(out), pdf.stem, method)
|
| 130 |
+
if cached is not None:
|
| 131 |
+
return cached
|
| 132 |
+
|
| 133 |
+
MinerUParser._run_mineru(pdf, out, method, device, **kw)
|
| 134 |
+
|
| 135 |
+
cached = MinerUParser._try_load_cached_result(str(out), pdf.stem, method)
|
| 136 |
+
return cached if cached is not None else []
|
| 137 |
+
|
| 138 |
+
@staticmethod
|
| 139 |
+
def _try_load_cached_result(
|
| 140 |
+
out_dir: str, pdf_stem: str, method: str
|
| 141 |
+
) -> Optional[List[Dict[str, Any]]]:
|
| 142 |
+
"""
|
| 143 |
+
try to load cached json result from MinerU output.
|
| 144 |
+
:param out_dir:
|
| 145 |
+
:param pdf_stem:
|
| 146 |
+
:param method:
|
| 147 |
+
:return:
|
| 148 |
+
"""
|
| 149 |
+
json_file = os.path.join(
|
| 150 |
+
out_dir, pdf_stem, method, f"{pdf_stem}_content_list.json"
|
| 151 |
+
)
|
| 152 |
+
if not os.path.exists(json_file):
|
| 153 |
+
return None
|
| 154 |
+
|
| 155 |
+
try:
|
| 156 |
+
with open(json_file, encoding="utf-8") as f:
|
| 157 |
+
data = json.load(f)
|
| 158 |
+
except Exception as exc: # pylint: disable=broad-except
|
| 159 |
+
logger.warning("Failed to load cached MinerU result: %s", exc)
|
| 160 |
+
return None
|
| 161 |
+
|
| 162 |
+
base = os.path.dirname(json_file)
|
| 163 |
+
results = []
|
| 164 |
+
for item in data:
|
| 165 |
+
for key in ("img_path", "table_img_path", "equation_img_path"):
|
| 166 |
+
rel_path = item.get(key)
|
| 167 |
+
if rel_path:
|
| 168 |
+
item[key] = str(Path(base).joinpath(rel_path).resolve())
|
| 169 |
+
if item["type"] == "text":
|
| 170 |
+
item["content"] = item["text"]
|
| 171 |
+
del item["text"]
|
| 172 |
+
for key in ("page_idx", "bbox", "text_level"):
|
| 173 |
+
if item.get(key) is not None:
|
| 174 |
+
del item[key]
|
| 175 |
+
if item["type"] == "text" and not item["content"].strip():
|
| 176 |
+
continue
|
| 177 |
+
results.append(item)
|
| 178 |
+
return results
|
| 179 |
+
|
| 180 |
+
@staticmethod
|
| 181 |
+
def _run_mineru(
|
| 182 |
+
pdf: Path,
|
| 183 |
+
out: Path,
|
| 184 |
+
method: str,
|
| 185 |
+
device: str,
|
| 186 |
+
**kw: Any,
|
| 187 |
+
) -> None:
|
| 188 |
+
cmd = [
|
| 189 |
+
"mineru",
|
| 190 |
+
"-p",
|
| 191 |
+
str(pdf),
|
| 192 |
+
"-o",
|
| 193 |
+
str(out),
|
| 194 |
+
"-m",
|
| 195 |
+
method,
|
| 196 |
+
"-d",
|
| 197 |
+
device,
|
| 198 |
+
]
|
| 199 |
+
for k, v in kw.items():
|
| 200 |
+
if v is None:
|
| 201 |
+
continue
|
| 202 |
+
if isinstance(v, bool):
|
| 203 |
+
cmd += [f"--{k}", str(v).lower()]
|
| 204 |
+
else:
|
| 205 |
+
cmd += [f"--{k}", str(v)]
|
| 206 |
+
|
| 207 |
+
logger.info("Parsing PDF with MinerU: %s", pdf)
|
| 208 |
+
logger.debug("Running MinerU command: %s", " ".join(cmd))
|
| 209 |
+
|
| 210 |
+
proc = subprocess.run(
|
| 211 |
+
cmd,
|
| 212 |
+
stdout=subprocess.PIPE,
|
| 213 |
+
stderr=subprocess.PIPE,
|
| 214 |
+
text=True,
|
| 215 |
+
encoding="utf-8",
|
| 216 |
+
errors="ignore",
|
| 217 |
+
check=False,
|
| 218 |
+
)
|
| 219 |
+
if proc.returncode != 0:
|
| 220 |
+
raise RuntimeError(f"MinerU failed: {proc.stderr or proc.stdout}")
|
| 221 |
+
|
| 222 |
+
@staticmethod
|
| 223 |
+
def _check_bin() -> None:
|
| 224 |
+
try:
|
| 225 |
+
subprocess.run(
|
| 226 |
+
["mineru", "--version"],
|
| 227 |
+
stdout=subprocess.DEVNULL,
|
| 228 |
+
stderr=subprocess.DEVNULL,
|
| 229 |
+
check=True,
|
| 230 |
+
)
|
| 231 |
+
except (subprocess.CalledProcessError, FileNotFoundError) as exc:
|
| 232 |
+
raise RuntimeError(
|
| 233 |
+
"MinerU is not installed or not found in PATH. Please install it from pip: \n"
|
| 234 |
+
"pip install -U 'mineru[core]'"
|
| 235 |
+
) from exc
|
graphgen/models/reader/txt_reader.py
CHANGED
|
@@ -3,7 +3,7 @@ from typing import Any, Dict, List
|
|
| 3 |
from graphgen.bases.base_reader import BaseReader
|
| 4 |
|
| 5 |
|
| 6 |
-
class
|
| 7 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 8 |
docs = []
|
| 9 |
with open(file_path, "r", encoding="utf-8") as f:
|
|
|
|
| 3 |
from graphgen.bases.base_reader import BaseReader
|
| 4 |
|
| 5 |
|
| 6 |
+
class TXTReader(BaseReader):
|
| 7 |
def read(self, file_path: str) -> List[Dict[str, Any]]:
|
| 8 |
docs = []
|
| 9 |
with open(file_path, "r", encoding="utf-8") as f:
|
graphgen/operators/generate/generate_qas.py
CHANGED
|
@@ -6,6 +6,7 @@ from graphgen.models import (
|
|
| 6 |
AtomicGenerator,
|
| 7 |
CoTGenerator,
|
| 8 |
MultiHopGenerator,
|
|
|
|
| 9 |
)
|
| 10 |
from graphgen.utils import logger, run_concurrent
|
| 11 |
|
|
@@ -39,6 +40,8 @@ async def generate_qas(
|
|
| 39 |
generator = MultiHopGenerator(llm_client)
|
| 40 |
elif mode == "cot":
|
| 41 |
generator = CoTGenerator(llm_client)
|
|
|
|
|
|
|
| 42 |
else:
|
| 43 |
raise ValueError(f"Unsupported generation mode: {mode}")
|
| 44 |
|
|
|
|
| 6 |
AtomicGenerator,
|
| 7 |
CoTGenerator,
|
| 8 |
MultiHopGenerator,
|
| 9 |
+
VQAGenerator,
|
| 10 |
)
|
| 11 |
from graphgen.utils import logger, run_concurrent
|
| 12 |
|
|
|
|
| 40 |
generator = MultiHopGenerator(llm_client)
|
| 41 |
elif mode == "cot":
|
| 42 |
generator = CoTGenerator(llm_client)
|
| 43 |
+
elif mode == "vqa":
|
| 44 |
+
generator = VQAGenerator(llm_client)
|
| 45 |
else:
|
| 46 |
raise ValueError(f"Unsupported generation mode: {mode}")
|
| 47 |
|
graphgen/operators/read/read_files.py
CHANGED
|
@@ -1,16 +1,22 @@
|
|
| 1 |
-
from graphgen.models import
|
| 2 |
|
| 3 |
_MAPPING = {
|
| 4 |
-
"jsonl":
|
| 5 |
-
"json":
|
| 6 |
-
"txt":
|
| 7 |
-
"csv":
|
|
|
|
| 8 |
}
|
| 9 |
|
| 10 |
|
| 11 |
-
def read_files(file_path: str):
|
| 12 |
-
suffix = file_path.split(".")[-1]
|
| 13 |
-
if suffix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
reader = _MAPPING[suffix]()
|
| 15 |
else:
|
| 16 |
raise ValueError(
|
|
|
|
| 1 |
+
from graphgen.models import CSVReader, JSONLReader, JSONReader, PDFReader, TXTReader
|
| 2 |
|
| 3 |
_MAPPING = {
|
| 4 |
+
"jsonl": JSONLReader,
|
| 5 |
+
"json": JSONReader,
|
| 6 |
+
"txt": TXTReader,
|
| 7 |
+
"csv": CSVReader,
|
| 8 |
+
"pdf": PDFReader,
|
| 9 |
}
|
| 10 |
|
| 11 |
|
| 12 |
+
def read_files(file_path: str, cache_dir: str | None = None) -> list[dict]:
|
| 13 |
+
suffix = file_path.split(".")[-1].lower()
|
| 14 |
+
if suffix == "pdf":
|
| 15 |
+
if cache_dir is not None:
|
| 16 |
+
reader = _MAPPING[suffix](output_dir=cache_dir)
|
| 17 |
+
else:
|
| 18 |
+
reader = _MAPPING[suffix]()
|
| 19 |
+
elif suffix in _MAPPING:
|
| 20 |
reader = _MAPPING[suffix]()
|
| 21 |
else:
|
| 22 |
raise ValueError(
|
graphgen/utils/__init__.py
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
from .calculate_confidence import yes_no_loss_entropy
|
| 2 |
from .detect_lang import detect_if_chinese, detect_main_language
|
|
|
|
| 3 |
from .format import (
|
| 4 |
handle_single_entity_extraction,
|
| 5 |
handle_single_relationship_extraction,
|
|
|
|
| 1 |
from .calculate_confidence import yes_no_loss_entropy
|
| 2 |
from .detect_lang import detect_if_chinese, detect_main_language
|
| 3 |
+
from .device import pick_device
|
| 4 |
from .format import (
|
| 5 |
handle_single_entity_extraction,
|
| 6 |
handle_single_relationship_extraction,
|
graphgen/utils/device.py
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import shutil
|
| 2 |
+
import subprocess
|
| 3 |
+
import sys
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def pick_device() -> str:
|
| 7 |
+
"""Return the best available device string for MinerU."""
|
| 8 |
+
# 1. NVIDIA GPU
|
| 9 |
+
if shutil.which("nvidia-smi") is not None:
|
| 10 |
+
try:
|
| 11 |
+
# check if there's any free GPU memory
|
| 12 |
+
out = subprocess.check_output(
|
| 13 |
+
[
|
| 14 |
+
"nvidia-smi",
|
| 15 |
+
"--query-gpu=memory.free",
|
| 16 |
+
"--format=csv,noheader,nounits",
|
| 17 |
+
],
|
| 18 |
+
text=True,
|
| 19 |
+
)
|
| 20 |
+
if any(int(line) > 0 for line in out.strip().splitlines()):
|
| 21 |
+
return "cuda:0"
|
| 22 |
+
except Exception: # pylint: disable=broad-except
|
| 23 |
+
pass
|
| 24 |
+
|
| 25 |
+
# 2. Apple Silicon
|
| 26 |
+
if sys.platform == "darwin" and shutil.which("sysctl"):
|
| 27 |
+
try:
|
| 28 |
+
brand = subprocess.check_output(
|
| 29 |
+
["sysctl", "-n", "machdep.cpu.brand_string"], text=True
|
| 30 |
+
)
|
| 31 |
+
if "Apple" in brand:
|
| 32 |
+
return "mps"
|
| 33 |
+
except Exception: # pylint: disable=broad-except
|
| 34 |
+
pass
|
| 35 |
+
|
| 36 |
+
# 3. Ascend NPU
|
| 37 |
+
if shutil.which("npu-smi") is not None:
|
| 38 |
+
try:
|
| 39 |
+
subprocess.check_call(["npu-smi", "info"], stdout=subprocess.DEVNULL)
|
| 40 |
+
return "npu"
|
| 41 |
+
except Exception: # pylint: disable=broad-except
|
| 42 |
+
pass
|
| 43 |
+
|
| 44 |
+
return "cpu"
|
webui/examples/csv_demo.csv
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
-
content
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
"Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."
|
| 5 |
-
"Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."
|
|
|
|
| 1 |
+
type,content
|
| 2 |
+
text,云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。
|
| 3 |
+
text,隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。
|
| 4 |
+
text,"Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."
|
| 5 |
+
text,"Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."
|
webui/examples/json_demo.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
[
|
| 2 |
-
{"content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"},
|
| 3 |
-
{"content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"},
|
| 4 |
-
{"content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."},
|
| 5 |
-
{"content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
|
| 6 |
]
|
|
|
|
| 1 |
[
|
| 2 |
+
{"type": "text", "content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"},
|
| 3 |
+
{"type": "text", "content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"},
|
| 4 |
+
{"type": "text", "content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."},
|
| 5 |
+
{"type": "text", "content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
|
| 6 |
]
|
webui/examples/jsonl_demo.jsonl
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
{"content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"}
|
| 2 |
-
{"content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"}
|
| 3 |
-
{"content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."}
|
| 4 |
-
{"content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
|
|
|
|
| 1 |
+
{"type": "text", "content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"}
|
| 2 |
+
{"type": "text", "content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"}
|
| 3 |
+
{"type": "text", "content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."}
|
| 4 |
+
{"type": "text", "content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
|
webui/examples/vqa_demo.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{"type": "text", "content": "云南省农业科学院粮食作物研究所于2005年育成早熟品种云粳26号,该品种外观特点为: 颖尖无色、无芒,谷壳黄色,落粒性适中,米粒大,有香味,食味品质好,高抗稻瘟病,适宜在云南中海拔 1 500∼1 800 m 稻区种植。2012年被农业部列为西南稻区农业推广主导品种。"},
|
| 3 |
+
{"type": "text", "content": "隆两优1212 于2017 年引入福建省龙岩市长汀县试种,在长汀县圣丰家庭农场(河田镇南塘村)种植,土壤肥力中等、排灌方便[2],试种面积 0.14 hm^2 ,作烟后稻种植,6 月15 日机播,7月5 日机插,10 月21 日成熟,产量 8.78 t/hm^2 。2018 和2019 年分别在长汀润丰优质稻专业合作社(濯田镇永巫村)和长汀县绿丰优质稻专业合作社(河田镇中街村)作烟后稻进一步扩大示范种植,均采用机播机插机收。2018 年示范面积 4.00 hm^2 ,平均产量 8.72 t/hm^2 ;2019 年示范面积 13.50 hm^2 ,平均产量 8.74 t/hm^2 。经3 a 试种、示范,隆两优1212 表现出分蘖力强、抗性好、抽穗整齐、后期转色好、生育期适中、产量高、适应性好等特点,可作为烟后稻在长汀县推广种植。"},
|
| 4 |
+
{"type": "text", "content": "Grain size is one of the key factors determining grain yield. However, it remains largely unknown how grain size is regulated by developmental signals. Here, we report the identification and characterization of a dominant mutant big grain1 (Bg1-D) that shows an extra-large grain phenotype from our rice T-DNA insertion population. Overexpression of BG1 leads to significantly increased grain size, and the severe lines exhibit obviously perturbed gravitropism. In addition, the mutant has increased sensitivities to both auxin and N-1-naphthylphthalamic acid, an auxin transport inhibitor, whereas knockdown of BG1 results in decreased sensitivities and smaller grains. Moreover, BG1 is specifically induced by auxin treatment, preferentially expresses in the vascular tissue of culms and young panicles, and encodes a novel membrane-localized protein, strongly suggesting its role in regulating auxin transport. Consistent with this finding, the mutant has increased auxin basipetal transport and altered auxin distribution, whereas the knockdown plants have decreased auxin transport. Manipulation of BG1 in both rice and Arabidopsis can enhance plant biomass, seed weight, and yield. Taking these data together, we identify a novel positive regulator of auxin response and transport in a crop plant and demonstrate its role in regulating grain size, thus illuminating a new strategy to improve plant productivity."},
|
| 5 |
+
{"type": "text", "content": "Tiller angle, an important component of plant architecture, greatly influences the grain yield of rice (Oryza sativa L.). Here, we identified Tiller Angle Control 4 (TAC4) as a novel regulator of rice tiller angle. TAC4 encodes a plant-specific, highly conserved nuclear protein. The loss of TAC4 function leads to a significant increase in the tiller angle. TAC4 can regulate rice shoot\n\ngravitropism by increasing the indole acetic acid content and affecting the auxin distribution. A sequence analysis revealed that TAC4 has undergone a bottleneck and become fixed in indica cultivars during domestication and improvement. Our findings facilitate an increased understanding of the regulatory mechanisms of tiller angle and also provide a potential gene resource for the improvement of rice plant architecture."}
|
| 6 |
+
]
|