Compact_Facts / data /Readme.md
khulnasoft's picture
Upload 108 files
4fb0bd1 verified
## Dataset Processing
### Our Benchmark (processed OIE2016)
Firstly, download our benchmark tailored for compact extractions provided [`here`](https://zenodo.org/record/7014032#.YwQQ0OzMJb8) and put it under [`data/OIE2016(processed)`](https://github.com/FarimaFatahi/CompactIE/tree/master/data/OIE2016(processed)).
Secondly, split out the train, development, test set for the constituent extraction model by running:
```
cd OIE2016(processed)/constituent_model
python process_constituent_data.py
```
Lastly, split out the train, development, test set for the constituent linking model by running:
```
cd OIE2016(processed)/relation_model
python process_linking_data.py
```
Note that the data folders for training each model are set to the ones mentioned above.
### Evaluation Benchmarks
Three evaluation benchmarks (**BenchIE**, **CaRB**, and **Wire57**) are used for evaluating CompactIE's performance. Note that since these datasets are not targeted for compact triples, we exclude triples that have at least one clause within a constituent.
To get the final data (json format) for these benchmarks, run:
```bash
./process_test_data.sh
```
### Other files
Since the schema design of the table filling model does not support conjunctions inside constituents, we use the conjunction module developed by [`OpenIE6`](https://github.com/dair-iitd/openie6) to break sentences into smaller conjunction-free sentences before passing them to the system.
Therefore, input new test files (`source_file.txt`), produce the conjunction file (`conjunctions.txt`) and then run:
```
python process.py --source_file source_file.txt --target_file output.json --conjunctions_file conjunctions.txt
```
### Compactness measurement
To measure the compactness metrics mentioned in the paper (AL, NCC, RPA), set the `INPUT_FILE` variable inside the following scrip to the test file path and run it as follows:
```
python compactness_measurements.py
```