zstanjj
/

HTML-Pruner-Phi-3.8B

Text Generation

text-generation-inference

Model card Files Files and versions

zstanjj commited on Dec 12, 2024

Commit

35b6576

·

verified ·

1 Parent(s): d2620d0

Update README.md

Files changed (1) hide show

README.md +9 -2

README.md CHANGED Viewed

@@ -67,6 +67,11 @@ document.write("Hello World!");
 </html>
 """
 simplified_html = clean_html(html)
 print(simplified_html)
@@ -80,7 +85,6 @@ print(simplified_html)
 # </html>
 ```
 ### 🔧 Configure Pruning Parameters
 The example HTML document is rather a short one. Real-world HTML documents can be much longer and more complex. To handle such cases, we can configure the following parameters:
@@ -107,6 +111,7 @@ MAX_CONTEXT_WINDOW_GEN = 32
 from htmlrag import build_block_tree
 block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
 for block in block_tree:
     print("Block Content: ", block[0])
     print("Block Path: ", block[1])
@@ -175,6 +180,7 @@ import torch
 # construct a finer block tree
 block_tree, pruned_html=build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
 for block in block_tree:
     print("Block Content: ", block[0])
     print("Block Path: ", block[1])
@@ -189,7 +195,7 @@ for block in block_tree:
 # Block Path:  ['html', 'p']
 # Is Leaf:  True
-ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
 if torch.cuda.is_available():
     device="cuda"
 else:
@@ -206,6 +212,7 @@ print(pruned_html)
 # <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
 ```
 ## Results

 </html>
 """
+#. alternatively you can read html files and merge them
+# html_files=["/path/to/html/file1.html", "/path/to/html/file2.html"]
+# htmls=[open(file).read() for file in html_files]
+# html = "\n".join(htmls)
 simplified_html = clean_html(html)
 print(simplified_html)
 # </html>
 ```
 ### 🔧 Configure Pruning Parameters
 The example HTML document is rather a short one. Real-world HTML documents can be much longer and more complex. To handle such cases, we can configure the following parameters:
 from htmlrag import build_block_tree
 block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
+# block_tree, simplified_html=build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
 for block in block_tree:
     print("Block Content: ", block[0])
     print("Block Path: ", block[1])
 # construct a finer block tree
 block_tree, pruned_html=build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
+# block_tree, pruned_html=build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
 for block in block_tree:
     print("Block Content: ", block[0])
     print("Block Path: ", block[1])
 # Block Path:  ['html', 'p']
 # Is Leaf:  True
+ckpt_path = "zstanjj/HTML-Pruner-Phi-3.8B"
 if torch.cuda.is_available():
     device="cuda"
 else:
 # <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
 ```
+---
 ## Results