Spaces:
				
			
			
	
			
			
		Paused
		
	
	
	
			
			
	
	
	
	
		
		
		Paused
		
	| # Extraction & Chunking Strategies API | |
| This documentation covers the API reference for extraction and chunking strategies in Crawl4AI. | |
| ## Extraction Strategies | |
| All extraction strategies inherit from the base `ExtractionStrategy` class and implement two key methods: | |
| - `extract(url: str, html: str) -> List[Dict[str, Any]]` | |
| - `run(url: str, sections: List[str]) -> List[Dict[str, Any]]` | |
| ### LLMExtractionStrategy | |
| Used for extracting structured data using Language Models. | |
| ```python | |
| LLMExtractionStrategy( | |
| # Required Parameters | |
| provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "ollama/llama2") | |
| api_token: Optional[str] = None, # API token | |
| # Extraction Configuration | |
| instruction: str = None, # Custom extraction instruction | |
| schema: Dict = None, # Pydantic model schema for structured data | |
| extraction_type: str = "block", # "block" or "schema" | |
| # Chunking Parameters | |
| chunk_token_threshold: int = 4000, # Maximum tokens per chunk | |
| overlap_rate: float = 0.1, # Overlap between chunks | |
| word_token_rate: float = 0.75, # Word to token conversion rate | |
| apply_chunking: bool = True, # Enable/disable chunking | |
| # API Configuration | |
| base_url: str = None, # Base URL for API | |
| extra_args: Dict = {}, # Additional provider arguments | |
| verbose: bool = False # Enable verbose logging | |
| ) | |
| ``` | |
| ### CosineStrategy | |
| Used for content similarity-based extraction and clustering. | |
| ```python | |
| CosineStrategy( | |
| # Content Filtering | |
| semantic_filter: str = None, # Topic/keyword filter | |
| word_count_threshold: int = 10, # Minimum words per cluster | |
| sim_threshold: float = 0.3, # Similarity threshold | |
| # Clustering Parameters | |
| max_dist: float = 0.2, # Maximum cluster distance | |
| linkage_method: str = 'ward', # Clustering method | |
| top_k: int = 3, # Top clusters to return | |
| # Model Configuration | |
| model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model | |
| verbose: bool = False # Enable verbose logging | |
| ) | |
| ``` | |
| ### JsonCssExtractionStrategy | |
| Used for CSS selector-based structured data extraction. | |
| ```python | |
| JsonCssExtractionStrategy( | |
| schema: Dict[str, Any], # Extraction schema | |
| verbose: bool = False # Enable verbose logging | |
| ) | |
| # Schema Structure | |
| schema = { | |
| "name": str, # Schema name | |
| "baseSelector": str, # Base CSS selector | |
| "fields": [ # List of fields to extract | |
| { | |
| "name": str, # Field name | |
| "selector": str, # CSS selector | |
| "type": str, # Field type: "text", "attribute", "html", "regex" | |
| "attribute": str, # For type="attribute" | |
| "pattern": str, # For type="regex" | |
| "transform": str, # Optional: "lowercase", "uppercase", "strip" | |
| "default": Any # Default value if extraction fails | |
| } | |
| ] | |
| } | |
| ``` | |
| ## Chunking Strategies | |
| All chunking strategies inherit from `ChunkingStrategy` and implement the `chunk(text: str) -> list` method. | |
| ### RegexChunking | |
| Splits text based on regex patterns. | |
| ```python | |
| RegexChunking( | |
| patterns: List[str] = None # Regex patterns for splitting | |
| # Default: [r'\n\n'] | |
| ) | |
| ``` | |
| ### SlidingWindowChunking | |
| Creates overlapping chunks with a sliding window approach. | |
| ```python | |
| SlidingWindowChunking( | |
| window_size: int = 100, # Window size in words | |
| step: int = 50 # Step size between windows | |
| ) | |
| ``` | |
| ### OverlappingWindowChunking | |
| Creates chunks with specified overlap. | |
| ```python | |
| OverlappingWindowChunking( | |
| window_size: int = 1000, # Chunk size in words | |
| overlap: int = 100 # Overlap size in words | |
| ) | |
| ``` | |
| ## Usage Examples | |
| ### LLM Extraction | |
| ```python | |
| from pydantic import BaseModel | |
| from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
| # Define schema | |
| class Article(BaseModel): | |
| title: str | |
| content: str | |
| author: str | |
| # Create strategy | |
| strategy = LLMExtractionStrategy( | |
| provider="ollama/llama2", | |
| schema=Article.schema(), | |
| instruction="Extract article details" | |
| ) | |
| # Use with crawler | |
| result = await crawler.arun( | |
| url="https://example.com/article", | |
| extraction_strategy=strategy | |
| ) | |
| # Access extracted data | |
| data = json.loads(result.extracted_content) | |
| ``` | |
| ### CSS Extraction | |
| ```python | |
| from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
| # Define schema | |
| schema = { | |
| "name": "Product List", | |
| "baseSelector": ".product-card", | |
| "fields": [ | |
| { | |
| "name": "title", | |
| "selector": "h2.title", | |
| "type": "text" | |
| }, | |
| { | |
| "name": "price", | |
| "selector": ".price", | |
| "type": "text", | |
| "transform": "strip" | |
| }, | |
| { | |
| "name": "image", | |
| "selector": "img", | |
| "type": "attribute", | |
| "attribute": "src" | |
| } | |
| ] | |
| } | |
| # Create and use strategy | |
| strategy = JsonCssExtractionStrategy(schema) | |
| result = await crawler.arun( | |
| url="https://example.com/products", | |
| extraction_strategy=strategy | |
| ) | |
| ``` | |
| ### Content Chunking | |
| ```python | |
| from crawl4ai.chunking_strategy import OverlappingWindowChunking | |
| # Create chunking strategy | |
| chunker = OverlappingWindowChunking( | |
| window_size=500, # 500 words per chunk | |
| overlap=50 # 50 words overlap | |
| ) | |
| # Use with extraction strategy | |
| strategy = LLMExtractionStrategy( | |
| provider="ollama/llama2", | |
| chunking_strategy=chunker | |
| ) | |
| result = await crawler.arun( | |
| url="https://example.com/long-article", | |
| extraction_strategy=strategy | |
| ) | |
| ``` | |
| ## Best Practices | |
| 1. **Choose the Right Strategy** | |
| - Use `LLMExtractionStrategy` for complex, unstructured content | |
| - Use `JsonCssExtractionStrategy` for well-structured HTML | |
| - Use `CosineStrategy` for content similarity and clustering | |
| 2. **Optimize Chunking** | |
| ```python | |
| # For long documents | |
| strategy = LLMExtractionStrategy( | |
| chunk_token_threshold=2000, # Smaller chunks | |
| overlap_rate=0.1 # 10% overlap | |
| ) | |
| ``` | |
| 3. **Handle Errors** | |
| ```python | |
| try: | |
| result = await crawler.arun( | |
| url="https://example.com", | |
| extraction_strategy=strategy | |
| ) | |
| if result.success: | |
| content = json.loads(result.extracted_content) | |
| except Exception as e: | |
| print(f"Extraction failed: {e}") | |
| ``` | |
| 4. **Monitor Performance** | |
| ```python | |
| strategy = CosineStrategy( | |
| verbose=True, # Enable logging | |
| word_count_threshold=20, # Filter short content | |
| top_k=5 # Limit results | |
| ) | |
| ``` | 
