Spaces:
Paused
Paused
| ### Content Selection | |
| Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need. | |
| #### CSS Selectors | |
| Extract specific content using a `CrawlerRunConfig` with CSS selectors: | |
| ```python | |
| from crawl4ai.async_configs import CrawlerRunConfig | |
| config = CrawlerRunConfig(css_selector=".main-article") # Target main article content | |
| result = await crawler.arun(url="https://crawl4ai.com", config=config) | |
| config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content | |
| result = await crawler.arun(url="https://crawl4ai.com", config=config) | |
| ``` | |
| #### Content Filtering | |
| Control content inclusion or exclusion with `CrawlerRunConfig`: | |
| ```python | |
| config = CrawlerRunConfig( | |
| word_count_threshold=10, # Minimum words per block | |
| excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags | |
| exclude_external_links=True, # Remove external links | |
| exclude_social_media_links=True, # Remove social media links | |
| exclude_external_images=True # Remove external images | |
| ) | |
| result = await crawler.arun(url="https://crawl4ai.com", config=config) | |
| ``` | |
| #### Iframe Content | |
| Process iframe content by enabling specific options in `CrawlerRunConfig`: | |
| ```python | |
| config = CrawlerRunConfig( | |
| process_iframes=True, # Extract iframe content | |
| remove_overlay_elements=True # Remove popups/modals that might block iframes | |
| ) | |
| result = await crawler.arun(url="https://crawl4ai.com", config=config) | |
| ``` | |
| #### Structured Content Selection Using LLMs | |
| Leverage LLMs for intelligent content extraction: | |
| ```python | |
| from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
| from pydantic import BaseModel | |
| from typing import List | |
| class ArticleContent(BaseModel): | |
| title: str | |
| main_points: List[str] | |
| conclusion: str | |
| strategy = LLMExtractionStrategy( | |
| provider="ollama/nemotron", | |
| schema=ArticleContent.schema(), | |
| instruction="Extract the main article title, key points, and conclusion" | |
| ) | |
| config = CrawlerRunConfig(extraction_strategy=strategy) | |
| result = await crawler.arun(url="https://crawl4ai.com", config=config) | |
| article = json.loads(result.extracted_content) | |
| ``` | |
| #### Pattern-Based Selection | |
| Extract content matching repetitive patterns: | |
| ```python | |
| from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
| schema = { | |
| "name": "News Articles", | |
| "baseSelector": "article.news-item", | |
| "fields": [ | |
| {"name": "headline", "selector": "h2", "type": "text"}, | |
| {"name": "summary", "selector": ".summary", "type": "text"}, | |
| {"name": "category", "selector": ".category", "type": "text"}, | |
| { | |
| "name": "metadata", | |
| "type": "nested", | |
| "fields": [ | |
| {"name": "author", "selector": ".author", "type": "text"}, | |
| {"name": "date", "selector": ".date", "type": "text"} | |
| ] | |
| } | |
| ] | |
| } | |
| strategy = JsonCssExtractionStrategy(schema) | |
| config = CrawlerRunConfig(extraction_strategy=strategy) | |
| result = await crawler.arun(url="https://crawl4ai.com", config=config) | |
| articles = json.loads(result.extracted_content) | |
| ``` | |
| #### Comprehensive Example | |
| Combine different selection methods using `CrawlerRunConfig`: | |
| ```python | |
| from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig | |
| async def extract_article_content(url: str): | |
| # Define structured extraction | |
| article_schema = { | |
| "name": "Article", | |
| "baseSelector": "article.main", | |
| "fields": [ | |
| {"name": "title", "selector": "h1", "type": "text"}, | |
| {"name": "content", "selector": ".content", "type": "text"} | |
| ] | |
| } | |
| # Define configuration | |
| config = CrawlerRunConfig( | |
| extraction_strategy=JsonCssExtractionStrategy(article_schema), | |
| word_count_threshold=10, | |
| excluded_tags=['nav', 'footer'], | |
| exclude_external_links=True | |
| ) | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun(url=url, config=config) | |
| return json.loads(result.extracted_content) | |
| ``` | |