Spaces:

re-mind
/

Crawl4AI

Paused

App Files Files Community

Crawl4AI / docs /md_v2 /basic /content-selection.md

amaye15

test

03c0888 10 months ago

preview code

raw

history blame

4.09 kB

	### Content Selection

	Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.

	#### CSS Selectors

	Extract specific content using a `CrawlerRunConfig` with CSS selectors:

	```python
	from crawl4ai.async_configs import CrawlerRunConfig

	config = CrawlerRunConfig(css_selector=".main-article") # Target main article content
	result = await crawler.arun(url="https://crawl4ai.com", config=config)

	config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content
	result = await crawler.arun(url="https://crawl4ai.com", config=config)
	```

	#### Content Filtering

	Control content inclusion or exclusion with `CrawlerRunConfig`:

	```python
	config = CrawlerRunConfig(
	word_count_threshold=10, # Minimum words per block
	excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags
	exclude_external_links=True, # Remove external links
	exclude_social_media_links=True, # Remove social media links
	exclude_external_images=True # Remove external images
	)

	result = await crawler.arun(url="https://crawl4ai.com", config=config)
	```

	#### Iframe Content

	Process iframe content by enabling specific options in `CrawlerRunConfig`:

	```python
	config = CrawlerRunConfig(
	process_iframes=True, # Extract iframe content
	remove_overlay_elements=True # Remove popups/modals that might block iframes
	)

	result = await crawler.arun(url="https://crawl4ai.com", config=config)
	```

	#### Structured Content Selection Using LLMs

	Leverage LLMs for intelligent content extraction:

	```python
	from crawl4ai.extraction_strategy import LLMExtractionStrategy
	from pydantic import BaseModel
	from typing import List

	class ArticleContent(BaseModel):
	title: str
	main_points: List[str]
	conclusion: str

	strategy = LLMExtractionStrategy(
	provider="ollama/nemotron",
	schema=ArticleContent.schema(),
	instruction="Extract the main article title, key points, and conclusion"
	)

	config = CrawlerRunConfig(extraction_strategy=strategy)

	result = await crawler.arun(url="https://crawl4ai.com", config=config)
	article = json.loads(result.extracted_content)
	```

	#### Pattern-Based Selection

	Extract content matching repetitive patterns:

	```python
	from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

	schema = {
	"name": "News Articles",
	"baseSelector": "article.news-item",
	"fields": [
	{"name": "headline", "selector": "h2", "type": "text"},
	{"name": "summary", "selector": ".summary", "type": "text"},
	{"name": "category", "selector": ".category", "type": "text"},
	{
	"name": "metadata",
	"type": "nested",
	"fields": [
	{"name": "author", "selector": ".author", "type": "text"},
	{"name": "date", "selector": ".date", "type": "text"}
	]
	}
	]
	}

	strategy = JsonCssExtractionStrategy(schema)
	config = CrawlerRunConfig(extraction_strategy=strategy)

	result = await crawler.arun(url="https://crawl4ai.com", config=config)
	articles = json.loads(result.extracted_content)
	```

	#### Comprehensive Example

	Combine different selection methods using `CrawlerRunConfig`:

	```python
	from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

	async def extract_article_content(url: str):
	# Define structured extraction
	article_schema = {
	"name": "Article",
	"baseSelector": "article.main",
	"fields": [
	{"name": "title", "selector": "h1", "type": "text"},
	{"name": "content", "selector": ".content", "type": "text"}
	]
	}

	# Define configuration
	config = CrawlerRunConfig(
	extraction_strategy=JsonCssExtractionStrategy(article_schema),
	word_count_threshold=10,
	excluded_tags=['nav', 'footer'],
	exclude_external_links=True
	)

	async with AsyncWebCrawler() as crawler:
	result = await crawler.arun(url=url, config=config)
	return json.loads(result.extracted_content)
	```