Spaces:
Paused
Download Handling in Crawl4AI
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
Enabling Downloads
To enable downloads, set the accept_downloads parameter in the BrowserConfig object and pass it to the crawler.
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
async def main():
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
async with AsyncWebCrawler(config=config) as crawler:
# ... your crawling logic ...
asyncio.run(main())
Or, enable it for a specific crawl by using CrawlerRunConfig:
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(accept_downloads=True)
result = await crawler.arun(url="https://example.com", config=config)
# ...
Specifying Download Location
Specify the download directory using the downloads_path attribute in the BrowserConfig object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the .crawl4ai folder in your home directory.
from crawl4ai.async_configs import BrowserConfig
import os
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
os.makedirs(downloads_path, exist_ok=True)
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
async def main():
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun(url="https://example.com")
# ...
Triggering Downloads
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use js_code in CrawlerRunConfig to simulate these actions and wait_for to allow sufficient time for downloads to start.
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig(
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
downloadLink.click();
}
""",
wait_for=5 # Wait 5 seconds for the download to start
)
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
Accessing Downloaded Files
The downloaded_files attribute of the CrawlResult object contains paths to downloaded files.
if result.downloaded_files:
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
file_size = os.path.getsize(file_path)
print(f"- File size: {file_size} bytes")
else:
print("No files downloaded.")
Example: Downloading Multiple Files
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os
from pathlib import Path
async def download_multiple_files(url: str, download_path: str):
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
async with AsyncWebCrawler(config=config) as crawler:
run_config = CrawlerRunConfig(
js_code="""
const downloadLinks = document.querySelectorAll('a[download]');
for (const link of downloadLinks) {
link.click();
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
}
""",
wait_for=10 # Wait for all downloads to start
)
result = await crawler.arun(url=url, config=run_config)
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
else:
print("No files downloaded.")
# Usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
Important Considerations
- Browser Context: Downloads are managed within the browser context. Ensure
js_codecorrectly targets the download triggers on the webpage. - Timing: Use
wait_forinCrawlerRunConfigto manage download timing. - Error Handling: Handle errors to manage failed downloads or incorrect paths gracefully.
- Security: Scan downloaded files for potential security threats before use.
This revised guide ensures consistency with the Crawl4AI codebase by using BrowserConfig and CrawlerRunConfig for all download-related configurations. Let me know if further adjustments are needed!