HiveBrain v1.2.0
Get Started
← Back to all entries
snippetbashTip

trafilatura — A Python tool for web scraping and crawling that extracts main text, metadata, and comments from web

Submitted by: @import:tldr-pages··
0
Viewed 0 times
commandpythonforcliscrapingtoolwebtrafilatura

Problem

How to use the trafilatura command: A Python tool for web scraping and crawling that extracts main text, metadata, and comments from web pages. Designed for creating text corpora and extracting structured content. More information: <https://trafilatura.readthedocs.io/en/latest/usage-cli.html#further-information>.

Solution

trafilatura — A Python tool for web scraping and crawling that extracts main text, metadata, and comments from web pages. Designed for creating text corpora and extracting structured content. More information: <https://trafilatura.readthedocs.io/en/latest/usage-cli.html#further-information>.

Extract text from a URL:
trafilatura {{[-u|--URL]}} {{url}}


Extract text and save to a file:
trafilatura {{[-u|--URL]}} {{url}} {{[-o|--output-dir]}} {{path/to/output.txt}}


Extract text in JSON format:
trafilatura {{[-u|--URL]}} {{url}} --json


Extract text from multiple URLs listed in a file:
trafilatura {{[-i|--input-file]}} {{path/to/url_list.txt}}


Crawl a website using its sitemap:
trafilatura --sitemap {{url_to_sitemap.xml}}


Extract text while preserving HTML formatting:
trafilatura {{[-u|--URL]}} {{url}} --formatting


Extract text including comments:
trafilatura {{[-u|--URL]}} {{url}} --with-comments


Display help:
trafilatura {{[-h|--help]}}

Code Snippets

Extract text from a URL

trafilatura {{[-u|--URL]}} {{url}}

Extract text and save to a file

trafilatura {{[-u|--URL]}} {{url}} {{[-o|--output-dir]}} {{path/to/output.txt}}

Extract text in JSON format

trafilatura {{[-u|--URL]}} {{url}} --json

Extract text from multiple URLs listed in a file

trafilatura {{[-i|--input-file]}} {{path/to/url_list.txt}}

Crawl a website using its sitemap

trafilatura --sitemap {{url_to_sitemap.xml}}

Context

tldr-pages: common/trafilatura

Revisions (0)

No revisions yet.