htmlq: extract content from HTML in Linux easily

You may need to extract content from an HTML. And maybe you already use jq command to extract data from JSON documents, but with htmlq you will have a tool similar to this one, it is even written also in Rust programming language, but for HTML.

The htmlq tool is available for other Unix-like systems, and not only for Linux, so you can also use it on FreeBSD, macOS, etc. Also, use CSS selectors to extract the content snippets from the .html files. This is how you point to the elements you want from a web page that you need. For example, you can extract the images, or the text, etc., from a URL.

The first is install htmlq on your Linux. For example, taking a DEB distro as a reference (for others it would be similar, but with the corresponding package manager), we can use:

sudo apt install cargo

cargo install htmlq

cargo is Rust's package manager, as pip is for the Python language… With it you can easily install a multitude of packages created in Rust. By the way, you will also need to have the rustc package installed if you don't already have it on your distro.

Once installed, its use is simple. For example, imagine you want to find content on a page by its ID:

curl -s url | htmlq '#css-selector'
curl -s url2 | htmlq '#css-selector'
curl -s https://www.linuxadictos.com/ | htmlq --pretty '#content' | more

Or, for find all links of a page, you can use this other command:

curl -s https://www.linuxadictos.com | htmlq --attribute href a

Finally, if you have questions about the options available in htmlq, you can check their help with this command:

htmlq --help

I hope this little tutorial has helped you. As you can see, its use is simple, and you can combine it with tools such as curl, Among others.

LinuxAdictos

htmlq: extract content from an HTML in Linux

Leave a Comment Cancel reply