Hi,
I completed this exercise with the following command :
curl -s https://www.inlanefreight.com | tr -d \'\" | grep -o -E "(href|url|src)=[^ >]+" | cut -d '=' -f 2 | grep -vE ".*(defer|\.org|google|themeansar).*" | cut -d "?" -f 1 | sort | uniq | tee /dev/stderr | wc -l
Let me explain each step :
curl -s https://www.inlanefreight.com
This get the URL content but without metrics output automatically added by curl when it outputs the result. These metrics will get insert before and in the middle of fetched data if not ignored.
tr -d \'\"
URL and other HTML elements parameter’s values can be encapsulated between either quote or double quotes or not encapsultated at all. So, to making parsing easier, I prefer removing them.
grep -o -E "(href|url|src)=[^ >]+"
In HTML, URL are given in href, url or src elements. So I use a grep to retrieve the attribute and its URL value ONLY thanks the regex stopping at first space or > met.
cut -d '=' -f 2
Now, I am left with attributes having the following structure : (href|url|src)=url
So, I split each entry using the = delimiter and keeping only the url.
grep -vE ".*(defer|\.org|google|themeansar).*"
Now, I have all available URLs but not all are part of the target domain. So I use the -v grep option to keep only strings not matching the given regex where I specified specific words found in url I wanted to ignore.
cut -d "?" -f 1
URL may have query strings. A url points to a resource but this resource may be a script taking parameters to give us the right result. As the exercise asks us to count unique url in the domain, we have to ignore query strings / parameters. So I split urls using the ? delimiter as this is the character used to indicate start of the query string and I keep only the first part.
sort | uniq | tee /dev/stderr | wc -l
Finally, I just need to sort found urls with the sort command, remove duplicates with the uniq command and count lines with wc command to get the result.
You can imagine that I didn’t find the result in one shot. I found it step by step and to be able to see detailed results and number of found urls at each steps, I used a very useful command : tee.
This command lets you redirecting a result to a file and, at the same time, to the standard output so you can continue using pipes. So, before counting lines, I used tee command to redirect standard output to the unix file for the error ouput to see them on screen and, as it was also sent to the standard output, counting them using pipe.
I decided to explain the command to illustrate the thinking process and the benefit of the tee command but also to remind something important about URLs for beginners. URLs may be absolutes or relatives. And when URLs are relatives, they are part of the website domain. So, when looking for URLs, it is important not looking for schemes like ‘http’ but for attributes which may contain URLs first then parsing value keeping this detail in mind as you may or not find a scheme or a domain in URL value and also that there are many existing schemes (ex: mailto, ftp, …)