Linux Fundamentals Filter Content - Filter All Unique Paths of Domain

mrhacker613 · December 30, 2022, 8:05pm

The third question in the HTB academy module Linux Fundamentals, in the Filter Content section, "
Use cURL from your Pwnbox (not the target machine) to obtain the source code of “https://www.inlanefreight.com” website and filters all unique paths of that domain. Submit the number of these paths as the answer." I am stuck, I tried filtering out urls from looking at other content in the forum. I am confused what the question means by unique paths. What unique, specific thing am I supposed to look for? Here is the best command that I tried:

curl https://inlanefreight.com > inlanefreight.txt
cat inlanefreight.txt | tr " " "\n" | grep https://www.inlanefreight.com | column -t | wc -l

Thanks in advance!

Nexxsys · March 10, 2023, 5:03am

curl https://www.inlanefreight.com/ | grep -Eo "https:\/\/.{0,3}\.inlanefreight\.com[^\"\']*" | sort -u | wc -l

grep -E = Extended-Regex, -o = only-matching

rodrilea · March 14, 2023, 3:33am

Thank you, I was having the same issue. This helped me better understand how grep, and sort work.

bengee34 · January 27, 2024, 10:40pm

Hi,

I completed this exercise with the following command :

curl -s https://www.inlanefreight.com | tr -d \'\" | grep -o -E "(href|url|src)=[^ >]+" | cut -d '=' -f 2 | grep -vE ".*(defer|\.org|google|themeansar).*" | cut -d "?" -f 1 | sort | uniq | tee /dev/stderr | wc -l

Let me explain each step :

curl -s https://www.inlanefreight.com

This get the URL content but without metrics output automatically added by curl when it outputs the result. These metrics will get insert before and in the middle of fetched data if not ignored.

tr -d \'\"

URL and other HTML elements parameter’s values can be encapsulated between either quote or double quotes or not encapsultated at all. So, to making parsing easier, I prefer removing them.

grep -o -E "(href|url|src)=[^ >]+"

In HTML, URL are given in href, url or src elements. So I use a grep to retrieve the attribute and its URL value ONLY thanks the regex stopping at first space or > met.

cut -d '=' -f 2

Now, I am left with attributes having the following structure : (href|url|src)=url
So, I split each entry using the = delimiter and keeping only the url.

grep -vE ".*(defer|\.org|google|themeansar).*"

Now, I have all available URLs but not all are part of the target domain. So I use the -v grep option to keep only strings not matching the given regex where I specified specific words found in url I wanted to ignore.

cut -d "?" -f 1

URL may have query strings. A url points to a resource but this resource may be a script taking parameters to give us the right result. As the exercise asks us to count unique url in the domain, we have to ignore query strings / parameters. So I split urls using the ? delimiter as this is the character used to indicate start of the query string and I keep only the first part.

sort | uniq | tee /dev/stderr | wc -l

Finally, I just need to sort found urls with the sort command, remove duplicates with the uniq command and count lines with wc command to get the result.

You can imagine that I didn’t find the result in one shot. I found it step by step and to be able to see detailed results and number of found urls at each steps, I used a very useful command : tee.

This command lets you redirecting a result to a file and, at the same time, to the standard output so you can continue using pipes. So, before counting lines, I used tee command to redirect standard output to the unix file for the error ouput to see them on screen and, as it was also sent to the standard output, counting them using pipe.

I decided to explain the command to illustrate the thinking process and the benefit of the tee command but also to remind something important about URLs for beginners. URLs may be absolutes or relatives. And when URLs are relatives, they are part of the website domain. So, when looking for URLs, it is important not looking for schemes like ‘http’ but for attributes which may contain URLs first then parsing value keeping this detail in mind as you may or not find a scheme or a domain in URL value and also that there are many existing schemes (ex: mailto, ftp, …)

Topic		Replies	Views
Linux Fundamentals - Filter Contents Academy	0	250	February 2, 2024
Linux Fundamentals - Filter contents Challenges htb-academy	3	3324	December 29, 2022
Htbacademy linux fundamentals filter content HTB Content linux-fundamentals , htb-academy	1	1569	May 18, 2022
Linux Fundamentals - Filter Contents Academy	0	43	August 4, 2024
Linux fundamental	4	1387	March 21, 2022

Linux Fundamentals Filter Content - Filter All Unique Paths of Domain

Related topics