Bearing in mind it was a while ago and my memory sucks, lol
For the query string, in this case no, as the items with different query strings are counted as different paths (I guess it depends on what your end game is, as index.php?page=home and index.php?page=about aren’t the same - even though they run from the same backend script) - although the word path does make me think of directory… anyhows…
I don’t believe there were any anchor/fragment links in the output, so there was no need to filter on that. But yes, you are right, usually you should stop at the # as well.
For some reason the original command didn’t work but I used similar techniques to get the right answer.
curl https://www.inlanefreight.com | tr " " “\n” | grep “www.inlanefreight.com” | tr “'” ‘"’ | cut -d’"’ -f2 | sort -u | wc -l
This gives the answer 35 which has an extra entry that isn’t a link at the beginning so minus 1 is 34 which was the asnwer.
This question kept me up for days trying to find the right solution on my own but couldn’t come up with anything and instead of wasting my time I tried the commands which were given in the forum htb eventually it seemed to work, So I’ll jot down some of the commands that might help though it won’t produce the correct outcome:
In this command the grep ‘P’ refers to a special way to look up for specific details in the given text, html files etc. I’ve got the answer as ‘33’ which is very close than the previous one .
I’ve tried other approaches too, but this seems to do the job as mentioned by you guys ,
The commands in here work, but it’s a shame you have to come here to cheat.
The should’ve taught more filtration methods or at least redirected you to much more useful info that would help you complete this task more easy. This is one of the things that is quite annoying about HTB - they don’t seem to nudge beginners in the right direction
It seems like the website cannot be accessed anymore. I did a nslookup, found the IP, and did a quick nmap, it says that host seems to be down. That would explain why none of my curl commands go through.
It looks up and running to me, I can connect using spider and can use all curl commands in here
If it’s more about the point than the assignment you can find the answer in this forum
Not quite sure, I am quite new as well only on like my 4th module at the moment. hope someone more experienced can help you with this
maybe if it keeps on going wrong you could take this to discord, it seems to be a lot more active and a lot quicker for getting answers
tr " " “\n” → will replace all whitespaces to new lines grep -E ‘regex’ → will extract all urls (found this regex on internet) grep inlane → we want only that urls of inlanefreight sort -u → will remove the duplicates wc -l → to count total number of outputs
curl “h ttps://www.inlanefreight.com” | tr " " “\n” | cut -d “'” -f 2 | cut -d ‘"’ -f 2 | grep “h ttps://www.inlanefreight.com” | sort -u | wc -l
This command also output the correct answer. It can be break down like this:
curl “h ttps://www.inlanefreight.com” will download the source code.
tr " " “\n” will replace all the space by entering the new line. This step is to make sure all the unique paths will be in separate line.
cut -d “'” -f 2 | cut -d ‘"’ -f 2 | grep “h ttps://www.inlanefreight.com” will search for all the lines that contain the pattern of “h ttps://www.inlanefreight.com”, setting the delimiters of ’ and " to remove and keep the field after the delimiters. After this step, we already had all the unique path of the domain. However, some of the paths are duplicated. Thus it will output the wrong answer if we just use wc -l command.
As a result we need to remove all the duplicated lines by using sort -u. And the result will be correct which is 34.
Sorry as a new user, I 'm not allowed to put more than 2 links in my reply so I had to use space in the domain link
Personally I think the key here is have to read the source code and sort out how the pattern of how the unique paths are written then we can find the solution. Many thanks to every one who posted the solution before me as you all helped to understand the logic of solving the problem. My reply is just trying to explain more detailed.
I also have a question here. Can anyone tell me the differences between running the sort -u command and sort | uniq one?
Thanks
I can’t even reach this site that is listed in the task using either my Kali Box connected to the Academy VPN or through the Pwnbox like it mentions in the actual question.
Sadly, did have to come find this thread to check if it was me messing up the command. Tried several of the other commands other people suggested and still no luck. Finally tried it outside of the VPN, and worked fine. Seems like anyone having the issue of “curl: (6) Could not resolve host: www.inlanefreight.com” isn’t a them issue, but a VPN issue.
Here site.txt is the output of curl,then i use grep that would show only the words that contain the domain and then any amount of characters until it reaches " or ’ where " and ’ basically mean the path ended.Then i sort it to be able to use uniq and count the lines.I know it uses regex but personally i think it is much easier this way
even though i can do that in a short way but in first time i did it that way replaced space with \n new line and then > to \n and all " ’ " to ’ " ’ (single quote to double) and got the answer