I use this :
curl "https://inlanefreight.com" > index.com
sed "s/\(\"\|'\)\(https:\/\/www\.inlanefreight\.com\S*\)\1/\1\n\2\n\1/g" index.html | grep "https://www.inlane" | sort -u | wc -l
hail regex
I use this :
curl "https://inlanefreight.com" > index.com
sed "s/\(\"\|'\)\(https:\/\/www\.inlanefreight\.com\S*\)\1/\1\n\2\n\1/g" index.html | grep "https://www.inlane" | sort -u | wc -l
hail regex
This was a difficult one to figure out. I couldn’t do it by myself.
The command that finally worked for me:
Still trying to figure out how this command worked…
Can anyone help explain this? Thanks
I solved it this way: curl -s “https://www.inlanefreight.com” | grep ‘https://www.inlanefreight.com’ | cut -d’.’ -f3,4 | cut -d’<’ -f1 | grep -v ‘org/’ | sort | uniq | wc -l
I am not very clear about the definition of path. Are the following two extracted paths the same path?
I think they are the same as
“h ttps://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed”
https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F
https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F&format=xml
Very helpful, than you.
I got stuck on this for ages. This is what I did that worked.
curl -s https://www.inlanefreight.com | grep -o 'www\.inlanefreight\.com[^"]*' | awk -F 'www.inlanefreight.com' '{print $2}' | sort | uniq | wc -l
And then there was an entry with just \
so I took 1 off the count (output was 35, took 1 off for 34)
It’s a really dumb question.
so with some help from the comments above, i tried to also do some few tweaks. though pretty longer, it got the job done
will try to explain.
and please note, while trying to reply, i was alerted that new users are allowed to only reply with 2 links so where you see
THE_DOMAIN_LINK it refers to https://www.inlanefreight.com
curl -s THE_DOMAIN_LINK > freight.txt
1. curl -s https://www.inlanefreight.com > freight.txt
i first saved the content to a file so i would not have to be running curl each time i wanted to try something new
2.cat freight.txt | grep “THE_DOMAIN_LINK” | tr " " “\n” | grep “THE_DOMAIN_LINK” | grep -o “https[^']*” | cut -d ‘"’ -f1 | sort -u | wc -l
=>grep “THE_DOMAIN_LINK”
piped the saved text to grep to find all appearances of the domain name which resulted in a number of lines but as you already know, it included all the other text before the actual
THE_DOMAIN_LINK. Now my goal was to find a way to extract any text starting with THE_DOMAIN_LINK
=>tr " " “\n” | grep “THE_DOMAIN_LINK”
so i piped it to the tr command to replace all the spaces in each line with a carriage return or new line so i can at least get the lines with THE_DOMAIN_LINK appearing on their own line. of course there was still other text prepending that
=>grep -o “https[^']*”
now this will get all the text starting with “https” and any other text after it. NOTE: i did not go with the hrefs because some of the links were in src or just url(, so this was the best way in my opinion. so the result from this gave me lots of lines beginning with THE_DOMAIN_LINK
=>cut -d ‘"’ -f1
there were some lines with quotes(“) at the end of it so you would have THE_DOMAIN_LINK and THE_DOMAIN_LINK” and to the system those are two separate lines so had to use cut to just sort of separate the text in two with the quote(") as separator and getting the first grouped text with the -f1. this way now i had all lines starting with THE_DOMAIN_LINK and no quotes at the end
**I do believe there’s a better approach here
=>sort -u | wc -l
then finally sorted them out with -u giving unique lines only and wc -l counting the number of lines
surely , there are many solutions better and easier than this. if it were a java or python program, i would have finished this off easily but being a linux newbie, came out with this.
Hope it helps someone
thank you
Genious!
This one worked for me thanks!
took me 3 hours to formulate this command
curl -s https://www.inlanefreight.com | grep -oE “href=[\”‘][^\"’]+" | sort -u | wc -l
^ this was what it looks like on forums
copy and paste this
curl -s https://www.inlanefreight.com | grep -oE “href=["‘][^"’]+” | sort -u | wc -l
Question:
$ curl -s https://www.inlanefreight.com | grep -o 'www\.inlanefreight\.com[^"]*' | awk -F 'www.inlanefreight.com' '{print $2}' | sort | uniq | cut -d? -f1 | grep -v \> | sed 's/\\//g' | sort > paths.txt; cat paths.txt; cat paths.txt | wc -l; rm paths.txt
Answer:
/
/index.php/about-us/
/index.php/career/
/index.php/comments/feed/
/index.php/contact/
/index.php/feed/
/index.php/news/
/index.php/offices/
/index.php/wp-json/
/index.php/wp-json/oembed/1.0/embed
/index.php/wp-json/wp/v2/pages/7
/wp-content/themes/ben_theme/css/animate.css
/wp-content/themes/ben_theme/css/bootstrap-progressbar.min.css
/wp-content/themes/ben_theme/css/bootstrap.css
/wp-content/themes/ben_theme/css/colors/default.css
/wp-content/themes/ben_theme/css/font-awesome.css
/wp-content/themes/ben_theme/css/jquery.smartmenus.bootstrap.css
/wp-content/themes/ben_theme/css/magnific-popup.css
/wp-content/themes/ben_theme/css/owl.carousel.css
/wp-content/themes/ben_theme/css/owl.transitions.css
/wp-content/themes/ben_theme/images/breadcrumb-back.jpg
/wp-content/themes/ben_theme/js/bootstrap.min.js
/wp-content/themes/ben_theme/js/jquery.smartmenus.bootstrap.js
/wp-content/themes/ben_theme/js/jquery.smartmenus.js
/wp-content/themes/ben_theme/js/navigation.js
/wp-content/themes/ben_theme/js/owl.carousel.min.js
/wp-content/themes/ben_theme/style.css
/wp-includes/css/dist/block-library/style.min.css
/wp-includes/js/jquery/jquery-migrate.min.js
/wp-includes/js/jquery/jquery.min.js
/wp-includes/js/wp-embed.min.js
/wp-includes/js/wp-emoji-release.min.js
/wp-includes/wlwmanifest.xml
/xmlrpc.php
34
recommend this command:
curl https://www.inlanefreight.com | tr ’ ’ ‘\n’| tr “'” ‘\n’ | tr ‘"’ ‘\n’ |grep https://www.inlanefreight.com/ |sort -u | wc -l
curl is obvious, then replace space with new line and also replace the different quote characters with new lines this is so later the url’s are 1 per line and isolated to look nicer imo, and grep for the site name this will give a result of 1 url per line without any extra characters than sort -u to ensure uniquness than the number of lines is the answer
I completely agree… The only other thing to do is pay for a subscription that will give you access to the questions walk-through. If any of you have done this, please provide your input on how helpful the walk through option is. I don’t mind putting some money down to get a better quality of learning, but I would like to know that it is worth it first.
The correct answer is ladies and gentleman : curl -s https://www.inlanefreight.com | tr ’ ’ ‘\n’ | tr “'” ‘\n’ | tr ‘"’ ‘\n’ | grep ‘https://www.inlanefreight.com/’ | sort -u | wc -l, Note the forum can some times change ’ for ‘ they are not the same also with " can become “
Bruh, why didn’t you use grep -o?
why did you used “tr” 3 times
Hello.
Solved in this way :
curl https://www.inlanefreight.com | grep -oP ‘(https:\/\/www\.inlanefreight\.com[^\s?#"]*)’ | sort | uniq | wc -l
didn’t work, can anyone help, it seems the problem in grep regex pattern
Nope still does nothing.
Hello,
My solution is not perfect but it does what we want.
curl "https://www.inlanefreight.com" | tr "\"" "\n" | tr "'" "\n" | grep "https://www.inlanefreight.com" | sort -u | wc -l