Creepy Crawlies | Information Gather - Web Edition Module

NovaNuke13 · July 1, 2024, 2:02am

I’m struggling with the creepy crawlies section. It seems that there should be a target to crawl but I don’t see the target button.

KibretTsige · July 1, 2024, 4:06pm

I’ve been having the same issue. Completed this module a while ago, then when the new content was added and I went to re-complete the module I ran into this.

At first I figured we were meant to scan a target we would’ve spawned earlier in the module, kind of assuming we’re doing the whole thing in one go, but that doesn’t seem to be the case.

wyatt1 · July 3, 2024, 4:03pm

I am having a similar issue with this module. I am unable to use scrapy because HTB doesn’t allow “pip install scrapy” but they do allow “sudo apt install scrapy” (which causes DLL errors when trying to use ReconSpider with scrapy). They need to update the guide to reflect this. Anyways I had finished this module awhile ago and received the badge for It so I moved on after roughly 2 hours of trying to use a command.

inlinefreight.com is weird for me aswell. https inlinefreight, HTTP inlinefreight, and www inlinefreight. Way too many imo and sometimes they don’t work.

I tried using Zapspider and was still unable to find the “reports” page the question was talking about. Was anyone able to figure this out?

Scooby78 · July 4, 2024, 2:02am

“For Linux Users”
12. Virtual Environments and Packages — Python 3.12.4 documentation Go to this link and follow instructions on how to set up a virtual environment. Once you have done that run the “pip3 install scrapy” then you can bring in the ReconSpider inside the virtual environment(same dir as the V.E.) after that your good to run the command. The CLI will give you some goop, look for the json file

Quantum420 · July 4, 2024, 1:11pm

Thanks mate!

NovaNuke13 · July 4, 2024, 2:37pm

Bruh how are you guys even scanning a host? There is no option to spin up a target box.

Stonewall · July 8, 2024, 8:41pm

For those confused about not being able to spin up a target, the target is the example used in the module, http://inlanefreight[.]com

NovaNuke13 · July 8, 2024, 9:00pm

I don’t think I’m following.

I need a host to spider, since it doesn’t give the option to spin one up what am I supposed to spider?

Stonewall · July 8, 2024, 9:16pm

The module gives a few examples of using inlanefreight. This host is always up and the one the question wants you to spider.

NovaNuke13 · July 8, 2024, 9:29pm

Ohhhh is it because it’s “.com” as opposed to “.htb”

Stonewall · July 8, 2024, 9:43pm

That would cause it to not go to the correct host. You can always make sure you have the right host by pinging it first and making sure you get a response.

FinnWhiteHat · July 9, 2024, 3:43pm

this was super helpful, if you follow the tutorial line by line you won’t be able to run the ReconSpider. Guys, the www.inlanefreight.com is up and running all the time… we don’t need to spin the target! It’s also accessible without HTB VPN

sysrisk · August 15, 2024, 10:20am

I cannot get ReconSpider.py to run at all on my VM. I get an
from scrapy.downloadermiddlewares.offsite import OffsiteMiddleware
ModuleNotFoundError: No module named ‘scrapy.downloadermiddlewares.offsite’

Error as it appears this was built with an old version of Scrapy framework. Any pointers on where to download an updated version?

Many thx!

Ly0kha · September 7, 2024, 5:06pm

This is a simple script I created, and it will solve the issue

pip install requests beautifulsoup4 rich tqdm

import argparse
import requests
from bs4 import BeautifulSoup, Comment
from urllib.parse import urljoin, urlparse
from collections import deque
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from tqdm import tqdm
import signal
import sys

console = Console()
visited = set()
all_internal_links = []
all_external_links = []
all_comments = []
progress_bar = None

def accessi_logo():
    logo = """
     \_______/
 `.,-'\_____/`-.,'
  /`..'\ _ /`.,'\
 /  /`.,' `.,'\  \
/__/__/     \__\__\__
\  \  \     /  /  /
 \  \,'`._,'`./  /
  \,'`./___\,'`./
 ,'`-./_____\,-'`.
     /       \
     Ly0kha                      
    """
    console.print(logo, style="bold red")

def get_args():
    parser = argparse.ArgumentParser(description="web crawler for bug bounty recon")
    parser.add_argument("url", help="the target url", type=str)
    parser.add_argument("-d", "--depth", help="recon depth level", type=int, default=5)
    parser.add_argument("-b", "--breadth", help="use breadth-first search", action="store_true")
    parser.add_argument("-f", "--filter", help="filter out extensions like .jpg,.png,.pdf", type=str, default="")
    parser.add_argument("-t", "--timeout", help="timeout for http requests", type=int, default=10)
    parser.add_argument("-o", "--output", help="save recon results to html file", type=str, default=None)
    return parser.parse_args()

def fetch_page(url, timeout):
    try:
        response = requests.get(url, timeout=timeout)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        console.print(f"[bold red]Error fetching {url}: {e}")
        return None

def extract_links_and_comments(html, base_url, filter_exts):
    soup = BeautifulSoup(html, "html.parser")
    internal_links = set()
    external_links = set()
    comments = []

    # Extracting all anchor links
    for anchor in soup.find_all("a", href=True):
        href = anchor["href"]
        full_url = urljoin(base_url, href)
        if full_url.startswith("mailto:") or full_url.startswith("javascript:"):
            continue
        if any(full_url.endswith(ext) for ext in filter_exts):
            continue
        if urlparse(full_url).netloc == urlparse(base_url).netloc:
            internal_links.add(full_url)
        else:
            external_links.add(full_url)

    # Extracting comments
    comments = soup.find_all(string=lambda text: isinstance(text, Comment))
    
    return internal_links, external_links, comments

def display_results(url, internal_links, external_links, comments):
    console.print(Panel(f"[bold cyan]Recon results for {url}:", style="bold white"))

    # Display internal links
    table = Table(title="[bold green]Internal Links", show_lines=True)
    table.add_column("No.", justify="center")
    table.add_column("Internal Link", justify="left", style="green")
    for i, link in enumerate(internal_links, 1):
        table.add_row(str(i), link)
    console.print(table)

    # Display external links
    if external_links:
        ext_table = Table(title="[bold red]External Links", show_lines=True)
        ext_table.add_column("No.", justify="center")
        ext_table.add_column("External Link", justify="left", style="red")
        for i, link in enumerate(external_links, 1):
            ext_table.add_row(str(i), link)
        console.print(ext_table)
    else:
        console.print("[bold yellow]No external links found.", style="bold yellow")

    # Display comments
    if comments:
        comment_table = Table(title="[bold yellow]HTML Comments", show_lines=True)
        comment_table.add_column("No.", justify="center")
        comment_table.add_column("Comment", justify="left", style="yellow")
        for i, comment in enumerate(comments, 1):
            comment_table.add_row(str(i), comment.strip())
        console.print(comment_table)
    else:
        console.print("[bold yellow]No comments found.", style="bold yellow")

def save_results_to_html(output_file):
    spider_logo_url = "https://www.svgrepo.com/show/400766/spider.svg"
    html_content = f"""
    <html>
    <head>
        <title>Recon Results</title>
        <style>
            body {{
                font-family: "Courier New", Courier, monospace;
                background-color: #1e1e1e;
                color: #dcdcdc;
            }}
            h1, h2 {{
                text-align: center;
                color: #c0c0c0;
            }}
            table {{
                width: 100%;
                border-collapse: collapse;
                margin: 20px 0;
                background-color: #2e2e2e;
            }}
            th, td {{
                border: 1px solid #444;
                padding: 10px;
                text-align: left;
                font-family: "Courier New", Courier, monospace;
            }}
            th {{
                background-color: #3e3e3e;
                color: #dcdcdc;
            }}
            td {{
                color: #f5f5f5;
            }}
            a {{
                color: #5ac8fa;
                text-decoration: none;
            }}
            a:hover {{
                text-decoration: underline;
                color: #8fbcbb;
            }}
            .spider-logo {{
                display: block;
                margin-left: auto;
                margin-right: auto;
                width: 150px;
            }}
            .internal-links td {{
                color: #b0e0e6;
            }}
            .external-links td {{
                color: #f08080;
            }}
        </style>
    </head>
    <body>
        <h1>Recon Results</h1>
        <img src="{spider_logo_url}" alt="Spider Logo" class="spider-logo" />
        <h2>Internal Links</h2>
        <table class="internal-links">
            <tr>
                <th>No.</th>
                <th>Internal Link</th>
            </tr>
    """
    
    for i, link in enumerate(all_internal_links, 1):
        html_content += f'<tr><td>{i}</td><td><a href="{link}">{link}</a></td></tr>'
    
    html_content += "</table>"

    html_content += """
        <h2>External Links</h2>
        <table class="external-links">
            <tr>
                <th>No.</th>
                <th>External Link</th>
            </tr>
    """
    
    for i, link in enumerate(all_external_links, 1):
        html_content += f'<tr><td>{i}</td><td><a href="{link}">{link}</a></td></tr>'
    
    html_content += "</table>"

    html_content += """
        <h2>HTML Comments</h2>
        <table class="comments">
            <tr>
                <th>No.</th>
                <th>Comment</th>
            </tr>
    """
    
    for i, comment in enumerate(all_comments, 1):
        html_content += f'<tr><td>{i}</td><td>{comment}</td></tr>'
    
    html_content += """
        </table>
    </body>
    </html>
    """

    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(html_content)
    
    console.print(f"Results saved to {output_file}")

def should_crawl(url, filter_exts):
    if url in visited:
        return False
    if any(url.endswith(ext) for ext in filter_exts):
        return False
    return True

def breadth_first_crawl(start_url, depth_limit, timeout, filter_exts):
    queue = deque([(start_url, 0)])
    while queue:
        url, depth = queue.popleft()
        if depth > depth_limit or url in visited:
            continue
        console.print(f"[bold blue]Crawling: {url} [depth: {depth}]")
        html = fetch_page(url, timeout)
        if not html:
            continue
        internal_links, external_links, comments = extract_links_and_comments(html, url, filter_exts)
        all_internal_links.extend(internal_links)
        all_external_links.extend(external_links)
        all_comments.extend([comment.strip() for comment in comments])
        display_results(url, internal_links, external_links, comments)
        visited.add(url)
        for link in internal_links:
            if should_crawl(link, filter_exts):
                queue.append((link, depth + 1))

def depth_first_crawl(url, depth, depth_limit, timeout, filter_exts):
    if depth > depth_limit or url in visited:
        return
    console.print(f"[bold blue]Crawling: {url} [depth: {depth}]")
    html = fetch_page(url, timeout)
    if not html:
        return
    internal_links, external_links, comments = extract_links_and_comments(html, url, filter_exts)
    all_internal_links.extend(internal_links)
    all_external_links.extend(external_links)
    all_comments.extend([comment.strip() for comment in comments])
    display_results(url, internal_links, external_links, comments)
    visited.add(url)
    for link in internal_links:
        if should_crawl(link, filter_exts):
            depth_first_crawl(link, depth + 1, depth_limit, timeout, filter_exts)

def handle_exit_signal(signal, frame):
    if args.output:
        console.print("\n[bold yellow]Recon interrupted! Saving results...")
        save_results_to_html(args.output)
    sys.exit(0)

def main():
    global args
    accessi_logo()
    args = get_args()
    filter_exts = args.filter.split(",") if args.filter else []
    console.print("[bold cyan]Starting recon...", style="bold green")
    console.print(f"URL: {args.url}")
    console.print(f"Depth limit: {args.depth}")
    console.print(f"Filter extensions: {filter_exts}")
    console.print(f"Timeout: {args.timeout} seconds")
    
    if args.output:
        console.print(f"[bold magenta]Saving to file: {args.output}")
        signal.signal(signal.SIGINT, handle_exit_signal)

    if args.breadth:
        breadth_first_crawl(args.url, args.depth, args.timeout, filter_exts)
    else:
        depth_first_crawl(args.url, 0, args.depth, args.timeout, filter_exts)

    if args.output:
        save_results_to_html(args.output)

if __name__ == "__main__":
    main()

Cameron72 · September 12, 2024, 2:41am

pip3 install scrapy

sysrisk · September 15, 2024, 2:42am

Ly0kha:

import argparse
import requests
from bs4 import BeautifulSoup, Comment
from urllib.parse import urljoin, urlparse
from collections import deque
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from tqdm import tqdm
import signal
import sys

console = Console()
visited = set()
all_internal_links = []
all_external_links = []
all_comments = []
progress_bar = None

def accessi_logo():
    logo = """

Worked! Thank you much!