What's the best tool to spider a web site, get a site map, and also get any request variables?

What’s the best tool to spider a web site, get a site map, and also get any request variables (so all post and get variables) for each particular page?


The answer I gave on your question about web scraping still applies here (see your other post): it’s the technical foundation of what you are trying to do here… however you’ll soon come to realize this is a pretty in-depth question:

I’ve been writing an automated scanner for burp: while I browse a site, it just hacks away in the background for me.

While working on that vulnerability-scanner I was pondering about your question: can’t I just have this fully automated? In stead of me having to browse the site (which is basically just feeding requests to my scanner), can’t I just give a base url, have my system crawl the sites, construct meaningful requests from that and tries to hack whatever parameters it finds?

… it’s in fact pretty complex:

If you think about it: variables can be as obvious as being present in the url (as with a get-request), however post requests are different: you’d actually have to parse the website (as in the source code), detect all forms, retrieve whatever they are likely to post, interpret their fields as to give meaningful variables along with that post request.

To make it more concrete: say you spider a base url, crawl a page and are able to detect a login from on that page, you now need to parse the source code of that form to retrieve the names of the variables that will be posted. Up till here, that’s pretty doable… However:

It becomes even more complex when there are mechanisms in place like ‘cross site request forgery tokens’ or when you start to think about that fact that cookies or sessions in itself might store variables as well:

Just think about a web shop where you purchase stuff that ends up in your cart: when you post the purchase, those variables are likely stored in your session, which is tied to your cookie, but not entirely, because if you start from a spider/crawl it is also tied to your previous behavior and actions on that site, which if you want to automate that, you’d need to form a bunch of requests (in this case to fill up your shopping cart). A task that needs more interpretation of a site that machines are able to offer today I think.

We then didn’t even touch on the fact that even javascript can strongly alter variables once a button is pressed.

So it depends what you are trying to do:

if you just want to do simple ‘retrieving of parameters from actual get or post requests’, then yeah: pretty simple. You can write something in python or write a burp plugin that uses the burp proxy as input and have a near 100% success rate.

When you are talking about ‘all possible post and get variables starting from a spider or even a crawl’: gain some knowledge/experience and you’ll come to realize how deep your question actually goes and in reality your success rate will significantly drop (in the range of 20%) on more complex sites.

Concrete example: with my scanner I detect if pages hold a login page, and if I find one, I automatically test it against ‘default credentials’.
It’s probably the most simple example of what you might be trying to do. But however simple it may seem, quite a lot going on:

In this example, you have a post form: the first thing you need to do is have some code that interprets what this form is for: there are many forms and you need this logic to know whether or not this form is a valid candidate to try default credentials. This can be done by the fact that if you have the pattern of ‘something that looks like a username + something that looks like a password + some button to post that’, then you probably have a login from (note the ‘probably’).

Once you established that this is indeed a login from, you then need to processes the fields in that form to retrieve the names of the variables: descriptions of fields or surrounding fields may reveal for example that the first input field is the place where a human is expected to enter his username and internally this field is called ‘uname’. I need to know both the human use as well as the internal name, so that if I construct a post request, I should post all default usernames to that ‘uname’ parameter.

This is about the most simple example of how complex these things can become… and up till here: as long as a form uses meaningful names and you trained your program to recognize those names, you’re good to go… but if a coder is lazy,uses a foreign language or just gives some crazy name to some input field, you’re soon f*cked.

Anyway, not sure if this is going towards what you are trying to achieve with your question, but I hope I brought some insight into how deep your question really goes, and trust me: it becomes really intricate real soon.
Simple as requests or sites may seem, a lot goes on when you are trying to capture ‘all get and post parameters’, certainly when you are starting from a crawl and not the requests itself.

This being said: I do like the way you are thinking… in the end hacking is just about putting data in parameters (regardless of where they reside) and when it comes to machine testing, there is no difference between the mechanisms of ‘sql injection’ and ‘command injection’: to the machine it’s just ‘inserting a payload in a parameter’ and the most fundamental place where this process begins is by ‘finding all parameters’. So your question is very fundamental and intelligent.

What my experience has brought me is that that finding the parameter is the easy part, but even in this easy part, you’ll have a hard time completing the task 100%… and then you didn’t even tackle the hard or near impossible parts, like server side data and the massive challenge of interpreting detected parameters and constructing meaningful request from them.

Anyway, don’t let that stop you. Focus on the parts that are possible… as Pareto as well as Ford thought us: even if you focus on the 20% that is possible, 80% of all problems are likely to disappear.

Best of luck!