Content Discovery
Updated on January 4, 2025
This content could be, for example, pages or portals intended for staff usage, older versions of the website, backup files, configuration files, administration panels, etc. There are three main ways of discovering content on a website. Manually, Automated and OSINT (Open-Source Intelligence).
Flow
check /robots.txt
Restricted directories or resources which are not allowed for search engine indexing(/crawling)
Identify favicon
Sometimes when frameworks are used to build a website, a favicon that is part of the installation gets leftover, and if the website developer doesn't replace this with a custom one, this can give us a clue on what framework is in use. OWASP Favicon Database
curl url-to-favicon | md5sum
Check /sitemap.xml
The sitemap.xml file gives a list of every file the website owner wishes to be listed on a search engine. These can sometimes contain areas of the website that are a bit more difficult to navigate to or even list some old webpages that the current site no longer uses but are still working behind the scenes.
Check HTTP Response Headers
When we make requests to the web server, the server returns various HTTP headers. These headers can sometimes contain useful information such as the webserver software and possibly the programming/scripting language in use.
Framework and its Version
Analyzing framework and its version helps to know vulnerability in the web page
Google Dorking
site:
returns results only from the specified website address
example: site:shinjith.dev
inurl:
returns results that have the specified word in the URL
example: inurl:admin
filetype:pdf
returns results which are a particular file extension
example: filetype:pdf
intitle:
returns results that contain the specified word in the title
example: intitle:admin
More operators: GoogleHackingCheatSheet.pdf
Use OSINT Tools
Wappalyzer
Wappalyzer is an online tool and browser extension that helps identify what technologies a website uses, such as frameworks, Content Management Systems (CMS), payment processors and much more, and it can even find version numbers as well.
Wayback Machine
The Wayback Machine is a historical archive of websites that dates back to the late 90s. You can search a domain name, and it will show you all the times the service scraped the web page and saved the contents. This service can help uncover old pages that may still be active on the current website.
Github
Repositories can either be set to public or private and have various access controls. You can use GitHub's search feature to look for company names or website names to try and locate repositories belonging to your target.
S3 Bucket
The owner of the files can set access permissions to either make files public, private and even writable. Sometimes these access permissions are incorrectly set and inadvertently allow access to files that shouldn't be available to the public.
http(s)://org-name.s3.amazonaws.com
Fuzzing
Using fuzzing tools like ffuf to discover hidden web-contents
ffuf -w wordlist -u target/FUZZ