Search Engines for Reconnaissance
Search engines have evolved far beyond simple web page indexing. Modern search engines—both traditional web search and specialized cybersecurity-focused platforms—provide powerful reconnaissance capabilities for discovering vulnerable devices, exposed data, network infrastructure, and sensitive information. This lecture explores how attackers and security professionals leverage search engines to map attack surfaces and identify security weaknesses without directly interacting with target systems.
Traditional Search Engines for Reconnaissance
Google Dorking (Google Hacking)
Definition: Using advanced Google search operators to find security-related information, vulnerabilities, and sensitive data indexed by Google's web crawlers.
Why It Works: Organizations often unintentionally expose sensitive information on public-facing web servers. Google's aggressive crawling indexes this data, making it searchable through specific query operators.
Google Search Operators
Basic Operators
site: - Limit results to specific domain
site:example.com
site:example.com -www
site:*.example.comfiletype: or ext: - Search for specific file types
site:example.com filetype:pdf
site:example.com ext:xls
site:example.com (filetype:doc OR filetype:pdf OR filetype:xls)inurl: - Search for terms in URL
intitle: - Search for terms in page title
intext: - Search for terms in page body
cache: - View Google's cached version of page
link: - Find pages linking to specific URL
Advanced Combinations
Exposed Directories:
Configuration Files:
Database Files:
Credentials and Sensitive Data:
Server Information:
Vulnerable Applications:
API Keys and Tokens:
Backup Files:
Google Hacking Database (GHDB)
The Exploit Database maintains the Google Hacking Database, a curated collection of useful Google dorks categorized by purpose:
Footholds: Finding login pages and vulnerable apps
Files containing usernames: Exposed user lists
Sensitive directories: Configuration and backup directories
Web server detection: Server version and type
Vulnerable files: Known vulnerable file locations
Vulnerable servers: Server misconfigurations
Error messages: Information disclosure through errors
Files containing juicy info: Passwords, financial data, PII
Files containing passwords: Direct password exposures
Sensitive online shopping info: E-commerce vulnerabilities
Access: https://www.exploit-db.com/google-hacking-database
Practical Google Dorking Workflow
Phase 1: Domain Enumeration
Phase 2: Subdomain Discovery
Phase 3: Technology Identification
Phase 4: Sensitive File Discovery
Phase 5: Login Portal Discovery
Other Traditional Search Engines
Bing - Microsoft's search engine with unique operators:
DuckDuckGo - Privacy-focused, less aggressive caching, useful for avoiding detection:
Yandex - Russian search engine, excellent image search (reverse image search often better than Google):
Shodan: Search Engine for Internet-Connected Devices
What is Shodan?
Shodan is the world's first search engine for internet-connected devices. Unlike traditional search engines that index web page content, Shodan continuously scans the entire IPv4 address space and indexes information about services, devices, and systems directly connected to the internet.
Created: By John Matherly in 2009 Purpose: Originally for security research, now used by security professionals, researchers, and unfortunately, attackers
What Shodan Indexes:
Web servers and their banners
Industrial Control Systems (ICS/SCADA)
Network devices (routers, switches, firewalls)
Internet of Things (IoT) devices
Databases exposed to the internet
Webcams and surveillance systems
Smart home devices
Medical equipment
Building management systems
Power grid components
How Shodan Works
Scanning: Shodan continuously scans common ports across all IPv4 addresses
Banner Grabbing: Captures service banners containing software versions, configurations
Indexing: Stores data in searchable database
Categorization: Tags devices by type, location, organization
Vulnerability Matching: Cross-references with known vulnerabilities
Common Ports Scanned:
21 (FTP), 22 (SSH), 23 (Telnet), 25 (SMTP)
80 (HTTP), 443 (HTTPS), 8080, 8443 (HTTP alternates)
3306 (MySQL), 5432 (PostgreSQL), 27017 (MongoDB)
1883 (MQTT), 502 (Modbus), 102 (S7)
And hundreds more...
Shodan Search Syntax
Basic Searches
Search by hostname:
Search by IP address:
Search by port:
Search by country:
Search by city:
Search by organization:
Advanced Filters
Operating System:
Product/Software:
Version:
Vulnerability (CVE):
Has Screenshot (for services with web interfaces):
HTTP Components:
Practical Shodan Queries
Find Webcams:
Find Remote Desktop Services:
Find Exposed Databases:
Find Industrial Control Systems:
Find Vulnerable Systems:
Find Specific Organizations:
Find Default Credentials:
Shodan CLI and API
Shodan Command Line Interface:
Shodan API (Python Example):
Shodan Alternatives
Censys (https://censys.io/):
Similar to Shodan but with focus on SSL/TLS certificates
Better for finding subdomains via certificate transparency
Free academic access
More detailed SSL/TLS information
Search syntax:
ZoomEye (https://www.zoomeye.org/):
Chinese alternative to Shodan
Good coverage of Asian networks
Web and host search capabilities
BinaryEdge (https://www.binaryedge.io/):
Comprehensive internet scanning
Includes DNS, Tor, and Torrents
Historical data available
Greynoise (https://www.greynoise.io/):
Focuses on internet background noise
Distinguishes malicious vs. benign scanning
Useful for threat intelligence
FOFA (https://fofa.info/):
Cyberspace search engine
Strong in Chinese networks
Advanced query syntax
Shodan for Defensive Reconnaissance
Organizations should use Shodan to discover their own exposed assets:
Asset Discovery:
Identify Exposed Services:
Find services that shouldn't be public
Locate forgotten or shadow IT assets
Discover misconfigurations
Vulnerability Assessment:
Monitoring:
Set up Shodan monitors for your IP ranges
Receive alerts when new services appear
Track changes over time
Other Specialized Search Engines
PublicWWW
Purpose: Search for specific code, scripts, or tracking IDs across websites
Use Cases:
Find all websites using specific Google Analytics ID
Discover sites using same advertising code
Identify websites by technology footprint
Example:
Certificate Search (crt.sh)
Purpose: Search certificate transparency logs for SSL/TLS certificates
Use Cases:
Subdomain enumeration
Find all domains owned by organization
Discover forgotten or test domains
Example queries:
URL: https://crt.sh/
Pastebin and Code Search
GitHub Code Search:
Find leaked credentials in public repositories
Discover API keys and tokens
Identify technology stack from code
Pastebin Search (https://psbdmp.ws/):
Monitor pastes mentioning your organization
Find leaked credentials or data
Track data breaches
Wayback Machine
Internet Archive (https://archive.org/web/):
View historical versions of websites
Recover deleted content
Find old vulnerabilities or information
Use Cases:
See old employee directories
Find removed documentation
Discover changed infrastructure
Ethical and Legal Considerations
Passive Nature
Search engines for reconnaissance are generally passive
You're querying a search engine, not the target directly
Information is already publicly accessible
Legal Gray Areas
Accessing exposed data may still violate laws (CFAA in U.S.)
Some jurisdictions consider accessing misconfigured systems illegal
Terms of service violations can have legal consequences
Responsible Use
Don't access exposed systems: Finding is reconnaissance; accessing is intrusion
Responsible disclosure: Report serious exposures to affected organizations
Authorization required: Only access systems you have permission to test
Document findings: Keep records of what you find and why
Notification Dilemma
If you discover serious exposures (e.g., medical records, financial data):
Consider responsible disclosure to organization
May report to CERT/CC or similar organizations
Balance risk of notification with risk of exposure
Document decision-making process
Defensive Measures
Preventing Search Engine Exposure
1. robots.txt Configuration:
Note: robots.txt doesn't prevent crawling, only requests it
2. Remove from Index:
Google Search Console: Request URL removal
Meta tags:
<meta name="robots" content="noindex">X-Robots-Tag HTTP header
3. Authentication and Access Controls:
Require authentication for sensitive areas
Don't rely on "security through obscurity"
Use proper access controls, not just hidden URLs
4. Regular Monitoring:
5. Information Disclosure Prevention:
Disable directory listing
Remove verbose error messages
Strip server version banners
Don't expose internal file structures
Shodan Protection
1. Minimize Internet Exposure:
Only expose services that must be public
Use VPN for administrative access
Implement network segmentation
2. Regular Shodan Audits:
3. Set Up Alerts:
Use Shodan monitoring service
Alert on new exposed services
Track changes in footprint
4. Banner Modification:
Modify server banners to remove versions
Use generic responses
Don't advertise technology stack
Practical Exercises
Exercise 1: Google Dorking Challenge
Objective: Find sensitive information about a target organization (with permission)
Start with basic site search:
site:example.comLook for exposed directories:
site:example.com intitle:"index of"Find document types:
site:example.com (filetype:pdf OR filetype:xls OR filetype:doc)Search for login pages:
site:example.com (inurl:login OR inurl:admin)Look for technology indicators:
site:example.com "powered by"
Document:
What sensitive information did you find?
What types of files are exposed?
What technologies are in use?
Exercise 2: Shodan Reconnaissance
Objective: Understand your organization's internet footprint
Search by organization name:
org:"Your Organization"Search by domain:
hostname:yourcompany.comAnalyze exposed services and ports
Check for known vulnerabilities:
org:"Your Organization" vuln:*Document findings and risk assessment
Exercise 3: Certificate Transparency
Objective: Enumerate subdomains via certificate logs
Visit https://crt.sh
Search for
%.example.comCompile list of discovered subdomains
Cross-reference with DNS enumeration results
Identify previously unknown assets
Exercise 4: Self-OSINT via Search Engines
Objective: Understand what search engines reveal about you
Google your name in quotes with variations
Search your email address(es)
Search your username(s)
Check images (Google Images, your name)
Review what's accessible and consider privacy
Integration with Reconnaissance Methodology
Search engine reconnaissance fits into the overall process:
Passive Reconnaissance: Search engines are passive, generate no target logs
Early Phase: Use before active scanning to understand scope
Continuous: Search engines index new content constantly
Validation: Confirm technical findings with search data
Intelligence: Combine with OSINT for comprehensive picture
Key Takeaways
Traditional search engines (Google, Bing) can reveal sensitive data through dorking
Shodan and similar platforms index internet-connected devices and services
Search engine reconnaissance is largely passive but incredibly effective
Organizations must monitor their own search engine exposure
Legal and ethical considerations apply even to public data
Defensive reconnaissance helps organizations understand their attack surface
Combine multiple search engines for comprehensive coverage
Regular monitoring and mitigation reduces exposure
Additional Resources
Google Hacking
Google Hacking Database: https://www.exploit-db.com/google-hacking-database
"Google Hacking for Penetration Testers" by Johnny Long - Definitive guide
Google Search Operators: https://support.google.com/websearch/answer/2466433
Shodan
Shodan: https://www.shodan.io/
Shodan Documentation: https://help.shodan.io/
Shodan CLI: https://cli.shodan.io/
Book of Shodan: https://leanpub.com/shodan - Comprehensive Shodan guide
Alternatives and Tools
Censys: https://censys.io/
ZoomEye: https://www.zoomeye.org/
BinaryEdge: https://www.binaryedge.io/
crt.sh: https://crt.sh/
PublicWWW: https://publicwww.com/
Practice and Learning
HackTheBox: Includes boxes requiring search engine reconnaissance
TryHackMe - Google Dorking Room: Guided Google hacking exercises
Shodan Training: Regular webinars and tutorials on Shodan.io
Conclusion
Search engines have evolved into powerful reconnaissance tools that allow security professionals and attackers alike to discover vast amounts of information without directly interacting with targets. From Google dorking revealing misconfigured web servers to Shodan exposing critical infrastructure, these tools demonstrate that passive reconnaissance can be devastatingly effective. Organizations must adopt a defensive mindset by regularly auditing their search engine footprint and implementing controls to prevent sensitive information exposure. Remember: if a search engine can find it, so can an attacker—and unlike active reconnaissance, search engine queries leave no traces on your systems.
Last updated