Search Engines for Reconnaissance

Search engines have evolved far beyond simple web page indexing. Modern search engines—both traditional web search and specialized cybersecurity-focused platforms—provide powerful reconnaissance capabilities for discovering vulnerable devices, exposed data, network infrastructure, and sensitive information. This lecture explores how attackers and security professionals leverage search engines to map attack surfaces and identify security weaknesses without directly interacting with target systems.

Traditional Search Engines for Reconnaissance

Google Dorking (Google Hacking)

Definition: Using advanced Google search operators to find security-related information, vulnerabilities, and sensitive data indexed by Google's web crawlers.

Why It Works: Organizations often unintentionally expose sensitive information on public-facing web servers. Google's aggressive crawling indexes this data, making it searchable through specific query operators.

Google Search Operators

Basic Operators

site: - Limit results to specific domain

site:example.com
site:example.com -www
site:*.example.com

filetype: or ext: - Search for specific file types

site:example.com filetype:pdf
site:example.com ext:xls
site:example.com (filetype:doc OR filetype:pdf OR filetype:xls)

inurl: - Search for terms in URL

intitle: - Search for terms in page title

intext: - Search for terms in page body

cache: - View Google's cached version of page

link: - Find pages linking to specific URL

Advanced Combinations

Exposed Directories:

Configuration Files:

Database Files:

Credentials and Sensitive Data:

Server Information:

Vulnerable Applications:

API Keys and Tokens:

Backup Files:

Google Hacking Database (GHDB)

The Exploit Database maintains the Google Hacking Database, a curated collection of useful Google dorks categorized by purpose:

  • Footholds: Finding login pages and vulnerable apps

  • Files containing usernames: Exposed user lists

  • Sensitive directories: Configuration and backup directories

  • Web server detection: Server version and type

  • Vulnerable files: Known vulnerable file locations

  • Vulnerable servers: Server misconfigurations

  • Error messages: Information disclosure through errors

  • Files containing juicy info: Passwords, financial data, PII

  • Files containing passwords: Direct password exposures

  • Sensitive online shopping info: E-commerce vulnerabilities

Access: https://www.exploit-db.com/google-hacking-database

Practical Google Dorking Workflow

Phase 1: Domain Enumeration

Phase 2: Subdomain Discovery

Phase 3: Technology Identification

Phase 4: Sensitive File Discovery

Phase 5: Login Portal Discovery

Other Traditional Search Engines

Bing - Microsoft's search engine with unique operators:

DuckDuckGo - Privacy-focused, less aggressive caching, useful for avoiding detection:

Yandex - Russian search engine, excellent image search (reverse image search often better than Google):

Shodan: Search Engine for Internet-Connected Devices

What is Shodan?

Shodan is the world's first search engine for internet-connected devices. Unlike traditional search engines that index web page content, Shodan continuously scans the entire IPv4 address space and indexes information about services, devices, and systems directly connected to the internet.

Created: By John Matherly in 2009 Purpose: Originally for security research, now used by security professionals, researchers, and unfortunately, attackers

What Shodan Indexes:

  • Web servers and their banners

  • Industrial Control Systems (ICS/SCADA)

  • Network devices (routers, switches, firewalls)

  • Internet of Things (IoT) devices

  • Databases exposed to the internet

  • Webcams and surveillance systems

  • Smart home devices

  • Medical equipment

  • Building management systems

  • Power grid components

How Shodan Works

  1. Scanning: Shodan continuously scans common ports across all IPv4 addresses

  2. Banner Grabbing: Captures service banners containing software versions, configurations

  3. Indexing: Stores data in searchable database

  4. Categorization: Tags devices by type, location, organization

  5. Vulnerability Matching: Cross-references with known vulnerabilities

Common Ports Scanned:

  • 21 (FTP), 22 (SSH), 23 (Telnet), 25 (SMTP)

  • 80 (HTTP), 443 (HTTPS), 8080, 8443 (HTTP alternates)

  • 3306 (MySQL), 5432 (PostgreSQL), 27017 (MongoDB)

  • 1883 (MQTT), 502 (Modbus), 102 (S7)

  • And hundreds more...

Shodan Search Syntax

Basic Searches

Search by hostname:

Search by IP address:

Search by port:

Search by country:

Search by city:

Search by organization:

Advanced Filters

Operating System:

Product/Software:

Version:

Vulnerability (CVE):

Has Screenshot (for services with web interfaces):

HTTP Components:

Practical Shodan Queries

Find Webcams:

Find Remote Desktop Services:

Find Exposed Databases:

Find Industrial Control Systems:

Find Vulnerable Systems:

Find Specific Organizations:

Find Default Credentials:

Shodan CLI and API

Shodan Command Line Interface:

Shodan API (Python Example):

Shodan Alternatives

Censys (https://censys.io/):

  • Similar to Shodan but with focus on SSL/TLS certificates

  • Better for finding subdomains via certificate transparency

  • Free academic access

  • More detailed SSL/TLS information

Search syntax:

ZoomEye (https://www.zoomeye.org/):

  • Chinese alternative to Shodan

  • Good coverage of Asian networks

  • Web and host search capabilities

BinaryEdge (https://www.binaryedge.io/):

  • Comprehensive internet scanning

  • Includes DNS, Tor, and Torrents

  • Historical data available

Greynoise (https://www.greynoise.io/):

  • Focuses on internet background noise

  • Distinguishes malicious vs. benign scanning

  • Useful for threat intelligence

FOFA (https://fofa.info/):

  • Cyberspace search engine

  • Strong in Chinese networks

  • Advanced query syntax

Shodan for Defensive Reconnaissance

Organizations should use Shodan to discover their own exposed assets:

  1. Asset Discovery:

  2. Identify Exposed Services:

    • Find services that shouldn't be public

    • Locate forgotten or shadow IT assets

    • Discover misconfigurations

  3. Vulnerability Assessment:

  4. Monitoring:

    • Set up Shodan monitors for your IP ranges

    • Receive alerts when new services appear

    • Track changes over time

Other Specialized Search Engines

PublicWWW

Purpose: Search for specific code, scripts, or tracking IDs across websites

Use Cases:

  • Find all websites using specific Google Analytics ID

  • Discover sites using same advertising code

  • Identify websites by technology footprint

Example:

Certificate Search (crt.sh)

Purpose: Search certificate transparency logs for SSL/TLS certificates

Use Cases:

  • Subdomain enumeration

  • Find all domains owned by organization

  • Discover forgotten or test domains

Example queries:

URL: https://crt.sh/

GitHub Code Search:

  • Find leaked credentials in public repositories

  • Discover API keys and tokens

  • Identify technology stack from code

Pastebin Search (https://psbdmp.ws/):

  • Monitor pastes mentioning your organization

  • Find leaked credentials or data

  • Track data breaches

Wayback Machine

Internet Archive (https://archive.org/web/):

  • View historical versions of websites

  • Recover deleted content

  • Find old vulnerabilities or information

Use Cases:

  • See old employee directories

  • Find removed documentation

  • Discover changed infrastructure

Passive Nature

  • Search engines for reconnaissance are generally passive

  • You're querying a search engine, not the target directly

  • Information is already publicly accessible

  • Accessing exposed data may still violate laws (CFAA in U.S.)

  • Some jurisdictions consider accessing misconfigured systems illegal

  • Terms of service violations can have legal consequences

Responsible Use

  1. Don't access exposed systems: Finding is reconnaissance; accessing is intrusion

  2. Responsible disclosure: Report serious exposures to affected organizations

  3. Authorization required: Only access systems you have permission to test

  4. Document findings: Keep records of what you find and why

Notification Dilemma

If you discover serious exposures (e.g., medical records, financial data):

  • Consider responsible disclosure to organization

  • May report to CERT/CC or similar organizations

  • Balance risk of notification with risk of exposure

  • Document decision-making process

Defensive Measures

Preventing Search Engine Exposure

1. robots.txt Configuration:

Note: robots.txt doesn't prevent crawling, only requests it

2. Remove from Index:

  • Google Search Console: Request URL removal

  • Meta tags: <meta name="robots" content="noindex">

  • X-Robots-Tag HTTP header

3. Authentication and Access Controls:

  • Require authentication for sensitive areas

  • Don't rely on "security through obscurity"

  • Use proper access controls, not just hidden URLs

4. Regular Monitoring:

5. Information Disclosure Prevention:

  • Disable directory listing

  • Remove verbose error messages

  • Strip server version banners

  • Don't expose internal file structures

Shodan Protection

1. Minimize Internet Exposure:

  • Only expose services that must be public

  • Use VPN for administrative access

  • Implement network segmentation

2. Regular Shodan Audits:

3. Set Up Alerts:

  • Use Shodan monitoring service

  • Alert on new exposed services

  • Track changes in footprint

4. Banner Modification:

  • Modify server banners to remove versions

  • Use generic responses

  • Don't advertise technology stack

Practical Exercises

Exercise 1: Google Dorking Challenge

Objective: Find sensitive information about a target organization (with permission)

  1. Start with basic site search: site:example.com

  2. Look for exposed directories: site:example.com intitle:"index of"

  3. Find document types: site:example.com (filetype:pdf OR filetype:xls OR filetype:doc)

  4. Search for login pages: site:example.com (inurl:login OR inurl:admin)

  5. Look for technology indicators: site:example.com "powered by"

Document:

  • What sensitive information did you find?

  • What types of files are exposed?

  • What technologies are in use?

Exercise 2: Shodan Reconnaissance

Objective: Understand your organization's internet footprint

  1. Search by organization name: org:"Your Organization"

  2. Search by domain: hostname:yourcompany.com

  3. Analyze exposed services and ports

  4. Check for known vulnerabilities: org:"Your Organization" vuln:*

  5. Document findings and risk assessment

Exercise 3: Certificate Transparency

Objective: Enumerate subdomains via certificate logs

  1. Search for %.example.com

  2. Compile list of discovered subdomains

  3. Cross-reference with DNS enumeration results

  4. Identify previously unknown assets

Exercise 4: Self-OSINT via Search Engines

Objective: Understand what search engines reveal about you

  1. Google your name in quotes with variations

  2. Search your email address(es)

  3. Search your username(s)

  4. Check images (Google Images, your name)

  5. Review what's accessible and consider privacy

Integration with Reconnaissance Methodology

Search engine reconnaissance fits into the overall process:

  1. Passive Reconnaissance: Search engines are passive, generate no target logs

  2. Early Phase: Use before active scanning to understand scope

  3. Continuous: Search engines index new content constantly

  4. Validation: Confirm technical findings with search data

  5. Intelligence: Combine with OSINT for comprehensive picture

Key Takeaways

  • Traditional search engines (Google, Bing) can reveal sensitive data through dorking

  • Shodan and similar platforms index internet-connected devices and services

  • Search engine reconnaissance is largely passive but incredibly effective

  • Organizations must monitor their own search engine exposure

  • Legal and ethical considerations apply even to public data

  • Defensive reconnaissance helps organizations understand their attack surface

  • Combine multiple search engines for comprehensive coverage

  • Regular monitoring and mitigation reduces exposure

Additional Resources

Google Hacking

Shodan

Alternatives and Tools

Practice and Learning

  • HackTheBox: Includes boxes requiring search engine reconnaissance

  • TryHackMe - Google Dorking Room: Guided Google hacking exercises

  • Shodan Training: Regular webinars and tutorials on Shodan.io

Conclusion

Search engines have evolved into powerful reconnaissance tools that allow security professionals and attackers alike to discover vast amounts of information without directly interacting with targets. From Google dorking revealing misconfigured web servers to Shodan exposing critical infrastructure, these tools demonstrate that passive reconnaissance can be devastatingly effective. Organizations must adopt a defensive mindset by regularly auditing their search engine footprint and implementing controls to prevent sensitive information exposure. Remember: if a search engine can find it, so can an attacker—and unlike active reconnaissance, search engine queries leave no traces on your systems.

Last updated