Search Engines for Reconnaissance

Search engines have evolved far beyond simple web page indexing. Modern search engines—both traditional web search and specialized cybersecurity-focused platforms—provide powerful reconnaissance capabilities for discovering vulnerable devices, exposed data, network infrastructure, and sensitive information. This lecture explores how attackers and security professionals leverage search engines to map attack surfaces and identify security weaknesses without directly interacting with target systems.

Traditional Search Engines for Reconnaissance

Google Dorking (Google Hacking)

Definition: Using advanced Google search operators to find security-related information, vulnerabilities, and sensitive data indexed by Google's web crawlers.

Why It Works: Organizations often unintentionally expose sensitive information on public-facing web servers. Google's aggressive crawling indexes this data, making it searchable through specific query operators.

Google Search Operators

Basic Operators

site: - Limit results to specific domain

site:example.com
site:example.com -www
site:*.example.com

filetype: or ext: - Search for specific file types

site:example.com filetype:pdf
site:example.com ext:xls
site:example.com (filetype:doc OR filetype:pdf OR filetype:xls)

inurl: - Search for terms in URL

inurl:admin
inurl:login
inurl:config
site:example.com inurl:admin

intitle: - Search for terms in page title

intitle:"index of"
intitle:"dashboard" inurl:admin
intitle:"Apache Status" "Apache Server Status for"

intext: - Search for terms in page body

intext:"confidential" site:example.com
intext:"password" filetype:log

cache: - View Google's cached version of page

cache:example.com

link: - Find pages linking to specific URL

link:example.com

Advanced Combinations

Exposed Directories:

intitle:"index of" "parent directory"
intitle:"index of" "backup"
intitle:"index of" ".git"
intitle:"index of" "database"

Configuration Files:

site:example.com ext:conf OR ext:config OR ext:cfg
site:example.com inurl:web.config
filetype:env "DB_PASSWORD"

Database Files:

filetype:sql "INSERT INTO" "VALUES"
filetype:sql "CREATE TABLE" intext:password
site:example.com ext:sql

Credentials and Sensitive Data:

site:example.com intext:"password" filetype:xls
site:example.com filetype:txt "username" "password"
"index of" intext:credentials
filetype:log intext:password

Server Information:

intitle:"Apache Status" "Server Version"
intitle:"IIS7" "Detailed Error"
intitle:"phpinfo()" "PHP Version"

Vulnerable Applications:

inurl:wp-admin site:example.com
"Powered by WordPress" site:example.com
inurl:jmx-console site:example.com

API Keys and Tokens:

site:github.com "API_KEY"
site:github.com "aws_secret_access_key"
site:github.com "authorization: Bearer"
filetype:env "API_KEY"

Backup Files:

site:example.com ext:bak
site:example.com inurl:backup
site:example.com filetype:old
intitle:"index of" "backup" site:example.com

Google Hacking Database (GHDB)

The Exploit Database maintains the Google Hacking Database, a curated collection of useful Google dorks categorized by purpose:

  • Footholds: Finding login pages and vulnerable apps

  • Files containing usernames: Exposed user lists

  • Sensitive directories: Configuration and backup directories

  • Web server detection: Server version and type

  • Vulnerable files: Known vulnerable file locations

  • Vulnerable servers: Server misconfigurations

  • Error messages: Information disclosure through errors

  • Files containing juicy info: Passwords, financial data, PII

  • Files containing passwords: Direct password exposures

  • Sensitive online shopping info: E-commerce vulnerabilities

Access: https://www.exploit-db.com/google-hacking-database

Practical Google Dorking Workflow

Phase 1: Domain Enumeration

site:example.com
site:*.example.com

Phase 2: Subdomain Discovery

site:*.example.com -www
site:*.example.com -site:www.example.com

Phase 3: Technology Identification

site:example.com "powered by"
site:example.com "built with"
site:example.com inurl:wp-content

Phase 4: Sensitive File Discovery

site:example.com filetype:pdf confidential
site:example.com ext:sql
site:example.com intitle:"index of"

Phase 5: Login Portal Discovery

site:example.com inurl:login
site:example.com inurl:admin
site:example.com intitle:dashboard

Other Traditional Search Engines

Bing - Microsoft's search engine with unique operators:

ip:192.168.1.1
contains:pdf site:example.com
url:admin site:example.com

DuckDuckGo - Privacy-focused, less aggressive caching, useful for avoiding detection:

site:example.com
filetype:pdf site:example.com

Yandex - Russian search engine, excellent image search (reverse image search often better than Google):

site:example.com

Shodan: Search Engine for Internet-Connected Devices

What is Shodan?

Shodan is the world's first search engine for internet-connected devices. Unlike traditional search engines that index web page content, Shodan continuously scans the entire IPv4 address space and indexes information about services, devices, and systems directly connected to the internet.

Created: By John Matherly in 2009 Purpose: Originally for security research, now used by security professionals, researchers, and unfortunately, attackers

What Shodan Indexes:

  • Web servers and their banners

  • Industrial Control Systems (ICS/SCADA)

  • Network devices (routers, switches, firewalls)

  • Internet of Things (IoT) devices

  • Databases exposed to the internet

  • Webcams and surveillance systems

  • Smart home devices

  • Medical equipment

  • Building management systems

  • Power grid components

How Shodan Works

  1. Scanning: Shodan continuously scans common ports across all IPv4 addresses

  2. Banner Grabbing: Captures service banners containing software versions, configurations

  3. Indexing: Stores data in searchable database

  4. Categorization: Tags devices by type, location, organization

  5. Vulnerability Matching: Cross-references with known vulnerabilities

Common Ports Scanned:

  • 21 (FTP), 22 (SSH), 23 (Telnet), 25 (SMTP)

  • 80 (HTTP), 443 (HTTPS), 8080, 8443 (HTTP alternates)

  • 3306 (MySQL), 5432 (PostgreSQL), 27017 (MongoDB)

  • 1883 (MQTT), 502 (Modbus), 102 (S7)

  • And hundreds more...

Shodan Search Syntax

Basic Searches

Search by hostname:

hostname:example.com

Search by IP address:

ip:192.168.1.1
net:192.168.1.0/24

Search by port:

port:22
port:3389

Search by country:

country:US
country:DE

Search by city:

city:"New York"
city:London

Search by organization:

org:"Amazon"
org:"Google"

Advanced Filters

Operating System:

os:"Windows"
os:"Linux"
os:"Ubuntu"

Product/Software:

product:"Apache"
product:"nginx"
product:"MySQL"

Version:

version:"2.4.41"
product:Apache version:2.4

Vulnerability (CVE):

vuln:CVE-2021-44228 (Log4Shell)
vuln:CVE-2017-0144 (EternalBlue)

Has Screenshot (for services with web interfaces):

has_screenshot:true

HTTP Components:

http.title:"Dashboard"
http.status:200
http.favicon.hash:12345678

Practical Shodan Queries

Find Webcams:

"webcam" has_screenshot:true
"Server: IP Webcam Server"

Find Remote Desktop Services:

port:3389 country:US
"Remote Desktop Protocol" city:London

Find Exposed Databases:

port:27017 product:"MongoDB"
"MongoDB Server Information" -authentication
port:3306 "MySQL" -authentication
"port:6379 Redis"

Find Industrial Control Systems:

"Modbus"
port:502
"Siemens" country:US

Find Vulnerable Systems:

vuln:CVE-2017-0144
"IIS/7.5" vuln:CVE-2015-1635
product:Apache version:2.4.49

Find Specific Organizations:

org:"Target Corporation"
hostname:example.com has_screenshot:true

Find Default Credentials:

"default password" port:23
"Authentication disabled" port:22

Shodan CLI and API

Shodan Command Line Interface:

# Install
pip install shodan

# Initialize with API key
shodan init YOUR_API_KEY

# Search
shodan search "hostname:example.com"
shodan search "port:22 country:US" --limit 100

# Get host information
shodan host 8.8.8.8

# Count results
shodan count "apache"

# Stream real-time data
shodan stream

Shodan API (Python Example):

import shodan

api = shodan.Shodan('YOUR_API_KEY')

# Search Shodan
results = api.search('hostname:example.com')

print(f'Results found: {results["total"]}')

for result in results['matches']:
    print(f'IP: {result["ip_str"]}')
    print(f'Port: {result["port"]}')
    print(f'Data: {result["data"]}')

Shodan Alternatives

Censys (https://censys.io/):

  • Similar to Shodan but with focus on SSL/TLS certificates

  • Better for finding subdomains via certificate transparency

  • Free academic access

  • More detailed SSL/TLS information

Search syntax:

parsed.names: example.com
protocols: "443/https"
location.country: "United States"

ZoomEye (https://www.zoomeye.org/):

  • Chinese alternative to Shodan

  • Good coverage of Asian networks

  • Web and host search capabilities

BinaryEdge (https://www.binaryedge.io/):

  • Comprehensive internet scanning

  • Includes DNS, Tor, and Torrents

  • Historical data available

Greynoise (https://www.greynoise.io/):

  • Focuses on internet background noise

  • Distinguishes malicious vs. benign scanning

  • Useful for threat intelligence

FOFA (https://fofa.info/):

  • Cyberspace search engine

  • Strong in Chinese networks

  • Advanced query syntax

Shodan for Defensive Reconnaissance

Organizations should use Shodan to discover their own exposed assets:

  1. Asset Discovery:

    org:"Your Organization Name"
    hostname:yourcompany.com
  2. Identify Exposed Services:

    • Find services that shouldn't be public

    • Locate forgotten or shadow IT assets

    • Discover misconfigurations

  3. Vulnerability Assessment:

    org:"Your Organization" vuln:*
  4. Monitoring:

    • Set up Shodan monitors for your IP ranges

    • Receive alerts when new services appear

    • Track changes over time

Other Specialized Search Engines

PublicWWW

Purpose: Search for specific code, scripts, or tracking IDs across websites

Use Cases:

  • Find all websites using specific Google Analytics ID

  • Discover sites using same advertising code

  • Identify websites by technology footprint

Example:

"UA-12345678" (Google Analytics ID)
"GTM-ABC123" (Google Tag Manager)

Certificate Search (crt.sh)

Purpose: Search certificate transparency logs for SSL/TLS certificates

Use Cases:

  • Subdomain enumeration

  • Find all domains owned by organization

  • Discover forgotten or test domains

Example queries:

%.example.com

URL: https://crt.sh/

GitHub Code Search:

  • Find leaked credentials in public repositories

  • Discover API keys and tokens

  • Identify technology stack from code

org:company_name password
filename:.env DB_PASSWORD

Pastebin Search (https://psbdmp.ws/):

  • Monitor pastes mentioning your organization

  • Find leaked credentials or data

  • Track data breaches

Wayback Machine

Internet Archive (https://archive.org/web/):

  • View historical versions of websites

  • Recover deleted content

  • Find old vulnerabilities or information

Use Cases:

  • See old employee directories

  • Find removed documentation

  • Discover changed infrastructure

Passive Nature

  • Search engines for reconnaissance are generally passive

  • You're querying a search engine, not the target directly

  • Information is already publicly accessible

  • Accessing exposed data may still violate laws (CFAA in U.S.)

  • Some jurisdictions consider accessing misconfigured systems illegal

  • Terms of service violations can have legal consequences

Responsible Use

  1. Don't access exposed systems: Finding is reconnaissance; accessing is intrusion

  2. Responsible disclosure: Report serious exposures to affected organizations

  3. Authorization required: Only access systems you have permission to test

  4. Document findings: Keep records of what you find and why

Notification Dilemma

If you discover serious exposures (e.g., medical records, financial data):

  • Consider responsible disclosure to organization

  • May report to CERT/CC or similar organizations

  • Balance risk of notification with risk of exposure

  • Document decision-making process

Defensive Measures

Preventing Search Engine Exposure

1. robots.txt Configuration:

User-agent: *
Disallow: /admin/
Disallow: /config/
Disallow: /backup/

Note: robots.txt doesn't prevent crawling, only requests it

2. Remove from Index:

  • Google Search Console: Request URL removal

  • Meta tags: <meta name="robots" content="noindex">

  • X-Robots-Tag HTTP header

3. Authentication and Access Controls:

  • Require authentication for sensitive areas

  • Don't rely on "security through obscurity"

  • Use proper access controls, not just hidden URLs

4. Regular Monitoring:

site:yourcompany.com filetype:pdf confidential
site:yourcompany.com intitle:"index of"

5. Information Disclosure Prevention:

  • Disable directory listing

  • Remove verbose error messages

  • Strip server version banners

  • Don't expose internal file structures

Shodan Protection

1. Minimize Internet Exposure:

  • Only expose services that must be public

  • Use VPN for administrative access

  • Implement network segmentation

2. Regular Shodan Audits:

# Monitor your organization
shodan search 'org:"Your Company"'

# Monitor your IP ranges
shodan host YOUR_IP_ADDRESS

3. Set Up Alerts:

  • Use Shodan monitoring service

  • Alert on new exposed services

  • Track changes in footprint

4. Banner Modification:

  • Modify server banners to remove versions

  • Use generic responses

  • Don't advertise technology stack

Practical Exercises

Exercise 1: Google Dorking Challenge

Objective: Find sensitive information about a target organization (with permission)

  1. Start with basic site search: site:example.com

  2. Look for exposed directories: site:example.com intitle:"index of"

  3. Find document types: site:example.com (filetype:pdf OR filetype:xls OR filetype:doc)

  4. Search for login pages: site:example.com (inurl:login OR inurl:admin)

  5. Look for technology indicators: site:example.com "powered by"

Document:

  • What sensitive information did you find?

  • What types of files are exposed?

  • What technologies are in use?

Exercise 2: Shodan Reconnaissance

Objective: Understand your organization's internet footprint

  1. Search by organization name: org:"Your Organization"

  2. Search by domain: hostname:yourcompany.com

  3. Analyze exposed services and ports

  4. Check for known vulnerabilities: org:"Your Organization" vuln:*

  5. Document findings and risk assessment

Exercise 3: Certificate Transparency

Objective: Enumerate subdomains via certificate logs

  1. Search for %.example.com

  2. Compile list of discovered subdomains

  3. Cross-reference with DNS enumeration results

  4. Identify previously unknown assets

Exercise 4: Self-OSINT via Search Engines

Objective: Understand what search engines reveal about you

  1. Google your name in quotes with variations

  2. Search your email address(es)

  3. Search your username(s)

  4. Check images (Google Images, your name)

  5. Review what's accessible and consider privacy

Integration with Reconnaissance Methodology

Search engine reconnaissance fits into the overall process:

  1. Passive Reconnaissance: Search engines are passive, generate no target logs

  2. Early Phase: Use before active scanning to understand scope

  3. Continuous: Search engines index new content constantly

  4. Validation: Confirm technical findings with search data

  5. Intelligence: Combine with OSINT for comprehensive picture

Key Takeaways

  • Traditional search engines (Google, Bing) can reveal sensitive data through dorking

  • Shodan and similar platforms index internet-connected devices and services

  • Search engine reconnaissance is largely passive but incredibly effective

  • Organizations must monitor their own search engine exposure

  • Legal and ethical considerations apply even to public data

  • Defensive reconnaissance helps organizations understand their attack surface

  • Combine multiple search engines for comprehensive coverage

  • Regular monitoring and mitigation reduces exposure

Additional Resources

Google Hacking

Shodan

Alternatives and Tools

Practice and Learning

  • HackTheBox: Includes boxes requiring search engine reconnaissance

  • TryHackMe - Google Dorking Room: Guided Google hacking exercises

  • Shodan Training: Regular webinars and tutorials on Shodan.io

Conclusion

Search engines have evolved into powerful reconnaissance tools that allow security professionals and attackers alike to discover vast amounts of information without directly interacting with targets. From Google dorking revealing misconfigured web servers to Shodan exposing critical infrastructure, these tools demonstrate that passive reconnaissance can be devastatingly effective. Organizations must adopt a defensive mindset by regularly auditing their search engine footprint and implementing controls to prevent sensitive information exposure. Remember: if a search engine can find it, so can an attacker—and unlike active reconnaissance, search engine queries leave no traces on your systems.

Last updated