ODIS Metadata Crawler & Search

This application is designed to find, index, and search ODIS (Ocean Data and Information System) JSON-LD metadata.


Built with Symfony 8.0 and Elasticsearch.

Go to Search

Usage Documentation

1. Crawling and Indexing

The core functionality is provided by a Symfony command that fetches all ODIS-Arch records from the catalogue API, extracts metadata links, and indexes the content into Elasticsearch.

Run the crawler:
Complete crawl:
php bin/console app:odis:crawl
Targeted crawl (by ID):
php bin/console app:odis:crawl 3215 3125
Skip specific IDs:
php bin/console app:odis:crawl --skip 3215,3125
Parallel crawl:
php bin/console app:odis:crawl --parallel --concurrency 5
Testing with limits:
php bin/console app:odis:crawl --limit 10
Clear index:
php bin/console app:odis:crawl --clear-index
Clear stats & reports:
php bin/console app:odis:clear-stats
Clear EVERYTHING:
php bin/console app:odis:clear-stats --all
Verbose output:
php bin/console app:odis:crawl -v

What the crawler does:

  • If IDs are provided, it only processes those specific data sources.
  • If the --skip option is used, it filters out the specified IDs.
  • If --parallel (or -p) is used, it spawns separate sub-processes for each datasource to speed up the process.
  • The --concurrency (or -c) option controls the maximum number of simultaneous processes (default is 5).
  • If --limit (or -l) is used, it restricts the number of metadata records indexed per data source. This is very useful for testing without performing a full crawl.
  • If --clear-index is used, it deletes and recreates the Elasticsearch index before starting the crawl. This is mandatory when fixing mapping conflicts.
  • Fetches all data source IDs and metadata from the ODIS-Arch Records API.
  • Analyzes each source for ODIS-Arch URL and ODIS-Arch Type.
  • Parses Sitemaps or direct Sitegraph JSON-LD files.
  • Indexes the extracted metadata into the odis_metadata index in Elasticsearch.
  • Successive crawls update existing entries instead of creating duplicates, ensuring a cumulative dataset.
  • Detailed logs of all issues are tracked and viewable in the Crawler Dashboard.

2. Searching & Monitoring

You can use the built-in search interface to query the indexed metadata. It uses an Elasticsearch multi_match query across name, description, and keywords fields.

Click here to open the search interface.

For real-time crawl monitoring and error tracking, visit the Crawler Dashboard.

3. Configuration

Environment variables in the .env file control the Elasticsearch connection:

  • ELASTICSEARCH_URL: Usually https://localhost:9200.
  • ELASTICSEARCH_USER: odis_metadata
  • ELASTICSEARCH_PASSWORD: The secure password provided during setup.

4. Resetting Project Data

To clear all data and start from scratch, you can use the built-in maintenance command or the button on the Dashboard:

1. Clear Everything (Index + Stats):
Wipe both Elasticsearch and the local database stats:
php bin/console app:odis:clear-stats --all
2. Clear ONLY Stats & Reports:
Empty the crawl history table but keep the search index:
php bin/console app:odis:clear-stats
3. Clear ONLY Elasticsearch Index:
Delete and recreate the metadata index:
php bin/console app:odis:crawl --clear-index --limit 1
4. Clear Log Files:
Remove any temporary crawl logs:
rm -f crawl_*.log
Warning: These operations are destructive and cannot be undone. All indexed metadata and crawl history will be lost.
System Status

Framework: Symfony 8.0
Storage: Elasticsearch 9.3
PHP Version: 8.5+