How the WoodCentral Archive section works

Overview

The system is a static inverted-index search engine for your legacy PHP articles. It reads .php files from the articles directory, extracts content, builds three JSON caches, and enables fast keyword or search-term lookups without querying a database.

The three JSON outputs are:

archive_meta.json – Metadata per file (title, snippet, keywords).
archive_keywords.json – Frequency count of all meta keywords across files.
archive_index.json – Inverted index mapping words → files containing them.

Workflow

1. File Discovery

Uses glob($articlesDir.'/*.php') to locate all PHP article files.
Iterates over each file and reads the contents via file_get_contents().
Files that fail to read are skipped with a warning.

2. Metadata Extraction

Title: Extracted using regex: /<title>(.*?)<\/title>/si.
Meta keywords: Extracted using regex:

/<meta\s+name=["']keywords["']\s+content=["'](.*?)["']\s*\/?>/si

Keywords are normalized: lowercased, trimmed, and stored in both:
- $fileKeywords for the current file.
- $keywords global array for counting occurrences.

3. Content Extraction for Search

Primary content comes from <main id="wc-main">:

/<main[^>]*id=["']wc-main["'][^>]*>(.*?)<\/main>/si

If <main> is missing, the entire file content is used as a fallback.
Content is stripped of HTML tags using strip_tags(), whitespace is normalized with preg_replace('/\s+/',' ', ...).
A display snippet is created using mb_substr($snippetText, 0, 300) — used in search result previews.

4. Inverted Index Generation

Full text for indexing: $text = strtolower($title . ' ' . $snippetText);
Clean-up for search indexing:

$text = preg_replace('/[^\p{L}\p{N}\s\-]+/u',' ', (string)$text);

Removes punctuation except internal dashes/numbers (e.g., 1-qt).
Converts all letters to lowercase for case-insensitive search.
Split text into words using preg_split('/\s+/u', (string)$text, -1, PREG_SPLIT_NO_EMPTY).
Words are filtered:
- Minimum length: 2 characters.
- Stop words excluded (common words like the, and, for).
For each valid word, add the filename to the inverted index $index[word][].
After all files, duplicates in $index[word] are removed using array_unique().

5. JSON Cache Files

archive_meta.json – keyed by filename:

{
  "russ04.php": {
    "file": "russ04.php",
    "title": "A RECIPE FOR CREATING SPALTED WOOD",
    "snippet": "Since the question of spalting comes up time and again, I will share a description of ...",
    "keywords": ["russ fairfield"]
  }
}

archive_keywords.json – frequency map:

{
  "russ fairfield": 1,
  "woodturning": 12,
  ...
}

archive_index.json – inverted index:

{
  "oak": ["russ04.php", "other_article.php"],
  "leaves": ["russ04.php", "article_521.php"],
  "spalted": ["russ04.php", "spalt_article.php"],
  ...
}

Enables fast lookup of files containing any given word.

6. Search Process (Client-Side Reference)

User input (search parameter) is normalized similarly: lowercase, punctuation removed, split into words.
The inverted index is consulted for each word.
Intersection of arrays ensures that multiple-word searches only return files containing all words.
Keyword filters (keywords[]) are applied post-search using array_intersect() with the article’s meta keywords.
Final results are sorted alphabetically by title.

7. Technical Notes

PHP 8.3 Compatibility: (string)$text ensures preg_split() never receives null.
Unicode Support: Regex uses \p{L} and \p{N} with the u modifier for proper Unicode word handling.
Performance: All JSON caches are prebuilt; runtime search is array-based → avoids database queries.
Extensibility:
- Additional stop words can be added to $stop.
- Snippet length adjustable via mb_substr().
- Indexing rules can be modified (e.g., include numbers, punctuation, or phrases).
- Files without <main> are safely indexed using fallback content.

8. Recommendations for Modifications

Exact Phrase Matching: Currently the search is word-based. Post-filtering $results with stripos($content, $searchPhrase) will allow phrase-only matches.
Multi-Language Support: Extend stop words and character classes for non-English letters.
Incremental Indexing: Currently, the script rebuilds everything. Could optimize to only re-index changed files.
Search Scoring / Ranking: Add frequency-based scoring using $keywords or $index[word] counts.

This documentation should allow another programmer to understand, maintain, and extend the indexing/search system, including handling new articles, keywords, or advanced search features.