How the WoodCentral Archive section works

Overview

The system is a static inverted-index search engine for your legacy PHP articles. It reads .php files from the articles directory, extracts content, builds three JSON caches, and enables fast keyword or search-term lookups without querying a database.

The three JSON outputs are:

  1. archive_meta.json – Metadata per file (title, snippet, keywords).
  2. archive_keywords.json – Frequency count of all meta keywords across files.
  3. archive_index.json – Inverted index mapping words → files containing them.

Workflow

1. File Discovery

  • Uses glob($articlesDir.'/*.php') to locate all PHP article files.
  • Iterates over each file and reads the contents via file_get_contents().
  • Files that fail to read are skipped with a warning.

2. Metadata Extraction

  • Title: Extracted using regex: /<title>(.*?)<\/title>/si.
  • Meta keywords: Extracted using regex:
/<meta\s+name=["']keywords["']\s+content=["'](.*?)["']\s*\/?>/si
  • Keywords are normalized: lowercased, trimmed, and stored in both:
    • $fileKeywords for the current file.
    • $keywords global array for counting occurrences.

3. Content Extraction for Search

  • Primary content comes from <main id="wc-main">:
/<main[^>]*id=["']wc-main["'][^>]*>(.*?)<\/main>/si
  • If <main> is missing, the entire file content is used as a fallback.
  • Content is stripped of HTML tags using strip_tags(), whitespace is normalized with preg_replace('/\s+/',' ', ...).
  • A display snippet is created using mb_substr($snippetText, 0, 300) — used in search result previews.

4. Inverted Index Generation

  • Full text for indexing: $text = strtolower($title . ' ' . $snippetText);
  • Clean-up for search indexing:
$text = preg_replace('/[^\p{L}\p{N}\s\-]+/u',' ', (string)$text);
  • Removes punctuation except internal dashes/numbers (e.g., 1-qt).
  • Converts all letters to lowercase for case-insensitive search.
  • Split text into words using preg_split('/\s+/u', (string)$text, -1, PREG_SPLIT_NO_EMPTY).
  • Words are filtered:
    • Minimum length: 2 characters.
    • Stop words excluded (common words like the, and, for).
  • For each valid word, add the filename to the inverted index $index[word][].
  • After all files, duplicates in $index[word] are removed using array_unique().

5. JSON Cache Files

  1. archive_meta.json – keyed by filename:
{
  "russ04.php": {
    "file": "russ04.php",
    "title": "A RECIPE FOR CREATING SPALTED WOOD",
    "snippet": "Since the question of spalting comes up time and again, I will share a description of ...",
    "keywords": ["russ fairfield"]
  }
}
  1. archive_keywords.json – frequency map:
{
  "russ fairfield": 1,
  "woodturning": 12,
  ...
}
  1. archive_index.json – inverted index:
{
  "oak": ["russ04.php", "other_article.php"],
  "leaves": ["russ04.php", "article_521.php"],
  "spalted": ["russ04.php", "spalt_article.php"],
  ...
}
  • Enables fast lookup of files containing any given word.

6. Search Process (Client-Side Reference)

  • User input (search parameter) is normalized similarly: lowercase, punctuation removed, split into words.
  • The inverted index is consulted for each word.
  • Intersection of arrays ensures that multiple-word searches only return files containing all words.
  • Keyword filters (keywords[]) are applied post-search using array_intersect() with the article’s meta keywords.
  • Final results are sorted alphabetically by title.

7. Technical Notes

  • PHP 8.3 Compatibility: (string)$text ensures preg_split() never receives null.
  • Unicode Support: Regex uses \p{L} and \p{N} with the u modifier for proper Unicode word handling.
  • Performance: All JSON caches are prebuilt; runtime search is array-based → avoids database queries.
  • Extensibility:
    • Additional stop words can be added to $stop.
    • Snippet length adjustable via mb_substr().
    • Indexing rules can be modified (e.g., include numbers, punctuation, or phrases).
    • Files without <main> are safely indexed using fallback content.

8. Recommendations for Modifications

  1. Exact Phrase Matching: Currently the search is word-based. Post-filtering $results with stripos($content, $searchPhrase) will allow phrase-only matches.
  2. Multi-Language Support: Extend stop words and character classes for non-English letters.
  3. Incremental Indexing: Currently, the script rebuilds everything. Could optimize to only re-index changed files.
  4. Search Scoring / Ranking: Add frequency-based scoring using $keywords or $index[word] counts.

This documentation should allow another programmer to understand, maintain, and extend the indexing/search system, including handling new articles, keywords, or advanced search features.


Leave a Comment

Licensed under CC BY-NC 4.0

DevOps viewpoints are those of its owner. You may share and adapt this article for non-commercial purposes, provided proper attribution is given. Attribution should include:

Title: How the WoodCentral Archive section works
Author: peter arthur martin
Original URL: https://www.woodcentral.com/-/peter/how-the-woodcentral-archive-section-works/
License: CC BY-NC 4.0

Site Index

👍 This page answered my questions

Your vote helps other woodworkers quickly find the answers and techniques that actually work in the shop.