Overview
The system is a static inverted-index search engine for your legacy PHP articles. It reads .php files from the articles directory, extracts content, builds three JSON caches, and enables fast keyword or search-term lookups without querying a database.
The three JSON outputs are:
archive_meta.json– Metadata per file (title, snippet, keywords).archive_keywords.json– Frequency count of all meta keywords across files.archive_index.json– Inverted index mapping words → files containing them.
Workflow
1. File Discovery
- Uses
glob($articlesDir.'/*.php')to locate all PHP article files. - Iterates over each file and reads the contents via
file_get_contents(). - Files that fail to read are skipped with a warning.
2. Metadata Extraction
- Title: Extracted using regex:
/<title>(.*?)<\/title>/si. - Meta keywords: Extracted using regex:
/<meta\s+name=["']keywords["']\s+content=["'](.*?)["']\s*\/?>/si
- Keywords are normalized: lowercased, trimmed, and stored in both:
$fileKeywordsfor the current file.$keywordsglobal array for counting occurrences.
3. Content Extraction for Search
- Primary content comes from
<main id="wc-main">:
/<main[^>]*id=["']wc-main["'][^>]*>(.*?)<\/main>/si
- If
<main>is missing, the entire file content is used as a fallback. - Content is stripped of HTML tags using
strip_tags(), whitespace is normalized withpreg_replace('/\s+/',' ', ...). - A display snippet is created using
mb_substr($snippetText, 0, 300)— used in search result previews.
4. Inverted Index Generation
- Full text for indexing:
$text = strtolower($title . ' ' . $snippetText); - Clean-up for search indexing:
$text = preg_replace('/[^\p{L}\p{N}\s\-]+/u',' ', (string)$text);
- Removes punctuation except internal dashes/numbers (e.g.,
1-qt). - Converts all letters to lowercase for case-insensitive search.
- Split text into words using
preg_split('/\s+/u', (string)$text, -1, PREG_SPLIT_NO_EMPTY). - Words are filtered:
- Minimum length: 2 characters.
- Stop words excluded (common words like
the,and,for).
- For each valid word, add the filename to the inverted index
$index[word][]. - After all files, duplicates in
$index[word]are removed usingarray_unique().
5. JSON Cache Files
archive_meta.json– keyed by filename:
{
"russ04.php": {
"file": "russ04.php",
"title": "A RECIPE FOR CREATING SPALTED WOOD",
"snippet": "Since the question of spalting comes up time and again, I will share a description of ...",
"keywords": ["russ fairfield"]
}
}
archive_keywords.json– frequency map:
{
"russ fairfield": 1,
"woodturning": 12,
...
}
archive_index.json– inverted index:
{
"oak": ["russ04.php", "other_article.php"],
"leaves": ["russ04.php", "article_521.php"],
"spalted": ["russ04.php", "spalt_article.php"],
...
}
- Enables fast lookup of files containing any given word.
6. Search Process (Client-Side Reference)
- User input (
searchparameter) is normalized similarly: lowercase, punctuation removed, split into words. - The inverted index is consulted for each word.
- Intersection of arrays ensures that multiple-word searches only return files containing all words.
- Keyword filters (
keywords[]) are applied post-search usingarray_intersect()with the article’s meta keywords. - Final results are sorted alphabetically by title.
7. Technical Notes
- PHP 8.3 Compatibility:
(string)$textensurespreg_split()never receivesnull. - Unicode Support: Regex uses
\p{L}and\p{N}with theumodifier for proper Unicode word handling. - Performance: All JSON caches are prebuilt; runtime search is array-based → avoids database queries.
- Extensibility:
- Additional stop words can be added to
$stop. - Snippet length adjustable via
mb_substr(). - Indexing rules can be modified (e.g., include numbers, punctuation, or phrases).
- Files without
<main>are safely indexed using fallback content.
- Additional stop words can be added to
8. Recommendations for Modifications
- Exact Phrase Matching: Currently the search is word-based. Post-filtering
$resultswithstripos($content, $searchPhrase)will allow phrase-only matches. - Multi-Language Support: Extend stop words and character classes for non-English letters.
- Incremental Indexing: Currently, the script rebuilds everything. Could optimize to only re-index changed files.
- Search Scoring / Ranking: Add frequency-based scoring using
$keywordsor$index[word]counts.
This documentation should allow another programmer to understand, maintain, and extend the indexing/search system, including handling new articles, keywords, or advanced search features.