Previously…
I recently built my e-bookshelf using an Hugo template. See my previous post which covers the cataloging process as well as a CSS-based search engine.
As snappy as the search is, it falls short by only accepting whole keywords
from a book’s JSON metadata (title, author, etc.). What we are used to instead
is, obviously, that search results with keyword, say, David
, should appear in
response to substring queries like Davi
, Dav
, vid
, etc. These are also
known as n-grams (of a word).
This has since then been improved and the search bar now does accept all n-grams (with at least three letters, i.e., n>2). Give it a try!
My solution is a command line program that I wrote in C. See its git repo
here. Yes, I know, extracting JSON strings
and generating n-grams is definitely a scriptable task using command line tools
(jq
, cut
, sort
, sed
and/or awk
, etc.). However, I want to avoid endless chains of
piped commands. The previous one, which generates whole keywords only, was already quite messy-looking.
Besides, writing a program for it makes it easier to modularize tasks, allows finer control and most likely results in much faster runtime, especially when the dataset gets large.
jngram
The program is called jngram
(ngrams produced from json strings). The library
json-c is used to parse and operate JSON
objects/arrays. See its official API here.
I also found this tutorial very helpful.
What we want (for the CSS search engine)
Recall that our goal is produce CSS code for the search engine, given a json
file (with as many books as you want).
Consider the following example library.json
with only one book:
[
{
"title": "The Four Loves",
"subtitle": "",
"publishyear": "1960",
"publisher": "HarperOne",
"author": "C. S. Lewis",
"lccnumber": "BV4639-L45-2017",
"_comment": "BV4639 .L45 2017",
"booktags": [
"cslewis"
]
}
]
Here is the CSS code that jngram
produces (see this section in the previous post where a search engine is built upon these CSS codes):
#booksearchresults li { display: none }
input[value='the' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='fou' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='four' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lov' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='love' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='loves' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewis' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csl' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csle' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewis' i] ~ #booksearchresults #BV4639-L45-2017 { display: list-item }
Note that ngrams generated above always start from the first letter.
jngram
is also capable of generating all ngrams (many lines in the middle omitted):
#booksearchresults li { display: none }
input[value='the' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='fou' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='our' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='four' i] ~ #booksearchresults #BV4639-L45-2017,
[...]
input[value='cslewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='slewis' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewis' i] ~ #booksearchresults #BV4639-L45-2017 { display: list-item }
Compile jngram
Nothing crazy here, just make sure that you have json-c installed and included:
gcc -I/usr/include/json-c/ jngram.c -ljson-c -o jngram
What jngram
does
Usage: jngram [flags] filename
Flags:
-l num minimal length of search keywords (default: 3);
-r print raw keyword ngrams (default: formatted css code);
-a print all ngrams (default: ngrams that include the first letter).
Formatted CSS codes
./jngram library.json
#booksearchresults li { display: none }
input[value='the' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='fou' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='four' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lov' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='love' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='loves' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewis' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csl' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csle' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewis' i] ~ #booksearchresults #BV4639-L45-2017 { display: list-item }
Raw ngrams
./jngram -r library.json
the
fou
four
lov
love
loves
lew
lewi
lewis
csl
csle
cslew
cslewi
cslewis
Raw all ngrams
./jngram -ra library.json
the
fou
our
four
lov
...
(middle part omitted)
...
cslew
slewi
lewis
cslewi
slewis
cslewis
Raw all ngrams (that are at least four-letter long)
./jngram -ra -l4 library.json
four
love
oves
loves
lewi
ewis
lewis
csle
slew
lewi
ewis
cslew
slewi
lewis
cslewi
slewis
cslewis