Previously…

I recently built my e-bookshelf using an Hugo template. See my previous post which covers the cataloging process as well as a CSS-based search engine.

As snappy as the search is, it falls short by only accepting whole keywords from a book’s JSON metadata (title, author, etc.). What we are used to instead is, obviously, that search results with keyword, say, David, should appear in response to substring queries like Davi, Dav, vid, etc. These are also known as n-grams (of a word).

This has since then been improved and the search bar now does accept all n-grams (with at least three letters, i.e., n>2). Give it a try!

My solution is a command line program that I wrote in C. See its git repo here. Yes, I know, extracting JSON strings and generating n-grams is definitely a scriptable task using command line tools (jq, cut, sort, sed and/or awk, etc.). However, I want to avoid endless chains of piped commands. The previous one, which generates whole keywords only, was already quite messy-looking.

Besides, writing a program for it makes it easier to modularize tasks, allows finer control and most likely results in much faster runtime, especially when the dataset gets large.

jngram

The program is called jngram (ngrams produced from json strings). The library json-c is used to parse and operate JSON objects/arrays. See its official API here. I also found this tutorial very helpful.

What we want (for the CSS search engine)

Recall that our goal is produce CSS code for the search engine, given a json file (with as many books as you want). Consider the following example library.json with only one book:

[
	{
		"title": "The Four Loves",
		"subtitle": "",
		"publishyear": "1960",
		"publisher": "HarperOne",
		"author": "C. S. Lewis",
		"lccnumber": "BV4639-L45-2017",
		"_comment": "BV4639 .L45 2017",
		"booktags": [
			"cslewis"
		]
	}
]

Here is the CSS code that jngram produces (see this section in the previous post where a search engine is built upon these CSS codes):

#booksearchresults li { display: none }
input[value='the' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='fou' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='four' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lov' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='love' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='loves' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewis' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csl' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csle' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewis' i] ~ #booksearchresults #BV4639-L45-2017 { display: list-item }

Note that ngrams generated above always start from the first letter. jngram is also capable of generating all ngrams (many lines in the middle omitted):

#booksearchresults li { display: none }
input[value='the' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='fou' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='our' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='four' i] ~ #booksearchresults #BV4639-L45-2017,
[...]
input[value='cslewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='slewis' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewis' i] ~ #booksearchresults #BV4639-L45-2017 { display: list-item }

Compile jngram

Nothing crazy here, just make sure that you have json-c installed and included:

gcc -I/usr/include/json-c/ jngram.c -ljson-c -o jngram

What jngram does

Usage: jngram [flags] filename


Flags:
        -l num  minimal length of search keywords (default: 3);
        -r      print raw keyword ngrams (default: formatted css code);
        -a      print all ngrams (default: ngrams that include the first letter).

Formatted CSS codes

./jngram library.json
#booksearchresults li { display: none }
input[value='the' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='fou' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='four' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lov' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='love' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='loves' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='lewis' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csl' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='csle' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslew' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewi' i] ~ #booksearchresults #BV4639-L45-2017,
input[value='cslewis' i] ~ #booksearchresults #BV4639-L45-2017 { display: list-item }

Raw ngrams

./jngram -r library.json
the
fou
four
lov
love
loves
lew
lewi
lewis
csl
csle
cslew
cslewi
cslewis

Raw all ngrams

./jngram -ra library.json
the
fou
our
four
lov
...
(middle part omitted)
...
cslew
slewi
lewis
cslewi
slewis
cslewis

Raw all ngrams (that are at least four-letter long)

./jngram -ra -l4 library.json
four
love
oves
loves
lewi
ewis
lewis
csle
slew
lewi
ewis
cslew
slewi
lewis
cslewi
slewis
cslewis