Bashing JSON into Shape with SQLite
Published on , 1561 words, 6 minutes to read
It is clear that most of the world has decided that they want to use JSON for their public-facing API endpoints. However, most of the time you will need to deal with storage engines that don't deal with JSON very well. This can be confusing to deal with because you need to fit a square peg into a round hole.
However, SQLite added JSON functions to allow you to munge and modify JSON data in whatever creative ways you want. You can use these and SQLite triggers in order to automatically massage JSON into whatever kind of tables you want. Throw in upserts and you'll be able to make things even more automated. This support was added in SQLite 3.9.0 (released in 2015), so assuming Debian didn't disable it for no good reason, you should be able to use it today.
For this example, we're going to be querying publicly available JSONFeed endpoints and turning that into SQL tables. Let's start with a table schema that looks like this:
CREATE TABLE IF NOT EXISTS jsonfeed_raw
( feed_url TEXT PRIMARY KEY
, scrape_date TEXT NOT NULL DEFAULT (DATE('now'))
, raw TEXT NOT NULL
);
The scrape date is essentially the date that the JSONFeed row was inserted into the database. This can be useful when writing other parts of the stack to automatically query feeds for changes. This is left as an exercise to the reader.
You can then insert things into the SQLite database using Python's sqlite3
module:
#!/usr/bin/env nix-shell
#! nix-shell -p python39 --run python
import sqlite3
import urllib.request
con = sqlite3.connect("data.db")
def get_feed(feed_url):
req = urllib.request.Request(feed_url, headers={"User-Agent": "Xe/feedfetch"})
with urllib.request.urlopen(req) as response:
cur = con.cursor()
body = response.read()
cur.execute("""
INSERT INTO jsonfeed_raw
(feed_url, raw)
VALUES
(?, json(?))
""", (feed_url, body))
con.commit()
print("got feed %s" % (feed_url))
get_feed("https://xeiaso.net/blog.json")
So now let's play with the data! Let's load the database schema in with the
sqlite3
command:
$ sqlite3 data.db < schema.sql
Then run that python script to populate the database:
$ python ./jsonfeedfetch.py
got feed https://xeiaso.net/blog.json
Then open up the SQLite command line:
$ sqlite3 data.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite>
And now we can play with a few of the JSON functions. First let's show off
json_extract
.
This lets you pull a value out of a JSON object. For example, let's get the feed
title out of my website's JSONFeed:
sqlite> select json_extract(raw, '$.title') from jsonfeed_raw;
Xe's Blog
We can use this function to help us create a table that stores the metadata we care about from a JSONFeed, such as this:
CREATE TABLE IF NOT EXISTS jsonfeed_metadata
( feed_url TEXT PRIMARY KEY
, title TEXT NOT NULL
, description TEXT
, home_page_url TEXT
, updated_at TEXT NOT NULL DEFAULT (DATE('now'))
);
If you ask my coworkers, they can confirm that I actually do real life unironcally write SQL like that.
Then we can populate that table with a query like this:
INSERT INTO jsonfeed_metadata
( feed_url
, title
, description
, home_page_url
, updated_at
)
SELECT jsonfeed_raw.feed_url AS feed_url
, json_extract(jsonfeed_raw.raw, '$.title') AS title
, json_extract(jsonfeed_raw.raw, '$.description') AS description
, json_extract(jsonfeed_raw.raw, '$.home_page_url') AS home_page_url
, DATE('now') AS updated_at
FROM jsonfeed_raw;
The AS
keyword lets you bind values in a SELECT
statement to names for use
elsewhere in the query. I don't know if it's strictly needed, however it makes
the names line up and SQLite doesn't complain about it, so it's probably
fine.
Now this is workable, however you know what's easier than writing statements in the SQLite console like that? Not having to! SQLite triggers allow us to run database statements automatically when certain conditions happen. The main condition we want to care about right now is when we insert new data. We can turn that statement into an after-insert trigger like this:
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_ins
AFTER INSERT ON jsonfeed_raw
BEGIN
INSERT INTO jsonfeed_metadata
( feed_url
, title
, description
, home_page_url
)
VALUES ( NEW.feed_url
, json_extract(NEW.raw, '$.title')
, json_extract(NEW.raw, '$.description')
, json_extract(NEW.raw, '$.home_page_url')
);
END;
Then we can run a few commands to nuke all the database state:
sqlite3> DELETE FROM jsonfeed_metadata;
sqlite3> DELETE FROM jsonfeed_raw;
And run that python script again, then the data should automatically show up:
sqlite3> SELECT * FROM jsonfeed_metadata;
https://xeiaso.net/blog.json|Xe's Blog|My blog posts and rants about various technology things.|https://xeiaso.net|2022-01-04
It's like magic!
However, if you run that python script again without deleting the rows, you will get a primary key violation. We can fix this by turning the insert into an upsert with something like this:
cur.execute("""
INSERT INTO jsonfeed_raw
(feed_url, raw)
VALUES
(?, json(?))
ON CONFLICT DO
UPDATE SET raw = json(?)
""", (feed_url, body, body))
And also make a complementary update trigger for the jsonfeed_raw
table:
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd
AFTER UPDATE ON jsonfeed_raw
BEGIN
INSERT INTO jsonfeed_metadata
( feed_url
, title
, description
, home_page_url
)
VALUES ( NEW.feed_url
, json_extract(NEW.raw, '$.title')
, json_extract(NEW.raw, '$.description')
, json_extract(NEW.raw, '$.home_page_url')
)
ON CONFLICT DO
UPDATE SET
title = json_extract(NEW.raw, '$.title')
, description = json_extract(NEW.raw, '$.description')
, home_page_url = json_extract(NEW.raw, '$.home_page_url')
;
You should probably update the original trigger to be an upsert too. You can
follow this trigger as a guide. Be sure to DROP TRIGGER jsonfeed_raw_upd;
first though!
We can also scrape the feed items out too with json_each
. json_each
lets you
iterate a JSON array and returns SQLite rows for every value in that array.
Let's take this for example:
sqlite> select * from json_each('["foo", "bar"]');
0|foo|text|foo|1||$[0]|$
1|bar|text|bar|2||$[1]|$
The schema for the temporary table that json_each
(and the related
json_tree
) uses can be found here.
You can also grab things out of a list in an object with the second argument to
json_each
, so you can do things like this:
sqlite> select * from json_each('{"spam": ["foo", "bar"]}', '$.spam');
0|foo|text|foo|3||$.spam[0]|$.spam
1|bar|text|bar|4||$.spam[1]|$.spam
Using this, we can make a table for each of the feed items that looks something like this:
CREATE TABLE IF NOT EXISTS jsonfeed_posts
( url TEXT PRIMARY KEY
, feed_url TEXT NOT NULL
, title TEXT NOT NULL
, date_published TEXT NOT NULL
);
And then munge everything out of the data in the database with a query like this:
INSERT INTO jsonfeed_posts
( url
, feed_url
, title
, date_published
)
SELECT
json_extract(json_each.value, '$.url') AS url
, jsonfeed_raw.feed_url AS feed_url
, json_extract(json_each.value, '$.title') AS title
, json_extract(json_each.value, '$.date_published') AS date_published
FROM
jsonfeed_raw
, json_each(jsonfeed_raw.raw, '$.items');
This will fetch all of the values of the items
field in every JSONFeed and
then automatically populate them into the jsonfeed_posts
table. However
turning this into a trigger with the naiive approach will not instantly work.
Let's say we have the trigger form that looks like this:
CREATE TRIGGER IF NOT EXISTS jsonfeed_raw_upd_posts
AFTER INSERT ON jsonfeed_raw
BEGIN
INSERT INTO jsonfeed_posts
( url
, feed_url
, title
, date_published
)
SELECT
json_extract(json_each.value, '$.url') AS url
, NEW.feed_url AS feed_url
, json_extract(json_each.value, '$.title') AS title
, json_extract(json_each.value, '$.date_published') AS date_published
FROM json_each(NEW.raw, '$.items')
ON CONFLICT DO
UPDATE SET title = excluded.title
, date_published = excluded.date_published
;
END;
If you paste this into your SQLite console, you'll get this error:
Error: near "DO": syntax error
This is actually due to a parsing ambiguity in
SQLite. In order to fix this you will
need to add WHERE TRUE
between the FROM
and ON CONFLICT
clauses of the
trigger:
-- ...
FROM json_each(NEW.raw, '$.items')
WHERE TRUE
ON CONFLICT DO
-- ...
And thus the day is saved by the wheretrue, the hidden apex predator of the SQLite realm, a fated value that is only non-falsy at night. Weep in terror lest it add you to its table of victims!
The correlating insert trigger change is also an exercise for the reader.
Now you can add JSONFeeds how you want and all of the data will automatically be updated. This can probably be vastly simplified further with the use of generated columns, however this should work admirably for most needs.
SQLite is able to be a NOSQL database. It's good enough for your needs. If you want to play with the code I wrote while writing this article, check it out here. This post was written live on twitch.tv. Please follow or subscribe to be kept up to date on when I go live!
The VOD for this post is here. The corresponding YouTube upload is here. It won't be available immediately after this post goes live, but it will go up in time.
Here is my favorite message from the chat while I was researching this post:
jbpratt: if you were married to sqlite, i'd be reporting domestic abuse. This is awesome
Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.
Tags: sqlite, json