8 minutes
Key, index, and path transforms
It’s time for one of the most exciting important parts: transforms! This
is probably one of the last few posts on this blog (at least for this year),
since GSoC 2019 is almost finished and I’m wrapping up my project soon. In
this post, I’ll explain how I implemented key, index, and path transforms to
allow querying with JSON values. Actually, they aren’t made into different
transforms, but a transform (called KeyTransform
) that can be chained to
form a JSON path. Then, we can do a lookup with the value at that path.
The following is the demonstration from the docs.
To query based on a given dictionary key, simply use that key as the lookup name:
>>> Dog.objects.create(name='Rufus', data={
... 'breed': 'labrador',
... 'owner': {
... 'name': 'Bob',
... 'other_pets': [{
... 'name': 'Fishy',
... }],
... },
... })
>>> Dog.objects.create(name='Meg', data={'breed': 'collie', 'owner': None})
>>> Dog.objects.filter(data__breed='collie')
<QuerySet [<Dog: Meg>]>
Multiple keys can be chained together to form a path transform:
>>> Dog.objects.filter(data__owner__name='Bob')
<QuerySet [<Dog: Rufus>]>
If the key is an integer, it will be interpreted as an index transform in an array:
>>> Dog.objects.filter(data__owner__other_pets__0__name='Fishy')
<QuerySet [<Dog: Rufus>]>
Note that the transform examples given above implicitly use the exact
lookup.
PostgreSQL
On PostgreSQL, if only one key or index is used, the SQL operator ->
is
used. If multiple operators are used then the #>
operator is used. For
example, the last lookup translates into the following SQL query:
SELECT * FROM myapp_dog WHERE (data #> ['owner', 'other_pets', '0', 'name']) = 'Fishy'
If the transform is chained with a lookup expecting a text lhs, the operators
used are ->>
and #>>
, respectively.
MySQL and MariaDB
On MySQL and MariaDB, we can make use of the
JSON_EXTRACT
function. The function accepts at least
two arguments: the JSON document and the path. However, instead of an array,
the path is a JSONPath string. So, the query becomes:
SELECT * FROM myapp_dog WHERE JSON_EXTRACT(data, '$.owner.other_pets[0].name') = '"Fishy"'
Except… things aren’t that easy.
The exact
lookup isn’t always used with scalar right-hand-side value in
Python. We also want to be able to use dictionaries and lists, so we can do
something like:
>>> Dog.objects.filter(data__owner__other_pets__0={'name': 'Fishy'})
<QuerySet [<Dog: Rufus>]>
For some reason, using just a JSON-encoded RHS doesn’t always work. I’ve spent a lot of time trying to get it right on both MySQL and MariaDB with many lookup cases without making the code look too complex and confusing. I can’t remember all of the attempts I’ve tried.
In the end, I managed to do it by also wrapping the RHS with JSON_EXTRACT
,
but with '$'
(root) as the path. It’s kind of like the trick I did on Oracle
in the previous post, except that I don’t have to wrap it inside a JSON object
with the "val"
key because MySQL and MariaDB support storing scalar JSON
values.
So now, the previous SQL query for the
data__owner__other_pets__0__name='Fishy'
lookup becomes:
SELECT * FROM myapp_dog
WHERE JSON_EXTRACT(data, '$.owner.other_pets[0].name') =
JSON_EXTRACT('"Fishy"', '$')
However, on MariaDB, that conditional returns false. After some fiddling, I
found that I have to wrap the RHS with JSON_UNQUOTE
to yield a result. Now,
the query becomes like this (only if the RHS is a string):
SELECT * FROM myapp_dog
WHERE JSON_EXTRACT(data, '$.owner.other_pets[0].name') =
JSON_UNQUOTE(JSON_EXTRACT('"Fishy"', '$'))
Funnily, as you can see on the fiddle, the expression itself actually still
returns false (0
). I have no idea why. ¯\_(ツ)_/¯
If you think that I probably should also wrap the LHS with JSON_UNQUOTE
,
well… I thought so, too. However, doing that will actually mess things up
if the value returned by JSON_EXTRACT
is a JSON object or array, because it
seems that JSON_UNQUOTE
will convert it into a normal string. I’m not
entirely sure if that’s the case, but I ran the tests and some queries failed
to retrieve what they’re supposed to. So, I ended up doing this only for
lookups that expect a text LHS (and the above case with MariaDB). Conveniently,
the existing PostgreSQL implementation also does something similar by having
a mixin called KeyTransformTextLookupMixin
.
That’s just for exact
lookup. Thankfully, other lookups that can be applied
to a KeyTransform
only expect scalar values. By unquoting the extracted JSON
value on MySQL and MariaDB, lookups that expect text LHS like contains
(the
original one), startswith
, and endswith
can be implemented just by
inheriting the mixin and the lookups.
However, there’s another problem. For some reason, LIKE
operator on MySQL
doesn’t really like (heh) JSON strings. Look at
this fiddle for illustration. On MariaDB, the LIKE
operator at least works
correctly for case-insensitive matching when JSON_UNQUOTE
is applied. Try
switching into MySQL with the same fiddle and you can see that it’s not the
case (heh).
In order for case-insensitive matching to work, we have to convert both side to
lowercase (or uppercase). This can easily be done by creating another mixin,
let’s call it CaseInsensitiveMixin
, and wrapping the LHS and RHS with
LOWER()
in its process_lhs
and process_rhs
methods.
For numeric lookups (i.e. lt
, lte
, gt
, gte
), I decided to create
another mixin called KeyTransformNumericLookupMixin
. The original PostgreSQL
implementation didn’t have this mixin and just leave the lookups to the
superclass. I’m not really sure how, but it just works. Even the RHS is still
a JSON-encoded numeric value, which means we pass it as a string to psycopg2
.
For all other backends, we just need to do json.loads
on the RHS to decode it
into numerical values (int
, float
). Oh, and obviously, we don’t wrap the
LHS with JSON_UNQUOTE
in the mixin to avoid converting the LHS to string.
On Oracle database, we need to wrap the LHS with TO_NUMBER
to convert the
LHS to numeric.
One thing I don’t really get is how MariaDB can correctly do numeric
comparisons with values retrieved by JSON_EXTRACT
, but it can’t do ordering
by numeric JSON values retrieved by JSON_EXTRACT
. Instead, it orders the data
by the string representation of the numeric values. Oh, well…
Fun fact: while writing this post, I realized I made a mistake and used
KeyTransformTextLookupMixin
on numeric lookups as well. I hadn’t even created
the KeyTransformNumericLookupMixin
. The tests still passed because the test
data were too simple and comparing the string representation of the numeric
values would yield the same result as if we’re comparing the actual numbers.
For example, 4 < 7
and '4' < '7'
are both true. I’ve fixed the tests and
the implementation, so it should work correctly now.
That’s pretty much it for the transforms on MySQL and MariaDB. Oh, and you
might have read from somewhere that MySQL also has the ->
and ->>
operator to extract JSON values like on PostgreSQL. The ->
operator is
actually a shorthand for JSON_EXTRACT()
and the ->>
operator is a
shorthand for JSON_UNQUOTE(JSON_EXTRACT())
. However, MariaDB hasn’t got
those two operators yet, so I decided not to use it to avoid diverging the code
too much between the two.
Oracle
On Oracle Database, it’s pretty much the same except that instead of using
JSON_EXTRACT
, we use JSON_QUERY
and JSON_VALUE
. As I explained in my
previous post, JSON_QUERY
is used to retrieve JSON objects and arrays, while
JSON_VALUE
is used to retrieve JSON scalar values. However, since a transform
is only used to, well, transform the LHS (correct me if I’m wrong here), we
cannot know whether we should use JSON_QUERY
or JSON_VALUE
. In the end, I
decided to make use of the COALESCE
function to combine the two.
For the exact
lookup, as I also explained in my previous post, it’s much
easier to wrap the RHS with a dummy JSON object and call JSON_QUERY
or
JSON_VALUE
on that object. We also do that here.
So, the data__owner__other_pets__0__name='Fishy'
lookup translates into:
SELECT * FROM myapp_dog
WHERE COALESCE(
JSON_QUERY(data, '$.owner.other_pets[0].name'),
JSON_VALUE(data, '$.owner.other_pets[0].name')
) = JSON_VALUE('{"val": "Fishy"}', '$.val')
I could simplify the LHS so that it only use either JSON_QUERY
or
JSON_VALUE
, but that can only be determined by checking the RHS. I think it
wouldn’t make sense to do that in process_lhs
method.
After that’s done, we just need to fix some things. First, we need to fix
exact
lookup if the RHS is None
, which means that the path exists but the
value is JSON null
. Both JSON_QUERY
and JSON_VALUE
returns SQL NULL
if
the value at the path is JSON null
. Which is also the case if the path
doesn’t exist. So, what do we do?
Thankfully, JSON_EXISTS
exists (heh) on Oracle.
Therefore, we can define the WHERE
clause as something like:
JSON_EXISTS(...) AND COALESCE(JSON_QUERY(...), JSON_VALUE(...)) IS NULL
Which basically reads, “the path exists, but the value is null”, and that’s exactly what we want.
Next, we need to fix isnull
lookup. If the RHS is False
, there’s no
problem as we can just use JSON_EXISTS
. However, if the RHS is True
,
NOT JSON_EXISTS
only returns true if the JSON path doesn’t exist and the
data is not empty. We need to fix the condition so it also returns true when
the data is empty. So, we need to append the condition with
OR JSON_QUERY(column_name, '$') IS NULL
.
Finally, we need to fix numeric lookups. It’s simple as we just need to wrap
the LHS with TO_NUMBER
to cast it into a, well… number.
SQLite
I have to hand it to SQLite: the JSON1 extension works so much better than I
expected. As in MySQL, there’s also a JSON_EXTRACT
on
SQLite. However, it automatically unquotes JSON strings. Also, we don’t
need to wrap the RHS with a JSON function whatsoever. It just works.
So, the data__owner__other_pets__0__name='Fishy'
lookup translates into:
SELECT * FROM myapp_dog WHERE JSON_EXTRACT(data, '$.owner.other_pets[0].name') = 'Fishy'
We just need to fix exact
lookup if the RHS is None
and isnull
lookup so
that it uses the JSON_TYPE
function to determine JSON null
value and
missing keys, which I have explained a few times before. Other than that, we’re
good to go!
And, that’s it! We have (pretty much) completed all the lookups and transforms
for JSONField
. In the next few posts, I’ll probably explain how I found a
vulnerability in contrib.postgres.fields.JSONField
and do some recaps to
wrap everything up. Until then, see you!