While working with a customer, a usecase came up that was hard to implement with a relational database (due to NDA, I cannot share the usecase itself). It was necessary to store some information for every row in the Oracle database, but although the structure of the data was similar, it contained a lot of special cases and different hierarchies. It would have been possible, with a clever table structure und a bit of denormalization. In fact, we did a PoC with a relational model, but it soon turned out that reads were slow and the queries to filter by this data grew more and more complex. There were good and productive discussions about the best approaches, and at some point, somebody said “This would be a good usecase for a document DB, wouldn’t it?”. Unfortunately, installing infrastructure for this was out of the picture, as the customer did not have experience with this technology and we were not keen on maintaining even more infrastructure, but we agreed on creating another PoC with the whole hierarchy being stored as a JSON in Oracle DB. The PoC was successful and the solution is productive and performing well.
All the informtion in this post is compiled from the comprehensive JSON Developer’s Guide for Oracle 19c. As Oracle added a lot of funtionality to its JSON features in the latest versions, the approaches described here may not be working on other Oracle Database versions.
I decided to write this up as a lessons-learned kind of post, because we ran into some issues which are documented, but might not be easy to find. You do not need deep Oracle knowledge to understand the article, knowing basic SQL is sufficient. Here is what I learned:
JSON objects are strings, and are treated like this in an Oracle database. If we want to include a column with JSON data into a table, there are several data types we can choose from:
As VARCHAR2
columns have limits, and JSON data we’d like to store is very dynamic, I’d rule out VARCHAR2
, leaving CLOB
(Character Large OBject) and BLOB
(Binary Large OBject). Oracle recommends CLOB with the following explanation:
This is particularly relevant if the database character set is the Oracle-recommended value of AL32UTF8. In AL32UTF8 databases CLOB instances are stored using the UCS2 character set, which means that each character requires two bytes. This doubles the storage needed for a document if most of its content consists of characters that are represented using a single byte in character set AL32UTF8.
Even in cases where the database character set is not AL32UTF8, choosing BLOB over CLOB storage has the advantage that it avoids the need for character-set conversion when storing the JSON document
Additionally, Oracle recommends storing the column as SECUREFILE
, which, in contrast to the default BASICFILE
, enables you to activate encryption and compression.
To summarize: If you chose CLOB
for your JSON column type, you’d (probably) double the storage needed, you’d add an unnecessary character-set conversion to your operations, while having zero benefits (except that CLOB
fields are easier to handle from SQLPlus).
As we will access the documents in our table with functions that expect the JSON content to be valid, it is a good idea to add a constraint that checks this content on every insert. This also makes sure that our data is consistent, and we don’t save invalid JSON by accident. Oracle created the IS JSON
constraint, which does exactly that. If you choose not to add this index, Oracle will not allow you to use the simple dot-notation to access the documents content (we will have a look at this later).
The constraint also allows you to decide if your constraint should enforce strict or lax JSON rules, whith lax being the default. Listing up the differences in detail would result in me just copying from the reference to this post, but the important differencee to most users is that the lax syntax allows JSON keywords to be single quoted, while the strict syntax does not.
With all the above mentioned recommendations from the documentation we can create a simple table containing a JSON column. I have found a resource with Noble Prize winners as a JSON document online, so we will store these to have some data as an example:
CREATE SEQUENCE seq_nobel_prize
START WITH 1
INCREMENT BY 1;
CREATE TABLE nobel_prize (
id NUMBER(19) DEFAULT seq_nobel_prize.nextval NOT NULL,
document BLOB NOT NULL,
CONSTRAINT pk_nobel_prize PRIMARY KEY (id),
CONSTRAINT document_is_json CHECK (document IS JSON (STRICT))
) LOB (document) STORE AS SECUREFILE;
As explained above, we chose to store the JSON in a BLOB
field as SECUREFILE
to optimize storage and access. We also apply an IS JSON
check to the column, which checks for strict syntax. We also added a primary key column to the table which is filled by seq_nobel_prizes
.
This step is not specific to storing JSON data, it works just like inserting text into a database. As the data from the web resource is one “large” JSON document, I split it up with some Javascript and transformed it into file with SQL insert statements, which you can find here. An example for a single row looks like this:
INSERT INTO nobel_prize (document) values ('{"year":"2020","category":"chemistry","laureates":[{"id":"991","firstname":"Emmanuelle","surname":"Charpentier","motivation":"\"for the development of a method for genome editing\"","share":"2"},{"id":"992","firstname":"Jennifer A.","surname":"Doudna","motivation":"\"for the development of a method for genome editing\"","share":"2"}]}');
The document for every entry will look like this:
{"year":"2020",
"category":"chemistry",
"laureates":[
{"id":"991",
"firstname":
"Emmanuelle",
"surname":"Charpentier",
"motivation":"\"for the development of a method for genome editing\"",
"share":"2"},
{"id":"992",
"firstname":"Jennifer A.",
"surname":"Doudna",
"motivation":"\"for the development of a method for genome editing\"",
"share":"2"}]}
After executing the whole script, the table should contain 652 rows.
In the following paragraphs we will concentrate on the dot-notation and the json_table. There are many more different ways and functions to access JSON, but I found these two to be the most common.
As described above, Oracle allows us to use the dot-notation to access JSON values, if we applied an IS JSON
check on a column. The notation works as if we accessed the fields with JavaScript directly. If we wanted the ID, the year and the category from the table, we’d build a query like this:
SELECT id,
np.document."year", //year is a reserved keyword, so we quote it
np.document.category
FROM nobel_prize np
FETCH FIRST 5 ROWS ONLY;
The result:
id | year | category |
---|---|---|
1588 | 2003 | medicine |
1589 | 2002 | chemistry |
1590 | 2002 | economics |
1591 | 2002 | literature |
1592 | 2002 | peace |
This looks pretty straightforward for single values, like year
and category
. How does it work with array values, e.g. if we’d like to get the surnames of all laureates? We can do this with a query that uses JSON path expressions:
SELECT id,
np.document."year",
np.document.category,
np.document.laureates[*].surname
FROM nobel_prize np
FETCH FIRST 5 ROWS ONLY;
The result:
id | year | category | laureates |
---|---|---|---|
1588 | 2003 | medicine | [“Lauterbur”,”Mansfield”] |
1589 | 2002 | chemistry | [“Fenn”,”Tanaka”,”Wüthrich”] |
1590 | 2002 | economics | [“Kahneman”,”Smith”] |
1591 | 2002 | literature | Kertész |
1592 | 2002 | peace | Carter |
We receive an array as a response if there are multiple values, and a single value if there is only one. We can work with that, but it is not really consistent. Also, as a developer, we are not exactly happy about having multiple values in one column. We will see how to deal with this later.
The dot notation can also be used in a WHERE
clause. If we wanted to know, who won the Nobel Prize for physics in 1950, this would be our query:
SELECT np.document.laureates[*].firstname||' '||np.document.laureates[*].surname
FROM nobel_prize np
WHERE np.document."year" = 1950
AND np.document.category = 'physics';
This returns Cecil Powell
as a single result. We need to keep in mind, that the database has to parse and filter all documents in the table, if we filter like this. This is suprisingly fast! If we have ~ 3 million entries in a production-grade database with adequate performance, these queries can execute in one or two seconds, while scanning all the data, without any type of index being applied!
In the example using the dot-notation we saw, that nested JSON values were represented as an array result. The json_table
function helps us if we wanted to join these values with our original table as a kind of internal view, in way that feels more natural to the way it works traditionally with relational data models. These columns can be used anywhere “normal” columns can be used, e.g. in a WHERE clause:
SELECT np.document."year",
np.document.category,
jt.firstname,
jt.surname
FROM nobel_prize np,
json_table(np.document, '$.laureates[*]' columns (
firstname varchar2(64) PATH '$.firstname',
surname varchar2(64) PATH '$.surname')) jt
WHERE surname LIKE 'T%';
The result:
year | CATEGORY | FIRSTNAME | SURNAME |
---|---|---|---|
2002 | chemistry | Koichi | Tanaka |
2018 | literature | Olga | Tokarczuk |
2017 | economics | Richard H. | Thaler |
2017 | physics | Kip S. | Thorne |
2016 | physics | David J. | Thouless |
2015 | medicine | Youyou | Tu |
2014 | economics | Jean | Tirole |
2011 | literature | Tomas | Tranströmer |
2008 | chemistry | Roger Y. | Tsien |
1913 | literature | Rabindranath | Tagore |
1906 | physics | J.J. | Thomson |
1998 | peace | David | Trimble |
1998 | physics | Daniel C. | Tsui |
1993 | physics | Joseph H. | Taylor Jr. |
1990 | physics | Richard E. | Taylor |
1990 | medicine | E. Donnall | Thomas |
1989 | peace | Lhamo | Thondup |
1987 | medicine | Susumu | Tonegawa |
1984 | peace | Desmond | Tutu |
1983 | chemistry | Henry | Taube |
1981 | economics | James | Tobin |
1951 | medicine | Max | Theiler |
1948 | chemistry | Arne | Tiselius |
1937 | physics | George Paget | Thomson |
1958 | medicine | Edward | Tatum |
1957 | chemistry | Lord | Todd |
1955 | medicine | Hugo | Theorell |
1976 | physics | Samuel C.C. | Ting |
1975 | medicine | Howard M. | Temin |
1973 | medicine | Nikolaas | Tinbergen |
1969 | economics | Jan | Tinbergen |
1965 | physics | Sin-Itiro | Tomonaga |
1964 | physics | Charles H. | Townes |
1958 | physics | Igor Y. | Tamm |
With the JSON path $.laureates[*]
we specified the laureate array as the root of our JSON table. For every column we want to be present, we have to specify a name, a data type and the path, relative to the elements in the array. In this example we mixed the dot-notation with json_table
, because year
and category
are on a higher level than our root path in the JSON document.
Again, as for the dot-notation, querying and displaying columns from the json_table
requires the database to parse and filter all documents in the whole table.
In our data set, not all laureates have both a firstname and a lastname. This is because sometimes organizations are awarded with Nobel Prizes:
SELECT jt.*
FROM nobel_prize np,
json_table(np.document, '$.laureates[*]' columns (
firstname varchar2(64) PATH '$.firstname',
surname varchar2(64) PATH '$.surname')) jt
WHERE np.document."year" = 2020
AND np.document.category = 'peace';
The result:
firstname | surname |
---|---|
World Food Programme | null |
This seems logical - the value for surname does not exist, so the surname itself is null. This is because the default error clause for the json_table
is NULL ON ERROR
. A missing key in the document is, in fact, an error, and the query interprets this as null
. If we typed out the defaults, it would look like this:
SELECT jt.*
FROM nobel_prize np,
json_table(np.document, '$.laureates[*]' columns (
firstname varchar2(64) PATH '$.firstname' NULL ON ERROR,
surname varchar2(64) PATH '$.surname' NULL ON ERROR)) jt
WHERE np.document."year" = 2020
AND np.document.category = 'peace';
There are many error clauses which apply with different JSON functions within Oracle. The clause ERROR ON ERROR
tells the engine to raise the error, instead of returning null:
SELECT jt.*
FROM nobel_prize np,
json_table(np.document, '$.laureates[*]' columns (
firstname varchar2(64) PATH '$.firstname' ERROR ON ERROR,
surname varchar2(64) PATH '$.surname' ERROR ON ERROR)) jt
WHERE np.document."year" = 2020
AND np.document.category = 'peace';
The result:
Error code: ORA-40462
Description: JSON_VALUE evaluated to no value
Cause: The provided JavaScript Object Notation (JSON) path expression did not select a value.
Action: Correct the JSON path expression.
Keep in mind, that this will not only bubble the error at row level! If we filtered for rows that all have a firstname
and a lastname
, the error will occur as long as any row in the whole dataset does not have a lastname
!
Depending on the JSON functions, there are also options to return empty arrays, empty objects, true
, false
or a default literal on error.
For both the dot-notation and json_table
, we have seen that the database needs to read, parse and filter on all the properties of a JSON document in our column. And although the database does an incredible job of doing this fast and efficiently, we don’t want our queries on large tables with millions of rows to take two seconds and longer, do we? Luckily, there are several ways of indexing JSON data. We will look at two of them, which both work with the json_value
function and dot-notation queries and some of them with json_table
.
We can use the function json_value
to create a function based index. To understand how this works (and why I did not introduce it in the “Exploring data” paragraph), it is helpful to run this function on our table, first:
SELECT json_value(document, '$.year') "year",
json_value(document, '$.category') category,
json_value(document, '$.laureates[*].firstname') lastname
FROM NOBEL_PRIZE;
The result:
year | CATEGORY | FIRSTNAME |
---|---|---|
2003 | medicine | null |
2002 | chemistry | null |
2002 | economics | null |
2002 | literature | Imre |
2002 | peace | Jimmy |
This exposes an unexpected behavior for the json_value
function: If there are multpile values for one path expression, it simply returns null. This is because, again, NULL ON ERROR
is the default behavior for error handling. And indeed, if we append ERROR ON ERROR
to the lastname column, we get an error message:
Error code: ORA-40456
Description: JSON_VALUE evaluated to non-scalar value
Cause: The provided JavaScript Object Notation (JSON) path expression selected a non-scalar value.
Action: Correct the JSON path expression or use JSON_QUERY.
This behavior is documented. This means, if we created an index using json_value
on the properties of our laureates
, the index would contain null
values for all prizes that were shared between multiple laureates, which is obviously not helpful. The Oracle documentation states, that a JSON_VALUE
function based index is the only one which works with json_table
, and only if the JSON_VALUE
function uses the ERROR ON ERROR
clause. This guarantees, that all the values from the document are present in the table, but also implies that the indexed value needs to be present in all documents. As schemaless storage is one of the benefits of document stores in databases, this seems rather restricted to me. You are able to create indexes with the NULL ON ERROR
clause and use them with the JSON_VALUE
clause in queries as well, but this index won’t be used by a json_table
at any time:
-- Cannot be used by json_table queries
CREATE INDEX idx_prize_year
ON nobel_prize (json_value(document,
'$.laureates[*].firstname'
NULL ON ERROR));
-- Can be used by json_table queries, but will throw errors because the path does
-- not point to a scalar result
CREATE INDEX idx_prize_year
ON nobel_prize (json_value(document,
'$.laureates[*].firstname'
ERROR ON ERROR));
-- Can be used by json_table queries, because year is a scalar result
CREATE INDEX idx_prize_year
ON nobel_prize (json_value(document,
'$.year'
ERROR ON ERROR));
If you are running this example on your own database, please be aware that your database might not use the index, because the optimizer decided it would be faster to just access all rows.
Using the dot-notation to create indexes is handled in the same chapter of the Oracle documentation as the json_value
based indexes, so I assume that they are working in a similar way “under the hood”. One upside is, that indexes with the dot-notation can access and index non-scalar values:
CREATE INDEX idx_prize_firstname
ON nobel_prize np (np.document.laureates[*].firstname);
When querying for all the prizes with laureates whose first names start with ‘T’, we can verify this with the explain plan:
SELECT np.*
FROM nobel_prize np
WHERE np.document.laureates[*].firstname LIKE 'T%';
OPERATION | OPTIONS | OBJECT_NAME | OBJECT_TYPE | OPTIMIZER | COST | CARDINALITY |
---|---|---|---|---|---|---|
SELECT STATEMENT | ALL_ROWS | 8 | 33 | |||
TABLE ACCESS | BY INDEX ROWID BATCHED | NOBEL_PRIZE | TABLE | ANALYZED | 8 | 33 |
INDEX | RANGE SCAN | IDX_PRIZE_FIRSTNAME | INDEX | ANALYZED | 2 | 6 |
Without the index, the explain plan shows that the query will do a full table scan (which is not that expensive on a table with 600 entries):
OPERATION | OPTIONS | OBJECT_NAME | OBJECT_TYPE | OPTIMIZER | COST | CARDINALITY |
---|---|---|---|---|---|---|
SELECT STATEMENT | ALL_ROWS | 26 | 33 | |||
TABLE ACCESS | FULL | NOBEL_PRIZE | TABLE | ANALYZED | 26 | 33 |
If we want to list the same laureates with firstname and lastname, we can use the json_table
to join them as a new table and the dot-notation to query the table in a way that can use the created index:
SELECT jt.*
FROM nobel_prize np,
json_table(np.document, '$.laureates[*]' columns (
firstname varchar2(64) PATH '$.firstname' NULL ON ERROR,
surname varchar2(64) PATH '$.surname' NULL ON ERROR)) jt
WHERE np.document.laureates[*].firstname LIKE 'T%';
The result:
FIRSTNAME | SURNAME |
---|---|
T.S. | Eliot |
The | Svedberg |
Theodor | Kocher |
Theodor | Mommsen |
Theodore | Roosevelt |
Theodore W. | Richards |
Thomas | Mann |
Thomas H. | Morgan |
Tomas | Tranströmer |
Toni | Morrison |
Trygve | Haavelmo |
The explain plan with idx_prize_firstname
present:
OPERATION | OPTIONS | OBJECT_NAME | OBJECT_TYPE | OPTIMIZER | COST | CARDINALITY |
---|---|---|---|---|---|---|
SELECT STATEMENT | ALL_ROWS | 906 | 266277 | |||
NESTED LOOPS | 906 | 266277 | ||||
TABLE ACCESS | BY INDEX ROWID BATCHED | NOBEL_PRIZE | TABLE | ANALYZED | 8 | 33 |
INDEX | RANGE SCAN | IDX_PRIZE_FIRSTNAME | INDEX | ANALYZED | 2 | 6 |
JSONTABLE EVALUATION |
The explain plan without idx_prize_firstname
present and full table:
OPERATION | OPTIONS | OBJECT_NAME | OBJECT_TYPE | OPTIMIZER | COST | CARDINALITY |
---|---|---|---|---|---|---|
SELECT STATEMENT | ALL_ROWS | 924 | 266277 | |||
NESTED LOOPS | 924 | 266277 | ||||
TABLE ACCESS | FULL | NOBEL_PRIZE | TABLE | ANALYZED | 26 | 33 |
JSONTABLE EVALUATION |
The JSON features in Oracle database have become more powerful with every subsequent release. Unfortunately, with every release containing new functionality, there are a lot of different ways to create, index and query JSON, with different strengths and weaknesses, and not all of them can or should be combined. The dot-notation currently seems to be the most complete and useful to me, because it has low syntax-overhead and works well with non-scalar values. For accessing array values in a “relational” way, json_table
is a helpful tool, although its columns should not be used to filter data, as it does not work well with index access. Accessing properly indexed JSON documents in a column can be very fast. If you are already running an Oracle database, you might not need to set up an additional document-based database, if your requirements are not too complex.