Efficient JSON with Oracle Database

While working with a customer, a usecase came up that was hard to implement with a relational database (due to NDA, I cannot share the usecase itself). It was necessary to store some information for every row in the Oracle database, but although the structure of the data was similar, it contained a lot of special cases and different hierarchies. It would have been possible, with a clever table structure und a bit of denormalization. In fact, we did a PoC with a relational model, but it soon turned out that reads were slow and the queries to filter by this data grew more and more complex. There were good and productive discussions about the best approaches, and at some point, somebody said “This would be a good usecase for a document DB, wouldn’t it?”. Unfortunately, installing infrastructure for this was out of the picture, as the customer did not have experience with this technology and we were not keen on maintaining even more infrastructure, but we agreed on creating another PoC with the whole hierarchy being stored as a JSON in Oracle DB. The PoC was successful and the solution is productive and performing well.

All the informtion in this post is compiled from the comprehensive JSON Developer’s Guide for Oracle 19c. As Oracle added a lot of funtionality to its JSON features in the latest versions, the approaches described here may not be working on other Oracle Database versions.

I decided to write this up as a lessons-learned kind of post, because we ran into some issues which are documented, but might not be easy to find. You do not need deep Oracle knowledge to understand the article, knowing basic SQL is sufficient. Here is what I learned:

Creating a JSON column in Oracle

Which data type to choose?

JSON objects are strings, and are treated like this in an Oracle database. If we want to include a column with JSON data into a table, there are several data types we can choose from:

VARCHAR2(4000) / VARCHAR2(32767)
CLOB
BLOB

As VARCHAR2 columns have limits, and JSON data we’d like to store is very dynamic, I’d rule out VARCHAR2, leaving CLOB (Character Large OBject) and BLOB (Binary Large OBject). Oracle recommends CLOB with the following explanation:

This is particularly relevant if the database character set is the Oracle-recommended value of AL32UTF8. In AL32UTF8 databases CLOB instances are stored using the UCS2 character set, which means that each character requires two bytes. This doubles the storage needed for a document if most of its content consists of characters that are represented using a single byte in character set AL32UTF8.

Even in cases where the database character set is not AL32UTF8, choosing BLOB over CLOB storage has the advantage that it avoids the need for character-set conversion when storing the JSON document

Additionally, Oracle recommends storing the column as SECUREFILE, which, in contrast to the default BASICFILE, enables you to activate encryption and compression.

To summarize: If you chose CLOB for your JSON column type, you’d (probably) double the storage needed, you’d add an unnecessary character-set conversion to your operations, while having zero benefits (except that CLOB fields are easier to handle from SQLPlus).

Making sure that the column only contains JSON

As we will access the documents in our table with functions that expect the JSON content to be valid, it is a good idea to add a constraint that checks this content on every insert. This also makes sure that our data is consistent, and we don’t save invalid JSON by accident. Oracle created the IS JSON constraint, which does exactly that. If you choose not to add this index, Oracle will not allow you to use the simple dot-notation to access the documents content (we will have a look at this later).

The constraint also allows you to decide if your constraint should enforce strict or lax JSON rules, whith lax being the default. Listing up the differences in detail would result in me just copying from the reference to this post, but the important differencee to most users is that the lax syntax allows JSON keywords to be single quoted, while the strict syntax does not.

DDL for a table with a JSON column

With all the above mentioned recommendations from the documentation we can create a simple table containing a JSON column. I have found a resource with Noble Prize winners as a JSON document online, so we will store these to have some data as an example:

CREATE SEQUENCE seq_nobel_prize
  START WITH 1 
  INCREMENT BY 1;

CREATE TABLE nobel_prize (
	id       NUMBER(19) DEFAULT seq_nobel_prize.nextval NOT NULL,
	document BLOB                                       NOT NULL,
	
	CONSTRAINT pk_nobel_prize   PRIMARY KEY (id),
	CONSTRAINT document_is_json CHECK       (document IS JSON (STRICT))
) LOB (document) STORE AS SECUREFILE;

As explained above, we chose to store the JSON in a BLOB field as SECUREFILE to optimize storage and access. We also apply an IS JSON check to the column, which checks for strict syntax. We also added a primary key column to the table which is filled by seq_nobel_prizes.

Inserting and test data

This step is not specific to storing JSON data, it works just like inserting text into a database. As the data from the web resource is one “large” JSON document, I split it up with some Javascript and transformed it into file with SQL insert statements, which you can find here. An example for a single row looks like this:

INSERT INTO nobel_prize (document) values ('{"year":"2020","category":"chemistry","laureates":[{"id":"991","firstname":"Emmanuelle","surname":"Charpentier","motivation":"\"for the development of a method for genome editing\"","share":"2"},{"id":"992","firstname":"Jennifer A.","surname":"Doudna","motivation":"\"for the development of a method for genome editing\"","share":"2"}]}');

The document for every entry will look like this:

{"year":"2020",
 "category":"chemistry",
 "laureates":[
 	{"id":"991",
 	 "firstname":
 	 "Emmanuelle",
 	 "surname":"Charpentier",
 	 "motivation":"\"for the development of a method for genome editing\"",
 	 "share":"2"},
 	{"id":"992",
 	 "firstname":"Jennifer A.",
 	 "surname":"Doudna",
 	 "motivation":"\"for the development of a method for genome editing\"",
 	 "share":"2"}]}

After executing the whole script, the table should contain 652 rows.

Exploring data

In the following paragraphs we will concentrate on the dot-notation and the json_table. There are many more different ways and functions to access JSON, but I found these two to be the most common.

The dot-notation

Reading data

As described above, Oracle allows us to use the dot-notation to access JSON values, if we applied an IS JSON check on a column. The notation works as if we accessed the fields with JavaScript directly. If we wanted the ID, the year and the category from the table, we’d build a query like this:

SELECT id, 
       np.document."year", //year is a reserved keyword, so we quote it
       np.document.category
FROM nobel_prize np
FETCH FIRST 5 ROWS ONLY;

The result:

id	year	category
1588	2003	medicine
1589	2002	chemistry
1590	2002	economics
1591	2002	literature
1592	2002	peace

This looks pretty straightforward for single values, like year and category. How does it work with array values, e.g. if we’d like to get the surnames of all laureates? We can do this with a query that uses JSON path expressions:

SELECT id, 
       np.document."year",
       np.document.category,
       np.document.laureates[*].surname
FROM nobel_prize np
FETCH FIRST 5 ROWS ONLY;

The result:

id	year	category	laureates
1588	2003	medicine	[“Lauterbur”,”Mansfield”]
1589	2002	chemistry	[“Fenn”,”Tanaka”,”Wüthrich”]
1590	2002	economics	[“Kahneman”,”Smith”]
1591	2002	literature	Kertész
1592	2002	peace	Carter

We receive an array as a response if there are multiple values, and a single value if there is only one. We can work with that, but it is not really consistent. Also, as a developer, we are not exactly happy about having multiple values in one column. We will see how to deal with this later.

Filtering data

The dot notation can also be used in a WHERE clause. If we wanted to know, who won the Nobel Prize for physics in 1950, this would be our query:

SELECT np.document.laureates[*].firstname||' '||np.document.laureates[*].surname
FROM nobel_prize np
WHERE np.document."year" = 1950
  AND np.document.category = 'physics';

This returns Cecil Powell as a single result. We need to keep in mind, that the database has to parse and filter all documents in the table, if we filter like this. This is suprisingly fast! If we have ~ 3 million entries in a production-grade database with adequate performance, these queries can execute in one or two seconds, while scanning all the data, without any type of index being applied!

The JSON table

Reading data

In the example using the dot-notation we saw, that nested JSON values were represented as an array result. The json_table function helps us if we wanted to join these values with our original table as a kind of internal view, in way that feels more natural to the way it works traditionally with relational data models. These columns can be used anywhere “normal” columns can be used, e.g. in a WHERE clause:

SELECT np.document."year",
	    np.document.category,
       jt.firstname,
       jt.surname
FROM nobel_prize np,
     json_table(np.document, '$.laureates[*]' columns (  
                             firstname varchar2(64) PATH '$.firstname',
                             surname   varchar2(64) PATH '$.surname')) jt
WHERE surname LIKE 'T%';

The result:

year	CATEGORY	FIRSTNAME	SURNAME
2002	chemistry	Koichi	Tanaka
2018	literature	Olga	Tokarczuk
2017	economics	Richard H.	Thaler
2017	physics	Kip S.	Thorne
2016	physics	David J.	Thouless
2015	medicine	Youyou	Tu
2014	economics	Jean	Tirole
2011	literature	Tomas	Tranströmer
2008	chemistry	Roger Y.	Tsien
1913	literature	Rabindranath	Tagore
1906	physics	J.J.	Thomson
1998	peace	David	Trimble
1998	physics	Daniel C.	Tsui
1993	physics	Joseph H.	Taylor Jr.
1990	physics	Richard E.	Taylor
1990	medicine	E. Donnall	Thomas
1989	peace	Lhamo	Thondup
1987	medicine	Susumu	Tonegawa
1984	peace	Desmond	Tutu
1983	chemistry	Henry	Taube
1981	economics	James	Tobin
1951	medicine	Max	Theiler
1948	chemistry	Arne	Tiselius
1937	physics	George Paget	Thomson
1958	medicine	Edward	Tatum
1957	chemistry	Lord	Todd
1955	medicine	Hugo	Theorell
1976	physics	Samuel C.C.	Ting
1975	medicine	Howard M.	Temin
1973	medicine	Nikolaas	Tinbergen
1969	economics	Jan	Tinbergen
1965	physics	Sin-Itiro	Tomonaga
1964	physics	Charles H.	Townes
1958	physics	Igor Y.	Tamm

With the JSON path $.laureates[*] we specified the laureate array as the root of our JSON table. For every column we want to be present, we have to specify a name, a data type and the path, relative to the elements in the array. In this example we mixed the dot-notation with json_table, because year and category are on a higher level than our root path in the JSON document.

Again, as for the dot-notation, querying and displaying columns from the json_table requires the database to parse and filter all documents in the whole table.

Error handling

In our data set, not all laureates have both a firstname and a lastname. This is because sometimes organizations are awarded with Nobel Prizes:

SELECT jt.*
FROM nobel_prize np,
     json_table(np.document, '$.laureates[*]' columns (  
                             firstname varchar2(64) PATH '$.firstname',
                             surname   varchar2(64) PATH '$.surname')) jt
WHERE np.document."year" = 2020
  AND np.document.category = 'peace';

The result:

firstname	surname
World Food Programme	null

This seems logical - the value for surname does not exist, so the surname itself is null. This is because the default error clause for the json_table is NULL ON ERROR. A missing key in the document is, in fact, an error, and the query interprets this as null. If we typed out the defaults, it would look like this:

SELECT jt.*
FROM nobel_prize np,
     json_table(np.document, '$.laureates[*]' columns (  
                             firstname varchar2(64) PATH '$.firstname' NULL ON ERROR,
                             surname   varchar2(64) PATH '$.surname'   NULL ON ERROR)) jt
WHERE np.document."year" = 2020
  AND np.document.category = 'peace';

There are many error clauses which apply with different JSON functions within Oracle. The clause ERROR ON ERROR tells the engine to raise the error, instead of returning null:

SELECT jt.*
FROM nobel_prize np,
     json_table(np.document, '$.laureates[*]' columns (  
                             firstname varchar2(64) PATH '$.firstname' ERROR ON ERROR,
                             surname   varchar2(64) PATH '$.surname'   ERROR ON ERROR)) jt
WHERE np.document."year" = 2020
  AND np.document.category = 'peace';

The result:

Error code: ORA-40462
Description: JSON_VALUE evaluated to no value
Cause: The provided JavaScript Object Notation (JSON) path expression did not select a value.
Action: Correct the JSON path expression.

Keep in mind, that this will not only bubble the error at row level! If we filtered for rows that all have a firstname and a lastname, the error will occur as long as any row in the whole dataset does not have a lastname!

Depending on the JSON functions, there are also options to return empty arrays, empty objects, true, false or a default literal on error.

Indexing JSON data

For both the dot-notation and json_table, we have seen that the database needs to read, parse and filter on all the properties of a JSON document in our column. And although the database does an incredible job of doing this fast and efficiently, we don’t want our queries on large tables with millions of rows to take two seconds and longer, do we? Luckily, there are several ways of indexing JSON data. We will look at two of them, which both work with the json_value function and dot-notation queries and some of them with json_table.

JSON_VALUE function based index

We can use the function json_value to create a function based index. To understand how this works (and why I did not introduce it in the “Exploring data” paragraph), it is helpful to run this function on our table, first:

SELECT json_value(document, '$.year') "year",
       json_value(document, '$.category') category,
       json_value(document, '$.laureates[*].firstname') lastname
FROM NOBEL_PRIZE;

The result:

year	CATEGORY	FIRSTNAME
2003	medicine	null
2002	chemistry	null
2002	economics	null
2002	literature	Imre
2002	peace	Jimmy

This exposes an unexpected behavior for the json_value function: If there are multpile values for one path expression, it simply returns null. This is because, again, NULL ON ERROR is the default behavior for error handling. And indeed, if we append ERROR ON ERROR to the lastname column, we get an error message:

Error code: ORA-40456
Description: JSON_VALUE evaluated to non-scalar value
Cause: The provided JavaScript Object Notation (JSON) path expression selected a non-scalar value.
Action: Correct the JSON path expression or use JSON_QUERY.

This behavior is documented. This means, if we created an index using json_value on the properties of our laureates, the index would contain null values for all prizes that were shared between multiple laureates, which is obviously not helpful. The Oracle documentation states, that a JSON_VALUE function based index is the only one which works with json_table, and only if the JSON_VALUE function uses the ERROR ON ERROR clause. This guarantees, that all the values from the document are present in the table, but also implies that the indexed value needs to be present in all documents. As schemaless storage is one of the benefits of document stores in databases, this seems rather restricted to me. You are able to create indexes with the NULL ON ERROR clause and use them with the JSON_VALUE clause in queries as well, but this index won’t be used by a json_table at any time:

-- Cannot be used by json_table queries
CREATE INDEX idx_prize_year 
  ON nobel_prize (json_value(document, 
                             '$.laureates[*].firstname'
                             NULL ON ERROR));

-- Can be used by json_table queries, but will throw errors because the path does
-- not point to a scalar result
CREATE INDEX idx_prize_year 
  ON nobel_prize (json_value(document, 
                             '$.laureates[*].firstname'
                             ERROR ON ERROR));

-- Can be used by json_table queries, because year is a scalar result
CREATE INDEX idx_prize_year 
  ON nobel_prize (json_value(document, 
                             '$.year'
                             ERROR ON ERROR));

If you are running this example on your own database, please be aware that your database might not use the index, because the optimizer decided it would be faster to just access all rows.

Dot-notation syntax index

Using the dot-notation to create indexes is handled in the same chapter of the Oracle documentation as the json_value based indexes, so I assume that they are working in a similar way “under the hood”. One upside is, that indexes with the dot-notation can access and index non-scalar values:

CREATE INDEX idx_prize_firstname 
  ON nobel_prize np (np.document.laureates[*].firstname);

When querying for all the prizes with laureates whose first names start with ‘T’, we can verify this with the explain plan:

SELECT np.*
FROM nobel_prize np
WHERE np.document.laureates[*].firstname LIKE 'T%';

OPERATION	OPTIONS	OBJECT_NAME	OBJECT_TYPE	OPTIMIZER	COST	CARDINALITY
SELECT STATEMENT				ALL_ROWS	8	33
TABLE ACCESS	BY INDEX ROWID BATCHED	NOBEL_PRIZE	TABLE	ANALYZED	8	33
INDEX	RANGE SCAN	IDX_PRIZE_FIRSTNAME	INDEX	ANALYZED	2	6

Without the index, the explain plan shows that the query will do a full table scan (which is not that expensive on a table with 600 entries):

OPERATION	OPTIONS	OBJECT_NAME	OBJECT_TYPE	OPTIMIZER	COST	CARDINALITY
SELECT STATEMENT				ALL_ROWS	26	33
TABLE ACCESS	FULL	NOBEL_PRIZE	TABLE	ANALYZED	26	33

If we want to list the same laureates with firstname and lastname, we can use the json_table to join them as a new table and the dot-notation to query the table in a way that can use the created index:

SELECT jt.*
FROM nobel_prize np, 
     json_table(np.document, '$.laureates[*]' columns (  
                             firstname varchar2(64) PATH '$.firstname' NULL ON ERROR,
                             surname   varchar2(64) PATH '$.surname'   NULL ON ERROR)) jt
WHERE np.document.laureates[*].firstname LIKE 'T%';

The result:

FIRSTNAME	SURNAME
T.S.	Eliot
The	Svedberg
Theodor	Kocher
Theodor	Mommsen
Theodore	Roosevelt
Theodore W.	Richards
Thomas	Mann
Thomas H.	Morgan
Tomas	Tranströmer
Toni	Morrison
Trygve	Haavelmo

The explain plan with idx_prize_firstname present:

OPERATION	OPTIONS	OBJECT_NAME	OBJECT_TYPE	OPTIMIZER	COST	CARDINALITY
SELECT STATEMENT				ALL_ROWS	906	266277
NESTED LOOPS					906	266277
TABLE ACCESS	BY INDEX ROWID BATCHED	NOBEL_PRIZE	TABLE	ANALYZED	8	33
INDEX	RANGE SCAN	IDX_PRIZE_FIRSTNAME	INDEX	ANALYZED	2	6
JSONTABLE EVALUATION

The explain plan without idx_prize_firstname present and full table:

OPERATION	OPTIONS	OBJECT_NAME	OBJECT_TYPE	OPTIMIZER	COST	CARDINALITY
SELECT STATEMENT				ALL_ROWS	924	266277
NESTED LOOPS					924	266277
TABLE ACCESS	FULL	NOBEL_PRIZE	TABLE	ANALYZED	26	33
JSONTABLE EVALUATION

Summary

The JSON features in Oracle database have become more powerful with every subsequent release. Unfortunately, with every release containing new functionality, there are a lot of different ways to create, index and query JSON, with different strengths and weaknesses, and not all of them can or should be combined. The dot-notation currently seems to be the most complete and useful to me, because it has low syntax-overhead and works well with non-scalar values. For accessing array values in a “relational” way, json_table is a helpful tool, although its columns should not be used to filter data, as it does not work well with index access. Accessing properly indexed JSON documents in a column can be very fast. If you are already running an Oracle database, you might not need to set up an additional document-based database, if your requirements are not too complex.

← Previous Post Next Post →