How to Extract Value from Nested JSON with Special Characters in AWS Athena

If you've worked in SQL, you must have encountered nested JSON. Extracting value is usually straightforward with some flavor of json_extract

However, I recently came across a case with special characters in the JSON keys (names) in a table in Athena. Three different kinds!

And ChatGPT couldn't give me a straight answer to extract the values. I finally figured it out - hope to share in case this would be helpful to some of you.

Let's say you have a table with a column named user_profile with the values below. You want to extract the values of the 4 keys in the nested JSON.

{
"first_session_channel":"google",
"converted?":"true",
"have've_opt_in":"sms",
"bought_product/service":"true"
}

However, three of them have special characters: apostrophe ('), question mark (?) , forward slash (/).

So the typical json_extract_scalar(user_profile,'$.first_session_channel')would only work for the first column.

This is what you need to do instead.

Use the code below to replace the special characters with underscore, and then you can use json_extract_scalar as usual.

regexp_replace (user_profile,'['|| chr(39) || chr(63) || chr(47) ||']','') as user_profile_updated

Side note

In AWS Athena, CHR() is a function that returns the character corresponding to the ASCII code given as its argument.

  1. CHR(39): This function returns the character associated with the ASCII code 39, which is an apostrophe (').
  2. CHR(63): This function returns the character associated with the ASCII code 63, which is the question mark (?).
  3. CHR(47): This function returns the character associated with the ASCII code 47, which is the forward slash (/).

In AWS Athena, these CHR() values are often used to manipulate strings, escape special characters, or concatenate characters that might otherwise interfere with query parsing.