Interestingly, the past years JSON files have been experiencing increased support from all sorts of software and mobile applications, web services and APIs (e.g. the Twitter API). As a result, not only the number of stored files has grown rapidly, but also the amount of information stored within each file has also increased substantially.
Applying data analytics on such a source can provide valuable business insights. However, it could be also very challenging, especially when the data are comprised of millions of JSON files, which are large, deeply nested and contain a lot of arrays of variable size.
So, how can we analyze a large number of complex and deeply nested JSON files?
Analysis of complex JSON files with Microsoft technologies
Within a Microsoft environment (see categories A and B below, for an on-premise and cloud solution, respectively), the following ways allow for a successful extraction and analysis of information stored in complex and deeply nested JSON files:
A. SQL Server 2016 or later
- Build-in JSON commands: The functions “JSON_VALUE” (extracts a scalar value from a JSON string), “JSON_QUERY” (extracts an object or an array from a JSON string), and “OPENJSON” (creates a relational view of the JSON data), can be combined to extract information from deeply nested files and store them to tables for further analysis. The ability to use JSON path expressions is key for this process. For more information, check also the following link: https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
- R/Python code in T-SQL: The programming languages R and Python offer libraries (e.g. “jsonlite” and “json” respectively) which allow extraction of information from deeply nested files. Such scripts can be executed on the SQL Server in conjunction with the add-ons “2017 Machine Learning Services” (for both R and Python) or “R Services in SQL Server 2016” (only R support). For more information on how to run R/Python using T-SQL, check also the following link: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-2017
B: Microsoft Azure
- Azure Data Lake Analytics: For analysis of JSON data at scale, there is the possibility to do so by use of U-SQL, a query language that combines SQL with C#. The Azure Data Lake Analytics framework is able to query JSON files with the addition of custom assemblies, i.e. precompiled code in the form of DLL files, which needs to be uploaded to the Azure Data Lake Store and referenced in the U-SQL script. These two assemblies are the “Newtonsoft.Json.dll” and the “Microsoft.Analytics.Samples.Formats.dll”, and by referencing them, we can use the function “MultiLevelJsonExtractor” to extract data from multiple JSON paths at a single pass. For more detailed description, check also the following link: https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats
- Azure HDInsight: Another solution which offers the possibility to analyze complex JSON files at scale is by using Apache Hive in Azure HDInsight together with either the custom Serializer/Deserializer “SerDe” or Python user-defined functions (UDFs). For more information, check also the following link: https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/using-json-in-hive