skip to main content
research-article
Open access

dsJSON: A Distributed SQL JSON Processor

Published: 30 May 2023 Publication History

Abstract

The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers

Supplemental Material

MP4 File
dsJSON is a scalable JSON processor that supports executing analytical queries for data with very big sizes and structural complexity. In this video, we go over the motivation behind this work, and describe some of the components in the system.
PDF File
Read me
ZIP File
Source Code

References

[1]
Json encoder and decoder. Available at https://docs.python.org/3/library/json.html.
[2]
MongoDB. Available at https://www.mongodb.com.
[3]
Apache spark: A unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.
[4]
Bestbuy developer api, 2021. Retrieved from https://bestbuyapis.github.io/api-documentation/.
[5]
Jackson, 2021. Available at https://github.com/FasterXML/jackson.
[6]
Jayway JsonPath, 2021. Available at https://github.com/json-path/JsonPath.
[7]
RapidJSON, 2021. Available at https://rapidjson.org/.
[8]
Wikipedia json dumps, 2021. Retrieved from https://dumps.wikimedia.org/wikidatawiki/latest/.
[9]
Altwaijry, S. A. Y. A. H., Behm, A., Carey, V. B. Y. B. M., Cheelangi, I. C. M., Faraaz, K., Heilbron, E. G. R. G. Z., Vernica, P. P. V. T. R., Wen, J., and Westmann, T. Asterixdb: A scalable, open source bdms. Proceedings of the VLDB Endowment 7, 14 (2014).
[10]
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (2015), pp. 1383--1394.
[11]
Baazizi, M.-A., Berti, C., Colazzo, D., Ghelli, G., and Sartiani, C. Human-in-the-Loop Schema Inference for Massive JSON Datasets. In EDBT 2020 - 23nd International Conference on Extending Database Technology (Copenhagen, Denmark, Mar. 2020), OpenProceedings.org, pp. 635--638.
[12]
Baazizi, M.-A., Colazzo, D., Ghelli, G., and Sartiani, C. Parametric schema inference for massive json datasets. The VLDB Journal 28, 4 (2019), 497--521.
[13]
Barenghi, A., Crespi Reghizzi, S., Mandrioli, D., Panella, F., and Pradella, M. Parallel parsing made practical. Science of Computer Programming 112 (2015), 195--226.
[14]
Beyer, K. S., Ercegovac, V., Gemulla, R., Eltabakh, M., and Balmin, A. Jaql: A scripting language for large scale semistructured data analysis. vldb, 2011.
[15]
Biswas, E. Imdb review dataset, 2021. Retrieved from https://www.kaggle.com/dsv/1836923.
[16]
Bonetta, D., and Brantner, M. Fad. js: fast json data access using jit-based speculative optimizations. Proceedings of the VLDB Endowment 10, 12 (2017), 1778--1789.
[17]
Clark, J., DeRose, S., et al. Xml path language (xpath), 1999.
[18]
Conto?, P., and Svoboda, M. Json schema inference approaches. In Advances in Conceptual Modeling (Cham, 2020), G. Grossmann and S. Ram, Eds., Springer International Publishing, pp. 173--183.
[19]
DiScala, M., and Abadi, D. J. Automatic generation of normalized relational schemas from nested key-value data. In Proceedings of the 2016 International Conference on Management of Data (2016), pp. 295--310.
[20]
Durner, D., Leis, V., and Neumann, T. JSON Tiles: Fast Analytics on Semi-Structured Data. Association for Computing Machinery, New York, NY, USA, 2021, p. 445--458.
[21]
Eldawy, A., Hristidis, V., Ghosh, S., Saeedan, M., Sevim, A., Siddiqe, A., Singla, S., Sivaram, G., Vu, T., and Zhang, Y. Beast: Scalable exploratory analytics on spatio-temporal data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021), pp. 3796--3807.
[22]
Ge, C., Li, Y., Eilebrecht, E., Chandramouli, B., and Kossmann, D. Speculative distributed csv data parsing for big data analytics. In Proceedings of the 2019 International Conference on Management of Data (2019), pp. 883--899.
[23]
Goessner, S. JSONPath - XPath for JSON, Feb. 2007. Available at https://goessner.net/articles/JsonPath/.
[24]
Hrubaru, I., Talaba, G., and Fotache, M. A basic testbed for json data processing in sql data servers. In Proceedings of the 20th International Conference on Computer Systems and Technologies (New York, NY, USA, 2019), CompSysTech '19, Association for Computing Machinery, p. 278--283.
[25]
Jiang, L., Qiu, J., and Zhao, Z. Scalable structural index construction for json analytics. Proc. VLDB Endow. 14, 4 (dec 2020), 694--707.
[26]
Jiang, L., Sun, X., Farooq, U., and Zhao, Z. Scalable processing of contemporary semi-structured data on commodity parallel processors-a compilation-based approach. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019), pp. 79--92.
[27]
Jiang, L., and Zhao, Z. Grammar-aware parallelization for scalable xpath querying. ACM SIGPLAN Notices 52, 8 (2017), 371--383.
[28]
Jiang, L., and Zhao, Z. Jsonski: streaming semi-structured data with bit-parallel fast-forwarding. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022), pp. 200--211.
[29]
JSON, 2021. Available at https://www.json.org/.
[30]
Documentation for the json lines text file format, 2021. Available at https://jsonlines.org.
[31]
Klettke, M., Störl, U., and Scherzinger, S. Schema extraction and structural outlier detection for json-based nosql data stores. Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).
[32]
Langdale, G., and Lemire, D. Parsing gigabytes of json per second. The VLDB Journal 28, 6 (2019), 941--960.
[33]
Li, Y., Katsipoulakis, N. R., Chandramouli, B., Goldstein, J., and Kossmann, D. Mison: a fast json parser for data analytics. Proceedings of the VLDB Endowment 10, 10 (2017), 1118--1129.
[34]
Lu, W., Chiu, K., and Pan, Y. A parallel approach to xml parsing. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (USA, 2006), GRID '06, IEEE Computer Society, p. 223--230.
[35]
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.
[36]
Microsoft. Computer generated building footprints in all 50 us states., 2020. Retrieved from UCR-STAR https://star.cs.ucr.edu/?MSBuildings&d.
[37]
Palkar, S., Abuzaid, F., Bailis, P., and Zaharia, M. Filter before you parse: Faster analytics on raw data with sparser. Proceedings of the VLDB Endowment 11, 11 (2018), 1576--1589.
[38]
Pavlopoulou, C., Carman Jr, E. P., Westmann, T., Carey, M. J., and Tsotras, V. J. A parallel and scalable processor for json data. In EDBT (2018), pp. 576--587.
[39]
Yip, M. Native JSON Benchmark, 2021. Available at https://github.com/miloyip/nativejson-benchmark.
[40]
Zhang, Y., and Eldawy, A. Openstreetmap all map points, 2021. Retrieved from UCR-STAR https://star.cs.ucr.edu/?osm21/all_nodes&d.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Badges

Author Tags

  1. JSON
  2. JSONPath
  3. SQL
  4. distributed processing
  5. schema inference
  6. spark

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 666
    Total Downloads
  • Downloads (Last 12 months)374
  • Downloads (Last 6 weeks)47
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media