research-article

Open access

dsJSON: A Distributed SQL JSON Processor

Authors:

Zhijia ZhaoAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 103, Pages 1 - 25

https://doi.org/10.1145/3588957

Published: 30 May 2023 Publication History

Abstract

The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers

Supplemental Material

MP4 File

dsJSON is a scalable JSON processor that supports executing analytical queries for data with very big sizes and structural complexity. In this video, we go over the motivation behind this work, and describe some of the components in the system.

Download
21.59 MB

PDF File

Read me

Download
79.31 KB

ZIP File

Source Code

Download
256.01 MB

References

[1]

Json encoder and decoder. Available at https://docs.python.org/3/library/json.html.

[2]

MongoDB. Available at https://www.mongodb.com.

[3]

Apache spark: A unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.

[4]

Bestbuy developer api, 2021. Retrieved from https://bestbuyapis.github.io/api-documentation/.

[5]

Jackson, 2021. Available at https://github.com/FasterXML/jackson.

[6]

Jayway JsonPath, 2021. Available at https://github.com/json-path/JsonPath.

[7]

RapidJSON, 2021. Available at https://rapidjson.org/.

[8]

Wikipedia json dumps, 2021. Retrieved from https://dumps.wikimedia.org/wikidatawiki/latest/.

[9]

Altwaijry, S. A. Y. A. H., Behm, A., Carey, V. B. Y. B. M., Cheelangi, I. C. M., Faraaz, K., Heilbron, E. G. R. G. Z., Vernica, P. P. V. T. R., Wen, J., and Westmann, T. Asterixdb: A scalable, open source bdms. Proceedings of the VLDB Endowment 7, 14 (2014).

[10]

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (2015), pp. 1383--1394.

Digital Library

[11]

Baazizi, M.-A., Berti, C., Colazzo, D., Ghelli, G., and Sartiani, C. Human-in-the-Loop Schema Inference for Massive JSON Datasets. In EDBT 2020 - 23nd International Conference on Extending Database Technology (Copenhagen, Denmark, Mar. 2020), OpenProceedings.org, pp. 635--638.

[12]

Baazizi, M.-A., Colazzo, D., Ghelli, G., and Sartiani, C. Parametric schema inference for massive json datasets. The VLDB Journal 28, 4 (2019), 497--521.

Digital Library

[13]

Barenghi, A., Crespi Reghizzi, S., Mandrioli, D., Panella, F., and Pradella, M. Parallel parsing made practical. Science of Computer Programming 112 (2015), 195--226.

Digital Library

[14]

Beyer, K. S., Ercegovac, V., Gemulla, R., Eltabakh, M., and Balmin, A. Jaql: A scripting language for large scale semistructured data analysis. vldb, 2011.

Digital Library

[15]

Biswas, E. Imdb review dataset, 2021. Retrieved from https://www.kaggle.com/dsv/1836923.

[16]

Bonetta, D., and Brantner, M. Fad. js: fast json data access using jit-based speculative optimizations. Proceedings of the VLDB Endowment 10, 12 (2017), 1778--1789.

Digital Library

[17]

Clark, J., DeRose, S., et al. Xml path language (xpath), 1999.

[18]

Conto?, P., and Svoboda, M. Json schema inference approaches. In Advances in Conceptual Modeling (Cham, 2020), G. Grossmann and S. Ram, Eds., Springer International Publishing, pp. 173--183.

Digital Library

[19]

DiScala, M., and Abadi, D. J. Automatic generation of normalized relational schemas from nested key-value data. In Proceedings of the 2016 International Conference on Management of Data (2016), pp. 295--310.

Digital Library

[20]

Durner, D., Leis, V., and Neumann, T. JSON Tiles: Fast Analytics on Semi-Structured Data. Association for Computing Machinery, New York, NY, USA, 2021, p. 445--458.

[21]

Eldawy, A., Hristidis, V., Ghosh, S., Saeedan, M., Sevim, A., Siddiqe, A., Singla, S., Sivaram, G., Vu, T., and Zhang, Y. Beast: Scalable exploratory analytics on spatio-temporal data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021), pp. 3796--3807.

Digital Library

[22]

Ge, C., Li, Y., Eilebrecht, E., Chandramouli, B., and Kossmann, D. Speculative distributed csv data parsing for big data analytics. In Proceedings of the 2019 International Conference on Management of Data (2019), pp. 883--899.

Digital Library

[23]

Goessner, S. JSONPath - XPath for JSON, Feb. 2007. Available at https://goessner.net/articles/JsonPath/.

[24]

Hrubaru, I., Talaba, G., and Fotache, M. A basic testbed for json data processing in sql data servers. In Proceedings of the 20th International Conference on Computer Systems and Technologies (New York, NY, USA, 2019), CompSysTech '19, Association for Computing Machinery, p. 278--283.

Digital Library

[25]

Jiang, L., Qiu, J., and Zhao, Z. Scalable structural index construction for json analytics. Proc. VLDB Endow. 14, 4 (dec 2020), 694--707.

Digital Library

[26]

Jiang, L., Sun, X., Farooq, U., and Zhao, Z. Scalable processing of contemporary semi-structured data on commodity parallel processors-a compilation-based approach. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019), pp. 79--92.

Digital Library

[27]

Jiang, L., and Zhao, Z. Grammar-aware parallelization for scalable xpath querying. ACM SIGPLAN Notices 52, 8 (2017), 371--383.

Digital Library

[28]

Jiang, L., and Zhao, Z. Jsonski: streaming semi-structured data with bit-parallel fast-forwarding. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022), pp. 200--211.

Digital Library

[29]

JSON, 2021. Available at https://www.json.org/.

[30]

Documentation for the json lines text file format, 2021. Available at https://jsonlines.org.

[31]

Klettke, M., Störl, U., and Scherzinger, S. Schema extraction and structural outlier detection for json-based nosql data stores. Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).

[32]

Langdale, G., and Lemire, D. Parsing gigabytes of json per second. The VLDB Journal 28, 6 (2019), 941--960.

[33]

Li, Y., Katsipoulakis, N. R., Chandramouli, B., Goldstein, J., and Kossmann, D. Mison: a fast json parser for data analytics. Proceedings of the VLDB Endowment 10, 10 (2017), 1118--1129.

Digital Library

[34]

Lu, W., Chiu, K., and Pan, Y. A parallel approach to xml parsing. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (USA, 2006), GRID '06, IEEE Computer Society, p. 223--230.

[35]

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.

Digital Library

[36]

Microsoft. Computer generated building footprints in all 50 us states., 2020. Retrieved from UCR-STAR https://star.cs.ucr.edu/?MSBuildings&d.

[37]

Palkar, S., Abuzaid, F., Bailis, P., and Zaharia, M. Filter before you parse: Faster analytics on raw data with sparser. Proceedings of the VLDB Endowment 11, 11 (2018), 1576--1589.

Digital Library

[38]

Pavlopoulou, C., Carman Jr, E. P., Westmann, T., Carey, M. J., and Tsotras, V. J. A parallel and scalable processor for json data. In EDBT (2018), pp. 576--587.

[39]

Yip, M. Native JSON Benchmark, 2021. Available at https://github.com/miloyip/nativejson-benchmark.

[40]

Zhang, Y., and Eldawy, A. Openstreetmap all map points, 2021. Retrieved from UCR-STAR https://star.cs.ucr.edu/?osm21/all_nodes&d.

Index Terms

dsJSON: A Distributed SQL JSON Processor
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
      1. MapReduce algorithms
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Semi-structured data
      2. Relational database model

Recommendations

Parametric schema inference for massive JSON datasets
Abstract
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important ...
Schemas and Types for JSON Data: From Theory to Practice
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

The last few years have seen the fast and ubiquitous diffusion of JSON as one of the most widely used formats for publishing and interchanging data, as it combines the flexibility of semistructured data models with well-known data structures like ...
JSON data management: supporting schema-less development in RDBMS
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Relational Database Management Systems (RDBMS) have been very successful at managing structured data with well-defined schemas. Despite this, relational systems are generally not the first choice for management of data where schemas are not pre-defined ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
666
Total Downloads

Downloads (Last 12 months)374
Downloads (Last 6 weeks)47

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents