Skip to content

Commit

Permalink
Various Upgrades
Browse files Browse the repository at this point in the history
* Update simdjson to 3.2.2
* Support GHC 9.6
* Drop support for GHC 8.10
* Drop support for text < 2.0
* Replace attoparsec-iso8601 with text-iso8601
* Replace `Scientific` parser with aeson version
* Remove `attoparsec` dependency
* Remove unnecessary allocation for array and object iterators
* Remove unnecessary allocation for objects and field lookups
* Remove unnecessary strictness in iterator loops
* Update benchmarks
* Fix bug where internal path was not being reset on each parse
* Add array and object reset behavior for better `Alternative` instance
* Expose `listOfInt` and `listOfDouble` for users who don't rely on rewrite rules
* Remove `withArray` and `withObject`
* Add `object` which replaces obsolete `withObject`
  • Loading branch information
velveteer committed Aug 20, 2023
1 parent f997ed9 commit 0008acf
Show file tree
Hide file tree
Showing 20 changed files with 115,975 additions and 33,019 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
name: ${{ matrix.os }} / ghc ${{ matrix.ghc }}
strategy:
matrix:
ghc: ['9.4', '9.2', '8.10']
ghc: ['9.6', '9.4', '9.2']
cabal: ['latest']
os: [ubuntu-latest, macOS-latest, windows-latest]
exclude:
Expand Down
78 changes: 43 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ This library exposes functions that can be used to write decoders for JSON docum
With this in mind, `Data.Hermes` parsers can decode Haskell types faster than traditional `Data.Aeson.FromJSON` instances, especially in cases where you only need to decode a subset of the document. This is because `Data.Aeson.FromJSON` converts the entire document into a `Data.Aeson.Value`, which means memory usage increases linearly with the input size. The `simdjson::ondemand` API does not have this constraint because it iterates over the JSON string in memory without constructing an intermediate tree. This means decoders are truly lazy and you only pay for what you use.

For an incremental JSON parser in Haskell, see [json-stream](https://hackage.haskell.org/package/json-stream).
Hermes requires the entire document in memory. For an incremental JSON parser that supports streaming, see [json-stream](https://hackage.haskell.org/package/json-stream).

## Usage

Expand All @@ -40,26 +40,26 @@ import qualified Data.ByteString as BS
import qualified Data.Hermes as H

personDecoder :: H.Decoder Person
personDecoder = H.withObject $ \obj ->
personDecoder = H.object $
Person
<$> H.atKey "_id" H.text obj
<*> H.atKey "index" H.int obj
<*> H.atKey "guid" H.text obj
<*> H.atKey "isActive" H.bool obj
<*> H.atKey "balance" H.text obj
<*> H.atKey "picture" (H.nullable H.text) obj
<*> H.atKey "latitude" H.scientific obj
<$> H.atKey "_id" H.text
<*> H.atKey "index" H.int
<*> H.atKey "guid" H.text
<*> H.atKey "isActive" H.bool
<*> H.atKey "balance" H.text
<*> H.atKey "picture" (H.nullable H.text)
<*> H.atKey "latitude" H.scientific

-- Decode a strict ByteString.
decodePersons :: BS.ByteString -> Either H.HermesException [Person]
decodePersons = H.decodeEither $ H.list personDecoder
```
### Aeson Integration

While it is not recommended to use hermes if you need the full DOM, we still provide a performant interface to decode aeson `Value`s. See an example of this in the `hermes-aeson` subpackage. Ideally, you could use hermes to selectively decode aeson `Value`s on demand, for example:
While it is not recommended to use hermes if you need the full DOM, we still provide a performant interface to decode aeson `Value`s. See an example of this in the `hermes-aeson` subpackage. You could use hermes to selectively decode aeson `Value`s on demand, for example:

```haskell
> H.decodeEither (H.atPointer "/statuses/99/user/screen_name" H.hValueToAeson) twitter
> decodeEither (atPointer "/statuses/99/user/screen_name" hValueToAeson) twitter
Right (String "2no38mae")
```

Expand All @@ -68,65 +68,73 @@ Right (String "2no38mae")
When decoding fails for a known reason, you will get a `Left HermesException` indicating if the error came from `simdjson` or from an internal `hermes` call.

```haskell
> decodeEither (withObject . atKey "hello" $ list text) "{ \"hello\": [\"world\", false] }"
Left (SIMDException (DocumentError {path = "/hello/1", errorMsg = "Error while getting value of type text. The JSON element does not have the requested type."))
> decodeEither (object . atKey "hello" $ list text) "{ \"hello\": [\"world\", false] }"
Left (SIMDException (DocumentError {path = "/hello/1", errorMsg = "Error while getting value of type text. INCORRECT_TYPE: The JSON element does not have the requested type."}))
```

## Benchmarks
We benchmark the following operations using both `hermes-json` and `aeson` strict ByteString decoders:
* Decode an array of 1 million 3-element arrays of doubles
* Decode a small array of 3-element arrays of doubles
* Full decoding of a large-ish (12 MB) JSON array of Person objects
* Partial decoding of Twitter status objects to highlight the on-demand benefits
* Decoding entire documents into `Data.Aeson.Value`

### Specs

* GHC 9.4.4
* aeson-2.1.2.1 (using `Data.Aeson.Decoding`) with text-2.0.2
* GHC 9.4.6 w/ -O1
* aeson-2.2 with text > 2.0
* Apple M1 Pro

![](https://raw.githubusercontent.com/velveteer/hermes/master/hermes-bench/bench.svg)

<!-- AUTO-GENERATED-CONTENT:START (BENCHES) -->
| Name | Mean (ps) | 2*Stdev (ps) | Allocated | Copied | Peak Memory |
| --------------------------------------- | ------------- | ------------ | ---------- | ---------- | ----------- |
| All.Decode.Arrays.Hermes | 267914650000 | 10610366160 | 503599934 | 439150544 | 541065216 |
| All.Decode.Arrays.Aeson | 2214928800000 | 160279563772 | 7094759111 | 2392723388 | 1166016512 |
| All.Decode.Persons.Hermes | 47338175000 | 4290343628 | 144901928 | 57032737 | 1166016512 |
| All.Decode.Persons.Aeson | 132864400000 | 9509102680 | 357269946 | 188529742 | 1166016512 |
| All.Decode.Partial Twitter.Hermes | 241083593 | 13856196 | 348540 | 3088 | 1166016512 |
| All.Decode.Partial Twitter.JsonStream | 2116192187 | 158907568 | 15259526 | 273821 | 1166016512 |
| All.Decode.Partial Twitter.Aeson | 4254060937 | 262619196 | 12538003 | 4634594 | 1166016512 |
| All.Decode.Persons (Aeson Value).Hermes | 106420425000 | 3747538126 | 303886293 | 135388183 | 1166016512 |
| All.Decode.Persons (Aeson Value).Aeson | 119489550000 | 9713032080 | 286148916 | 177027852 | 1166016512 |
| All.Decode.Twitter (Aeson Value).Hermes | 4164246875 | 240020934 | 12368752 | 4149211 | 1166016512 |
| All.Decode.Twitter (Aeson Value).Aeson | 4810817187 | 345165042 | 12539421 | 5527424 | 1166016512 |
| Name | Mean (ps) | 2*Stdev (ps) | Allocated | Copied | Peak Memory |
| --------------------------------------- | ------------ | ------------ | --------- | --------- | ----------- |
| All.Decode.Arrays.Hermes | 1219116015 | 78989100 | 4099496 | 44470 | 94371840 |
| All.Decode.Arrays.Aeson | 16966725000 | 977804574 | 70812389 | 2086285 | 94371840 |
| All.Decode.Persons.Hermes | 49972775000 | 1732589320 | 124747018 | 37844939 | 178257920 |
| All.Decode.Persons.Aeson | 127911600000 | 8272557088 | 349500116 | 129658138 | 254803968 |
| All.Decode.Partial Twitter.Hermes | 257494824 | 11743286 | 281756 | 223 | 254803968 |
| All.Decode.Partial Twitter.JsonStream | 2409346875 | 133458318 | 15089153 | 13163 | 254803968 |
| All.Decode.Partial Twitter.Aeson | 2640857812 | 229909714 | 12165866 | 142686 | 254803968 |
| All.Decode.Persons (Aeson Value).Hermes | 112281500000 | 4051755174 | 270059665 | 106866287 | 254803968 |
| All.Decode.Persons (Aeson Value).Aeson | 111117800000 | 10907092380 | 273064166 | 106977568 | 254803968 |
| All.Decode.Twitter (Aeson Value).Hermes | 2752631250 | 113826340 | 10745274 | 182823 | 254803968 |
| All.Decode.Twitter (Aeson Value).Aeson | 2791598437 | 118553528 | 12220666 | 211154 | 254803968 |
| |
<!-- AUTO-GENERATED-CONTENT:END (BENCHES) -->

## Performance Tips

* Use `text` >= 2.0 to benefit from its UTF-8 implementation.
* Decode to `Text` instead of `String` wherever possible!
* Decode to `Int` or `Double` instead of `Scientific` if you can.
* Decode your object fields in order. If encoding with `aeson`, you can leverage `toEncoding` to enforce ordering.

If you need to decode in tight loops or long-running processes (like a server), consider using the `withHermesEnv/mkHermesEnv` and `parseByteString` functions instead of `decodeEither`. This ensures the simdjson instances are not re-created on each decode. Please see the [simdjson performance docs](https://github.com/simdjson/simdjson/blob/master/doc/performance.md#performance-notes) for more info. But please ensure that you use one `HermesEnv` per thread, as simdjson is [single-threaded by default](https://github.com/simdjson/simdjson/blob/master/doc/basics.md#thread-safety).
If you need to decode in tight loops or long-running processes (like a server), consider using the `withHermesEnv/mkHermesEnv` and `parseByteString` functions instead of `decodeEither`. This ensures the simdjson instances are not re-created on each decode. See the [simdjson performance docs](https://github.com/simdjson/simdjson/blob/master/doc/performance.md#performance-notes) for more info. Please ensure that you use one `HermesEnv` per thread, as simdjson is [single-threaded by default](https://github.com/simdjson/simdjson/blob/master/doc/basics.md#thread-safety).

## Limitations

Because the On Demand API uses a forward-only iterator (except for object fields), you must be mindful to not access values out of order. This library tries to prevent this as much as possible, i.e. making `Decoder Value` impossible.
Because the On Demand API in simdjson uses a forward-only iterator (except for object fields), it is possible to introduce [unsafe iteration](https://github.com/simdjson/simdjson/blob/master/doc/ondemand_design.md#iteration-safety). Hermes tries to prevent this as much as possible with the type system.

Because the On Demand API does not validate the entire document upon creating the iterator (besides UTF-8 validation and basic well-formed checks), it is possible to parse an invalid JSON document but not realize it until later. If you need the entire document to be validated up front then a DOM parser is a better fit for you.
The On Demand API does not validate the entire document upon creating the iterator (besides UTF-8 validation and basic well-formed checks). It is possible to parse an invalid JSON document but not realize it until later. If you need the entire document to be validated up front then a DOM parser is a better fit for you.

> The On Demand approach is less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
This library currently cannot decode scalar documents, e.g. a single string, number, boolean, or null as a JSON document.
Other limitations inherited from simdjson:

* Cannot decode scalar documents, e.g. a single string, number, boolean, or null as a JSON document.
* 4GB is the maximum document size that simdjson supports.

## Portability

Per the `simdjson` documentation:

> A recent compiler (LLVM clang6 or better, GNU GCC 7.4 or better, Xcode 11 or better) on a 64-bit (PPC, ARM or x64 Intel/AMD) POSIX systems such as macOS, freeBSD or Linux. We require that the compiler supports the C++11 standard or better.
However, this library relies on `std::string_view` without a shim, so C++17 or better is highly recommended.
However, this library relies on `std::string_view` without a shim, so C++17 or later is required.

The `native_comp` cabal flag enables passing `-march=native` to the C++ compiler.

> Passing -march=native to the compiler may make On Demand faster by allowing it to use optimizations specific to your machine. You cannot do this, however, if you are compiling code that might be run on less advanced machines. That is, be mindful that when compiling with the -march=native flag, the resulting binary will run on the current system but may not run on other systems (e.g., on an old processor).
> If you are compiling on an ARM or POWER system, you do not need to be concerned with CPU selection during compilation. The -march=native flag is useful for best performance on x64 (e.g., Intel) systems but it is generally unsupported on some platforms such as ARM (aarch64) or POWER.
35 changes: 17 additions & 18 deletions cbits/lib.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,19 +44,15 @@ extern "C" {
return doc.at_pointer(pointerSv).get(out);
}

error_code get_object_from_value(
ondemand::value &val,
ondemand::object &out) {
return val.get_object().get(out);
error_code get_object_from_value(ondemand::value &val) {
return val.get_object().error();
}

error_code get_object_iter_from_value(
ondemand::value &val,
ondemand::object_iterator &iterOut) {
error_code get_object_iter_from_value(ondemand::value &val) {
ondemand::object obj;
auto error = val.get_object().get(obj);
if (error != SUCCESS) { return error; }
return obj.begin().get(iterOut);
return obj.begin().error();
}

bool obj_iter_is_done(ondemand::object_iterator &obj) {
Expand Down Expand Up @@ -91,8 +87,8 @@ extern "C" {

error_code get_array_len_from_value(
ondemand::value &val,
ondemand::array &out,
size_t &len) {
ondemand::array out;
auto error = val.get_array().get(out);
if (error) { return error; }
return out.count_elements().get(len);
Expand All @@ -116,23 +112,18 @@ extern "C" {
return SUCCESS;
}

error_code get_array_iter_from_value(
ondemand::value &val,
ondemand::array_iterator &iterOut) {
error_code get_array_iter_from_value(ondemand::value &val) {
ondemand::array arr;
auto error = val.get_array().get(arr);
if (error != SUCCESS) { return error; }
return arr.begin().get(iterOut);
return arr.begin().error();
}

error_code get_array_iter_len_from_value(
ondemand::value &val,
ondemand::array_iterator &iterOut,
size_t &len) {
error_code get_array_iter_len_from_value(ondemand::value &val, size_t &len) {
ondemand::array arr;
auto error = val.get_array().get(arr);
if (error != SUCCESS) { return error; }
error = arr.begin().get(iterOut);
error = arr.begin().error();
if (error != SUCCESS) { return error; }
return arr.count_elements().get(len);
}
Expand All @@ -149,6 +140,14 @@ extern "C" {
++arr;
}

void reset_array(ondemand::array &arr) {
arr.reset();
}

void reset_object(ondemand::object &obj) {
obj.reset();
}

error_code find_field(
ondemand::object &obj,
const char *key,
Expand Down
Loading

0 comments on commit 0008acf

Please sign in to comment.