Performance Resources

DZone's Featured Performance Resources

Monitoring and Troubleshooting Serverless Applications

By Ila Bandhiya

Gone are the days when developers handled app development, server logs, infrastructures, and other resources single-handedly. With the introduction of serverless computing, businesses can build and deploy applications much faster. Serverless architectures offload routine tasks from developers and let them focus on app building. They offer scalable, flexible, and cost-effective solutions that eliminate the need to manage servers. This blog dives deep into serverless app monitoring and the tools that can help you monitor and troubleshoot effectively. What Is Serverless? Serverless is an application development and execution model that allows developers to build apps and run codes without managing servers or backend infrastructure. Unlike in traditional server-based architecture, where developers had to provision, configure, and manage servers to run apps, serverless computing has no operational responsibilities. Developers can write codes and deploy them as individual functions or microservices. Here’s how serverless functions work: Write a function in the app code for a specific purpose.Define an event that triggers the service provider to execute the function (i.e., an HTTP request).As users trigger the event, the cloud service provider starts the function.The result of the function executed within the application is then displayed to the user. Benefits of Serverless Architecture Serverless architecture offers key benefits for developers and businesses. Here are a few of them: 1. Cost Efficiency Serverless architecture works on a pay-per-use basis. You pay only for cloud-based computing time as needed. Unlike traditional server-based architecture, there’s no need to pay for server management or idle capacity. 2. Scalability and Flexibility Serverless architectures have built-in autoscaling to scale up and down based on demand. They ensure apps remain responsive without manual processes. They are dynamic and adjust to traffic surges to maintain performance consistency. 3. Improved Productivity Developers no longer need to perform provisioning and managing servers. As serverless architectures offload routine tasks, developers can focus more on app building and streamlining the development process. 4. Better Observability With serverless architectures, developers can break down applications into smaller parts to gain more visibility into them. The decomposition process improves app observability and makes it easier to fix issues. Why Is Serverless Monitoring and Troubleshooting Important? Serverless monitoring allows developers to track and address issues critical to the system’s health and user experience. With serverless architectures, developers can no longer access the underlying server resources, logs, and other operational data. While traditional architectures give full visibility into environments, the cloud provider manages infrastructure, scaling, and resource allocation in serverless architectures. This means businesses have lesser visibility and control over environments. As a result, identifying bugs or issues can be difficult. With serverless monitoring and troubleshooting, your teams can track these managed resources and services that power serverless functions. They can effectively identify issues and manage complex digital environments. Besides helping you identify and address issues and improve efficiency, here are a few benefits of serverless monitoring: 1. Predict and Optimize Cost Monitor serverless spending to understand how many resources are used at a specific time. This enables you to control spending and scale when needed. Further, a total cost of ownership (TCO) serverless model can improve cost prediction by considering development, infrastructure, and maintenance costs. 2. Prevent Cold Starts Leverage serverless monitoring to identify which functions are used and when. It helps you optimize performance by fixing latency issues caused by "cold" functions. Since serverless architectures are event-based, they use functions only when an event demands. If functions remain unused for a period, they go "cold" leading to cold starts when the function is called on again after a period. 3. Monitor Memory Usage In serverless architectures, memory usage is a configurable element. You can allocate memory space to a function depending on the cloud provider. This affects the processing power required to run a function when an event triggers it. Top 5 Serverless Platforms in 2024 1. AWS Lambda AWS Lambda is a serverless computing tool with in-built autoscaling and infrastructure management functions. The platform supports several programming languages, including PowerShell, Ruby, C#, and Go. It offers automated scaling capabilities and administration without manually managing servers. 2. Vercel Vercel is a cloud platform that provides developer tools and infrastructure to build apps faster. The platform is optimized for the front end, including static sites and Jamstack projects built with frameworks like Next.js, Vue.js, and React. The tool also offers global CDN, cache validation, and serverless functions. 3. Azure Functions Azure Functions is a serverless computing service enabling you to run events triggered by events without provisioning or managing infrastructures. The tool supports Java, Python, C#, F#, and Node.js. It integrates with Azure and other third-party services and provides auto-scaling capabilities. 4. Google Cloud Functions Google’s Cloud Functions allow you to trigger codes from Google Cloud, Google Assistant, Firebase, or directly from any web or backend app via HTTP. Based on pay-per-use pricing, the platform auto-scales based on incoming requests integrates with Google Cloud services and supports Go, Java, .NET Core, Ruby, Node.js, Python, and PHP. 5. OpenFaaS OpenFaaS helps you deploy event-triggered functions and microservices to Kubernetes without boilerplate coding. The platform supports any programming language via Docker/OCI images and scales functions based on custom metrics and HTTP requests. Choosing Serverless Monitoring Tools There are several serverless monitoring tools that you can use to get the one that fits your bill. Here are a few popular ones: AWS CloudWatch is the default monitoring tool for AWS Lambda and other AWS serverless services. It helps track the performance of serverless functions with metrics, logs, etc.Datadog is a cloud-based monitoring and analytics tool that supports serverless monitoring across several cloud providers, like AWS, Google Cloud, and Azure. It offers dashboards, alerting, and analytics features based on your chosen Datadog pricing plan.Middleware.io is a full-stack cloud observability tool that lets you monitor logs, metrics, events, and traces in real time on unified dashboards. It provides end-to-end visibility into serverless apps. Apart from choosing the best serverless monitoring tool, setting up monitoring metrics to track serverless apps is essential. Here are a few of them: Invocation duration: Track the invocation duration of each function to identify performance issues and optimize function execution.Memory usage: Monitor this metric to ensure the functions are configured with the apt memory allocation and identify inefficient usage and memory leaks.Error rates: Monitor error rates of function invocations to resolve issues quickly. Cold starts: Track the frequency and duration of cold starts to improve the performance of serverless functions, especially during new deployments and traffic spikes. Concurrency: Monitor this metric to identify the number of function executions happening simultaneously to ensure they can manage the workload and identify scaling and resource issues. Best Practices for Monitoring Serverless Applications Here are some best practices for monitoring serverless applications: 1. Configuring Alerts Setting threshold alerts on critical metrics like invocation errors, concurrency, memory usage, etc. helps you proactively monitor your apps and resources. You can set alerts for early detection of potential issues and enable prompt responses. You can also integrate your emails, Slack channels, and SMS for urgent notifications and real-time updates. 2. Visualizing Data Create custom dashboards to get a unified view of all performance data and metrics. It helps you track key metrics and KPIs, identify patterns, and detect anomalies in one place. Also, use line charts and scatter plots to identify trends in performance metrics and highlight unusual data points. 3. Advanced Techniques Leverage AI-powered advanced analytics to analyze historical data and predict future trends accurately. AI algorithms can also help spot potential issues early before they significantly impact user experience. You can also use AI to schedule maintenance tasks based on estimated resource usage and performance metrics. With the integration of AI and serverless monitoring, there are a few potential future trends that can take serverless monitoring to the next level: Increase in automation: You can leverage more automated monitoring and alerting systems that adapt quickly to changing app behavior.Enhanced integration: Improved integration of monitoring tools with CI/CD pipelines for better tracking and deployment of applications. Advanced data visualization: Get actionable performance data with AI-generated insights to improve app performance. Troubleshooting Serverless Applications 1. Identifying Issues The first step to troubleshooting is identifying issues in serverless applications. Some common issues that may arise in serverless apps are: Latency spikes: Sudden increase in function invocation duration, impacting overall app performanceCold starts: Function execution delays due to long periods of inactivityMemory leaks: Excess memory usage by serverless functions, resulting in declining function performance and failuresErrors: Timeouts, exceptions, invocation errors, etc. occurring during function execution, disrupting app performance To identify the above issues in serverless applications and ensure effective troubleshooting, perform a thorough root cause analysis. Leverage traces, logs, metrics, and data monitoring to spot the potential problem areas and the underlying causes of the issues. This will help you: Understand the issue, its nature, and its scope.Identify the root cause by evaluating the context and events surrounding issues.Implement sustainable solutions that address the issue at its source rather than temporary fixes. 2. Resolving Issues One of the most critical steps in troubleshooting is log analysis. Logs provide valuable insights into function execution, including: Error messages that help identify specific errors and get cues to resolve issuesExecution details like event sequence, input parameters, and other contextual informationPerformance metrics data like memory usage, invocation duration, etc., to help detect performance issues In short, analyzing logs gives you a deeper understanding of serverless app behavior and ensures effective troubleshooting. FAQs 1. What Are the Benefits of Monitoring Serverless Applications? The key benefits of monitoring serverless applications include faster troubleshooting, cost optimization, improved developer productivity, better observability across environments and metrics, and faster app development and deployment. 2. What Are the Key Metrics for Serverless Applications? The key metrics of serverless applications include memory usage, concurrency, cold starts, invocation duration, function error rates, response times, and latency. 3. How Do Serverless Applications Differ in Troubleshooting From Traditional Apps? In traditional apps, developers can access server logs, tools, and infrastructures to debug codes. However, serverless applications need an observability tool to collect function-related metrics. Also, in serverless, cloud providers control access to tools, so developers need help troubleshooting. 4. What Are the Best Practices for Monitoring and Troubleshooting Serverless Apps? Best practices for tracking and troubleshooting serverless apps include using custom dashboards, setting alerts for critical metrics, leveraging AI and ML analytics, performing load testing, and implementing centralized logs. Conclusion Serverless architectures make it easier for developers to focus on building quality apps. However, they also pose monitoring challenges due to a lack of direct access to resources. The good news is that you can overcome the monitoring challenges with serverless monitoring solutions and ensure effective monitoring and troubleshooting by adopting a few best practices. These best practices can help you reduce complexities in serverless environments and proactively address issues before they significantly impact user experience. More

DORA Metrics: Tracking and Observability With Jenkins, Prometheus, and Observe

By Bhargavi Gorantla

DORA (DevOps Research and Assessment) metrics, developed by the DORA team have become a standard for measuring the efficiency and effectiveness of DevOps implementations. As organizations start to adopt DevOps practices to accelerate software delivery, tracking performance and reliability becomes critical. DORA metrics help organizations address these critical tasks by providing a framework for understanding how well teams are delivering software and how quickly they can recover from failures. This article will delve into DORA metrics, demonstrate how to track them using Jenkins, and explore how to use Prometheus for collecting and displaying these metrics in Observe. What Are DORA Metrics? DORA metrics are a set of four key performance indicators (KPIs) that help organizations evaluate their software delivery performance. These metrics are: Deployment Frequency (DF): Measures how often code is deployed to productionLead Time for Changes (LT): Time taken from code commit to production deploymentChange Failure Rate (CFR): The percentage of changes failed in productionMean Time to Restore (MTTR): The average time it takes to recover from a failure in production These metrics are valuable because they provide actionable insights into software development and deployment practices. High-performing teams tend to deploy more frequently and have shorter lead times, lower failure rates, and quicker recovery times, leading to more resilient and robust applications. Tracking DORA Metrics in Jenkins Jenkins is a widely used automation server to enable continuous integration and delivery (CI/CD). Below is an example of how to track DORA metrics using a Jenkins pipeline, using shell commands and scripts to log deployment frequency, calculate lead time for changes, monitor change failure rate, and determine the mean time to restore. Groovy pipeline { agent any environment { DEPLOY_LOG = 'deploy.log' FAIL_LOG = 'fail.log' } //Build Application stages { stage('Build') { steps { echo 'Building the application...' // Run required build commands sh 'make build' } } // Test Application stage('Test') { steps { echo 'Running tests...' // run required test commands sh 'make test' } } // Deploy application stage('Deploy') { steps { echo 'Deploying the application...' // run the deployment steps sh 'make deploy' // Log the deployment into log file to compute deployment frequency sh "echo $(date '+%F_%T') >> ${DEPLOY_LOG}" } } } post { always { script { // Computing deployment frequency (DF) def deploymentCount = sh(script: "wc -l < ${DEPLOY_LOG}", returnStdout: true).trim() echo "# of Deployments: ${deploymentCount}" // Writing build failures into log for computing CFR if (currentBuild.result == 'FAILURE') { sh "echo $(date '+%F_%T') >> ${FAIL_LOG}" } // Computing Change Failure Rate (CFR) def failureCount = sh(script: "wc -l < ${FAIL_LOG}", returnStdout: true).trim() def CFR = (failureCount.toInteger() * 100) / deploymentCount.toInteger() echo "Change Failure Rate: ${CFR}%" // Computing Lead Time for Changes(LTC) using last commit and deploy times def commitTime = sh(script: "git log -1 --pretty=format:'%ct'", returnStdout: true).trim() def currentTime = sh(script: "date +%s", returnStdout: true).trim() def leadTime = (currentTime.toLong() - commitTime.toLong()) / 3600 echo "Lead Time for Changes: ${leadTime} hours" } } //End if pipeline success { echo 'Deployment Successful!' } failure { echo 'Deployment failed!' // Failure handling } } } In the above script, each deployment is logged as a timestamp in the deploy file, which can be used to determine the deployment frequency as you go. Similarly, failures are logged as timestamps in the fail log file and both counts are used to compute change failure rate. Additionally, the time difference between the last commit time and the current time provides the lead time for changes. Monitoring DORA Metrics With Prometheus and Observe Prometheus is an open-source monitoring and alerting toolkit commonly used for collecting metrics from applications. Combined with Observe, a modern observability platform, Prometheus can be used to visualize and monitor DORA metrics in real-time. Install Prometheus on server: Download and install Prometheus from the link. Configure Prometheus: Set up the prometheus.yml configuration file to define the metrics to be collected and time intervals. Example configuration: YAML #setting time interval at which metrics are collected global: scrape_interval: 30s #Configuring Prometheus to collect metics from Jenkins on specific port scrape_configs: - job_name: 'jenkins' static_configs: - targets: ['<JENKINS_SERVER>:<PORT>'] Expose Metrics in Jenkins: You can use either the Prometheus plugin for Jenkins or a custom script to expose metrics in a format that Prometheus can use to collect. Example Python script: Python from prometheus_client import start_http_server, Gauge import random import time # Creating Prometheus metrics gauges for the four DORA KPIs DF = Gauge('Deployment Frequency', 'No. of deployments in a day') LT = Gauge('Lead Time For Changes', 'Average lead time for changes in hours') CFR = Gauge('Change Failure Rate', 'Percentage of changes failures in production') MTTR = Gauge('Mean Time To Restore', 'Mean time to restore service after failure in minutes') #Start server start_http_server(8000) #Sending random values to generate sample metrics to test while True: DF.set(random.randint(1, 9)) LT.set(random.uniform(1, 18)) CFR.set(random.uniform(0, 27)) MTTR.set(random.uniform(1, 45)) #Sleep for 30s time.sleep(30) Save this script on the server where Jenkins is running and run it to expose the metrics on port 8000. Add Prometheus Data Source to Observe: Observe is a monitoring and observability tool that provides advanced features for monitoring, analyzing, and visualizing observability data. In Observe, you can add Prometheus as a data source by navigating to the integrations section and configuring Prometheus with the appropriate endpoint URL.Set up Dashboards in Observe, and create dashboards with widgets to display graphs for these different metrics. Set up monitoring to configure alerts on set thresholds and analyze trends and patterns by drilling down into specific metrics. Conclusion DORA metrics are essential for assessing the performance and efficiency of DevOps practices. By implementing tracking in Jenkins pipelines and leveraging monitoring tools like Prometheus and Observe, organizations can gain deep insights into their software delivery processes. These metrics help teams continuously improve, making data-driven decisions that enhance deployment frequency, reduce lead time, minimize failures, and accelerate recovery. Adopting a robust observability strategy ensures that these metrics are visible to stakeholders, fostering a culture of transparency and continuous improvement in software development and delivery. More

How To Achieve High GC Throughput

By Ram Lakshmanan

CORE

The Relationship Between Performance and Security

By Manish Sinha

Mastering React Efficiency: Refactoring Constructors for Peak Performance

By Raju Dandigam

JMeter Plugin HTTP Simple Table Server (STS) In-Depth

The Need for the Creation of the STS Plugin From a Web Application in Tomcat The idea of having a server to manage the dataset was born during the performance tests of the income tax declaration application of the French Ministry of Public Finance in 2012. The dataset consisted of millions of lines to simulate tens of thousands of people who filled out their income tax return form per hour and there were a dozen injectors to distribute the injection load of a performance shot. The dataset was consumed, that is to say, once the line with the person's information was read or consumed, we could no longer take the person's information again. The management of the dataset in a centralized way had been implemented with a Java web application (war) running in Tomcat. Injectors requesting a row of the dataset from the web application. To a Plugin for Apache JMeter The need to have centralized management of the dataset, especially with an architecture of several JMeter injectors, was at the origin of the creation in 2014 of the plugin for Apache JMeter named HTTP Simple Table Server or STS. This plugin takes over the features of the dedicated application for Tomcat mentioned above but with a lighter and simpler technical solution based on the NanoHTTPD library. Manage the Dataset With HTTP Simple Table Server (STS) Adding Centralized Management of the Dataset Performance tests with JMeter can be done with several JMeter injectors or load generators (without interface, on a remote machine) and a JMeter controller (with user interface or command line interface on the local machine). JMeter script and properties are sent by RMI protocol to injectors. The results of the calls are returned periodically to the JMeter controller. However, the dataset and CSV files are not transferred from the controller to the injectors. It is natively not possible with JMeter to read the dataset randomly or in reverse order. However, there are several external plugins that allow you to randomly read a data file, but not in a centralized way. It is natively not possible to save data created during tests such as file numbers or new documents in a file. The possibility of saving values can be done with the Groovy script in JMeter, but not in a centralized way if you use several injectors during the performance test. The main idea is to use a small HTTP server to manage the dataset files with simple commands to retrieve or add data lines to the data files. This HTTP server can be launched alone in an external program or in the JMeter tool. The HTTP server is called the "Simple Table Server" or STS. The name STS is also a reference to the Virtual Table Server (VTS) program of LoadRunner with different functionalities and a different technical implementation, but close in the use cases. STS is very useful in a test with multiple JMeter injectors, but it also brings interesting features with a single JMeter for performance testing. Some Possibilities of Using the HTTP Simple Table Server Reading Data in a Distributed and Collision-Free Way Some applications do not tolerate that 2 users connect with the same login at the same time. It is often recommended not to use the same data for 2 users connected with the same login at the same time to avoid conflicts on the data used. The STS can easily manage the dataset in a distributed way to ensure that logins are different for each virtual user at a given time T of the test. A dataset with logins/passwords with a number of lines greater than the number of active threads at a given time is required. We manually load at the start of the STS or by script the logins/password file, and the injectors ask the STS for a line with login/password by the command READ, KEEP=TRUE, and READ_MODE=FIRST. We can consider that the reading of the dataset is done in a circular way. Reading Single-Use Data The dataset can be single-use, meaning that it is used only once during the test. For example, people who register on a site can no longer register with the same information because the system detects a duplicate. Documents are awaiting validation by an administrator. When the documents are validated, they are no longer in the same state and can no longer be validated again. To do this, we will read a data file in the memory of the HTTP STS and the virtual users will read by deleting the value at the top of the list. When the test is stopped, we can save the values that remain in memory in a file (with or without a timestamp prefix) or let the HTTP STS run for another test while keeping the values still in memory. Producers and Consumers of a Queue With the STS, we can manage a queue with producers who deposit in the queue and consumers who consume the data of the queue. In practice, we start a script with producers who create data like new documents or new registered people. The identifiers of created documents or the information of newly registered people are stored in the HTTP STS by the ADD command in mode ADD_MODE=LAST. The consumers start a little later so that there is data in the queue. It is also necessary that the consumers do not consume too quickly compared to the producers or it is necessary to manage the case where the list no longer contains a value by detecting that there is no more value available and waiting a few seconds before repeating in a loop. The consumers consume the values by the commands READ, FIRST, KEEP=FALSE. Here is a schema to explain how the producers ADD and consumers READ in the same queue. Producer and Consumer With Search By FIND This is a variation of the previous solution. Producer A user with rights limited to a geographical sector creates documents and adds a line with his login + the document number (ex: login23;D12223). login1;D120000 login2;D120210login23;D12223 login24;D12233 login2;D120214 login23;D12255 Consumer A user will modify the characteristics of the document, but to do so, he must choose from the list in memory only the lines with the same login as the one he currently uses for questions of rights and geographic sector. Search in the list with the FIND command and FIND_MODE=SUBSTRING. The string searched for is the login of the connected person (ex: login23) so LINE=login23; (with the separator ;). In this example, the returned line will be the 1st line login23 and the file number will be used in searching for files in the tested application. Here, the result of the FIND by substring (SUBSTRING) is: login23;D12223. The line can be consumed with KEEP=FALSE or kept and placed at the end of the list with KEEP=TRUE. Enrich the Dataset as the Shot Progresses The idea is to start with a reduced dataset that is read at the start of the shot and to add new lines to the initial dataset as the shot progresses in order to increase or enrich the dataset. The dataset is therefore larger at the end of the shot than at the beginning. At the end of the shot, the enriched dataset can be saved with the SAVE command and the new file can be the future input file for a new shot. For example, we add people by a scenario. We add to the search dataset these people in the list so the search is done on the first people of the file read at the beginning but also the people added as the performance test progresses. Verification of the Dataset The initial dataset can contain values that will generate errors in the script because the navigation falls into a particular case or the entered value is refused because it is incorrect or already exists. We read the dataset by INITFILE or at the start of the STS, we read (READ) and use the value if the script goes to the end (the Thread Group is configured with "Start Next Thread Loop on Sampler Error"), we save the value by ADD in a file (valuesok.csv). At the end of the test, the values of the valuesok.csv file are saved (for example, with a "tearDown Thread Group"). The verified data of the valuesok.csv file will then be used. Storing the Added Values in a File In the script, the values added to the application database are saved in a file by the ADD command. At the end of the test, the values in memory are saved in a file (for example, with a "tearDown Thread Group"). The file containing the created values can be used to verify the additions to the database a posteriori or to delete the created values in order to return to an initial state before the test. A script dedicated to creating a data set can create values and store them in a data file. This file will then be used in the JMeter script during performance tests. On the other hand, a dedicated erase script can take the file of created values to erase them in a loop. Communication Between Software The HTTP STS can be used as a means to communicate values between JMeter and other software. The software can be a heavy client, a web application, a shell script, a Selenium test, another JMeter, a LoadRunner, etc. We can also launch 2 command line tests with JMeter in a row by using an "external standalone" STS as a means to store values created by the first test and used in the second JMeter test. The software can add or read values by calling the URL of the HTTP STS, and JMeter can read these added values or add them itself. This possibility facilitates performance tests but also non-regression tests. Enrich the Dataset as the Shot Progresses Gradual Exiting the Test on All JMeter Injectors In JMeter, there is no notion as in LoadRunner of "GradualExiting," that is to say, to indicate to the vuser at the end of the iteration if it should continue or not to repeat and therefore stop. We can simulate this "GradualExiting" with the STS and a little code in the JMeter script. With the STS we can load a file "status.csv," which contains a line with a particular value like the line "RUN." The vuser asks at the beginning of the iteration the value of the line of the file "status.csv." If the value is equal to "STOP" then the vuser stops. If the value is zero or different from STOP like "RUN" then the user continues. Gradual Exiting After Fixed Date Time We can also program the stop request after a fixed time with this system. We indicate as a parameter the date and time to change the value of the status to STOP; e.g., 2024-07-31_14h30m45s by a JMeter script that runs in addition to the current load testing. The script is launched, and we calculate the number of milliseconds before the indicated date of the requested stop. The vuser is put on hold for the calculated duration. Then the "status.csv" file in the STS is deleted to put the STOP value, which will allow the second JMeter script that is already running to read the status value if status == "STOP" value and to stop properly on all the JMeter injectors or the JMeter alone. Start the HTTP Simple Table Server Declaration and Start of the STS by the Graphical Interface The Simple Table Server is located in the "Non-Test Elements" menu: Click on the "Start" button to start the HTTP STS. By default, the directory containing the files is <JMETER_HOME>/bin. Start From the Command Line It is possible to start the STS server from the command line. Shell <JMETER_HOME>\bin\simple-table-server.cmd (Windows OS) <JMETER_HOME>/bin/simple-table-server.sh (Linux OS) Or automatically when starting JMeter from the command line (CLI) by declaring the file these 2 lines in jmeter.properties: jsr223.init.file=simple-table-server.groovyjmeterPlugin.sts.loadAndRunOnStartup=true If the value is false, then the STS is not started when launching JMeter from command line without GUI. The default port is 9191. It can be changed by the property: # jmeterPlugin.sts.port=9191 If jmeterPlugin.sts.port=0, then the STS does not start when launching JMeter in CLI mode. The property jmeterPlugin.sts.addTimestamp=true indicates if the backup of the file (e.g., info.csv) will be prefixed by the date and time (e.g.: 20240112T13h00m50s.info.csv); otherwise, jmeterPlugin.sts.addTimestamp=false writes/overwrites the file info.csv# jmeterPlugin.sts.addTimestamp=true. The property jmeterPlugin.sts.daemon=true is used when the STS is launched as an external application with the Linux nohup command (example: nohup ./simple-table-server.sh &). In this case, the STS does not listen to the keyboard. Use the <ENTER> key to exit. The STS enters an infinite loop, so you must call the /sts/STOP command to stop the STS or the killer. When jmeterPlugin.sts.daemon=false, the STS waits for the entry of <ENTER> to exit. This is the default mode. Loading Files at Simple Table Server Startup The STS has the ability to load files into memory at STS startup. Loading files is done when the STS is launched as an external application (<JMETER_HOME>\bin\simple-table-server.cmd or <JMETER_HOME>/bin/simple-table-server.sh) and also when JMeter is launched from the command line without GUI or via the JMeter Maven Plugin. Loading files is not done with JMeter in GUI mode. The files are read in the directory indicated by the property jmeterPlugin.sts.datasetDirectory, and if this property is null, then in the directory <JMETER_HOME>/bin. The declaration of the files to be loaded is done by the following properties: jmeterPlugin.sts.initFileAtStartup=article.csv,filename.csv jmeterPlugin.sts.initFileAtStartupRegex=false or jmeterPlugin.sts.initFileAtStartup=.+?\.csv jmeterPlugin.sts.initFileAtStartupRegex=true When jmeterPlugin.sts.initFileAtStartupRegex=false then the property jmeterPlugin.sts.initFileAtStartup contains the list of files to be loaded with the comma character “,” as the file name separator. (e.g., jmeterPlugin.sts.initFileAtStartup=article.csv,filename.csv). The STS at startup will try to load (INITFILE) the files articles.csv then filename.csv. When jmeterPlugin.sts.initFileAtStartupRegex=true then the property jmeterPlugin.sts.initFileAtStartup contains a regular expression that will be used to match the files in the directory of the property jmeterPlugin.sts.datasetDirectory (e.g.,jmeterPlugin.sts.initFileAtStartup=.+?\.csv loads into memory (INITFILE) all files with the extension ".csv". The file name must not contain special characters that would allow changing the reading directory such as ..\..\fichier.csv, /etc/passwd, or ../../../tomcat/conf/server.xml. The maximum size of a file name is 128 characters (without taking into account the directory path). Management of the Encoding of Files to Read/Write and the HTML Response It is possible to define the encoding when reading files or writing data files. The properties are as follows: Read (INITFILE) or write (SAVE) file with an accent from the text file in the charset like UTF-8 or ISO8859_15All CSV files need to be in the same charset encoding: Properties files jmeterPlugin.sts.charsetEncodingReadFile=UTF-8 jmeterPlugin.sts.charsetEncodingWriteFile=UTF-8 Files will be read (INITFILE) with the charset declared by the value of jmeterPlugin.sts.charsetEncodingReadFile.Files will be written (SAVE) with the charset declared by the value of jmeterPlugin.sts.charsetEncodingWriteFile.The default value jmeterPlugin.sts.charsetEncodingReadFile corresponds to the System property: file.encoding.The default value jmeterPlugin.sts.charsetEncodingWriteFile corresponds to the System property: file.encoding. All data files must be in the same charset if they contain non-ASCII characters.To respond in HTML to different commands, especially READ, the charset found in the response header is indicated by jmeterPlugin.sts.charsetEncodingHttpResponse.jmeterPlugin.sts.charsetEncodingHttpResponse=<charset> (Use UTF-8): In the HTTP header add "Content-Type:text/html; charset=<charset>"The default value is the JMeter property: sampleresult.default.encodingThe list of charsets is declared in the HTML page (take the java.io API column)For the name of the charset look, see Oracle docs for Supported Encodings.Column Canonical Name for java.io API and java.lang API Help With Use The URL of an STS command is of the form: <HOSTNAME>:<PORT>/sts/<COMMAND>?<PARAMETERS>. The commands and the names of the parameters are in uppercase (case sensitive). If no command is indicated then the help message is returned: http://localhost:9191/sts/. Commands and Configuration The following is a list of commands and configuration of the HTTP STS with extracts from the documentation of the JMeter-plugins.org site. The calls are atomic (with synchronized) => Reading or adding goes to the end of the current processing before processing the next request. The commands to the Simple Table Server are performed by HTTP GET and/or POST calls depending on the command. Documentation is available on the JMeter plugin website. Distributed Architecture for JMeter The Simple Table Server runs on the JMeter controller (master) and load generators (slaves) or injectors make calls to the STS to get, find, or add some data. At the beginning of the test, the first load generator will load data in memory (initial call) and at the end of the test, it asks for the STS saving values in a file. All the load generators ask for data from the same STS which is started on the JMeter controller. The INITFILE can also be done at STS startup time (without the first load generator initial call). Example of a dataset file logins.csv: Plain Text login1;password1 login2;password2 login3;password3 login4;password4 login5;password5 INITFILE: Load File in Memory http://hostname:port/sts/INITFILE?FILENAME=logins.csv HTML <html><title>OK</title> <body>5</body> => number of lines read </html> Linked list after this command: Plain Text login1;password1 login2;password2 login3;password3 login4;password4 login5;password5 The files are read in the directory indicated by the property: jmeterPlugin.sts.datasetDirectory; if this property is null, then in the directory<JMETER_HOME>/bin/. READ: Get One Line From List http://hostname:port/sts/ HTML <html><title>OK</title> <body>login1;password1</body> </html> Available options: READ_MODE=FIRST => login1;password1READ_MODE=LAST => login5;password5READ_MODE=RANDOM => login?;password?KEEP=TRUE => The data is kept and put to the end of the listKEEP=FALSE => The data is removed READMULTI: Get Multi Lines From List in One Request GET Protocol http://hostname:port/sts/READMULTI?FILENAME=logins.csv&NB_LINES={Nb lines to read}&READ_MODE={FIRST, LAST, RANDOM}&KEEP={TRUE, FALSE}Available options: NB_LINES=Number of lines to read : 1 <= Nb lines (Integer) and Nb lines <= list sizeREAD_MODE=FIRST => Start to read at the first lineREAD_MODE=LAST => Start to read at the last line (reverse)READ_MODE=RANDOM => read n lines randomlyKEEP=TRUE => The data is kept and put to the end of listKEEP=FALSE => The data is removed ADD: Add a Line Into a File (GET OR POST HTTP Protocol) FILENAME=dossier.csv, LINE=D0001123, ADD_MODE={FIRST, LAST}HTML format: HTML <html><title>OK</title> <body></body> </html> Available options: ADD_MODE=FIRST => Add to the beginning of the listADD_MODE=LAST => Add to the end of the listFILENAME=dossier.csv => If doesn't already exist it creates a LinkList in memoryLINE=1234;98763 =>Tthe line to addUNIQUE => Do not add a line if the list already contains such a line (return KO)HTTP POST request: Method GET: GET Protocol: http://hostname:port/sts/ADD FILENAME=dossier.csv&LINE=D0001123&ADD_MODE={FIRST, LAST} FIND: Find a Line in the File (GET OR POST HTTP Protocol) Command FIND Find a line (LINE) in a file (FILENAME) (GET or POST HTTP protocol)The LINE to find is for FIND_MODE: A string: SUBSTRING (Default, ALineInTheFile contains the stringToFind) or EQUALS (stringToFind == ALineInTheFile)A regular expression with REGEX_FIND (contains) and REGEX_MATCH (entire region matches the pattern)KEEP=TRUE => The data is kept and put to the end of the listKEEP=FALSE => The data is removedGET Protocol: http://hostname:port/sts/FIND?FILENAME=colors.txt&LINE=(BLUE|RED)&[FIND_MODE=[SUBSTRING,EQUALS,REGEX_FIND,REGEX_MATCH]]&KEEP={TRUE, FALSE} If find return the first line found, start reading at the first line in the file (linked list): HTML <html><title>OK</title> <body>RED</body> </html> If NOT found, return title KO and message Error: Not found! in the body. HTML <html><title>KO</title> <body>Error : Not find !</body> </html> LENGTH: Return the Number of Remaining Lines of a Linked List http://hostname:port/sts/LENGTH?FILENAME=logins.csvHTML format: HTML <html><title>OK</title> <body>5</body> => remaining lines </html> STATUS: Display the List of Loaded Files and the Number of Remaining Lines http://hostname:port/sts/STATUSHTML format: HTML <html><title>OK</title> <body> logins.csv = 5<br /> dossier.csv = 1<br /> </body> </html> SAVE: Save the Specified Linked list in a File to the datasetDirectory Location http://hostname:port/sts/SAVE?FILENAME=logins.csvIf jmeterPlugin.sts.addTimestamp is set to true, then a timestamp will be added to the filename. The file is stored in jmeterPlugin.sts.datasetDirectory or if null in the <JMETER_HOME>/bin directory: 20240520T16h33m27s.logins.csv.You can force the addTimestamp value with parameter ADD_TIMESTAMP in the URL like :http://hostname:port/sts/SAVE?FILENAME=logins.csv&ADD_TIMESTAMP={true,false}HTML format: HTML <html><title>OK</title> <body>5</body> => number of lines saved </html> RESET: Remove All Elements From the Specified List: http://hostname:port/sts/RESET?FILENAME=logins.csvHTML format: HTML <html><title>OK</title> <body></body> </html> The Reset command is often used in the “setUp Thread Group” to clear the values in the memory Linked List from a previous test. It always returns OK even if the file does not exist. STOP: Shutdown the Simple Table Server http://hostname:port/sts/STOPThe stop command is used usually when the HTTP STS server is launched by a script shell and we want to stop the STS at the end of the test.When the jmeterPlugin.sts.daemon=true, you need to call http://hostname:port/sts/STOP or kill the process to stop the STS. CONFIG: Display STS Configuration http://hostname:port/sts/CONFIGDisplay the STS configuration, e.g.: Plain Text jmeterPlugin.sts.loadAndRunOnStartup=false startFromCli=true jmeterPlugin.sts.port=9191 jmeterPlugin.sts.datasetDirectory=null jmeterPlugin.sts.addTimestamp=true jmeterPlugin.sts.demon=false jmeterPlugin.sts.charsetEncodingHttpResponse=UTF-8 jmeterPlugin.sts.charsetEncodingReadFile=UTF-8 jmeterPlugin.sts.charsetEncodingWriteFile=UTF-8 jmeterPlugin.sts.initFileAtStartup= jmeterPlugin.sts.initFileAtStartupRegex=false databaseIsEmpty=false Error Response KO When the command and/or a parameter are wrong, the result is a page html status 200 but the title contains the label KO. Examples: Send an unknown command. Be careful as the command a case sensitive (READ != read). HTML <html><title>KO</title> <body>Error : unknown command !</body> </html> Try to read the value from a file not yet loaded with INITFILE. HTML <html><title>KO</title> <body>Error : logins.csv not loaded yet !</body> </html> Try to read the value from a file but no more lines in the Linked List. HTML <html><title>KO</title> <body>Error : No more line !</body> </html> Try to save lines in a file that contains illegal characters like “..”, “:” HTML <html><title>KO</title> <body>Error : Illegal character found !</body> </html> Command FIND: HTML <html><title>KO</title> <body>Error : Not find !</body> </html> Command FIND and FIND_MODE=REGEX_FIND or REGEX_MATCH: HTML <html><title>KO</title> <body>Error : Regex compile error !</body> </html> Modifying STS Parameters From the Command Line You can override STS settings using command-line options: Plain Text -DjmeterPlugin.sts.port=<port number> -DjmeterPlugin.sts.loadAndRunOnStartup=<true/false> -DjmeterPlugin.sts.datasetDirectory=<path/to/your/directory> -DjmeterPlugin.sts.addTimestamp=<true/false> -DjmeterPlugin.sts.daemon=<true/false> -DjmeterPlugin.sts.charsetEncodingHttpResponse=<charset like UTF-8> -DjmeterPlugin.sts.charsetEncodingReadFile=<charset like UTF-8> -DjmeterPlugin.sts.charsetEncodingWriteFile=<charset like UTF-8> -DjmeterPlugin.sts.initFileAtStartup=<files to read when STS startup, e.g : article.csv,users.csv> -DjmeterPlugin.sts.initFileAtStartupRegex=false=<false : no regular expression, files with comma separator, true : read files matching the regular expression> jmeter.bat -DjmeterPlugin.sts.loadAndRunOnStartup=true -DjmeterPlugin.sts.port=9191 -DjmeterPlugin.sts.datasetDirectory=c:/data -DjmeterPlugin.sts.charsetEncodingReadFile=UTF-8 -n –t testdemo.jmx STS in the POM of a Test With the jmeter-maven-plugin It is possible to use STS in a performance test launched with the jmeter-maven-plugin. To do this: Put your CSV files in the <project>/src/test/jmeter directory (e.g., logins.csv).Put the simple-table-server.groovy (Groovy script) in the <project>/src/test/jmeter directory.Put your JMeter script in <project>/src/test/jmeter directory (e.g., test_login.jmx).Declare in the Maven build section, in the configuration <jmeterExtensions> declare the artifact kg.apc:jmeter-plugins-table-server:<version>.Declare user properties for STS configuration and automatic start.If you use a localhost and a proxy configuration, you could add a proxy configuration with <hostExclusions>localhost</hostExclusions>.Extract pom.xml dedicated to HTTP Simple Table Server : XML <build> <plugins> <plugin> <groupId>com.lazerycode.jmeter</groupId> <artifactId>jmeter-maven-plugin</artifactId> <version>3.8.0</version> ... <configuration> <jmeterExtensions> <artifact>kg.apc:jmeter-plugins-table-server:5.0</artifact> </jmeterExtensions> <propertiesUser>  <jmeterPlugin.sts.port>9191</jmeterPlugin.sts.port> <jmeterPlugin.sts.addTimestamp>true</jmeterPlugin.sts.addTimestamp> <jmeterPlugin.sts.datasetDirectory>${project.build.directory}/jmeter/testFiles</jmeterPlugin.sts.datasetDirectory> <jmeterPlugin.sts.loadAndRunOnStartup>true</jmeterPlugin.sts.loadAndRunOnStartup> <jsr223.init.file>${project.build.directory}/jmeter/testFiles/simple-table-server.groovy</jsr223.init.file> </propertiesUser> </configuration> </plugin> </plugins> </build> Links jmeter-plugins.org documentation

By Vincent DABURON

Overcoming the Retry Dilemma in Distributed Systems

“Insanity is doing the same thing over and over again, but expecting different results” - Source unknown As you can see in the quote above, humans have this tendency to retry things even when results are not going to change. This was manifested in systems designs as well where we pushed these biases when designing systems. If you look closely there are two broad categories of failures: Cases where retry makes senseWhere retries don’t make sense In the first category, transient failures like network glitches or intermittent service overloads are examples for which retrying makes sense. For the second, where the failure originates from something like the request itself is malformed, requests are getting throttled (429s), or service is load-shedding (5xx), it doesn’t make much sense to retry. While both of these categories need special attention and appropriate handling, in this article, we will primarily focus on category 1, where it makes sense to retry. The Case for Retries In the modern world, a typical customer-facing service is made up of various microservices. A customer query to a server goes through a complex call graph and typically deals with many services. On a happy day, a single customer query meets no error (or independent failure) to give an illusion of a successful call event. For example, a service that is dependent on 5 services with 99.99% availability, can only achieve a max of 99.95% of availability (reference doc) for its callers. The key point here is that even though each individual dependency has an excellent availability of 99.99%, the cumulative effect of depending on 5 such services results in a lower overall availability for the main service. This is because the probability of all 5 dependencies succeeding on a single request is the product of their individual probabilities. Overall Availability = 1 - (1 - Individual Availability) ^ (# of Dependencies) Now, by using the same formula, we can see that to maintain a 99.99% availability at the main service without any retries, all the dependencies need to have an availability higher than 99.998%. So, this begs a question: how do you achieve 99.99% availability at the main service? The answer is we need retries! Retries Sweet Spot We discussed above that the maximum availability that you can achieve without retries is 99.95% with those 5 dependencies. Now, if we expand our above formula and try to model the overall availability to 99.99% of the main service, it will include retries as a factor in considering it. So, the formula becomes: Overall Availability = 1 - (1 - Individual Availability) ^ (# of Dependencies + # of Retries) If you plug these values, it gives you 99.99% = 1 - (1 - 0.9999) ^ (5 + Number of Retries). This gives you # of retries = 2, which means that by adding two retries at the main service, you will be able to achieve 99.99% availability. This demonstrates how retries can be an effective strategy to overcome the effect of cumulative availability reduction when relying on multiple dependencies and help achieve the desired overall service-level objectives. This is great and obvious! So why this article!? Retry Storm and Call Escalation While these retries help, they also bring trouble with them. The main problem with retries is when one of the services you depend on is in trouble or having a bad day. Now when you retry when it's already down, it is like you are kicking it where it hurts! — potentially delaying the service’s recovery. Now think of a scenario where this call graph is multi-level deep; for example, the main service depends on 5 sub-services which in turn depend on another 5 services. Now when there is a failure, you retry at the top, and this will lead to 5 * 5 = 25 retries. What if those are further dependent on 5 services? So for one retry at the top, you may end up with 5 * 5 * 5 retries, and so on. While retries don’t help with faster recovery, they can take down the service that was operating partially with this extra generated load. Now, when those fail, they further increase the failure leading to more failures and this retry storm starts and creates long-lasting outages. At the lowest level, the call volume reaches 2 ^ N, which is catastrophic and would have recovered much faster had there been no retries at all. This brings us to the meat of the article where we say we like retries, but they are also dangerous if not done right. So, how can we do it right? Strategies for Limiting Retries To benefit from retries while keeping the retry storms and cascading failures from happening, we need a way to stop the excessive retries. As highlighted above, even a single retry at the top can cause too much damage when the call graph is deep (2 ^ N). We need to limit retries in aggregate. These are some of the practical tips on how this can be done: Bounded Retries The idea with the bounded retries is that you have an upper bound in how many retries you can do. The upper bound can be decided based on the time; for example, every minute you can make 10 retries or it’s based on the success rate for every 1000 success calls, you give service a single retry credit and you keep getting until you reach the fixed upper bound. Circuit Breaking The philosophy of the circuit-breaking technique is to don’t hammer whats already down. With the circuit breaker pattern, what you do is when you meet an error, you close the connection, stop making calls to the server, and give it breathing room. To check for the recovery, you have a thread that makes a single invoke to the service on a periodic basis to check if it has recovered. Once it has recovered, you gradually start the traffic and go back to normal operation mode. This gives the receiving service the much-needed time to recover. This topic is covered in much detail in Martin Fowler’s article. Retry Strategies There are techniques in TCP congestion control like AIMD (Additive Increase and Multiplicative decrease) that can be employed. AIMD basically says that you slowly increase the traffic to a service (think of it like connection + 1), but you immediately reduce the traffic when faced with an error (think of it like active Connection / 2). You keep doing that till you equilibrium. Exponential backoff is another technique where you back off for a period of time upon meeting an error and subsequently increase the time you back off up to some maximum time. The subsequent increase is generally like 1, 2, 4, 8.. 2^retryCount, but could also be Fibonacci-based like 1,2,3,5,8…. There is another gotcha while retrying, which is to keep the jitter in mind. Here is a great article from Marc Brooker who goes into more depth about exponential backoff and retries. While these techniques talk about client-side protection, we could also employ some guardrails on the server side. Some of the things that we can consider are: Explicit backpressure contract: When under load, reject caller request and pass on the metadata that you are failing because of overload and ask it to pass to its upstream and so on. Avoid wastage work: In the case where you expect service to be under load, avoid doing wastage work. You can check whose caller has timed out and don’t need an answer to drop the work altogether. Load shedding: When under stress, load shed aggressively. Investigate in mechanism where you can identify duplicate requests and discard one of them. Use signals like CPU, memory, etc. to compliment the load-shedding decision-making. Conclusion In distributed systems, transient failures in remote interactions are unavoidable. To build higher availability systems, we rely on retries. Retries can lead to big outages because of call escalations. Clients can adopt various approaches to avoid overloading the system in case of failures. Services should also employ techniques to protect themselves in case a client goes rogue. Disclaimer The guidance provided here offers principles and practices that could broadly improve the reliability of services in many conditions. However, I would advise that you do not view this as a one-size-fits-all mandate. Instead, I suggest that you and your team evaluate these recommendations through the lens of your specific needs and circumstances.

By Rajesh Pandey

Java Performance Tuning: Adjusting GC Threads for Optimal Results

Garbage Collection (GC) plays an important role in Java’s memory management. It helps to reclaim memory that is no longer in use. A garbage collector uses its own set of threads to reclaim memory. These threads are called GC threads. Sometimes JVM can end up either with too many or too few GC threads. In this post, we will discuss why JVM can end up having too many/too few GC threads, the consequences of it, and potential solutions to address them. How To Find Your Application’s GC Thread Count You can determine your application’s GC thread count by doing a thread dump analysis as outlined below: Capture thread dump from your production server.Analyze the dump using a thread dump analysis tool.The tool will immediately report the GC thread count, as shown in the figure below. Figure 1: fastThread tool reporting GC Thread count How To Set GC Thread Count You can manually adjust the number of GC threads by setting the following two JVM arguments: -XX:ParallelGCThreads=n: Sets the number of threads used in the parallel phase of the garbage collectors-XX:ConcGCThreads=n: Controls the number of threads used in concurrent phases of garbage collectors What Is the Default GC Thread Count? If you don’t explicitly set the GC thread count using the above two JVM arguments, then the default GC thread count is derived based on the number of CPUs in the server/container. –XX:ParallelGCThreads Default: On Linux/x86 machines, it is derived based on the formula: if (num of processors <=8) { return num of processors; } else { return 8+(num of processors-8)*(5/8); } So if your JVM is running on a server with 32 processors, then the ParallelGCThread value is going to be: 23(i.e. 8 + (32 – 8)*(5/8)). -XX:ConcGCThreads Default: It’s derived based on the formula: max((ParallelGCThreads+2)/4, 1) So if your JVM is running on a server with 32 processors, then: ParallelGCThread value is going to be: 23 (i.e. 8 + (32 – 8)*(5/8)).ConcGCThreads value is going to be: 6 (i.e. max(25/4, 1). Can JVM End Up With Too Many GC Threads? It’s possible for your JVM to unintentionally have too many GC threads, often without your awareness. This typically happens because the default number of GC threads is automatically determined based on the number of CPUs in your server or container. For example, on a machine with 128 CPUs, the JVM might allocate around 80 threads for the parallel phase of garbage collection and about 20 threads for the concurrent phase, resulting in a total of approximately 100 GC threads. If you’re running multiple JVMs on this 128-CPU machine, each JVM could end up with around 100 GC threads. This can lead to excessive resource usage because all these threads are competing for the same CPU resources. This problem is particularly noticeable in containerized environments, where multiple applications share the same CPU cores. It will cause JVM to allocate more GC threads than necessary, which can degrade overall performance. Why Is Having Too Many GC Threads a Problem? While GC threads are essential for efficient memory management, having too many of them can lead to significant performance challenges in your Java application. Increased context switching: When the number of GC threads is too high, the operating system must frequently switch between these threads. This leads to increased context switching overhead, where more CPU cycles are spent managing threads rather than executing your application’s code. As a result, your application may slow down significantly.CPU overhead: Each GC thread consumes CPU resources. If too many threads are active simultaneously, they can compete for CPU time, leaving less processing power available for your application’s primary tasks. This competition can degrade your application’s performance, especially in environments with limited CPU resources.Memory contention: With an excessive number of GC threads, there can be increased contention for memory resources. Multiple threads trying to access and modify memory simultaneously can lead to lock contention, which further slows down your application and can cause performance bottlenecks.Increased GC pause times and lower throughput: When too many GC threads are active, the garbage collection process can become less efficient, leading to longer GC pause times where the application is temporarily halted. These extended pauses can cause noticeable delays or stutters in your application. Additionally, as more time is spent on garbage collection rather than processing requests, your application’s overall throughput may decrease, handling fewer transactions or requests per second and affecting its ability to scale and perform under load.Higher latency: Increased GC activity due to an excessive number of threads can lead to higher latency in responding to user requests or processing tasks. This is particularly problematic for applications that require low latency, such as real-time systems or high-frequency trading platforms, where even slight delays can have significant consequences.Diminishing returns: Beyond a certain point, adding more GC threads does not improve performance. Instead, it leads to diminishing returns, where the overhead of managing these threads outweighs the benefits of faster garbage collection. This can result in degraded application performance, rather than the intended optimization. Why Is Having Too Few GC Threads a Problem? While having too many GC threads can create performance issues, having too few GC threads can be equally problematic for your Java application. Here’s why: Longer Garbage Collection times: With fewer GC threads, the garbage collection process may take significantly longer to complete. Since fewer threads are available to handle the workload, the time required to reclaim memory increases, leading to extended GC pause times.Increased application latency: Longer garbage collection times result in increased latency, particularly for applications that require low-latency operations. Users might experience delays, as the application becomes unresponsive while waiting for garbage collection to finish.Reduced throughput: A lower number of GC threads means the garbage collector can’t work as efficiently, leading to reduced overall throughput. Your application may process fewer requests or transactions per second, affecting its ability to scale under load.Inefficient CPU utilization: With too few GC threads, the CPU cores may not be fully utilized during garbage collection. This can lead to inefficient use of available resources, as some cores remain idle while others are overburdened.Increased risk of OutOfMemoryErrors and memory leaks: If the garbage collector is unable to keep up with the rate of memory allocation due to too few threads, it may not be able to reclaim memory quickly enough. This increases the risk of your application running out of memory, resulting in OutOfMemoryErrors and potential crashes. Additionally, insufficient GC threads can exacerbate memory leaks by slowing down the garbage collection process, allowing more unused objects to accumulate in memory. Over time, this can lead to excessive memory usage and further degrade application performance. Solutions To Optimize GC Thread Count If your application is suffering from performance issues due to an excessive or insufficient number of GC threads, consider manually setting the GC thread count using the above-mentioned JVM arguments: -XX:ParallelGCThreads=n -XX:ConcGCThreads=n Before making these changes in production, it’s essential to study your application’s GC behavior. Start by collecting and analyzing GC logs using tools. This analysis will help you identify if the current thread count is causing performance bottlenecks. Based on these insights, you can make informed adjustments to the GC thread count without introducing new issues Note: Always test changes in a controlled environment first to confirm that they improve performance before rolling them out to production. Conclusion Balancing the number of GC threads is key to ensuring your Java application runs smoothly. By carefully monitoring and adjusting these settings, you can avoid potential performance issues and keep your application operating efficiently.

By Ram Lakshmanan

CORE

Why Replace External Database Caches?

Teams often consider external caches when the existing database cannot meet the required service-level agreement (SLA). This is a clear performance-oriented decision. Putting an external cache in front of the database is commonly used to compensate for subpar latency stemming from various factors, such as inefficient database internals, driver usage, infrastructure choices, traffic spikes, and so on. Caching might seem like a fast and easy solution because the deployment can be implemented without tremendous hassle and without incurring the significant cost of database scaling, database schema redesign, or even a deeper technology transformation. However, external caches are not as simple as they are often made out to be. In fact, they can be one of the more problematic components of a distributed application architecture. In some cases, it’s a necessary evil, such as when you require frequent access to transformed data resulting from long and expensive computations, and you’ve tried all the other means of reducing latency. But in many cases, the performance boost just isn’t worth it. You solve one problem, but create others. Here are some often-overlooked risks related to external caches and ways to achieve a performance boost plus cost savings by replacing their core database and external cache. Why Not Cache? We’ve worked with countless teams struggling with the costs, hassles, and limits of traditional attempts to improve database performance. Here are the top struggles we’ve seen teams experience with putting an external cache in front of their database. An External Cache Adds Latency A separate cache means another hop on the way. When a cache surrounds the database, the first access occurs at the cache layer. If the data isn’t in the cache, then the request is sent to the database. This adds latency to an already slow path of uncached data. One may claim that when the entire data set fits the cache, the additional latency doesn’t come into play. However, unless your data set is considerably small, storing it entirely in memory considerably magnifies costs and is thus prohibitively expensive for most organizations. An External Cache is an Additional Cost Caching means expensive DRAM, which translates to a higher cost per gigabyte than solid-state disks (see this P99 CONF talk by Grafana’s Danny Kopping for more details on that). Rather than provisioning an entirely separate infrastructure for caching, it is often best to use the existing database memory, and even increase it for internal caching. Modern database caches can be just as efficient as traditional in-memory caching solutions when sized correctly. When the working set size is too large to fit in memory, then databases often shine in optimizing I/O access to flash storage, making databases alone (no external cache) a preferred and cheaper option. External Caching Decreases Availability No cache’s high availability solution can match that of the database itself. Modern distributed databases have multiple replicas; they also are topology-aware and speed-aware and can sustain multiple failures without data loss. For example, a common replication pattern is three local replicas, which generally allows for reads to be balanced across such replicas to efficiently make use of your database’s internal caching mechanism. Consider a nine-node cluster with a replication factor of three: essentially every node will hold roughly a third of your total data set size. As requests are balanced among different replicas, this grants you more room for caching your data, which could completely eliminate the need for an external cache. Conversely, if an external cache happens to invalidate entries right before a surge of cold requests, availability could be impeded for a while since the database won’t have that data in its internal cache (more on this below). Caches often lack high availability properties and can easily fail or invalidate records depending on their heuristics. Partial failures, which are more common, are even worse in terms of consistency. When the cache inevitably fails, the database will get hit by the unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache itself has some high availability features, it can’t coordinate handling such failure with the persistent database it is in front of. The bottom line: rely on the database, rather than making your latency SLAs dependent on a cache. Application Complexity: Your Application Needs to Handle More Cases External caches introduce application and operational complexity. Once you have an external cache, it is your responsibility to keep the cache up to date with the database. Irrespective of your caching strategy (such as write-through, caching aside, etc.), there will be edge cases where your cache can run out of sync from your database, and you must account for these during application development. Your client settings (such as failover, retry, and timeout policies) need to match the properties of both the cache as well as your database to function when the cache is unavailable or goes cold. Usually, such scenarios are hard to test and implement. External Caching Ruins the Database Caching Modern databases have embedded caches and complex policies to manage them. When you place a cache in front of the database, most read requests will reach only the external cache and the database won’t keep these objects in its memory. As a result, the database cache is rendered ineffective. When requests eventually reach the database, its cache will be cold and the responses will come primarily from the disk. As a result, the round-trip from the cache to the database and then back to the application is likely to add latency. External Caching Might Increase Security Risks An external cache adds a whole new attack surface to your infrastructure. Encryption, isolation, and access control on data placed in the cache are likely to be different from the ones at the database layer itself. External Caching Ignores The Database Knowledge And Database Resources Databases are quite complex and built for specialized I/O workloads on the system. Many of the queries access the same data, and some amount of the working set size can be cached in memory to save disk accesses. A good database should have sophisticated logic to decide which objects, indexes, and accesses it should cache. The database also should have eviction policies that determine when new data should replace existing (older) cached objects. An example is scan-resistant caching. When scanning a large data set, say a large range or a full-table scan, a lot of objects are read from the disk. The database can realize this is a scan (not a regular query) and choose to leave these objects outside its internal cache. However, an external cache (following a read-through strategy) would treat the result set just like any other and attempt to cache the results. The database automatically synchronizes the content of the cache with the disk according to the incoming request rate, and thus the user and the developer do not need to do anything to make sure that lookups to recently written data are performant and consistent. Therefore, if, for some reason, your database doesn’t respond fast enough, it means that: The cache is misconfiguredIt doesn’t have enough RAM for cachingThe working set size and request pattern don’t fit the cacheThe database cache implementation is poor A Better Option: Let the Database Handle It How can you meet your SLAs without the risks of external database caches? Many teams have found that by moving to a faster database with a specialized internal cache, they’re able to meet their latency SLAs with less hassle and lower costs. Although external caches are a great companion for reducing latencies (such as serving static content and personalization data not requiring any level of durability), they often introduce more problems than benefits when placed in front of a database. The top tradeoffs include elevated costs, increased application complexity, additional round trips to your database, and an additional security surface area. By rethinking your existing caching strategy and switching to a modern database providing predictable low latencies at scale, teams can simplify their infrastructure and minimize costs. At the same time, they can still meet their SLAs without the extra hassles and complexities introduced by external caches.

By Felipe Mendes

MLOps: How to Build a Toolkit to Boost AI Project Performance

Numerous AI projects launched with promise fail to set sail. This is not usually because of the quality of the machine learning (ML) models. Poor implementation and system integration sink 90% of projects. Organizations can save their AI endeavors. They should adopt adequate MLOps practices and choose the right set of tools. This article discusses MLOps practices and tools that can save sinking AI projects and boost robust ones, potentially doubling project launch speed. MLOps in a Nutshell MLOps is a mix of machine learning application development (Dev) and operational activities (Ops). It is a set of practices that helps automate and streamline the deployment of ML models. As a result, the entire ML lifecycle becomes standardized. MLOps is complex. It requires harmony between data management, model development, and operations. It may also need shifts in technology and culture within an organization. If adopted smoothly, MLOps allows professionals to automate tedious tasks, such as data labeling, and make deployment processes transparent. It helps ensure that project data is secure and compliant with data privacy laws. Organizations enhance and scale their ML systems through MLOps practices. This makes collaboration between data scientists and engineers more effective and fosters innovation. Weaving AI Projects From Challenges MLOps professionals transform raw business challenges into streamlined, measurable machine learning goals. They design and manage ML pipelines, ensuring thorough testing and accountability throughout an AI project's lifecycle. In the initial phase of an AI project called use case discovery, data scientists work with businesses to define the problem. They translate it into an ML problem statement and set clear objectives and KPIs. MLOps framework Next, data scientists team up with data engineers. They gather data from various sources, and then clean, process, and validate this data. When data is ready for modeling, data scientists design and deploy robust ML pipelines, integrated with CI/CD processes. These pipelines support testing and experimentation and help track data, model lineage, and associated KPIs across all experiments. In the production deployment stage, ML models are deployed in the chosen environment: cloud, on-premises, or hybrid. Data scientists monitor the models and infrastructure, using key metrics to spot changes in data or model performance. When they detect changes, they update the algorithms, data, and hyperparameters, creating new versions of the ML pipelines. They also manage memory and computing resources to keep models scalable and running smoothly. MLOps Tools Meet AI Projects Picture a data scientist developing an AI application to enhance a client’s product design process. This solution will accelerate the prototyping phase by providing AI-generated design alternatives based on specified parameters. Data scientists navigate through diverse tasks, from designing the framework to monitoring the AI model in real-time. They need the right tools and a grasp of how to use them at every step. Better LLM Performance, Smarter AI Apps At the core of an accurate and adaptable AI solution are vector databases and these key tools to boost LLMs performance: Guardrails is an open-source Python package that helps data scientists add structure, type, and quality checks to LLM outputs. It automatically handles errors and takes actions, like re-querying the LLM, if validation fails. It also enforces guarantees on output structure and types, such as JSON.Data scientists need a tool for efficient indexing, searching, and analyzing large datasets. This is where LlamaIndex steps in. The framework provides powerful capabilities to manage and extract insights from extensive information repositories.The DUST framework allows LLM-powered applications to be created and deployed without execution code. It helps with the introspection of model outputs, supports iterative design improvements, and tracks different solution versions. Track Experiments and Manage Model Metadata Data scientists experiment to better understand and improve ML models over time. They need tools to set up a system that enhances model accuracy and efficiency based on real-world results. MLflow is an open-source powerhouse, useful to oversee the entire ML lifecycle. It provides features like experiment tracking, model versioning, and deployment capabilities. This suite lets data scientists log and compare experiments, monitor metrics, and keep ML models and artifacts organized.Comet ML is a platform for tracking, comparing, explaining, and optimizing ML models and experiments. Data scientists can use Comet ML with Scikit-learn, PyTorch, TensorFlow, or HuggingFace — it will provide insights to improve ML models.Amazon SageMaker covers the entire machine-learning lifecycle. It helps label and prepare data, as well as build, train, and deploy complex ML models. Using this tool, data scientists quickly deploy and scale models across various environments.Microsoft Azure ML is a cloud-based platform that helps streamline machine learning workflows. It supports frameworks like TensorFlow and PyTorch, and it can also integrate with other Azure services. This tool helps data scientists with experiment tracking, model management, and deployment.DVC (data version control) is an open-source tool meant to handle large data sets and machine learning experiments. This tool makes data science workflows more agile, reproducible, and collaborative. DVC works with existing version control systems like Git, simplifying how data scientists track changes and share progress on complex AI projects. Optimize and Manage ML Workflows Data scientists need optimized workflows to achieve smoother and more effective processes on AI projects. The following tools can assist: Prefect is a modern open-source tool that data scientists use to monitor and orchestrate workflows. Lightweight and flexible, it has options to manage ML pipelines (Prefect Orion UI and Prefect Cloud).Metaflow is a powerful tool for managing workflows. It is meant for data science and machine learning. It eases focusing on model development without the hassle of MLOps complexities.Kedro is a Python-based tool that helps data scientists keep a project reproducible, modular, and easy to maintain. It applies key software engineering principles to machine learning (modularity, separation of concerns, and versioning). This helps data scientists build efficient, scalable projects. Manage Data and Control Pipeline Versions ML workflows need precise data management and pipeline integrity. With the right tools, data scientists stay on top of those tasks and handle even the most complex data challenges with confidence. Pachyderm helps data scientists automate data transformation and offers robust features for data versioning, lineage, and end-to-end pipelines. These features can run seamlessly on Kubernetes. Pachyderm supports integration with various data types: images, logs, videos, CSVs, and multiple languages (Python, R, SQL, and C/C++). It scales to handle petabytes of data and thousands of jobs.LakeFS is an open-source tool designed for scalability. It adds Git-like version control to object storage and supports data version control on an exabyte scale. This tool is ideal for handling extensive data lakes. Data scientists use this tool to manage data lakes with the same ease as they handle code. Test ML Models for Quality and Fairness Data scientists focus on developing more reliable and fair ML solutions. They test models to minimize biases. The right tools help them assess key metrics, like accuracy and AUC, support error analysis and version comparison, document processes, and integrate seamlessly into ML pipelines. Deepchecks is a Python package that assists with ML models and data validation. It also eases model performance checks, data integrity, and distribution mismatches.Truera is a modern model intelligence platform that helps data scientists increase trust and transparency in ML models. Using this tool, they can understand model behavior, identify issues, and reduce biases. Truera provides features for model debugging, explainability, and fairness assessment.Kolena is a platform that enhances team alignment and trust through rigorous testing and debugging. It provides an online environment for logging results and insights. Its focus is on ML unit testing and validation at scale, which is key to consistent model performance across different scenarios. Bring Models to Life Data scientists need reliable tools to efficiently deploy ML models and serve predictions reliably. The following tools help them achieve smooth and scalable ML operations: BentoML is an open platform that helps data scientists handle ML operations in production. It helps streamline model packaging and optimize serving workloads for efficiency. It also assists with faster setup, deployment, and monitoring of prediction services.Kubeflow simplifies deploying ML models on Kubernetes (locally, on-premises, or in the cloud). With this tool, the entire process becomes straightforward, portable, and scalable. It supports everything from data preparation to prediction serving. Simplify the ML Lifecycle With End-To-End MLOps Platforms End-to-end MLOps platforms are essential for optimizing the machine learning lifecycle, offering a streamlined approach to developing, deploying, and managing ML models effectively. Here are some leading platforms in this space: Amazon SageMaker offers a comprehensive interface that helps data scientists handle the entire ML lifecycle. It streamlines data preprocessing, model training, and experimentation, enhancing collaboration among data scientists. With features like built-in algorithms, automated model tuning, and tight integration with AWS services, SageMaker is a top pick for developing and deploying scalable machine learning solutions.Microsoft Azure ML Platform creates a collaborative environment that supports various programming languages and frameworks. It allows data scientists to use pre-built models, automate ML tasks, and seamlessly integrate with other Azure services, making it an efficient and scalable choice for cloud-based ML projects.Google Cloud Vertex AI provides a seamless environment for both automated model development with AutoML and custom model training using popular frameworks. Integrated tools and easy access to Google Cloud services make Vertex AI ideal for simplifying the ML process, helping data science teams build and deploy models effortlessly and at scale. Signing Off MLOps is not just another hype. It is a significant field that helps professionals train and analyze large volumes of data more quickly, accurately, and easily. We can only imagine how this will evolve over the next ten years, but it's clear that AI, big data, and automation are just beginning to gain momentum.

By Alexander Simonov

Optimizing Container Synchronization for Frequent Writes

Efficient data synchronization is crucial in high-performance computing and multi-threaded applications. This article explores an optimization technique for scenarios where frequent writes to a container occur in a multi-threaded environment. We’ll examine the challenges of traditional synchronization methods and present an advanced approach that significantly improves performance for write-heavy environments. The method in question is beneficial because it is easy to implement and versatile, unlike pre-optimized containers that may be platform-specific, require special data types, or bring additional library dependencies. Traditional Approaches and Their Limitations Imagine a scenario where we have a cache of user transactions: C++ struct TransactionData { long transactionId; long userId; unsigned long date; double amount; int type; std::string description; }; std::map<long, std::vector<TransactionData>> transactionCache; // key - userId In a multi-threaded environment, we need to synchronize access to this cache. The traditional approach might involve using a mutex: C++ class SimpleSynchronizedCache { public: void write(const TransactionData&& transaction) { std::lock_guard<std::mutex> lock(cacheMutex); transactionCache[transaction.userId].push_back(transaction); } std::vector<TransactionData> read(const long&& userId) { std::lock_guard<std::mutex> lock(cacheMutex); try { return transactionCache.at(userId); } catch (const std::out_of_range& ex) { return std::vector<TransactionData>(); } } std::vector<TransactionData> pop(const long& userId) { std::lock_guard<std::mutex> lock(_cacheMutex); auto userNode = _transactionCache.extract(userId); return userNode.empty() ? std::vector<TransactionData>() : std::move(userNode.mapped()); } private: std::map<int, std::vector<TransactionData>> transactionCache; std::mutex cacheMutex; }; As system load increases, especially with frequent reads, we might consider using a shared_mutex: C++ class CacheWithSharedMutex { public: void write(const TransactionData&& transaction) { std::lock_guard<std::shared_mutex> lock(cacheMutex); transactionCache[transaction.userId].push_back(transaction); } std::vector<TransactionData> read(const long&& userId) { std::shared_lock<std::shared_mutex> lock(cacheMutex); try { return transactionCache.at(userId); } catch (const std::out_of_range& ex) { return std::vector<TransactionData>(); } } std::vector<TransactionData> pop(const long& userId) { std::lock_guard<std::shared_mutex> lock(_cacheMutex); auto userNode = _transactionCache.extract(userId); return userNode.empty() ? std::vector<TransactionData>() : std::move(userNode.mapped()); } private: std::map<int, std::vector<TransactionData>> transactionCache; std::shared_mutex cacheMutex; }; However, when the load is primarily generated by writes rather than reads, the advantage of a shared_mutex over a regular mutex becomes minimal. The lock will often be acquired exclusively, negating the benefits of shared access. Moreover, let’s imagine that we don’t use read() at all — instead, we frequently write incoming transactions and periodically flush the accumulated transaction vectors using pop(). As pop() involves reading with extraction, both write() and pop() operations would modify the cache, necessitating exclusive access rather than shared access. Thus, the shared_lock becomes entirely useless in terms of optimization over a regular mutex, and maybe even performs worse — its more intricate implementation is now used for the same exclusive locks that a faster regular mutex provides. Clearly, we need something else. Optimizing Synchronization With the Sharding Approach Given the following conditions: A multi-threaded environment with a shared containerFrequent modification of the container from different threadsObjects in the container can be divided for parallel processing by some member variable. Regarding point 3, in our cache, transactions from different users can be processed independently. While creating a mutex for each user might seem ideal, it would lead to excessive overhead in maintaining so many locks. Instead, we can divide our cache into a fixed number of chunks based on the user ID, in a process known as sharding. This approach reduces the overhead and yet allows the parallel processing, thereby optimizing performance in a multi-threaded environment. C++ class ShardedCache { public: ShardedCache(size_t shardSize): _shardSize(shardSize), _transactionCaches(shardSize) { std::generate( _transactionCaches.begin(), _transactionCaches.end(), []() { return std::make_unique<SimpleSynchronizedCache>(); }); } void write(const TransactionData& transaction) { _transactionCaches[transaction.userId % _shardSize]->write(transaction); } std::vector<TransactionData> read(const long& userId) { _transactionCaches[userId % _shardSize]->read(userId); } std::vector<TransactionData> pop(const long& userId) { return std::move(_transactionCaches[userId % _shardSize]->pop(userId)); } private: const size_t _shardSize; std::vector<std::unique_ptr<SimpleSynchronizedCache>> _transactionCaches; }; This approach allows for finer-grained locking without the overhead of maintaining an excessive number of mutexes. The division can be adjusted based on system architecture specifics, such as size of a thread pool that works with the cache, or hardware concurrency. Let’s run tests where we check how sharding accelerates cache performance by testing different partition sizes. Performance Comparison In these tests, we aim to do more than just measure the maximum number of operations the processor can handle. We want to observe how the cache behaves under conditions that closely resemble real-world scenarios, where transactions occur randomly. Our optimization goal is to minimize the processing time for these transactions, which enhances system responsiveness in practical applications. The implementation and tests are available in the GitHub repository. C++ #include <thread> #include <functional> #include <condition_variable> #include <random> #include <chrono> #include <iostream> #include <fstream> #include <array> #include "SynchronizedContainers.h" const auto hardware_concurrency = (size_t)std::thread::hardware_concurrency(); class TaskPool { public: template <typename Callable> TaskPool(size_t poolSize, Callable task) { for (auto i = 0; i < poolSize; ++i) { _workers.emplace_back(task); } } ~TaskPool() { for (auto& worker : _workers) { if (worker.joinable()) worker.join(); } } private: std::vector<std::thread> _workers; }; template <typename CacheImpl> class Test { public: template <typename CacheImpl = ShardedCache, typename ... CacheArgs> Test(const int testrunsNum, const size_t writeWorkersNum, const size_t popWorkersNum, const std::string& resultsFile, CacheArgs&& ... cacheArgs) : _cache(std::forward<CacheArgs>(cacheArgs)...), _writeWorkersNum(writeWorkersNum), _popWorkersNum(popWorkersNum), _resultsFile(resultsFile), _testrunsNum(testrunsNum), _testStarted (false) { std::random_device rd; _randomGenerator = std::mt19937(rd()); } void run() { for (auto i = 0; i < _testrunsNum; ++i) { runSingleTest(); logResults(); } } private: void runSingleTest() { { std::lock_guard<std::mutex> lock(_testStartSync); _testStarted = false; } // these pools won’t just fire as many operations as they can, // but will emulate real-time occuring requests to the cache in multithreaded environment auto writeTestPool = TaskPool(_writeWorkersNum, std::bind(&Test::writeTransactions, this)); auto popTestPool = TaskPool(_popWorkersNum, std::bind(&Test::popTransactions, this)); _writeTime = 0; _writeOpNum = 0; _popTime = 0; _popOpNum = 0; { std::lock_guard<std::mutex> lock(_testStartSync); _testStarted = true; _testStartCv.notify_all(); } } void logResults() { std::cout << "===============================================" << std::endl; std::cout << "Writing operations number per sec:\t" << _writeOpNum / 60. << std::endl; std::cout << "Writing operations avg time (mcsec):\t" << (double)_writeTime / _writeOpNum << std::endl; std::cout << "Pop operations number per sec: \t" << _popOpNum / 60. << std::endl; std::cout << "Pop operations avg time (mcsec): \t" << (double)_popTime / _popOpNum << std::endl; std::ofstream resultsFilestream; resultsFilestream.open(_resultsFile, std::ios_base::app); resultsFilestream << _writeOpNum / 60. << "," << (double)_writeTime / _writeOpNum << "," << _popOpNum / 60. << "," << (double)_popTime / _popOpNum << std::endl; std::cout << "Results saved to file " << _resultsFile << std::endl; } void writeTransactions() { { std::unique_lock<std::mutex> lock(_testStartSync); _testStartCv.wait(lock, [this] { return _testStarted; }); } std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now(); // hypothetical system has around 100k currently active users std::uniform_int_distribution<> userDistribution(1, 100000); // delay up to 5 ms for every thread not to start simultaneously std::uniform_int_distribution<> waitTimeDistribution(0, 5000); std::this_thread::sleep_for(std::chrono::microseconds(waitTimeDistribution(_randomGenerator))); for ( auto iterationStart = std::chrono::steady_clock::now(); iterationStart - start < std::chrono::minutes(1); iterationStart = std::chrono::steady_clock::now()) { auto generatedUser = userDistribution(_randomGenerator); TransactionData dummyTransaction = { 5477311, generatedUser, 1824507435, 8055.05, 0, "regular transaction by " + std::to_string(generatedUser)}; std::chrono::steady_clock::time_point operationStart = std::chrono::steady_clock::now(); _cache.write(dummyTransaction); std::chrono::steady_clock::time_point operationEnd = std::chrono::steady_clock::now(); ++_writeOpNum; _writeTime += std::chrono::duration_cast<std::chrono::microseconds>(operationEnd - operationStart).count(); // make span between iterations at least 5ms std::this_thread::sleep_for(iterationStart + std::chrono::milliseconds(5) - std::chrono::steady_clock::now()); } } void popTransactions() { { std::unique_lock<std::mutex> lock(_testStartSync); _testStartCv.wait(lock, [this] { return _testStarted; }); } std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now(); // hypothetical system has around 100k currently active users std::uniform_int_distribution<> userDistribution(1, 100000); // delay up to 100 ms for every thread not to start simultaneously std::uniform_int_distribution<> waitTimeDistribution(0, 100000); std::this_thread::sleep_for(std::chrono::microseconds(waitTimeDistribution(_randomGenerator))); for ( auto iterationStart = std::chrono::steady_clock::now(); iterationStart - start < std::chrono::minutes(1); iterationStart = std::chrono::steady_clock::now()) { auto requestedUser = userDistribution(_randomGenerator); std::chrono::steady_clock::time_point operationStart = std::chrono::steady_clock::now(); auto userTransactions = _cache.pop(requestedUser); std::chrono::steady_clock::time_point operationEnd = std::chrono::steady_clock::now(); ++_popOpNum; _popTime += std::chrono::duration_cast<std::chrono::microseconds>(operationEnd - operationStart).count(); // make span between iterations at least 100ms std::this_thread::sleep_for(iterationStart + std::chrono::milliseconds(100) - std::chrono::steady_clock::now()); } } CacheImpl _cache; std::atomic<long> _writeTime; std::atomic<long> _writeOpNum; std::atomic<long> _popTime; std::atomic<long> _popOpNum; size_t _writeWorkersNum; size_t _popWorkersNum; std::string _resultsFile; int _testrunsNum; bool _testStarted; std::mutex _testStartSync; std::condition_variable _testStartCv; std::mt19937 _randomGenerator; }; void testCaches(const size_t testedShardSize, const size_t workersNum) { if (testedShardSize == 1) { auto simpleImplTest = Test<SimpleSynchronizedCache>( 10, workersNum, workersNum, "simple_cache_tests(" + std::to_string(workersNum) + "_workers).csv"); simpleImplTest.run(); } else { auto shardedImpl4Test = Test<ShardedCache>( 10, workersNum, workersNum, "sharded_cache_" + std::to_string(testedShardSize) + "_tests(" + std::to_string(workersNum) + "_workers).csv", 4); shardedImpl4Test.run(); } } int main() { std::cout << "Hardware concurrency: " << hardware_concurrency << std::endl; std::array<size_t, 7> testPlan = { 1, 4, 8, 32, 128, 4096, 100000 }; for (auto i = 0; i < testPlan.size(); ++i) { testCaches(testPlan[i], 4 * hardware_concurrency); } // additional tests with diminished load to show limits of optimization advantage std::array<size_t, 4> additionalTestPlan = { 1, 8, 128, 100000 }; for (auto i = 0; i < additionalTestPlan.size(); ++i) { testCaches(additionalTestPlan[i], hardware_concurrency); } } We observe that with 2,000 writes and 300 pops per second (with a concurrency of 8) — which are not very high numbers for a high-load system — optimization using sharding significantly accelerates cache performance, by orders of magnitude. However, evaluating the significance of this difference is left to the reader, as, in both scenarios, operations took less than a millisecond. It’s important to note that the tests used a relatively lightweight data structure for transactions, and synchronization was applied only to the container itself. In real-world scenarios, data is often more complex and larger, and synchronized processing may require additional computations and access to other data, which can significantly increase the time of operation itself. Therefore, we aim to spend as little time on synchronization as possible. The tests do not show the significant difference in processing time when increasing the shard size. The greater the size the bigger is the maintaining overhead, so how low should we go? I suspect that the minimal effective value is tied to the system's concurrency, so for modern server machines with much greater concurrency than my home PC, a shard size that is too small won’t yield the most optimal results. I would love to see the results on other machines with different concurrency that may confirm or disprove this hypothesis, but for now I assume it is optimal to use a shard size that is several times larger than the concurrency. You can also note that the largest size tested — 100,000 — effectively matches the mentioned earlier approach of assigning a mutex to each user (in the tests, user IDs were generated within the range of 100,000). As can be seen, this did not provide any advantage in processing speed, and this approach is obviously more demanding in terms of memory. Limitations and Considerations So, we determined an optimal shard size, but this is not the only thing that should be considered for the best results. It’s important to remember that such a difference compared to a simple implementation exists only because we are attempting to perform a sufficiently large number of transactions at the same time, causing a “queue” to build up. If the system’s concurrency and the speed of each operation (within the mutex lock) allow operations to be processed without bottlenecks, the effectiveness of sharding optimization decreases. To demonstrate this, let’s look at the test results with reduced load — at 500 writes and 75 pops (with a concurrency of 8) — the difference is still present, but it is no longer as significant. This is yet another reminder that premature optimizations can complicate code without significantly impacting results. It’s crucial to understand the application requirements and expected load. Also, it’s important to note that the effectiveness of sharding heavily depends on the distribution of values of the chosen key (in this case, user ID). If the distribution becomes heavily skewed, we may revert to performance more similar to that of a single mutex — imagine all of the transactions coming from a single user. Conclusion In scenarios with frequent writes to a container in a multi-threaded environment, traditional synchronization methods can become a bottleneck. By leveraging the ability of parallel processing of data and predictable distribution by some specific key and implementing a sharded synchronization approach, we can significantly improve performance without sacrificing thread safety. This technique can prove itself effective for systems dealing with user-specific data, such as transaction processing systems, user session caches, or any scenario where data can be logically partitioned based on a key attribute. As with any optimization, it’s crucial to profile your specific use case and adjust the implementation accordingly. The approach presented here provides a starting point for tackling synchronization challenges in write-heavy, multi-threaded applications. Remember, the goal of optimization is not just to make things faster, but to make them more efficient and scalable. By thinking critically about your data access patterns and leveraging the inherent structure of your data, you can often find innovative solutions to performance bottlenecks.

By Alexander Iskhakov

Event Sourcing Explained: Building Robust Systems With Immutable Event Logs

An architectural pattern named Event Sourcing is gaining more and more recognition from developers who aim for strong and scalable systems. In this article, we’ll take a closer look at the concept of it: what it entails, its benefits, general flow, and key concepts. Moreover, we will discuss how to implement ES — some details on the technologies that make adoption easy. This article is aimed at software architects, system developers, and project managers who might be contemplating or are already engaged in integrating Event Sourcing into their systems. What Is Event Sourcing? The basic idea of Event Sourcing is to store the history of changes in an application state as a sequence of events. These events can't be changed once stored in an immutable log, which means that the system's current state can be recovered by playing back these events. The idea is different from what happens in typical systems where the state itself is kept and not how it came to be. It is important to also outline the most fundamental principles of ES: State as a sequence of events: Store changes as a series of immutable events rather than the current stateEvent immutability: Events cannot be changed or deleted upon creationEvent replay: Reconstruct the current state or a view by replaying events from the beginning Event Sourcing has a number of advantages that make the system strong and adaptable. One key advantage is that it ensures data integrity and keeps everything well-organized with a clear audit trail, making it easy to trace any changes. This ensures that every change made is unchangeable, hence providing an elaborate history of changes made, which makes it valuable during system failures. The other benefit includes easy horizontal scaling where the load can be distributed to multiple parallel handlers of the events thus complemented by architectural flexibility. Furthermore, an introduction to Event Sourcing would be incomplete without mentioning its perfect fit in systems needing an exhaustive audit trail such as fintech projects. Key Terms in Event Sourcing Command: An action request (imperative verb) performed by a user or external actor (external system).Event: The result of a command (verb in the past tense).Aggregate Root: A key concept from Domain-Driven Design (DDD) representing the main entity in a group of closely related entities (aggregate). This entity handles commands and generates events based on the state of the aggregate (command, aggregateState) => events[].AggregateState: The state of the aggregate. It doesn't need to contain all data, only the data required for validation and decision-making on which events to generate.Saga: An entity opposite to an aggregate - it listens to events and generates commands. A saga is typically a state machine and often requires persistence for its state.View: An entity that listens to events and performs side effects, such as creating a database table with data optimized for fast frontend access.Query: A data read request to the aggregateState or to the data produced in a view. General Flow The process of updating an event-sourced entity can be represented in pseudocode: (0): Restoring state from snapshot or by replaying events from the beginning.(1): Transform a command into events or an error based on the current state.(2): Persist events in the log.(3): Apply each event sequentially to the state generating a new state.(4): Persist the new state snapshot for quick access.(5): Respond to the caller with a report on the completed work. Additionally, events are applied asynchronously to views and sagas, enabling flexible, decoupled architectures that can evolve over time. In the image below, you can see the typical data flow for event-sourced applications when updating data by sending commands to the aggregate. For reading data, you can send a query to the aggregate for a strongly consistent but potentially slower result, or send a query to the view for an optimized but usually eventually consistent result. Benefits System state recovery allows the restoration of the system state at any given point (perhaps even before an unfortunate failure).Audit and traceability ensure that a detailed history of all changes made is provided for further analysis.Scalability can be achieved through horizontal scaling of event processing. Flexibility: Easy adaptation to new requirements and changes. Views can be rebuilt from scratch, and aggregate states can be reconstructed from scratch. Challenges On the other hand, Event Sourcing brings along with it several challenges that need to be managed with finesse for its effective implementation. Designing correct events and aggregates: Mistakes in the design of aggregates can be costly. The need for Event Storming and the adaptation of DDD modeling arises to find good boundaries.Storage requirements: Events can accumulate over time. Solution: Consider databases — NoSQL stores or specialized solutions for the right event storage.Performance issues: Optimizing event store queries.Ensuring consistency: Since views are updated eventually, implementing eventual consistency in the UI can be challenging, requiring careful management of delays to ensure they do not disrupt the user experience.Testing: Developing tests for systems based on ES.Single-writer pattern implementation in a cluster or multi-datacenter: Implementing ES in a cluster can be challenging because you need to create transactional boundaries to make sure only one node must write events (step (2) in the example) for a single entity at a time. ES in a multi-datacenter is even more challenging, often requiring modeling of additional commands and events for explicit cross-data center synchronization processes. Where To Use Event Sourcing Event Sourcing is a powerful architectural pattern that excels in specific scenarios: Complex Business Domains If there are lots of state changes, coupled with convoluted business logic, Event Sourcing creates a perfect audit trail. This is mostly applicable to cases of compliance, and even more to financial services (including banking and trading systems) as well as healthcare (comprising patient records and treatment histories). Auditing and Compliance The immutable event log of Event Sourcing is highly beneficial in industries that have strict regulatory requirements because every change is documented. This practice finds its primary applications in insurance (for keeping records of claim processing and policy changes) as well as government (for legislative processes and public records management). Debugging and Troubleshooting A key benefit of Event Sourcing is that it allows the recreation of past states by replaying events— this can help developers greatly with error analysis as well as testing. It lets you trace back through the sequence of events that led up to any particular issue, enabling root cause identification or even running simulations based on an event playback. Event-Driven Architectures Event-driven Architectures work hand in hand with Event Sourcing, elevating microservices further through the facilitation of state synchronization and empowering real-time applications. This happens with live updates and collaboration features, thereby ensuring that the entire system remains performant despite growing demands placed upon it due to scalability needs. Scalability and Performance Decoupling read and write models optimizes performance, making Event Sourcing ideal for high-volume systems like e-commerce platforms and social networks, as well as for data analysis, performing complex queries on historical data. Technologies for Implementation The proper implementation of Event Sourcing implies the need for a collection of technologies suited to various aspects of architectural demands. Among the solutions that can be recommended without hesitation for event stream persistence is EventStoreDB. It is well-suited for Event Sourcing and comes with a variety of tools related to different aspects of event management — storage, retrieval, and real-time event handling — that are instrumental in sustaining the performance and consistency of event-driven systems. Two other frameworks play an important role in this regard: Akka and Pekko, besides specialized storage The frameworks assist in managing this sophisticated flow of asynchronous data — including system resilience (ability to remain operational) and scalability (ability to handle increased loads) — which must be introduced with a dynamic environment typical for event-driven architectures. When it comes to organizations with unique needs that generic solutions fail to address, implementing Event Sourcing on a custom basis offers the needed adaptability. But these bespoke systems have to be put together; only a good grasp of ES principles will keep them away from typical traps like performance bottlenecks or data consistency problems. Be ready for a heavy investment in development and testing if you're going to create your own Event Sourcing system — ensure that the system is not only robust but also scalable. Conclusion The implementation of Event Sourcing is a pattern that greatly improves the quality and traceability of applications if implemented correctly; it also increases reliability and scalability. It demands a thorough understanding of its concepts and careful planning but offers substantial long-term benefits. If you are looking at building systems that are ready not only to meet the current needs but also any future changes and challenges, think about embracing Event Sourcing.

By Dmitrii Pakhomov

Istio Ambient Mesh Performance Test and Benchmarking

Istio is the most popular service mesh, but the DevOps and SREs community constantly complain about its performance. Istio Ambient is a sidecar-less approach by the Istio committee (majorly driven by SOLO.io) to improve performance. Since there are many promotions about Ambient mesh being production-ready, many of our prospects and enterprises are generally eager to try or migrate to Ambient mesh. Architecturally, the Istio Ambient mesh is a great design that improves performance. But whether it performs quickly is still a question. We have tried Istio Ambient Mesh and observed the performance countless times between January 2024 and July 2024, and we have yet to see any significant performance gains. Below is the lab setup on which we ran our experiments. Lab Setup to Load Test Istio Ambient Mesh Load testing tool: FortioApplication configuration: Bookinfo ApplicationTotal requests fired: 1000 queries/second (QPS), 10 connections, and for 30 secondsCluster configuration: Azure (AKS) clusters with 3 nodesNode configuration: 2 VCPU and 7GB memory for each nodeCNI used: Kube CNI and Cilium (We did not use Flannel because it was not working well with AKS.) Note: We have kept all the applications and Fortio in different nodes.We have exposed the Rating microservice and NOT Details service to handle external traffic. Because the Details microservice is written in Ruby, it is unfit for handling higher QPS. We sent a load of 100 QPS and 1000 QPS to the Details service without Istio, and the p99 latency for 100 QPS is around 6 ms, but it goes up to 50 ms for 1000 QPS. Performance Test on Istio Ambient Mesh With Kube CNI and Cilium We have carried out the performance or load test for the following cases: Kube CNI Kube CNI + Istio sidecar (mTLS enabled)Kube CNI + Istio Ambient mesh (mTLS enabled)Cilium CNICilium CNI + Istio sidecar (mTLS enabled)Cilium CNI + Istio Ambient mesh (mTLS enabled) Although we have tested the load for each case multiple times, we have attached only one screenshot to showcase the standard deviation of P99 latency in each case. Please refer to the load test results in the next section. Load Test Results for Kube CNI Without Istio Observed (Median) P99 latency: 1.12ms Figure 1: Kube CNI + Without Istio Load Test of Kube CNI and Istio Sidecar (mTLS Enabled) Observed (Median) P99 latency: 4.72 ms Figure 2: Kube CNI + With Istio Sidecar (mtLS enabled) Load Test of Kube CNI and Istio Ambient Mesh (mTLS Enabled) Observed (Median) P99 latency: 3.6 ms Figure 3: Kube CNI + With Istio Ambient (mtLS enabled) Load Test of Cilium CNI Without Istio Observed (Median) P99 latency: 4.5 ms Figure 4: Cilium CNI + Without Istio Load Test of Cilium CNI and Istio Sidecar (mTLS Enabled) Observed (Median) P99 latency: 8.8 ms Figure 5: Cilium CNI + With Istio Sidecar Load Test of Cilium CNI and Istio Ambient Mesh (mTLS Enabled) Observed (Median) P99 latency: 6.8 ms Figure 6: Cilium CNI + With Istio Ambient Final Load Test Results and Benchmarking of Rating Service With and Without Istio Here are the benchmarking results for the p99 latency of the Rating service with and without Istio (sidecar and Ambient mesh). Sl No Cases p99 latency(ms) 1 Kube CNI 1.12 2 Kube CNI + Istio sidecar (mTLS enabled) 4.72 3 Kube CNI + Istio Ambient mesh (mTLS enabled) 3.6 4 Cilium CNI 4.5 5 Cilium CNI + Istio sidecar (mTLS enabled) 8.8 6 Cilium CNI + Istio Ambient mesh (mTLS enabled) 6.8 Conclusion Three items are concluded from this extensive load test of Istio Ambient Mesh: The performance of Istio Ambient mesh will never give you thunderbolt improvements over latency when compared with plain Kube CNI. Note that using Ztunnel for encryption still involves network hops, which will increase the latency. Yes, it is better than Istio sidecar architecture. Regardless of the CNI used, the performance (p99 latency) of the Istio Ambient Mesh is 20% better than that of the Istio sidecar. Combining Cilium and Istio (sidecar or Ambient) produces undesirable results. If you are looking for performance improvements, you should avoid this mix.

By Debasree Panda

Improving Snowflake Performance by Mastering the Query Profile

Having worked with over 50 Snowflake customers across Europe and the Middle East, I've analyzed hundreds of Query Profiles and identified many issues including issues around performance and cost. In this article, I'll explain: What is the Snowflake Query Profile, and how to read and understand the componentsHow the Query Profile reveals how Snowflake executes queries and provides insights about Snowflake and potential query tuningWhat to look out for in a Query Profile and how to identify and resolve SQL issues By the end of this article, you should have a much better understanding and appreciation of this feature and learn how to identify and resolve query performance issues. What Is a Snowflake Query Profile? The Snowflake Query Profile is a visual diagram explaining how Snowflake has executed your query. It shows the steps taken, the data volumes processed, and a breakdown of the most important statistics. Query Profile: A Simple Example To demonstrate how to read the query profile, let's consider this relatively simple Snowflake SQL: select o_orderstatus, sum(o_totalprice) from orders where year(o_orderdate) = 1994 group by all order by 1; The above query was executed against a copy of the Snowflake sample data in the snowflake_sample_data.tpch_sf100.orders table, which holds 150m rows or about 4.6GB of data. Here's the query profile it produced. We'll explain the components below. Query Profile: Steps The diagram below illustrates the Query Steps. These are executed from the bottom up, and each step represents an independently executing process that processes a batch of a few thousand rows of data in parallel. There are various types available, but the most common include: TableScan [4] - Indicating data being read from a table; Notice this step took 94.8% of the overall execution time. This indicates the query spent most of the time scanning data. Notice we cannot tell from this whether data was read from the virtual warehouse cache or remote storage.Filter[3] - This attempts to reduce the number of rows processed by filtering out the data. Notice the Filter step takes in 22.76M rows and outputs the same number. This raises the question of whether the where clause filtered any results.Aggregate [2] - This indicates a step summarizing results. In this case, it produced the sum(orders.totalprice). Notice that this step received 22.76M rows and output just one row.Sort [1] - Which represents the order by orderstatus. This sorts the results before returning them to the Result Set. Note: Each step also includes a sequence number to help identify the sequence of operation. Read these from highest to lowest. Query Profile: Overview and Statistics Query Profile: Overview The diagram below summarises the components of the Profile Overview, highlighting the most important components. The components include: Total Execution Time: This indicates the actual time in seconds the query took to complete. Note: The elapsed time usually is slightly longer as it includes other factors, including compilation time and time spent queuing for resources.Processing %: Indicates the percentage of time the query spends waiting for the CPU; When this is a high percentage of total execution time, it indicates the query is CPU-bound — performing complex processing.Local Disk I/O %: Indicates the percentage of time waiting for SSDRemote Disk I/O %: This indicates the percentage of time spent waiting for Remote Disk I/O. A high percentage indicates the query was I/O bound. This suggests that the performance can be best improved by reducing the time spent reading from the disk.Synchronizing %: This is seldom useful and indicates the percentage of time spent synchronizing between processes. This tends to be higher as a result of sort operations.Initialization %: Tends to be a low percentage of overall execution time and indicates time spent compiling the query; A high percentage normally indicates a potentially over-complex query with many sub-queries but a short execution time. This suggests the query is best improved by simplifying the query design to reduce complexity and, therefore, compilation time. Query Profile Statistics The diagram below summarises the components of the Profile Statistics, highlighting the most important components. The components include: Scan Progress: This indicates the percentage of data scanned. When the query is still executing, this can be used to estimate the percentage of time remaining.Bytes Scanned: This indicates the number of bytes scanned. Unlike row-based databases, Snowflake fetches only the columns needed, and this indicates the data volume fetched from Local and Remote storage.Percentage Scanned from Cache: This is often mistaken for a vital statistic to monitor. However, when considering the performance of a specific SQL statement, Percentage Scanned from Cache is a poor indicator of good or bad query performance and should be largely ignored.Partitions Scanned: This indicates the number of micro partitions scanned and tends to be a critical determinant of query performance. It also indicates the volume of data fetched from remote storage and the extent to which Snowflake could partition eliminate — to skip over partitions, explained below.Partitions Total: Shows the total number of partitions in all tables read. This is best read in conjunction with Partitions Scanned and indicates the efficiency of partition elimination. For example, this query fetched 133 of 247 micro partitions and scanned just 53% of the data. A lower percentage indicates a higher rate of partition elimination, which will significantly improve queries that are I/O bound. A Join Query Profile While the simple example above illustrates how to read a query profile, we need to know how Snowflake handles JOIN operations between tables to fully understand how Snowflake works. The SQL query below includes a join of the customer and orders tables: SQL select c_mktsegment , count(*) , sum(o_totalprice) , count(*) from customer , orders where c_custkey = o_custkey and o_orderdate between ('01-JAN-1992') and ('31-JAN-1992') group by 1 order by 1; The diagram below illustrates the relationship between these tables in the Snowflake sample data in the snowflake_sample_data.tpch_sf100 schema. The diagram below illustrates the Snowflake Query Plan used to execute this query, highlighting the initial steps that involve fetching data from storage. One of the most important insights about the Query Profile above is that each step represents an independently operating parallel process that runs concurrently. This uses advanced vectorized processing to fetch and process a few thousand rows at a time, passing them to the next step to process in parallel. Snowflake can use this architecture to break down complex query pipelines, executing individual steps in parallel across all CPUs in a Virtual Warehouse. It also means Snowflake can read data from the ORDERS and CUSTOMER data in parallel using the Table Scan operations. How Does Snowflake Execute a JOIN Operation? The diagram below illustrates the processing sequence of a Snowflake JOIN operation. To read the sequence correctly, always start from the Join step and take the left leg, in this case, down to the TableScan of the ORDERS table, step 5. The diagram above indicates the steps were as follows: TableScan [5]: This fetches data from the ORDERS table, which returns 19.32M rows out of 150M rows. This reduction is explained by the Snowflake's ability to automatically partition eliminate - to skip over micro-partitions, as described in the article on Snowflake Cluster Keys. Notice that the query spent 9.3% of the time processing this step.Filter [4]: Receives 19.32M rows and logically represents the following line in the above query: SQL and o_orderdate between ('01-JAN-1992') and ('31-JAN-1992') This step represents filtering rows from the ORDERS table before passing them to the Join [3] step above. Surprisingly, this step appears to do no actual work as it receives and emits 19.32M rows. However, Snowflake uses Predicate Pushdown, which filters the rows in the TableScan [4] step before reading them into memory. The output of this step is passed to the Join step. Join [3]: Receives ORDERS rows but needs to fetch corresponding CUSTOMERentries. We, therefore, need to skip down to the TableScan [7] step.TableScan [7]: Fetches data from the CUSTOMER table. Notice this step takes 77.7% of the overall execution time and, therefore, has the most significant potential benefit from query performance tuning. This step fetches 28.4M rows, although Snowflake automatically tunes this step, as there are 1.5 Bn rows on the CUSTOMER table.JoinFilter [6]: This step represents an automatic Snowflake performance tuning operation that uses a Bloom Filter to avoid scanning micro-partitions on the right-hand side of a Join operation. In summary, as Snowflake has already fetched the CUSTOMER entries, it only needs to fetch ORDERS for the matching CUSTOMER rows. This explains the fact the TableScan [7] returns only 28M of the 1.5Bn possible entries. It's worth noting this performance tuning is automatically applied, although it could be improved using a Cluster Key on the ORDERS table on the join columns.Join [3]: This represents the actual join of data in the CUSTOMER and ORDERStables. It's important to understand that every Snowflake Join operation is implemented as a Hash Join. What Is a Snowflake Hash Join? While it may appear we're disappearing into the Snowflake internals, bear with me. Understanding how Snowflake executes JOIN operations highlights a critical performance-tuning opportunity. The diagram below highlights the essential statistics to watch out for in any Snowflake Join operation. The diagram above shows the number of rows fed into the JOIN and the total rows returned. In particular, the left leg delivered fewer rows (19.32M) than the right leg (28.4M). This is important because it highlights an infrequent but critical performance pattern: The number of rows fed into the left leg of a JOIN must always be less than the right. The reason for this critical rule is revealed in the way Snowflake executes a Hash Join, which is illustrated in the diagram below: The above diagram illustrates how a Hash Join operation works by reading an entire table into memory and generating a unique hash key for each row. It then performs a full table scan, which looks up against the in-memory hash key to join the resulting data sets. Therefore, it's essential to correctly identify the smaller of the two data sets and read it into memory while scanning the larger of the two, but sometimes Snowflake gets it wrong. The screen-shot below illustrates the situation: In the above example, Snowflake needs to read eight million entries into memory, create a hash key for each row, and perform a full scan of just 639 rows. This leads to very poor query performance and a join that should take seconds but often takes hours. As I have explained previously in an article on Snowflake Performance Tuning, this is often the result of multiple nested joins and group by operations, which makes it difficult for Snowflake to identify the cardinality correctly. While this happens infrequently, it can lead to extreme performance degradation and the best practice approach is to simplify the query, perhaps breaking it down into multiple steps using transient or temporary tables. Identifying Issues Using the Query Profile Query Profile Join Explosion The screenshot below illustrates a common issue that often leads to both poor query performance and (more importantly) incorrect results. Notice the output of the Join [4] step doesn't match the values input on the left or right leg despite the fact the query join clause is a simple join by CUSTKEY? This issue is often called a "Join Explosion" and is typically caused by duplicate values in one of the tables. As indicated above, this frequently leads to poor query performance and should be investigated and fixed. Note: One potential way to automatically identify Join Explosion is to use the Snowflake function GET_OPERATOR_QUERY_STATS , which allows programmatic access to the query profile. Unintended Cartesian Join The screenshot below illustrates another common issue easily identified in the Snowflake query profile: a cartesian join operation. Similar to the Join Explosion above, this query profile is produced by a mistake in the SQL query. This mistake produces an output that multiplies the size of both inputs. Again, this is easy to spot in a query profile, and although it may, in some cases, be intentional, if not, it can lead to very poor query performance. Disjunctive OR Query Disjunctive database queries are queries that include an OR in the query WHEREclause. This is an example of a valid use of the Cartesian Join, but one which can be easily avoided. Take, for example, the following query: SQL select distinct l_linenumber from snowflake_sample_data.tpch_sf1.lineitem, snowflake_sample_data.tpch_sf1.partsupp where (l_partkey = ps_partkey) or (l_suppkey = ps_suppkey); The above query produced the following Snowflake Query Profile and took 7m 28s to complete on an XSMALL warehouse despite scanning only 28 micro partitions. However, when the same query was rewritten (below) to use a UNION statement, it took just 3.4 seconds to complete, a 132 times performance improvement for very little effort. SQL select l_linenumber from snowflake_sample_data.tpch_sf1.lineitem join snowflake_sample_data.tpch_sf1.partsupp on l_partkey = ps_partkey union select l_linenumber from snowflake_sample_data.tpch_sf1.lineitem join snowflake_sample_data.tpch_sf1.partsupp on l_suppkey = ps_suppkey; Notice the Cartesian Join operation accounted for 95.8% of the execution time. Also, the Profile Overview indicates that the query spent 98.9% of the time processing. This is worth noting as it demonstrates a CPU-bound query. Wrapping Columns in the WHERE Clause While this issue is more challenging to identify from the query profile alone, it illustrates one the most important statistics available, the Partitions Scanned compared to Partitions Total. Take the following SQL query as an example: SQL select o_orderpriority, sum(o_totalprice) from orders where o_orderdate = to_date('1993-02-04','YYYY-MM-DD') group by all; The above query was completed in 667 milliseconds on an XSMALL warehouse and produced the following profile. Notice the sub-second execution time and that the query only scanned 73 of 247 micro partitions. Compare the above situation to the following query, which took 7.6 seconds to complete - 11 times slower than the previous query to produce the same results. SQL select o_orderpriority, sum(o_totalprice) from orders where to_char(o_orderdate, 'YYYY-MM-DD') = '1993-02-04' group by all; The screenshot above shows the second query was 11 times slower because it needed to scan 243 micro-partitions. The reason lies in the WHERE clause. In the first query, the WHERE clause compares the ORDERDATE to a fixed literal. This meant that Snowflake was able to perform partition elimination by date. SQL where o_orderdate = to_date('1993-02-04','YYYY-MM-DD') In the second query, the WHERE clause modified the ORDERDATE field to a character string, which reduced Snowflake's ability to filter out micro-partitions. This meant more data needed to be processed which took longer to complete. SQL where to_char(o_orderdate, 'YYYY-MM-DD') = '1993-02-04' Therefore, the best practice is to avoid wrapping database columns with functions, especially not user-defined functions, which severely impact query performance. Identifying Spilling to Storage in the Snowflake Query Profile As discussed in my article on improving query performance by avoiding spilling to storage, this tends to be an easy-to-identify and potentially resolve issue. Take, for example, this simple benchmark SQL query: SQL select ss_sales_price from snowflake_sample_data.TPCDS_SF100TCL.STORE_SALES order by SS_SOLD_DATE_SK, SS_SOLD_TIME_SK, SS_ITEM_SK, SS_CUSTOMER_SK, SS_CDEMO_SK, SS_HDEMO_SK, SS_ADDR_SK, SS_STORE_SK, SS_PROMO_SK, SS_TICKET_NUMBER, SS_QUANTITY; The above query sorted a table with 288 billion rows and took over 30 hours to complete on a SMALL virtual warehouse. The critical point is that the Query Profile Statistics showed that it spilled over 10 TB to local storage and 8 TB to remote storage. Furthermore, because it took so long, it cost over $183 to complete. The screenshot above shows the query profile, execution time, and bytes spilled to local and remote storage. It's also worth noting that the query spent 70.9% of the time waiting for Remote Disk I/O, consistent with the data volume spilled to Remote Storage. Compare the results above to the screenshot below. This shows the same query executed on an X3LARGE warehouse. The Query Profile above shows that the query was completed in 38 minutes and produced no remote spilling. In addition to completing 48 times faster than on the SMALL warehouse also cost $121.80, a 66% reduction in cost. Assuming this query was executed daily, that would amount to an annual savings of over $22,000. Conclusion The example above illustrates my point in an article on controlling Snowflake costs. Snowflake Data Engineers and Administrators tend to put far too much emphasis on tuning performance. However, Snowflake has changed the landscape, and we need to focus on both maximizing query performance and controlling costs. The task of managing cost while maximizing performance may seem at odds with each other, but using the Snowflake Query Profile and the techniques described in this article, there's no reason why we can't deliver both.

By John Ryan

Leveraging Time Series Databases for Cutting-Edge Analytics: Specialized Software for Providing Timely Insights at Scale

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. Time series data has become an essential part of data collection in various fields due to its ability to capture trends, patterns, and anomalies. Through continuous or periodic observation, organizations are able to track how key metrics are changing over time. This simple abstraction powers a broad range of use cases. The widespread adoption of time series data stems from its versatility and applicability across numerous domains. For example: Financial institutions analyze market trends and predict future movements. IoT devices continuously generate time-stamped data to monitor the telemetry of everything from industrial equipment to home appliances.IT infrastructure relies on temporal data to track system performance, detect issues, and ensure optimal operation. As the volume and velocity of time series data have surged, traditional databases have struggled to keep pace with the unique demands placed by such workloads. This has led to the development of specialized databases, known as time series databases (TSDBs). TSDBs are purpose built to handle the specific needs of ingesting, storing, and querying temporal data. Core Features and Advantages of Time Series Databases TSDBs work with efficient data ingestion and storage capabilities, optimized querying, and analytics to manage large volumes of real-time data. Data Ingestion and Storage TSDBs utilize a number of optimizations to ensure scalable and performant loading of high-volume data. There are several of these optimizations that stand out as key differentiators: Table 1. Ingestion and storage optimizations FeatureDescriptionExpected ImpactAdvanced compressionColumnar compression techniques such as delta, dictionary, and run length and LZ array-basedDramatically reduces the amount of data that needs to be stored on disk and, consequently, scanned at query timeData aggregation and downsamplingCreation of summaries over specified intervalsReduces data volumes without a significant loss in informationHigh-volume write optimizationA suite of features such as append-only logs, parallel ingestion, and asynchronous write pathEnsures that there are no bottlenecks in the write path and that data can continuously arrive and be processed by features working together Optimized Querying and Analytics To ensure fast data retrieval at query time, several optimizations are essential. These include specialized time-based indexing, time-based sharding/partitioning, and precomputed aggregates. These techniques take advantage of the time-based, sequential nature of the data to minimize the amount of data scanned and reduce the computation required during queries. An overview of these techniques are highlighted below. Indexing Various indexing strategies are employed across TSDBs to optimize data retrieval. Some TSDBs use an adapted form of the inverted index, which allows for rapid indexing into relevant time series by mapping metrics or series names to their locations within the dataset. Others implement hierarchical structures, such as trees, to efficiently index time ranges, enabling quick access to specific time intervals. Additionally, some TSDBs utilize hash-based indexing to distribute data evenly and ensure fast lookups, while others may employ bitmap indexing for compact storage and swift access. These diverse strategies enhance the performance of TSDBs, making them capable of handling large volumes of time-stamped data with speed and precision. Partitioning Partitioning consists of separating logic units of time into separate structures so that they can be accessed independently. Figure 1. Data partitioning to reduce data scan volume Pre-Computed Aggregates A simplified version of pre-computation is shown below. In practice, advanced statistical structures (e.g., sketches) may be used so that more complex calculations (e.g., percentiles) can be performed over the segments. Figure 2. Visualizing pre-computation of aggregates Scalability and Performance Several tactics and features ensure TSDBs remain reliable and performant as data velocity and volume increase. These are summarized in the table below: Table 2. Scalability tactics and features FeatureDescriptionExpected ImpactDistributed architectureProvides seamless horizontal scalingAllows for transparently increasing the amount of processing power to both producing and consuming applicationsPartitioning and shardingAllows for data to be isolated to distributed processing unitsEnsures that both write and read workloads can fully utilize the distributed clusterAutomated data managementEnables data to move through different tiers of storage automatically based on its temporal relevanceGuarantees that the most frequently used data is automatically stored in the fastest access path, while less used data has retention policies automatically applied Time Series Databases vs. Time Series in OLAP Engines Due to the ubiquity of time series data within businesses, many databases have co-opted the features of TSDBs in order to provide at least some baseline of the capabilities that a specialized TSDB would offer. And in some cases, this may satisfy the use cases of a particular organization. However, outlined below are some key considerations and differentiating features to evaluate when choosing whether an existing OLAP store or a time-series-optimized platform best fit a given problem. Key Considerations An organization's specific requirements will drive which approach makes the most sense. Understanding the three topics below will provide the necessary context for an organization to determine if bringing in a TSDB can provide a high return on investment. Data Volume and Ingestion Velocity TSDBs are designed to handle large volumes of continuously arriving data, and they may be a better fit in cases where the loading volumes are high and the business needs require low latency from event generation to insight. Typical Query Patterns It is important to consider whether the typical queries are fetching specific time ranges of data, aggregating over time ranges, performing real-time analytics, or frequently downsampling. If they are, the benefits of a TSDB will be worth introducing a new data framework into the ecosystem. Existing Infrastructure and Process When considering introducing a TSDB into an analytic environment, it is worthwhile to first survey the existing tooling since many query engines now support a subset of temporal features. Determine where any functionality gaps exist within the existing toolset and use that as a starting point for assessing fit for the introduction of a specialized back end such as TSDB. Differentiating Features There are many differences in implementation, and the specific feature differences will vary depending on the platforms being considered. However, generally, the two feature sets are emphasized broadly in TSDBs: time-based indexing and data management constructs. This emphasis stems from the fact that both feature sets are tightly coupled with time-based abstractions. Use of a TSDB will be most successful when these features can be best leveraged. Time-Based Indexing Efficient data access is achieved through constructs that leverage the sequential nature of time series data, allowing for fast retrieval while maintaining low ingest latency. This critical feature allows TSDBs to excel in use cases where traditional databases struggle to scale effectively. Data Management Constructs Time-based retention policies, efficient compression, and downsampling simplify the administration of large datasets by reducing the manual work required to manage time series data. These specialized primitives are purposefully designed to manage and analyze time series data, and they include functionality that traditional databases typically lack. Use Cases of Time Series Databases in Analytics There are various uses for time series data across all industries. Furthermore, emerging trends such as edge computing are putting the power of real-time time series analytics as close to the source of data generation as possible, thereby reducing the time to insight and removing the need for continuous connectivity to centralized platforms. This opens up a host of applications that were previously difficult or impossible to implement until recently. A few curated use cases are described below to demonstrate the value that can be derived from effectively leveraging temporal data. Telemetry Analysis and Anomaly Detection One of the most common use cases for TSDBs is the observation and analytics on real-time metrics. These metrics come from a variety of sources, and a few of the most prominent sources are described below. IT and Infrastructure Monitoring TSDBs enable real-time monitoring of servers, networks, and application performance, allowing for immediate detection and response to issues. This real-time capability supports performance optimization by identifying bottlenecks, determining capacity needs, and detecting security intrusions. Additionally, TSDBs enhance alert systems by identifying anomalous patterns and breaches of predefined thresholds, proactively informing staff to prevent potential problems. They also support custom dashboards and visualizations for quick and effective data interpretation, making them an invaluable tool for modern IT operations. IoT and Sensor Data TSDBs are vital for telemetry analysis and anomaly detection in IoT and sensor data applications, particularly when aligned with edge computing. They efficiently handle the large volumes of temporal data generated by IoT devices and sensors, enabling real-time monitoring and analysis at the edge of the network. This proximity allows for immediate detection of anomalies, such as irregular patterns or deviations from expected behavior, which is crucial for maintaining the health and performance of IoT systems. By processing data locally, TSDBs reduce latency and bandwidth usage, enhancing the responsiveness and reliability of IoT operations. Smart Cities and Utilities Extreme weather and the need for quick time to action has driven a growth in the usage of temporal data within city and utility infrastructures. Quickly deriving insights from deviations in normal operations can make a significant impact in these applications. TSDBs enable this through both the ability to ingest large volumes of data quickly as well as natively providing highly performant real-time analytic capabilities. For instance, it can mean the difference between high winds causing live wire breakages, which increase fire risk, and an automated shutdown that significantly reduces such risks. Furthermore, better information about energy generation and demand can be used to improve the efficiency of such systems by ensuring that supply and demand are being appropriately matched. This is particularly important during times when there is heavy strain on the energy grid, such as periods of unusual heat or cold, when effective operation can save lives. Trend Analysis The usefulness of TSDBs is not limited to real-time analytics; they are also used for performing long-term trend analysis and often provide the most value when identifying real-time deviations from longer term trends. The optimizations mentioned above, such as pre-computation and partitioning, allow TSDBs to maintain high performance, even if data volumes grow dramatically. Financial Analytics In the realm of financial analytics, TSDBs are indispensable for trend analysis. Analysts can identify patterns and trends over time, helping to forecast market movements and inform investment strategies. The ability to process and analyze this data in real time allows for timely decision making, reducing the risk of losses and capitalizing on market opportunities. Additionally, TSDBs support the integration of various data sources, providing a comprehensive view of financial markets and enhancing the accuracy of trend analysis. Healthcare and Biometric Data Medical devices and wearables generate vast amounts of time-stamped data, including heart rates, glucose levels, and activity patterns. TSDBs facilitate the storage and real-time analysis of this data, allowing healthcare providers to monitor patients continuously and detect any deviations from normal health parameters promptly. Trend analysis using TSDBs can also help in predicting the onset of diseases, monitoring the effectiveness of treatments, and tailoring personalized healthcare plans. This proactive approach not only improves patient outcomes but also enhances the efficiency of healthcare delivery. Industrial Predictive Maintenance Industries deploy numerous sensors on equipment to monitor parameters such as vibration, temperature, and pressure. By collecting and analyzing time-stamped data, TSDBs enable the identification of patterns that indicate potential equipment failures. This trend analysis allows maintenance teams to predict when machinery is likely to fail and schedule timely maintenance, thereby preventing costly unplanned downtimes. Moreover, TSDBs support the optimization of maintenance schedules based on actual usage and performance data, enhancing overall operational efficiency and extending the lifespan of industrial equipment. Conclusion Time series databases offer tools that simplify working with temporal data, thereby enabling businesses to improve operational efficiency, predict failures, and enhance security. The expanding capabilities of TSDBs highlight the value of real-time analytics and edge processing. Features like time-based partitioning, fast ingestion, and automated data retention — now found in traditional databases — encourage TSDB adoption by allowing proof of concepts on existing infrastructure. This demonstrates where investing in TSDBs can yield significant benefits, pushing the boundaries of temporal data management and optimizing analytics ecosystems. Integration with machine learning and AI for advanced analytics, enhanced scalability, and adoption of cloud-native solutions for flexibility are driving forces ensuring future adoption. TSDBs will support edge computing and IoT for real-time processing, strengthen security and compliance, and improve data retention management. Interoperability with other tools and support for open standards will create a cohesive data ecosystem, while real-time analytics and advanced visualization tools will enhance data interpretation and decision making. Together, these factors will ensure that TSDBs continue to be an essential piece of data infrastructure for years to come. This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report

By Ted Gooch

Performance

DZone's Featured Performance Resources

Top Performance Experts

The Latest Performance Topics