DZone Spotlight

Sunday, October 27 View All Articles »

How to Choose a Server Stack at Product Launch

By Grigorii Novikov

Choosing the perfect server stack for launching a product is a decision that carries a lot of weight. This choice influences not just the initial deployment but the long-term adaptability and efficiency of your app. If you're a senior developer or leading a team, you shoulder the responsibility of these architecture decisions, sifting through a sea of languages and frameworks to find the perfect fit for your project's unique needs. Your task here is to make an important choice, one that will hold up as your project evolves and expands. I am Grigorii Novikov, a Senior Backend Developer with years of experience in sculpting and rolling out software architectures. Throughout my career, I've been faced with plenty of critical decisions on server stack selection. Each decision has added layers to my understanding of how to align technology with the requirements of a growing project. In this article, I will share with you some of those hard-earned insights, helping you pick a server stack that will fit your project's current needs and support its future growth. I invite you to explore with me the ins and outs of making tech decisions that pave the way for success, making sure your project stands on a ground ripe for growth, flexibility, and innovation. If you're a senior developer or leading a team, you shoulder the responsibility of these architecture decisions, sifting through a sea of languages and frameworks to find the perfect fit for your project's unique needs. 1. Autogenerating Documentation Although not related to code per se, this point is so important it should be discussed first. Robust documentation is a cornerstone of efficient development, especially when it comes to client-side development and app testing. Tools for autogenerating documentation have revolutionized this process, ensuring that documentation keeps pace with the latest API changes, streamlining development workflows, and cutting down on the manual effort of keeping your project’s documentation up to date. Among the tools available to a developer, I recommend Swagger for its versatility, widespread adoption, and powerful community support. Another popular option is Redoc, which offers an attractive, customizable interface for API documentation. For projects requiring more extensive customization, tools like Apiary provide flexibility alongside documentation capabilities, though they may demand more initial setup. Whichever tool you choose, the objective should be to optimize the documentation process for efficiency without allowing the tool itself to become a significant time sink. Opt for a solution that minimizes manual documentation efforts while offering the flexibility to adapt to your project's unique requirements. 2. Bug Tracker Support Efficient bug tracking is critical for maintaining the health of your application. For effective bug-tracking integration, I use tools like Jira and Bugzilla, both boasting a rich feature set and flexibility. Jira, in particular, offers robust integration capabilities with many development environments; Bugzilla, on the other hand, is known for its simplicity and effectiveness, especially in open-source projects where straightforward bug tracking is a priority. Here’s an insight for you: integrating bug trackers with instant messengers and version control systems will boost your team’s collaboration and efficiency. For instance, the Jira+Bitbucket combo streamlines workflows, allowing for seamless issue tracking within the version control environment. This pairing facilitates a transparent, agile development process, where code updates and issue resolutions are closely linked, enabling faster iterations and improved code quality. Another powerful integration is Mattermost+Focalboard, which offers a comprehensive collaboration platform. It combines the direct communication benefits of Mattermost with the project and task management capabilities of Focalboard, empowering teams with real-time updates on bug tracking, alongside the flexibility to manage tasks and workflows within a unified interface. Such integrations not only optimize the bug resolution process but also foster a more cohesive and agile development environment, ultimately enhancing productivity and project outcomes. 3. Scaling on Growing When your product starts to catch on, you will face the challenge of scaling. And I don’t mean simply a surging number of users. Scaling involves fitting in new features, handling a growing database, and keeping the performance levels of your codebase and database optimal. This is when the architecture you chose for your server stack really comes into play. For instance, at the launch of your project, going for a monolithic architecture might seem like a balanced approach. But as your product grows and changes, you'll start to see where it falls short. Transitioning to a microservices architecture or bringing in scalable cloud services can give you much finer control over different aspects of your application. For scalable server stack solutions, I lean towards technologies like Kubernetes and Docker. These tools will give you the flexibility to scale services independently, manage deployments efficiently, and ensure consistency across your environments. Furthermore, cloud service providers like Amazon Web Services, Google Cloud, and Microsoft Azure offer stellar managed services that can really simplify your scaling journey. Choosing a scalable architecture means balancing the perks of scalability with the complexities of managing a distributed system. Ultimately, your aim here is to pick a server stack that meets your present needs and has the flexibility to handle future growth. 4. Finding the Perfect Fit: Between Community and Security There is no shortage of programming languages and frameworks available, each with its own set of perks like community support, resource availability, and even security features. This diversity allows a broad choice of solutions that not only address immediate development challenges but also align with long-term project goals, including security and scalability. Technologies backed by large communities and abundant resources, such as Python and JavaScript – and their respective frameworks within languages like Django or React – provide a wealth of knowledge and ready-to-use code examples. This wealth significantly cuts down the time you'd otherwise spend on troubleshooting, given the slim odds of encountering an issue not tackled by someone before you. Conversely, newer or niche technologies may bring unique perks to the table, but will often leave you bracing for a tougher time when it comes to finding quick solutions. Another crucial moment is balancing security and usability. For projects where source code protection is a major concern, consider using languages and technologies that support easy obfuscation and secure packaging. For instance, Java and .NET have established tools and ecosystems for obfuscating code. Containerization technologies like Docker will also help you here. By packaging the application and its environment into a container, you ensure that the client receives everything needed to run the app without directly accessing your code. This method not only secures the code but also simplifies the deployment process. 5. Cost Cost considerations are critical in the selection of a technology stack. It’s just about the cost of the initial setup, you also have to think long-term about what it'll cost to maintain and scale your system. Open-source technologies come with the sweet perk of zero licensing fees upfront. For startups or any project on a tight budget, this can be a major draw. Additionally, the vast pool of adept developers will help you keep labor costs more manageable. On the other hand, more complex and specialized technologies, such as blockchain or advanced data analytics platforms, may require a higher initial investment. While they offer significant pros in terms of performance and security, you should weigh the total cost of ownership against the projected benefits. Furthermore, cloud services, while reducing the need for physical infrastructure, come with their own set of costs. The above-mentioned AWS, Google Cloud, and Azure offer various pricing models that can scale with your usage; yet without careful management, these costs can spiral as your project grows. 6. Code Delivery Ensuring efficient code delivery focuses on the deployment process, primarily through Continuous Integration/Continuous Deployment (CI/CD) pipelines. This method underscores the importance of automating the transfer of code into various environments, streamlining development and production workflows. Tools such as GitLab CI and CircleCI offer robust solutions for automating testing and deployment processes. Additionally, the use of scripting tools like Ansible and Terraform further enhances this automation, allowing for the provisioning and management of infrastructure through code. These technologies will help you build a seamless pipeline that moves code from development to production with precision and reliability. By integrating these tools into your workflow, you establish a framework that not only accelerates development cycles but also ensures consistency and stability across environments. 7. Environment Creating and managing the development environment is a foundational yet complex aspect of any project's lifecycle. Designing a scalable and maintainable environment can seem daunting, especially for teams with no dedicated DevOps specialist. For many teams, the answer to the question of the best approach to environment management lies in leveraging cloud-based services and containerization. Again, AWS, Google Cloud, and Azure offer a range of services that can be tailored to fit the size and complexity of your project. These platforms provide the tools necessary to create flexible, scalable environments without the need for extensive infrastructure management. Furthermore, the adoption of technologies like Docker and Kubernetes makes deployment across different stages of development, testing, and production consistent and reliable. Building an effective and comfortable environment is not about the server setup only but also about the configuration of local environments for developers. This aspect is crucial for DevOps, as they often craft scripts to simplify the process of launching projects locally. However, this task is not always an easy one. For instance, preparing local environments in .NET can be quite challenging, highlighting the need for choosing technologies and tools that streamline both server and local setups. Ensuring developers have seamless access to efficient local development environments is essential for maintaining productivity and facilitating a smooth workflow. Choosing the right server stack for your project is like setting the foundations for a building: it requires careful consideration, foresight, and a balance between current needs and future growth. Each choice you make impacts your project’s success and its capacity to adapt and flourish in the dynamic technological landscape. With this article, I aimed to guide you through these critical decisions, equipping you with the insights to handle the complexities ahead. I hope that the insights you gained today will help you make informed choices that lead you to the success of your current and future projects! Case Study A: Mass Lie Detector Project In the development of a groundbreaking lie detector designed for mass testing, a project marked as the first of its kind in Eastern Europe, I was faced with the server stack choice as the development team’s lead. The project's core requirements — a vast number of microservice connections and extensive file operations to process diverse sensor outputs — required a robust yet flexible backend solution. We opted for Python with FastAPI over other contenders like Python/Django and Go/Fiber. The decision hinged on FastAPI's superior support for asynchronous programming, a critical feature for handling the project's intensive data processing needs efficiently. Django, while powerful, was set aside due to its synchronous nature, which could not meet our requirements for high concurrency and real-time data handling. Similarly, Go was considered for its performance but ultimately passed over in favor of FastAPI's rapid development capabilities and its built-in support for Swagger documentation, which was invaluable for our tight MVP development timeline. At the same time, the project demanded the creation of a softcam feature capable of managing webcam connections and directing the video stream across various channels. C++ became the language of choice for this task, thanks to its unparalleled execution speed and cross-platform compatibility. The decisions we made on that project have not only facilitated the project's initial success but have laid a solid foundation for its continuous growth and adaptation. Case Study B: Martial Arts Club CRM For this project, I initially opted for Python and Django, choosing them for their rapid development capabilities essential for a swift launch. This choice proved effective in the early stages, directly contributing to increased club revenue through improved attendance management. As the project's scope expanded to include features like employee management, analytics, and an internal messaging system, the limitations of Django for handling complex, concurrent processes became apparent. This realization led me to integrate Go, leveraging its goroutines and Fasthttp for the development of our internal messenger. Go's performance in managing concurrent tasks helped us expand the CRM's functionality, allowing us to maintain high performance with minimal overhead. The decision to use a hybrid technology approach, utilizing Django for core functionalities and Go for high-performance components, proved to be a critical one. This strategy allowed me to balance rapid development and scalability, ensuring the CRM could evolve to meet the growing needs of the club. More

Leveraging Event-Driven Data Mesh Architecture With AWS for Modern Data Challenges

By Sunil Sharma

In today's data-driven world, businesses must adapt to rapid changes in how data is managed, analyzed, and utilized. Traditional centralized systems and monolithic architectures, while historically sufficient, are no longer adequate to meet the growing demands of organizations that need faster, real-time access to data insights. A revolutionary framework in this space is event-driven data mesh architecture, and when combined with AWS services, it becomes a robust solution for addressing complex data management challenges. The Data Dilemma Many organizations face significant challenges when relying on outdated data architectures. These challenges include: Centralized, Monolithic, and Domain Agnostic Data Lake A centralized data lake is a single storage location for all your data, making it easy to manage and access but potentially causing performance issues if not scaled properly. A monolithic data lake combines all data handling processes into one integrated system, which simplifies setup but can be hard to scale and maintain. A domain-agnostic data lake is designed to store data from any industry or source, offering flexibility and broad applicability but may be complex to manage and less optimized for specific uses. Traditional Architecture Failure Pressure Points Centralized Data Architecture In traditional data systems, several problems can occur. Data producers may send large volumes of data or data with errors, creating issues downstream. As data complexity increases and more diverse sources contribute to the system, the centralized data platform can struggle to handle the growing load, leading to crashes and slow performance. Increased demand for rapid experimentation can overwhelm the system, making it hard to quickly adapt and test new ideas. Data response times may become a challenge, causing delays in accessing and using data, which affects decision-making and overall efficiency. Divergence Between Operational and Analytical Data Landscapes In software architecture, issues like siloed ownership, unclear data use, tightly coupled data pipelines, and inherent limitations can cause significant problems. Siloed ownership occurs when different teams work in isolation, leading to coordination issues and inefficiencies. Lack of a clear understanding of how data should be used or shared can result in duplicated efforts and inconsistent results. Coupled data pipelines, where components are too dependent on each other, make it difficult to adapt or scale the system, leading to delays. Finally, inherent limitations in the system can slow down the delivery of new features and updates, hindering overall progress. Addressing these pressure points is crucial for a more efficient and responsive development process. Challenges With Big Data Online Analytical Processing (OLAP) systems organize data in a way that makes it easier for analysts to explore different aspects of the data. To answer queries, these systems must transform operational data into a format suitable for analysis and handling large volumes of data. Traditional data warehouses use ETL (Extract, Transform, Load) processes to manage this. Big data technologies, like Apache Hadoop, improved data warehouses by addressing scaling issues and being open source, which allowed any company to use it as long as they could manage the infrastructure. Hadoop introduced a new approach by allowing unstructured or semi-structured data, rather than enforcing a strict schema upfront. This flexibility, where data could be written without a predefined schema and structured later during querying, made it easier for data engineers to handle and integrate data. Adopting Hadoop often meant forming a separate data team: data engineers handled data extraction, data scientists managed cleaning and restructuring, and data analysts performed analytics. This setup sometimes led to problems due to limited communication between the data team and application developers, often to prevent impacting production systems. Problem 1: Issues With Data Model Boundaries The data used for analysis is closely linked to its original structure, which can be problematic with complex, frequently updated models. Changes to the data model affect all users, making them vulnerable to these changes, especially when the model involves many tables. Problem 2: Bad Data, The Costs of Ignoring the Problem Bad data often goes unnoticed until it causes issues in a schema, leading to problems like incorrect data types. Since validation is often delayed until the end of the process, bad data can spread through pipelines, resulting in expensive fixes and inconsistent solutions. Bad data can lead to significant business losses, such as billing errors costing millions. Research indicates that bad data costs businesses trillions annually, wasting substantial time for knowledge workers and data scientists. Problem 3: Lack of Single Ownership Application developers, who are experts in the source data model, typically do not communicate this information to other teams. Their responsibilities often end at their application and database boundaries. Data engineers, who manage data extraction and movement, often work reactively and have limited control over data sources. Data analysts, far removed from developers, face challenges with the data they receive, leading to coordination issues and the need for separate solutions. Problem 4: Custom Data Connections In large organizations, multiple teams may use the same data but create their own processes for managing it. This results in multiple copies of data, each managed independently, creating a tangled mess. It becomes difficult to track ETL jobs and ensure data quality, leading to inaccuracies due to factors like synchronization issues and less secure data sources. This scattered approach wastes time, money, and opportunities. Data mesh addresses these issues by treating data as a product with clear schemas, documentation, and standardized access, reducing bad data risks and improving data accuracy and efficiency. Data Mesh: A Modern Approach Data Mesh Architecture Data mesh redefines data management by decentralizing ownership and treating data as a product, supported by self-service infrastructure. This shift empowers teams to take full control over their data while federated governance ensures quality, compliance, and scalability across the organization. In simpler terms, It is an architectural framework that is designed to resolve complex data challenges by using decentralized ownership and distributed methods. It is used to integrate data from various business domains for comprehensive data analytics. It is also built on top of strong data sharing and governance policies. Goals of Data Mesh Data mesh helps various organizations get some valuable insights into the data at scale; in short, handling an ever-changing data landscape, the growing number of data sources and users, the variety of data transformations needed, and the need to quickly adapt to changes. Data mesh solves all the above-mentioned problems by decentralizing control, so teams can manage their own data without it being isolated in separate departments. This approach improves scalability by distributing data processing and storage, which helps avoid slowdowns in a single central system. It speeds up insights by allowing teams to work directly with their own data, reducing delays caused by waiting for a central team. Each team takes responsibility for their own data, which boosts quality and consistency. By using easy-to-understand data products and self-service tools, data mesh ensures that all teams can quickly access and manage their data, leading to faster, more efficient operations and better alignment with business needs. Key Principles of Data Mesh Decentralized data ownership: Teams own and manage their data products, making them responsible for their quality and availability. Data as a product: Data is treated like a product with standardized access, versioning, and schema definitions, ensuring consistency and ease of use across departments. Federated governance: Policies are established to maintain data integrity, security, and compliance, while still allowing decentralized ownership. Self-service infrastructure: Teams have access to scalable infrastructure that supports the ingestion, processing, and querying of data without bottlenecks or reliance on a centralized data team. How Do Events Help Data Mesh? Events help a data mesh by allowing different parts of the system to share and update data in real-time. When something changes in one area, an event notifies other areas about it, so everyone stays up-to-date without needing direct connections. This makes the system more flexible and scalable because it can handle lots of data and adapt to changes easily. Events also make it easier to track how data is being used and managed, and let each team handle their own data without relying on others. Finally, let us look at the event-driven data mesh architecture. Event-Driven Data Mesh Architecture This event-driven approach lets us separate the producers of data from the consumers, making the system more scalable as domains evolve over time without needing major changes to the architecture. Producers are responsible for generating events, which are then sent to a data-in-transit system. The streaming platform ensures these events are delivered reliably. When a producer microservice or datastore publishes a new event, it gets stored in a specific topic. This triggers listeners on the consumer side, like Lambda functions or Kinesis, to process the event and use it as needed. Leveraging AWS for Event-Driven Data Mesh Architecture AWS offers a suite of services that perfectly complement the event-driven data mesh model, allowing organizations to scale their data infrastructure, ensure real-time data delivery, and maintain high levels of governance and security. Here’s how various AWS services fit into this architecture: AWS Kinesis for Real-Time Event Streaming In an event-driven data mesh, real-time streaming is a crucial element. AWS Kinesis provides the ability to collect, process, and analyze real-time streaming data at scale. Kinesis offers several components: Kinesis Data Streams: Ingest real-time events and process them concurrently with multiple consumers. Kinesis Data Firehose: Delivers event streams directly to S3, Redshift, or Elastic search for further processing and analysis. Kinesis Data Analytics: Processes data in real-time to derive insights on the fly, allowing immediate feedback loops in data processing pipelines. AWS Lambda for Event Processing AWS Lambda is the backbone of serverless event processing in the data mesh architecture. With its ability to automatically scale and process incoming data streams without requiring server management, Lambda is an ideal choice for: Processing Kinesis streams in real-time Invoking API Gateway requests in response to specific events Interacting with DynamoDB, S3, or other AWS services to store, process, or analyze data AWS SNS and SQS for Event Distribution AWS Simple Notification Service (SNS) acts as the primary event broadcasting system, sending real-time notifications across distributed systems. AWS Simple Queue Service (SQS) ensures that messages between decoupled services are delivered reliably, even in the event of partial system failures. These services allow decoupled microservices to interact without direct dependencies, ensuring that the system remains scalable and fault-tolerant. AWS DynamoDB for Real-Time Data Management In decentralized architectures, DynamoDB provides a scalable, low-latency NoSQL database that can store event data in real time, making it ideal for storing the results of data processing pipelines. It supports the Outbox pattern, where events generated from the application are stored in DynamoDB and consumed by the streaming service (e.g., Kinesis or Kafka). AWS Glue for Federated Data Catalog and ETL AWS Glue offers a fully managed data catalog and ETL service, essential for federated data governance in the data mesh. Glue helps catalog, prepare, and transform data in distributed domains, ensuring discoverability, governance, and integration across the organization. AWS Lake Formation and S3 for Data Lakes While the data mesh architecture moves away from centralized data lakes, S3 and AWS Lake Formation play a crucial role in storing, securing, and cataloging data that flows between various domains, ensuring long-term storage, governance, and compliance. Event-Driven Data Mesh in Action With AWS and Python Event Producer: AWS Kinesis + Python In this example, we use AWS Kinesis to stream events when a new customer is created: Python import boto3 import json kinesis = boto3.client('kinesis') def send_event(event): kinesis.put_record( StreamName="CustomerStream", Data=json.dumps(event), PartitionKey=event['customer_id'] ) def create_customer_event(customer_id, name): event = { 'event_type': 'CustomerCreated', 'customer_id': customer_id, 'name': name } send_event(event) # Simulate a new customer creation create_customer_event('123', 'ABC XYZ') Event Processing: AWS Lambda + Python This Lambda function consumes Kinesis events and processes them in real time. Python import json import boto3 dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('CustomerData') def lambda_handler(event, context): for record in event['Records']: payload = json.loads(record['kinesis']['data']) if payload['event_type'] == 'CustomerCreated': process_customer_created(payload) def process_customer_created(event): table.put_item( Item={ 'customer_id': event['customer_id'], 'name': event['name'] } ) print(f"Stored customer data: {event['customer_id']} - {event['name']}") Conclusion By leveraging AWS services such as Kinesis, Lambda, DynamoDB, and Glue, organizations can fully realize the potential of event-driven data mesh architecture. This architecture provides agility, scalability, and real-time insights, ensuring that organizations remain competitive in today’s rapidly evolving data landscape. Adopting an event-driven data mesh architecture is not just a technical enhancement but a strategic imperative for businesses that want to thrive in the era of big data and distributed systems. More

Trend Report

Kubernetes in the Enterprise

In 2014, Kubernetes' first commit was pushed to production. And 10 years later, it is now one of the most prolific open-source systems in the software development space. So what made Kubernetes so deeply entrenched within organizations' systems architectures? Its promise of scale, speed, and delivery, that is — and Kubernetes isn't going anywhere any time soon.DZone's fifth annual Kubernetes in the Enterprise Trend Report dives further into the nuances and evolving requirements for the now 10-year-old platform. Our original research explored topics like architectural evolutions in Kubernetes, emerging cloud security threats, advancements in Kubernetes monitoring and observability, the impact and influence of AI, and more, results from which are featured in the research findings.As we celebrate a decade of Kubernetes, we also look toward ushering in its future, discovering how developers and other Kubernetes practitioners are guiding the industry toward a new era. In the report, you'll find insights like these from several of our community experts; these practitioners guide essential discussions around mitigating the Kubernetes threat landscape, observability lessons learned from running Kubernetes, considerations for effective AI/ML Kubernetes deployments, and much more.

Refcard #303

API Integration Patterns

By Thomas Jardinet

CORE

Refcard #389

Threat Detection

By Sudip Sengupta

CORE

AI/ML Innovation in the Kubernetes Ecosystem

As organizations put artificial intelligence and machine learning (AI/ML) workloads into continuous development and production deployment, they need to have the same levels of manageability, speed, and accountability as regular software code. The popular way to deploy these workloads is Kubernetes, and the Kubeflow and KServe projects enable them there. Recent innovations like the Model Registry, ModelCars feature, and TrustyAI integrations in this ecosystem are delivering these improvements for users who rely on AI/ML. These, and other improvements, have made open source AI/ML ready for use in production. More improvements are coming in the future. Better Model Management AI/ML analyzes data and produces output using machine learning "models," which consist of code, data, and tuning information. In 2023, the Kubeflow community identified a key requirement to have better ways of distributing tuned models across large Kubernetes clusters. Engineers working on Red Hat's OpenShift AI agreed and started work on a new Kubeflow component, Model Registry. "The Model Registry provides a central catalog for developers to index and manage models, their versions, and related artifacts metadata," explained Matteo Mortari, Principal Software Engineer at Red Hat and Kubeflow contributor. "It fills a gap between model experimentation and production activities, providing a central interface for all users to effectively collaborate on ML models." The AI/ML model development journey, from initial experimentation to deployment in production, requires coordination between data scientists, operations staff, and users. Before Model Registry, this involved coordinating information scattered across many places in the organization – even email! With Model Registry, system owners can implement efficient machine learning operations (MLOps), letting them deploy directly from a dedicated component. It's an essential tool for researchers looking to run many instances of a model across large Kubernetes clusters. The project is currently in Alpha, and was included in the recent Kubeflow 1.9 release. Faster Model Serving Kubeflow makes use of the KServe project to "serve," or run, models on each server in the Kubernetes cluster. Users care a great deal about latency and overhead when serving models: they want answers as quickly as possible, and there's never enough GPU power. Many organizations have service level objectives (SLO) for response times, particularly in regulated industries. "One of the challenges that we faced when we first tried out LLMs on Kubernetes was to avoid unnecessary data movements as much as possible," said Roland Huss, Senior Principal Software Engineer at Red Hat and KServe and Knative contributor. "Copying over a multi-gigabyte model from an external storage can take several minutes which adds to the already lengthy startup of an inference service. Kubernetes itself knows how to deal with large amounts of data when it comes to container images, so why not piggyback on those matured techniques?" This thinking led to the development of Modelcars, a passive "sidecar" container holding the model data for KServe. That way, a model needs to be present only once at a cluster node, regardless how many replicas are accessing it. Container image handling is a very well explored area in Kubernetes, with sophisticated caching and performance optimization for the image handling. The result has been faster startup times for serving models, and greatly reduced disk space requirements for cluster nodes. Huss also pointed out that Kubernetes 1.31 recently introduced an image volume type that allows the direct mount of OCI images. When that feature is generally available, which may take a year, it can replace ModelCar for even better performance. Right now, ModelCar is available in KServe v0.12 and above. Safer Model Usage AI/ML systems are complex, and it can be difficult to figure out how they arrive at their output. Yet it's important to ensure that unexpected bias or logic errors don't create misleading results. TrustyAI is a new open source project which aims to bring "responsible AI" to all stages of the AI/ML development lifecycle. "The TrustyAI community strongly believes that democratizing the design and research of responsible AI tooling via an open source model is incredibly important in ensuring that those affected by AI decisions – nowadays, basically everyone – have a say in what it means to be responsible with your AI," stated Rui Vieira, Senior Software Engineer at Red Hat and TrustyAI contributor. The project uses an approach where a core of techniques/algorithms, mostly focused on AI explainability, metrics and guardrails, can be integrated at different stages of the lifecycle. For example, a Python TrustyAI library can be used through Jupyter notebooks during the model experimentation stage to identify biases. The same functionality can be also used for continuous bias detection of production models by incorporating the tool as a pipeline step before model building or deployment. TrustyAI is in its second year of development and KServe supports TrustyAI. Future AI/ML Innovations With these features and tools, and others, development and deployment of AI/ML models is becoming more consistent, reliable, efficient, and verifiable. As with other generations of software, this allows organizations to adopt and customize their own open source AI/ML stacks that would have been too difficult or risky before. The Kubeflow and KServe community is working hard on the next generation of improvements, usually in the Kubernetes Serving Working Group (WG Serving). This includes the LLM Serving Catalog, to provide working examples for popular model servers and explore recommended configurations and patterns for inference workloads. WG Serving is also exploring the LLM Instance Gateway to more efficiently serve distinct LLM use cases on shared model servers running the same foundation model, allowing scheduling requests to pools of model servers. The KServe project is working on features to support very large models. One is multi-host/multi-node support for models which are too big to run on a single node/host. Support for "Speculative Decoding," another in-development feature, speeds up large model execution and improves inter-token latency in memory-bound LLM inference. The project is also developing "LoRA adapter" support which permits serving already trained models with in-flight modifications via adapters to support distinct use cases instead of re-training each of them from scratch before serving. The KServe community is also working on Open Inference Protocol extension to GenAI Task APIs that provide community-maintained protocols to support various GenAI task specific APIs. The community is also working closely with WG Serving to integrate with the efforts like LLM Instance Gateway and provide KServe examples in the Serving Catalog. These and other features are in the KServe Roadmap. The author will be delivering a keynote about some of these innovations at KubeCon's Cloud Native AI Day in Salt Lake City. Thanks to all of the ingenuity and effort being poured into open source AI/ML, users will find the experience of building, running, and training models to keep getting more manageable and performant for many years to come. This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event

By Yuan Tang

Securing Cloud-Native Applications: A CISO’s Perspective on Broken Access Control

When we talk about security in cloud-native applications, broken access control remains one of the most dangerous vulnerabilities. The OWASP Top 10 lists it as the most prevalent security risk today, and for good reason: the impact of mismanaged permissions can lead to catastrophic outcomes like data breaches or ransomware attacks. For CISOs, addressing broken access control isn't just a technical challenge—it’s a strategic priority that touches nearly every aspect of an organization’s security posture. As part of my job as the VP of Developer Relations in Permit.io, I consulted with dozens of CISOs and security engineers leaders, from small garage startup founders to Fortune 100 enterprise security staff. This article will try to provide the most comprehensive perspective I gathered from these chats, guiding you in considering broken access control challenges in cloud-native applications. Understanding the Threat At its core, broken access control occurs when unauthorized users gain access to parts of an application they shouldn’t be able to see or modify. This vulnerability can manifest in several ways: from users gaining admin privileges they shouldn’t have to attackers exploiting weak session management to move laterally within a system. What makes this threat particularly dangerous in cloud-native environments is the complexity of modern application architectures. Microservices, third-party APIs, and distributed resources create a multifaceted ecosystem where data flows across various services. Each connection is a potential point of failure. CISOs must ensure that access control mechanisms are ironclad—every request to access sensitive data or perform critical operations must be carefully evaluated and tightly controlled. The Three Pillars of Access Control Addressing broken access control requires a comprehensive strategy built on three key pillars: authentication, permissions, and session management. Each plays a critical role in securing cloud-native applications: Authentication: This is the first line of defense, ensuring that users are who they claim to be. Strong authentication methods like multi-factor authentication (MFA) can drastically reduce the risk of unauthorized access. Permissions: Even after authentication, not all users should have equal access. Permissions dictate what authenticated users can do. In cloud-native apps, fine-grained permissions are essential to prevent privilege escalation and data leakage. Session Management: Proper session management ensures that once a user is authenticated and authorized, their activities are monitored, and their access remains limited to the session’s scope. Poor session management can allow attackers to hijack sessions or escalate privileges. Why Permissions Matter More Than Ever While all three pillars are crucial, permissions are the backbone of modern access control. In a cloud-native environment, where services and resources are distributed across different infrastructures, managing permissions becomes exponentially more challenging. A one-size-fits-all approach, like assigning simple roles (e.g., Admin, User), isn’t sufficient. Today’s applications require a more nuanced approach to permissions management. Fine-Grained Authorization To prevent unauthorized access, organizations should implement fine-grained authorization models. These models allow for more precise control by evaluating multiple attributes—such as a user’s role, location, or even payment method—before granting access. This granular level of control is necessary to avoid both horizontal and vertical privilege escalation. For example, imagine a SaaS product with different pricing tiers. A user’s access to features shouldn’t just depend on their role (e.g., admin or regular user) but also on their subscription level, which should automatically update based on their payment status in an external payment application. Implementing fine-grained permissions ensures that only users who have paid for premium features can access them, even if they have elevated roles within the system. The Importance of Least Privilege A critical part of permissions management is enforcing the principle of least privilege. Simply put, users should have the minimal level of access required to perform their tasks. This principle is especially important in cloud-native applications, where microservices may expose sensitive data across various parts of the system. For example, a developer working on one service shouldn’t have full access to every service in the environment. Limiting access in this way reduces the risk of an attacker exploiting one weak point to gain broader access. It also prevents insider threats, where an internal user might misuse their privileges. Managing Sessions to Contain Threats While permissions control access to features and data, session management ensures that users’ activities are properly constrained during their session. Strong session management practices include limiting session duration, detecting unusual behavior, and ensuring that session tokens are tightly secured. Session hijacking, where attackers steal a user’s session token and take over their session, is a common attack vector in cloud-native environments. Implementing session timeouts, MFA for high-risk actions, and token revocation mechanisms can help mitigate these risks. Effective session management also includes ensuring that users cannot escalate their privileges within the session. For example, a user who starts a session with standard permissions shouldn’t be able to gain admin-level privileges without re-authenticating. The CISO’s Role in Securing Access Control For a CISO, the challenge of preventing broken access control goes beyond simply setting policies. It involves fostering collaboration between security teams, developers, and product managers. This ensures that access control isn’t just a checkbox in compliance reports but a living, adaptive process that scales with the organization’s needs. A Strategic Approach to Collaboration CISOs must ensure that developers have the resources and tools they need to build secure applications without becoming bottlenecks in the process. Traditional access control systems often put too much burden on developers, requiring them to manually write permission logic into the code. This not only slows down development, but also introduces the risk of human error. Instead, CISOs should promote a culture of collaboration where security, development, and product teams can work together on defining and managing access control policies. By implementing automated and scalable tools, CISOs can empower teams to enforce security policies effectively while maintaining agility in the development process. Authorization-as-a-Service One of the most effective ways to manage permissions in a scalable and secure manner is through authorization-as-a-service solutions. These platforms can provide a centralized, no-code interface for defining and managing authorization policies, making it easier for non-technical stakeholders to be involved in the process. By leveraging these tools, organizations can reduce their reliance on developers to manually manage permissions. This not only speeds up the process, but also ensures that permissions are consistently enforced across all services. With real-time policy updates, automated monitoring, and auditability features, authorization-as-a-service platforms allow organizations to stay agile while maintaining strong access control measures. The flexibility of these solutions also allows for easier scaling as the application and user base grow, ensuring that permission models can evolve without requiring significant re-engineering. Additionally, having a no-code UI allows for rapid adjustments to access policies in response to changing business needs or security requirements, without creating unnecessary dependencies on development teams. Conclusion Preventing broken access control vulnerabilities in cloud-native applications is a critical priority for CISOs. It requires a strategic focus on fine-grained permissions, the principle of least privilege, and robust session management. Collaboration across teams and the adoption of modern tools like authorization-as-a-service platforms can greatly simplify this complex challenge, enabling organizations to secure their environments without sacrificing speed or flexibility. By addressing these areas, CISOs can help ensure that their organizations remain resilient to access control vulnerabilities while empowering their teams to manage permissions effectively and securely.

By Gabriel L. Manor

Dust: Open-Source Actors for Java

Virtual Threads Java 21 saw the supported introduction of virtual threads. Unlike regular Java threads (which usually correspond to OS threads), virtual threads are incredibly lightweight, indeed an application can create and use 100,000 or more virtual threads simultaneously. This magic is achieved by two major changes to the JVM: A virtual thread is managed by the JVM, not the OS. If it is executing, it is bound to a platform thread (known as a carrier); if it is not executing (say it is blocked waiting for some form of notification), the JVM "parks" the virtual thread and frees the carrier thread so it can schedule a different virtual thread. A platform thread typically has about 1 megabyte of memory preassigned to it for its stack, etc. In contrast, a virtual thread’s stack is managed in the heap and can be as little as a few hundred bytes — growing and shrinking as needed. The API for managing cooperation and communication between virtual threads is exactly the same as for legacy platform threads. This has good and bad points: The good: Implementers are familiar with the interface. The bad: You are still faced with all the usual "hard" parts of multi-threaded applications — synchronized blocks, race conditions, etc. — only now the problem is increased by orders of magnitude. Moreover, a virtual thread cannot get parked in a synchronized block – so the more synchronized blocks are used, the less efficient virtual threads become. What is needed is a new approach. One that can exploit the ability to run millions of virtual threads in a meaningful way but do so while making multi-threaded programming easier. In fact, such a model exists and it was first discussed 50 years ago: Actors. Actors and Dust The Actor concept arose during the 1970s at MIT with research by Carl Hewitt. The Actor concept is at the core of languages like Erlang and Elixir and frameworks like Dust: an open-source (Apache2 license) implementation of Actors for Java 21+. Different implementations of Actors vary in the details, so from now on we will describe the specific Dust Actor model: An Actor is a Java object associated with exactly one virtual thread. An Actor has a "mailbox" that receives and queues messages from other Actors. The thread wait()s on this mailbox, retrieves a message, processes it, and returns to waiting for its next message. How the Actor processes messages is called its Behavior. Note that if the Actor has no pending messages then, since the mailbox thread is virtual the JVM will "park" the Actor and reuse its thread. When a message is received, the JVM will un-park the Actor and give it a thread to process the message. This is all transparent to the developer whose only cares are messages and behaviors. An Actor may have its own mutable state which is inaccessible outside the Actor. In response to receipt of a message an Actor may: Mutate its state Send immutable messages to other Actors Create or destroy other Actors Change its Behavior That’s it. Note that an Actor is single threaded so there are no locking/synchronization issues within an Actor. The only way an Actor can influence another Actor is by sending it an immutable message – so there are no synchronization issues between Actors. The order of messages sent by one Actor to another is preserved by the receiving Actor but continuity is not guaranteed. If two Actors send messages to the same Actor at the same time, the messages may be interleaved but the order of each stream is preserved. Actors are managed by an ActorSystem. It has a name, and, optionally a port number. If the port is specified, then Actors in the ActorSystem can receive messages sent remotely — either from another port or another host entirely. The ActorSystem takes care of (de)serialization of messages in the remote case. An Actor has a unique address which resembles a URL: dust://host:port/actor-system-name/a1/a2/a3. If you are communicating with Actors in the same Actor System, the URL can be reduced to: /a1/a2/a3. This is more than a pathname, though: it expresses a parent/child relationship between Actors, namely: An Actor was created with the name a1. It then created an Actor called a2 : a1 is the "parent" and a2 the "child" of a1. Actor a2 then created a child of its own called a3. Actors can create many children. The only requirement is their names be distinct from their "siblings." Actor Structure Actors extend the Actor class. It is important to note that Actors are not created directly with a "new" but use a different mechanism. This is needed to set up correct parent-child relationships. We use the Props class for this as in the following simple example: /** * A very simple Actor */ public class PingPongActor extends Actor { private int max; /** * Used internally to call the appropriate constructor */ public static Props props(int max) { Props.create(PingPongActor.class, max); } public PingPongActor(int max) { this.max = max } // Define the initial Behavior @Override protected ActorBehavior createBehavior() { return message → { switch(message) { case PingPongMsg → { sender.tell(message, self); if (0 == --max) stopSelf(); } default → System.out.println(“Strange message …”); } } } } Actors are created from their Props (see below), which can also include initialization parameters. So in the above, our PingPongActor initialization includes a max count, whose use we will show shortly. Actors are created by other Actors, but that chain has to begin somewhere. When an ActorSystem is created, it creates several default top-level Actors, including one called /user. An application can then create children of this Actor via the ActorSystem: ActorSystem system = new ActorSystem('PingPong'); ActorRef ping = system.context.actorOf(PingPongActor.props(1000000), ‘ping’); The context of an ActorSystem provides the actorOf() method, which creates children of the /user Actor. Actors themselves have an identical actorOf() for creating their children. If we now looked into the ActorSystem, we would see a new PingPongActor whose name is ping and whose path is /user/ping. The value returned by this creation step is an ActorRef — a "handle" to that particular Actor. Let's build another: ActorRef pong = system.context.actorOf(PingPongActor.props(1000000), ‘pong’); So now we have two instances of PingPongActor, with their "max" state set to 1000000 and both are waiting to receive messages in their mailbox. When it has a message, it passes it to the createBehavior() lambda, which implements our behavior. So what does this behavior do? First, we need a nice message class to get things fired up: public class PingPongMsg implements Serializable {} The only constraint on messages is they must be serializable. So now let’s look at our setup: ActorSystem system = new ActorSystem('PingPong'); ActorRef ping = system.context.actorOf(PingPongActor.props(1000000), ‘ping’); ActorRef pong = system.context.actorOf(PingPongActor.props(1000000), ‘pong’); pong.tell(new PingPongMsg(), ping); ActorRefs have a tell() method which takes a Serializable message object and a (nullable) ActorRef. Thus, in the above an instance of PingPongMsg is delivered to the Actor at pong. Since the second argument was not null, that ActorRef (ping) is available as the "sender" variable in the recipient's behavior. Recall that the part of the behavior that dealt with a PingPongMsg was: case PingPongMsg → { sender.tell(message, self); if (0 == --max) stopSelf(); } The sender of this message gave me his ActorRef (ping) so I am simply sending the message back to him, telling him that I (pong) am the sender via the self variable. Rinse, lather, and repeat one million times. So the same message will have been passed back and forth two million times in total between the two Actors, and once their counters hit 0, each Actor will destroy itself. Beyond PingPong PingPongActor was just about the simplest example capable of giving a feel for Actors and Dust, but is clearly of limited value otherwise. GitHub contains several Dust repos which constitute a small library around the Dust framework. dust-core– The heart of Dust: Actors, persistent Actors, various structural Actors for building pipelines, scalable servers, etc. Programmer documentation dust-http – Small library to make it easy for Actors to access Internet endpoints, etc. dust-html – A small library to make manipulating web page content easy in idiomatic Dust dust-feeds – Actors to access RSS feeds, crawl ,websites, and use SearXNG for web searches dust-nlp – Actors to access ChatGPT (and similar) endpoints and the Hugging Face embeddings API The Actor paradigm is an ideal match for event-driven scenarios. Dust has been used to create systems such as: Intelligent news reader using LLMs to identify and follow trending topics Building occupancy management using WiFi signal strengths as proxies for people A digital twin of a toy town – 8000 Actors just to simulate flocking birds! A system to find and analyze data for M&A activities

By Alan Littleford

Symbolic and Connectionist Approaches: A Journey Between Logic, Cognition, and the Future Challenges of AI

This article explores two major approaches to artificial intelligence: symbolic AI, based on logical rules, and connectionist AI, inspired by neural networks. Beyond the technical aspects, the aim is to question concepts such as perception and cognition and to reflect on the challenges that AI must take up to better manage contradictions and aim to imitate human thought. Preamble French researcher Sébastien Konieczny was recently named EuAI Fellow 2024 for his groundbreaking work on belief fusion and inconsistency management in artificial intelligence. His research, focused on reasoning modeling and knowledge revision, opens up new perspectives to enable AI systems to tend to reason even more reliably in the face of contradictory information, and thus better manage the complexity of the real world. Konieczny's work is part of a wider context of reflection and fundamental questioning about the very nature of artificial intelligence. These questions are at the root of the long-standing debate between symbolic and connectionist approaches to AI, where technical advances and philosophical reflections are intertwined. Introduction In the field of artificial intelligence, we can observe two extreme perspectives: on the one hand, boundless enthusiasm for AI's supposedly unlimited capabilities, and on the other, deep concerns about its potential negative impact on society. For a clearer picture, it makes sense to go back to the basics of the debate between symbolic and connectionist approaches to AI. This debate, which goes back to the origins of AI, compares two fundamental visions against each other. On the one hand, the symbolic approach sees intelligence as the manipulation of symbols according to logical rules. On the other, the connectionist approach is inspired by the neuronal functioning of the human brain. By refocusing the discussion on the relationship between perception, cognition, learning, generalization, and common sense, we can elevate the debate beyond speculation about the alleged consciousness of today's AI systems. The Symbolic Approach The symbolic approach sees the manipulation of symbols as fundamental to the formation of ideas and the resolution of complex problems. According to this view, conscious thought relies on the use of symbols and logical rules to represent knowledge and, from there, to reason. “Although recently connectionist AI has started addressing problems beyond narrowly defined recognition and classification tasks, this mostly remains a promise: it remains to be seen if connectionist AI can accomplish complex tasks that require commonsense reasoning and causal reasoning, all without including knowledge and symbols.” - Ashok K. Goel Georgia Institute of Technology The Connectionist Approach This vision, projected by the symbolic approach, is contested by proponents of the connectionist approach, who maintain that intelligence emerges from complex interactions between numerous simple units, much like the neurons in the brain. They argue that current AI models, based on deep learning, demonstrate impressive capabilities without explicit symbol manipulation. Konieczny's work on reasoning modeling and knowledge revision provides food for thought in this debate. By focusing on the ability of AI systems to handle uncertain and contradictory information, this research highlights the complexity of autonomous reasoning. They highlight what’s perhaps the real challenge: how to enable an AI to revise its knowledge in the face of new information while maintaining internal consistency. Experiencing the World Now, we know very well that seeing, touching, and hearing the world (in other words, experiencing the world through the body) are essential for humans to build cognitive structures. When we consider the evolution of AI systems, particularly in the field of Generative AI (GenAI) democratized by the release of ChatGPT in 2022, AI systems are approaching a form of “thinking” in their own way. This theory is based on the fact that on the basis of massive data sets, collected from real-world sensors, advanced systems, such as autonomous systems for example, would already be able to establish models that mimic forms of understanding. Mimicking Cognitive Abilities Although AI lacks consciousness, these systems process and react to their environment in a way that suggests a data-driven way of mimicking cognitive abilities. This imitation of cognition raises fascinating questions about the nature of intelligence and consciousness. Are we creating truly intelligent entities, or simply extremely sophisticated systems of imitation? "In many computer science fields, one needs to synthesize a coherent belief from several sources. The problem is that, in general, these sources contradict each other. So the merging of belief sources is a non-trivial issue” - Sébastien Konieczny, CNRS Konieczny's observation reinforces the idea that AI needs to reconcile contradictory information. This problem, far from being purely and solely technical, opens the way to deeper reflections on the nature of reasoning and understanding. The theme of managing inconsistencies echoes philosophical debates on experience and common sense in AI. Indeed, the ability to detect and resolve contradictions is a fundamental quality of human reasoning and a key element in our understanding of the world. The Concept of Experience If we were to transpose Kant's concept of experience onto AI technologies, we might suggest that the risk for some — and the opportunity for others — lies in how our understanding of "experience" itself is evolving. If, for Kant, experience is the result of a synthesis between sensible data — i.e., the raw information perceived by our senses — and the concepts of understanding, on what criteria can we base our assertion that machines can acquire experience? This transposition prompts us to reflect on the very nature of experience, and on the possibility of machines gaining access to a form of understanding comparable to that of humans. That would be quite a significant leap from this reflection to asserting that machines can truly acquire experience. . . The Concept of Common Sense If we now turn to the concept of “common sense," we can conceive of it as a form of practical wisdom derived from everyday experience. In the context of our thinking on AI, common sense could be seen as an intuitive ability to navigate the real world, to make rapid inferences without resorting to formal reasoning. We can attribute to common sense the ability to form a bridge between perception and cognition. This suggests that lived experience is crucial to understanding the world. So how can a machine, devoid of direct sensory experience, develop this kind of intuitive intelligence? This question raises another challenge for AI: to reproduce not only formal intelligence but also that form of wisdom that we humans often take for granted. We need to understand that, when machines integrate data from our human experience, even though they haven't experienced it for themselves, they are making the closest thing we have to a “decision," not a choice. Decision vs. Choice It's necessary here to make a clear distinction between “decision” and “choice." A machine can make decisions by executing algorithms and analyzing data, but can it really make choices? Where decision involves a logical process of selection among options, based on predefined criteria, choice on the other hand involves an extra dimension of free will, common sense, self-awareness, and moral responsibility. When an AI “decides," it follows a logical path determined by its programming and data. But a real choice, like those made by humans, implies a deep understanding of the consequences, an ability to reason abstractly about values, and potentially even an intuition that goes beyond mere logic. This distinction highlights a fundamental limitation of today's AI: although it can make extremely complex and sophisticated decisions, it remains devoid of the ability to make choices in the fully human sense of the term. While this distinction turns out to be far more philosophical than technical, it is nonetheless often discussed in debates on artificial intelligence and consciousness and the capacity to think. Konieczny's research into knowledge fusion and revision sheds interesting light on this distinction. By working on methods enabling AI to handle conflicting information and estimate the reliability of sources, this work could help develop systems capable of making more nuanced decisions, perhaps coming closer to the notion of “choice” as we conceive it for humans. See and Act in the World AI, in processing data, is not granted with consciousness or experience. As Dr. Li Fei-Fei, Co-Director of Stanford’s Human-Centered AI Institute,) puts it: “To truly understand the world, we must not only see but also act in it." She used to highlight the fundamental limitation of machines, which, deprived of autonomous action, subjectivity, and the ability to “choose," cannot truly experience the world as humans do. In her lecture “What We See and What We Value: AI with a Human Perspective,” Dr. Li addresses the issue of visual intelligence as an essential component of animal and human intelligence. She argues that it is necessary to enable machines to perceive the world in a similar way while raising fundamental ethical questions about the implications of developing AI systems capable of seeing and interacting with the world around us. This reflection is fully in line with the wider debate on perception and cognition in AI, suggesting that while AI can indeed process visual data with remarkable efficiency, it remains lacking the human values and subjective experience that characterize our understanding of the world. This perspective brings us back to the central question of the experience and “lived experience” of machines, highlighting once again the gap that exists between data processing, however sophisticated, and the true understanding of the world as we humans conceive it. Conclusion While the progress of AI is undeniable and impressive, the debate between symbolic and connectionist approaches reminds us that we are still far from fully understanding the nature of intelligence and consciousness. This debate will continue to influence the development of AI while pushing us to reflect on what really makes us thinking and conscious beings. One More Thing It's important to stress that this article is not intended to suggest that machines might one day acquire true consciousness, comparable to that of humans. Rather, by exploring philosophical concepts such as experience and choice, the intention is to open up avenues of reflection on how to improve artificial intelligence. These theoretical reflections offer a framework for understanding how AI could, through advanced data processing methods, better mimic certain aspects of human cognition without claiming to achieve consciousness. It’s in this search for better techniques, and not in speculation about artificial consciousness, that the purpose of this exploration lies.

By Frederic Jacquet

CORE

How to Implement Client-Side Load Balancing With Spring Cloud

It is common for microservice systems to run more than one instance of each service. This is needed to enforce resiliency. It is therefore important to distribute the load between those instances. The component that does this is the load balancer. Spring provides a Spring Cloud Load Balancer library. In this article, you will learn how to use it to implement client-side load balancing in a Spring Boot project. Client and Server Side Load Balancing We talk about client-side load balancing when one microservice calls another service deployed with multiple instances and distributes the load on those instances without relying on external servers to do the job. Conversely, in the server-side mode, the balancing feature is delegated to a separate server, that dispatches the incoming requests. In this article, we will discuss an example based on the client-side scenario. Load Balancing Algorithms There are several ways to implement load balancing. We list here some of the possible algorithms: Round robin: The instances are chosen one after the other sequentially, in a circular way (after having called the last instance in the sequence, we restart from the first). Random choice: The instance is chosen randomly. Weighted: The choice is made by a weight assigned to each node, based on some quantity (CPU or memory load, for example). Same instance: The same instance previously called is chosen if it's available. Spring Cloud provides easily configurable implementations for all of the above scenarios. Spring Cloud Load Balancer Starter Supposing you work with Maven, to integrate Spring Cloud Load Balancer in your Spring Boot project, you should first define the release train in the dependency management section: XML <dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-dependencies</artifactId> <version>2023.0.0</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> Then you should include the starter named spring-cloud-starter-loadbalancer in the list of dependencies: XML <dependencies> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-loadbalancer</artifactId> </dependency> ... </dependencies> Load Balancing Configuration We can configure our component using the application.yaml file. The @LoadBalancerClients annotation activates the load balancer feature and defines a configuration class by the defaultConfiguration parameter. Java @SpringBootApplication @EnableFeignClients(defaultConfiguration = BookClientConfiguration.class) @LoadBalancerClients(defaultConfiguration = LoadBalancerConfiguration.class) public class AppMain { public static void main(String[] args) { SpringApplication.run(AppMain.class, args); } } The configuration class defines a bean of type ServiceInstanceListSupplier and allows us to set the specific balancing algorithm we want to use. In the example below we use the weighted algorithm. This algorithm chooses the service based on a weight assigned to each node. Java @Configuration public class LoadBalancerConfiguration { @Bean public ServiceInstanceListSupplier discoveryClientServiceInstanceListSupplier(ConfigurableApplicationContext context) { return ServiceInstanceListSupplier.builder() .withBlockingDiscoveryClient() .withWeighted() .build(context); } } Testing Client-Side Load Balancing We will show an example using two simple microservices, one that acts as a server and the other as a client. We imagine the client as a book service of a library application that calls an author service. We will implement this demonstration using a JUnit Test. You can find the example in the link at the bottom of this article. The client will call the server through OpenFeign. We will implement a test case simulating the calls on two server instances using Hoverfly, an API simulation tool. The example uses the following versions of Java, Spring Boot, and Spring Cloud. Spring Boot: 3.2.1 Spring Cloud: 2023.0.0 Java 17 To use Hoverfly in our JUnit test, we have to include the following dependency: XML <dependencies>  <dependency> <groupId>io.specto</groupId> <artifactId>hoverfly-java-junit5</artifactId> <scope>test</scope> </dependency> </dependencies> We will configure the load balancer in the client application with the withSameInstancePreference algorithm. That means that it will always prefer the previously selected instance if available. You can implement that behavior with a configuration class like the following: Java @Configuration public class LoadBalancerConfiguration { @Bean public ServiceInstanceListSupplier discoveryClientServiceInstanceListSupplier(ConfigurableApplicationContext context) { return ServiceInstanceListSupplier.builder() .withBlockingDiscoveryClient() .withSameInstancePreference() .build(context); } } We want to test the client component independently from the external environment. To do so we disable the discovery server client feature in the application.yaml file by setting the eureka.client.enabled property to be false. We then statically define two author service instances, on ports 8091 and 8092: YAML spring: application: name: book-service cloud: discovery: client: simple: instances: author-service: - service-id: author-service uri: http://author-service:8091 - service-id: author-service uri: http://author-service:8092 eureka: client: enabled: false We annotate our test class with @SpringBootTest, which will start the client's application context. To use the port configured in the application.yaml file, we set the webEnvironment parameter to the value of SpringBootTest.WebEnvironment.DEFINED_PORT. We also annotate it with @ExtendWith(HoverflyExtension.class), to integrate Hoverfly into the running environment. Using the Hoverfly Domain-Specific Language, we simulate two instances of the server application, exposing the endpoint /authors/getInstanceLB. We set a different latency for the two, by the endDelay method. On the client, we define a /library/getAuthorServiceInstanceLB endpoint, that forwards the call through the load balancer and directs it to one instance or the other of the getInstanceLB REST service. We will perform 10 calls to /library/getAuthorServiceInstanceLB in a for loop. Since we have configured the two instances with very different delays we expect most of the calls to land on the service with the least delay. We can see the implementation of the test in the code below: Java @SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.DEFINED_PORT) @ExtendWith(HoverflyExtension.class) class LoadBalancingTest { private static Logger logger = LoggerFactory.getLogger(LoadBalancingTest.class); @Autowired private TestRestTemplate restTemplate; @Test public void testLoadBalancing(Hoverfly hoverfly) { hoverfly.simulate(dsl( service("http://author-service:8091").andDelay(10000, TimeUnit.MILLISECONDS) .forAll() .get(startsWith("/authors/getInstanceLB")) .willReturn(success("author-service:8091", "application/json")), service("http://author-service:8092").andDelay(50, TimeUnit.MILLISECONDS) .forAll() .get(startsWith("/authors/getInstanceLB")) .willReturn(success("author-service:8092", "application/json")))); int a = 0, b = 0; for (int i = 0; i < 10; i++) { String result = restTemplate.getForObject("http://localhost:8080/library/getAuthorServiceInstanceLB", String.class); if (result.contains("8091")) { ++a; } else if (result.contains("8092")) { ++b; } logger.info("Result: ", result); } logger.info("Result: author-service:8091={}, author-service:8092={}", a, b); } } If we run the test we can see all the calls targeting the instance with 20 milliseconds delay. You can change the values by setting a lower range between the two delays to see how the outcome changes. Conclusion Client load balancing is an important part of microservices systems. It guarantees system resilience when one or more of the nodes serving the application are down. In this article, we have shown how it can be implemented by using Spring Cloud Load Balancer. You can find the source code of the example of this article on GitHub.

By Mario Casari

“Let’s Cook!”: A Beginner's Guide to Making Tasty Web Projects

When I was a child, I loved making pancakes with my grandmother. As time went on, I became a web developer, and now, instead of pancakes, I create various web projects with my colleagues. Every time I start a new project, I’m haunted by one question: How can I make this development "tasty" not only for the user but also for my colleagues who will work on it? This is a crucial question because over time, the development team may change, or you might decide to leave the project and hand it over to someone else. The code you create should be clear and engaging for those who join the project later. Moreover, you should avoid a situation where the current developers are dissatisfied with the final product yet have to keep adding new "ingredients" (read: functions) to satisfy the demands of the "restaurant owner." Important note: Before I describe my recipe, I want to point out that methods can vary across different teams and, of course, they depend on their preferences. However, as we know, some people have rather peculiar tastes, so I believe it's essential to reiterate even the simplest truths. Selecting the Ingredients: Choose Your Technology Stack Before you start cooking a dish, you usually check what ingredients you already have. If something is missing, you go to the store or look for alternative ways to acquire them, like venturing out to pick them up in the woods. The web development process is similar: before starting your work on a new project, you need to understand what resources you currently have and what you want to achieve in the end. To prepare for creating your technological masterpiece, it helps to answer a series of questions: What are the functional requirements for the project? What exactly needs to be done? Is anything known about the expected load on your future product? Do you have any ready-made solutions that can be reused? If you’re working in a team: What knowledge and skills does each team member possess? Is the programming language/framework you want to use for development still relevant? How difficult and expensive is it to find developers who specialize in the chosen technologies? Even if you’re working on the project alone, always remember the "bus factor" — the risk associated with losing key personnel. Anything can happen to anyone (even you), so it’s crucial to prepare in advance for any hypothetical issues. No Arbitrary Action: Stick to Coding Standards How about trying oatmeal with onions? It’s hard to believe, but I once had such a dish in Kindergarten. This memory is vividly etched in my mind, and it taught me an important lesson. Coding standards were invented for the same reason as “compatibility standards” of food ingredients. They are supposed to improve code readability and understanding by all developers on the team. We avoid debates about the best way to write code, which constructs to use, and how to structure it. When everyone follows the same rules, the code becomes easier to read, understand, and maintain (also mind that maintenance becomes cheaper this way). But that's not the only reason for having standards: adhering to them helps reduce the number of bugs and errors. For instance, strict rules for using curly braces can prevent situations where an operation is accidentally left out of a condition. Line length restrictions make code more readable, and consistent rules for creating conditions in an if statement help avoid logical errors. Strict rules for data types and type casting in languages with less strict typing also help prevent many runtime errors. Coding standards help reduce dependency on specific developers, which is also good for the developers themselves since they won't be bothered with silly questions during their vacation. In popular programming languages, there are generally accepted coding standards supported by the development team and the community around the language. For example, PEPs (Python Enhancement Proposals) are maintained and managed by the Python developer community under the guidance of the Python Software Foundation (PSF). PSR (PHP Standards Recommendations) is a set of standards developed by PHP-FIG (PHP Framework Interoperability Group) for PHP. Golang has stricter coding standards maintained by the language's developers. However, each development team or company may have its own standards in addition to (or instead of) those supported by the community. There can be many reasons for this; for example, the main codebase might have been written long before any standards were established, making it too costly to rewrite. To maintain uniformity in the codebase, the rules may be adjusted. There are tools for automatically checking standards, known as static code analyzers. These tools generally have a wide range of functionality that can be further expanded and customized. They can also detect errors in the code before it is released to production. Examples of such tools include PHPStan, Psalm, and PHP_CodeSniffer for PHP; Pylint, Flake8, and Mypy for Python; golint and go vet for Golang. There are also tools for automatic code fixing code and bringing it up to existing standards when possible. This means much of this work now should not require as much manual labor and resources as it used to. Keep Ingredients Fresh: Constant Refactoring What happens if you don't keep the code fresh, and how can it lose its freshness? Programming languages and libraries (these are the ingredients) get updated, and with that — old ingredients rot. Establish rules on how you keep the code fresh, use automation tools, and update libraries. This advice may seem obvious, but it's frequently neglected: make sure your project's dependencies and server software are constantly monitored and regularly updated. This is especially important since outdated or insecure code presents an easy target for attackers. Just as with code checking and fixing, you don't have to manually update everything; there are numerous automation tools that can assist. For instance, GitHub’s Dependabot automatically identifies outdated or vulnerable dependencies and proposes updates. It's also vital to automate the renewal of security certificates. Expired certificates can cause significant issues, but automating this process is a straightforward task. For instance, if you're using Let's Encrypt certificates, Certbot can automate their renewal. The same concept applies to server software. For larger projects with multiple servers, tools like Puppet, Salt, Ansible, or Chef can be employed to handle updates. For those working with Linux, especially Debian/Ubuntu-based systems, Unattended Upgrades can efficiently manage this. Taste (And Test) Along the Way A good chef usually tastes the dish at different preparation stages to ensure everything is going according to plan. In a similar fashion, a professional developer should check not just the final, but also the intermediate results using tests. Testing is often associated with just detecting bugs. Indeed, it catches errors and unexpected behaviors before the product reaches users, improving overall quality and reducing the likelihood of issues down the line. But in fact, its importance is much bigger than that. Effective testing is crucial for delivering high-quality, dependable, and well-understood code: Code comprehension: Writing test scenarios demands a deep understanding of the code’s architecture and functionality, leading to better insights into how the program operates and how different parts interact. Supplemental documentation: Tests can also serve as practical examples of how functions and methods are used, helping to document the project’s capabilities and providing new team members with real-world use cases. It’s pretty much clear that achieving 100% test coverage for complex code is unrealistic. Therefore, developers must focus on testing critical functions and essential code segments, and knowing when to stop is key to avoiding an endless cycle of testing. Also, testing can consume significant resources, especially during the early stages of development. So, it’s important to strike a balance between the necessity of testing and the available time and resources. Chef’s Logbook: Add a Pinch of Documentation It’s common knowledge that many famous types of food, like mozzarella, nachos, and even french fries, were discovered by accident. Others took decades to develop through countless trial and error. In both cases, all of them would be just one-off products if knowledge about them could not have been passed on. It is the same with tech: each project needs proper documentation. The lack of such paperwork makes it much harder to identify and fix errors, complicates maintenance and updates, and slows down the onboarding process for new team members. While teams lacking documentation get bogged down in repetitive tasks, projects with well-structured documentation demonstrate higher efficiency and reliability. According to the 2023 Stack Overflow Developer Survey, 90.36% of respondents rely on technical documentation to understand the functionality of technologies. Yet, even with documentation, they often struggle to find the information they need, turning to other resources like Stack Overflow (82.56%) and blogs (76.69%). Research by Microsoft shows that developers spend an average of 1-2% of their day (8-10 minutes) on documentation, and 10.3% report that outdated documentation forces them to waste time searching for answers. The importance of documentation is also a significant concern for the academic community, as evidenced by the millions of scientific publications on the topic. Researchers from HAN University of Applied Sciences and the University of Groningen in the Netherlands identified several common issues with technical documentation: Developer productivity is measured solely by the amount of working software. Documentation is seen as wasteful if it doesn’t immediately contribute to the end product. Informal documentation, often used by developers, is difficult to understand. Developers often maintain a short-term focus, especially in continuous development environments. Documentation is frequently out of sync with the actual software. These “practices” should be avoided at all costs in any project, but it is not always up to developers. Getting rid of these bad habits often involves changes to planning, management, and long-term vision of the entire company from top management to junior dev staff. Conclusion As you see, food and tech project development (including, but not limited to web) has a lot in common with cooking. Proper recipes, fresh and carefully selected ingredients, meticulous compliance with standards, and checking intermediate and final results — sticking to this checklist is equally essential for a chef and for a developer. And, of course, they both should ideally have a strong vision, passion for what they do, and a strong appetite for innovation. I am sure you recognize yourself in this description. Happy cooking!

By Filipp Shcherbanich

Solving a Common Dev Grievance: Architecture Documentation

The global developer population is expected to reach 28.7 million people by 2024, surpassing the entire population of Australia. Among such a large group, achieving unanimous agreement on anything is remarkable. Yet, there's widespread consensus on one point: good technical documentation is crucial and saves considerable time. Some even consider it a cornerstone of engineering success, acting as a vital link between ideas, people, and visions. Despite this, many developers battle daily with poor, incomplete, or inaccurate documentation. It’s a common grievance in the tech community, where developers lament the hours spent on documentation, searching scattered sources for information, or enduring unnecessary meetings to piece together disjointed details. Vadim Kravcenko in his essay on Healthy Documentation highlights a pervasive issue: “The constant need to have meetings is a symptom of a deeper problem — a lack of clear, accessible, and reliable documentation. A well-documented workflow doesn't need an hour-long session for clarification. A well-documented decision doesn't need a room full of people to understand its rationale. A well-documented knowledge base doesn't need a group huddle whenever a new member joins the team.” Documentation, especially that of system architecture, is often seen as a burdensome afterthought, fraught with tedious manual diagramming and verbose records spread across various platforms. It’s important to highlight that bad documentation is not just a source of frustration for developers, but it also has a very tangible business impact. After all, time is money. When developers waste time manually recording information or looking for something in vain, they are being diverted from building new features, optimizing performance, and, in general, producing value for end users. This article examines the evolving requirements of modern system architecture documentation and how system architecture observability might be a way to reduce overhead for teams and provide them with the information they need when they need it. Why System Architecture Documentation Is Important System documentation is crucial as it captures all aspects of a software system’s development life cycle, from initial requirements and design to implementation and deployment. There are two primary benefits of comprehensive system architecture documentation: 1. Empowers All Stakeholders While Saving Time System design is inherently collaborative, requiring inputs from various stakeholders to ensure the software system meets all business and technical requirements while remaining feasible and maintainable. Documentation serves different needs for different stakeholders: New Team Additions: Comprehensive documentation helps new members quickly understand the system's architecture, technical decisions, and operational logic, facilitating smoother and faster onboarding. Existing Engineering Team: Serves as a consistent reference, guiding the team's implementation efforts and reducing the frequency of disruptive clarification meetings. Cross-Functional Teams: Enables teams from different functional areas to understand the system’s behavior and integration points, which is crucial for coordinated development efforts. Security Teams and External Auditors: Documentation provides the necessary details for compliance checks, security audits, and certifications, detailing the system’s structure and security measures. Effective documentation ensures that all team members, regardless of their role, can access and utilize crucial project information, enhancing overall collaboration and efficiency. 2. Persisted, Single Source of Company Knowledge A dynamic, comprehensive repository of system knowledge helps mitigate risks associated with personnel changes, code redundancy, and security vulnerabilities. It preserves critical knowledge, preventing the 'single point of failure' scenario where departing team members leave a knowledge vacuum. This central source of truth also streamlines problem-solving and minimizes time spent on context-switching, duplicated efforts, and unnecessary meetings. By centralizing system information across various platforms — like Jira, GitHub, Confluence, and Slack — teams can avoid the pitfalls of fragmented knowledge and ensure that everyone has access to the latest, most accurate system information. Modern Systems Have Outgrown Traditional Documentation The requirements for system architecture documentation have evolved dramatically from 20 or even 10 years ago. The scale, complexity, and distribution of modern systems render traditional documentation methods inadequate. Previously, a team might grasp a system's architecture, dependencies, and integrations by reviewing a static diagram, skimming the codebase, and browsing through some decision records. Today, such an approach is insufficient due to the complexity and dynamic nature of contemporary systems. Increased Technological Complexity Modern technologies have revolutionized system architecture. The rise of distributed architectures, cloud-native applications, SaaS, APIs, and composable platforms has added layers of complexity. Additionally, the aging of software and the proliferation of legacy systems necessitate continual evolution and integration. This technological diversity and modularity increase interdependencies and complicate the system's communication structure, making traditional diagramming tools inadequate for capturing and understanding the full scope of system behaviors. Accelerated System Evolution The adoption of Agile methodologies and modern design practices like Continuous and Evolutionary Architecture has significantly increased the rate of change within software systems. Teams have to update their systems to reflect changes in external infrastructure, new technologies, evolving business requirements, or a plethora of other aspects that might change during the lifetime of any software system. That’s why a dynamic documentation approach that can keep pace with rapid developments is necessary. Changing Engineering Team Dynamics The globalization of the workforce and the demand from users for global, scalable, and performant applications have led to more distributed engineering teams. Coordinating across different cross-functional teams, offices, and time zones, introduces numerous communication challenges. The opportunity for misunderstandings and failures becomes an order N squared problem: adding a 10th person to a team adds 9 new lines of communication to worry about. That’s also reflected in the famous Fred Brooks quote from the The Mythical Man-Month book: “Adding [human] power to a late software project makes it later.” This complexity is compounded by the industry's high turnover rate, with developers often changing roles every 1 to 2 years, underscoring the necessity for robust, accessible documentation. New Requirements of System Architecture Documentation System architecture documentation should be accurate, current, understandable, maintainable, easy to access, and relevant. Despite these goals, traditional documentation methods have often fallen short due to several inherent challenges: Human Error and Inconsistencies: Relying on individuals, whether software architects, technical writers, or developers, to document system architecture introduces human error, inconsistencies, and quickly outdated information. These issues are compounded by barriers such as interpersonal communication, lack of motivation, insufficient technical writing skills, or time constraints. Documentation as Code: While self-documenting code is a step forward, using comments to clarify code logic can only provide so much clarity. It lacks critical contextual information like decision rationales or system-wide implications. Fragmented Tooling: Documentation generators that scan source code and other artifacts can create documentation based on predefined templates and rules. However, these tools often provide fragmented views of the system, requiring manual efforts to integrate and update disparate pieces of information. The complexity and dynamism of modern software systems intensify these documentation challenges. In response, new requirements have emerged: Automation: Documentation processes need to minimize manual efforts, allowing for the automatic creation and maintenance of diagrams, component details, and decision records. Tools should enable the production of interactive, comprehensive visuals quickly and efficiently. Reliability and Real-Time Updates: Documentation must not only be reliable but also reflect real-time system states. This is essential to empowering engineers to make accurate, informed decisions based on the current state of the system. This immediacy helps troubleshoot issues efficiently and prevents wasted effort on tasks based on outdated information. Collaborative Features: Modern tooling must support both synchronous and asynchronous collaboration across distributed teams, incorporating features like version control and advanced search capabilities to manage and navigate documentation easily. In today's fast-paced software development environment, documentation should evolve alongside the systems it describes, facilitating seamless updates without imposing additional overhead on engineering teams. Observability Could Solve the Biggest Pain Points Leveraging observability could be the key to keeping system architecture documentation current while significantly reducing the manual overhead for engineering teams. The growing adoption of open standards, such as OpenTelemetry (OTel), is crucial here. These standards enhance interoperability among various tools and platforms, simplifying the integration and functionality of observability infrastructures. Imagine a scenario where adding just a few lines of code to your system allows a tool to automatically discover, track, and detect drift in your architecture, dependencies, and APIs. Such technology not only exists but is becoming increasingly accessible. Building software at scale remains a formidable challenge. It's clear that merely increasing the number of engineers, or pursing traditional approaches to technical documentation doesn’t equate to better software — what's needed are more effective tools. Developers deserve advanced tools that enable them to visualize, document, and explore their systems’ architecture effortlessly. Just as modern technology has exponentially increased the productivity of end-users, innovative tools for system design and documentation are poised to do the same for developers, transforming their capacity to manage and evolve complex systems.

By Thomas Johnson

How to Secure Your Raspberry Pi and Enable Safe, Resilient Updates

The venerable Raspberry Pi has been around for over a decade (officially created in 2009) and it has become a standard in many robotics, home automation, and other types of uses, especially for “makers” and other tinkerers. But it has also made serious inroads into the professional and enterprise world — just more quietly. It’s a capable, reliable, and powerful single-board computer (SBC) with a robust user community. For all its strengths, it does have a few notable weaknesses. The biggest one is security. Not because the SBC itself is insecure, and not because the Operating System (OS) is insecure (it runs Linux, which can be very well secured). The most vulnerable part of the Raspberry Pi is the fact that it boots and runs off of a micro SD card. While that micro SD card boot mechanism is certainly convenient, it does leave the Pi extremely vulnerable to physical tampering. After all, someone can simply walk up to the Pi, remove the SD card, and they have access to all of the programs and data that was running. They can put that card into their own Pi and they have full access to everything. Well, with a little password hacking, etc. Making that Pi absolutely secure against physical tampering as well as electronic tampering is a critical step in making a Raspberry Pi a secure device for deploying applications in the field. Seamless updates of your Pi is also, often, a hassle. Especially if you have more than a handful of them. You have to log in to each one, run the updates, and then hope that nothing goes wrong. Which leads me to recoverability: What happens if one of those updates fails for some reason? especially if it’s in some remote location. How do you ensure that the device is recoverable, and how can you get it back online as quickly as possible? Clearly, I’m going to cover a lot of ground, but in the end, I’ll show you how you can secure your Raspberry Pi from physical tampering making it virtually impossible to steal your programs and data, how to run secure, remote updates, and how to ensure that those updates are recoverable in case of failure. Let’s Build a New Pi To start off, let’s build a Raspberry Pi from scratch. (If you aren’t interested in this part and just want to secure an existing Pi, scroll down to the section, "Securing Your Pi.") Here are all the things you’ll need in order to complete this. I will include links to the ones I have direct, personal experience using. Raspberry Pi (I’m using a Raspberry Pi 4, but you can also use a Raspberry Pi 5 if you prefer) Power supply for your Pi: It is important to use a good power supply that supplies enough power for the Pi4 or Pi5. The official supply from the Pi Foundation is recommended. High-quality Micro SD Card: I recommend a 32GB card. SD Card Reader/Writer (if there isn’t one built-in to your computer) High-quality USB Flash drive (Note: It should be 2x the size of your SD Card): I like this one from PNY. Zymkey, HSM4, or HSM6 (I’m using a Zymkey here, but an HSM6 will work just fine if you want Hardware Signing.) CR1025 Battery (A battery is not strictly necessary, but I’m including it here for completeness. The Zymkey uses it to maintain the Real Time Clock (RTC).) Format and Image Your Micro SD Now that you have all the parts assembled, let’s get started. I’m using the Raspberry Pi Imager tool, but you can use Balena Etcher or any other SD Card imaging tool you prefer. When you first start the Pi Imager, you’ll see that you have to make some choices: First, you’ll want to choose which Pi model you have. I’m using a Pi 4. Choose the hardware you have, of course. Next, you’ll choose the Operating System. We are going to use the most recent version (Bookworm, 64-bit), but we won’t be needing the full Desktop environment, so I’m choosing the "Lite" version. Next, you’ll identify the Micro SD Card you’d like to write to. If you haven’t already, insert the Micro SD Card into the SD Card writer and plug it into your computer. The last step before actually writing the OS to the disk is to set any additional settings you’d like for the Pi. I recommend at least setting up a hostname and username/password, and if you would like to use your local WiFi, the WiFi credentials. Once you’ve got all the settings right, it’s time to write it all to the card. Note that this will completely erase any existing data on the SD Card, so be careful. After that, you can sit back and enjoy a cup of coffee while your OS is written on the card. Once it’s done, we can move on to configuring the hardware. Set up the Hardware This is always my favorite part! First, let’s just see what we need: Before plugging the Pi in, let’s get the Zymkey put together and installed. About the only thing you need to do is to insert the CR1025 battery into the battery holder. Make sure that the Zymkey is well-seated on the header pins. Once the hardware is all put together, insert the SD Card into the slot on the underside of the Pi. Now it’s time to plug the Pi into the power supply, wait for it to boot, and get started setting up our security! Securing Your Pi Now that we’ve got a happily running Pi, let’s go about the important job of making sure that it is secure, updateable, and recoverable. In that order. Configure Your Zymkey Before we can configure the Zymkey, we need to ensure that the Pi can talk to it. The Zymkey software communicates with the device via I2C, so we need to make sure that the Pi’s I2C interface is enabled. Shell $ sudo raspi-config This gets you to the configuration utility. You’ll then select “Interface Options” and then “I2C”: You can then exit and save raspi-config: All of these steps are covered in greater detail in the documentation, so if anything here is confusing, you can always double-check there. Next, we need to install the required Zymkey Software. Shell curl -G https://s3.amazonaws.com/zk-sw-repo/install_zk_sw.sh | sudo bash Install any updates, and then download and install the required Zymbit software. Note: Installing this software will trigger an automatic reboot of the Pi, so you should not be doing anything else with it while the software is installing. After the reboot has completed, you should notice that the blue light is no longer flashing rapidly, but is flashing once every 3 seconds. This is your indication that the Zymbit software is properly installed, and able to communicate with the Zymkey. If you’d like to test to make sure that the Zymkey is installed and functioning properly, you can download and install the test scripts: Shell wget https://community.zymbit.com/uploads/short-url/eUkHVwo7nawfhESQ3XwMvf28mBb.zip unzip eUkHVwo7nawfhESQ3XwMvf28mBb.zip sudo mkdir -p /usr/local/share/zymkey/examples/ sudo mv *.py /usr/local/share/zymkey/examples/ python3 /usr/local/share/zymkey/examples/zk_app_utils_test.py Shell Testing data lock... Original Data 01 02 03 04 Encrypted Data 8B B8 06 67 00 00 35 80 82 75 AA BE 89 8C A8 D5 6D 7B 71 48 83 47 B9 9A B7 3A 09 58 41 E6 33 BC 4E 48 7A 32 3A B0 26 D8 59 4F 8C 58 59 97 03 20 3C 99 CF AF 2D CC 47 E5 1B AB 83 FC 6A 3D DE D8 F3 24 9F 73 B5 72 B7 0D 77 8E C6 A8 A3 B3 22 D6 94 8F BD 6A 6C 96 38 EE Testing data unlock... Decryped Data 01 02 03 04 Turning LED on... Testing get_random() with 512 bytes... B7 B6 BD 78 C6 62 7A CC 80 E0 BD 04 C7 43 29 AC 7A 48 2D 3F E5 43 33 AA 7C 37 F6 BA 7D 3F F2 D3 A9 4B B3 A9 16 4C FD AD 48 61 72 9E 7F B9 09 AE A7 4A 4F 54 0D CE 6E 85 E6 87 F5 8C D6 58 4B 0E 12 03 4C 71 BD 3A F0 34 79 06 66 5E 65 DC 6E CF AF 12 72 C1 F1 5D 24 79 A8 D0 F9 40 3E 8E 59 D7 5C ED C5 1E 0E FF 4A 04 69 22 54 F5 13 A1 2E A7 3C B4 CD 30 E7 61 10 B7 E5 07 AD DC E0 FF E9 6E 58 32 50 DA 9F 33 51 F5 8C 16 B5 0C 0F 57 08 E6 E8 00 89 79 DF 16 2A BD FC 27 E0 E4 6C 1B 05 28 EB DE 5B 63 2E F0 E0 21 E8 C5 39 31 26 2A E5 64 79 31 04 7A 60 ED D7 32 6A 8B 4A 29 DD 79 EC D9 2B 72 AC 2E 9A 08 FF 56 06 DB 1C 91 FF D9 3F 10 3E 57 9C 5E B4 32 FD 2E 09 BF 8D 04 6A C8 12 88 06 7C C1 93 FD F7 61 47 90 DD 0D 50 78 78 6C 83 0A 94 DD 5E 9D 83 3F FD 0B 1E 73 23 72 0D 4D D1 82 1F 42 DB EE 1E 7F 85 B9 F1 94 24 54 1B 28 2E 47 24 05 8B 17 0B AE 90 6A DF 0B BC E1 53 B2 96 1C 87 D4 FD A0 EC FC 85 E4 9F 04 F6 B8 E0 37 B2 40 17 33 3A FA 96 01 0C B2 4C 4D FE E7 64 0E 87 4E 4B A8 D0 97 C6 A5 42 F4 02 E4 CC 7C 2B 3A A8 C7 33 22 3C 76 1C 40 42 1F 5A 78 7B 23 FB 0B 39 BD 9F 38 13 6B FE D9 54 C9 D2 F3 97 C6 39 F3 09 9C 6B DC 82 C1 25 99 70 8B 2B 46 FD CD 51 C9 09 20 16 DA 4C D3 58 B6 BB D7 C3 E4 A9 34 F0 5C 85 D7 19 6D A8 F7 26 D6 41 6F 27 04 2C A0 C4 50 9D 28 43 0D DC E2 7E D4 9E 29 FE 45 B2 BF 14 77 A7 AD F4 43 4B 51 85 85 06 7F 02 BF 21 DA C4 BD A4 9B 94 71 FA 21 8B 9E B6 07 48 7F 50 A7 CF 32 2F 8F 98 A1 E1 FE 1B 2E 24 B5 BF 69 E7 DE 3D 11 6C 48 5B 56 5C BF 96 FB 30 BB 86 13 C4 53 61 AD 6E 09 0C A9 4B C1 2F 12 3F BF 34 FB 01 D7 62 13 7A Turning LED off... Flashing LED off, 500ms on, 100ms off... Testing zkCreateRandDataFile with 1MB... Turning LED off... Testing get_ecdsa_public_key()... 20 AD 20 7A 0E D9 A5 81 BF 44 80 54 C6 DC A7 8C D1 D5 7B EE 6D C5 E3 B4 92 8C 0E BF 42 6E D9 9E AA 04 29 CD 4C D9 3A BC 58 5B DD 47 43 39 30 C8 2E FD C6 D9 C9 82 60 06 A4 A0 7F EA F9 C0 76 E9 Testing create_ecdsa_public_key_file()... $ python3 /usr/local/share/zymkey/examples/zk_crypto_test.py Signing data...OK Verifying data...OK Verifying tainted data...FAIL, yay! Generating random block from Zymkey (131072 bytes)... Encrypting random block... Decrypting encrypted block... PASS: Decrypted data matches original random data Done! Congratulations! Finally, Making It Secure Now that we have a proper security device installed, tested, and ready let’s secure this thing. At the very same time, let’s make sure that we can securely update the device when the time comes, and that it is built to be recoverable in case an update fails. Ordinarily, this would be a ton of work, but we’re going to simplify everything and do it pretty much all at once. A Place To Put the Backup Image Since we will be using Bootware(r) to secure our device, we will need a place for the system to copy the entire SD Card as it encrypts it. For this, we’re going to use a USB Drive. We need to make sure that we can use our USB Drive properly. I often reuse them for other tasks, so here’s how I like to start out. After plugging the USB Drive in, I make sure to “zero out” the drive, then create a brand new partition map and file system on it. Shell sudo dd if=/dev/zero of=/dev/sda bs=512 count=1 conv=notrunc Shell 1+0 records in 1+0 records out 512 bytes copied, 0.0197125 s, 26.0 kB/s That clears the previous file system, if any. Shell sudo fdisk -W always /dev/sda Shell Welcome to fdisk (util-linux 2.38.1). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Device does not contain a recognized partition table. Created a new DOS (MBR) disklabel with disk identifier 0x27b0681a. Command (m for help): n Partition type p primary (0 primary, 0 extended, 4 free) e extended (container for logical partitions) Select (default p): p Partition number (1-4, default 1): First sector (2048-125313282, default 2048): Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-125313282, default 125313282): Created a new partition 1 of type 'Linux' and of size 59.8 GiB. Partition #1 contains a ext4 signature. The signature will be removed by a write command. Command (m for help): w The partition table has been altered. Calling ioctl() to re-read partition table. Syncing disks. The important parts there are. Once you’ve entered the sudo fdisk -W always /dev/sda , you will enter n to create a new partition map. Then p to make it a Primary partition, and finally w to write the partition map to the disk. For everything else, I just accept the defaults as presented. Finally, now that we have a partitioned USB Drive, we have to create a proper file system on it. Shell sudo mkfs.ext4 -j /dev/sda1 -F Shell mke2fs 1.47.0 (5-Feb-2023) Creating filesystem with 15663904 4k blocks and 3916304 inodes Filesystem UUID: 4a3af5d0-bac4-4903-965f-aa6caa8532cf Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (65536 blocks): done Writing superblocks and filesystem accounting information: done Tip: If, like me, you get tired of typing sudo all the time, you can run sudo -i once and get a root prompt from which to run all your commands. But remember, with great power comes great responsibility! Installing Bootware(r) Bootware is the Zymbit tool for securing and updating your Raspberry Pi. It is a powerful tool that allows you to update one, or an entire fleet, or Pis across your enterprise. And it allows you to do it safely, securely, and in a way that is recoverable if something goes wrong. First, we have to run the installer: Shell $ curl -sSf https://raw.githubusercontent.com/zymbit-applications/zb-bin/main/install.sh | sudo bash This installer will ask you a couple of simple questions, so let’s go through the answers. The first is whether or not you’d like to include Hardware Signing. If you have an HSM6 or SCM-based product, you can answer yes to this question. If you’ve got a Zymkey or HSM4, Hardware Signing is not supported, so you don’t need to install it. Even with software signing, your final LUKS encrypted partitions will be protected by the Zymbit HSM keys. Next, it will ask you which version of Bootware to install. Choose the most recent version. Shell zb-install.sh: bootstrapping the zbcli installer --------- Pi Module: Raspberry Pi 4/Compute Module 4 Operating System: Rpi-Bookworm Zymbit module: Zymkey Kernel: kernel8.img --------- ✔ 'zbcli' comes with software signing by default. Include hardware key signing? (Requires SCM or HSM6) · No ✔ Select version · zbcli-1.2.0-rc.26 Installing zbcli Installed zbcli. Run 'zbcli install' to install Bootware onto your system or 'zbcli --help' for more options. zb-install.sh: cleaning up Now that the installer is ready, it’s time to install Bootware itself: Shell sudo zbcli install The installer will ask you if you’re ready to reboot when it’s done: Shell --------- Pi Module: Raspberry Pi 4 Operating System: Rpi-Bookworm Zymbit module: Zymkey Kernel: kernel8.img --------- Found kernel '/boot/firmware/kernel8.img' Created '/etc/zymbit/zboot/mnt' Created '/etc/zymbit/zboot/scripts' Created '/etc/zymbit/zboot/zboot_backup' Created '/boot/firmware/zboot_bkup' Installed 'u-boot-tools' Created '/etc/fw_env.config' Created '/usr/bin/zbconfig' Found OpenSSL 3 Created '/boot/firmware/zb_config.enc' Modified zbconfig 'kernel_filename' Installed zboot Modified '/etc/rc.local' Created '/lib/cryptsetup/scripts/zk_get_shared_key' Modified '/boot/firmware/config.txt' Created '/etc/update-motd.d/01-zymbit-fallback-message' Modified /etc/update-motd.d/01-zymbit-fallback-message ✔ A reboot into zboot is required. Reboot now? · yes Finished in 29.1s Configuring Bootware This is where the real fun begins! If you’ve ever used LUKS to encrypt a Pi filesystem before, you know that, while it’s a great step in securing your Pi, you still have to store that encryption key somewhere that is accessible at boot time. With Bootware and a Zymbit HSM, the LUKS encryption key is locked by the Zymbit HSM, making it much more secure. Bootware expects the boot image to be in a specific, encrypted format called a z-image. The Bootware CLI tool helps you create and manage these images for deployment across your enterprise. So let’s create our first z-image, and we’ll use the current system as the basis for it. First, we need to mount the USB Drive so that we have a place to put our z-image: Shell sudo mount /dev/sda1 /mnt Next, we’ll run the imaging tool to create an encrypted z-image of our current system: Shell sudo zbcli imager Shell Validated bootware installation --------- Pi Module: Raspberry Pi 4 Operating System: Rpi-Bookworm Zymbit module: Zymkey Kernel: kernel8.img --------- Created '/etc/zymbit/zboot/update_artifacts/tmp' ✔ Enter output directory · /mnt ✔ Enter image name · z-image-1 ✔ Select image type · Full image of live system ✔ (Optional) enter image version · 1.0 ✔ Select key · Create new software key Notice that I used the mount point for the USB Drive as our output directory. I then chose a name and version number for the image and chose to use a software key, since I’m using a Zymkey. Don’t be surprised if this step takes a while. What it’s doing is making a complete copy of the files on the running disk, and signing it with the hardware key that it has generated. Shell Created signing key Created '/etc/zymbit/zboot/update_artifacts/file_manifest' Created '/etc/zymbit/zboot/update_artifacts/file_deletions' Verified path unmounted '/etc/zymbit/zboot/mnt' Cleaned '/etc/zymbit/zboot/mnt' Deleted '/etc/crypttab' Verified disk size (required: 2.33 GiB, free: 26.39 GiB) Created initramfs Created snapshot of boot (/etc/zymbit/zboot/update_artifacts/tmp/.tmpBgEBJk/z-image-1_boot.tar) Created snapshot of root (/etc/zymbit/zboot/update_artifacts/tmp/.tmpBgEBJk/z-image-1_rfs.tar) Created '/mnt/tmp' Cleaned '/mnt/tmp' Created staging directory (/mnt/tmp/.tmpEhjNN7) Created '/mnt/tmp/.tmpEhjNN7/header.txt' Created tarball (/mnt/tmp/.tmpEhjNN7/update_artifact.tar) Created header signature Created update artifact signature Created file manifest signature Created file deletions signature Created '/mnt/tmp/.tmpEhjNN7/signatures' Created signatures (/mnt/tmp/.tmpEhjNN7/signatures) Copied file (/etc/zymbit/zboot/update_artifacts/file_manifest) to (/mnt/tmp/.tmpEhjNN7/file_manifest) Copied file (/etc/zymbit/zboot/update_artifacts/file_deletions) to (/mnt/tmp/.tmpEhjNN7/file_deletions) Created tarball (/mnt/z-image-1.zi) Created '/mnt/z-image-1_private_key.pem' Saved private key '/mnt/z-image-1_private_key.pem' Created '/mnt/z-image-1_pub_key.pem' Saved public key '/mnt/z-image-1_pub_key.pem' Cleaned '/mnt/tmp' Saved image '/mnt/z-image-1.zi' (2.33 GiB) Finished in 384.8s The public/private key pair is saved on the USB Drive, and we will need it later. A/B Partitioning Some background here is probably appropriate. The idea of A/B partitioning is an important concept for recoverability. If you have a single disk partition that your devices boot from, and you update critical items in that partition that are somehow corrupted, your device may be left in a state where it is impossible to boot or recover. It’s bricked. The only way to recover such a device typically is to physically access the device and make direct changes to the SD Card. This is not always practical, or even possible. With A/B partitioning, you create dual boot partitions and only run from one. That is the known-good or primary partition. You then have a secondary partition where you can apply updates. Once an update is applied to the secondary partition, the device reboots from that newly updated partition. If the update is successful, your system is back up and running and that partition is then marked as the primary, and it will reboot from that known-good partition from now on. If the update fails, for some reason, and the device cannot properly boot from the updated partition, the system reboots from the previously used primary partition, and it can continue to run until a fixed update can be deployed. With this partitioning scheme in place, your Pi is much less likely to end up bricked as you can maintain a known-good partition at all times from which to boot. Bootware encrypts the A, B, and DATA partitions. The A and B partitions are locked with unique LUKS keys, meaning you cannot access the Backup partition from the Active partition. The encrypted DATA partition is accessible from either the A or B partition. Setting up this A/B partitioning scheme is usually quite cumbersome and difficult to implement. Zymbit’s Bootware has taken that process and simplified it such that it’s a relatively easy process. So let’s go through that process now and make your Pi both secure and resilient. Create A/B Partitions Since we’ve not previously had a backup B partition, we will create one, and we will place the current image (which we know is good, since we’re currently running it) into that partition. To do that, we will update the configuration (really create it) with the zbcli tool. Shell sudo zbcli update-config Shell Validated bootware installation --------- Pi Module: Raspberry Pi 4 Operating System: Rpi-Bookworm Zymbit module: Zymkey Kernel: kernel8.img --------- Info the root file system will be re-partitioned with your chosen configuration. This process will ask you some questions to determine how to lay out your partitions. The first is what device partition layout you would like to use. Choose the recommended option: Shell ? Select device partition layout after an update › ❯ [RECOMMENDED] A/B: This will take the remaining disk space available after the boot partition and create two encrypted partitions, each taking up half of the remaining space. Most useful for rollback and reco Using partition layout (A/B) Info the root file system will be re-partitioned with your chosen configuration. Next, you will select the update policy. Again, just choose the recommended one. Shell ? Select update policy › ❯ [RECOMMENDED] BACKUP: Applies new updates to current backup filesystem and swap to booting the new updated backup partition as the active partition now. If the new update is bad, it will rollback into the pre Running [========================================] 2/1 (00:00:17): WARNING! Detected active partition (28.71GB) is larger than 14.86GB needed for two filesystems. Active partition won't be saved!!! Changing update mode to UPDATE_BOTH!!! Using update mode (UPDATE_BOTH) Data partition size currently set to: 512 MB Info bootware will create a shared data partition after A/B in size MB specified Next, you can select the size of the data partition. It defaults to 512MB, but I suggest increasing that to 1024 MB. Shell ✔ Enter size of data partition in MB · 1024 Using Data Partition Size 1024MB Defaulting to configured endpoint '/dev/sda1' Info update endpoints can be either an HTTPS URL or an external mass storage device like a USB stick. Found update name 'z-image-1' Saved update name 'z-image-1' Using update endpoint '/dev/sda1' Configuration settings saved Finished in 42.1s We’ve now got a system that is configured to have A/B partitioning and to apply updates to the backup partition when they are available. To complete the process, we will actually apply the update (which is really just a copy of the currently running system). This will trigger the re-partitioning and a reboot. First, though, we need to get the public key (created previously and stored on the USB Drive) so that we will be able to decrypt the image. To do that, let’s copy it to the local directory: Shell sudo mount /dev/sda1 /mnt cp /mnt/z-image-1_pub_key.pem . sudo zbcli update Shell Validated bootware installation --------- Pi Module: Raspberry Pi 4 Operating System: Rpi-Bookworm Zymbit module: Zymkey Kernel: kernel8.img --------- Cleaned '/etc/zymbit/zboot/update_artifacts/tmp' Found update configs ? Proceed with current configs? These can be modified through 'zbcli update-config' --------- Update endpoint /dev/sda1 Update name z-image-1 Endpoint type LOCAL Partition layout A/B Update policy UPDATE_BOTH --------- Created temporary directory (/etc/zymbit/zboot/update_artifacts/tmp/.tmpCfhm6c) ✔ Enter public key file (Pem format) · ./z-image-1_pub_key.pem Mounted '/dev/sda1' to '/etc/zymbit/zboot/update_artifacts/tmp/.tmpyKYgR3' Found image tarball (/etc/zymbit/zboot/update_artifacts/tmp/.tmpyKYgR3/z-image-1.zi) Unpacked '/etc/zymbit/zboot/update_artifacts/tmp/.tmpCfhm6c/update_artifact.tar' Unpacked '/etc/zymbit/zboot/update_artifacts/tmp/.tmpCfhm6c/signatures' Unpacked '/etc/zymbit/zboot/update_artifacts/tmp/.tmpCfhm6c/header.txt' Unpacked '/etc/zymbit/zboot/update_artifacts/tmp/.tmpCfhm6c/file_manifest' Unpacked '/etc/zymbit/zboot/update_artifacts/tmp/.tmpCfhm6c/file_deletions' Decoded header signature Decoded image signature Decoded manifest signature Decoded deletions signature Found header data Found image data Found manifest data Found file deletions data Verified header signature Verified image signature Verified manifest signature Verified file deletions signature Modified zbconfig 'public_key' Modified zbconfig 'new_update_needed' Modified zbconfig 'root_a' Modified zbconfig 'root_b' Modified zbconfig 'root_dev' Copied file (/boot/firmware/usr-kernel.enc) to (/boot/firmware/zboot_bkup/usr-kernel-A.enc) Copied file (/boot/firmware/kernel8.img) to (/boot/firmware/zboot_bkup/kernel8.img) Modified zbconfig 'update_with_new_image' Modified zbconfig 'kernel_filename' ? Scheduled update for the next reboot. Reboot now? (y/n) › yes When it asks to reboot, say yes, and then wait. Once your Pi is rebooted, log in and check to see that it’s correct. Shell lsblk Shell NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 1 59.8G 0 disk └─sda1 8:1 1 59.8G 0 part mmcblk0 179:0 0 29.7G 0 disk ├─mmcblk0p1 179:1 0 512M 0 part /boot/firmware ├─mmcblk0p2 179:2 0 14.1G 0 part │ └─cryptrfs_A 254:0 0 14.1G 0 crypt / ├─mmcblk0p3 179:3 0 14.1G 0 part └─mmcblk0p4 179:4 0 1G 0 part └─cryptrfs_DATA 254:1 0 1008M 0 crypt Notice that we now have two cryptfs devices. These are fully signed and encrypted filesystems. What if the update had failed? Here’s the beauty of A/B partitioning with Bootware: if the system fails to boot (it fails to reach a systemd init target 3 times in a row), Bootware will revert to the known-good partition, bringing your device back online. Bonus Exercise Here, we have simply used the USB drive as the source for the update. But there are other options! we could copy that z-image to an Amazon S3 bucket, or one of our own servers, and then configure Bootware to pull the update from that location. You would need to re-run zbcli update-config and, for the endpoint, use the location on the internet where you stored the image. Conclusion We have now built a complete, secure Raspberry Pi from scratch. Just as importantly, we have now enabled that Pi to be updated safely and securely and we can be assured that a failed update won’t brick the Pi. Are all of these things possible without a Zymkey and Bootware? Yes, mostly. You can encrypt your filesystem with LUKs, but then you have to manually manage where the key is stored, and make sure you keep it safe. You can also do remote updates, but (and this is a very large caveat), you have no assurances that the update will succeed, that the update won’t brick your device, or that the update can’t be tampered with in some way. With the device we have just built, we can be assured that the filesystems are securely signed and encrypted, that we don’t have to worry about managing the encryption keys, that the keys themselves are stored securely, and that we can reliably update the device and not have to worry about it failing to boot after an update. If you have further questions or would like to talk more about Bootware, Zymkey, or any of the topics covered here, leave your feedback and any questions you may have.

By David G. Simmons

CORE

*You* Can Shape Trend Reports: Join DZone's Observability Research + Enter the Prize Drawing!

Hey, DZone Community! We have a survey in progress as part of our original research for the upcoming Trend Report. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for our research survey below Observability and Performance Research DZone's annual research on application performance dives deeper into the emerging trends and techniques around monitoring and observability, both of which are must-haves to support the performance, reliability, and scalability of today's complex applications and system architectures. Our 10-minute research survey that will help guide the narrative of our November Observability and Performance Trend Report explores: Observability models, techniques, and tools OpenTelemetry use, benefits, and drawbacks Performance metrics and degradation root causes AI analytics capabilities for observability and monitoring Join the Observability Research Over the coming months, we will compile, observe, and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By Caitlin Candelmo

Three Habits of Highly Effective Observability Teams

As organizations adopt microservices and containerized architectures, they often realize that they need to rethink their approach to basic operational tasks like security or observability. It makes sense: in a world where developers – rather than operations teams – are keeping applications up and running, and where systems are highly distributed, ephemeral, and interconnected, how can you take the same approach you have in the past? From a technology perspective, there has been a clear shift to open source standards, especially in the realm of observability. Protocols like OpenTelemetry and Prometheus, and agents like Fluent Bit, are now the norm – according to the 2023 CNCF survey, Prometheus usage increased to 57% adoption in production workloads, with OpenTelemetry and Fluent both at 32% adoption in production. But open source tools alone can’t help organizations transform their observability practices. As I’ve had the opportunity to work with organizations who have solved the challenge of observability at scale, I’ve seen a few common trends in how these companies operate their observability practices. Let's dig in. Measure Thyself — Set Smart Goals With Service Level Objectives Service Level Objectives were first introduced by the Google SRE book in 2016 with great fanfare. But I’ve found that many organizations don’t truly understand them, and even fewer have implemented them. This is unfortunate because they are secretly one of the best ways to predict failures. SLOs (Service Level Objectives) are specific goals that show how well a service should perform, like aiming for 99.9% uptime. SLIs (Service Level Indicators) are the actual measurements used to see if the SLOs are met — think about tracking the percentage of successful requests. Error budgeting is the process of allowing a certain amount of errors or downtime within the SLOs, which helps teams balance reliability and new features — this ensures they don’t push too hard at the risk of making things unstable. Having SLOs on your key services and using error budgeting allows you to identify impending problems and act on them. One of the most mature organizations that I’ve seen practicing SLOs is Doordash. For them, the steaks are high (pun intended). If they have high SLO burn for a service, that could lead to a merchant not getting a food order on time, right, or at all. Or it could lead to a consumer not getting their meal on time or experiencing errors in the app. Getting started with SLOs doesn’t need to be daunting. My colleague recently wrote up her tips on getting started with SLOs. She advises to keep SLOs practical and achievable, starting with the goals that truly delight customers. Start small by setting an SLO for a key user journey. Collaborate with SREs and business users to define realistic targets. Be flexible and adjust SLOs as your system evolves. Embrace Events — The Only Constant in your Cloud-Native Environment is Change In DevOps, things are always changing. We're constantly shipping new code, turning features on and off, updating our infrastructure, and more. This is great for innovation and agility, but it also introduces change, which opens the door for errors. Plus, the world outside our systems is always shifting too, from what time of day it is to what's happening in the news. All of this can make it hard to keep everything running smoothly. These everyday events that result in changes are the most common causes of issues in production systems. And the challenge is that these changes are initiated by many different types of systems, from feature flag management to CI/CD, cloud infrastructure, security, and more. Interestingly, 67% of organizations don’t have the ability to identify change(s) in their environments that caused performance issues according to the Digital Enterprise Journal. The only way to stay on top of all of these changes is to connect them into a central hub to track them. When people talk about “events” as a fourth type of telemetry, outside of metrics, logs, and traces, this is typically what they mean. One organization I’ve seen do this really well is Dandy Dental. They’ve found that the ability to understand change in their system, and quickly correlate it to the changes in behavior, has made debugging a lot faster for developers. Making a habit of understanding what changed has allowed Dandy to improve their observability effectiveness. Adopt Hypothesis-Driven Troubleshooting — Enable Any Developer to Fix Issues Faster When a developer begins troubleshooting an issue, they start with a hypothesis. Their goal is to quickly prove or disprove that hypothesis. The more context they have about the issue, the faster they can form a good hypothesis to test. If they have multiple hypotheses, they will need to test each one in order of likelihood to determine which one is the culprit. The faster a developer can prove or disprove a hypothesis, the faster they can solve the problem. Developers use observability tools to both form their initial hypotheses and to prove/disprove them. A good observability tool will give the developer the context they need to form a likely hypothesis. A great observability tool will make it as easy as possible for a developer with any level of expertise or familiarity with the service to quickly form a likely hypothesis and test it. Organizations that want to improve their MTTR can start by shrinking the time to create a hypothesis. Tooling that provides the on-call developer with highly contextual alerts that immediately focus them on the relevant information can help shrink this time. The other advantage of explicitly taking a hypothesis-driven troubleshooting approach is concurrency. If the issue is high severity, or has significant complexity, they may need to call in more developers to help them concurrently prove or disprove each hypothesis to speed up troubleshooting time. An AI software company we work with uses hypothesis-driven troubleshooting. I recently heard a story about how they were investigating a high error rate on a service, and used their observability tool to narrow it down to two hypotheses. Within 10 minutes they had proven their first hypothesis to be correct – that the errors were all occurring in a single region that had missed the most recent software deploy. Taking the Next Step If you're committed to taking your observability practice to the next level, these tried-and-true habits can help you take the initial steps forward. All three of these practices are areas that we’re passionate about. If you’ll be at KubeCon and want to discuss this more, please come say hello! This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event

By Dan Juengst