When building systems, there are some operational elements that it pays to get to grips with sooner than later:

Deployment Packaging Configuration Monitoring Logging

Failing to address these elements is detrimental to core aspects of what we need to do from day one:

Get changes out - ship a new feature, deploy an urgent bug-fix or make a tweak to handle a load-spike. Determine if things have started up and configured properly. Be sure things are still running right. Identify and react to problems quickly. Obtain data important to future architectural decisions.

Even in light of the above many of us are still tempted into leaving this until later by which time:

Our software will have grown substantially making it difficult and expensive to adapt when we do decide to address the operational issues. We’ll be losing inordinate amounts of time on manual trouble-shooting and dealing with the consequences of human error (a key contributor to downtime and other problems). Operations will likely have become tightly bound to whatever our software currently looks like such that when we start addressing the issues, we’ll break all their assumptions (and the tooling they built around them).

Some Specifics

Having configuration buried inside your binaries where it cannot be easily managed is an inconvenience. We don’t really want to have to do a whole new build just to change configuration settings (though one might want to do a re-deploy of the whole lot together to allow for audit-trails and have half a chance of having all boxes configured similarly at the same time).

When it comes to deployment and packaging it pays to adopt something akin to the xcopy install approach. Everything required is contained inside of the distribution with minimal external dependencies (necessary external dependencies should ideally be satisfied dynamically at runtime rather than with static configuration). Such an approach for desktop software would be unattractive but with servers and an imperative to automate installation it’s very attractive.

What about all those existing packaging systems such as rpm? Many of these mechanisms have a design assumption around a single version of something on a machine. This can inhibit fast rollback because rather than stopping one process and starting another one has to (in simple terms):

Stop a process. Uninstall it’s binaries and dependencies. Install the binaries for the old process and dependencies. Start the other process up.

In some cases it will also be necessary to perform further configuration (did we back it up?), suddenly it’s looking like a lot of work to buy ourselves appropriate risk-mitigation for broken upgrades.

Monitoring often requires an amount of configuration which can make for a bootstrap problem where one needs monitoring to detect a configuration issue but the monitoring isn’t configured yet. Thus it can be useful to have some very simple monitoring based on a primitive that can run without explicit configuration such as multicast.

Important Step

These key operational elements should be accounted for early on in the design of system and grown alongside other functional aspects.* There’s plenty of information on this topic publicly available including:

* Initially implementation can be simple scripts but at some point it becomes necessary to take a more serious approach in respect of tools and infrastructure development. This means investing in properly skilled architects and engineers, performing appropriate testing etc.

Tags: Architecture, operations

CommentsNo Comments »

Those specifying requirements often express them without consideration for the passing of time, assuming that actions are instantaneous. A naive development team with limited experience in distributed systems will then make the classic mistake of attempting to implement those requirements to the letter. This can lead to a bunch of undesirable outcomes including:

Brittleness in the face of failure. High cost solutions. Poor scaling properties. Disappointment as the expectations of the requirements source aren’t met.

Consider a system where we have two (network) hops to an observer and one hop to the initiator of an action (assuming uniform network latency for each hop). Potentially for every two actions there will be a single observation. Thus each observation of the system is out of date by the time it reaches the observer.

Administrative actions can suffer similar problems, in that it could take several hops for the request to arrive at the system. A user may be only one hop away and could be performing many operations in the time it takes for one of our actions to reach the system. For example if we wish to block a user, whilst our request is in transit they might perform several operations.

Things are made worse by network failures which can further delay or prevent execution of an action and slow down the rate of updates to an observer.

How then do we account for these troubles when specifying requirements? By qualifying them with appropriate SLA’s. In the example above, appropriate SLA’s might include:

Time for propagation of an administrative action. Maximum acceptable time after the action is triggered for a user to be blocked.

SLA’s such as the above:

Help us to identify appropriate solutions (e.g. do we need to pay for multiple independent routes between data-centres). Allow us to make appropriate use of asynchronous operations and eventual consistency.

Since SLA’s have significant impact on the way in which a requirement will be implemented it is essential to perform appropriate expectation management, discussing and communicating the implications with the requirements source, they cannot be solely the domain of techies. Remember also that in many situations customers prefer availability over consistency.

Tags: Architecture, development, Distributed Systems

CommentsNo Comments »

How big does a website have to get before custom infrastructure becomes necessary? When a website reaches this stage, what infrastructure gets built? Before trying to answer these questions we must have some means of measuring the size of a website. I’ve settled on the number of machines as a reasonable approximation because:

As a codebase grows it must be split up along functional boundaries, and spread across multiple processes. More code equals more processes and more machines to run them on. More customers, means more load and requires more machines to handle it. More data means more storage and more processors to chew through it.

Now let’s see how many machines some of the big players are running and what infrastructure they’re talking about:

TicketMaster have at least 3000 machines and have built Spine to help them manage configuration of their infrastructure.

eBay have built a custom deployment tool (Roller), logging infrastructure, configuration management for their software services, messaging software and more. They’re running around 15000 machines across four geographical locations.

Microsoft have built a custom deployment, configuration and monitoring infrastructure called Autopilot focused on many thousands of machines. In fact we’re talking hundreds of thousands.

Google are dealing in a million or more machines and expending effort on software to handle staged, automatic upgrades. Of course they’ve already built GFS, Chubby etc.

Twitter have moved beyond the half-dozen or so machines they used to have to “a lot of servers” (hundreds?) and are seemingly still hiring operations staff but have built a custom queue server.

Facebook have at least 10000 webservers, 800 MemcacheD instances and 1800 MySQL instances. They’ve built a custom configuration-serving infrastructure, management and monitoring tools. They also contribute to MemcacheD and have built Cassandra and Thrift. They also appear to be busy building their own optimized webservers and a replacement for squid.

Amazon have tens of thousands of servers (surely more?) and have constructed Dynamo, S3, EC2, SQS etc.

A few tentative conclusions:

It would seem that by the time a website has moved into the thousands of boxes it will have had to address configuration and monitoring. Which suggests development efforts started before this threshold (perhaps at a couple of hundred boxes?) As the machine count moves towards the tens of thousands, automated deployment becomes essential and there’s a need to develop more service-specific infrastructure.
Tags: Architecture, infrastructure, Systems

Comments1 Comment »

Neglecting to account for failure is an age old problem. Consider this common error (Purify anybody?):

#include <stdio.h>
#include <stdlib.h>
struct rhubarb {
  int aVal;
  int anotherVal;
  char* aString;
};
......
  struct rhubarb* mystruct;
  mystruct = malloc(sizeof(struct rhubarb));
  mystruct->aVal = 55;
......

Of course the following code should have been included after the malloc:

/*
  If memory wasn't allocated, do something appropriate.
*/
if (mystruct == NULL) {
  .....
}

An equivalent mistake is easily possible when building a distributed system in http or RMI by ignoring error codes or exceptions that are designed to communicate failures that we ought to handle. It’s similarly easy to ignore latency, or implement brittle and dumb retry logic or assume something is reliable (like a message queue) when it isn’t. Many have managed to concoct systems with http that breach the idempotent “constraints” of REST and whilst Erlang provides link() and receive timeouts, we’re not forced to use them.

In essence there is no way to ensure developers do the right thing in a single-process or distributed context. No technology, tool or design approach can prevent developers from making poor implementation decisions which limits the value in re-hashing (Steve, Steve and Stu) RPC rights and wrongs.

I believe the best chance we have for doing distributed right is not by providing some de-facto standard toolset, rather it’s through education[1] and mentoring to encourage the correct mindset. Such a mindset allows a developer building a distributed system to choose the most appropriate tools and use them right.

[1] Material to be covered would be substantially broader then the fallacies, failure handling, latency and should probably include: logical time, FLP, failure detectors, global snapshots and Paxos.

Tags: development, Distributed Systems

Comments1 Comment »

Amazon has had a few problems of late, one of the more interesting ones being something S3 users encountered. It took Amazon a little while to identify the root cause:

We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3’s total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream.

Perhaps they had anticipated this scenario as the S3 API features explicit support for software-level check-summing via MD5:

For all PUT requests, Amazon S3 computes its own MD5, stores it with the object, and then returns the computed MD5 as part of the PUT response code in the ETag. By validating the ETag returned in the response, customers can verify that Amazon S3 received the correct bytes even if the Content MD5 header wasn’t specified in the PUT request. Because network transmission errors can occur at any point between the customer and Amazon S3, we recommend that all customers use the Content-MD5 header and/or validate the ETag returned on a PUT request to ensure that the object was correctly transmitted. This is a best practice that we’ll emphasize more heavily in our documentation to help customers build applications that can handle this situation.

Some developers were surprised that any of this was necessary, expecting TCP/UDP checksums to be sufficient however Stevens points out in TCP/IP Illustrated Vol I:

Also, if your data is valuable, you might not want to trust the UDP or the TCP checksum, since these are simple checksums and were not meant to catch all possible errors.

Takeaways:

Not all types of failure are binary - working or not working. Leaving the responsibility of data-safety to software layers further down the stack may not be best. Mechanisms for failure handling must be embedded in APIs.
Tags: availability, Distributed Systems, networks

CommentsComments Off

The act of design includes:

Consideration of many possible technology options Examination and identification of constraints Thought about the pros and cons of using various patterns and styles Comparing various splits of role and responsibility Looking at various tradeoffs of complexity versus function Formulation of opinion on possible future directions of system growth

A large proportion of this information is lost when the design document is written, because the focus is typically on providing a (notionally) definitive view of how a system should be structured which might be in the form of a bunch of UML diagrams or merely a collection of Visio-type diagrams and explanation of what each of the boxes in the diagrams does. At the code-level there is almost no chance that any of this information will have been retained. Yet this information is of high value since it:

is the explanation as to why a design is the way it is provides reviewers with a clearer view of what was and was not considered forms the basis for assessment of the maturity of a designer and can be used for coaching/mentoring can provide insight for those with less experience contains assumptions which if breached by changes in circumstance would dictate a re-design dictates to a large extent how suitable for purpose a design might be

Thus I believe It’s important to expose elements of the act of design via documentation alongside the design itself, conversations during the design work etc.

Tags: design, development, Software

CommentsComments Off

It seems it’s generally accepted[1] that SOA means breaking up your system into a set of co-operating components partitioned by business process. If you’re not doing that, you’re not doing SOA. It never ceases to amaze me how we get so zealous about fixed methods for architecting a system. I suspect it’s because we’d like to believe that architecture (and much of the act of development) can be done with fixed rules, cookie cutter style, get your catalog of patterns and technology, apply them - job done. The ultimate embodiment of this behaviour is deployment of a piece of technology in the belief that once the integration is complete the system has radically shifted in terms of it’s architecture (e.g. deploying an ESB suddenly makes your system SOA).

So if the fixed methods of SOA are thrown out and technology is not the solution, how do we build a system? Let’s first consider some of the things we’d like from our architecture:

Avoid integration via the database - otherwise data coupling will cripple us Support for granular updates - taking down the whole system is not desirable Fast rollback of changes - in case an update breaks In-production testing - there’s no substitute for real traffic in tests Minimal shared resources such as storage - so should there be an outage, impact is minimised Horizontal scaling - more boxes equals more power Support for scalable development - dev teams should be able to act in isolation most of the time Support for appropriate CAP tradeoffs - making everything consistent can be bad for availability

Although we wish to avoid coupling via the database, the reality is that our code still requires access to the data in some form or another. The best we can do under this circumstance is to limit the amount of code that directly accesses the data. We achieve this by vertically slicing (as opposed to horizontal sharding) our data and consolidating the code that is most closely related to it (e.g. performs updates) into a single encapsulated unit. All other access to the data must go via the code element of its associated unit (note that one needn’t always go to a unit for the data, it’s perfectly acceptable to cache).

In this way we limit the impact of data-schema changes to it’s associated unit, other parts of the system need not be concerned but there’s still some work to do. If the code within a unit were to be co-located within all processes containing code that wishes to make use of it, we’d need to restart all those processes when we wish to deploy a new version of that code (for whatever reason). Such a deployment model also encourages several bad habits:

Ignoring the remoteness of the data - it’s hidden behind some form of interface and it’s tempting to attempt to hide failure behind that interface Focus on synchronous method calls - it’s natural for a developer to write synchronous method calls when the code being called looks local (note that method calls can support asynchronous behaviours)

To avoid these issues, we deploy each unit in it’s own process accessed via some network endpoint that dependants use to interact with it thus:

Each unit can now easily be allocated it’s own independent storage, apply it’s own sharding policy etc. The network endpoint can support multiple protocol versions or we can opt to terminate multiple network endpoints onto a unit, a powerful primitive for supporting several versions of a remote interface simultaneously. The network endpoint can be terminated onto some form of load balancer or custom routing implementation (which might be part of the code within the unit itself perhaps because it’s P2P based) facilitating horizontal scaling, hot upgrades, A/B testing, in-production tests etc. Each unit can be assigned to a development team and much work can be done independently of development efforts elsewhere, making for less contention in development. Each unit can implement whatever CAP tradeoff makes sense.

If we arrange for the network endpoint of each unit to be discovered dynamically at runtime we gain the ability to move our units around (e.g. for DR reasons) and have means for our system to dynamically knit itself together reducing configuration issues. Such an arrangement can also make it easier to deal with ordered startup issues (where some set of things must be available before others).

Of course it’s not all good news, we will have to manage our desire for ACID guarantees because many of the mechanisms (such as two-phase commit) for achieving this in a distributed system are fraught with problems. Fortunately, people have been thinking about this for a while. We’ll also have to take care of the fallacies but even this has some positive aspects as failure and upgrade in some cases can be considered the same (noting that abstractions for message passing, failure detectors and the like can be implemented in many languages, not just Erlang).

So what remoting approaches might we use? REST/http, WS-*, RMI, CORBA, messages, custom protocol - whatever is suitable for our situation (noting that some choices impact the means by which we can handle evolution of protocols etc). What guidelines might we follow in determining how to split our code and data? There are a number of different approaches including:

Considering similarities in consistency, availability and partitioning (CAP) requirements Data access localities Data relationships Jurisdictional requirements Roles and responsibilities (at coarser level than OO) Features (e.g. recommendations) Business processes Constituent elements of an overall business process

Most systems likely require a combination of these rather than one fixed approach, taste and gut instinct count for a lot. And what might we call these units I speak of? I prefer to call them services as do a few other people but there’s no doubt that’ll be confusing, have to think of something else…….

[1] I know that Steve might well argue otherwise.

Tags: Architecture, Distributed Systems

CommentsComments Off

There are many distributed algorithms and they vary in lots of ways including:

[image]Communication Method: Possibilities include shared memory, point-to-point or broadcast messages etc.

Failure Model: Perhaps the algorithm assumes complete reliability. Perhaps it copes with some types of processor failure (including stop, transient failure or byzantine where the processor behaves arbitrarily). It might cope with problems in it’s communications layers (including message loss and duplication).

Timing Model: The algorithm might require computation and communication to progress in lock-step (synchronous) or it might cope with steps in arbitrary order with arbitrary speed (asynchronous). In between these two extremes exists an area of algorithms that have partial timing information (e.g. processors can access partially synchronised clocks). Asynchronous/Synchronous is independently applied to processors and communication channels.

The easiest to program are the synchronous algorithms. Asynchronous algorithms are harder to program because the order of happenings is uncertain however they have the advantage of needing no consideration of timing. Asynchronous algorithms also present some unique challenges for consensus which can be addressed by means of a failure detector. Many distributed systems provide stronger guarantees in respect of timing than is assumed in the asynchronous model thus we get to the partially synchronous model which perhaps surprisingly is the most difficult to program. Algorithms in this class are potentially efficient and the most realistic but care must be taken to ensure the timing assumptions they make are not violated (perhaps by failing to arrange for some aspect of process behaviour to act within the assumptions).

[image]Such a classification helps us choose algorithms appropriate to our network environment (which should include consideration of how often manual intervention will be required), A popular leader election algorithm simply requires each process to broadcast its UID across the network and maintenance of a lease. If a process doesn’t receive a UID higher than its own it can assume it is the leader. This algorithm works in a synchronous network with no failures. It can also be adapted to work in an asynchronous network with reliable FIFO channels and no failures. However it can fail in the presence of a network partition or packet loss leading to split brain behaviour which would need to be addressed with manual action or additional fault handling in other parts of the system.

Tags: Architecture, Distributed Systems, networks

CommentsComments Off

Jim Waldo writes:

My own conclusion is that system design is really a matter of technique, a way of thinking rather than a subject that can be taught in a particular course. It might be possible to build a program that teaches system design by putting students through a series of courses that hone their system design skills as they move through the subject matter of the courses. Such a series of courses would, in effect, be a formalized version of the apprenticeship that is now the way people acquire their system design technique…..

…..Even worse than not being visible to the customer, work done on designing the system is not visible to the management of the company that is developing the system. Even though managers will pay lip service to the teaching of The Mythical Man Month, there is still the worry that engineers who aren’t producing code are not doing anything useful. While there are few companies that explicitly measure productivity in lines-of-code per week, there is still pressure to produce something that can be seen. The notion that design can take weeks or months and that during that time little or no code will be written is hard to sell to managers. Harder still is selling the notion that any code that does get written will be thrown away, which often appears to be regression rather than progress.

In such an environment lip service often extends to technical strategy as well.

Tags: Architecture, design

CommentsComments Off

Some of the more common software development mistakes I’ve seen…..

triangle.jpgIgnoring the triangle - The triangle represents a trade-off between three core elements of software delivery - resource, product (features, non-functionals, quality) and schedule. One can only ever control two elements, the third being determined by the decisions regarding the other two. So if one wishes to dictate product and schedule, sufficient resource must be made available to complete the task in the allotted time. If one wishes to dictate product and resource, then the schedule cannot be limited. It is simply “as long as it takes”. And if one wishes to dictate resource and schedule, then product features, quality etc must be traded away to allow completion of development within the time allotted.

It’s amazing how often organisations attempt to dictate all three elements and are then surprised when a project gets messy. Of course, development processes have evolved in recognition of this trade-off - agile for example is great for prioritising, dropping features and getting something useful out the door in a resonable timeframe with limited resources.

Heroic efforts - these are a bad sign. A regular pattern of projects turning into mad hack-fests, saved by some apparently super-talented individual(s) is indicative of broken processes. One step in addressing this problem involves an honest surgery immediately after the project to determine root causes (e.g. inadequate risk management) of the meltdown and methods of prevention for future projects (e.g. regular risk review and identification of appropriate mitigations).

rapid_dev.jpgIn the very worst cases, management actively encourages such heroism via recognition and reward. Worshipping this kind of carnage and supposed miracle recovery is tantamount to approving bad project management. Note that well-intentioned management can unknowingly drive this behaviour. From McConnell’s Rapid Development:

Some managers encourage heroic behaviour when they focus too strongly on can-do attitudes. By elevating can-do attitudes above accurate and sometimes gloomy status reporting, such project managers undercut their ability to take corrective action. They don’t even know they need to take corrective action until the damage is done. As Tom DeMarco says, can-do attitudes escalate minor setbacks into true disasters.

No Risk Management - One can never predict or spot all the risks but there are some obvious ones that get missed over and over. For example, we’re building a piece of software that relies on a component we’ve not used before. This is a big risk, one that can be mitigated by writing a test-harness or simulation of the way in which we plan to use the component.

The simulation should include realistic load, failure conditions, maintenance etc and should be as close to the beginning of the project as possible to surface any issues early (we cannot afford to wait until final QA or deployment testing). There can be no shirking here because should our chosen component fail, we will need these tests in place so we can validate potential replacements as quickly and easily as possible.

Waterfall Agile - We’re supposedly “doing” agile but one or more of the following are true:

There’s a fixed deadline, with fixed features and fixed resources. All Negotiation/Trade-off is done prior to project commencement with no review between sprints. All sprints have been planned out in advance right up to the release date with no spare time. There are no risks to manage (because there aren’t any apparently). No-one is entertaining the idea of unknowns. When a sprint doesn’t deliver as anticipated, outstanding work is simply crammed into the remaining sprints.
Tags: development, Software

Comments2 Comments »

From On Designing and Deploying Internet-Scale Services (James Hamilton - Windows Live Services Platform):

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. Throughout the sections that follow, a consensus emerges that firm separation of development, test, and operations isn’t the most effective approach in the services world. The trend we’ve seen when looking across many services is that low-cost administration correlates highly with how closely the development, test, and operations teams work together.

Yep, I’m a firm believer too….

Tags: Architecture, design, development

Comments1 Comment »

One more chance to feel alive,

One more chance to wind up dead,

One more chance to climb so high,

One more chance to bust my head,

One more chance to get on top,

One more chance to make a change,

One more chance for shit to give,

One more chance to even the blame…….

Saliva - One More Chance - Blood Stained Love Story

Tags: Music, Personal

CommentsComments Off

A couple of false economies software development indulges in:

It’s quicker for me to write the code than explain the design to someone else. Automated deployment will have to wait until we have more time.

Number one costs a software development team in a number of ways:

The career development of other members of the team is slowed - if one never discusses design how does one expect to obtain good designers or architects? The team’s development capacity is reduced - essentially projects bottleneck around the uncommunicative heroic individual. The team’s effectiveness is reduced - project load cannot be divided efficiently because individuals have skills in narrow areas limiting the breadth of work they can perform. Team morale is damaged with other developers feeling left out, unfulfilled and unable to influence project decisions

Number two yields costs including:

We save on some development time but the cost is re-surfacing in staff-hours required to perform the deployment. An increasing number of mistakes that extend deployment time or breaks releases. We save development time once and pay the price for that saved time with each and every deployment. The cost of each successive deployment increases because the system’s size is growing. As each deployment takes ever longer, the gap between releases is likely to increase.
Tags: development, Engineering, Software

Comments4 Comments »

No one would deliberately drive at night with their headlights off. It’s obvious why, we might make it to our destination but we’ll have run down a few pedestrians, bounced off some curbs, hit a lamp-post or two and slid into a few ditches. Were we to continue this way, our car would quickly turn into a useless wreck.

Driving with the headlights on allows us to see ahead, plan and anticipate a little, to think. In turn, our journey is more pleasant, the car lasts a lot longer and there’s much less risk of a fiery end.

Yet many companies drive with their headlights off when developing software. Silly deadlines with a non-negotiable set of features, fixed resource and no time to think. The result? A tangled mess of systems with zero-architecture, huge legacy, horrible brittleness and poor availability. And that most desired property of quick delivery is lost as it takes longer and longer to do even the most simple things.

There’s no substitution for prior thought and realistic planning. Yet so many eschew it whilst complaining about the results.

Technorati Tags: development, engineering, software

Comments1 Comment »

I mentioned a while back that one could exploit DNS to ease some of the common static configuration issues around hostnames, ports etc. What follows is a simple outline solution, we’ve moved a long way beyond this at Betfair but the details will have to remain secret for now (sorry).

Let’s assume that we have several different releases in testing at any one time such that we wish to segment our development/testing systems into separate enclaves (each handling a separate release) and may wish to add more enclaves over time. Assume also that production is an enclave in its own right.

Firstly we define a set of logical hostnames that refer to the significant components of our system such as databases, file servers etc. Other elements such as webservers are probably independent and not referenced from other parts of the system and thus do not need names. These logical hostnames are what feature in our configuration files and do not need to change from enclave to enclave because we are going to use DNS to map from these logical hostnames to real physical machines.

Thus we want is a separate namespace for hosts in each of these enclaves so as to prevent leakage. To that end we map each namespace onto a separate domain within our DNS setup.

[Note our DNS setup would typically consist of a set of servers that maintain records for our own internal domains and possibly forward other requests for say external web address to other servers.]

Each enclave therefore has:

A separate namespace represented as a unique domain A set of services deployed onto physical machines A mapping from logical machine names to physical machine names (or IP addresses) A collection of configuration files all referencing logical machine names

Each domain (namespace) contains the logical to physical mapping of machines for its associated enclave. Each domain can be a separate zone and is thus kept in a separate file read by our DNS master. This allows us to maintain a template file which can be quickly edited to create a new domain (namespace). Thus whenever we wish to create a new enclave we setup a new zone, containing the definition of a new domain which is the namespace for that enclave.

To actually resolve a logical hostname we must ensure that it is concatenated with the domain appropriate to the enclave’s namespace. Before discussing options, note that each machine will be allocated to an enclave and must be configured accordingly which we can exploit to our advantage:

Simple configuration - ensure that the application has access to the domain to concatenate. This could be done via command-line argument but better is to source it from a well-known file on the machine which could be setup as part of allocating it to an enclave. Default search domain - any name not fully qualified has the default search domain appended to it. This default is typically part of the resolver configuration of the operating system and again can be setup as part of allocating a machine to an enclave.

Missing from the above is the handling of ports which might change from one enclave to the next. This can be tackled with a similar logical/physical mapping approach but must be based on the use of DNS SRV records rather than simple hostname mappings. The JDK provides little help out of the box for querying these records so something like dnsjava will be required.

Technorati Tags: distributed systems, dns, testing

CommentsComments Off


You are viewing a mobilized version of this site...
View original page here

Mobilized by Mowser Mowser