Checklist: Software Architecture Document

General

Overall, the system is soundly based architecturally, because:

The architecture appears to be stable.
The need for stability is dictated by the nature of the Construction phase: in Construction the project typically expands, adding developers who will work in parallel, communicating loosely with other developers as they produce the product. The degree of independence and parallelism needed in Construction simply cannot be achieved if the architecture is not stable.

The importance of a stable architecture cannot be overstated. Do not be deceived into thinking that 'pretty close is good enough' - unstable is unstable, and it is better to get the architecture right and delay the onset of Construction rather than proceed. The coordination problems involved in trying to repair the architecture while developers are trying to build upon its foundation will easily erase any apparent benefits of accelerating the schedule. Changes to architecture during Construction have broad impact: they tend to be expensive, disruptive and demoralizing.

The real difficulty of assessing architectural stability is that "you don't know what you don't know"; stability is measured relative to expected change. As a result, stability is essentially a subjective measure. We can, however, base this subjectivity on more than just conjecture. The architecture itself is developed by considering 'architecturally significant' scenarios - sub-sets of use cases which represent the most technologically challenging behavior the system must support. Assessing the stability of the architecture involves ensuring that the architecture has broad coverage, to ensure that there will be no 'surprises' in the architecture going forward.

Past experience with the architecture can also be a good indicator: if the rate of change in the architecture is low, and remains low as new scenarios are covered, there is good reason to believe that the architecture is stabilizing. Conversely, if each new scenario causes changes in the architecture, it is still evolving and baselining is not yet warranted.
The complexity of the system matches the functionality it provides.
The conceptual complexity is appropriate given the skill and experience of its:
- users
- operators
- developers
The system has a single consistent, coherent architecture
The number and types of component is reasonable
The system has a consistent system-wide security facility. All the security components work together to safeguard the system.
The system will meet its availability targets.
The architecture will permit the system to be recovered in the event of a failure within the required amount of time.
The products and techniques on which the system is based match its expected life?
- An interim (tactical) system with a short life can safely be built using old technology because it will soon be discarded.
- A system with a long life expectancy (most systems) should be built on up-to-date technology and methods so it can be maintained and expanded to support future requirements.
The architecture provides defines clear interfaces to enable partitioning for parallel team development.
The designer of a model element can understand enough from the architecture to successfully design and develop the model element.
The packaging approach reduces complexity and improves understanding.
Packages have been defined to be highly cohesive within the package, while the packages themselves are loosely coupled.
Similar solutions within the common application domain have been considered.
The proposed solution can be easily understood by someone generally knowledgeable in the problem domain.
All people on the team share the same view of the architecture as the one presented by the software architect.
The Software Architecture Document is current.
The Design Guidelines have been followed.
All technical risks been either mitigated or have been addressed in a contingency plan. New risk discovered have been documented and analyzed for their potential impact.
The key performance requirements (established budgets) have been satisfied.
Test cases, test harnesses, and test configurations have been identified.
The architecture does not appear to be "over-designed".
- The mechanisms in place appear to be simple enough to use.
- The number of mechanisms is modest and consistent with the scope of the system and the demands of the problem domain.
All use-case realizations defined for the current iteration can be executed by the architecture, as demonstrated by diagrams depicting:
- Interactions between objects,
- Interactions between tasks and processes,
- Interaction between physical nodes.

Models

Architectural Analysis Considerations

Overall

Subsystem and package partitioning and layering is logically consistent.
All analysis mechanisms have been identified and described.

Subsystems

The services (interfaces) of subsystems in upper-level layers have been defined.
The dependencies between subsystems and packages correspond to dependency relationships between the contained classes.
The classes in a subsystem support the services identified for the subsystem.

Classes

The key entity classes and their relationships have been identified.
Relationships between key entity classes have been defined.
The name and description of each class clearly reflects the role it plays.
The description of each class accurately captures the responsibilities of the class.
The entity classes have been mapped to analysis mechanisms where appropriate.
The role names of aggregations and associations accurately describe the relationship between the related classes.
The multiplicities of the relationships are correct.
The key entity classes and their relationships are consistent with the business model (if it exists), domain model (if it exists), requirements, and glossary entries.

General Model Considerations

The model is at an appropriate level of detail given the model objectives.

For the business model, requirements model or the design model during the elaboration phase, there is not an over-emphasis on implementation issues.

For the design model in the construction phase, there is a good balance of functionality across the model elements, using composition of relatively simple elements to build a more complex design.

The model demonstrates familiarity and competence with the full breadth of modeling concepts applicable to the problem domain; modeling techniques are used appropriately for the problem at hand.

Concepts are modeled in the simplest way possible.

The model is easily evolved; expected changes can be easily accommodated.

At the same time, the model has not been overly structured to handle unlikely change, at the expense of simplicity and comprehensibility.

The key assumptions behind the model are documented and visible to reviewers of the model. If the assumptions are applicable to a given iteration, then the model should be able to be evolved within those assumptions, but not necessarily outside of those assumptions. Documenting assumptions is a way of indemnifying designers from not looking at "all" possible requirements. In an iterative process, it is impossible to analyze all possible requirements, and to define a model which will handle every future requirement.

Diagrams

The purpose of the diagram is clearly stated and easily understood.
The graphical layout is clean and clearly conveys the intended information.
The diagram conveys just enough to accomplish its objective, but no more.
Encapsulation is effectively used to hide detail and improve clarity.
Abstraction is effectively used to hide detail and improve clarity.
Placement of model elements effectively conveys relationships; similar or closely coupled elements are grouped together.
Relationships among model elements are easy to understand.
Labeling of model elements contributes to understanding.

Documentation

Each model element has a distinct purpose.
There are no superfluous model elements; each one plays an essential role in the system.

Error recovery

For each error or exception, a policy defines how the system is restored to a "normal" state.
For each possible type of input error from the user or wrong data from external systems, a policy defines how the system is restored to a "normal" state.
There is a consistently applied policy for handling exceptional situations.
There is a consistently applied policy for handling data corruption in the database.
There is a consistently applied policy for handling database unavailability, including whether data can still be entered into the system and stored later.
If data is exchanged between systems, there is a policy for how systems synchronize their views of the data.
In the system utilizes redundant processors or nodes to provide fault tolerance or high availability, there is a strategy for ensuring that no two processors or nodes can 'think' that they are primary, or that no processor or node is primary.
The failure modes for a distributed system have been identified and strategies defined for handling the failures.

Transition and Installation

The process for upgrading an existing system without loss of data or operational capability is defined and has been tested.
The process for converting data used by previous releases is defined and has been tested.
The amount of time and resources required to upgrade or install the product is well-understood and documented.
The functionality of the system can be activated one use case at a time.

Administration

Disk space can be reorganized or recovered while the system is running.
The responsibilities and procedures for system configuration have been identified and documented.
Access to the operating system or administration functions is restricted.
Licensing requirements are satisfied.
Diagnostics routines can be run while the system is running.
The system monitors operational performance itself (e.g. capacity threshold, critical performance threshold, resource exhaustion).
- The actions taken when thresholds are reached are defined.
- The alarm handling policy is defined.
- The alarm handling mechanism is defined and has been prototyped and tested.
- The alarm handling mechanism can be 'tuned' to prevent false or redundant alarms.
The policies and procedures for network (LAN, WAN) monitoring and administration are defined.
Faults on the network can be isolated.
There is an event tracing facility that can enabled to aid in troubleshooting.
- The overhead of the facility is understood.
- The administration staff possesses the knowledge to use the facility effectively.
It is not possible for a malicious user to:
- enter the system.
- destroy critical data.
- consume all resources.

Performance

Performance requirements are reasonable and reflect real constraints in the problem domain; their specification is not arbitrary.

Estimates of system performance exist (modeled as necessary using a Workload Analysis Model), and these indicate that the performance requirements are not significant risks.

System performance estimates have been validated using architectural prototypes, especially for performance-critical requirements.

Memory Utilization

Memory budgets for the application have been defined.
Actions have been taken to detect and prevent memory leaks.
There is a consistently applied policy defining how the virtual memory system is used, monitored and tuned.

Cost and Schedule

The actual number of lines of code developed thus far agrees with the estimated lines of code at the current milestone.
The estimation assumptions have been reviewed and remain valid.
Cost and schedule estimates have been re-computed using the most recent actual project experience and productivity performance.

Portability

Portability requirements have been met.
Programming Guidelines provide specific guidance on creating portable code.
Design Guidelines provide specific guidance on designing portable applications.
A 'test port' has been done to verify portability claims.

Reliability

Measures of quality (MTBF, number of outstanding defects, etc.) have been met.
The architecture provides for recovery in the event of disaster or system failure

Security

Security requirements have been met.

Organizational Issues

Are the teams well-structured? Are responsibilities well-partitioned between teams?
Are there political, organizational or administrative issues that restrict the effectiveness of the teams?
Are there personality conflicts?

The Use-Case View

The Use-Case View section of the Software Architecture Document:

each use case is architecturally significant, identified as such because it:
- is vitally important to the customer
- motivates key elements in the other views
- is a driver for mitigating one or more major risks, including any challenging non-functional requirements.
there are no use cases whose architectural concerns are already covered by another use case
the architecturally significant aspects of the use case are clear, and not lost in details
the use case is clear and unlikely to change in a way that affects the architecture, or there is a plan in place for how to achieve such clarity and stability
no architecturally significant use cases have been missed (may require some analysis of the use cases not selected for this view).

The Logical View

The Logical View section of the Software Architecture Document:

accurately and completely presents an overview of the architecturally significant elements of the design.
presents the complete set of architectural mechanisms used in the design along with the rationale used in their selection.
presents the layering of the design, along with the rationale used to partition the layers.
presents any frameworks or patterns used in the design, along with the rationale used to select the patterns or frameworks.
The number of architecturally significant model elements is proportionate to the size and scope of the system, and is of a size which still renders the major concepts at work in the system understandable.

The Process View

Resource Utilization

Potential race conditions (process competition for critical resources) have been identified and avoidance and resolution strategies have been defined.
There is a defined strategy for handling "I/O queue full" or "buffer full" conditions.
The system monitors itself (capacity threshold, critical performance threshold, resource exhaustion) and is capable of taking corrective action when a problem is detected.

Performance

Response time requirements for each message have been identified.
There is a diagnostic mode for the system which allows message response times to be measured.
The nominal and maximal performance requirements for important operations have been specified.
There are a set of performance tests capable of measuring whether performance requirements have been met.
The performance tests cover the "extra-normal" behavior of the system (startup and shutdown, alternate and exceptional flows of events of the use cases, system failure modes).
Architectural weaknesses creating the potential for performance bottlenecks have been identified. Particular emphasis has been given to:
- Use of some finite shared resource such as (but not limited to) semaphores, file handles, locks, latches, shared memory, etc.
- inter-process communication. Communication across process boundaries is always more expensive than in-process communication.
- inter-processor communication. Communication across process boundaries is always more expensive than inter-process communication.
- physical and virtual memory usage; the point at which the system runs out of physical memory and starts using virtual memory is a point at which performance usually drops precipitously.

Fault Tolerance

Where there are primary and backup processes, the potential for more than one process believing that it is primary (or no process believing that it is primary) has been considered and specific design actions have been taken to resolve the conflict.
There are external processes that will restore the system to a consistent state when an event like a process failure leaves the system in an inconsistent state.
The system tolerant of errors and exceptions, such that when an error or exception occurs, the system can revert to a consistent state.
Diagnostic tests can be executed while the system is running.
The system can be upgraded (hardware, software) while it is running, if required.
There is a consistent policy for handling alarms in the system, and the policy has been consistently applied. The alarm policy addresses:
- the "sensitivity" of the alarm reporting mechanism;
- the prevention of false or redundant alarms;
- the training and user interface requirements of staff who will use the alarm reporting mechanism.
The performance of the alarm reporting mechanism has been assessed and falls within acceptable performance thresholds as established in the performance requirements.
The workload/performance requirements have been examined and have been satisfied. In the case where the performance requirements are unrealistic, they have been re-negotiated.
Memory budgets, to the extent that they exist, have been identified and the software has been verified to meet those requirements. Measures have been taken to detect and prevent memory leaks.
A policy exists for use of the virtual memory system, including how to monitor and tune its usage.

Modularity

Processes are sufficiently independent of one another that they can be distributed across processors or nodes when required.
Processes which must remain co-located (because of performance and throughput requirements, or the inter-process communication mechanism (e.g. semaphores or shared memory)) have been identified, and the impact of not being able to distribute this workload has been taken into consideration.
Messages which can be made asynchronous, so that they can be processed when resources are more available, have been identified.

The Deployment View

The throughput requirements have been satisfied by the distribution of processing across nodes, and potential performance bottlenecks have been addressed.
Where information is distributed and potentially replicated across several nodes, information integrity is ensured.
Requirements for reliable transport of messages, such that they exist, have been satisfied.
Requirements for secure transport of messages, such that they exist, have been satisfied.
Processing has been distributed across nodes in such a way that network traffic and response time have been minimized subject to consistency and resource constraints.
System availability requirements, to the extent that they exist, have been satisfied.
- The maximum system down-time in the event of a server or network failure has been determined and is within acceptable limits as defined by the requirements.
- Redundant and stand-by servers have been defined in such a way that it is not possible for more than one server to be designated as the "primary" server.
All potential failure modes have been documented.
Faults in the network can be isolated, diagnosed and resolved.
The amount of "headroom" in the CPU utilization has been identified, and the method of measurement has been defined
There is a stated policy for the actions to be taken when the maximum CPU utilization is exceeded.