Checklist: Software Architecture Document |
|
|
This checklist helps make sure that a the Software Architecture Document is stable, correct and complete. |
|
Relationships
Main Description
Check Items
General
Overall, the system is soundly based architecturally, because:
-
The architecture appears to be stable.
The need for stability is dictated by the nature of the Construction phase: in Construction the project
typically expands, adding developers who will work in parallel, communicating loosely with other developers as
they produce the product. The degree of independence and parallelism needed in Construction simply cannot be
achieved if the architecture is not stable.
The importance of a stable architecture cannot be overstated. Do not be deceived into thinking that 'pretty
close is good enough' - unstable is unstable, and it is better to get the architecture right and delay the
onset of Construction rather than proceed. The coordination problems involved in trying to repair the
architecture while developers are trying to build upon its foundation will easily erase any apparent benefits
of accelerating the schedule. Changes to architecture during Construction have broad impact: they tend to be
expensive, disruptive and demoralizing.
The real difficulty of assessing architectural stability is that "you don't know what you don't know";
stability is measured relative to expected change. As a result, stability is essentially a subjective measure.
We can, however, base this subjectivity on more than just conjecture. The architecture itself is developed by
considering 'architecturally significant' scenarios - sub-sets of use cases which represent the most
technologically challenging behavior the system must support. Assessing the stability of the architecture
involves ensuring that the architecture has broad coverage, to ensure that there will be no 'surprises' in the
architecture going forward.
Past experience with the architecture can also be a good indicator: if the rate of change in the architecture
is low, and remains low as new scenarios are covered, there is good reason to believe that the architecture is
stabilizing. Conversely, if each new scenario causes changes in the architecture, it is still evolving and
baselining is not yet warranted.
-
The complexity of the system matches the functionality it provides.
-
The conceptual complexity is appropriate given the skill and experience of its:
-
users
-
operators
-
developers
-
The system has a single consistent, coherent architecture
-
The number and types of component is reasonable
-
The system has a consistent system-wide security facility. All the security components work together to
safeguard the system.
-
The system will meet its availability targets.
-
The architecture will permit the system to be recovered in the event of a failure within the required amount of
time.
-
The products and techniques on which the system is based match its expected life?
-
An interim (tactical) system with a short life can safely be built using old technology because it will
soon be discarded.
-
A system with a long life expectancy (most systems) should be built on up-to-date technology and methods so
it can be maintained and expanded to support future requirements.
-
The architecture provides defines clear interfaces to enable partitioning for parallel team development.
-
The designer of a model element can understand enough from the architecture to successfully design and develop the
model element.
-
The packaging approach reduces complexity and improves understanding.
-
Packages have been defined to be highly cohesive within the package, while the packages themselves are loosely
coupled.
-
Similar solutions within the common application domain have been considered.
-
The proposed solution can be easily understood by someone generally knowledgeable in the problem domain.
-
All people on the team share the same view of the architecture as the one presented by the software architect.
-
The Software Architecture Document is current.
-
The Design Guidelines have been followed.
-
All technical risks been either mitigated or have been addressed in a contingency plan. New risk discovered have
been documented and analyzed for their potential impact.
-
The key performance requirements (established budgets) have been satisfied.
-
Test cases, test harnesses, and test configurations have been identified.
-
The architecture does not appear to be "over-designed".
-
The mechanisms in place appear to be simple enough to use.
-
The number of mechanisms is modest and consistent with the scope of the system and the demands of the
problem domain.
-
All use-case realizations defined for the current iteration can be executed by the architecture, as demonstrated by
diagrams depicting:
-
-
Interactions between objects,
-
Interactions between tasks and processes,
-
Interaction between physical nodes.
|
Models
Overall
-
Subsystem and package partitioning and layering is logically consistent.
-
All analysis mechanisms have been identified and described.
Subsystems
-
The services (interfaces) of subsystems in upper-level layers have been defined.
-
The dependencies between subsystems and packages correspond to dependency relationships between the contained
classes.
-
The classes in a subsystem support the services identified for the subsystem.
Classes
-
The key entity classes and their relationships have been identified.
-
Relationships between key entity classes have been defined.
-
The name and description of each class clearly reflects the role it plays.
-
The description of each class accurately captures the responsibilities of the class.
-
The entity classes have been mapped to analysis mechanisms where appropriate.
-
The role names of aggregations and associations accurately describe the relationship between the related
classes.
-
The multiplicities of the relationships are correct.
-
The key entity classes and their relationships are consistent with the business model (if it exists), domain
model (if it exists), requirements, and glossary entries.
General Model Considerations
-
The model is at an appropriate level of detail given the model objectives.
-
For the business model, requirements model or the design model during the elaboration phase, there is not an
over-emphasis on implementation issues.
-
For the design model in the construction phase, there is a good balance of functionality across the model
elements, using composition of relatively simple elements to build a more complex design.
-
The model demonstrates familiarity and competence with the full breadth of modeling concepts applicable to the
problem domain; modeling techniques are used appropriately for the problem at hand.
-
Concepts are modeled in the simplest way possible.
-
The model is easily evolved; expected changes can be easily accommodated.
-
At the same time, the model has not been overly structured to handle unlikely change, at the expense of
simplicity and comprehensibility.
-
The key assumptions behind the model are documented and visible to reviewers of the model. If the assumptions
are applicable to a given iteration, then the model should be able to be evolved within those assumptions, but
not necessarily outside of those assumptions. Documenting assumptions is a way of indemnifying designers from
not looking at "all" possible requirements. In an iterative process, it is impossible to analyze all possible
requirements, and to define a model which will handle every future requirement.
|
Diagrams
-
The purpose of the diagram is clearly stated and easily understood.
-
The graphical layout is clean and clearly conveys the intended information.
-
The diagram conveys just enough to accomplish its objective, but no more.
-
Encapsulation is effectively used to hide detail and improve clarity.
-
Abstraction is effectively used to hide detail and improve clarity.
-
Placement of model elements effectively conveys relationships; similar or closely coupled elements are grouped
together.
-
Relationships among model elements are easy to understand.
-
Labeling of model elements contributes to understanding.
|
Documentation
-
Each model element has a distinct purpose.
-
There are no superfluous model elements; each one plays an essential role in the system.
|
Error recovery
-
For each error or exception, a policy defines how the system is restored to a "normal" state.
-
For each possible type of input error from the user or wrong data from external systems, a policy defines how
the system is restored to a "normal" state.
-
There is a consistently applied policy for handling exceptional situations.
-
There is a consistently applied policy for handling data corruption in the database.
-
There is a consistently applied policy for handling database unavailability, including whether data can still
be entered into the system and stored later.
-
If data is exchanged between systems, there is a policy for how systems synchronize their views of the data.
-
In the system utilizes redundant processors or nodes to provide fault tolerance or high availability, there is
a strategy for ensuring that no two processors or nodes can 'think' that they are primary, or that no processor
or node is primary.
-
The failure modes for a distributed system have been identified and strategies defined for handling the
failures.
|
Transition and Installation
-
The process for upgrading an existing system without loss of data or operational capability is defined and has
been tested.
-
The process for converting data used by previous releases is defined and has been tested.
-
The amount of time and resources required to upgrade or install the product is well-understood and documented.
-
The functionality of the system can be activated one use case at a time.
|
Administration
-
Disk space can be reorganized or recovered while the system is running.
-
The responsibilities and procedures for system configuration have been identified and documented.
-
Access to the operating system or administration functions is restricted.
-
Licensing requirements are satisfied.
-
Diagnostics routines can be run while the system is running.
-
The system monitors operational performance itself (e.g. capacity threshold, critical performance threshold,
resource exhaustion).
-
The actions taken when thresholds are reached are defined.
-
The alarm handling policy is defined.
-
The alarm handling mechanism is defined and has been prototyped and tested.
-
The alarm handling mechanism can be 'tuned' to prevent false or redundant alarms.
-
The policies and procedures for network (LAN, WAN) monitoring and administration are defined.
-
Faults on the network can be isolated.
-
There is an event tracing facility that can enabled to aid in troubleshooting.
-
The overhead of the facility is understood.
-
The administration staff possesses the knowledge to use the facility effectively.
-
It is not possible for a malicious user to:
-
enter the system.
-
destroy critical data.
-
consume all resources.
|
Performance
-
Performance requirements are reasonable and reflect real constraints in the problem domain; their specification
is not arbitrary.
-
Estimates of system performance exist (modeled as necessary using a Workload Analysis Model), and these
indicate that the performance requirements are not significant risks.
-
System performance estimates have been validated using architectural prototypes, especially for
performance-critical requirements.
|
Memory Utilization
-
Memory budgets for the application have been defined.
-
Actions have been taken to detect and prevent memory leaks.
-
There is a consistently applied policy defining how the virtual memory system is used, monitored and tuned.
|
Cost and Schedule
-
The actual number of lines of code developed thus far agrees with the estimated lines of code at the current
milestone.
-
The estimation assumptions have been reviewed and remain valid.
-
Cost and schedule estimates have been re-computed using the most recent actual project experience and
productivity performance.
|
Portability
-
Portability requirements have been met.
-
Programming Guidelines provide specific guidance on creating portable code.
-
Design Guidelines provide specific guidance on designing portable applications.
-
A 'test port' has been done to verify portability claims.
|
Reliability
-
Measures of quality (MTBF, number of outstanding defects, etc.) have been met.
-
The architecture provides for recovery in the event of disaster or system failure
|
Security
-
Security requirements have been met.
|
Organizational Issues
-
Are the teams well-structured? Are responsibilities well-partitioned between teams?
-
Are there political, organizational or administrative issues that restrict the effectiveness of the teams?
-
Are there personality conflicts?
|
The Use-Case View
The Use-Case View section of the Software Architecture Document:
-
each use case is architecturally significant, identified as such because it:
-
is vitally important to the customer
-
motivates key elements in the other views
-
is a driver for mitigating one or more major risks, including any challenging non-functional
requirements.
-
there are no use cases whose architectural concerns are already covered by another use case
-
the architecturally significant aspects of the use case are clear, and not lost in details
-
the use case is clear and unlikely to change in a way that affects the architecture, or there is a plan in
place for how to achieve such clarity and stability
-
no architecturally significant use cases have been missed (may require some analysis of the use cases not
selected for this view).
|
The Logical View
The Logical View section of the Software Architecture Document:
-
accurately and completely presents an overview of the architecturally significant elements of the design.
-
presents the complete set of architectural mechanisms used in the design along with the rationale used in their
selection.
-
presents the layering of the design, along with the rationale used to partition the layers.
-
presents any frameworks or patterns used in the design, along with the rationale used to select the patterns or
frameworks.
-
The number of architecturally significant model elements is proportionate to the size and scope of the system,
and is of a size which still renders the major concepts at work in the system understandable.
|
The Process View
Resource Utilization
-
Potential race conditions (process competition for critical resources) have been identified and avoidance and
resolution strategies have been defined.
-
There is a defined strategy for handling "I/O queue full" or "buffer full" conditions.
-
The system monitors itself (capacity threshold, critical performance threshold, resource exhaustion) and is
capable of taking corrective action when a problem is detected.
Performance
-
Response time requirements for each message have been identified.
-
There is a diagnostic mode for the system which allows message response times to be measured.
-
The nominal and maximal performance requirements for important operations have been specified.
-
There are a set of performance tests capable of measuring whether performance requirements have been met.
-
The performance tests cover the "extra-normal" behavior of the system (startup and shutdown, alternate and
exceptional flows of events of the use cases, system failure modes).
-
Architectural weaknesses creating the potential for performance bottlenecks have been identified. Particular
emphasis has been given to:
-
-
Use of some finite shared resource such as (but not limited to) semaphores, file handles, locks,
latches, shared memory, etc.
-
inter-process communication. Communication across process boundaries is always more expensive than
in-process communication.
-
inter-processor communication. Communication across process boundaries is always more expensive than
inter-process communication.
-
physical and virtual memory usage; the point at which the system runs out of physical memory and starts
using virtual memory is a point at which performance usually drops precipitously.
Fault Tolerance
-
Where there are primary and backup processes, the potential for more than one process believing that it is
primary (or no process believing that it is primary) has been considered and specific design actions have been
taken to resolve the conflict.
-
There are external processes that will restore the system to a consistent state when an event like a process
failure leaves the system in an inconsistent state.
-
The system tolerant of errors and exceptions, such that when an error or exception occurs, the system can
revert to a consistent state.
-
Diagnostic tests can be executed while the system is running.
-
The system can be upgraded (hardware, software) while it is running, if required.
-
There is a consistent policy for handling alarms in the system, and the policy has been consistently applied.
The alarm policy addresses:
-
-
the "sensitivity" of the alarm reporting mechanism;
-
the prevention of false or redundant alarms;
-
the training and user interface requirements of staff who will use the alarm reporting mechanism.
-
The performance of the alarm reporting mechanism has been assessed and falls within acceptable performance
thresholds as established in the performance requirements.
-
The workload/performance requirements have been examined and have been satisfied. In the case where the
performance requirements are unrealistic, they have been re-negotiated.
-
Memory budgets, to the extent that they exist, have been identified and the software has been verified to meet
those requirements. Measures have been taken to detect and prevent memory leaks.
-
A policy exists for use of the virtual memory system, including how to monitor and tune its usage.
Modularity
-
Processes are sufficiently independent of one another that they can be distributed across processors or nodes
when required.
-
Processes which must remain co-located (because of performance and throughput requirements, or the
inter-process communication mechanism (e.g. semaphores or shared memory)) have been identified, and the impact
of not being able to distribute this workload has been taken into consideration.
-
Messages which can be made asynchronous, so that they can be processed when resources are more available, have
been identified.
|
The Deployment View
-
The throughput requirements have been satisfied by the distribution of processing across nodes, and potential
performance bottlenecks have been addressed.
-
Where information is distributed and potentially replicated across several nodes, information integrity is
ensured.
-
Requirements for reliable transport of messages, such that they exist, have been satisfied.
-
Requirements for secure transport of messages, such that they exist, have been satisfied.
-
Processing has been distributed across nodes in such a way that network traffic and response time have been
minimized subject to consistency and resource constraints.
-
System availability requirements, to the extent that they exist, have been satisfied.
-
The maximum system down-time in the event of a server or network failure has been determined and is
within acceptable limits as defined by the requirements.
-
Redundant and stand-by servers have been defined in such a way that it is not possible for more than
one server to be designated as the "primary" server.
-
All potential failure modes have been documented.
-
Faults in the network can be isolated, diagnosed and resolved.
-
The amount of "headroom" in the CPU utilization has been identified, and the method of measurement has been
defined
-
There is a stated policy for the actions to be taken when the maximum CPU utilization is exceeded.
|
|
© Copyright IBM Corp. 1987, 2006. All Rights Reserved.
|
|