by M. E. Kabay
This review
is intended to provide guidance for critical reading of research results about
computer crime. It will also alert designers of research instruments who may
lack formal training in survey design and analysis to the need for professional
support in developing questionnaires and analyzing results.
Security
specialists are often asked about computer crime; for example, customers want
know who is attacking which systems, how often, using what methods. These
questions are perceived as important because they bear upon the strategies
of risk management; in theory, in order to estimate the appropriate level
of investment in security, it would be helpful to have a sound grasp of the
probability of different levels of damage. Ideally, one would want to evaluate
an organization’s level of risk by evaluating the experiences of other organizations
with similar system and business characteristics. Such comparisons would be
useful in competitive analysis and in litigation over standards of due care
and diligence in protecting corporate assets.
Unfortunately,
in the current state of information security, no one can give reliable answers
to such questions. There are two fundamental difficulties preventing us from
developing accurate statistics of this kind. These difficulties are known
as the problems of ascertainment.
The
first problem is that an unknown number of crimes of all kinds are undetected.
For example, even outside the computer crime field, we don't know how many
financial frauds are being perpetrated. We don't know because some of them
are not detected. How do we know they're not detected? Because some frauds
are discovered long after they have occurred. Similarly, computer crimes may
not be detected by their victims but may be reported by the perpetrators.
In
a landmark series of tests at the Department of Defense, the Defense Information
Systems Agency found that very few of the penetrations it engineered against
unclassified systems within the DoD seem to have been detected by system managers.
These studies were carried out from 1994 through 1996 and attacked 68,000
systems. About two-thirds of the attacks succeeded; however, only 4% of these
attacks were detected.
A
commonly-held view within the information security community is that only
one-tenth or so of all the crimes committed against and using computer systems
are detected.
The
second problem of ascertainment is that even if attacks are detected, it seems
that few are reported in a way that allows systematic data collection. This
belief is based in part on the unquantified experience of information security
professionals who have conducted interviews of their clients; it turns out
that only about ten percent of the attacks against computer systems revealed
in such interviews were ever reported to any kind of authority or to the public.
The Department of Defense studies mentioned above were consistent with this
belief; of the few penetrations detected, only a fraction of one percent were
reported to appropriate authorities.
Given
these problems of ascertainment, computer crime statistics should generally
be treated with skepticism.
Generalizations
in this field are difficult to justify; even if we knew more about types of
criminals and the methods they use, it would still be difficult to have the
kind of actuarial statistic that is commonplace in the insurance field. For
example, the establishment of uniform building codes in the 1930s in the United
States led to the growth in fire insurance as a viable business. With official
records of fires in buildings that could be described using a standard typology,
statistical information began to provide an actuarial basis for using probabilities
of fires and associated costs to calculate reasonable insurance rates.
In
contrast, even if we had access to accurate reports, it would be difficult
to make meaningful generalizations about vulnerabilities and incidence of
successful attack for the information technology field. We use a bewildering
variety and versions of processors, operating systems, firewalls, encryption,
application software, backup methods and media, communications channels, identification,
authentication, authorization, compartmentalization, and operations.
How
would we generalize from data about the risks at (say) a mainframe-based network
running MVS in a military installation to the kinds of risks faced by a UNIX-based
intranet in an industrial corporation, or to a Windows NT-based Web server
in a university setting? There are so many differences among systems that
if we were to establish a multidimensional analytical table where every variable
was an axis, many cells would likely contain no or only a few examples. Such
sparse matrices are notoriously difficult to use in building statistical models
for predictive purposes.
This is not
an article about social sciences research. However, many discussions of computer
crime seem to take published reports as gospel, even though these studies
being discussed may have no validity whatsoever. In this short section, we
will look at some fundamentals of research design so that readers will be
able to judge how much faith to put in computer crime research results.
The way a scientist
or reporter represents data can make enormous differences in the readers’
impressions.
Suppose three
companies reported the following losses from penetration of their computer
systems: $1M, $2M and $6M. We can describe these results in many ways. For
example, we can simply list the raw data; however, such lists could become
unacceptably long as the number of reports increased, and it is hard to make
sense of the raw data.
We could define classes such as "2 million or less"
and "more than 2 million" and count how many occurrences there were
in each class:
|
Class |
Freq |
|
£ $2M |
2 |
|
> $2M |
1 |
Alternatively, we might define the classes with finer granularity
as < $1M, ³ $1M but < $2M, and so on; such a table
might look like this:
|
Class |
Freq |
|
< $1M |
0 |
|
³ $1M & < $2M |
1 |
|
³ $2M & < $3M |
1 |
|
³ $3M & < $4M |
0 |
|
³ $4M & < $5M |
0 |
|
³ $5M & < $6M |
0 |
|
³ $6 |
1 |
Notice how the
definition of the classes affects perception of the results: the first table
gives the impression that the results are clustered around $2M and gives no
information about the upper or lower bounds.
One of the most
obvious ways we describe data is to say where they lie in a particular dimension.
The central tendency of our three original data ($1M, $2M and $6M )
can be represented in various ways; for example, two popular measures are
Note that if
we tried to compute the mean and the median from the first table (with its
approximate classes), we would get the wrong value. Such statistics should
be computed from the original data, not from summary tables.
Another aspect
of our data that we frequently need is dispersion – i.e., variability.
The simplest measure of dispersion is the range – the difference between the
smallest and the largest value we found; in our example, we could say that
the range was from $1M to $6M or that it was $5M. Sometimes the range is expressed
as a percentage of the mean; then we would say that the range was 5/3 = 1.6…
or ~167%.
The
variance (s2) of these particular data is the average of the squared deviations
from the arithmetic mean; the variance of the three numbers would be s2 = (1-3)2 + (2-3)2 + (6-3)2]/3
= (4+1+9)/3 » 4.67.
The square root
of the variance (s) is called the standard deviation
and is often used to describe dispersion. In our example, s = Ö4.67
» 2.16.
Dispersion is
particularly important when we compare estimates about information from different
groups. The greater the variance of a measure, the more difficult it is to
form reliable generalizations about an underlying phenomenon, as I’ll describe
in the next section.
We can accurately
describe any data using descriptive statistics; the question is what we then
do with those measures.
Usually we expect
to extend the findings in a sample or subset of a population
to make generalizations about the population. For example, we might be trying
to estimate the losses from computer crime in commercial organizations with
offices in the United States and with more than 30,000 employees. Or perhaps
our sample would represent commercial organizations with offices in the United
States and with more than 30,000 employees and whose network security staff
was willing to respond to a survey questionnaire.
In such cases,
we try to infer the characteristics of the population from the characteristics
of the sample. Statisticians say that we try to estimate the parametric statistics by using the sample statistics.
For example,
we estimate the parametric (population) variance (usually designated s2) by multiplying the variance of the sample by n/(n-1). Thus
we would say that the estimate of the parametric variance (s2)
in our sample above would be s2 = 4.67 * 3/2 = 7. The estimate
of the parametric standard deviation (s) would be s = Ö7 »
2.65.
Another kind
of inference that we try to make from data is hypothesis testing. For
example, suppose we were interested in whether there was any association between
the presence or absence of firewalls and the occurrence of system penetration.
We can imagine collecting the following data about penetrations into systems
with or without firewalls:
|
|
Penetration |
|
|
|
Firewalls |
No |
Yes |
Totals |
|
No |
25 |
75 |
100 |
|
Yes |
70 |
130 |
200 |
|
Totals |
95 |
205 |
300 |
We would frame
the hypothesis (the null hypothesis, sometimes represented as H0)
that there was no relationship between the two independent variables,
penetration and firewalls and test that hypothesis by performing a
test of independence of these variables. In our example, a simple chi-square
test of independence would give a test statistic of c2[1] = 2.636. If there really were no association between penetration
and firewalls in the population of systems under examination, the parametric
value of this statistic would be zero. In our imaginary example, we can show
that such a large value (or larger) of c2[1] would occur in only 10.4% of the samples taken from a population
where firewalls had no effect on penetration. Put another way, if we took
lots of samples from a population where the presence of firewalls was not
associated with any change in the rate of penetration, we’d see around 10.4%
of those samples producing c2[1] statistics as large as or larger than 2.636.
Statisticians
have agreed on some conventions for deciding whether a test statistic deviates
enough from the value expected under the null hypothesis to warrant inferring
that the null hypothesis is wrong. Generally we describe the likelihood that
the null hypothesis is true – often shown as p(H0) – as follows:
The most important
element of sampling is randomness. We say that a sample is random or
randomized when every member of the population we are studying has
an equal probability of being selected. When a population is defined one way
but the sample is drawn non-randomly, the sample is described as biased.
For example, if the population we are studying was designed to be, say, all
companies worldwide with more than 30,000 full-time employees, but we sampled
mostly from such companies in the United States, the sample would be biased
towards US companies and their characteristics. Similarly, if we were supposed
to be studying security in all companies in the United States with more than
30,000 full-time employees but we sampled only from those companies who were
willing to respond to a security survey, we would be at risk of having a biased
sample.
In this last
example involving studying only those who respond to a survey, we say that
we are potentially confounding variables: we are looking at people-who-respond-to-surveys
and hoping they are representative of the larger population of people from
all companies in the desired population. But what if the people who are willing
to respond are those who have better security and those who don’t respond
have terrible security? Then responding to the survey is confounded
with quality of security and our biased sample could easily mislead
us into overestimating the level of security in the desired population.
Another example
of how variables can be confounded is comparisons of results from surveys
carried out in different years. Unless exactly the same people are interviewed
in both years, we may be confounding individual variations in responses with
changes over time; unless exactly the same companies are represented, we may
be confounding differences among companies with changes over time; if external
events have led people to be more or less willing to respond truthfully to
questions, we may be confounding willingness to respond with changes over
time. If the surveys are carried out with different questions or used by different
research groups, we may be confounding changes in methodology with changes
over time.
Because random
samples naturally vary around the parametric (population) statistics, it is
not very helpful to report a point estimate of the parametric value.
For example, if we read that the mean damage from computer crimes in a survey
was $180,000 per incident, what does that imply about the population mean?
To express our
confidence in the sample statistic, we calculate the likelihood of being right
if we give an interval estimate of the population value. For example,
we might find that we would have a 95% likelihood of being right in asserting
that the mean damage was between $160,000 and $200,000. In another sample,
we might be able to narrow these 95% confidence limits to $175,000
and $185,000.
In general,
the larger the sample size, the narrower the confidence limits will be for
particular statistics.
The calculation
of confidence limits for statistics depends on some necessary assumptions:
If any of these
assumptions is wrong, the calculated confidence limits for our estimates will
be wrong; i.e., they will be misleading. There are tests of these assumptions
that analysts should carry out before reporting results; if the data do not
follow Normal error distributions, sometimes one can apply normalizing
transformations.
In particular,
percentages do not follow a Normal distribution. Here is a reference table
of confidence limits for various percentages in a few representative sample
sizes.
|
95% Confidence Limits for percentages |
|||
|
|
|
Sample size |
|
|
Percentage |
100 |
500 |
1000 |
|
0 |
0-3.0% |
0-0.6% |
0-0.3% |
|
10 |
4.9-17.6% |
7.5-13.0% |
8.2-12.0% |
|
20 |
12.7-29.1% |
16.6-23.8% |
17.6-22.6% |
|
50 |
40.0-60.1% |
45.5-54.5% |
46.9-53.1% |
|
80 |
70.9-87.3% |
76.2-83.4% |
77.4-82.4% |
|
90 |
82.4-95.1% |
87.0-92.5% |
88.0-91.8% |
|
100 |
97.0-100% |
99.4-100% |
99.7-100% |
One of the most
frequent errors in reporting results of studies is to provide only part of
the story. For example, one can read statements such as “Over 70% of the systems
without firewalls were penetrated last year.” Such a statement may be true,
but it cannot be correctly be interpreted as meaning that systems with firewalls
were necessarily more or less vulnerable to penetration than systems without
firewalls. The statement is incomplete; to make sense of it, we need the other
part of the implied contingency table – the percentage of systems with
firewalls that were penetrated last year – before making any assertions about
the relationship between firewalls and penetrations. Compare, for example
these two hypothetical tables:
|
FIREWALLS
AND PENETRATION |
Without
Firewalls |
With
Firewalls in Default Config |
|
Penetrated |
70 |
70 |
|
Not
Penetrated |
30 |
30 |
|
FIREWALLS
AND PENETRATION |
Without
Firewalls |
With
Firewalls Properly Config |
|
Penetrated |
70 |
10 |
|
Not
Penetrated |
30 |
90 |
In both cases,
someone could say that “70% of the systems without firewalls were penetrated,”
but the implications would be radically different in the two data sets. Without
knowing the right-hand column, the original assertion would be meaningless.
Continuing our
example with rates of penetration, another error that untrained people often
make when studying statistical information is to mistake association
for causality. Imagine that a study showed that a lower percentage
of systems with fire extinguishers was penetrated than systems without fire
extinguishers, and that this difference was statistically highly significant.
Would such a result necessarily mean that fire extinguishers caused
the reduction in penetration? No, we know that it’s far more reasonable to
suppose that the fire extinguishers were installed in organizations whose
security awareness and security policies were more highly developed than in
the organizations where no fire extinguishers were installed. In this imaginary
example, the fire extinguishers might actually have no causal effect
whatever on resistance to penetration. This result would illustrate the effect
of confounding variables – presence of a fire extinguisher with state
of security awareness and policies.
Finally, to
finish our penetration example, one way to distinguish between association
and causality is to control for variables. For example, one could measure
the state of security awareness and policy as well as the presence or absence
of fire extinguishers and make comparisons only among groups with the same
level of awareness and policy. There are also statistical techniques for mathematically
controlling for differences in such independent variables.
Amateurs or beginners sometimes forget the principle of random sampling that underlies all statistical inference (see section 10.2.1.4). None of the hypothesis tests or confidence limit calculations work if a sample is not random. For example, if someone is wandering through a supermarket and notices that Granny Smith apples seem to be bigger than Macintosh apples, selecting a sample – even a random sample – of the apples that specifically gave rise to the hypothesis will not allow reliable computations of probability that the applies have the same average weight. The problem is that those particular applies would not have been sampled at all had the observer not been moved to formulate the hypothesis. So even if a particular statistical comparison produces a sample statistic that appears to have a probability of, say, 0.001, it is not possible to know how much the sampling deviated from randomness.
Applying statistical tests to data after one notices an interesting descriptive value, comparison or trend is known as a posteriori testing. Formulating an hypothesis, obtaining a random sample and computing the statistics and probabilities in accordance with the assumptions of those statistics and probabilities is known as a priori testing.
A well-used example of the perils of a posteriori testing is the unfortunate habit of searching through sequences of results such as long strings of guesses collected in student tests of paranormal abilities and calculating statistical values on carefully-selected subsets of the strings. These a posteriori tests are then presented as if they were a priori and cause great confusion and arguments such as, “Look – even though the overall proportion of correct guesses was [say] 50.003% in this run of <some very large number> guesses, there was a run of <much smaller number> guesses that were correct <any value greater than 50%> of the time! The probability of such a result by chance is <very small number>. That proves that there was a real effect of <whatever the treatment was>.” Unfortunately, a long series of numbers can produce any desired non-random-looking string; there are even tests known as runs tests that can help a researcher evaluate the non-randomness of such occurrences.
In practical terms, statisticians have established a convention for limiting the damaging effects of a posteriori testing: use the 0.001 level of probability as the equivalent of the minimum probability of the null hypothesis. This custom makes it far less likely that an a posteriori comparison will trick the user into accepting what is in fact a random variation that caught someone’s eye.
The best solution to the bias implicit in a posteriori testing is to use a completely new sample for the comparison. In the apple example, one could ask the store manager for new, unobserved and randomly-selected batches of both types of applies. The comparison statistics would then be credible and could be expected to follow the parametric distribution underlying calculations of probability of the null hypothesis. Mind you, the populations from which these apples were selected would still have to be carefully determined. Would the populations be apples at this particular store? For this particular chain? For this particular region of the country or of the world?
Interviewing
individuals can be illuminating. In general, interviews provide a wealth of
data that are unavailable through any other method. For example, one can learn
details of computer crime cases or motivations and techniques used by computer
criminals. Interviews can be structured (using precise lists of questions)
or unstructured (allowing the interviewer to respond to new information by
asking additional questions at will).
Interviewers
can take notes or record the interviews for later word-for-word transcription.
In unstructured interviewers, skilled interviewers can probe responses to
elucidate nuances of meaning that might be lost using cruder techniques such
as surveys. Techniques such as thematic analysis can reveal patterns of responses
that can then be examined using exploratory data analysis.[1]
Thematic analysis is a technique for organizing non-quantitative information
without imposing a pre-existing framework on the data; exploratory data analysis
uses statistical techniques to identify possibly interesting relationships
that can be tested with independently-acquired data. Such exploratory techniques
can correctly include a posteriori testing as described in section 10.2.1.9, but the results are used to propose further studies that
can use a priori tests for the best
use of resources.
Focus groups
are like group interviews. Generally the facilitator uses a list of predetermined
questions and encourages the participants to respond freely and to interact
with each other. Often the proceedings are filmed from behind a one-way mirror
for later detailed analysis. Such analysis can include non-verbal communications
such as facial expressions and other body language as the participants speak
or listen to others speak about specific topics.
Surveys consist
of asking people to answer a fixed series of questions with lists of allowable
answers. They can be carried out face-to-face, or by distributing and retrieving
questionnaires by telephone, mail, fax, and e-mail. Some questionnaires have
been posted on the Web.
The critical
issue when considering the reliability of surveys is self-selection bias
– the obvious problem that survey results include only the responses of people
who agreed to participate. Before basing critical decisions on survey data,
it is useful to find out what the response rate was; although there are no
absolutes, in general we tend to trust survey results more when the response
rate is high. Unfortunately, response rates for telephone surveys are often
less than 10%; response rates for mail and e-mail surveys can be less than
1%. It is very difficult to make any case for random sampling under such circumstances,
and all results from such low-response-rate surveys should be viewed as indicating
the range of problems or experiences of the respondents rather than as indicators
of population statistics.
As for Web-based
surveys, there are two types from a statistical point of view: those using
strong identification and authentication and those that don’t. Those that
do not are vulnerable to fraud such as repeated voting by the same individuals.
Those that provide individual URLs to limit voting to one per person nonetheless
suffer from the same problems of self-selection bias as any other survey.
Interviews and
other social-sciences research methodologies can suffer from a systematic
tendency for respondents to shape their answers to please the interviewer,
or to express opinions that may be closer to the norm in whatever group they
see themselves. Thus, if it is well known that every organization ought to
have a business continuity plan, some respondents may misrepresent the state
of their business continuity planning to look better than they really are.
In addition,
survey instruments may distort responses by phrasing questions in a biased
way; for example, the question “Does your business have a completed business
continuity plan?” may have a more accurate response rate than the question,
“Does your business comply with industry standards for having a completed
business continuity plan?” The latter question is not neutral and is likely
to increase the proportion of “yes” answers.
The sequence
of answers may bias responses; exposure to the first possible answers can
inadvertently establish a baseline for the respondent. For example, a question
about the magnitude of virus infections might ask
“In
the last 12 months, has your organization experienced total losses from virus
infections of
(a)
$1M or greater;
(b)
less than $1M but greater than or equal to $100,000;
(c)
less than $100,000;
(d)
none at all?”
To test for
bias, the designer can create versions of the instrument in which the same
information is obtained using the opposite sequence of answers:
“In
the last 12 months, has your organization experienced total losses from virus
infections of
(a)
none at all;
(b)
less than $100,000;
(c)
less than $1M but greater than or equal to $100,000;
(d)
$1M or greater?”
The sequence
of questions can bias responses; having provided a particular response to
a question, the respondent will tend to make answers to subsequent questions
about the same topic conform to the first answer in the series. To test for
this kind of bias, the designer can create versions of the instrument with
questions in different sequences.
Another instrument
validation technique inserts questions with no valid answers or with meaningless
jargon to see if respondents are thinking critically about each question or
merely providing any answer that pops into their heads. For example, one might
insert the nonsensical question, “Does your company use steady-state quantum
interference methodologies for intrusion detection?” into a questionnaire
about security, and invalidate the results of respondents who answer “Yes”
to this and other diagnostic questions.
Finally, independent
verification of answers provides strong evidence of whether respondents are
answering truthfully. However, such intrusive investigations are rare.
In summary,
all studies about computer crime should be studied carefully before we place
reliance on their results. Some basic take-home
questions about such research:
Textbooks:
If you are interested
in learning more about survey design and statistical methods, you can study
any elementary textbook on the social sciences statistics. Here are some sample
titles.
Babbie, E. R.,
F. S. Halley & J. Zaino (2003). Adventures
in Social Research : Data Analysis Using SPSS 11.0/11.5 for Windows, 5th
Ed. Pine Science Press (ISBN 0-761-98758-4).
Bachman, R.
& R. K. Schutt (2007). The Practice
of Research in Criminology and Criminal Justice,Third Edition. Sage Publications (ISBN 1-412-95032-5).
Chambliss, D.
F. & R. K. Schutt (2006). Making
Sense of the Social World: Methods of Investigation, Second Edition. Pine
Forge Press (ISBN 1-412-92717-X).
Sirkin, R. M.
(2005). Statistics for the Social Sciences,
3rd Ed. Sage Publications (ISBN 1-412-90546-X).
Web sites:
Creative Research
Systems “Survey Design” http://www.surveysystem.com/sdesign.htm
New York University
“Statistics & Social Science” http://www.nyu.edu/its/socsci/statistics.html
StatPac “Survey & Questionnaire Design” http://www.statpac.com/surveys/
University of
Miami Libraries “Research Methods in the Social Sciences: An Internet Resource
List” http://www.library.miami.edu/netguides/psymeth.html
[1] Kabay, M. E. (2006). CATA:
Computer-aided thematic analysis.
< http://www2.norwich.edu/mkabay/methodology/CATA.pdf
> with narrated lectures available at
< http://www2.norwich.edu/mkabay/methodology/index.htm
>