I had some magical moments in my life. Perhaps the most magical was the summer I went to work wearing shorts and running shoes. Captain C. said to me, as I sat in our shared office, “If you want to work during the weekend, you can borrow the keys to the building.” It didn’t seem like a big deal at the time. I had already borrowed a military vehicle that, as per Captain C., I could park anywhere in Canada without receiving a parking ticket. So I borrowed the keys to the building. I went to work the following day. I failed to take into account that I was in a military installation with its own slow-poke nuclear reactor. It didn’t occur to me to advise security of my entry in advance. So what became of me? I was alone in a computer lab when I was confronted by an armed guard apparently trying to control several large barking dogs. I think I asked him, certainly not in a calm manner, “Is . . . anything wrong?” He sized me up – as I sat there in my shorts – and he said, “No, sir, there seems to be a mistake. There was a report of unauthorized entry in this area.” This is what he said formally although he seemed to swear quite a bit afterwards in his French-Canadian accent. The dogs stopped barking. The guard took them away. Then once again I was left to myself in solitude in the computer lab, mulling over how to express risk in the face of many different types of data. It was a magical time to be young, paid to do research, wear shorts, have keys to a military installation, and not to be shot by security due to silly oversights. This blog is about my thoughts on the expression of risk in data.
Risk is easy or at least easier to understand when it is explained as an actuarial problem. For example, given a thousand underground storage tanks, a certain percentage will leak. Some will leak in surroundings that will threaten human health. How does one know what to clean up, where, and in what manner without actually surveying sites and testing? An actuarial estimate seems destined to be understated. Yet conceptually, it is possible to examine a distribution of tanks in order to estimate the possible clean-up costs. I believe that the actuarial approach has become so dominant that it might win over other forms including engineering. In any event, one would expect an engineer examining a tank and its surroundings to provide “authoritative guidance” on whether that specific tank can be expected to leak, causing what kind of damage. If I bring my pick-up truck to a garage for service, I would raise my eyebrow in a Vulcan-like manner if the mechanic said something like, “Well, feller, I’d expect a 5 percent chance it’s your alternator; a 10 percent chance it’s the belt to the alternator; 15 percent chance it’s the corroded lines to your battery; 50 percent chance it’s the actual battery; 10 percent chance it’s your starter; and 10 percent chance it’s something else in the car.” Am I hiring a mechanic or statistician to fix the car? An estimate of risk based on an examination by an engineer can be reasonably expected to provide better guidance than a statistics. However, there are limits to the extent to which something can be examined in order to gather data.
In investigating a problem through an actuarial or engineering lens, it should be apparent that one perspective is predisposed to past data; the other is more focused on the future. Here is a personal sentiment albeit the product of considerable deliberation: expressions of “risk” from past events are largely illogical or at least superfluous; and it is conceptually questionable to assert a future risk based purely on past events. “We expected this underground storage tank to leak.” Although risk is implied by the statement, it goes without saying that a past event either caused or didn’t cause damage or negative consequences; and so it is not a risk after the fact. A risk is something before the fact having the potential to cause adverse impacts. I don’t consider it unusual for risk to be described as the outcome of probabilities and consequences – thereby inviting the use of past data to characterize future risk. This then would expand the role of the actuary while diminishing that of the engineer. We might pose to the actuary, “What is the risk of a serious terror incident at this installation?” He or she can honestly say, “It has never happened before. The data therefore suggests that the risk is negligible.” The risk becomes structurally invisible. The question of risk is not actually statistical. Chance “should be” irrelevant except when the underlying phenomenon or case is not worth the trouble to study; or knowledge cannot provide guidance even if studying occurs. In the absence of engineering data, a reasonable guess is, well, reasonable. What is the difference between the types of data?
An actuary doesn’t necessarily make use of detailed site data. I use the term “site data” as it relates to contaminated sites and therefore engineers. An actuary is not necessarily an engineer. Having said this, many more people than just engineers make use of data connected to physical things and places such as tanks and sites. Actuaries generally have little need for “structural data” – data that relates to how pieces fit together. However, actuaries benefit from have many cases of the same phenomenon repeated many times in the data: e.g. cases of heart attacks for people between the ages of 45 and 50. Consider in contrast an engineering report containing data pertaining to borehole drilling results around a particular contaminated site. The borehole data is not meant to be used for actuarial purposes but rather to map out the plume of contamination associated with the actual physical site. The engineer can predict how the contamination will spread hopefully not using probabilities but rather true site conditions and readings. The engineer’s data-gathering is therefore multifarious and explorative. I suggest that conceptually, an assessment of risk requires data beyond statistics. I am not trying to dismiss statistics. I am only saying that the engineer requires statistics and a lot more. Statistics can establish general initial behaviour: e.g. the rate at which a contaminant can be “expected” to spread in certain soil conditions. But the analyst will also need accurate data about the soil; the nature of the contaminant; groundwater in the area; weather and flood patterns; surrounding land use; activities of local communities; susceptibilities of these populations; development planning. Ecologies of data surround metrics.
It is possible that an engineer might distribute boreholes randomly. But in the face of scarce resources, I am certain that selection is important. In a manner of speaking, I suppose an engineer can “hunt” for contamination. Maybe less grammatically sensible but more conceptually congruent, engineers hunt for “explanation.” He or she is concerned about placement, position, condition, and circumstances, which all play a role in the explanation. Now, I am not an engineer. When I was hired to assist with research (by a researcher with a PhD in engineering) at the Royal Military College in Kingston, it was because risk had to be considered in different contexts apart from engineering. For example, legal liability was an important context along with financial costs, medical health, and social impacts. Imagine what would have happened if the military guard that confronted me in the computer lab had simply applied rigid criteria: e.g. individuals without proper clearance must be presumed armed and dangerous. The ability of the guard to consider a “risky situation” in different contexts helped to protect me. Did the guard use “probabilities” in his assessment? Arguably, I represent an improbable risk based on past experience. How often are civilians allowed into the college to participate in research projects? I actually suggest that the guard had no prior experience that could enable any level of evidenced-based decision-making. He was trained to recognize threats and assess risk, and this is precisely what he did.
An assessment of risk is physical – or I should say more generically, it is structural. The probability of risk on the other hand is statistical. The fact that data might present a person as an improbable risk does nothing to negate the structural risk. True enough, the likelihood of a man with a gun entering a mall and shooting people randomly is negligible. But there are many reasons to argue that a man with a gun entering a mall represents a structural risk. First of all, he has a gun. Secondly, he is in a mall with a gun. There is no need to examine the stats further. The man is a risk – code white alert. It is a structural risk because the relationship of a man to a gun is probably for the man to shoot the gun. The relationship of a man in a mall with a gun is probably to shoot people in the mall using the gun. These facts help to explain how the pieces likely fit together – to give rise to metrics likely to include fatalities. The actuarial analysis is fairly moot in this case. The actuary does have an upper hand however in terms of reporting events after they occur: numbers are great for charts; numbers can substantiate trends; and of particular interest to politicians, numbers can indicate progress over time and also return on investment.
The tendency to project from past data might not be to measure risk per se but rather to determine the scope or substantive parameters of hypothetical scenarios. A hypothetical situation can include almost anything; and “anything” is difficult to handle or manage. It becomes necessary to substantiate what gets included and excluded from analysis. Another way to think of hypothetical cases is as narratives. Narratives don’t have to be real. The power is in the story. An engineer is usually not described as a storyteller. The story isn’t in the telling. It is in the collection of data. He or she can create a story of ground contamination that is then confirmed by borehole drilling. Risk is a narrative construct. Even my car mechanic tells me stories: “This is okay for now. But in six maybe eight months, it might start to get a bit noisy.” Actual data could be collected to confirm this assertion (“risk data”) if I had some type of data-logger; or of course I could simply listen nervously to my pick-up. Consequently, risk extends from narrative; it is multifarious, structural, and often associated with some type of actual physical structure or body. Data science as it corresponds with the work of actuaries wades into risk analysis precariously on the absence of structural data. Its data is less site-specific and more case-repetitive; more focused on subsuming reality to fit into unrealistic pigeon holes; less applicable to site-specific situations.
The “embodiment of data” is the extent to which data is connected to a body. Or expressed differently, it is the extent to which metrics gain possession of a body of structural events. Those that accuse statisticians of being disconnected from organizations and processes I suppose might be making rude personal observations about individuals. But I believe that such comments relate more to the distance between (the data of) metrics and (the data of) phenomena – not to be confused with the “metrics of phenomena.” Statisticians deal with disembodied data – that is to say, metrics. If data is embodied, it becomes possible to examine the structural circumstances. There are different levels and aspects of embodiment. On one hand, the focal point might be the body itself, which is useful if events are internal; if events are not so much internal but internalized, then the external world should be considered. The story might not be the body itself but what the body does and how it interacts with the world. Embodiment is therefore not necessarily about the body as an object but rather the subject of its narrative. Embodiment means that the narrative can be shaped by the body. On the other hand, data science has been influenced by methods of disembodiment: “narrative” is not really an ontological instrument to give rise to data; but rather the term is used in relation to the outcomes of analysis.
There is no greater irony than to be delivered a convincing story after every aspect of narrative is ignored during data gathering. The fact that a contaminated site is “unlikely” to affect water supplies entering an aboriginal community might dominate a story. But the risks associated – that contamination can indeed occur – cannot be ignored even if the narrative seems implausible. For one thing, a statistical analysis leading to probabilities might be alienated from reality. The metrics are literally decapitated from relevant bodies of data. Authoritative assertions pertaining to specific sites should not be legitimized using actuarial analysis. If the community’s water supply becomes contaminated, it is because of negligence. No sensible or reasonable person would suggest that aggregate results should be used to guide decisions involving specific cases: e.g. “Most of the people in jail are black. The accused is black. Therefore he probably belongs in jail. This is what science tells us. I go by the evidence.” The assertion is not based on evidence. The prosecutor has no evidence – scientific or otherwise. Something else that I think tends to be understated is the history of failure surrounding conventional methods of analysis. It is illogical in light of a history of failure to summarily place one’s faith on the same methodologies and approaches. Thirdly, the dismissal of narrative as an ontological instrument to gather data seems the exact opposite of what should happen when decision-making seems practically clinically insane.
In the past, I expressed some negative sentiments towards relational databases. I actually don’t have any issues with relational databases per se but rather their wholesale use. Tables are fine for actuarial purposes. Imagine being in a call centre where the client system is designed to hold only a specific number of specific facts. The idea of holding a specific number of specific facts is incompatible with client narrative: the client cannot do much of narration. What the client has to say must be pigeon-holed into the relational database – assuming the client’s input must later be accessed as symbolic tags. What the client has to say becomes largely “externally defined” by those responsible for the data system – or the managers that managed the design. Consequently – if the objective is truly to obtain “client data” (rather than “company data” about the client) – a relational database would be illogical and perhaps incapable of achieving the desired outcomes. I return to the issue of “risk” – the potential for future adverse consequences. The actuary or data scientist who has chosen to behave like an actuary can examine all sorts of trends disassociated from reality. But risk is invisible to this person. Sales might increase one day and decline sharply the next; the analyst cannot be certain why since the data lacks substantive details. The analyst relying entirely on statistical analysis has little understanding of the cause of changes; he or she should therefore be the last person to provide guidance on future risk.