Reliability and Risk
The Challenge of Managing Interconnected Infrastructures
Emery Roe and Paul R. Schulman


Chapter 1

The Infrastructure Society

Knowing how something works is not the same as knowing how it can fail.
—Giandomenico Majone, “Technology Assessment in a Dialectic Key”

History teaches us that a crisis often causes problems to correlate in a manner undreamed of in more tranquil times.
—Warren Buffett, “What Worries Warren Buffett”

If Alexis de Tocqueveille had drowned on a late-November night in 1831, we would have no Democracy in America, let alone the other works of that first great interpreter of the American setting. He, along with some two hundred others, was a passenger on a steamboat, the Fourth of July, when it hit a rock or sandbank in the Ohio River near Wheeling, West Virginia. “The cry ‘we sink!’ immediately rang out,” Tocqueville wrote. “The ship, crew, and passengers together began their journey toward eternity. I have never heard a nastier noise than the noise the water made as it rushed inside the boat” (quoted in Jardin 1988, 165). Nearly drowned, he was rescued along with others a couple of hours later.

Memories of the incident preyed on Tocqueville, and in early 1832, while still traveling, he pressed Joel Roberts Poinsett, U.S. politician and statesman, as to why steamboats weren’t better made. Poinsett told Tocqueville,

There is a general feeling among us [Americans] that prevents our aiming at the durable in anything: there reigns in America a popular and universal faith in the progress of the human mind. They are always expecting that improvements will be discovered in everything, and in fact they are often right. For instance, a few years ago I asked the builders of steamboats for the North River why they made their vessels so fragile. They answered that, as it was, the boats would perhaps last too long because the art of steam navigation was making daily progress. As a matter of fact, the vessels, which steamed at 8 or 9 miles an hour, could no longer a short time afterwards sustain competition with others whose construction allowed them to make 12 to 15. (Quoted in Kammen 1997)

Sound familiar? If we were to bring this part of the story up to the present, it would show not only how risk taking—more neutrally, technological innovation—has always been a key part of developing America’s critical infrastructures (and not just shipping) but also how infrastructure reliability in terms of safety and dependability has increased since Tocqueville. This would be a story about how taking risks actually improves reliability.

But that story is true only as far as it goes, and we need to take it a good deal further if we are to understand reliability and risks in today’s infrastructure. Even by the time Tocqueville got there, Wheeling, West Virginia, was a transportation hub of the Americas. In 1818, the National Road, the first major highway in the United States, connecting the Potomac and Ohio Rivers, had reached Wheeling, thereby enabling goods and services to move from the Ohio Valley eastward as well as westward into the frontier (the Ohio River flows into the Mississippi River). Today it is estimated that more than 230 million tons of cargo are shipped on the Ohio River each year, mostly coal and much of it by barge (including fifteen-barge tows of a thousand feet or longer).

Originally a shallow river, the Ohio has been deepened through a large infrastructure of dams, reservoirs, and locks that connect the more than 975 miles of that river (and its adjacent cities, major ports, and terminals) to the deepwater ports of New Orleans, Louisiana, and Mobile, Alabama, and beyond. Five million people depend on the river for their drinking water, and some twenty-five million people or more, nearly 10 percent of the U.S. population, are said to reside in the Ohio River Basin.1 By the time Tocqueville’s steamboat sank, the Ohio had already been interconnected with the country’s nascent infrastructure sectors and was well on its way to being interconnected to ever more important infrastructures. Reliability and risk have also been transformed by this interconnectivity, not just by technological innovation alone. It is that story—the wider story with its implications of our having become an infrastructure society—that we tell in this book.


This is a book about the capabilities of humans to manage complex and increasingly connected infrastructure systems that supply clean water; provide communications, transportation, electricity, flood protection, and financial services; and ensure major emergency response. It is about our capacity to operate these systems at levels of dependability and safety that match the intensive, continued, and predictable operation we now expect of them within modern society. We argue that interconnected infrastructures are reaching limits in the degree to which these systems can be managed reliably.

We know that in the modern world of infrastructures, reliability cannot be a property of single infrastructures and their managing organizations but rather must be the property of relationships among very different infrastructures producing very different services. If we think of reliability as the safe and continuous provision of a critical service, then it must be interinfrastructural today. Any critical service, such as water and energy, cannot be provided reliably without those water and energy supplies relying on other infrastructures such as telecommunications or transportation. Reliability depends on networks of networks and on a complex physical and organizational interconnectivity that even many experts do not fully appreciate. These networks, moreover, are at risk.

That much we know, but it is far from certain that we know how to manage interinfrastructural reliability across multiple systems and organizations. Because of the increasing complexity of interconnections, the world seems riskier and prone more to catastrophic and near-catastrophic events. The global financial crisis in 2008 and its aftermath may well be a harbinger of things to come (Roe 2013). The Indonesian and Fukushima tsunami disasters are other examples of interconnected failures with catastrophic consequences. Worldwide pandemics and the more pointed effects of global climate change may be the next megachallenges for humanity in the coming decades.


In important ways our infrastructures define modern society (Ascher 2007; Huler 2010). Advances in communication and transportation, by way of example, have led to profound changes in social organization ranging from shifting residential patterns to the evolution of family structures. Long-linked infrastructures of water supply and storage and large-scale pumping and irrigation capacity have fueled high population concentrations in deserts and across floodplains and in other geographical areas that prevented such dense settlements in the past.

Our capacities for transportation, communication, health care, financial transactions, and reliable water and energy so shape contemporary life and are so intimately interwoven with its pace, possibilities, and widely shared expectations and aspirations that their failures assume crisis proportions. The major electricity blackouts in the northeastern United States in August 2003 not only led to deaths. They also disrupted air traffic (shutting down regional airports because of inability to screen passengers and process electronic tickets) and automobile traffic because of signal light failures, and they stopped Amtrak rail transportation throughout the regional corridor. The blackouts interrupted cellular and telephone communications and cable television, and they affected water supplies through reducing pumping pressure and increasing contamination as a result of purification and sewage treatment plant failures. The blackouts likewise affected hospital and emergency medical services, as well as the food service sector and a wide variety of other services. It has been estimated that these blackouts cost between $6 billion and $7 billion in lost production, wages, spoilage, and the like (ELCON 2004).

When infrastructure failures occur in conjunction with an earthquake, major storm, or tsunami, they become an integral part of the catastrophe—adding to the death toll and hobbling recovery. Our infrastructures simultaneously create the capacities of modern society and introduce the possibility for catastrophic risks to its continuance (Beck 1992; Giddens 2002; Graham 2009). Both reliability and risk have become integral features of modern infrastructures, the former even contributing to the latter.


Physically, infrastructures consist of structural elements and material designs that enable them to function as major social and capital assets. Such large systems include the nation’s many dams, reservoirs, generators and transmission lines, levees, and roads and bridges. But our critical infrastructures also consist of organizations and their management and thus are more than technical systems. For this reason, we refer to them as large sociotechnical systems rather than as technical or physical systems only.

Organizationally, these critical systems incorporate a framework for management and control. This framework consists of the roles, rules, procedures, and protocols prescribing their operation and the skills their personnel (managers and operators) need to operate them. These systems also include the design assumptions and analytic models covering their operation and the data networks to monitor and assess their real-time condition.

Moreover, these infrastructures are not just organizations in addition to their technologies: critical infrastructures function as institutions in society. Institutionally, infrastructure systems include the laws, regulatory agencies, and public subsidies that constrain and support their operation through time. Together these technical and organizational and institutional dimensions make up an infrastructure as a whole system that, unlike other systems, is meant to operate and be maintained continuously, safely, and for the foreseeable—and unforeseeable—future.

As we demonstrate, the organizational and institutional factors are just as palpably critical as the technical factors. Wherever you see railcars and tracks, electrical generators and transmission lines, or dams, reservoirs, and pipeline systems you find they have control rooms for real-time management of the physical assets.2 These control elements are often distributed over significant distances, and many are in separate organizations, separately operated and managed. Without managed control elements, the technical elements would have little chance to function reliably.

The networked property of infrastructures makes coordination of activities required for their operation and maintenance a significant and constant managerial challenge. No management challenge in infrastructures is solved solely by investing in new technology, including more sophisticated computerized and automated control systems or physical assets such as new transmission lines or large generators. To the contrary, designs and technologies have needlessly increased that management challenge, a point this book underscores with numerous examples. As we show, the geographical and organizational dispersion of critical infrastructures also leads to more complexity and a wider set of potential system conditions or states than is typical with simpler technical systems under unitary command and control.

While infrastructures are quite specialized in structure and service, they are highly generalized in terms of the foundations they provide for a huge variety of follow-on capabilities. Thus electrical grids allow for communication, transportation, health care, and many other services. In these enhancements that they extend to other capacities, infrastructures become the means to multiple ends. As one analyst (Frischmann 2005) puts it, our infrastructures generate enormous positive externalities that create an unbounded set of potential benefits. They act as economic growth multipliers by providing capacities that generate and facilitate economic transactions and improvements that lead to growing new industries. More than ever before, major industries and economic sectors rest on high-speed, worldwide communication and transportation capabilities. Fast-acting global supply chains remain core to today’s international economy (Sheffi 2005).

Infrastructures have additional attributes that distinguish them from other technologies and human interventions. They are designed for large-scale public uses, not individual or solely private uses. Even though infrastructures such as airlines, telecommunication networks, and power grids may be privately or investor owned, their operation provides a distinctively public service (Frischmann 2005).

Large sociotechnical systems also have distinctive histories with respect to their system components. Infrastructures evolve as different parts wear out and are replaced; improvements are made selectively over extended periods. Consequently, many infrastructures are hybrids of technical, organizational, and institutional elements. An electrical grid, for instance, must rely on many generators of different ages, start-up (ramp) rates, and generation capacities and highly differentiated transmission lines with varying electrical load limits and differing vulnerabilities to weather conditions such as wind and temperature. So too, in their own fashion, do railroads and water supplies.

As a result of these variegated factors, it becomes unlikely that a single formal analytic model or approach could adequately portray for management purposes the full behavioral diversity of these infrastructures. Efforts to automate key operations often fail to adjust for this variety and complexity. They then create surprises for their modelers and designers (that they had not predicted or prepared for). On this analysis, the experience, background, and memory of system operators move center stage in the reliable performance of these large sociotechnical systems.

Finally, infrastructures feature complex interconnections and relationships, both internally and with one another. This complexity enlarges dramatically with the expansion of the number of technical, organizational, and institutional components of infrastructures. The interactive possibilities for a set of elements increase mathematically as a multiple of those elements. This in turn can and often does create new system conditions, or states. New types or arrangements of energy users, for example, have changed the expected profile for electricity demand at different times during the day.

Our research underscores that the character of interconnections among elements within single or among multiple infrastructures differs quite considerably under conditions of failure compared with normal operation, so there may be at least two additional configurations (manifest and latent) for each added element. For example, two separate transportation infrastructures, highways and rail transit, operate independently during normal operations, but if one fails during rush hour it can overload the other. A flood takes out a roadway and in turn affects the ability of repair vehicles to reach downed electrical power lines. Managing this complex interconnectivity challenges human ability to anticipate manifold permutation possibilities under differing conditions of operation.


A major pillar of modernity is that the social pace, pattern, and scale of contemporary life have evolved largely to match if not drive current infrastructure capacities (McLuhan 1966). This means that modern social life is closely attuned to, not merely dependent on, the functioning of its infrastructures. We rely so heavily on our infrastructures that it would be hard to imagine what life would be like without them. Always-on infrastructure reliability is not just taken for granted; it is a prerequisite of up-to-the-minute social life.

Lapses in infrastructure reliability are simultaneously disorienting on multiple levels and especially so when the lapses ramify to other infrastructures and throughout other diverse critical-service systems. We don’t just lose water and electricity when nothing comes out of the tap and the fridge shuts off; routine social practices are disrupted, and widely accepted standards and expectations for social life and everyday experience are threatened.

When infrastructures fail, their failures assume distinct normative dimensions. Our reliance on infrastructures has in fact transformed a variety of services into presumed rights to delivery. When California underwent a series of blackouts in 2001 on the heels of its wholesale electricity market restructuring, it quickly became evident in public reactions that always-on electricity had been elevated to the status of a public entitlement, a core part of the definition of contemporary California. Indeed, public unrest over the electricity crisis played a significant role in the recall of California’s governor in 2003.

As others have also asserted, an infrastructure crisis is under way and not just in the United States. The crisis we describe differs in major respects, however. To many, the reliability and risk of infrastructures are properties that lie primarily within the design of physical systems. In this view, the structural brittleness or robustness of the constituent elements of these systems define their overall reliability, determined as they are by structural features such as materials strength or the redundancy of key elements. In contrast, we do not identify the infrastructure crisis narrowly in terms of degrading or out-of-date assets; the crisis we focus on in this book lies in the real-time management of the hardware and software assets we have.

Our research has revealed that reliability is as much a function of managerial skill as of technology and physical structure (Roe and Schulman 2008). One engineer estimated that up to 85 percent of all infrastructure failures result from human and organizational factors rather than technical failures (Bea 2006). We highlight these organizational and management factors and their importance in infrastructural and interinfrastructural reliability. The following chapters make clear that safeguarding the reliability of many modern social activities and technical capacities lies in recognizing and managing our interconnected critical infrastructures as valued social institutions.


Commonly proposed solutions to the infrastructure crisis, such as finding new financing mechanisms to generate the trillions needed for new assets, are inherently ill advised, we argue, when the real organizational and institutional dimensions of the reliability challenge have not been recognized. Likewise, technical innovations that undermine rather than enhance the flexibility of real-time control operators to respond to inevitable technical shortfalls and rude surprises are fatal at their core. The infrastructure crisis this book highlights is the one caused by design undermining better managerial practice and by innovation that assumes management will always adapt, whatever the situation. We show why a design solution to these problems must take seriously the management of unavoidably incomplete designs or design errors and their inadvertent consequences.

Yet as important as they are, the reliability and risks of our interconnected infrastructures are not well understood and not accurately reflected in risk assessment and risk management methodologies currently applied to infrastructures. This argument stands in marked contrast to much current thinking about the threats to critical infrastructures—namely, that their risks are self-evident, ranging from all too obvious aging structures to vulnerabilities visible to the determined terrorist. Given this diagnosis, risk analysis and modeling are now widely argued for, applied, and said to be getting better and better (Ostrom and Wilhelmsen 2012). Our argument, on the other hand, is that these methods are seriously flawed when compared with the unique real-time risk assessment and management skills of control operators in the critical infrastructures themselves. Moreover, many policy makers, regulators, and system designers display a studied indifference to this fact, if not silent dismissal of it. Policy makers and system designers also do not appreciate the challenges to operator skills imposed by faulty technical designs, misdirected policy, and regulatory error (all of which are described in this book). At the same time, our infrastructures have become interconnected in ways far more complex than even their operators and support staff fully understand.

Since a good many of the presumed infrastructure problems and methods of risk analysis have been misidentified, it is unsurprising that technical, policy, and management prescriptions applied are not just wrong but at times dangerously counterproductive. Whereas a crisis of critical infrastructures has been described in terms of underinvestment and overutilization of the nation’s physical structures, we insist that better understanding the management crisis opens up an entirely new terrain for more cost-effective remedies.

In framing our argument, we must take reliability beyond a single infrastructure, such as electricity transmission, which was the focus of our earlier research (Roe and Schulman 2008). Here instead, we frame and analyze risks of interconnected operations and failure among electricity, telecommunications, ports, water supply, levees, marine navigation, roads, and railroads infrastructures in the San Francisco Bay–Sacramento River–San Joaquin River Delta, our case study area. This research has been part of the larger 2009–2013 Resilient and Sustainable Infrastructure Networks (RESIN) initiative, funded by the National Science Foundation and housed in the Center for Catastrophic Risk Management at the University of California, Berkeley (for other results from the RESIN project, see Hamedifar 2012). Since our findings offer a new way of understanding the complex interrelationships among these systems, we develop and considerably extend our earlier framework on high reliability management (Roe and Schulman 2008). In so doing, we build on the research and literature of other scholars and practitioners on networked reliability among infrastructures (e.g., de Bruijne 2006; van Eeten et al. 2011).


Any tour d’horizon of the reliability literature must necessarily be incomplete, but for our purposes the beginning lay in the research into high reliability organizations (HROs). The hazardous organizations studied were mandated to maintain reliable (safe and continuous) operations even during peak periods and simultaneously guard against accidents or other events that must not happen with such systems (LaPorte and Consolini 1991; Roberts 1993). The early research was largely taxonomic in seeking to identify key HRO features or factors (e.g., Rochlin 1993; Schulman 1993), and much HRO research has been contrasted with the normal accidents theory of Charles Perrow (Perrow [1984] 1999; Rijpma 1997).

Subsequent studies built on this base or extended it in diverse ways, including contributions to management practice (e.g., Weick and Sutcliffe 2001), the field of safety science (e.g., Hopkins 2014; Amalberti 2013), resilience theory and practice (e.g., Hollnagel, Woods, and Leveson 2006; Boin and van Eeten 2013), networked reliability (e.g., de Bruijne 2006), and the statistical analyses of high reliability as a continuous quantifiable variable in the operations of health care, nuclear power, and other industries (Vogus and Sutcliffe 2007; Schöbel 2009; May 2013; O’Neil and Kriz 2013).

As for this book’s contribution, it is the latest chapter in an evolving story for us. It began with over seven years of research at the California Independent System Operator (CAISO), the organization charged with responsibility for managing the California high-voltage transmission grid. We studied its struggle to maintain reliable electrical service operations for California in the midst of its 2001 electricity crisis and in the years afterward. This book takes its point of departure from that study and adds four more years of a much expanded analysis of varying forms of reliability challenges among interconnected infrastructures.


The question of the reliability of infrastructures across their interconnections takes us to the edge of our understanding of high reliability itself. This book examines in detail what happens when large-scale infrastructures turn out to be far more complexly interconnected than their designers, managers, and even operators have conceived on the basis of formal design and the experience of normal operations.

To summarize, we are at high reliability’s edge in these systems for several reasons. First, a single infrastructure’s reliability and its risks are no longer a self-contained property (this has important implications, as we see later, for infrastructure resilience and robustness). Second, the interconnected system failures we are concerned about are likely to be those that lie outside prior analysis and experience within any individual system. Last, reliability cannot simply be about normal operation and avoiding the risks of failure. It must also include the probabilities of successful restoration and recovery. “Reliability” under these conditions has a different meaning than that attached to it in earlier HRO literature. Our intent here is to examine and illustrate alternative meanings for reliability, given the variety of system states and conditions that interconnected infrastructures assume.

These new reliability challenges give rise to an authentic infrastructure crisis, which, while unreported or underacknowledged, should be the paramount one of concern to society. This crisis revolves around threats to the managerial dimension of reliability—the skills and strategies that have up to this point contained risks, limited the spread of interconnected failures, and provided new options and resources for interinfrastructural reliability. No arguments about the state of infrastructures can ignore the daunting nature of their ever more complex interconnectivity. Better appreciation of this challenge, and how it has hitherto been misunderstood, is the first step to understanding risk and reliability in our modern infrastructure society. Consequently, we begin with an analysis of modern interinfrastructure connectivity.


1. For more details, see Longfellow 2013; Ohio River Valley Water Sanitation Commission, n.d.; Schneider 2012; and Neff 2010.

2. Of course, not all critical infrastructures, as officially classified, have control rooms (e.g., education).