diff --git a/.DS_Store b/.DS_Store
index 9455158..775bc81 100644
Binary files a/.DS_Store and b/.DS_Store differ
diff --git a/TDO-Chapters1-3/.DS_Store b/TDO-Chapters1-3/.DS_Store
deleted file mode 100644
index 7216769..0000000
Binary files a/TDO-Chapters1-3/.DS_Store and /dev/null differ
diff --git a/TDO-Chapters1-3/META-INF/container.xml b/TDO-Chapters1-3/META-INF/container.xml
deleted file mode 100644
index 317aaed..0000000
--- a/TDO-Chapters1-3/META-INF/container.xml
+++ /dev/null
@@ -1,6 +0,0 @@
-
-
To appear in The Discipline of Organizing, 2012Robert J.Glushko -glushko@berkeley.edu
To organize is to create capabilities by intentionally imposing order and structure. Organizing is such a common activity that we often do it without thinking much about it. We organize the shoes in our closet, the books on our book shelves, the spices in our kitchen, and the folders into which we file information for tax and other purposes. Quite a few of us have jobs that involve specific types of organizing tasks. We might even have been explicitly trained to perform them by following specialized disciplinary practices. We might learn to do these tasks very well, but even then we often don’t reflect on the similarity of the organizing tasks we do and those done by others, or on the similarity of those we do at work and those we do at home. We take for granted and as givens the concepts and methods used in the organizing system we work with most often.
The goal of this book is to help readers become more self-conscious about what it means to organize things – whether they are physical resources like printed books and shoes or digital resources like web pages and MP3 files – and about the principles by which the resources are organized. In particular, this book introduces the concept of an Organizing System; that is, an intentionally arranged collection of resources and the interactions they support. The book analyzes the design decisions that go into any systematic organization of resources and the design patterns for the interactions that make use of the resources.
This book evolved from a master’s level university course on “Information Organization & Retrieval” I taught for several years at the University of California, Berkeley’s School of Information. My goal was to synthesize insights from library science and computer science to provide my students with a richer understanding about information organization than either discipline alone could provide. I came to realize that information was just one of the many types of resources to organize and that it would be beneficial to think about the art and science of organizing in a more abstract way. This book is the product of countless discussions with students and faculty colleagues at Berkeley and other schools, and we are collaboratively developing a new discipline that unifies four types of organizing, as follows:
We organize physical things. Each of us organizes many kinds of things in our lives—our books on book shelves; printed financial records in folders and filing cabinets; clothes in dressers and closets; cooking and eating utensils in kitchen drawers and cabinets. Public libraries organize printed books, periodicals, maps, CDs, DVDs, and maybe some old record albums. Research libraries also organize rare manuscripts, pamphlets, musical scores, and many other kinds of printed information. Museums organize paintings, sculptures, and other artifacts of cultural, historical, or scientific value. Stores and suppliers organize their goods for sale to consumers and to each other.
We organize information about physical things. Each of us organizes information about things, when we inventory the contents of our house for insurance purposes, when we sell our unwanted stuff on eBay, or when we rate a restaurant on Yelp. Library card catalogs, and their online replacements, tell us what books a library’s collection contains and where to find them. Sensors and RFID tags track the movement of goods - even library books - through supply chains, and the movement (or lack of movement) of cars on highways.
We organize digital things. Each of us organizes personal digital information—email, documents, e-books, MP3 and video files, appointments, and contacts—on our computers, smart phone, e-book readers or in “the cloud,” through information services that use Internet protocols. Large research libraries organize digital journals and books, computer programs, government and scientific datasets, databases, and many other kinds of digital information. Companies organize their digital business records and customer information in enterprise applications, content repositories, and databases. Hospitals and medical clinics maintain and exchange electronic health records and digital X-rays and scans.
We organize information about digital things. Digital library catalogs, web portals and aggregation websites organize links to other digital resources. Web search engines use content and link analysis along with relevance ratings to organize the billions of web pages competing for our attention. Web-based services, data feeds and other information resources can be combined as “mash-ups” or choreographed to carry out information-intensive business models.
Let’s take a closer look at these four different types or contexts of organizing. Are there clear, systematic and useful distinctions between them? We contrasted “organizing things” with “organizing information.” This comparison might seem obvious and natural to people for whom personal computers, email, and the web have always been part of their lives. At first glance it might seem that organizing physical things like books, machine parts, or cooking utensils has an entirely different character than organizing intangible digital things. The organization of printed books on library shelves and the way you interact with them isn’t at all like how you store and read books on your Kindle or iPad. Arranging, storing, and accessing X-rays printed on film might appear to have little in common with these activities when the X-rays are in digital form.
But the era of ubiquitous digital information of the last decade or two is just a blip in time compared with the more than ten thousand years of human experience with information carved in stone, etched in clay, or printed with ink on papyrus, parchment or paper. These tangible information artifacts have deeply embedded the notion of information as a physical thing in culture, language, and methods of information design and organization, and it always will, so long as we humans inhabit a physical world that contains people and objects that we name, classify, and organize. Organizing things and organizing information don’t differ much when information is represented in a tangible way because in both cases we feel as though our interactions are direct and unmediated.
We also contrasted “organizing things” with “organizing information about things.” This difference is easy to understand if we consider the traditional library card catalog, whose printed cards describe and specify the location of books on library shelves. When the things and the information about them are both in physical format, it is easy to see that the former is a primary resource and the latter a surrogate or associated resource that describes or relates to it. But what about “organizing information about digital things?” When you search for a book using a search engine, first you get the catalog description of the book, and if you’re lucky the book itself is just a click away. When the things and the information about them are both digital, the contrast we posed isn’t as sharp as when one or both of them is in a physical format. And while we used X-rays – on film or in digital format – as examples of things we might organize, when a physician studies an X-ray, isn’t it being used as information about the subject of the X-ray, namely the patient?
These differences and relationships between “physical things” and “digital things” have long been discussed and debated by philosophers, linguists, psychologists and others (See the sidebar, WHAT IS INFORMATION?).
The distinctions among organizing physical things, organizing digital things, or organizing information about physical things or digital things are challenging to describe because many of the words we might use are as overloaded with multiple meanings as information itself. For example, some people use the term “document” to refer only to traditional physical forms, while others use it more abstractly to refer to any self-contained unit of information independent of its instantiation in physical or digital form. The most abstract definition, presented in “What is a Document?” is when Buckland provocatively asserts that an antelope is both “information as thing” and also a “document” when it is in a zoo, even though it is just an animal when it is running wild on the plains of Africa. Similar definitional variation occurs with “author” or “creator.”[2]
If we allow the concept of information to be anything we can study – to be “anything that informs” – the concept becomes unbounded. Our goal in this book is to bridge the intellectual gulf that separates the many disciplines that share the goal of organizing but that differ in what they organize. This requires us to focus on situations where information exists because of intentional acts to create or organize.
We propose to unify many perspectives about organizing and information with the concept of an Organizing System, an intentionally arranged collection of resources and the interactions they support. This definition brings together several essential ideas that we will briefly introduce in this chapter and then develop in detail in subsequent chapters. Figure 1.1 depicts a conceptual model of an Organizing System that shows intentionally arranged resources, interactions (distinguished by different types of arrows), and the human and computational agents interacting with the resources in different contexts.
Our concept of the Organizing System was in part inspired by and generalizes to physical and web-based resource domains the concepts proposed in 2000 for bibliographic domains by Elaine Svenonius in “The Intellectual Foundation of Information Organization.” She recognized that the traditional information organization activities of bibliographic description and cataloging were complemented, and partly compensated for, by automated text processing and indexing that were usually treated as part of a separate discipline of information retrieval. She proposed that decisions about organizing information and decisions about retrieving information were inherently linked by a tradeoff principle and thus needed to be viewed as an interconnected system: “The effectiveness of a system for accessing information is a direct function of the intelligence put into organizing it” (p. ix). We celebrate and build upon her insights by beginning each of the sub-parts of Design Decisions in Organizing Systems with a quote from her book.[3]
A systems view of information organization and information retrieval captures and provides structure for the inherent tradeoffs obscured by the silos of traditional disciplinary and category perspectives: the more effort put into organizing information, the more effectively it can be retrieved, and the more effort put into retrieving information, the less it needs to be organized first. A systems view no longer contrasts information organization as a human activity and information retrieval as a machine activity, or information organization as a topic for library and information science and information retrieval as one for computer science. Instead, we readily see that computers now assist people in organizing and that people contribute much of the information used by computers to enable retrieval.
Resource has an ordinary sense of “anything of value that can support goal-oriented activity.” This definition means that a resource can be a physical thing, a non-physical thing, information about physical things, information about non-physical things, or anything you want to organize. Other words that aim for this broad scope are entity, object, item, and instance. Document is often used for an information resource in either digital or physical format; artifact refers to resources created by people, and asset for resources with economic value.
“Resource” has specialized meaning in Internet architecture. It is conventional to describe Web pages, images, videos, product catalogs, and so on as “resources” and HTTP, the protocol for accessing them, uses “Uniform Resource Identifiers” (URIs).[4]
Treating as a resource anything that can be identified is an important generalization of the concept because it enables web-based services, data feeds, objects with RFID tags, sensors or other “smart devices” or computational agents to be part of Organizing Systems.
Instead of emphasizing the differences between tangible and intangible resources, we consider it essential to determine whether the tangible resource has information content – whether it needs to be treated as being “about” or “representing” some other resource rather than being treated as a thing in itself. Whether a book is printed or digital, we focus on its information content, what it is about, and its tangible properties become secondary. In contrast, the shoes in our closet and the cooking utensils in our kitchen aren’t about anything else, which makes their tangible properties more important.
Many of the resources in Organizing Systems are description or surrogate resources such as physical or online catalog records that describe the primary resources that comprise the collection. In museums, information about the production, discovery, or history of ownership of a resource can be more important than the resource; a few shards of pottery are of little value without these associated information resources. Similarly, business or scientific data often can’t be understood or analyzed without additional information about the manner in which they were collected.
Resources that describe, or are associated with other resources are sometimes called metadata. However, when we look more broadly at Organizing Systems, it is often difficult to distinguish between the resource being described and any description of it or associated with it. One challenge is that when descriptions are embedded in resources, as metadata often is in the title page of a book, in the masthead of a newspaper, or in the source of web pages, deciding which resources are primary is often arbitrary. A second challenge is that what serves as a metadata for one person or process can function as a primary resource or data for another one. Rather than being an inherent distinction, the difference between primary and associated resources is often just a decision about which resource we are focusing on in some situation. An animal specimen in a natural history museum might be a primary resource for museum visitors and scientists interested in anatomy, but information about where the specimen was collected is the primary resource for scientists interested in ecology or migration.
Organizing Systems can refer to people as resources, and we often use that term to avoid specifying the gender or specific role of an employee or worker, as in the management concept of the “human resources” or HR department in a firm. The shift from a manufacturing to an information and services economy in the last few decades has resulted in greater emphasis on intellectual resources represented in skills and knowledge rather than on the natural resources of production materials and physical goods.[5] It is important to consider the capabilities and motivations of the people who create and participate in Organizing Systems. We might discuss how human resources are selected, organized, and managed over time just as we might discuss these activities with respect to library resources. Nevertheless, these topics are much more appropriate for texts on human resources management and organizational behavior so we will not consider them further in this book.
A collection is a group of resources that have been selected for some purpose. Similar terms are set (mathematics), aggregation (data modeling), dataset (science and business), and corpus (linguistics and literary analysis).
We prefer “collection” because it has fewer specialized meanings. Collection is typically used to describe personal sets of physical resources (my stamp or record album collection) as well as digital ones (my collection of digital music). A collection can contain identifiers for resources along with or instead of the resources themselves, which enables a resource to be part of more than one collection, like songs in playlists.
A collection itself is also a resource. Like other resources, a collection can have description resources associated with it. For example, an index is a description resource that contains information about the locations and frequencies of terms in a document collection to enable it to be searched efficiently.
Because collections are an important and frequently used kind of resource it is important to distinguish them as a separate concept. In particular, the concept of collection has deep roots in libraries, museums and other institutions that select, assemble, arrange, and maintain resources. Organizing Systems in these domains can often be described as collections of collections that are variously organized according to resource type, author, creator, or collector of the resources in the collection, or any number of other principles or properties.
Intentional arrangement emphasizes explicit or implicit acts of organization by people, or by computational processes acting as proxies for or as implementations of human intentionality. Intentional arrangement excludes naturally-occurring patterns created by physical, geological, biological or genetic processes. There is information in the piles of debris left after a tornado or tsunami and the strata of the Grand Canyon. But they aren’t Organizing Systems because the patterns of arrangement were created by deterministic natural forces rather than by an identifiable agent following one or more organizing principles selected by a human agent.
Requiring arrangement to be intentional also excludes self-organizing systems from our definition of Organizing System. These are systems that can change their internal structure or their function in response to feedback or changed circumstances. Self-organizing has been used in physics, chemistry, and mathematics to explain phase transitions and equilibrium states. Self-organizing is also used to describe numerous natural and man-made phenomena like climate, communication networks, business and biological ecosystems, traffic and habitation patterns, neural networks, and online communities. All of these systems involve collections of resources that are very large and open, with complex interactions among the resources. The resource arrangements that emerge can’t always be interpreted as the result of intentional or deterministic principles and instead are more often described in probabilistic or statistical terms. Adam Smith’s “invisible hand” in economic markets and “natural selection” in evolutionary biology are classic examples of self-organizing mechanisms. The web as a whole with its more than a trillion unique pages is in many ways self-organizing, but at its core it follows clear organizing principles (See the Side bar, THE WEB AS AN ORGANIZING SYSTEM).[6]
Taken together, the intentional arrangements of resources in an Organizing System are the result of decisions about what is organized, why it is organized, how much it is organized, when it is organized, and how or by whom it is organized (each of these will be discussed in greater detail in Design Decisions in Organizing Systems). An Organizing System is defined by the composite impact of the choices made on these design dimensions. Because these questions are interrelated their answers come together in an integrated way to define an Organizing System.
The arrangements of resources in an Organizing System follow or embody one or more organizing principles that enable the Organizing System to achieve its purposes. Organizing principles are directives for the design or arrangement of a collection of resources that are ideally expressed in a way that does not assume any particular implementation or realization.
When we organize a bookshelf, home office, kitchen, or the MP3 files on our music player the resources themselves might be new and modern but many of the principles that govern their organization are those that have influenced the design of organizing systems for thousands of years. For example, we organize resources using easily perceived properties to make them easy to locate, we group together resources that we often use together, and we make resources that we use often more accessible than those we use infrequently. Very general and abstract organizing principles are sometimes called design heuristics (for example, “make things easier to find”). More specific and commonly used organizing principles include alphabetical ordering (arranging resources according to their names) and chronological ordering (arranging resources according to the date of their creation or other important event in the lifetime of the resource). Some organizing principles sort resources into pre-defined categories and other organizing principles rely on novel combinations of resource properties to create new categories.
Expressing organizing principles in a way that separates design and implementation aligns well with the three-tier architecture familiar to software architects and designers: user interface (implementation of interactions), business logic (intentional arrangement), and data (resources). See the Sidebar, THE THREE TIERS OF ORGANIZING SYSTEMS.
The logical separation between organizing principles and their implementation is easy to see with digital resources. In a digital library it does not matter to a user if the resources are stored locally or retrieved over a network. The essence of a library Organizing System emerges from the resources that it organizes and the interactions with the resources that it enables. Users typically care a lot about the interactions they can perform, like the kinds of searching and sorting allowed by the online library catalog. How the resources and interactions are implemented are typically of little concern. Similarly, many email applications have migrated to the web and the system of filters and folders that manage email messages is no longer implemented in a local network or on personal computers, but most people neither notice nor care.
The separation of organizing principles and their implementation is harder to recognize in an Organizing System that only contains physical resources, such as your kitchen or clothes closet, where you appear to have unmediated interactions with resources rather than accessing them through some kind of user interface or “presentation tier” that supports the principles specified in the “middle tier” and realized in the “storage tier.” Nevertheless, you can see these different tiers in the organization of spices in a kitchen. Different kitchens might all embody an “alphabetic order” organizing principle for arranging a collection of spices, but the exact locations and arrangement of the spices in any particular kitchen depends on the configuration of shelves and drawers, whether a spice rack or rotating tray is used, and other storage-tier considerations. Similarly, spices could be logically organized by cuisine, with Indian spices separated from Mexican spices, but this organizing principle doesn’t imply anything about where they can be found in the kitchen.
Because tangible things can only be in one place at a time, many Organizing Systems—like that in the modern library with online catalogs and physical collections—resolve this constraint by creating digital proxies or surrogates to organize their tangible resources, or create parallel digital resources like digitized books.[8] The implications for arranging, finding, using and reusing resources in any Organizing System directly reflect the mix of these two embodiments of information; in this way we can think of the modern library as a digital Organizing System that primarily relies on digital resources to organize a mixture of physical and digital ones.
The Organizing System for a small collection can sometimes use only the minimal or default organizing principle of “co-location” – putting all the resources in the same container, on the same shelf, or in the same email inbox. If you don’t cook much and have only a small number of spices in your kitchen, you don’t need to alphabetize them because it is easy to find the one you want.[9]
Some organization emerges implicitly through a “frequency of use” principle. In your kitchen or clothes closet, the resources you use most often migrate to the front. But as a collection grows in size, the time to arrange, locate, and retrieve a particular resource becomes more important and the collection must be explicitly organized to make these interactions efficient. As a result, most Organizing Systems employ organizing principles that make use of properties of the resources being organized (for example, name, color, shape, date of creation, semantic or biological category), and multiple properties are often used simultaneously. For example, in your kitchen you might arrange your cooking pots and pans by size and shape so you can nest them and store them compactly, but you might also arrange things by cuisine or style and separate your grilling equipment from the wok and other items you use for making Chinese food.
Unlike those for physical resources, the most useful organizing properties for information resources are those based on their content and meaning, and these are not directly apparent when you look at a book or document. Significant intellectual effort or computation is necessary to reveal these properties when assigning subject terms or creating an index. The most effective organizing systems for information resources often are based on properties that emerge from analyzing the collection as a whole. For example, the relevance of documents to a search query is higher when they contain a higher than average frequency of the query terms compared to other documents in the collection, or when they are linked to relevant documents.
Many disciplines have specialized job titles to distinguish among the people who organize resources (for example: cataloguer, archivist, indexer, curator, collections manager…).[10] However, we use the more general word agent for any entity capable of autonomous and intentional organizing effort because it treats organizing work done by people and organizing work done by computers as having common goals, despite obvious differences in methods.
We can analyze agents in Organizing Systems to understand how human and computational efforts to arrange resources complement and substitute for each other. We can determine the economic, social, and technological contexts in which each type of agent can best be employed. We can determine how the Organizing System allocates effort and costs among its creators, users, maintainers and other stakeholders.
A group of people can be an organizing agent, as when a group of people come together in a service club or standards body technical committee in which the members of the group subordinate their own individual agency to achieve a collective good.
We also use “agent” when we discuss interactions with Organizing Systems. The entities that most typically access the contents of libraries, museums, or other collections of physical resources are “human agents” - that is, people. In other organizing systems like business information systems or data repositories interactions with resources are carried out by computational processes, robotic devices, or other entities that act autonomously on behalf of a person or group.
In some Organizing Systems the resources themselves are capable of initiating interactions with other resources or with external agents. This is most obvious with human or other living resources and is also the case with resources augmented with computational or communication capabilities. We are all familiar with RFID tags, which enable the precise identification and location of physical resources as they move through supply chains and stores.
An interaction is an action, function service, or capability that makes use of the resources in a collection or the collection as a whole. The interaction of access is fundamental in any collection of resources, but many Organizing Systems provide additional functions to make access more efficient and to support additional interactions with the accessed resources. For example, libraries and similar Organizing Systems implement catalogs to enable interactions for finding a known resource, identifying any resource in the collection, and discriminating or selecting among similar resources.[11]
Some of the interactions with resources in an Organizing System are inherently determined by the characteristics of the resource. Because many museum resources are unique or extremely valuable, visitors are allowed to view them but can’t borrow them, in contrast with most of the resources in libraries. A library might have multiple printed copies of Moby Dick but can never lend more of them than it possesses. After a printed book is checked out from the library, there are many types of interactions that might take place – reading, translating, summarizing, annotating, and so on – but these are not directly supported by the library Organizing System and are invisible to it. For works not in the public domain, copyright law gives the copyright holder the right to prevent some uses, but at the same time “fair use” and similar copyright doctrines enable certain limited uses even for copyrighted works.[12]
Digital resources enable a greater range of interactions than physical ones. Any number of people or processes can request a weather forecast from a web-based weather service because the forecast isn’t used up by the request and the marginal cost of allowing another access is nearly zero. Furthermore, with digital resources many new kinds of interactions can be enabled through application software, web services, or application program interfaces (APIs) in the Organizing System. In particular, translation, summarization, annotation, and keyword suggestion are highly useful services that are commonly supported by web search engines and other web applications. Similarly, an Organizing System with digital resources can implement a “keep everything up to date” interaction that automatically pushes current content to your browser or computing device.
But just as technology can enable interactions, it can prevent or constrain them. If your collection of digital resources (ebooks or music, for example) is not stored on your own computer or device and instead is implemented as access rights to resources stored else where a continuous Internet connection is a requirement for access. In addition, access control policies and digital rights management technology (DRM) can limit the devices that can access the collection and prevent copying, annotation and other actions that might otherwise be enabled by the fair use doctrine.
Just as with organizing principles, it is useful to think of interactions in an abstract or logical way that does not assume an implementation because it can encourage innovative designs for Organizing Systems. See the Sidebar, THE DIGITAL ZOO.
A set of resources is transformed by an organizing system when the resources are described or arranged to enable interactions with them. Explicitly or by default, this requires many interdependent decisions about the identities of resources; their names, descriptions and other properties; the classes, relations, structures and collections in which they participate; and the people or technologies who interact with them.
One important contribution of the idea of the Organizing System is that it moves beyond the debate about the definitions of things, documents, and information with the unifying concept of resource while acknowledging that “what is being organized” is just one of the questions or dimensions that need to be considered.
These decisions are deeply intertwined, but it is easier to introduce them as if they were independent. We introduce five groups of design decisions, itemizing the most important dimensions in each group:
How well these decisions coalesce in an Organizing System depends on the requirements and goals of its human and computational users, and on understanding the constraints and tradeoffs that any set of requirements and goals impose. How and when these constraints and tradeoffs are handled can depend on the legal, business and technological contexts in which the Organizing System is designed and deployed; on the relationship between the designers and users of the Organizing System (who may be the same people or different ones); on the economic or emotional or societal purpose of the Organizing System; and on numerous other design, deployment, and use factors.
Classifying Organizing Systems according to the kind of resources they contain is the most obvious and traditional approach. We can also classify Organizing Systems by their dominant purposes, by their intended user community, or other ways. No single fixed set of categories is sufficient by itself to capture the commonalities and contrasts between Organizing Systems.
We can augment the categorical view of Organizing Systems by thinking of them as existing in a multi-faceted or multi-dimensional design space in which we can consider many types of collections are at the same time.
We distinguish law libraries from software libraries, knowledge management systems from data warehouses, and personal stamp collections from coin collections primarily because they contain different kinds of resources. Similarly, we distinguish document collections by resource type, contrasting narrative document types like novels and biographies with transactional ones like catalogs and invoices, with hybrid forms like textbooks and encyclopedias in between.
But there are three other conventional ways to classify Organizing Systems. A second way to distinguish Organizing Systems is by their dominant purposes or the priority of their common purposes. For example, libraries, museums, and archives are often classified as “memory institutions” to emphasize their primary emphasis on resource preservation. In contrast, “management information systems” or “business systems” are categories that include the great variety of software applications that implement the Organizing Systems needed to carry out day-to-day business operations.
A third conventional approach for classifying Organizing Systems is according to the nature or size of the intended user community. This size or scope can range from personal Organizing Systems created and used by a single person; to “community-based” Organizing Systems used by informal social groups; to those used by the employees, customers or stakeholders of an enterprise; to those used by an entire community or nation; to global ones potentially used by anyone in the world.
A fourth way to distinguish Organizing Systems is according to the technology used to implement them. Large businesses use different software applications for inventory management, records management, content management, knowledge management, customer relationship management, data warehousing and business intelligence, e-mail archiving, and other subcategories of collections.[13]
We can get overwhelmed by this proliferation of ways to classify collections of resources, especially when the classification isn’t clearly based on just one of these many approaches. For example, the list of “library types” used by the International Federation of Library Associations to organize its activities includes resource-based distinctions (e.g. art libraries, law libraries, social science libraries), purpose-based ones (e.g., academic and research libraries), and user-based distinctions (e.g., public libraries, school libraries, libraries serving persons with print disabilities).[14]
A type of resource and its conventional Organizing System are often the focal point of a discipline. Category labels like library, museum, zoo, and data repository have core meanings and many associated experiences and practices. Specialized concepts and vocabularies often evolve to describe these. The richness that follows from this complex social and cultural construction makes it difficult to define category boundaries precisely.
Consider Borgman’s commonly accepted definition of libraries as institutions that “select, collect, organize, conserve, preserve, and provide access to information on behalf of a community of users.” Many Organizing Systems are described as libraries, even though they differ from traditional libraries in important respects. See the Sidebar, WHAT IS A LIBRARY?
We can always create new categories by stretching the conventional definitions of “library” or other familiar Organizing Systems and adding modifiers, as when Flickr is described as a web-based photo-sharing library. But whenever we define an Organizing System with respect to a familiar category, the typical or mainstream instances and characteristics of that category that are deeply embedded in language and culture are reinforced, and those that are atypical are marginalized. In the Flickr case this means we suggest features that aren’t there (like authoritative classification) or omit the features that are distinctive (like tagging by users).
More generally, a categorical view of Organizing Systems makes it matter greatly which category is used to anchor definitions or comparisons. The Google Books project makes out-of-print and scholarly works vastly more accessible, but framing it in library terms to suggest it is a public good upsets many people with a more traditional sense of what the library category implies. We can readily identify design choices in Google Books that are more characteristic of the Organizing Systems in business domains, and the project might have been perceived more favorably had it been described as an online bookstore that offered many beneficial services for free.
A complementary perspective on Organizing Systems is that they exist in a multifaceted or multi-dimensional design space. This framework for describing and comparing Organizing Systems overcomes some of the biases and conservatism built into familiar categories like libraries, museums, and archives, while enabling us to describe them as design patterns that embody characteristic configurations of design choices. We can then use these patterns to support multi-disciplinary work that cuts across categories and applies knowledge about familiar domains to unfamiliar ones. A dimensional perspective makes it easier to translate between category and discipline-specific vocabularies so that people from different disciplines can have mutually intelligible discussions about their organizing activities. They might realize that they have much in common, and they might be working on similar or even the same problems.
A faceted or dimensional perspective acknowledges the diversity of instances of collection types and provides a generative, forward-looking framework for describing hybrid types that don’t cleanly fit into the familiar categories. Even though it might differ from the conventional categories on some dimensions, an Organizing System can be designed and understood by its “family resemblance” on the basis of its similarities on other dimensions to a familiar type of resource collection.
Thinking of Organizing Systems as points or regions in a design space makes it easier to invent new or more specialized types of collections and their associated interactions. If we think metaphorically of this design space as a map of Organizing Systems, the empty regions or “white space” between the densely-populated centers of the traditional categories represent Organizing Systems that do not yet exist. We can consider the properties of an Organizing System that could occupy that white space and analyze the technology, process, or policy innovations that might be required to let us build it there.
But even though digital technology is radically subdividing the traditional categories of collections by supporting new kinds of specialized information-intensive applications, an opposite and somewhat paradoxical trend has emerged. Jennifer Trant argues that the common challenges of “going digital,” and the architectural and functional constraints imposed by web implementations, are causing some convergence in the operation of libraries, museums, and archives. Similarly, Anne Gilliland suggests that giving every physical resource in a collection a digital surrogate or proxy that is searchable and viewable in a web browser is “erasing the distinctions between custodians of information and custodians of things.”[16]
Taken together, these two trends have one profound implication. If the traditional categories for thinking about collections are splintering in some respects and converging in others, they are less useful in describing innovative collections and their associated interactions. Thus, we need a new concept – the Organizing System – that
“What is difficult to identify is difficult to describe and therefore difficult to organize” (Svenonius, 2000, p. 13).
Before we can begin to organize any resource we often need to identify it. It might seem straightforward to devise an Organizing System around tangible resources, but we must be careful not to assume what a resource is. In different situations, the same thing can be treated as a unique item, as one of many equivalent members of a broad category, or as component of an item rather than as an item on its own. For example, in a museum collection, a handmade carved chess piece might be a separately identified item, identified as part of a set of carved chess pieces, or treated as one of the 33 unidentified components of an item identified as a chess set (including the board). When merchants assign a stock-keeping unit (SKU) to identify the things they sell, a SKU can be associated with a unique item, to sets of items treated as equivalent for inventory or billing purposes, or to intangible things like warranties.
You probably don’t have explicit labels on the cabinets and drawers in your kitchen or clothes closet, but department stores and warehouses have signs in the aisles and on the shelves because of the larger number of things a store needs to organize. As a collection of resources grows, it often becomes necessary to identify each one explicitly; to create surrogates like bibliographic records or descriptions that distinguish one resource from another; and to create additional organizational mechanisms like shelf labels, store directories, library card catalogs and indexes that facilitate understanding the collection and locating the resources it contains. These organizational mechanisms often suggest or parallel the organizing principles used to organize the collection itself.
Organization mechanisms like aisle signs, store directories and library card catalogs are embedded in the same physical environment as the resources being organized. But when these mechanisms or surrogates are digitized, the new capabilities that they enable create design challenges. This is because a digital Organizing System can be designed and operated according to more abstract and less constraining principles than an Organizing System that only contains physical resources. A single physical resource can only be in one place at a time, and interactions with it are constrained by its size, location, and other properties. In contrast, digital copies and surrogates can exist in many places at once and enable searching, sorting, and other interactions with an efficiency and scale impossible for tangible things.
When the resources being organized consist of information content, deciding on the unit of organization is challenging because it might be necessary to look beyond physical properties and consider conceptual or intellectual equivalence. A high school student told to study Shakespeare’s play “Macbeth” might treat any printed copy or web version as equivalent, and might even try to outwit the teacher by watching a film adaptation of the play. To the student, all versions of Macbeth seem to be the same resource, but librarians and scholars make much finer distinctions.[17]
Archival Organizing Systems implement a distinctive answer to the question of what is being organized. Archives are a type of collection that focuses on resources created by a particular person, organization, or institution, often during a particular time period. This means that archives have themselves been previously organized as a result of the processes that created and used them. This “original order” embodies the implicit or explicit Organizing System of the person or entity that created the documents and it is treated as an essential part of the meaning of the collection. As a result, the unit of organization for archival collections is the fonds—the original arrangement or grouping, preserving any hierarchy of boxes, folders, envelopes, and individual documents—and thus they are not re-organized according to other (perhaps more systematic) classifications.[18]
Some Organizing Systems contain legal, business or scientific documents or data that are the digital descendants of paper reports or records of transactions or observations. These Organizing Systems might need to deal with legacy information that still exists in paper form or in electronic formats like image scans that are different from the structural digital format in which more recent information is likely to be preserved. When legacy conversions from printed information artifacts are complete or unnecessary, an Organizing System no longer deals with any of the traditional tangible artifacts. Digital libraries dispense with these artifacts, replacing them with the capability to print copies if needed. This enables libraries of digital documents or data collections to be vastly larger and more accessible across space and time than any library that stores tangible, physical items could ever be.
An increasing number of Organizing Systems handle resources that are born digital. Ideally, digital texts can be encoded with explicit markup that captures structural boundaries and content distinctions, which can be used to facilitate organization, retrieval, or both. In practice the digital representations of texts are often just image scans that do not support much processing or interaction. A similar situation exists for the digital representations of music, photographs, videos, and other non-text content like sensor data, where the digital formats are structurally and semantically opaque.
“The central purpose of systems for organizing information [is] bringing like things together and differentiating among them” (Svenonius, 2000 p. xi).
Almost by definition, the essential purpose of any Organizing System is to describe or arrange resources so they can be located and accessed later. The organizing principles needed to achieve this goal depend on the types of resources or domains being organized, and in the personal, social, or institutional setting in which organization takes place. “Bringing like things together” is an informal organizing principle for many Organizing Systems. But there will likely be a number of more precise requirements or constraints to satisfy.
Organizing Systems involving physical resources are more likely to emphasize aesthetic or emotional goals than those for information resources, which more often are dominated by functional goals like efficiency of storage and access. This contrast is often magnified by the tendency for major library and museum collections to be housed in buildings designed as architectural monuments that over time become symbols of national or cultural identity.
The fine distinctions between Organizing Systems that have many characteristics in common reflect subtle differences in the priority of their shared goals. For example, many Organizing Systems create collections and enable interactions with the goals of supporting scientific research, public education, and entertainment. We can contrast zoos, animal theme parks, and wild animal preserves in terms of the absolute and relative importance of these three goals.
When individuals manage their papers, books, documents, record albums, compact discs, DVDs, and other information resources, their Organizing Systems vary greatly. This is in part because the content of the resources being organized becomes a consideration. Furthermore, many of the Organizing Systems used by individuals are implemented by web applications, and this makes them more accessible because their resources can be accessed from anywhere with a web browser.[19]
Put another way, an information resource inherently has more potential uses than resources like forks or frying pans, so it isn’t surprising that the Organizing Systems in offices are even more diverse than those in kitchens.
When the scale of the collection or the number of intended users increases, not everyone is likely to share the same goals and design preferences for the Organizing System. If you share a kitchen with housemates, you might have to negotiate and compromise on some of the decisions about how the kitchen is organized so you can all get along. In more formal or institutional Organizing Systems conflicts between stakeholders can be much more severe, and the organizing principles might even be specified in commercial contracts or governed by law. For example, Bowker and Star note that physicians view the creation of patient records as central to diagnosis and treatment, insurance companies think of them as evidence needed for payment and reimbursement, and researchers think of them as primary data. Not surprisingly, policymaking and regulations about patient records are highly contentious.[20]
We can look back to the invention of mechanized printing in the fifteenth century, which radically increased the number of books and periodicals, as the motivation for libraries to develop systematic methods for cataloging and classifying what they owned and to view themselves as doing more than just preserving a collection. Libraries began progressively more refined efforts to state the functional requirements for their Organizing Systems and to be explicit about how they met those requirements.
Today, the Organizing Systems in a large academic research library must also support many functions and services other than those that directly support search and location of resources in their collections. In these respects, the Organizing Systems in non-profit libraries have much in common with those in corporate information repositories and business applications. See the Sidebar, LIBRARY {AND, OR, VS.} BUSINESS ORGANIZING SYSTEMS.
Preserving documents in their physical or original form is the primary purpose of archives and similar Organizing Systems that contain culturally, historically, or economically significant documents that have value as long-term evidence. Preservation is also an important motivation for the Organizing Systems of information- and knowledge-intensive firms. Businesses and governmental agencies are usually required by law to keep records of financial transactions, decision-making, personnel matters, and other information essential to business continuity, compliance with regulations and legal procedures, and transparency. As with archives, it is sometimes critical that these business knowledge or records management systems can retrieve the original documents, although digital copies that can be authenticated are increasingly being accepted as legally equivalent.
Chapter 7, “Classification,” more fully explains the different purposes for Organizing Systems, the organizing principles they embody, and the methods for assigning items to classifications.
“It is a general bibliographic truth that not all documents should be accorded the same degree of organization” (Svenonius, 2000 p. 24).
Not all resources should be accorded the same degree of organization. In this section we will briefly unpack this notion of degree of organization into its two more important and related dimensions: the amount of description or organization applied to each resource and the amount of organization of resources into classes or categories. Chapter 4, “Describing Resources,” Chapter 5, “Categories,” and Chapter 7, “Implementing Resource Descriptions,” more thoroughly address these questions about the nature and extent of description in Organizing Systems.
Not all resources in a collection require the same degree of description for the simple reason we discussed in Why is it Being Organized? Organizing Systems exist for different purposes and to support different kinds of interactions or functions. Let’s contrast two ends of the “degree of description” continuum. Many people use “current events awareness” or “news feed” applications that select news stories whose titles or abstracts contain one or more keywords. This exact match algorithm is easy to implement, but its all-or-none and one-item-at-a-time comparison misses any stories that use synonyms of the keyword, that are written in languages different from that of the keyword, or that are otherwise relevant but don’t contain the exact keyword in the limited part of the document that is scanned. However, users with current events awareness goals don’t need to see every news story about some event, and this limited amount of description for each story and the simple method of comparing descriptions are sufficient.
On the other hand, this simple Organizing System is inadequate for the purpose of comprehensive retrieval of all documents that relate to some concept, event, or problem. This is a critical task for scholars, scientists, inventors, physicians, attorneys and similar professionals who might need to discover every relevant document in some domain. Instead, this type of Organizing System needs rich bibliographic and semantic description of each document, most likely assigned by professional cataloguers, and probably using terms from a controlled vocabulary to enforce consistency in what descriptions mean.
Similarly, different merchants or firms might make different decisions about the extent or granularity of description when they assign SKUs because of differences in suppliers, targeted customers, or other business strategies. If you take your car to the repair shop because windshield wiper fluid is leaking, you might be dismayed to find that the broken rubber seal that is causing the leak can’t be ordered separately and you have to pay to replace the “wiper fluid reservoir” for which the seal is a minor but vital part. Likewise, when two business applications try to exchange and merge customer information, integration problems will arise if one describes a customer as a single “NAME” component while the other separates the customer’s name into “TITLE”, “FIRSTNAME,” and “LASTNAME.”
Even when faced with the same collection of resources, people differ in how much organization they prefer or how much disorganization they can tolerate. A classic study by Tom Malone of how people organize their office workspaces and desks contrasted the strategies and methods of “filers” and “pilers.” Filers maintain clean desktops and systematically organize their papers into categories, while pilers have messy work areas and make few attempts at organization. This contrast has analogues in other organizing systems and we can easily imagine what happens if a “neat freak” and “slob” become roommates.[21]
Different preferences and disagreements between stakeholders in an Organizing System about how much organization is necessary often result because of the implications for who does the work and who gets the benefits. Physicians prefer narrative descriptions and broad classification systems because they make it easier to create patient notes. In contrast, insurance companies and researchers want fine-grained “form-filling” descriptions and detailed classifications that would make the physician’s work more onerous.[22]
The cost-effectiveness of creating systematic and comprehensive descriptions of the resources in an information collection has been debated for nearly two centuries and in the last half century the scope of the debate grew to consider the role of computer-generated resource descriptions.[23]
An alternative and complement to man-made descriptions for each resource are computer-generated indexes of their textual contents. These indexes typically assign weights to the terms according to calculations that consider the frequency and distribution of the terms in both individual documents and in the collection as a whole to create a description of what the documents are about. These descriptions of the documents in the collection are more consistent than those created by human organizers. They allow for more complex query processing and comparison operations by the retrieval functions in the Organizing System. For example, query expansion mechanisms or thesauri can automatically add synonyms and related terms to the search. Additionally, retrieved documents can be arranged by relevance, while “citing” and “cited-by” links can be analyzed to find related relevant documents.
A second constraint on the degree of organization comes from the absolute size of the collection within the scope of the Organizing System. Organizing more resources requires more descriptions to distinguish any particular resource from the rest and more constraining organizing principles. Similar resources need to be grouped or classified to emphasize the most important distinctions among the complete set of resources in the collection. A small neighborhood restaurant might have a short wine list with just ten wines, arranged in two categories for “red” and “white” and described only by the wine’s name and price. In contrast, a gourmet restaurant might have hundreds of wines in its wine list, which would subdivide its “red” and “white” high-level categories into subcategories for country, region of origin, and grape varietal. The description for each wine might in addition include a specific vineyard from which the grapes were sourced, the vintage year, ratings of the wine, and tasting notes.
At some point a collection grows so large that it is not economically feasible for people to create bibliographic descriptions or to classify each separate resource, unless there are so many users of the collection that their aggregated effort is comparably large; this is organizing by “crowdsourcing” (See the Sidebar on WEB 2.0, ENTERPRISE 2.0, LIBRARY 2.0, MUSEUM 2.0, SCIENCE 2.0, GOV 2.0. in How (or by Whom) is it Organized?).. This leaves two approaches that can be done separately or in tandem. The simpler approach is to describe sets of resources or documents as a set or group, which is especially sensible for archives with its emphasis on the fonds (see What is Being Organized?). The second approach is to rely on automated and more general-purpose organizing technologies that organize resources through computational means. Search engines are familiar examples of computational organizing technology, and section 6.5, “Computational Classification,” describes other common techniques in machine learning, clustering, and discriminant analysis that can be used to create a system of categories and to assign resources to them.
Chapter 8, “Implementing Resource Descriptions,” focuses on the representation and management of descriptions, taking a more technological or implementation perspective. Chapter 9, “Interactions in Organizing Systems,” discusses how the nature and extent of descriptions determines the capabilities of the interactions that locate, compare, combine, or otherwise use resources in information-intensive domains.
Because bibliographic description, when manually performed, is expensive, it seems likely that the “pre” organizing of information will continue to shift incrementally toward “post” organizing (Svenonius, 2000, p. 194-195).
The Organizing System framework recasts the traditional tradeoff between information organization and information retrieval as the decision about when the organization is imposed. We can contrast organization imposed on resources “on the way in” when they are created or made part of a collection with “on the way out” organization imposed when an interaction with resources takes place.
When an author writes a document, he or she gives it some internal organization via title, section headings, typographic conventions, page numbers, and other mechanisms that identify its parts and their significance or relationship to each other. The document could also have some external organization implied by the context of its publication, like the name of its author and publisher, its web address if it is online or has a website, and citations or links to other documents or web pages.
Digital photos, videos, and documents are generally organized to some minimal degree when they are created because some descriptions like time and location are assigned automatically to these types of resources by the technology used to create them.[24]
Digital resources created by automated processes generally exhibit a high degree of organization and structure because they are generated automatically in conformance with data or document schemas. These schemas implement the business rules and information models for the orders, invoices, payments, and the numerous other types of document resources that are created and managed in business organizing systems.
Before a resource becomes part of a library collection, its author-created organization is often supplemented by additional information supplied by the publisher or other human intermediaries, such as an ISBN or Library of Congress call number or subject headings.
In contrast, Google and other search engines apply massive computational power to analyze the contents and associated structures (like links between web pages) to impose organization on resources that have already been published or made available so that they can be retrieved in response to a user’s query “on the way out.” Google makes use of existing organization within and between information resources when it can, but its unparalleled technological capabilities and scale yield competitive advantage in imposing organization on information that wasn’t previously organized digitally. Indeed, Geoff Nunberg criticized Google for ignoring or undervaluing the descriptive metadata and classifications previously assigned by people and replacing them with algorithmically assigned descriptors, many of which are incorrect or inappropriate.[25]
Google makes almost all of its money through personalized ad placement, so much of the selection and ranking of search results is determined “on the way out” in the fraction of a second after the user submits a query by using information about the user’s search history and current context. Of course, this “on the way out” organization is only possible because of the more generic organization that Google’s algorithms have imposed, but that only reminds us of how much the traditional distinction between “information organization” and “information retrieval” is no longer defensible.
In many Organizing Systems the nature and extent of organization changes over time as the resources governed by the Organizing System are used. The arrangement of resources in a kitchen or in an office changes incrementally as frequently used things end up in the front of the pantry, drawer, shelf or filing cabinet or on the top of a pile of papers. Printed books or documents acquire margin notes, underlining, turned down pages or coffee cup stains that differentiate the most important or most frequently used parts. Digital documents don’t take on coffee cup stains, but when they are edited, their new revision dates put them at the top of directory listings.
The scale of emergent organization of web sites, photos on Flickr, blog posts, and other resources that can be accessed and used online dwarfs the incremental evolution of individual Organizing Systems. This organization is clearly visible in the pattern of links, tags, or ratings that are explicitly associated with these resources, but search engines and advertisers also exploit the less visible organization created over time by information about which resources were viewed and which links were followed.
This sort of organic or emergent change in Organizing Systems that takes place over time contrasts with the planned and systematic maintenance of Organizing Systems described as curation or governance. These two terms are related but distinct. The former is most often used for libraries, museums, or archives and the latter for enterprise or inter-enterprise contexts. Curation usually refers to the methods or systems that add value to and preserve resources, while the concept of governance more often emphasizes the institutions or organizations that carry out those activities (see Maintaining Resources for more discussion).
The Organizing Systems for businesses and industries often change because of the development of de facto or de jure standards, or because of regulations, court decisions, or other events or mandates from entities with the authority to impose them.
“The rise of the Internet is affecting the actual work of organizing information by shifting it from a relatively few professional indexers and catalogers to the populace at large. … An important question today is whether the bibliographic universe can be organized both intelligently (that is, to meet the traditional bibliographic objectives) and automatically” (Svenonius, 2000 p. 26).
In the preceding quote Svenonius identifies three different ways for the “work of organizing information” to be performed: by professional indexers and catalogers, by the populace at large, and by automated (computerized) processes. Our notion of the Organizing System is broader than her “bibliographic universe,” making it necessary to extend her taxonomy. Authors are increasingly organizing the content they create, and it is important to distinguish users in informal and formal or institutional contexts. We have also introduced the concept of an organizing agent (The Concept of “Organizing Principle”) to unify organizing done by people and by computer algorithms.
Professional indexers and cataloguers under go extensive training to learn the concepts, controlled descriptive vocabularies, and standard classifications in the particular domains in which they work. Their goal is not only to describe individual resources, but to position them in the larger collection in which they reside.[26] They can create and maintain Organizing Systems with consistent high quality, but their work often requires additional research, which is costly.
Expanding the scope of Organizing Systems beyond the bibliographic universe expands the class of professional organizers to include the employees of commercial information services like Westlaw and LexisNexis, who add controlled and, often, proprietary metadata to legal and government documents and other news sources. Scientists and scholars with deep expertise in a domain often function as the professional organizers for data collections, scholarly publications and proceedings, and other specialized information resources in their respective disciplines. The National Society of Professional Organizers claims several thousand members who will organize your media collection, kitchen, closet, garage or entire house or will help you downsize to a smaller living space.[27]
Many of today’s content creators are unlikely to be professional organizers, but presumably the author best understands why something was created and the purposes for which it can be used. To the extent that authors want to help others find a resource, they will assign descriptions or classifications that they expect will be useful to those users. But unlike professional organizers, many authors will be unfamiliar with controlled vocabularies and standard classifications, and as a result their descriptions will be more subjective and less consistent with those for the larger collection.
Similarly, most of us don’t hire professionals to organize the resources we collect and use in our personal lives, and thus our organizing systems reflect our individual preferences and idiosyncrasies.
Non-author users in the “populace at large” are most often creating organization for their own benefit. Not only are these ordinary users unlikely to use standard descriptors and classifications, the organization they impose sometimes so closely reflects their own perspective and goals that it isn’t useful or accurate for others.
Fortunately most users of “Web 2.0” or “community content” applications at least partly recognize that in these applications the organization of resources emerges from the aggregated contributions of all users, which provides incentive to use less egocentric descriptors and classifications. The staggering number of users and resources on the most popular applications inevitably leads to “tag convergence” simply because of the statistics of large sample sizes.
Finally, the vast size of the web and the even greater size of the deep or invisible web composed of the information stores of business and proprietary information services makes it impossible to imagine today that it could be organized by anything other than the massive computational power of search engine providers like Google and Microsoft.[28]. Nevertheless, in the earliest days of the web, significant human effort was applied to organize it. Most notable is Yahoo!, founded by Jerry Yang and David Filo in 1994 as a directory of favorite web sites. For many years the Yahoo! homepage was the best way to find relevant websites by browsing the extensive system of classification. Today’s Yahoo! homepage emphasizes a search engine that makes it appear more like Google or Microsoft Bing, but the Yahoo directory can still be found if you search for it.
Devising concepts, methods, and technologies for describing and organizing resources have been essential human activities for millennia, evolving both in response to human needs and to enable new ones. Organizing Systems enabled the development of civilization, from agriculture and commerce to government and warfare. Today Organizing Systems are embedded in every domain of purposeful activity, including research, education, law, medicine, business, science, institutional memory, sociocultural memory, governance, public accountability, as well as in the ordinary acts of daily living.
Many of the foundational topics for a discipline of organizing have traditionally been presented from the perspective of the public sector library and taught as “library and information science.” These include bibliographic description, classification, naming, authority control, and information standards. We need to update and extend the coverage of these topics to include more private sector and non-bibliographic contexts, multi- and social media, and new information-intensive applications and service systems enabled by mobile, pervasive, and scientific computing. In so doing we can reframe the foundational concepts to make them equally compatible with the disciplinary perspectives of informatics, data and process modeling, and document engineering.
With the Web and ubiquitous digital information, along with effectively unlimited processing, storage and communication capability, millions of people create and browse web sites, blog, tag, tweet, and upload and download content of all media types without thinking “I’m organizing now” or “I’m retrieving now.” Writing a book used to mean a long period of isolated work by an author followed by the publishing of a completed artifact, but today some books are continuously and iteratively written and published through the online interactions of authors and readers. When people use their smart phones to search the web or run applications, location information transmitted from their phone is used to filter and reorganize the information they retrieve. Arranging results to make them fit the user’s location is a kind of computational curation, but because it takes place quickly and automatically we hardly notice it.
Likewise, almost every application that once seemed predominantly about information retrieval is now increasingly combined with activities and functions that most would consider to be information organization. Google, Microsoft, and other search engine operators have deployed millions of computers to analyze billions of web pages and millions of books and documents to enable the almost instantaneous retrieval of published or archival information. However, these firms increasingly augment this retrieval capability with information services that organize information in close to real-time. Further, the selection and presentation of search results, advertisements, and other information can be tailored for the person searching for information using his implicit or explicit preferences, location, or other contextual information.
Taken together, these innovations in technology and its application mean that the distinction between “information organization” and “information retrieval” that is often manifested in academic disciplines and curricula is much less important than it once was. This book has few sharp divisions between “information organization” (IO) and “information retrieval” (IR) topics. Instead, it explains the key concepts and challenges in the design and deployment of Organizing Systems in a way that continuously emphasizes the relationships and tradeoffs between IO and IR. The concept of the Organizing System highlights the design dimensions and decisions that collectively determine the extent and nature of resource organization and the capabilities of the processes that compare, combine, transform and interact with the organized resources.
Chapter 2. Developing a view that brings together how we organize as individuals with how libraries, museums, governments, research institutions, and businesses create Organizing Systems requires that we generalize the organizing concepts and methods from these different domains. Chapter 2 surveys a wide variety of Organizing Systems and describes four activities or functions shared by all of them: selecting resources, organizing resources, designing resource-based interactions and services, and maintaining resources over time.
Chapter 3. The design of an Organizing System is strongly shaped by what is being organized, the first of the five design decisions we introduced earlier in What is Being Organized?. To enable a broad perspective on this fundamental issue we use resource to refer to anything being organized, an abstraction that we can apply to physical things, digital things, information about either of them, or web-based services or objects. Chapter 3 discusses the challenges and methods for identifying the resources in an Organizing System in great detail and emphasizes how these decisions reflect the goals and interactions that must be supported – the “why” design decisions introduced in Why is it Being Organized?.
Chapter 4, “Resource Description and Metadata”. The principles by which resources are organized and the kinds of services and interactions that can be supported for them largely depend on the nature and explicitness of the resource descriptions. This “how much description” design question was introduced in How Much is it Being Organized?; Chapter 4 presents a systematic process for creating effective descriptions and analyzes how this general approach can be adapted for different types of Organizing Systems.
Chapter 5, “Describing Relationships and Structures”. An important aspect of organizing a collection of resources is describing the relationships between them. Chapter 5 introduces the specialized vocabulary used to describe semantic relationships between resources and between the concepts and words used in resource descriptions. It also discusses the structural relationships within multipart resources and between resources, like those expressed as citations or hypertext links.
Chapter 6, “Categorization: Describing Resource Classes and Types”. Groups or sets of resources with similar or identical descriptions can be treated as equivalent, making them members of an equivalence class or category. Identifying and using categories are essential human activities that take place automatically for perceptual categories like “red things” or “round things.” Categorization is deeply ingrained in language and culture, and we use linguistic and cultural categories without realizing it, but categorization can also be a deeply analytic and cognitive process. Chapter 6 reviews theories of categorization from the point of view of how categories are created and used in Organizing Systems.
Chapter 7, “Classification: Assigning Resources to Categories”. The terms categorization and classification are often used interchangeably but they are not the same. Classification is applied categorization – the assignment of resources to a system of categories, called classes, using a predetermined set of principles. Chapter 7 describes three different approaches to classification: faceted, social/distributed, and computational. The chapter briefly introduces some of the most commonly used classification systems in libraries and museums as well as new computational approaches for classifying email messages as spam or classifying music by genre.
Chapter 8, “Implementing Resource Descriptions”. Chapter 8 complements the conceptual and methodological perspective on the creation of resource descriptions with a technological and implementation perspective. Resource descriptions are often called metadata, literally “data about data” but the latter concept implies a narrower range of relationships between descriptions and the resources they describe than the former, which we prefer. Chapter 8 reviews the approaches for describing resources from the library and information science disciplines as well as the emerging perspectives from the semantic web, linked data, and microformat communities.
Chapter 9, “Interactions in Organizing Systems”. When Organizing Systems overlap, intersect, or are combined (temporarily or permanently), differences in resource descriptions can make it difficult or impossible to locate resources, access them, or otherwise impair their use. Chapter 9 reviews some of the great variety of concepts and techniques that different domains use when interacting with resources in Organizing Systems – integration, interoperability, data mapping, crosswalks, mashups, and so on. Similarly, processes for information retrieval are often characterized as comparing the description of a user’s needs with descriptions of the resources that might satisfy them. Chapter 9 extends and more broadly applies this core idea to describe IR and related applications of natural language processing (NLP) in terms of locating, comparing, and ranking descriptions.
Chapter 10, “The Organizing System Road Map”. Chapter 10 complements the descriptive perspective of Chapters 2-9 with a more prescriptive one that analyzes the design choices and tradeoffs that must be made in different phases in an Organizing System’s lifecycle. System lifecycle models exhibit great variety, but we use a generic four-phase model that distinguishes a domain identification and scoping phase, a requirements phase, a design and implementation phase, and an operational phase.
[1] [Citation]
Nunberg 1996, 2011. Buckland 1991. See also Bates 2006.
[2] [LIS]
Buckland (1997); Glushko and McGrath (2005) and others with an informatics or computer science perspective take an abstract view of “document” that separates its content from its presentation or container (see Identity and Information Components in Chapter 3). In contrast, the library science perspective often uses presentation or implementation properties in definitions of “document.” On authorship: when we say that “Charles Dickens is the author of A Tale of Two Cities” the meaning of “author” doesn’t depend on whether we have a printed copy or a Kindle copy of the book in mind, but what counts as authorship varies a great deal across academic disciplines. Furthermore, different standards for describing resources disagree in the precision with which they identify the person(s) or organization(s) primarily responsible for creating the intellectual content of the resource, which creates interoperability problems (See Chapter 9).
[3] [Citation]
Svenonius 2000.
[4] [Computing]
The URI identifies a resource as an abstract entity that can have multiple representations, which are the things that are actually exposed in application or user interfaces. The HTTP protocol can transfer the representation that best satisfies the content properties specified by a web client, most often a browser. This means that interactions with web resources are always with their representations rather than directly with the resource per se.The representation of the resource might seem to be implied by the URI (as when it ends in .htm or .html to suggest text in HTML format), but the URI is not required to indicate anything about the representation. A web resource can be a static web page, but it can also be dynamic content generated at the time of access by a program or service associated with the URI. Some resources like geolocations have no representations at all; the resource is simply some point or space and the interaction is “show me how to get there.” The browser and web server can engage in “content negotiation” to determine which representation to retrieve, and this is particularly important when the format requires an external application or “plug-in” to be rendered properly, as it does when the URI points to a PowerPoint file or other format not built into the browser.Internet architecture’s definition of resource as a conceptual entity that is never directly interacted with is difficult for most people to apply when the resources are physical or tangible objects, because then it surely seems like we are interacting with something real. So we will most often talk about interactions with resources, and will mention resource representations only when it is necessary to align precisely with the narrower Internet architecture sense.
[5] [Business]
The intellectual resources of a firm are embodied in a firm’s people, systems, management techniques, history of strategy and design decisions, customer relationships, and intellectual property like patents, copyrights, trademarks, and brands. Some of this knowledge is explicit, tangible, and traceable in the form of documents, databases, organization charts, and policy and procedure manuals. But much of it is tacit: informal and not systematized in tangible form because it is held in the minds and experiences of people; a synonym is “know-how.” A more modern term is “Intellectual Capital,” a concept originated in a 1997 book with that title (Stewart, 1997).
[6] [Citation]
Banzhaf, 2009.
[7] [Computing]
The “plain web” (Wilde, 2008), whose evolution is managed by the World Wide Web Consortium, is rigorously standardized, but unfortunately the larger ecosystem of technologies and formats in which the web exists is becoming less so. Web-based organizing systems often contain proprietary media formats and players (like Flash) or are implemented as closed environments that are intentionally isolated from the rest of the web (like Facebook or Apple’s iTunes and other smart phone “app stores”).
[8] [Computing]
Instead of thinking of a digital book as a “parallel resource” to a printed book, we could consider both of them as alternate representations of the same abstract resource that are linked together by an “alternative” relationship, just as we can use the HTML “alt” tag to associate text with an image so its content and function can be understood by text-only readers.
[9] [Computing]
For collections of non-trivial size the choice of searching or sorting algorithm in computer programs is a critical design decision because they differ greatly in the time they take to complete and the storage space they require. For example, if the collection is arranged in an unorganized or random manner (as a “pile”) and every resource must be examined, the time to find a particular item increases linearly with the collection size. If the collection is maintained in an ordered manner, a binary search algorithm can locate any item in a time proportional to the logarithm of the number of items. Analysis of algorithms is a fundamental topic in computer science; a popular textbook is “Introduction to Algorithms” by Thomas Cormen et al (2009).
[10] [LIS]
For precise distinctions, see the US Department of Labor, Bureau of Labor Statistics occupational outlook handbooks at http://www.bls.gov/oco/ocos065.htm and http://www.bls.gov/oco/ocos068.htm and http://www.michellemach.com/jobtitles/realjobs.html
[11] [LIS]
The four objectives listed in this paragraph as those proposed in 1997 by the International Federation of Library Associations and Institutions (IFLA). The first statement of the objectives for a bibliographic system was made by Cutter (1876), which Svenonius (2000) says it is likely the most cited text in the bibliographic literature. Cutter called his three objectives “finding,” “co-locating,” and “choice.”
[12] [Law]
Copyright law, license or contract agreements, terms of use and so on that shape interactions with resources are part of the Organizing System, but compliance with them might not be directly implemented as part of the system. With digital resources, digital rights management (DRM), passwords, and other security mechanisms can be built into the Organizing System to enforce compliance.
[13] [Computing]
Sometimes many of these Organizing Systems and their associated applications are implemented using a unified storage foundation provided by an enterprise content management (ECM) or enterprise data management (EDM) system. An integrated storage tier can improve the integrity and quality of the information but is invisible to users of the applications.
[14] [Citation]
IFLA 2011
[15] [Law]
In 2004 Google began digitizing millions of books from several major research libraries with the goal of making them available through its search engine (Brin, 2009). But many millions of these books are still in copyright, and in 2005 Google was sued for copyright infringement by several publishers and an author’s organization. In 2011 a US District Court judge rejected the proposed settlement the parties had negotiated in 2008 because many others objected to it, including the US Justice Department, several foreign governments, and numerous individuals (Samuelson, 2011).The major reason for the rejection was that the settlement was a “bridge too far” that went beyond the claims made against Google to address issues that were not in litigation. In particular, the judge objected to the treatment of the so-called “orphan books” that were still under copyright but out of print because money they generated went to the parties in the settlement and not to the rights holders who could not be located (why the books are “orphans”) or to defray the costs of subscriptions to the digital book collection. The judge also was concerned that the settlement didn’t adequately address the concerns of academic authors – who wrote most of the books scanned from research libraries – who might prefer to make their books freely available rather than seek to maximize profits from them. Other concerns were that the settlement would have entrenched Google’s monopoly in the search market and that there were inadequate controls for protecting the privacy of readers.Google’s plan would have dramatically increased access to out of print books, and the rejection of the proposed settlement has heightened calls for an open public digital library (Darnton, 2011), which could perhaps be started using the digital copies that the research libraries received in return for giving Google books to scan. In 2010 the Alfred P. Sloan Foundation provided funding to launch the Digital Public Library of America (DPLA, 2011). This non-proprietary goal might induce the US Congress and other governments to pass legislation that fixes the copyright problems for orphan works.
[16] [Citation]
Trant 2009a. Gilliland-Swetland, 2000.
[17] [LIS]
Organizing Systems that follow the rules set forth in the Functional Requirements for Bibliographic Records (FRBR) (Tillett, 2004) treat all the Macbeths as the same “work.” However, they also enforce a hierarchical set of distinctions for finer-grained organization. FRBR views books and movies as different “expressions,” different print editions as “manifestations,” and each distinct physical thing in a collection as an “item.” This Organizing System thus encodes the degree of intellectual equivalence while enabling separate identities where the physical form is important, which is often the case for scholars.
[18] [LIS]
Typical examples of archives might be national or government document collections or the specialized Julia Morgan archive at the University of California, Berkeley (Online Archive of California, 2011), which houses documents by the famous architect who designed many of the university’s most notable buildings as well as the famous Hearst Castle along the central California coast. The “original order” organizing principle of archival Organizing Systems was first defined by 19th century French archivists and is often described as “respect pour les fonds.”
[19] [Computing]
For example, many people manage their digital photos with Flickr, their home libraries with Library Thing, and their preferences for dining and shopping with Yelp. It is possible to use these “tagging” sites solely in support of individual goals, as tags like “myfamily,” “toread,” or “buythis” clearly demonstrate. But maintaining a personal Organizing System with these web applications potentially augments the individual’s purpose with social goals like conveying information to others, developing a community, or promoting a reputation. Furthermore, because these community or collaborative applications aggregate and share the tags applied by individuals, they shape the individual Organizing Systems embedded within them when they suggest the most frequent tags for a particular resource.
[20] [Citation]
Bowker and Star (2000).
[21] [Citation]
Malone (1983) is the seminal research study, but individual differences in organizing preferences were the basis of Neil Simon’s Broadway play “The Odd Couple” in 1965, which then spawned numerous films and TV series.
[22] [Citation]
See Grudin’s classic work on non-technological barriers to the successful adoption of collaboration technology (Grudin, 1994).
[23] [LIS]
Sir Anthony Panizzi is most often associated with the origins of modern library cataloging. In 1841 (Panizzi, 1841) published 91 cataloging rules for the British Library that defined authoritative forms for titles and author names, but the complexity of the rules and the resulting resource descriptions were widely criticized. For example, the famous author and historian Thomas Carlyle argued that a library catalog should be nothing more than a list of the names of the books in it. Standards for bibliographic description are essential if resources are to be shared between libraries.See Denton, 2007, Anderson and Perez-Carballo 2001a, 2001b
[24] [Computing]
At a minimum, these descriptions include the creation time and storage format for the resource, or chronologically by the auto-assigned filename (IMG00001.JPG, IMG00002.JPG, etc.), but often are much more detailed. Most digital cameras annotate each photo with detailed information about the camera and its settings in the Exchangeable Image File Format (EXIF), and many mobile phones can associate their location along with any digital object they create. Nevertheless, these descriptions are not always correct. For example, Microsoft Office applications extract the author name from any template associated with a document, presentation, or spreadsheet and then embed it in the new documents. And if you haven’t set the time correctly in your digital camera any timestamp it associates with a photo will be wrong.
[25] [LIS]
Nunberg (2009) calls Google’s Book Search a “disaster for scholars” and a “metadata train wreck.” He lists scores of errors in titles, publication dates, and classifications. For example, he reports that a search on “Internet” in books published before 1950 yields 527 results. The first 10 hits for Whitman’s “Leaves of Grass” are variously classified as Poetry, Juvenile Nonfiction, Fiction, Literary Criticism, Biography & Autobiography, and Counterfeits and Counterfeiting.
[26] [LIS]
This is an important distinction in library science education and library practice. Individual resources are described (“formal” cataloguing) using “bibliographic languages” and their classification in the larger collection is done using “subject languages” (Svenonius 2000, Chapters 4 and 8, respectively). These two practices are generally taught in different library school courses because they use different languages, methods and rules and are generally carried out by different people in the library. In other organizations, the resource description (both formal and subject) is created in the same step and by the same person.
[27] [Citation]
NAPO.net. Downsizingdiva.com.
[28] [Computing]
He, et al (2007) estimate that there are hundreds of thousands of web sites and databases whose content is accessible only through query forms and web services, and there are over a million of those. The amount of content in this hidden web is many hundreds of times larger than that accessible in the surface or visible web
[29] [Citation]
The “manifesto” for Web 2.0 is Tim O’Reilly’s “What is Web 2.0?” (http://oreilly.com/web2/archive/what-is-web-20.html). “Folksonomy” was coined by Thomas Van der Wal at about the same time in 2004; see http://vanderwal.net/folksonomy.html and Trant (2009b). The term “Crowdsourcing” was invented by Jeff Howe in a June 2006 article in Wired magazine, http://www.wired.com/wired/archive/14.06/crowds.html, and the concept was developed further in a book published two years later (Howe, 2008). Millen et al (2005) describe an enterprise application of social bookmarking at IBM called Dogear. The Library 2.0 idea is presented in Maness (2006) and several more recent surveys of Web 2.0 features in university library web sites have been reported by Xu et al (2009) and Harinarayana and Raju (2010). Nina Simon’s book on “The Participatory Museum” is itself an example of Web 2.0 concepts, available online with reader comments (www.participatorymuseum.org/read/). For Science 2.0., see Schneiderman 2008). For Government 2.0, see Robinson et al (2009) and Drapeau (2010).
_To appear in The Discipline of Organizing, 2012_Robert J.GlushkoErikWildeJessHemerly
There are four activities that occur naturally in every organizing system; how explicit they are depend on the scope, the breadth or variety of the resources, and the scale, the number of resources that the organizing system encompasses. Consider the routine, everyday task of managing your wardrobe. When you organize your clothes closet, you are unlikely to write a formal selection policy that specifies what things go in the closet. You do not consciously itemize and prioritize the ways you expect to search for and locate things, and you are unlikely to consider explicitly the organizing principles that you use to arrange them. From time to time to you will put things back in order and discard things you no longer wear, but you probably will not schedule this as a regular activity on your calendar.
Your clothes closet is an organizing system; defined in Chapter 1 as “an intentionally arranged collection of resources and the interactions they support.” As such, it exposes these four highly interrelated and iterative activities:
Figure 2.1 illustrates these four activities in all organizing systems, framing the depiction of the organizing and interaction design activities shown in Figure 1.1 with the selection and maintenance activities that necessarily precede and follow them.
These activities can be informal for your clothes closet because its scope and scale are limited. In institutional organizing systems the selection, organizing, interaction design, and maintenance activities are often highly formal; they are deeply ingrained in academic curricula and professional practices, with domain-specific terms for their methods and results.
For example, libraries and museums usually make their selection principles explicit in collection development policies. Adding a resource to a library collection is called acquisition, but adding to a museum collection is called accessioning. Documenting the contents of library and museum collections to organize them is called cataloguing. Circulation is a central interaction in libraries, but because museum resources don’t circulate the primary interactions for museum users are viewing or visiting the collection. Maintenance activities are usually described as preservation or curation.
In contrast, in business information systems, selection of resources can involve data generation, capture or extraction. Adding resources could involve loading, integration or insertion. Schema development and data transformation are important organizing activities. Supported interactions could include querying, reporting, analysis, or visualization. Maintenance activities are often described as data cleaning, data cleansing, governance, or compliance.
These domain-specific methods and vocabularies evolve over time to capture the complex and distinctive sets of experiences and practices of their respective disciplines. We can identify correspondences and overlapping meanings, but they are not synonyms or substitutes for each other. We propose more general terms like selection and maintenance, not as lowest common denominator replacements for these more specialized ones, but to facilitate communication and cooperation across the numerous disciplines that are concerned with organizing.
It might sound odd to describe the animals in a zoo as resources, to think of viewing a painting in a museum as an interaction, or to say that destroying information to comply with privacy regulations is maintenance. Taking a broader perspective on the activities in organizing systems so that we can identify best practices and patterns enables people with different backgrounds and working in different domains to understand and learn from each other.
Part of what a database administrator can learn from a museum curator follows from the rich associations the curator has accumulated around the concept of curation that are not available around the more general concept of maintenance. Without the shared concept of maintenance to bridge their disciplines, this learning could not take place.
In The Concept of “Resource” and What is Being Organized? we briefly discussed the fundamental concept of a resource. In this chapter, we describe the four primary activities with resources, using examples from many different kinds of organizing systems. We emphasize the activities of organizing and of designing resource-based interactions that make use of the organization imposed on the resources. We discuss selection and maintenance to create the context for the organizing activities and to highlight the interdependencies of organizing and these other activities. This broad survey enables us to compare and contrast the activities in different resource domains setting the stage for a more thorough discussion of resources and resource description in Chapters 3 and 4.
When we talk about organizing systems, we often do so in terms of the contents of their collections. This implies that the most fundamental decision for an organizing system is determining its resource domain, the group or type of resources that are being organized. This decision is usually a constraint, not a choice; we acquire or encounter some resources that we need to interact with over time, and we need to organize them so we can do that effectively.
Selecting is the process by which resources are identified, evaluated, and then added to a collection in an organizing system. Selection is first shaped by the domain and then by the scope of the organizing system, which can be analyzed through six interrelated aspects:
In Chapter 10, “The Organizing System Road Map” we discuss these six aspects in more detail.
Selection must be an intentional process because by definition an organizing system contains resources whose selection and arrangement was determined by human or computational agents, not by natural processes. Selection methods and criteria vary across resource domains. Resource selection policies are often shaped by laws, regulations or policies that require or prohibit the collection of certain kinds of objects or types of information.[30]
Libraries and museums typically formalize their selection principles in collection development policies that establish priorities for acquiring resources that reflect the people they serve and the services they provide to them. Digitization is substantially changing how libraries select resources. Digital content can be delivered anywhere quickly and cheaply, making it easier for a group of cooperating libraries to share resources. For example, while each campus of the University of California system has its own libraries and library catalogs, system-wide catalogs and digital content delivery reduce the need for every campus to have any particular resource in its own collection.[31]
Adding a resource to a museum implies an obligation to preserve it forever, so many museums follow rigorous accessioning procedures before accepting it. Likewise, archives usually perform an additional appraisal step to determine the quality and value of materials offered to them.[32]
In the for-profit sector, well-run firms are similarly systematic in selecting the resources that must be managed and the information needed to manage them. The organizing systems for managing sales, orders, customers, inventory, personnel, and finance information are tailored to the specific information needed to run that part of the company’s operations. Identifying this information is the job of business analysts and data modelers. Much of this operational data is combined in huge “data warehouses” to support the “business analytics” function in which novel combinations and relationships among data items are explored by selecting subsets of the data.[33]
Digitization has had extremely important impacts on the manner in which collections of information resources are created in information-intensive domains such as transportation, retailing, supply chain management, healthcare, energy management, and “big science” where a torrent of low-level information is captured from GPS devices, RFID tags, sensors and science labs. Businesses that once had to rely on limited historical data analysis and printed reports now have to deal with a constant stream of real-time information.
An analogous situation has evolved with personal collections of photographs. Less than two decades ago, before the digital camera became a consumer product, the time and expense of developing photographs induced people to take photos carefully and cautiously. Today the proliferation of digital cameras and photo-capable phones has made it so easy to take digital photos and videos that people are less selective and take many photos or videos of the same scene or event.
Selection is an essential activity in creating organizing systems whose purpose is to combine separate web services or resources to create a composite service or application according to the business design philosophy of “Service Oriented Architecture” (SOA).[34] When an information-intensive enterprise combines its internal services with outsourced ones provided by other firms, the resources are selected to create a combined collection of services according to the “core competency” principle: resources are selected and combined to exploit the enterprise’s internal capabilities and those of its service partners better than any other combination of services could.[35]
The nature and scale of the web changes how we collect resources and fundamentally challenges how we think of resources in the first place. Web-based resources cannot be selected for a collection by consulting a centralized authoritative directory, catalog, or index because one does not exist. And although your favorite web search engine consults an index or directory of web resources when you enter a search query, you do not know where that index or directory came from or how it was assembled.[36]
The contents of a collection and how it is organized always reflect its intended users and uses. But the web has universal scope and global reach, making most of the web irrelevant to most people most of the time. Researchers have attacked this problem by treating the web as a combination of a very large number of topic-based or domain-specific collections of resources, and then developing techniques for extracting these collections as digital libraries targeted for particular users and uses.[37]
Even when the selection principles behind a collection are clear and consistent, they can be unconventional, idiosyncratic, or otherwise biased by the perspective and experience of the collector. This is sometimes the case in museum or library collections that began or grew opportunistically through the acquisition of private collections that reflect a highly individual point of view.
It is especially easy to see the collector’s point of view in personal collections. Most of the clothes and shoes you own have a reason for being in your closet, but could anyone else explain the contents of your closet and its organizing system, and why you bought that crazy-looking dress or shirt?
Organizing systems arrange their resources according to many different principles. In libraries, museums, businesses, government agencies and other long-lived institutions these organizing principles are typically documented as cataloging rules, information management policies, or other explicit and systematic procedures so that different people can apply them consistently over time. In contrast, the principles for arranging resources in personal or small-scale organizing systems are not usually stated in any formal way and might even be inconsistent or conflicting.
For most types of resources, any number of principles could be used as the basis for their organization depending on the answers to the “why?” (Why is it Being Organized?), “how much?” (How Much is it Being Organized?), and “how?” (How (or by Whom) is it Organized?) questions posed in Chapter 1.
A simple principle for organizing resources is “co-location” – putting them in the same place. However, most organizing systems use principles that are based on specific resource properties or properties derived from the collection as a whole. What properties are significant and how to think about them depends on the number of resources being organized, the purposes for which they are being organized, and on the experiences and implicit or explicit biases of the intended users of the organizing system. The implementation of the organizing system also shapes the need for, and the nature of, the resource properties.[38]
Many resource collections –even very large ones – acquire resources one at a time or in sets of related resources that can initially be treated in the same way. Therefore, it is natural to arrange resources based on properties of individual resources that can be assessed and interpreted when the resource is selected and becomes part of the collection.
This means that decisions about which resource properties will be used in organizing must often precede the creation or collection of the resources. This is especially critical for archeologists, naturalists, and scientists of every type. Without information about the context of creation or discovery, what might otherwise be important resources could be just a handful of pottery shards, a dead animal, or some random set of bits on a computer.
“Subject matter” organization involves the use of a classification system that provides categories and descriptive terms for indicating what a resource is about. Because they use properties like “aboutness” that are not directly perceived, methods for assigning subject classifications are intellectually-intensive and require rigorous training to be performed consistently and appropriately for the intended users.[39] The cost and time required for this human effort motivates many organizing systems for information resources to use computational approaches for arranging them.
When the resources being arranged are physical or tangible things – such as books, paintings, animals, or cooking pots – any resource can only be in only one place at a time in libraries, museums, zoos, or kitchens. Similarly, when organizing involves recording information in a physical medium – carving in stone, imprinting in clay, applying ink to paper by hand or with a printing press – how this information can be organized is subject to the intrinsic properties and constraints of physical things.
The inescapable tangibility of physical resources means that their organizing systems are often strongly influenced by the material or medium in which the resources are presented or represented. For example, museums generally collect original artifacts and their collections are commonly organized according to the type of thing being collected. There are art museums, sculpture museums, craft museums, toy museums, science museums, and so on.
Similarly, because they have different material manifestations, we usually organize our printed books in a different location than our record albums, which might be near but remain separate from our CDs and DVDs. This is partly because the storage environments for physical resources (shelves, cabinets, closets, and so on) have co-evolved with the physical resources they store.[40]
The resource collections of organizing systems in physical environments often grow to fit the size of the environment or place in which they are maintained – the bookshelf, closet, warehouse, library or museum building. Their scale can be large: the Smithsonian in Washington, D.C., the world’s largest museum and research complex, consists of 19 museums, 9 research facilities, a zoo and a library with 1.5 million books. However, at some point, any physical space gets too crowded, and it is difficult and expensive to add new floors or galleries to an existing library or museum.
Physical resources are often organized according to intrinsic physical properties like their size, color or shape, or intrinsically associated properties such as the place and time they were created or discovered. The shirts in your clothes closet might be arranged by color, by fabric, or style. We can view dress shirts, T-shirts, Hawaiian shirts and other styles as configurations of shirt properties that are so frequent and familiar that they have become linguistic and cultural categories. Other people might think about these same properties or categories differently, using a greater or lesser number of colors or ordering them differently, sorting the shirts by style first and then by color, or vice versa.
In addition to, or instead of, physical properties of your shirts, you might employ behavioral or usage-based properties to arrange them. You might separate your party and Hawaiian shirts from those you wear to the office. You might put the shirts you wear most often in the front of the closet so they are easy to locate. Unlike intrinsic properties of resources, which do not change, behavioral or usage-based properties are dynamic. You might move to Hawaii, where you can wear Hawaiian shirts to the office, or you could get tired of what were once your favorite shirts and stop wearing them as often as you used to.
Some arrangements of physical resources are constrained or precluded by resource properties that might cause problems for other resources or for their users. Hazardous or flammable materials should not be stored where they might spill or ignite; lions and antelopes should not share the same zoo habitat or the former will eat the latter; and adult books and movies should not be kept in a library where children might accidentally find them. For almost any resource, it seems possible to imagine a combination with another resource that might have unfortunate consequences. We have no shortage of professional certifications, building codes, MPAA movie ratings, and other types of laws and regulations designed to keep us safe from potentially dangerous resources.
To overcome the inherent constraints with organizing physical resources, organizing systems often use additional physical resources that describe the primary physical ones, with the library card catalog being the classic example. A specific physical resource might be in a particular place, but multiple description resources for it can be in many different places at the same time.
When the description resources are themselves digital, as when the printed library card catalog is put online, the additional layer of abstraction created enables additional organizing possibilities that can ignore physical properties of resources and many of the details about how they are stored.
In organizing systems that use additional resources to identify or describe primary ones “adding” to a collection is a logical act that need not require any actual movement, copying, or reorganization of the primary resources. This virtual addition allows the same resources to be part of many collections at the same time; the same book can be listed in many bibliographies, the same web page can be in many lists of web bookmarks and have incoming links from many different pages, and a publisher’s digital article repository can be licensed to any number of libraries.
Organizing systems that arrange digital resources like digital documents or information services have some important differences from those that organize physical resources. Because digital resources can be easily copied or interlinked, they are free from the “one place at a time” limitation.[41] The actual storage locations for digital resources are no longer visible or very important. It hardly matters if a digital document or video resides on a computer in Berkeley or Bangalore if it can be located and accessed efficiently.[42]
Moreover, because the functions and capabilities of digital resources are not directly manifested as physical properties, the constraints imposed on all material objects do not matter to digital content in many circumstances.[43] [44]
An organizing system for digital resources can also use digital description resources that are associated with them. Since the incremental costs of adding processing and storage capacity to digital organizing systems are small, collections of both primary digital resources and description resources can be arbitrarily large. Digital organizing systems can support collections and interactions at a scale that is impossible in organizing systems that are entirely physical, and they can implement services and functions that exploit the exponentially growing processing, storage and communication capabilities available today.[45]
There are inherently more choices in the arrangement of digital resources than there are for physical ones, but this difference emerges because of multiple implementation platforms for the organizing system as much as in the nature of the resources. Nevertheless, the organizing systems for digital books, music and video collections often maintain the distinctions embodied in the organizing system for physical resources because it enables their co-existence or simply because of legacy inertia. As a result, the organizing systems for collections of digital resources tend to be coarsely distinguished by media type (e.g., document management, digital music collection, digital video collection, digital photo collection, etc.).
Information resources in either physical or digital form are typically organized using intrinsic properties like author names, creation dates, publisher, or the set of words that they contain. Information resources can also be organized using extrinsic or behavioral properties like subject classifications, assigned names or identifiers, or even access frequency.[46]
Complex organization and interactions are possible when organizing systems with digital resources are based on the datatype or data model of the digital content (e.g., text, numeric, multimedia, statistical, geospatial, logical, scientific, or personnel data). These distinctions are often strongly identifiable with business functions: operational, transactional and process control activities require the most fine-grained data, while strategic functions might rely on more qualitative analyses represented in narrative text formats. Managerial and decision support functions might require a mixture of digital content types.
With digital resources we don’t have to worry about hazardous resources blowing up or one resource eating another, although viruses, worms and other malevolent agents present threats to digital resources as dire as those faced by the zoo antelopes if lions are kept too close. Accordingly, just as there are many laws and regulations that restrict the organization of physical resources, there are laws and regulations that constrain the arrangements of digital ones. Many information systems that generate or collect transactional data are prohibited from sharing any records that identify specific people. Banking, accounting, and legal organizing systems are made more homogeneous by compliance and reporting standards and rules.
The Domain Name System is the most inherent scheme for organizing web resources, and top-level domains for generic resource categories (.com, .edu. .org, gov, etc.) and countries provide some clues about the resources organized by a web site. These clues are most reliable for large established enterprises and publishers; we know what to expect at ibm.com, Berkeley.edu, and jstor.org.[47]
The network of hyperlinks among web resources challenges the notion of a collection, because it makes it impossible to define a precise boundary around any collection smaller than the complete web.[48] Furthermore, authors are increasingly using “web-native” publication models, creating networks of articles that blur the notions of articles and journals. For example, scientific authors are interconnecting scientific findings with their underlying research data, to discipline-specific data repositories, or to software for analyzing, visualizing, simulation, or otherwise interacting with the information.[49]
The conventional library is both a collection of books and the physical space in which the collection is managed. On the web, rich hyperlinking and the fact that the actual storage location of web resources is unimportant to the end users fundamentally undermine the idea that organizing systems must collect resources and then arrange them under some kind of local control to be effective. The spectacular rise and fall of the AOL “walled garden,” created on the assumption that the open web was unreliable, insecure, and pernicious, was for a time a striking historical reminder and warning to designers of closed resource collections.[50] But Facebook so far is succeeding by following a walled garden strategy.
Multiple properties of the resources, the person organizing or intending to use them, and the social and technological environment in which they are being organized can collectively shape their organization. For example, the way you organize your home kitchen is influenced by the physical layout of counters, cabinets, and drawers; the dishes you cook most often; your skills as a cook, which may influence the number of cookbooks, specialized appliances and tools you own and how you use them; the sizes and shapes of the packages in the pantry and refrigerator; and even your height.
If multiple resource properties are considered in a fixed order, the resulting arrangement forms a logical hierarchy. The top level categories of resources are created based on the values of the property evaluated first, and then each category is further subdivided using other properties until each resource is classified in only a single category. A typical example of hierarchical arrangement for digital resources is the system of directories or folders used by a professor to arrange his personal document collection in a computer file system; the first level distinguishes personal documents from work-related documents; work is then subdivided into teaching and research, teaching is subdivided by year, and year divided by course. For physical resources, an additional step of mapping categories to physical locations is required; for example, resources in the category “kitchen utensils” might all be arranged in drawers near a workspace, with “silverware” arranged more precisely to separate knives, forks, and spoons.
An alternative to hierarchical organization that is often used in digital organizing systems is faceted classification, in which the different properties for the resources can be evaluated in any order. For example, you can select wines from the wine.com store catalog by type of grape, cost, or region and consider these property facets in any order. Three people might each end up choosing the same moderately-priced Kendall Jackson California Chardonnay, but one of them might have started the search based on price, one based on the grape varietal, and the third with the region. This kind of interaction in effect generates a different logical hierarchy for every different combination of property values, and each user made his final selection from a different set of wines.
Another way to understand faceted classification is that it allows a collection of description resources to be dynamically re-organized into as many categories as there are combinations of values on the descriptive facets, depending on the priority or point of view the user applies to the facets. Of course this only works because the physical resources are not themselves being rearranged, only their digital descriptions.
Chapter 7, “Classification,” explains principles and methods for hierarchical and faceted classification in more detail.
There would be no point in selecting and organizing resources if they could not be accessed or interacted with in some way. Organizing systems vary a great deal in the types of resource-based interactions they enable and in the nature and extent of access they allow.
It is essential to distinguish the interactions that are designed into and directly supported by an organizing system from those that can take place with resources after they have been accessed. For example, when a book is checked out of a library it might be read, translated, summarized, criticized, or otherwise used – but none of these interactions are directly designed into the library. We need to focus on the interactions that are enabled because of the intentional acts of description or arrangement that transform a collection of resources into an organizing system. Note that some of these interactions might be explicitly supported in an organizing system containing digital books, as in Google’s search engine where language translation is a supported service.
Users have direct access to original resources in a collection when they browse through library stacks or wander in museum galleries.[51] They have mediated or indirect access when they use catalogs or search engines, and sometimes they can only interact with copies or descriptions of the resources.
The concept of affordance, introduced by J.J. Gibson and then extended and popularized by Don Norman, captures the idea that physical resources and their environments have inherent actionable properties that determine, in conjunction with an actor’s capabilities and cognition, what can be done with the resource.[52]
When organizing resources involves arranging physical resources using boxes, bins, cabinets, or shelves, the affordances and the implications for access and use are immediately evident. Resources of a certain size and weight can be picked up and carried away. Books on the lower shelves of bookcases are easy to reach, but those stored ten feet from the ground cannot be easily accessed. Overhead and end-of-aisle signs support navigation and orientation in libraries and stores, and the information on book spines or product packages help us select a specific resource.
We can analyze the organizing systems with physical resources to identify the affordances and the possible interactions they imply. We can compare the affordances or overall interaction capability enabled by different organizing systems for some type of physical resources, and we often do this without thinking about it. The tradeoffs between the amount of work that goes into organizing a collection of resources and the amount of work required to find and use them are inescapable when the resources are physical objects or information resources are in physical form. We can immediately see that storing information on scrolls does not enable the random access capability that is possible with books. When you implement the organizing system for your clothes closet, you implicitly consider the tradeoff between extensive and minimal organization and the implications for the amount of interaction effort required to put away and find clothes in each case.
What and how to count to compare the capabilities of organizing systems becomes more challenging the further we get from collections of static physical resources, like books or shoes, where it is usually easy to perceive and understand the possible interactions. For information systems, capability can be assessed by counting their functions, services, or application program interfaces. However, this very coarse measure does not take into account differences in the capability or generality of a particular interaction. For example, two organizing systems might both have a search function, but differences in the operators they allow, the sophistication of pre-processing of the content to create index terms, or their usability can make them vastly differ in power, precision, and effectiveness.[53]
An analogous measure of functional capability for a system with dynamic or living resources is the behavioral repertoire, the number of different activities, or range of actions, that can be initiated.
We should not assume that supporting more types of interactions necessarily makes a system better or more capable; what matters is how much value is created or invoked in each interaction. Doors that open automatically when their sensors detect an approaching person do not need handles. Organizing systems can use stored or computed information about user preferences or past interactions to anticipate user needs or personalize recommendations. This has the effect of substituting information for interaction to make interactions unnecessary or simpler.
For example, a current awareness service that automatically informs you about relevant news from many sources makes it unnecessary to search any of them separately. Similarly, a “smart travel agent” service can use a user’s appointment calendar, past travel history, and information sources like airline and hotel reservation services to transform a minimal interaction like “book a business trip to New York for next week’s meeting” into numerous hidden queries that would have otherwise required separate interactions.[54]
A useful way to distinguish types of interactions with resources is according to the way in which they create value, using a classification proposed by Apte and Mason. They noted that interactions differ not just in their overall intensity but in the absolute and relative amounts of physical manipulation, interpersonal or empathetic contact, and symbolic manipulation or information exchange involved in the interaction. Furthermore, Apte and Mason recognized that the proportions of these three types of value creating activities can be treated as design parameters, especially where the value created by retrieving or computing information could be completely separated or disaggregated from the value created by physical actions and person-to-person encounters.[55]
Physical manipulation is often the intrinsic type of interaction with collections of physical resources. The resource might have to be handled or directly perceived in order to interact with it, and often the experience of interacting with the resource is satisfying or entertaining, making it a goal in its own right. People often visit museums, galleries, zoos, animal theme parks or other institutions that contain physical resources because they value the direct, perceptual, or otherwise unmediated interaction that these organizing systems support.
Physical manipulation and interpersonal contact might be required to interact with information resources in physical form like the printed books in libraries. A large university library contains millions of books and academic journals, and access to those resources can require a long walk deep into the library stacks after a consultation with a reference librarian. For decades library users searched through description resources – first printed library cards, and then online catalogs and databases of bibliographic citations – to locate the primary resources they wanted to access. The surrogate descriptions of the resources needed to be detailed so that users could assess the relevance of the resource without expending the significant effort of examining the primary resource.[56]
However, for most people the primary purpose of interacting with a library is to access the information contained in its resources.. For most people access in a digital library to copies of printed documents or books is equivalent to or even better than access to the original physical resource because the incidental physical and interpersonal interactions have been eliminated.[57]
In some organizing systems robotic devices, computational processes, or other entities that can act autonomously with no need for a human agent carry out interactions with physical resources. Robots have profoundly increased efficiency in materials management, “picking and packing” in warehouse fulfillment, office mail delivery, and in many other domains where human agents once located, retrieved, and delivered physical resources. A “librarian robot” that can locate books and grasp them from the shelves shows promise.[58]
With digital resources, neither physical manipulation nor interpersonal contact is required for interactions, and the essence of the interaction is information exchange or symbolic manipulation of the information contained in the resource.[59] Put another way, by replacing interactions that involve people and physical resources with symbolic ones, organizing systems can lower their costs without reducing user satisfaction. This is why so many businesses have automated their information-intensive processes with self-service technology like ATMs, websites, or smartphone apps.
Similarly, web search engines eliminate the physical effort required to visit a library and enables users to consult more readily accessible digital resources. A search engine returns a list of the page titles of resources that can be directly accessed with just another click, so it takes little effort to go from the query results to the primary resource. This reduces the need for the rich surrogate descriptions that libraries have always been known for because it enables rapid evaluation and iterative query refinement based on inspection of the primary resources.[60]
The ease of use and speed of search engines in finding web resources creates the expectation that any resource worth looking at can be found on the web. This is certainly false, or Google would never have begun its ambitious and audacious project to digitize millions of books from research libraries. But while research libraries strive to provide access to authoritative and specialized resources, the web is undeniably good enough for answering most of the questions ordinary users put to search engines, which largely deal with everyday life, popular culture, personalities, and news of the day.
Libraries recognize that they need to do a better job integrating their collections into the “web spaces” and web-based activities of their users if they hope to change the provably suboptimal strategies of “information foraging” most people have adopted that rely too much on the web and too little on the library.[61] Some libraries are experimenting with Semantic Web and “Linked Data” technologies that would integrate their extensive bibliographic resources with resources on the open web. But there is insufficient agreement about exactly how libraries should expose their collections and some ambivalence about whether to do it at all.[62]
There seems to be less ambivalence for museums, which have aggressively embraced the web to provide access to their collections. While few museum visitors would prefer viewing a digital image over experiencing an original painting, sculpture, or other physical artifact, the alternative is often no access at all. Most museum collections are far larger than the space available to display them, so the web makes it possible to provide access to otherwise hidden resources.[63]
The variety and functions of interactions with digital resources are determined by the amount of structure and semantics represented in their digital encoding, in the descriptions associated with the resources, or by the intelligence of the computational processes applied to them. Digital resources can support enhanced interactions of searching, copying, zooming, and other transformations. Digital or “e-books” demonstrate how access to content can be enhanced once it is no longer tied to the container of the printed book, but some e-book formats have a limited interaction repertoire: typically only “page turning,” resizing, and full-text search.[64]
Richer interactions with digital text resources are possible when they are encoded in an application or presentation-independent format. Automated content reuse and “single-source” publishing is most efficiently accomplished when text is encoded in XML (Extensible Markup Language), but much of this XML is produced by transforming text originally created in word processing formats. Once it is in XML, digital information can be distributed, processed, reused, transformed, mixed, remixed, and recombined into different formats for different purposes, applications, devices, or users in ways that are almost impossible to imagine when it is represented in a tangible (and therefore static) medium like a book on a shelf or a box full of paper files.[65]
Businesses that create or own their information resources can readily take advantage of the enhanced interactions that digital formats enable. For libraries, however, copyright is often a barrier to digitization, both as a matter of law and because digitization enables copyright enforcement to a degree not possible with physical resources. As a result, digital books are somewhat controversial and problematic for libraries, whose access models were created based on the economics of print publication and the social contract of the copyright first sale doctrine that allowed libraries to lend printed books.[66]
Software-based agents do analogous work to robots in “moving information around” after accessing digital resources such as web services or sensors that produce digital information. These agents can control or choreograph a set of interactions with digital resources to carry out complex business processes.
Different levels of interactions or access can apply to different resources in a collection or to different categories of users. For example, library collections can range from completely open and public, to allowing limited access, to wholly private and restricted. The library stacks might be open to anyone, but the rare documents in a special collection might be accessible only to authorized researchers. The same is true of museums, which typically have only a fraction of their collections on public display.
Because of their commercial and competitive purposes, organizing systems in business domains are more likely to enforce a granular level of access control that distinguishes people according to their roles and that further distinguishes them according to the nature of their interactions with resources. For example, administrative assistants in a company’s Human Resources department are not allowed to see employee salaries; HR employees in a benefits administration role can see the salaries but not change them; management-level employees in HR can change the salaries. Some firms limit access to specific times from authorized computers or IP addresses.[67]
A noteworthy situation arises when the person accessing the organizing system is the one who designed and implemented it. In this case, the person will have qualitatively better knowledge of the resources and the supported interactions. This situation most often arises in the organizing systems in kitchens, home closets, and other highly personal domains but can also occur in knowledge-intensive business and professional domains like consulting, customer relationship management, and scientific research.
Many of the organizing systems used by individuals are embedded in physical contexts where the access controls are applied in a coarse manner. We need a key to get into the house, but we do not need additional permissions or passwords to enter our closets or kitchens or to take a book from a bookshelf. In our online lives, however, we readily accept and impose more granular access controls on our personal computers and in the applications we use, as when we allow or block individual “friend” requests on Facebook or mark photos on Flickr as public, private, or viewable only by named groups or individuals.
We can further contrast access policies based on their origins or motivations. Designed Resource Access Policies are established by the designer or operator of an organizing system to satisfy internally generated requirements. Examples of designed access policies are: (1) giving more access to “inside” users (e.g., residents of a community, students or faculty members at a university, or employees of a company) than to anonymous or “outside” users; (2) giving more access to paying users than to users who don’t pay; (3) giving more access to users with capabilities or competencies that can add value to the organizing system (e.g., material culture researchers like archaeologists or anthropologists, who often work with resources in museum collections that are not on display).
Imposed Policies are mandated by an external entity and the organizing system must comply with them. For example, an organizing system might have to follow information privacy or security regulations that restrict access to resources or the interactions that can be made with them. University libraries typically complement or replace parts of their print collections with networked access to digital content licensed from publishers. Typical licensing terms then require them to restrict access to users that are associated with the university, either by being on campus or by using VPN software that controls remote access to the library network.[68] Copyright law limits the uses of a substantial majority of the books in the collections of major libraries, prohibiting them from being made fully available in digital formats. Museums often prohibit photography because they do not own the rights to modern works they display.
Whether an access policy is designed or imposed is not always clear. Policies that were originally designed for a particular organizing system may over time become best practices or industry standards, which regulators or industry groups not satisfied with “self-regulation” later impose. Museums might aggressively enforce a ban on photography not just to comply with copyright law, but also to enhance the revenue they get from selling posters and reproductions.
Maintaining resources is an important activity in every organizing system regardless of the nature of its collection because resources or surrogates for them must be available at the time they are needed. Beyond these basic shared motivations are substantial differences in maintenance goals and methods depending on the domain of the organizing system.
Different domains sometimes use the same terms to describe different maintenance activities and different terms for similar activities. The most common terms are storage, preservation, curation, and governance. Storage is most often used when referring to physical or technological aspects of maintaining resources; backup (for short-term storage), archiving (for long-term storage), and migration (moving stored resources from one storage device to another) are similar in this respect. The other three terms generally refer to activities or methods and more closely overlap in meaning; we will distinguish them in Preservation-Governance.
Ideally, maintenance requirements for resources should be anticipated when organizing principles are defined and implemented. In particular, resource descriptions to support long-term preservation of digital resources are important.[69]
The concept of “memory institution” broadly applies to a great many organizing systems that share the goal of preserving knowledge and cultural heritage. The primary resources in libraries, museums, data archives or other “memory institutions” are fixed cultural, historic, or scientific artifacts that are maintained because they are unique and original items with future value. This is why the Louvre preserves the portrait of the Mona Lisa and the United States National Archives preserves the Declaration of Independence.[70]
In contrast, in the organizing systems used by businesses many of the resources that are collected and managed have limited intrinsic value. The motivation for preservation and maintenance is economic; resources are maintained because they are essential in running the business. For example, businesses collect and preserve information about employees, inventory, orders, invoices, etc., because it ensures internal goals of efficiency, revenue generation and competitive advantage. The same resources (such as information about a customer) are often used by more than one part of the business.[71] Maintaining the accuracy and consistency of changing resources is a major challenge in business organizing systems.[72]
Other business organizing systems preserve information needed to satisfy externally imposed regulatory or compliance policies and serve largely to avoid possible catastrophic costs from penalties and lawsuits. In all these cases, resources are maintained as one of the means employed to preserve the business as an ongoing enterprise, not as an end in itself.
Unlike library, archives, and museums, indefinite preservation is not the central goal of most business organizing systems. These organizing systems mostly manage information needed to carry out day-to-day operations or relatively recent historical information used in decision support and strategic planning. In addition to these internal mandates, businesses have to conform to securities, taxation, and compliance regulations that impose requirements for long-term information preservation.[73]
Of course, libraries, museums, and archives also confront economic issues as they seek to preserve and maintain their collections and themselves as memory institutions. They view their collections as intrinsically valuable in ways that firms generally do not. Art galleries are an interesting hybrid because they organize and preserve collections that are valuable, but if they do not manage to sell some things, they will not stay in business.
In between these contrasting purposes of preservation and maintenance are the motives in personal collections, which occasionally are created because of the inherent value of the items but more typically because of their value in supporting personal activities. Some people treasure old photos or collectibles that belonged to their parents or grandparents and imagine their own children or grandchildren enjoying them, but many old collections seem to end up as offerings on eBay. In addition, many personal organizing systems are task-oriented, so their contents need not be preserved after the task is completed.[74]
At the most basic level, preservation of resources means maintaining them in conditions that protect them from physical damage or deterioration. Libraries, museums, and archives aim for stable temperatures and low humidity. Permanently or temporarily out-of-service aircraft are parked in deserts where dry conditions reduce corrosion. Risk-aware businesses create continuity plans that involve offsite storage of the data and documents needed to stay in business in the event of a natural disaster or other disruption.
When the goal is indefinite preservation, other maintenance issues arise if resources deteriorate or are damaged. How much of an artifact’s worth is locked in with the medium used to express it? How much restoration should be attempted? How much of the essence of an artifact is retained if it is converted to a digital format?
Preservation is often a key motive for digitization, but digitization alone is not preservation. Digitization creates preservation challenges because technological obsolescence of computer software and hardware require ongoing efforts to ensure the digitized resources can be accessed.
Technological obsolescence is the major challenge in maintaining digital resources. The most visible one is a result of the relentless evolution of the physical media and environments used to store digital information in both institutional or business and personal organizing systems. Computer data began to be stored on magnetic tape and hard disk drives six decades ago, on floppy disks four decades ago, on CDs three decades ago, on DVDs two decades ago, on solid-state drives half a decade ago, and in “cloud-based” or “virtual” storage environments in the last decade. As the capacity of storage technologies grows from kilobytes to megabytes to gigabytes to terabytes to petabytes, economic and efficiency considerations often make the case to adopt new technology to store newly acquired digital resources and raise questions about what to do with the existing ones.[75]
The second challenge might seem paradoxical. Even as the capacities of digital storage technologies increase at a staggering pace, the expected useful lifetimes of the physical storage media are measured in years or at best in decades. Colloquial terms for this problem are data rot or bit rot. In contrast, books printed on acid-free paper can last for centuries. The contrast between printed and digital resources is striking; books on library shelves don’t disappear if no one uses them, but digital data can be lost just because no one wants access to it within a year or two after its creation.[76]
However, limits to the physical lifetime of digital storage media are much less significant than the third challenge, the fact that the software and its associated computing environment used to parse and interpret the resource at the time of preservation might no longer be available when the resource needs to be accessed. Twenty-five years ago most digital documents were created using the Word Perfect word processor, but today the vast majority is created using Microsoft Word and few people use Word Perfect today. Software and services that convert documents from old formats to new ones are widely available, but they are only useful if the old file can be read from its legacy storage medium.[77]
Because almost every digital device has storage associated with it, problems posed by multiple storage environments can arise at all scales of organizing systems. Only a few years ago people often struggled with migrating files from their old computer, music player or phone when they got new ones. Web-based email and applications and web-based storage services like Dropbox, Amazon Cloud Drive, and Apple iCloud eliminate some data storage and migration problems by making them someone else’s responsibility, but in doing so introduce privacy and reliability concerns.
It is easy to say that the solutions to the problems of digital preservation are regular recopying of the digital resources onto new storage media and then migrating them to new formats when significantly better ones come along. In practice, however, how libraries, businesses, government agencies or other enterprises deal with these problems depends on their budgets and on their technical sophistication. In addition, not every resource should or can always be migrated, and the co-existence of multiple storage technologies makes an organizing system more complex because different storage formats and devices can be collectively incompatible. Dealing with interoperability and integration problems will be discussed further in Chapter 9, “Interactions in Organizing Systems.”
Preservation of web resources is inherently problematic. Unlike libraries, museums, archives, and many other kinds of organizing systems that contain collections of unchanging resources, organizing systems on the web often contain resources that are highly dynamic. Some web sites change by adding content, and others change by editing or removing it.[78]
Longitudinal studies have shown that hundreds of millions of web pages change at least once a week, even though most web pages never change or change infrequently.[79]Nevertheless, the continued existence of a particular web page is hardly sufficient to preserve it if it not popular and relevant enough to show up in the first few pages of search results. Persistent access requires preservation, but preservation isn’t meaningful if there is no realistic probability of future access.
Comprehensive web search engines like Google and Bing use crawlers to continually update their indexed collections of web pages and their search results link to the current version, so preservation of older versions is explicitly not a goal. Furthermore, search engines don’t reveal any details about how frequently they update their collections of indexed pages.[80]
A focus on preserving particular resource instances is most clear in museums and archives, where collections typically consist of unique and original items. There are many copies and derivative works of the Mona Lisa, but if the original Mona Lisa were destroyed none of them would be acceptable as a replacement.[82]
Archivists and historians argue that it is essential to preserve original documents because they convey more information than just their textual content. Paul Duguid recounts how a medical historian used faint smells of vinegar in eighteenth century letters to investigate a cholera epidemic because disinfecting letters with vinegar was thought to prevent the spread of the disease. Obviously, the vinegar smell would not have been part of a digitized letter.[83]
Zoos often give a distinctive or attractive animal a name and then market it as a special or unique instance. For example, the Berlin Zoo successfully marketed a polar bear named Knut to become a world famous celebrity, and the zoo made millions of dollars a year through increased visits and sales of branded merchandise. Merchandise sales have continued even though Knut died unexpectedly in March 2011, which suggests that the zoo was less interested in preserving that particular polar bear than in preserving the revenue stream based on that resource.[84]
Most business organizing systems, especially those that “run the business” by supporting day-to-day operations, are designed to preserve instances. These include systems for order management, customer relationship management, inventory management, digital asset management, record management, email archiving, and more general-purpose document management. In all of these domains, it is often necessary to retrieve specific information resources to serve customers or to meet compliance or traceability goals.
Some business organizing systems are designed to preserve types or classes of resources rather than resource instances. In particular, systems for content management typically organize a repository of reusable or “source” information resources from which specific “product” resources are then generated. For example, content management systems might contain modular information about a company’s products that are assembled and delivered in sales or product catalogs, installation guides, operating guides, or repair manuals.[85]
Businesses strive to preserve the collective knowledge embodied in the company’s people, systems, management techniques, past decisions, customer relationships, and intellectual property. Much of this knowledge is “know how” – knowing how to get things done or knowing how things work – that is tacit or informal. Knowledge management systems are a type of business organizing system whose goal is to capture and systematize these information resources.[86] As with content management, the focus of knowledge management is the reuse of “knowledge as type,” putting the focus on the knowledge rather than the specifics of how it found its way into the organizing system.
When businesses implement information-intensive processes employing web-based services, it is highly desirable to organize them as a collection of service types rather than service instances because this makes them more robust and maintainable. An abstract description of services or resources allows one service provider to transparently substitute for another. For example, the user of the organizing system that implements an Internet-based retail business model need not know and probably doesn’t care which delivery service carries out a request to deliver a package from a warehouse. Similarly, an abstract service description might allow a computational process to substitute for one carried out by a person, or vice versa. For example, a credit card terminal in a restaurant offers the customer the capability to specify no tip, a specific amount, or calculating a percentage of the total.
Libraries have a similar emphasis on preserving resource types rather than instances. The bulk of most library collections, especially public libraries, is made up of books that have many equivalent copies in other collections. When a library has a copy of Moby Dick it is preserving the abstract “work” rather than the particular physical “instance” – unless the copy of Moby Dick is a rare first edition signed by Melville.
Even when zoos give their popular animals individual names, it seems logical that the zoo’s goal is to preserve animal species rather than instances because any particular animal has a finite lifespan and cannot be preserved forever.[87]
In some organizing systems any specific resource might be of little interest or importance in its own right but is valuable because of its membership in a collection of essentially identical items. This is the situation in the data warehouses used by businesses to identify trends in customer or transaction data or in the huge data collections created by scientists. These collections are typically analyzed as complete sets. A scientist does not borrow a single data point when she accesses a data collection; she borrows the complete data set consisting of millions or billions of data points. This requirement raises difficult questions about what additional software or equipment need to be preserved in an organizing system along with the data to ensure that it can be reanalyzed.[88]
At other times specific items in a collection might have some value or interest on their own, but they acquire even greater significance and enhanced meaning because of the context created by other items in the collection that are related in some essential way. The odd collection of “things people swallow that they should not” at the Mütter Museum is a perfect example.[89]
For almost a century “curation” has been used to describe the processes by which a resource in a collection is maintained over time, which may include actions to improve access or to restore or transform its representation or presentation.[90] Furthermore, especially in cultural heritage collections, curation also includes research to identify, describe, and authenticate resources in a collection. Resource descriptions are often updated to reflect new knowledge or interpretations about the primary resources.[91]
Curation takes place in all organizing systems – at a personal scale when we rearrange a bookshelf to accommodate new books or create new file folders for this year’s health insurance claims, at an institutional scale when a museum designs a new exhibit or a zoo creates a new habitat, and at web scale when people select photos to upload to Flickr or Facebook and then tag or “Like” those uploaded by others.
An individual, company, or any other creator of a web site can make decisions and employ technology that maintains the contents, quality and character of the site over time. In that respect web site curation and governance practices are little different than those for the organizing systems in memory institutions or business enterprises. The key is having clear policies for collecting resources and maintaining them over time that enable people and automated processes to ensure that resource descriptions or data are authoritative, accurate, complete, consistent, and non-redundant.
Curation is most necessary and explicit in institutional organizing systems where the large number of resources or their heterogeneity requires choices to be made about which ones should be most accessible, how they should be organized to ensure this access, and which ones need most to be preserved to ensure continued accessibility over time. Curation might be thought of as an ongoing or deferred selection activity because curation decisions must often be made on an item-by-item basis.
Curation in these institutional contexts requires extensive professional training. The institutional authority empowers individuals or groups to make curation decisions. No one questions whether a museum curator or a compliance manager should be doing what they do.[92]
Resource descriptions are more important in company Intranets than in the open web because the contents of the former lack the links that are critical in the latter.
Curation by individuals has been studied a great deal in the research discipline of Personal Information Management.[93] Much of this work has been influenced for decades by a seminal article written by Vannevar Bush titled “As We May Think.” Bush envisioned the Memex, “a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.” Bush’s most influential idea was his proposal for organizing sets of related resources as “trails” connected by associative links, the ancestor of the hypertext links that define today’s web.[94]
Many individuals spend a great amount of time curating their own web sites, but when a site can attract large numbers of users. It often allows users to annotate, “tag,” “like,” “+1,” and otherwise evaluate its resources. The concept of curation has recently been adapted to refer to these volunteer efforts of individuals to create, maintain, and evaluate web resources.[95] The massive scale of these bottom-up and distributed activities is curation by “crowdsourcing” the continuously aggregated actions and contributions of users.[96]
The informal and organic “folksonomies” that result from their aggregated effort create organization and authority through network effects.[97] This undermines traditional centralized mechanisms of organization and governance and threatens any business model in publishing, education, and entertainment that has relied on top-down control and professional curation.[98] In addition, professional curators are not pleased to have the ad hoc work of untrained people working on web sites described as curation.
Most web sites are not curated in a systematic way, and the decentralized nature of the web and its easy extensibility means that the web as a whole defies curation. It is easy to find many copies of the same document, image, music file, or video and not easy to determine which is the original, authoritative or authorized version. Broken links return “Error 404 Not Found” messages.[99]
Nevertheless, problems like these that result from lazy or careless webmastering are minor compared to those that result from deliberate misclassification, falsification, or malice. An entirely new vocabulary has emerged to describe these web resources with bad intent (See WEB RESOURCES WITH BAD INTENT).
Since we cannot prevent these deceptions by controlling what web resources are created in the first place, we have to respond to them after the fact with “defensive curation” techniques. These include filters and firewalls that block access to particular sites or resource types, but whether this is curation or censorship is often debated, and from the perspective of the government or organization doing the censorship it is certainly curation. Nevertheless, the decentralized nature of the web and its open protocols can sometimes enable these controls to be bypassed.
Search engines continuously curate the web because the algorithms they use for determining relevance and ranking determine what resources people are likely to access. At a smaller scale, there are many kinds of tools for managing the quality of a web site, such as ensuring that HTML content is valid, that links work, and that the site is being crawled completely. Another familiar example is the spam and content filtering that takes place in our email systems that automatically classifies incoming messages and sorts them into appropriate folders.
In organizing systems that contain data, there are numerous tools for “name matching”, the task of determining when two different text strings denote the same person, object, or other named entity. This problem of eliminating duplicates and establishing a controlled or authoritative version of the data item arises in numerous application areas but familiar ones include law-enforcement and counter-terrorism. Done incorrectly, it might mean that you end up on a “watch list” and are hassled every time you want to fly on a commercial plane.
One might think that computational curation is always more reliable than any curation carried out by people. Certainly, it seems that we should always be able to trust any assertion created by context-aware resources like a sensor that reports the temperature or current location. But can we trust the accuracy of web content? Search engines use the popularity of web pages and the structure of links between them to compute relevance in response to a query. But popularity and relevance don’t always ensure accuracy. We can easily find popular pages that prove the existence of UFOs or claim to validate wacky conspiracy theories.
Furthermore, search engines have long been accused of bias built into their algorithms. For example, Google’s search engine has been criticized for giving too much credibility to web sites with .edu domain names, to sites that have been around for a long time, or that are owned by or that partner with the company, like Google maps or YouTube.[101]
“Governance” overlaps with “curation” in meaning but typically has more of policy focus (what should be done) rather than a process focus (how to do it). Governance is also more frequently used to describe the curation of the resources in business and scientific organizing systems rather than in libraries, archives, and museums.
Governance has a broader scope than curation because it extends beyond the resources in a collection and also applies to the software, computing, and networking environments needed to use them. This broader scope also means that governance must specify the rights and responsibilities for the different types of people who might interact with the resources, the circumstances under which that might take place, and the methods they would be allowed to use.
“Corporate governance” is a common term applied to the ongoing maintenance and management of the relationship between operating practices and long-term strategic goals. Libraries and museums must also deal with long-term strategy, but the lesser visibility of “library governance” and “museum governance” might simply reflect the greater concerns about fraud and malfeasance in for-profit business contexts than in non-profit contexts and the greater number of standards or “best practices” for corporate governance.[102]
Data governance policies are often shaped by laws, regulations or policies that prohibit the collection of certain kinds of objects or types of information. Privacy laws prohibit the collection or misuse of personally identifiable information about healthcare, education, telecommunications, video rental, and might soon restrict the information collected during web browsing.[103]
Governance is essential to deal with the frequent changes in business organizing systems and the associated activities of data quality management, access control to ensure security and privacy, compliance, deletion, and archiving. For many of these activities, effective governance involves the design and implementation of standard services in the organizing system to ensure that the activities are performed in an effective and consistent manner.[104]
Today’s information-intensive businesses capture and create large amounts of digital data. The concept of “business intelligence” emphasizes the value of data in identifying strategic directions and the tactics to implement them in marketing, customer relationship management, supply chain management and other information-intensive parts of the business.[105] A management aspect of governance in this domain is determining which resources and information will potentially provide economic or competitive advantages and determining which will not. A conceptual and technological aspect of governance is determining how best to organize the useful resources and information in business operations and information systems to secure the potential advantages.
Business intelligence is only as good as the data it is based on, which makes business data governance a critical concern that has rapidly developed its own specialized techniques and vocabulary. The most fundamental governance activity in information-driven businesses is identifying the “master data” about customers, employees, materials, products, suppliers, etc. that is reused by different business functions and is thus central to business operations.[106]
Because digital data can be easily copied, data governance policies might require that all sensitive data be anonymized or encrypted to reduce the risk of privacy breaches. To identify the source of a data breach or to facilitate the assertion of a copyright infringement claim a digital watermark can be embedded in digital resources.[107]
Scientific data poses special governance problems because of its enormous scale, which dwarfs the data sets managed in most business organizing systems. A scientific data collection might contain tens of millions of files and petabytes of data. Furthermore, because scientific data is often created using specialized equipment or computers and undergoes complex workflows, it can be necessary to curate the technology and processing context along with data in order to preserve it. An additional barrier to effective scientific data curation is the lack of incentives in scientific culture and publication norms to invest in data retention for reuse by others.[108]
Para before list
[30] [Law]
Some governments attempt to preserve and prevent misappropriation of “cultural property” by enforcing import or export controls on antiquities that might be stolen from archeological sites (Merryman, 2009). For digital resources, privacy laws prohibit the collection or misuse of personally identifiable information about healthcare, education, telecommunications, video rental, and might soon restrict the information collected during web browsing.
[31] [LIS]
See Borgman (2000). But while shared collections benefit users and reduce acquisition costs, if a library has defined itself as a physical place and emphasizes its holdings – the resources it directly controls – it might resist anything that reduces the importance of its physical reification, the size of its holdings or the control it has over resources (Sandler, 2006). A challenge facing conventional libraries today is to make the transition from a perspective that emphasizes creation and preservation of physical collections to facilitating the use and creation of knowledge regardless of the medium of its representation and the physical or virtual location from which it is accessed.
[32] [LIS]
Large research libraries have historically viewed their collections as their intellectual capital and have policies that specify the subjects and sources that they intend to emphasize as they build their collections. See Evans (2000). Museums are often wary of accepting items that might not have been legally acquired or that have claims on them from donor heirs or descendant groups; in the US much controversy exists because museums contain many human skeletal remains and artifacts that Native American groups want to be “repatriated.” In archives, common appraisal criteria include uniqueness, the credibility of the source, the extent of documentation, and the rights and potential for reuse. To oversimplify: libraries decide what to keep, museums decide what to accept, and archives decide what to throw away.
[33] [Citation]
On data modeling: see Kent (2012), Silverston (2000), Glushko & McGrath (2005). For data warehouses see Turban et al, (2010).
[34] [Computing]
See Cherbakov et al, 2005, Erl 2005. The essence of SOA is to treat business services or functions as components that can be combined as needed. An SOA enables a business to quickly and cost-effectively change how it does business and whom it does business with (suppliers, business partners, or customers). SOA is generally implemented using web services that exchange XML documents in real-time information flows to interconnect the business service components. If the business service components are described abstractly it can be possible for one service provider to be transparently substituted for another – a kind of real-time resource selection – to maintain the desired quality of service. For example, a web retailer might send a Shipping Request to many delivery services, one of which is selected to provide the service. It probably does not matter to the customer which delivery service handles his package, and it might not even matter to the retailer.
[35] [Business]
The idea that a firm’s long term success can depend on just a handful of critical capabilities that cut across current technologies and organizational boundaries makes a firm’s core competency a very abstract conceptual model of how it is organized. This concept was first proposed by Pralahad and Hamel (1990), and since then there have been literally hundreds of business books that all say essentially the same thing: you can’t be good at everything; choose what you need to be good at and focus on getting better at them; let someone else do things that you don’t need to be good at doing.
[36] [Computing]
(Arasu et al 2001; Manning et al 2008). The web is a graph, so all web crawlers use graph traversal algorithms to find URIs of web resources and then add any hyperlink they find to the list of URIs they visit. The sheer size of the web makes crawling its pages a bandwidth- and computation intensive process, and since some pages change frequently and others not at all, an effective crawler must be smart at how it prioritizes the pages it collects and how it re-crawls pages. A web crawler for a search engine can determine the most relevant, popular, and credible pages from query logs and visit them more often. For other sites a crawler adjusts its “revisit frequency” based on the “change frequency” (Cho and Garcia-Molina 2000).
[37] [Computing]
Web resources are typically discovered by computerized “web crawlers” that find them by following links in a methodical automated manner. Web crawlers can be used to create topic-based or domain-specific collections of web resources by changing the ‘breadth-first” policy of generic crawlers to a “best-first” approach. Such “focused crawlers” only visit pages that have a high probability of being relevant to the topic or domain, which can be estimated by analyzing the similarity of the text of the linking and linked pages, terms in the linked page’s URI, or locating explicit semantic annotation that describes their content or their interfaces if they are invokable services (Bergmark et al, 2002, Ding et al 2004).
[38] [CogSci]
In this book we use “property” in a generic and ordinary sense as a synonym for “feature” or “characteristic.” Many cognitive and computer scientists are more precise in defining these terms and reserve “property” for binary predicates (e.g., somethjng is red or not, round or not, and so on). If multiple values are possible, the “property” is called an “attribute,” “dimension,” or “variable.” See Barsalou and Hale (1993) for a rigorous contrast between feature lists and other representational formalisms in models of human categories.
[39] [LIS]
Libraries and bookstores use different classification systems. The kitchen in a restaurant is not organized like a home kitchen because professional cooks think of cooking differently than ordinary people do. Scientists use the Latin or binominal (genus + species) scheme for identifying and classifying living things to avoid the ambiguities and inconsistencies of common names, which differ across languages and often within different regions in a single language community.
[40] [Citation]
Battles (2003).
[41] [Law]
In principle, it is easy to make perfect copies of digital resources. In practice, however, many industries employ a wide range of technologies including digital rights management, watermarking, and license servers to prevent copying of documents, music or video files, and other digital resources. The degree of copying allowed is a design choice in digital organizing systems that is shaped by law.
[42] [Computing]
Web-based or “cloud” services are invoked through URIs, and good design practice makes them permanent even if the implementation or location of the resource they identify changes (Berners-Lee, 1998). Digital resources are often replicated in content delivery networks to improve performance, reliability, scalability, and security (Pathan et al, 2008); the web pages served by a busy site might actually be delivered from different parts of the world, depending on where the accessing user is located.
[43] [Computing]
Whether a digital resource seems intangible or tangible depends on the scale of the digital collection and whether we focus on individual resources or the entire collection. An email message is an identified digital resource in a standard format, RFC 2822 (Resnick, 2008). We can compare different email systems according to the kinds of interactions they support and how easy it is to carry them out, but how email resources are represented does not matter to us and they surely seem intangible. Similarly, the organizing system we use to manage email might employ a complex hierarchy of folders or just a single searchable inbox, but whether that organization is implemented in the computer or smartphone we use for email or exists somewhere “in the cloud” for web-based email does not much matter to us either. An email message is tangible when we print it on paper, but all that matters then is that there is well-defined mapping between the different representations of the abstract email resource.On the other hand, at the scale at which Google and Microsoft handle billions of email messages in their Gmail and Hotmail services the implementation of the email organizing system is extremely relevant and involves many tangible considerations. The location and design of data centers, the configuration of processors and storage devices, the network capacity for delivering messages, whether messages and folder structures are server or client based, and numerous other considerations contribute to the quality of service that we experience when we interact with the email organizing system.
[44] [LIS]
An emerging issue in the field of digital humanities (Schreibman, Siemens, and Unsworth, 2005) is the requirement to recognize the materiality of the environment that enables people to create and interact with digital resources (Leonardi, 2010). Even if the resources themselves are intangible, it can be necessary to study and preserve the technological and social context in which they exist to fully understand them. For example, a “Born-Digital Archives” program at Emory University is preserving a collection of the author Salmon Rushdie’s work that includes his four personal computers and an external hard drive (Kirschenbaum, 2008; Kirschenbaum et al, 2009).
[45] [Computing]
For example, a car dealer might be able to keep track of a few dozen new and used cars on his lot even without a computerized inventory system, but web-based AutoTrader.com offered more than 2,000,000 cars in 2012. The cars are physical resources where they are located in the world, but they are represented in the AutoTrader.com organizing system as digital resources, and cars can be searched for using any combination of the many resource properties in the car listings: price, body style, make, model, year, mileage, color, location, and even specific car features like sunroofs or heated seats.
[46] [Computing]
Even when organizing principles such as alphabetical, chronological, or numerical ordering do not explicitly consider physical properties, how the resources are arranged in the “storage tier” of the organizing system can still be constrained by their physical properties and by the physical characteristics of the environments in which they are arranged. Books can only be stacked so high whether they are arranged alphabetically or by frequency of use, and large picture books often end up on the taller bottom shelf of bookcases because that’s the only shelf they fit. Nevertheless, it is important to treat these idiosyncratic outcomes in physical storage as exceptions and not let them distort the choice of the organizing principles in the “logic tier.”
[47] [Computing]
The Domain Name System or DNS (Mockapetris, 1987) is the hierarchical naming system that enables the assignment of meaningful domain names to groups of Internet resources. The responsibility for assigning names is delegated in a distributed way by the Internet Corporation for Assigned Names and Numbers (ICANN) (http://www.icann.org). DNS is an essential part of the Web’s organizing system but predates it by almost twenty years.
[48] [Computing]
HTML5 defines a “manifest” mechanism for making the boundary around a collection of web resources explicit even if somewhat arbitrary to support an “offline” mode of interaction in which all needed resources are continually downloaded (http://www.w3.org/TR/html5/offline.html), but many people consider it unreliable and subject to strange side effects.
[49] [Citation]
(Aalbersberg and Kahler, 2011).
[50] [Citation]
(Munk, 2004).
[51] [LIS]
Except when the resources on display are replicas of the originals, which is more common than you might suspect. Many nineteenth century museums in the United States largely contained copies of pieces from European museums. Today, museums sometimes display replicas when the originals are too fragile or valuable to risk damage (Wallach, 1998). Whether the “resource-based interaction” is identical for the replica and original is subjective and depends on how well the replica is implemented.
[52] [Citation]
Gibson (1977), Norman (1988). See also (Norman 1999) for a short and simple explanation of Norman’s (re-)interpretation of Gibson.
[53] [Citation]
See Hearst (2009), Buttcher et al (2010).
[54] [Citation]
Glushko and Nomorosa (2012)
[55] [Business]
Apte and Mason (1995) introduced this framework to analyze services rather than interactions per se. They paid special attention to services where the value created by symbolic manipulation or information exchange could be completely separated or disaggregated from the value created by person-to-person interactions. This configuration of value creation enables automated self-service, in which the human service provider can be replaced by technology, and outsourcing, in which the human provider is separated in space or time from the customer.
[56] [LIS]
Furthermore, many of the resources might not be available in the user’s own library and could only be obtained through inter-library loan, which could take days or weeks.
[57] [LIS]
In addition, many of the interactions in libraries are searches for known items, and this function is easily supported by digital search. In contrast, far fewer interactions in museum collections are searches for known items, and serendipitous interactions with previously unknown resources are often the goal of museum visitors. As a result, few museum visitors would prefer an online visit to experiencing an original painting, sculpture, or other physical artifact. However, it is precisely because of the unique character of museum resources that museums allow access to them but do not allow visitors to borrow them, in clear contrast to libraries.
[58] [Citation]
(Viswanadham, 2002; Madrigal 2009). (Prats et al 2008).
[59] [LIS]
Providing access to knowledge is a core mission of libraries, and it is worth pointing out that library users obtain knowledge both from the primary resources in the library collection and from the organizing system that manages the collection.
[60] [LIS]
It also erodes the authority and privilege that apply to resources because they are inside the library when a web search engine can search the “holdings” of the web faster and more comprehensively than you can search a library’s collection through its online catalog.
[61] [Citation]
(Pirolli, 2007).
[62] [Citation]
(Byrne and Goddard, 2010).
[63] [Citation]
See (Simon, 2011). An exemplary project to enhance museum access is Delphi (Schmitz and Black, 2008), the collections browser for the Phoebe A. Hearst Museum of Anthropology at UC Berkeley. Delphi very cleverly uses natural language processing techniques to build an easy-to-use faceted browsing user interface that lets users view over 600,000 items stored in museum warehouses. Delphi is being integrated into Collection Space (http://www.collectionspace.org/), an open source web collections management system for museum collections, collaboratively being developed by UC Berkeley, Cambridge University, Ontario Academy of Art and Design, and numerous museums.
[64] [Computing]
To augment digital resources with text structures, multimedia, animation, interactive 3-D graphics, mathematical functions, and other richer content types requires much more sophisticated representation formats that tend to require a great deal of “hand-crafting.”An alternative to hand-crafted resource description is sophisticated computer processing guided by human inputs. For example, Facebook and many web-based photo organizing systems implement face recognition analysis that detects faces in photos, compares the features of detected faces to the features of previously identified faces, and encourages people to tag photos to make the recognition more accurate. Some online use similar image classification techniques to bring together shoes, jewelry, or other items that look alike.
[65] [Computing]
However, even sophisticated text representation formats such as XML have inherent limitations: one important problem that arises in complex management scenarios, humanities scholarship, and bioinformatics is that XML markup cannot easily represent overlapping substructures in the same resource (Schmidt, 2009).
[66] [Law]
Digital books change the economics and first sale is not as well-established for digital works, which are licensed rather than sold (Aufderheide and Jaszi, 2011). To protect their business models, many publishers are limiting the number of times e-books can be lent before they “self-destruct.” Some librarians have called for boycotts of publishers in response (http://boycottharpercollins.com).In contrast to these new access restrictions imposed by publishers on digital works, many governments as well as some progressive information providers and scientific researchers have begun to encourage the reuse and reorganization of their content by making geospatial, demographic, environmental, economic, and other datasets available in open formats, as web services, or as data feeds rather than as “fixed” publications (Bizer, 2009; Robinson et al, 2009). And we have made this book available as an open content repository so that it can be collaboratively maintained and customized.
[67] [Business]
These access controls to the organizing system or its host computer are enforced using passwords and more sophisticated software and hardware techniques. Some access control policies are mandated by regulations to ensure privacy of personal data, and policies differ from industry to industry and from country to country. Access controls can improve the credibility of information by identifying who created or changed it, especially important when traceability is required (e.g. financial accounting).
[68] [LIS]
In response to this trend, however, many libraries are supporting “open access” initiatives that strive to make scholarly publications available without restriction (Bailey, 2007). Libraries and e-book vendors are engaged in a tussle about the extent to which the “first sale” rule that allows libraries to lend physical books without restrictions also applies to e-books (Howard, 2011).
[69] [Citation]
(Guenther and Wolfe, 2009).
[70] [LIS]
Today the United States National Archives displays the Declaration of Independence, Bill of Rights, and Constitution in sealed titanium cases filled with inert argon gas. Unfortunately, for over a century these documents were barely preserved at all; the Declaration hung on the wall at the United States Patent Office in direct sunlight for about 40 years.
[71] [Business]
Customer information drives day-to-day operations, but is also used in decision support and strategic planning.
[72] [Computing]
For businesses “in the world,” a “customer” is usually an actual person whose identity was learned in a transaction, but for many web-based businesses and search engines a customer is a computational model extracted from browser access and click logs that is a kind of “theoretical customer” whose actual identity is often unknown. These computational customers are the targets of the computational advertising in search engines.
[73] [Law]
The Sarbanes-Oxley Act in the United States and similar legislation in other countries require firms to preserve transactional and accounting records and any document that relates to “internal controls,” which arguably includes any information in any format created by any employee (Langevoort 2006). Civil procedure rules that permit discovery of evidence in lawsuits have long required firms to retain documents, and the proliferation of digital document types like email, voice mail, shared calendars and instant messages imposes new storage requirements and challenges (Levy and Casey, 2006). However, if a company has a data retention policy that includes the systematic deletion of documents when they are no longer needed, courts have noted that this is not willful destruction of evidence.
[74] [CogSci]
For example. students writing a term paper usually organize the printed and digital resources they rely on; the former are probably kept in folders or in piles on the desk, and the latter in a computer file system. This organizing system is not likely to be preserved after the term paper is finished. An exception that proves the rule is the task of paying income taxes for which (in the US) taxpayers are legally required to keep evidence for up to seven years after filing a tax return (Internal Revenue Service, 2011).
[75] [Citation]
(Rothenberg, 1995).
[76] [Citation]
(Pogue, 2009).
[77] [Computing]
Many of those Word Perfect documents were stored on floppy disks because floppy disk drives were built into almost every personal computer for decades, but it would be hard to find such disk drives today. And even if someone with a collection of word processor documents stored of floppy disks in 1995 had copied those files to newer storage technologies, it is unlikely that the current version of the word processor would be able to read them. Software application vendors usually preserve “backwards compatibility” for a few years with earlier versions to give users time to update their software, but few would support older versions indefinitely because to do so can make it difficult to implement new features.Digital resources can be encoded using non-proprietary and standardized data formats to ensure “forward compatibility” in any software application that implements the version of the standard. However, if the e-book reader, web browser, or other software used to access the resource has capabilities that rely on later versions of the standards the “old data” won’t have taken advantage of them.
[78] [Computing]
This is tautologically true for sites that publish news, weather, product catalogs with inventory information, stock prices, and similar continually updated content because many of their pages are automatically revised when events happen or as information arrives from other sources. It is also true for blogs, wikis, Facebook, Flickr, YouTube, Yelp and the great many other “Web 2.0” sites whose content changes as they incorporate a steady stream of user-generated content. In some cases the changes are attempts to rewrite history and prevent preservation by removing all traces of information that later turned out to be embarrassing, contradictory, or politically incorrect.
[79] [Citation]
(Fretterly et al, 2003)
[80] [Computing]
However, when a web site disappears its first page can often be found in the search engine’s index “cache” rather than by following what would be a broken link.
[81] [Computing]
The Memento project has proposed a specification for using HTTP headers to perform “datetime negotiation” with the Wayback Machine and other archives of web pages, making it unnecessary for Memento to save anything on its own. Memento is implemented as a browser plug-in to “browse backwards in time” whenever older versions of pages are available from archives that use its specification. (VandeSompel, 2011).
[82] [Computing]
But people might still enjoy the many Mona Lisa parodies and recreations. See http://www.megamonalisa.com, http://www.oddee.com/item_96790.aspx, http://www.chilloutpoint.com/art_and_design/the-best-mona-lisa-parodies.html
[83] [Citation]
(Brown and Duguid, 2000).
[84] [Citation]
(Savodnik, 2011).
[85] [Computing]
The set of content modules and their assembly structure for each kind of generated document conforms to a template or pattern that is called the document type model when it is expressed in XML.
[86] [Business]
Company intranets, wikis, and blogs are often used as knowledge management technologies; Lotus Notes and Microsoft SharePoint are popular commercial systems.
[87] [Business]
In addition, the line between “preserving species” and “preserving marketing brands” is a fine one for zoos with celebrity animals, and in animal theme parks like Sea World, it seems to have been crossed. “Shamu” was the first killer whale (orca) to survive long in captivity and performed for several years at SeaWorld San Diego. Shamu died in 1971 but over forty years later all three US –based SeaWorld parks have Shamu shows and Shamu webcams.
[88] [Citation]
(Manyika et al, 2011).
[89] [LIS]
The College of Physicians of Philadelphia’s Mütter Museum houses a novel collection of artifacts meant to “educate future doctors about anatomy and human medical anomalies.” No museum in the world is like it; it contains display cases full of human skulls, abnormal fetuses in jars, preserved human bodies, a garden of medicinal herbs, and many other unique collections of resources.However, one sub-collection best reflects the distinctive and idiosyncratic selection and arrangement of resources in the museum. Chevalier Jackson, a distinguished laryngologist, collected over 2,000 objects extracted from the throats of patients. Because of the peculiar focus and educational focus of this collection, and because there are few shared characteristics of “things people swallow that they should not,” the characteristics and principles used to organize and describe the collection would be of little use in another organizing system. What other collection would include toys, bones, sewing needles, coins, shells, and dental material? It is hard to imagine that any other collection that would include all of these items plus fully annotated record of sex and approximate age of patient, the amount of time the extraction procedure took, the tool used, and whether or not the patient survived.
[90] [LIS]
Curation is a very old concept whose Medieval meaning focused on the “preservation and cure of souls” by a pastor, priest, or “curate” (Simpson and Weiner, 2009). A set of related and systematized curation practices for some class of resources is often called a curation system, especially when they are embodied in technology.
[91] [LIS]
Information about which resources are most often interacted with in scientific or archival collections is essential in understanding resource value and quality.
[92] [LIS]
In memory institutions, the most common job titles include “curator” or “conservator”. In for-profit contexts where “governance” is more common than “curation” job titles reflect that difference. In addition to “governance”, job titles often include “recordkeeping”, “compliance”, or “regulatory” prefixes to “officer”, “accountant”, or “analyst” job classifications.
[93] [CogSci]
Because personal collections are strongly biased by the experiences and goals of the organizer, they are highly idiosyncratic, but still often embody well-thought-out and carefully executed curation activities (Kirsh, 2000; Marshall, 2007; Marshall, 2008)
[94] [Citation]
Bush 1945.
[95] [Citation]
(Howe, 2008).
[96] [LIS]
The most salient example of this so called “community curation” activity is the work to maintain the Wikipedia open-source encyclopedia according to a curation system of roles and functions that governs how and under what conditions contributors can add, revise, or delete articles; receive notifications of changes to articles; and resolve editing disputes (Lovink and Tkacz 2011). Some museums and scientific data repositories also encourage voluntary curation to analyze and classify specimens or photographs (Wright, 2010).
[97] [Citation]
Trant 2009b
[98] [Business]
Some popular “community content” sites like Yelp where people rate local businesses have been criticized for allowing positive rating manipulation. Yelp has also been criticized for allowing negative manipulation of ratings when competitors slam their rivals.
[99] [Computing]
The resource might have been put someplace else when the site was reorganized or a new web server was installed. It is no longer the same resource because it will have another URI, even if its content did not change.
[100] [Citation]
(Brown, 2009).
[101] [Citation]
(Diaz, 2008; Grimmelmann, 2009).
[102] [Citation]
(Kim, Nofsinger, and Mohr, 2009).
[103] [Computing]
Data governance decisions are also often shaped by the need to conform to information or process model standards, or to standards for IT service management like the Information Technology Infrastructure Library (ITIL, 2011).
[104] [Business]
In this context, these management and maintenance activities are often described as “IT governance” (Weill and Ross, 2004). Data classification is an essential IT governance activity because the confidentiality, competitive value, or currency of information are factors that determine who has access to it, how long it should be preserved, and where it should be stored at different points in its lifecycle.
[105] [Citation]
(Turban et al, 2010)
[106] [Computing]
This master data must be continually “cleansed” to remove errors or inconsistencies, and “de-duplication” techniques are applied to ensure an authoritative source of data and to prevent the redundant storage of many copies of the same resource. Redundant storage can result in wasted time searching for the most recent or authoritative version, cause problems if an outdated version is used, and increase the risk of important data being lost or stolen. (Loshin, 2008).
[107] [Citation]
(Cox et al, 2007).
[108] [Law]
Recently imposed requirements by the National Science Foundation, National Institute of Health and other research granting agencies for researchers to submit “data management plans” as part of their proposals should make digital data curation a much more important concern (Borgman, 2011). (NSF Data Management Plan Requirements: http://www.nsf.gov/eng/general/dmp.jsp).
To appear in The Discipline of Organizing, 2012Robert J.GlushkoDaniel D.TurnerKimraMcPhersonJessHemerly
This chapter builds upon the foundational concepts introduced in Chapter 1 to explain more carefully what we mean by resource. In particular, we focus on the issue of identity – what will be treated as a separate resource – and discuss the issues and principles we need to consider when we give each resource a name or identifier.
In Organizing How to Think About Resources we introduce four distinctions we can make when we discuss resources: domain, format, agency, and focus. In Resource Identity we apply these distinctions as we discuss how resource identity is determined for physical resources, bibliographic resources, resources in information systems, as well as for active resources and “smart things.” Naming Resources then tackles the problems and principles for naming: once we have identified resources, how do we name and distinguish them? Finally, Resources Over Time considers issues that emerge with respect to resources over time.
Resources are what we organize.
We introduced the concept of “resource” in The Concept of “Resource” with its ordinary sense of “anything of value that can support goal-oriented activity” and emphasized that a group of resources can be treated as a “collection” in an organizing system. And what do we mean by “anything of value,” exactly? It might seem that the question of identity, of what a single resource is, shouldn’t be hard to answer. After all, we live in a world of resources, and finding, selecting, describing, arranging, and referring to them are everyday activities.
Nevertheless, even when the resources we are dealing with are tangible things, how we go about organizing them is not always obvious, or at least not the same obvious to each of us at all times. Not everyone thinks of them in the same way. Recognizing something in the sense of perceiving it as a tangible thing is only the first step toward being able to organize it and other resources like it. Which properties garner our attention, and which we use in organizing depends on our experiences, purposes, and context.
We add information to a resource when we name or describe it; it then becomes more than “it.” We can describe the same resource in many different ways. At various times we can consider any given resource to be one of many members of a broad category, as one of the few members of a narrow category, or as a unique instance of a category with only one member. For example, we might recognize something as a piece of clothing, as a sock, or as the specific dirty sock with the hole worn in the heel from yesterday’s long hike. However, even after we categorize something, we might not be careful how we talk about it; we often refer to two objects as “the same thing” when what we mean is that they are “the same type of thing.” Indeed, we could debate whether a category with only one possible member is really a category, because it blurs an important distinction between particular items or instances and the class or type to which they belong.
The issues that matter and the decisions we need to make about resource instances and resource classes and types are not completely separable. Nevertheless, we will strive to focus on the former ones in this chapter and the latter ones in Chapter 6, “Categories: Describing Resource Classes and Types.”
As tricky as it can be to decide what a resource is when you are dealing with single objects, it is even more challenging when the resources are objects or systems composed of other parts. In these cases, we must focus on the entirety of the object or system and treat it as a resource, treat its constituent parts as resources, and deal with the relationships between the parts and the whole, as we do with engineering drawings and assembly procedures.
How many things is a car? If you’re imagining the car being assembled you might think of several dozen large parts like the frame, suspension, drive train, gas tank, brakes, engine, exhaust system, passenger compartment, doors, and other preassembled components. Of course, each of those components is itself made up of many parts – think of the engine, or even just the radio. Some sources have counted ten or fifteen thousand parts in the average car, and but even at that precise granularity a lot of parts are still complex things. There are screws and wires and fasteners and on and on; really too many to count.
This ambiguity about the number of parts holds for information resources too; a newspaper can be considered a single resource but it might also consist of multiple sections, each of which contains separate stories, each of which has many paragraphs, and so on. From the typesetter’s point of view, each character in a sentence can be taken as a distinct resource, selected from a font of similar resources.
Information resources generally pose additional challenges in their identification and description because their most important property is usually their content, which is not easily and consistently recognizable. Organizing systems for information resources in physical form, like those for libraries, have to juggle the duality of their tangible embodiment with what is inherently an abstract information resource; that is, the printed book versus the knowledge the book contains. Here the organizing system emphasizes description resources or surrogates like bibliographic records that describe the information content, rather than, their physical properties.
Another important question in libraries is what set of resources should be treated as the same work because they contain essentially similar intellectual or artistic content. We may talk about Shakespeare’s play “Macbeth,” but what is this thing we call “Macbeth”? Is it a particular string of words, saved in a computer file or handwritten upon a folio? Is it the collection of words printed with some predetermined font and pagination? Are all the editions and printings of these words the same “Macbeth”? How should we organize the numerous live and recorded performances of plays and movies that share the “Macbeth” name? What about works based on or inspired by “Macbeth” that do not share the title “Macbeth,” like the Kurosawa film “Kumonosu-jo” (“Throne of Blood”) that transposes the plot to feudal Japan?
Information system designers and architects face analogous design challenges when they describe the “information components” in business or scientific organizing systems. Information content is intrinsically merged or confounded with structure and presentation whenever it is used in a specific instance and context. From a logical perspective, an order form contains information components for ITEM, CUSTOMER NAME, ADDRESS, and PAYMENT INFORMATION, but the arrangement of these components, their type font and size, and other non-semantic properties can vary a great deal in different order forms and even across a single information system that repurposes these components for letters, delivery notices, mailing labels, and database entries.[109]
Similar questions about resource identity are posed by the emergence of ubiquitous or pervasive computing, in which information processing capability and connectivity are embedded into physical objects, in devices like smart phones, and in the surrounding environment. Equipped with sensors, radio-frequency identification (RFID) tags, GPS data, and user-contributed metadata, these “smart things” create a jumbled torrent of information about location and other properties that must be sorted into identified streams and then matched or associated with the original resource.
Resource Identity discusses the issues and methods for determining “what is a resource?” for physical resources as well as for the bibliographic resources, information components and “smart things” discussed here in Resources with Parts.
The answer to the question “What is a resource?” has two parts. The first part is identity: what thing are we treating as the resource? The second part is identification: differentiating between this single resource and other resources like it. These problems are closely related. Once you’ve decided what to treat as a resource, you create a name or an identifier so that you can refer to it reliably. A name is a label for a resource that is used to distinguish one from another. An identifier is a special kind of name assigned in a controlled manner and governed by rules that define possible values and naming conventions. For a digital resource, its identifier serves as the input to the system or function that determines its location so it can be retrieved, a process called resolving the identifier or resolution.
Choosing names and identifiers—be it for a person, a service, a place, a trend, a work, a document, a concept, etc.—is hardly straightforward. In fact, naming can often be challenging and is often highly contentious. Naming is made difficult by countless factors, including the audience that will need to access, share, and use the names, the limitations of language, institutional politics, and personal and cultural biases.
A common complication arises when a resource has more than one name or identifier. When something has more than one name each of the multiple names is a synonym or alias. A particular physical instance of a book might be called a hardcover or paperback or simply a text. George Furnas and his research collaborators called this issue of multiple names for the same resource or concept the “vocabulary problem.”[110]
Whether we call it a book or a text, the resource will usually have a Library of Congress catalog number as well as an ISBN as an identifier. When the book is in a carton of books being shipped from the publisher to a bookstore or library, that carton will have a bar-coded tracking number assigned by the delivery service, and a manifest or receipt document created by the publisher whose identifier associates the shipment with the customer. Each of these identifiers is unique with respect to some established scope or context.
A partial solution to the vocabulary problem is to use a controlled vocabulary. We can impose rules that standardize the way in which names and labels for resources are assigned in the first place. Alternatively, we can define mappings from terms used in our natural language to the authoritative or controlled terms. However, vocabulary control can’t remove all ambiguity. Even if a passport or national identity system requires authoritative full names rather than nicknames, there could easily be more than one Robert John Smith in the system.
Controlling the language used for a particular purpose raises other questions: Who writes and enforces these rules? What happens when organizing systems that follow different rules get compared, combined, or otherwise brought together in contexts different from those for which they were originally intended?
The nature of the resource is critical for the creation and maintenance of quality organizing systems. There are four distinctions we make in discussing resources: domain, format, agency, and focus.
Every resource has some essence or type that distinguishes it from other resources, which we call the resource domain. Domain is an intuitive notion that we can help define by contrasting it with the alternative of ad hoc or arbitrary groupings of resources that just happen to be in the same place at some moment, rather than being based on natural or intrinsic characteristics.
For physical resources domains can be coarsely distinguished according to the type of matter of which they are made using properties that can be readily perceived. All languages and cultures make basic contrasts between animal, vegetable, or material substances and then make further distinctions to create a hierarchical system of domain categories. Many aspects of this system of domain categories are determined by natural constraints on category membership that are manifested in patterns of shared properties; once a resource is identified as a member of one category it must also be a member of another with which it shares some but not all properties. For example, a marble statue in a museum must also be a kind of material resource, and a fish in an aquarium must also be a kind of animal resource.
For information resources, easily perceived properties are less reliable and correlated, so we more often distinguish domains based on semantic properties; the definitions of the “encyclopedia,” “novel,” and “invoice” resource types distinguish them according to their typical subject matter, or the type of content, rather than according to the great variety of physical forms in which we might encounter them. Arranging books by color or size might be sensible for very small collections, or in a photo studio, but organizing according to physical properties would make it extremely impractical to find books in a large library.
We can arrange types of information resources in a hierarchy but because the category boundaries are not sharp it is more useful to view domains of information resources on a continuum from weakly-structured narrative content to highly structured transactional content. This framework, called the Document Type Spectrum by Glushko and McGrath, captures the idea that the boundaries between resource domains, like those between colors in the rainbow, are easy to see for colors far apart in the spectrum but hard to see for adjacent ones.[111] See the Sidebar, THE “DOCUMENT TYPE SPECTRUM”, and its corresponding figure.
Information resources can exist in numerous formats with the most basic format distinction being whether the resource is physical or digital. This distinction is most important when it comes to the implementation of a resource storage or preservation system because that is where physical properties are usually considerations, and very possibly constraints. This distinction is less important at the logical level when we design interactions with resources because it is often possible to use digital surrogates for the physical resources to overcome the constraints posed by their physical properties. When we search for cars or appliances in an online store it doesn’t matter where the actual cars or appliances are located or how they are physically organized.
Many digital representations can be associated with either physical or digital resources, but it is important to know which one is the original or primary resource, especially for unique or valuable ones.
Today a great many resources in organizing systems are born digital. They are created in word processors and digital cameras, or by audio and video recorders. Other resources are produced in digital form by the many types of sensors in “smart things” and by the systems that create digital resources when they interact with barcodes, QR (“quick response”) codes, RFID tags, or other mechanisms for tracking identity and location.[112]
Other digital resources are created by digitization, the process for transforming an artifact whose original format is physical so that it can be stored and manipulated by a computer. We can digitize the printed word, photographs, blueprints and record albums. Printed text, for example, can be digitized by scanning the pages and employing character recognition software or simply by re-typing it.[113]
There are a vast number of digital formats. The simplest digital format for “plain text” documents typically consists of only the characters that you see on your computer keyboard; the alphanumerics and symbols with which we are all by now familiar. Most document formats also explicitly encode a hierarchy of structural components, such as chapters, sections or semantic components like descriptions or procedural steps, and sometimes the appearance of the rendered or printed form.[114] Another important distinction to note is whether the information is encoded as a sequence of text characters so that it is human readable as well as computer readable. Text formats such as EBCDIC, ASCII and UniCode offer progressively modern character encoding formats in common use today. Encoding character content with XML, for example, allows for layering of intentional coding or markup interwoven with the “plain text” content. The most complex digital formats are those for multimedia resources and multidimensional data, where the data format is highly optimized for specialized analysis or applications.[115]
Digitization of non-text resources such as film photography, drawings, and analog audio and visual recordings raises a complicated set of choices about pixel density, color depth, sampling rate, frequency filtering, compression, and numerous other technical issues that determine the digital representation.[116] There may be multiple intended uses and devices for the digitized resource that might require different digitization approaches and formats. Moreover, downstream users of digitized resources often need to know the format in which the digital artifact has been created so they can reuse it as is or process it in other ways.
Some digital formats support interactions that are qualitatively different and more powerful than those possible with physical resources. Museums are using virtual world technology to create interactive exhibits in which visitors can fly through the solar system, scan their own bodies, and change gravity so they can bounce off walls. Sophisticated digital document formats can enable interactions with annotated digital images or video, 3-D graphics or embedded data sets. The Google Art Project contains extremely high resolution photographs of famous paintings that make it possible to see details that are undetectable under the normal viewing conditions in museums.[117]
Nevertheless, digital representations of physical resources can also lose important information and capabilities. The distinctive sounds of hip hop music produced by “scratching” vinyl records on turntables cannot be produced from digital MP3 music files.[118]
Copyright often presents a barrier to digitization, both as a matter of law and because digitization itself enables copyright enforcement to a degree not possible prior to the advent of digitization, by eliminating common forms of access and interactions that are inherently possible with physical printed books like the ability to give or sell them to someone else.[119]
Agency, the extent to which a resource can initiate actions on its own is the third distinction we make about a resource. Another way to express this contrast is between passive resources that are acted upon and active resources that can initiate actions. Telephone answering and fax machines are agents because they are capable of independently responding to an outside stimulus, accepting and managing messages. An ordinary mercury thermometer is not capable of communicating its own reading, but a digital wireless thermometer or “weather station” can. Passive resources serve as nouns or operands, while active resources serve as verbs or operants.[120]
Organizing systems that contain passive or operand resources are ubiquitous for the simple reason that we live in a world of physical resources that we identify and name in order to interact with them. Passive resources are usually tangible and static and thus they become valuable only as a result of some action or interaction with them.
Most organizing systems with physical resources or those that contain resources that are digitized equivalents treat those resources as passive. A printed book on a library shelf, a digital book in an e-book reader, a statue in a museum gallery, or a case of beer in a supermarket refrigerator only create value when they are checked out, viewed, or consumed. None of these resources exhibits any agency and cannot initiate any actions to create value on their own.
Active resources create effects or value on their own, sometimes when they initiate interactions with passive resources. Active resources can be people, other living resources, computational agents, active information sources, or web-based services. We can exploit computing capability, storage capacity and communication bandwidth to create active resources that can do things and support interactions that are impossible for ordinary physical passive resources.
Objects become active resources when they contain sensing or communication capabilities. RFID chips, which are essentially bar codes with built-in radio transponders, enable automated location tracking and context sensing. RFID receivers are built into store shelves, loading docks, parking lots, and toll booths to detect when some RFID-tagged resource is at some meaningful location. RFID tags can be made “smarter” by having them record and transmit information from sensors that detect temperature, humidity, acceleration, and even biological contamination.[121]
Smart phones are also active resources that can identify and share their own location, orientation, acceleration and a growing number of other contextual parameters to enable personalization of information services. Self-regulating appliances are active resources when they communicate with each other in a “smart building” to minimize energy consumption.
Many organizing systems on the web consist of collections or configurations of active digital resources. Interactions among these active resources often implement information-intensive business models where value is created by exchanging, manipulating, transforming, or otherwise processing information, rather than by manipulating, transforming, or otherwise processing physical resources.
“Service Oriented Architecture” (SOA) is an emerging design discipline for organizing active resources as functional business components that can be combined in different ways. SOA is generally implemented using web services that exchange XML documents in real-time information flows to interconnect the business service components.
A familiar design pattern for an organizing system composed from active digital resources is the “online store.” The store can be analyzed as a composition or choreography in which some web pages display catalog items, others serve as “shopping carts” to assemble the order, and then a “checkout” page collects the buyer’s payment and delivery information that gets passed on to other service providers who process payments and deliver the goods.
The web has enabled the novel application of human resources as active resources to carry out tasks of short duration that can be precisely described but which can’t be done reliably by computers. These tasks include image classification or annotation, spoken language transcription, and sentiment analysis. The people doing these tasks over the web are sometimes called “Mechanical Turks” by analogy to a fake chess playing machine from the 18th century that had a human hidden inside who was secretly moving the pieces.[122]
A fourth contrast between types of resources distinguishes primary or original resources from resources that describe them. Any primary resource can have one or more description resources associated with it to facilitate finding, interacting with, or interpreting the primary one. Description resources are essential in organizing systems where the primary resources are not under its control and can only be accessed or interacted with through the description. Description resources are often called metadata.
The distinction between primary resources and description resources, or metadata, is deeply embedded in library science and traditional organizing systems whose collections are predominantly text resources like books, articles, or other documents. In these contexts description resources are commonly called bibliographic resources or catalogs, and each primary resource is typically associated with one or more description resources.
In business enterprises, the organizing systems for digital information resources, such as business documents, or data records created by transactions or automated processes, almost always employ resources that describe, or are associated with, large sets or classes of primary resources.[123]
The contrast between primary resources and description resources is very useful in many contexts, but when we look more broadly at organizing systems, it is often difficult to distinguish them, and determining which resources are primary and which are metadata is often just a decision about which resource is currently the focus of our attention.
For example, many people who use Twitter focus on the 140-character message body as the primary resource, while the associated metadata about the sender and the message (is it a forward, reply, link, and so on?) is less important to them. However, for firms in the growing ecosystem of services that use Twitter metadata to measure sender and brand impact, identify social networks, and assess trends, the focus is on the metadata, not the message content.[124]
As another example, the players on professional sports teams are human resources that we enjoy watching as they compete, but millions of people participate in fantasy sports leagues where teams consist of fantasy players that are simulated resources based on the statistics generated by the actual human players. Put another way, the associated resources in the actual sports are treated as the primary ones in the fantasy leagues.
Applying the format contrast between physical and digital resources to the focus distinction between primary and descriptive resources yields a useful framework with four categories of resources.
The oldest relationship between descriptive resources and physical resources is when descriptions or other information about physical resources are themselves encoded in a physical form. Nearly ten thousand years ago in Mesopotamia small clay tokens kept in clay containers served as inventory information to count units of goods or livestock. It took 5000 years for the idea of stored tokens to evolve into Cuneiform writing in which marks in clay stood for the tokens and made both the tokens and containers unnecessary.[125]
Here the digital resource describes a physical resource. The most familiar example of this relationship is the online library catalog used to find the shelf location of physical library resources, which beginning in the 1970s replaced the physical cards with database records. The online catalogs for museums usually contain a digital photograph of the painting, item of sculpture, or other museum object that each catalog entry describes.
Digital description resources for primary physical resources are essential in supply chain management, logistics, retailing, transportation, and every business model that depends on having timely and accurate information about where things are or about their current states. This digital description resource is created as a result of an interaction with a primary physical resource like a temperature sensor or with some secondary physical resource that is already associated with the primary physical resource like an RFID tag, barcode, or two-dimensional QR code.
Augmented reality systems combine a layer of real-time digital information about some physical object to a digital view or representation of it. The yellow “first down” lines superimposed in broadcasts of football games are a familiar example. Augmented reality techniques that superimpose identifying or descriptive metadata have been used in displays to support the operation or maintenance of complex equipment, in smartphone navigation and tourist guides, in advertising, and in other domains where users might otherwise need to consult a separate information source. Advanced airplane cockpit technology includes heads-up displays that present critical data based on available instrumentation, including augmented reality runway lights when visibility is poor because of clouds or fog.
Here the digital resource describes a digital resource. This is the relationship in a digital library or any web-based organizing system and it makes it possible to access the primary digital resource directly from the digital secondary resource.
This is the relationship implemented when we encounter an embedded QR barcode in newspaper or magazine advertisements, on billboards, sidewalks, t-shirts, or on store shelves. Scanning the QR code with a mobile phone camera can launch a web site that contains information about a product or service, place an order for one unit of the pointed-to-item in a web catalog, dial a phone number, or initiate any other application or service identified by the QR code.[126]
Determining the identity of resources that belong in a domain, deciding which properties are important or relevant to the people or systems operating in that domain, and then specifying the principles by which those properties encapsulate or define the relationships among the resources are the essential tasks when building any organizing system. In organizing systems used by individuals or with small scope, the methods for doing these tasks are often ad hoc and unsystematic, and the organizing systems are therefore idiosyncratic and do not scale well. At the other extreme, organizing systems designed for institutional or industry-wide use, especially in information-intensive domains, require systematic design methods to determine which resources will have separate identities and how they are related to each other. These resources and their relationships are then described in conceptual models which then are used to guide the implementation of the systems that manage the resources and support interactions with them.[127]
Our human visual and cognitive systems do a remarkable job at picking out objects from their backgrounds and distinguishing them from each other. In fact, we have little difficulty recognizing an object or a person even if we’re seeing them from a novel distance and viewing angle or with different lighting, shading, and so on. When we watch a football game, we don’t have any trouble perceiving the players moving around the field, and their contrasting uniform colors allow us to see that there are two different teams.
The perceptual mechanisms that make us see things as permanent objects with contrasting visible properties are just the prerequisite for the organizing tasks of identifying the specific object, determining the categories of objects to which it belongs, and deciding which of those categories is appropriate to emphasize. Most of the time we carry out these tasks in an automatic, unconscious way; at other times we make conscious decisions about them. For some purposes we consider a sports team as a single resource, as a collection of separate players for others, as offense and defense, as starters and reserves, and so on.[128]
Although we have many choices about how we can organize football players, all of them will include the concept of a single player as the smallest identifiable resource. We are never going to think of a football player as an intentional collection of separately identified leg, arm, head, and body resources because there are no other ways to “assemble” a human from body parts. Put more generally, there are some natural constraints on the organization of matter into parts or collections based on sizes, shapes, materials, and other properties that make us identify some things as indivisible resources in some domain.
Pondering the question of identity is something relatively recent in the world of librarians and catalogers. Libraries have been around for about 4000 years, but until the last few hundred years librarians created “bins” of headings and topics to organize resources without bothering to give each individual item a separate identifier or name. This meant searchers first had to make an educated guess as to which bin might house their desired information—“Histories”? “Medical and Chemical Philosophy”?—then scour everything in the category in a quest for their desired item. The choices were ad hoc and always local—that is, each cataloger decided the bins and groupings for each catalog.[129]
The first systematic approach to dealing with the concept of identity for bibliographic resources was developed by Antonio Panizzi at the British Museum in the mid-19th century. Panizzi wondered: How do we differentiate similar objects in a library catalog? His solution was a catalog organized by author name with an index of subjects, along with his newly concocted Rules for the Compilation of the Catalogue. This contained 91 rules about how to identify and arrange author names and titles and what to do with anonymous works. The Rules were meant to codify how to differentiate and describe each singular resource in his library. Taken together, the rules serve to group all the different editions and versions of a work together under a single identity.[130]
The concept of identity for bibliographic resources was refined in the 1950s by Lubetzky, who enlarged the concept of “the work” to make it a more abstract idea of an author’s intellectual or artistic creation. According to Lubetzky’s principle, an audio book, a video recording of a play, and an electronic book should be listed each as distinct items, yet still linked to the original because of their overlapping intellectual origin.[131]
The distinctions put forth by Lubetzky, Svenonius and other library science theorists have evolved today into a four-step abstraction hierarchy between the abstract work, an expression in multiple formats or genres, a particular manifestation in one of those formats or genres, and a specific physical item. The broad scope from the abstract work to the specific item is essential because organizing systems in libraries must organize tangible artifacts while expressing the conceptual structure of the domains of knowledge represented in their collections.
If we revisit the question “What is this thing we call Macbeth?” we can see how different ways of answering fit into this abstraction hierarchy. The most specific answer is that “Macbeth” is a specific item, a very particular and individual resource, like that dog-eared paperback with yellow marked pages that you owned when you read “Macbeth” in high school. A more abstract answer is that “Macbeth” is an idealization called a work, a category that includes all the plays, movies, ballets, or other intellectual creations that share a recognizable amount of the plot and meaning from the original Shakespeare play.
This hierarchy is defined in the Functional Requirements for Bibliographical Records (FRBR), published as a standard by the International Federation of Library Associations and Institutions (IFLA).[132]
In information-intensive domains, documents, databases, software applications, or other explicit repositories or sources of information are ubiquitous and essential to the creation of value for the user, reader, consumer, or customer. Value is created through the comparison, compilation, coordination or transformation of information in some chain or choreography of processes operating on information flowing from one information source or process to another. These processes are employed in accounting, financial services, procurement, logistics, supply chain management, insurance underwriting and claims processing, legal and professional services, customer support, computer programming, and energy management.
The processes that create value in information-intensive domains are “glued together” by shared information components that are exchanged in documents, records, messages, or resource descriptions of some kind. Information components are the primitive and abstract resources in information-intensive domains. They are the units of meaning that serve as building blocks of composite descriptions and other information artifacts.
The value creation processes in information-intensive domains work best when their component parts come from a common controlled vocabulary for components, or when each uses a vocabulary with a granularity and semantic precision compatible with the others. For example, the value created by a personal health record emerges when information from doctors, clinics, hospitals, and insurance companies can be combined because they all share the same “patient” component as a logical piece of information.
This abstract definition of information components doesn’t help identify them, so we’ll introduce some heuristic criteria: An “information component” can be (1) Any piece of information that has a unique label or identifier or (2) Any piece of information that is self-contained and comprehensible on its own.[133]
These two criteria for determining the identity of information components are often easy to satisfy through observations, interviews, and task analysis because people naturally use many different types of information and talk easily about specific components and the documents that contain them. Some common components (e.g., person, location, date, item) and familiar document types (e.g., report, catalog, calendar, receipt) can be identified in almost any domain. Other components need to be more precisely defined to meet the more specific semantic requirements of narrower domains. These smaller or more fine-grained components might be viewed as refined or qualified versions of the generic components and document types, like course grade and semester components in academic transcripts, airport codes and flight numbers in travel itineraries and tickets, and drug names and dosages in prescriptions.
Decades of practical and theoretical effort in conceptual modeling, relational theory, and database design have resulted in rigorous methods for identifying information components when requirements and business rules for information can be precisely specified. For example, in the domain of business transactions, required information like item numbers, quantities, prices, payment information, and so on must be encoded as a particular type of data—integer, decimal, Unicode string, etc.—with clearly defined possible values and that follows clear occurrence rules.[134]
Identifying components can seem superficially easy at the transactional end of the Document Type Spectrum (see Sidebar in Resource Domain), with orders or invoices, forms requiring data entry, or other highly-structured document types like product catalogs, where pieces of information are typically labeled and delimited by boxes, lines, white space or other presentation features that encode the distinctions between types of content. For example, the presence of ITEM, CUSTOMER NAME, ADDRESS, and PAYMENT INFORMATION labels on the fields of an online order form suggests these pieces of information are semantically distinct components in a retail application. They follow the “self-contained and comprehensible” heuristic enough to interconnect the order management, payment, and delivery services that work together to carry out the transaction. In addition, these labels might have analogues in variable names in the source code that implements the order form, or as tags in a XML document created by the ordering application; <CustName>John Smith</CustName> and <Item>A-19</Item> in the order document can be easily identified when it is sent to the other services by the order management application.
But the theoretically grounded methods for identifying components like those of relational theory and normalization that work for structured data do not strictly apply when information requirements are more qualitative and less precise at the narrative end of the Document Type Spectrum. These information requirements are typical of narrative, unstructured and semi-structured types of documents, and information sources like those often found in law, education, and professional services. Narrative documents include technical publications, reports, policies, procedures and other less structured information, where semantic components are rarely labeled explicitly and are often surrounded by text that is more generic. Unlike transactional documents that depend on precise semantics because they are used by computers, narrative documents are used by people, who can ask if they aren’t sure what something means, so there is less need to explicitly define the meaning of the information components. Occasional exceptions, such as where components in narrative documents are identified with explicit labels like NOTE and WARNING, only prove the rule.
Active resources (Reduce Synonymy and Homonymy with Controlled Vocabularies) initiate effects or create value on their own. In many cases an inherently passive physical resource like a product package or shipping pallet is transformed into an active one when it associated with an RFID tag or bar code. Mobile phones contain device or subscriber IDs so that any information they communicate can be associated both with the phone and often, through indirect reference, with a particular person. If the resource has an IP address, it is said to be part of the “Internet of Things.”[135]
Organizing systems that create value from active resources often co-exist with or complement organizing systems that treat its resources as passive. In a traditional library, books sat passively on shelves and required users to read their spines to identify them. Today, some library books contain active RFID tags that make them dynamic information sources that self-identify by publishing their own locations. Similarly, a supermarket or department store might organize its goods as physical resources on shelves, treating them as passive resources; while superimposed on that traditional organizing system is one that uses point-of-sale transaction information created when items are scanned at checkout counters to automatically re-order goods and replenish the inventory at the store where they were sold. In some stores the shelves contain sensors that continually “talk to the goods” and the information they gather can maintain inventory levels and even help prevent theft of valuable merchandise by tracking goods through a store or warehouse. The inventory becomes a collection of active resources; each item eager to announce its own location and ready to conduct its own sale.
Blogjects—objects that blog—and Tweetjects—objects that post messages to Twitter—are neologisms for active resources that are plugged into the social web. Blogjects don’t write editorial commentary about their experiences, but they use APIs and customized programs to harness the information captured by sensors and RFID that then appears on blogs in the form of human-readable maps, charts, and text.[136]
Tweetjects are sensors that send information about measurements or events to a Twitter account. For example, Spark fun Electronics sells a kit consisting of a soil sensor that sends information about the water level in the soil through an Arduino circuit board, converting thresholds to Twitter messages like, “Please water me, I’m thirsty!” [137]
The extent to which an active resource is “smart” depends on how much computing capability it has available to refine the data it collects and communicates. A large collection of sensors can transmit a torrent of captured data that requires substantial processing to distinguish significant events from those that reflect normal operation, and also from those that are statistical outliers with strange values caused by random noise. This challenge gets qualitatively more difficult as the amount of data grows, because a one in million event might be a statistical outlier that can be ignored, but if there are a thousand similar outliers in a billion sensor readings, this cluster of data probably reveals something important. On the other hand, giving every sensor the computing capability to refine its data so that it only communicates significant information might make the sensors too expensive to deploy.[138]
Determining the identity of the thing, document, information component, or data item we need isn’t always enough. We often need to give that resource a name, a label that will help us understand and talk about what it is. But naming isn’t just the simple task of assigning a sequence of characters. In this section, we’ll discuss why we name, some of the problems with naming, and the principles that help us name things in useful ways.
When a child is born, its parents give it a name, often a very stressful and contentious decision. Names serve to distinguish one person from another. Names also, intentionally or unintentionally, suggest characteristics or aspirations. The name given to us at birth is just one of the names we will be identified with during our lifetimes. We have nicknames, names we use professionally, names we use with friends, and names we use online. Our banks, our schools, and our governments will know who we are because of numbers they associate with our names. As long as it serves its purpose to identify you, your name could be anything.[139]
Resources other than people need names so we can find them, describe them, reuse them, refer or link to them, record who owns them, and otherwise interact with them. In many domains the names assigned to resources are also influenced or constrained by rules, industry practice, or technology considerations.
Giving names to anything, from a business to a concept to an action, can be a difficult process and it is possible to do it well or do it poorly. The following section details some of the major challenges in assigning a name to a resource.
Every natural language offers more than one way to express any thought, and in particular there are usually many words that can be used to refer to the same thing or concept. The words people choose to name or describe things are embodied in their experiences and context, so people will often disagree in the words they use. Moreover, people are often a bit surprised when it happens, because what seems like the natural or obvious name to one person isn’t natural or obvious to another.[140]
Back in the 1980s in the early days of computer user interface design, George Furnas and his colleagues at Bell Labs conducted a set of experiments to measure how much people would agree when they named some resource or function. The short answer: very little. Left to our own devices, we come up with a shockingly large number of names for a single common thing.
In one experiment, a thousand pairs of people were asked to “write the name you would give to a program that tells about interesting activities occurring in some major metropolitan area.” Less than 12 pairs of people agreed on a name. Furnas called this phenomenon “the vocabulary problem,” concluding that no single word could ever be considered the “best” name.[141]
Sometimes the same word can refer to different resources—a “bank” can be a financial institution or the side of a river. When two words are spelled the same but have different meanings they are homographs; if they are also pronounced the same they are homonyms. If the different meanings of the homographs are related, they are called polysemes.
Resources with homonymous and polysemous names are sometimes incorrectly identified, especially by an automated process that can’t use common sense or context to determine the correct referent. Polysemy can cause more trouble than simple homography because the overlapping meaning might obscure the misinterpretation. If one person thinks of a “shipping container” as being a cardboard box and orders some of them, while another person thinks of a “shipping container” as the large box carried by semi-trailers and stacked on cargo ships, their disagreement might not be discovered until the wrong kinds of containers arrive.[142]
Many words in different languages have common roots, and as a result are often spelled the same or nearly the same. This is especially true for technology words; for example, “computer” has been borrowed by many languages. The existence of these cognates and borrowed words makes us vulnerable to false cognates. When a word in one language has a different meaning and refers to different resources in another, the results can be embarrassing or disastrous. “Gift” is poison in German; “pain” is bread in French.
False cognates are a special category of words that make poor names, and there are many stories relating product marketing mistakes, where a product name or description translates poorly, into other languages or cultures, with undesirable associations.[143] Furthermore, these undesirable associations differ across cultures. For example, even though floor numbers have the straightforward purpose to identify floors from lowest to highest levels, most buildings in Western cultures skip the 13 th floor because many people think 13 is an unlucky number. In many East and Southeast Asian buildings, the 4 th floor is skipped. In China the number 4 is dreaded because it sounds like the word for “death,” while 8 is prized because it sounds like the word for “wealth.”
While it can be tempting to dismiss unfamiliar biases and beliefs about names and identifiers as harmless superstitions and practices, their implications are ubiquitous and far from benign. Alphabetic ordering might seem like a fair and non-discriminatory arrangement of resources, but because it is easy to choose the name at the top of an alphabetical list, many firms in service businesses select names that begin with “A,” “AA,” or even “AAA” (look in any printed service directory). A consequence of this bias is that people or resources with names that begin with letters late in the alphabet are systematically discriminated against because they are often not considered, or because they are evaluated in the context created by resources earlier in the alphabet rather than on their own merit.[144]
Many resources are given names based on attributes that can be problematic later if the attribute changes in value or interpretation.
Web resources are often referred to using URLs that contain the domain name of the server on which the resource is located, followed by the directory path and file name on the computer running the server. This treats the current location of the resource as its name, so the name will change if the resource is moved. It also means that resources that are identical in content, like those at an archive or mirror web site, will have different names than the original even though they are exact copies. An analogous problem is faced by restaurants or businesses with street names or numbers in their names if they lose their leases or want to expand.[145]
Some dynamic web resources that are generated by programs have URLs that contain information about the server technology used to create them. When the technology changes, the URLs will no longer work.[146]
Other resources have names that include page numbers, which disappear or change when the resource is accessed in a digital form. For example, the standard citation format for legal opinions uses the page number from the printed volume issued by West Publishing, which has a virtual monopoly on the publishing of court opinions and other types of legal documents.[147]
Some resources have names that contain dates, years or other time indicators, most often to point to the future. The film studio named “20 th Century Fox” took on that name in the 1930s to give it a progressive identity, but today a name with “20 th Century” in it does the opposite because it looks backward in time.[148]
Another naming problem can arise when names are assigned by automated processes in ways that are conceptually different than how people do it. The difference in conceptual perspective in resource naming and description has been called the semantic gap.[149]
The semantic gap is largest when computer programs or sensors obtain and name some information in a format optimized for efficient capture, storage, decoding, or other technical criteria. The names — like IMG20268.jpg on a digital photo — might make sense for the camera as it stores consecutively taken photos but they are not good names for people. We may prefer names that describe the content of the picture, like “goldengatebridge.jpg.”
And if we try to examine the content of computer-created or sensor-captured resources, like a clip of music or a compiled software program, a human-language text rendering of the content simply looks like nonsense. It was designed to be interpreted by a computer program, not by a person.
If someone tells you they are having dinner with their best friend, a cousin, someone with whom they play basketball, and their professional mentor from work, how many places at the table will be set? Anywhere from two to five; it’s possible all those relational descriptions refer to a single person, or to four different people, and because “friend,” “cousin,” “basketball teammate” and “mentor” don’t name specific people you’ll have to guess who is coming to dinner.
If instead of descriptions you’re told that the dinner guests are Bob, Carol, Ted, and Alice, you can count four names and you know how many people are having dinner. But you still can’t be sure exactly which four people are involved because there are many people with those names.
The uncertainty is completely eliminated only if we use identifiers for the people rather than names. Identifiers are names that refer unambiguously to a specific person, place, or resource because they are assigned in a controlled way Identifiers are often created as strings of numbers or letters rather than words to avoid the biases and associations that words can convey. For example, in some universities professors grade final exams that are identified with student numbers rather than names so that grades are assigned without the bias that could arise if the professor knows the student.
The distinction between names and identifiers for people is often not appreciated. See the Sidebar, NAMES {AND, OR, VS} IDENTIFIERS.
The most basic principle of naming is to choose names that are informative, which makes them easier to understand and remember. It is easier to tell what a computer program or XML document is doing if it uses names like “ItemCost” and “TotalCost” rather than just “I” or “T”. People will enter more consistent and reusable address information if a form asks explicitly for “Street,” “City,” and “PostalCode” instead of “Line1” and “Line2.”
Identifiers can be designed with internal structure and semantics that conveys information beyond the basic aspect of pointing to a specific resource. An International Standard Book Number like “ISBN 978-0-262-07261-8” identifies a resource (07261=“Document Engineering”) and also reveals that the resource is a book (978), in English (0), and published by MIT Press (262).[150]
The navigation points that mark intersections of radial signals from ground beacons or satellites that are crucial to aircraft pilots used to be meaningless five-letter codes. These identifiers were changed to make them suggest their locations, making them semantic landmarks that made pilots less likely to enter the wrong names into navigation systems, For example, some of the navpoints near Orlando, Florida - the home of Disney World — are MICKI, MINEE, and GOOFY.[151]
One way to encourage good names for a given resource domain or task is to establish a controlled vocabulary. A controlled vocabulary can be thought of as a fixed or closed dictionary that includes all the terms that can be used in a particular domain. A controlled vocabulary shrinks the number of words used, reducing synonymy and homonymy and eliminating undesirable associations, leaving behind a set of words with precisely defined meanings and rules governing their use. Controlled vocabularies are applied in many organizing systems, from bibliographic languages that determine the ways books are catalogued in a library to business languages that define the set of information components that can be used in transactional documents.
A controlled vocabulary isn’t simply a set of allowed words; it also includes their definitions and often specifies rules by which the vocabulary terms can be used and combined. Different domains can create specific controlled vocabularies for their own purposes, but the important thing is that the vocabulary be used consistently throughout that domain.[152]
For bibliographic resources important aspects of vocabulary control include determining the authoritative forms for author names, uniform titles of works, and the set of terms by which a particular subject will be known. In library science, the process of creating and maintaining these standard names and terms is known as authority control. When evaluating what name to use for an author, librarians typically look for the name form that’s used most commonly across that author’s body of work while conforming to rules for handling prefixes, suffixes and other name parts that often cause name variations. For example, a name like that of Johann Wolfgang von Goethe might be alphabetized as both a “G” name and a “V” name, but using “G” is the authoritative way. “See” and “see also” references then map the variations to the authoritative name. Similar rules are followed for identifying the authoritative form of titles when multiple translations and editions exist.[153]
Official authority files are maintained for many resource domains: a gazetteer associates names and locations and tells us whether we should be referring to Bombay or Mumbai; the Domain Name System maps human-oriented domain and host names to their IP addresses; the Chemical Abstracts Service Registry assigns unique identifiers to every chemical described in the open scientific literature; numerous institutions assign unique identifiers to different categories of animal species.[154]
In some cases, authority files are created or maintained by a community, as in the case of MusicBrainz, an “open music encyclopedia” to which users contribute information about artists, releases, tracks, and other aspects of music. Music metadata is notoriously unreliable; one study found over 100 variations in the description of the “Knockin’ on Heaven’s Door” song (written by Bob Dylan) as recorded by Guns N’ Roses.[155]
A controlled vocabulary is extremely useful to people who use it, but if you are designing an organizing system for other people who do not or cannot use it, you need to accommodate the variety of words they will actually use when they seek or describe resources. The authoritative name of a certain fish species is Amphiprion ocellaris, but most people would search for it as “clownfish,” “anemone fish,” or even by its more familiar film name of “Nemo.”
Furnas suggests “unlimited aliasing” to connect the uncontrolled or natural vocabularies that people use with the controlled one employed by the organizing system. By this he means that there must be many alternate access routes to each word or function that a user is trying to find. For example, the birth name of the 42nd US President is “William Jefferson Clinton,” but web pages that refer to him as “Bill Clinton” are vastly more common, and searches for the former are redirected to the latter. A related mechanism used by search engines is spelling correction, essentially treating all the incorrect spellings as aliases of the correct one (“did you mean California?” when you typed “Claifornia”).
Even though an identifier refers to a single resource, this doesn’t mean that no two identifiers are identical. One military inventory system might use stock number 99 000 1111 to identify a 24-hour, cold-climate ration pack, while another inventory system, the same number could be used to identify an electronic radio valve. Each identifier is unique in its inventory system, but if a supply request gets sent to the wrong warehouse hungry soldiers could be sent radio valves instead of rations.[156] [157]
We can prevent or reduce identifier collisions by adding information about the namespace, the domain from which the names or identifiers are selected, thus creating what are often called qualified names. There are several dozen US cities named “Springfield” and “Washington,” but adding state codes to mail addresses distinguishes them. Likewise, we can add prefixes to XML element names when we create documents that reuse components from multiple document types, distinguishing <book:Title> from <legal:Title>.
We can fix problems like these by qualifying or extending the identifier, or by creating a globally unique identifier (or GUID), one that will never be the same as another identifier in any organizing system anywhere else. One easy method to create a GUID is to use a URL you control and append a string to it, the same approach that gives every web page a unique address. GUIDs are often used to identify software objects, the resources in distributed systems, or data collections.[158]
Because they aren’t created by an algorithm whose results are provably unique, we do not consider fingerprints, or other biometric information, to be globally unique identifiers for people, but for all practical purposes they are.[159]
Library call numbers are identifiers that do not contain any information about where the resource can be found in the library stacks on in a digital repository. This separation enables this identification system to work when there are multiple copies in different locations, in contrast to URLs that serve as both identifiers and locations much of the time. When the identifier does not contain information about resource location, we need a way to interpret or resolve it to determine the location. With physical resources, resolution takes place with the aid of signs, maps, or other associated resources that describe the arrangement of resources in some physical environment; for example, “you are here” maps have a list of its buildings and associate each with a coordinate or other means of finding it on the map.. With digital resources, the resolver is a directory system or service that interprets an identifier and looks up its location or directly initiates the retrieval of the resource.
Problems of “what is the resource?” and “how do we identify it?” are complex and often require ongoing work to ensure they are properly answered as the content and context of an organizing system evolves. As a result, we might need to know how a resource does or does not change over time (its persistence), whether its state and content come into play at a specified point in time (its effectivity), whether the resource is what it is said to be (its authenticity), and sometimes who has certified its authenticity over time (its provenance).
Even if you have reached an agreement as to the meaning of “a thing” in your organizing system, you still face the question of the identity of the resource over time, or its persistence.
How long must an identifier last? Coyle gives the conventional, if unsatisfying, answer: “As long as it’s needed”.[160] In some cases, the time frame is relatively short. When you order a specialty coffee and the barrista asks for your name, this identifier only needs to last until you pick up your order at the end of the counter. But other time frames are much longer. For libraries and repositories of scientific, economic, census, or other data the timeframe might be “forever.”
The design of a scheme for persistent identifiers must consider both the required time frame and the number of resources to be identified. When the Internet Protocol was designed in 1980, it contained a 32-bit address scheme, sufficient for over 4 billion unique addresses. But the enormous growth of the Internet and the application of IP addresses to resources of expected types have required a new addressing scheme with 12 bits.[161]
Recognition that URLs are often not persistent as identifiers for web-based resources led the Association of American Publishers (AAP) to develop the Digital Object Identifier (DOI) system. The location and owner of a digital resource can change, but its DOI is permanent.[162]
Even though persistence often has a technology dimension, it is more important to view it as a commitment by an institution or organization to perform activities over time to ensure that a resource is available when it is needed. Put another way, preservation (Organizing Digital Resources) and governance (Organizing with Multiple Resource Properties) are activities carried out to ensure the outcome of persistence.
The subtle relationship between preservation and persistence raises some interesting questions about what it means for a resource to stay the same over time. One way to think of persistence is that a persistent resource is never changed. However, physical resources often require maintenance, repair, or restoration to keep them accessible and usable, and we might question whether at some point these activities have transformed them into different resources.[163] Likewise, digital resources require regular backup and migration to keep them available and this might include changing their digital format.
We might instead think of persistence more abstractly, and expect that persistent resources need only to remain functionally the same to support the same interactions at any point in their lifetimes even if their physical properties change. Active resources implemented as computational agents or web services might be re-implemented numerous times, but as long as they don’t change their interfaces they can be deemed to be persistent from the perspective of any other resource that uses them. Similarly, many resources like online newspapers or blog feeds continually change their content but still could have persistent identifiers.
Some organizing systems closely monitor their resources and every interaction with them to prevent or detect tampering with them or other unauthorized changes. Some organizing systems, like those for software or legal documents, explicitly maintain every changed version to satisfy expectations of persistence because different users might not be relying on the same version. With digital resources determining whether two resources are the same or determining how they are related or derived from one another are very challenging problems. [164]
Many resources also have effectivity, meaning that they come into effect, or being, at a particular time and may cease to be effective at some future date. Effectivity is sometimes known as time-to-live. It consists of a date on which the resource is effective, and optionally a date on which the resource ceases to be effective, or becomes stale. For some types of resources, the effective date can be the moment when they’re created, but for others, the effective date can be a time different from the moment of creation. For example, a law can be passed in November but not take effect until January 1 of the following year. An effectivity date is the counterpart of the “Best Before” date on perishable goods. That date indicates when a product goes bad, whereas an item’s effectivity date is when it “goes good” and the resource that it supersedes needs to be disposed of or archived.
In most cases effectivity implies persistence requirements because it is important to be able to determine and reconstruct the configuration of resources that was in effect at some prior time. A new tax might go into effect on January 1, but if the government audits your tax returns what matters is whether you followed the law that was in effect when you filed your returns.[165]
In ordinary use we say that something is authentic if it can be shown to be, or has come to be accepted as what it claims to be. It is easy to think of examples where authenticity of a resource matters: a signed legal contract, a work of art, a historical artifact, even a person’s signature. The importance and nuance of questions about authenticity can be seen in the many words we have to describe the relationship between “the real thing” (the “original”) and something else: copy, reproduction, replica, fake, phony, forgery, counterfeit, pretender, imposter, ringer, and so on.
The creator or operator of an organizing system, whether human or machine, can authenticate a newly created resource. A third party can also serve as proof of authenticity. Many professional careers are based on figuring out if a resource is authentic.[166]
There is large body of techniques for establishing the identity of a person or physical resource. We often use judgments about the physical integrity of recorded information when we consider the integrity of its contents.
Digital authenticity is more difficult to establish. Digital resources can be reproduced at almost no cost, exist in multiple locations, carry different names on identical documents or identical names on different documents, and bring about other complications that do not arise with physical items. Technological solutions for ensuring digital authenticity include time stamps, watermarking, encryption, and digital signatures. However, while scholars generally trust technological methods, technologists are more skeptical of them because they can imagine ways for them to be circumvented or counterfeited. Even when a technologically sophisticated system for establishing authenticity is in place, we can still only assume the constancy of identity as far back as this system reaches in the “chain of custody” of the document.
The idea that important documents must be created in an authenticatable manner and then preserved with an unbroken chain of custody goes back to ancient Rome. Notaries witnessed the creation of important documents, which were then protected to maintain their integrity or value as evidence. In Organizing Systems like museums and archives that preserve rare or culturally important objects or documents this concern is expressed as the principle of provenance. This is the history of the ownership of a collection or the resources in it, where they have been and who has had it.
A uniquely Chinese technique in Organizing Systems is the imprinting of elaborate red seals on documents, books, and paintings that collectively record the provenance of ownership and the review and approval of the artifact by emperors or important officials.
[109] [Business]
Separating information content from its structure and presentation is essential to repurposing it for different scenarios, applications, devices, or users. The global information economy is increasingly driven by automated information exchange between business processes. When information flows efficiently from one type of document to another in this chain of related documents, the overlapping content components act as the “glue” that connects the information systems or web services that produce and consume the documents. Glushko and McGrath (2005).
[110] [Citation]
Furnas, Landauer, Gomez, and Dumais (1987).
[111] [Citation]
Glushko and McGrath (2005).
[112] [Citation]
Kuniavsky (2010)
[113] [LIS]
Project Gutenberg, begun in 1971, was the first large-scale effort to digitize books; its thousands of volunteers have created about 40,000 digital versions of classic printed works. Systematic research in digital libraries began in the 1990s when the US National Science Foundation, the Advanced Research Projects Agency, and NASA launched a Digital Library Initiative that emphasized the enabling technologies and infrastructure. At about the same time numerous pragmatic efforts to digitize library collections began, characterized by some as a race against time as old books in libraries were literally disintegrating and turning into dust. The Internet Archive, started in 1996, now has a collection of over 3 million texts and has estimated the cost of digitizing to be about $30 for the average book. Multiply this by the scores of millions ofoks held in the world’s research libraries and it is easy to why many libraries endorsed Google’s offer to digitize their collections.
[114] [CogSci]
Encoding of structure in documents is valuable because titles, sections, links and other structural elements can be leveraged to enhance the user interface and navigational interactions with the digital document and enable more precise information retrieval. Some uses of documents require formats that preserve their printed appearance. For example, “presentational fidelity” is essential if we imagine a banker or customs inspector carefully comparing a printed document with a computer-generated one to ensure they are identical.
[115] [Computing]
Text encoding specs are well-documented; see (http://www.wotsit.org/list.asp?fc=10).
[116] [Citation]
(Chapman and Chapman, 2009).
[117] [LIS]
Numerous museums have created web collections, but a great many of them seem to have focused on the quantity of information they could put online rather than on the user experience they were creating. Perhaps not surprisingly, the ambitious use of virtual world technology to create novel forms of interaction described by Rothfarb and Doherty (2007) reflects the highly interactive character of its host museum, the Exploratorium in San Francisco (http://www.exploratorium.edu/). Similarly, the Google Art Project (googleartproject.com) is notable for its goal of complementing and extending, rather than merely imitating, the museum visitor’s encounter with artwork (Proctor, 2011). A feature that let people create a “personal art collection” is very popular, enabling a fan of Van Gogh to bring together paintings that hang in different museums.
[118] [Computing]
However, scratching can be simulated using a smartphone or tablet app called djay. See http://www.algoriddim.com/djay.
[119] [Law]
As a result, digital books are somewhat controversial and problematic for libraries, whose access models were created based on the economics of print publication and the social contract of the copyright first sale doctrine that allowed libraries to lend printed books.. Digital books change the economics and first sale is not as well-established for digital works, which are licensed rather than sold (Aufderheide and Jaszi, 2011). To protect their business models, many publishers are limiting the number of times e-books can be lent before they “self-destruct.” Some librarians have called for boycotts of publishers in response (http://boycottharpercollins.com).
[120] [Business]
The opposing categories of operands and operants have their roots in debates in political economics about the nature of work and the creation of value (Vargo, Lusch, & Morgan 2006) and have more recently played a central role in the development of modern thinking about service design (Constantin & Lusch, 1994; Maglio et al 2009). The concept of agency or operant resources is needed to bring resources that are active information sources, or computational in character, into the organizing system framework. This concept also lets us include living resources, or more specifically, humans, into discussions about organizing systems in a more general way that emphasizes their agency and de-emphasizes other characteristics that could otherwise be distracting.
[121] [Citation]
See Allmendinger and Lombreglia (2005), Want (2006).
[122] [CogSci]
Luis Von Ahn (Von Ahn, 2004) was the first to use the web to get people to perform “microwork” or “human computation” tasks when he released what he called “the ESP game” that randomly paired people trying to agree on labeling an image. Not long afterwards Amazon created the MTurk platform (www.mturk.com) that lets people propose microwork and others sign up to do it, and today there are both hundreds of thousands of tasks offered and hundreds of thousands of people offering to be paid to do them.
[123] [Computing]
For semi-structured or more narrative documents these descriptions might be authoring templates used in word processors or other office applications, document schemas in XML applications, style sheets, or other kinds of transformations that change one resource representation into another one. Primary resources that are highly and regularly structured are invariably organized in databases or enterprise information management systems in which a data schema specifies the arrangement and type of data contained in each field or component of the resource.
[124] [Computing]
There are a large number of 3rd party Twitter apps. See http://twitter.pbworks.com/w/page/1779726/Apps. For a scholarly analysis see Efron (2011).
[125] [Citation]
(Schmandt-Besserat, 1997)
[126] [LIS]
We treat resource format and resource focus as distinct dimensions, so there are four categories here. This contrasts with David Weinberger’s three “orders of order” that he proposes in the first chapter of a book called Everything is Miscellaneous (Weinberger, 2007). Weinberger starts with the assumption that physical resources are inherently the primary ones, so the first “order of order” emerges when physical resources are arranged. The second “order of order” emerges when physical description resources are arranged, and the third “order of order” emerges when digital description resources for physical resources are arranged. Later in the book Weinberger mentions the use of bar codes associated with web sites, a physical description of a digital resource, but because he started with the assumption that physical resources define the “first order” this example doesn’t fit into his orders of order.
[127] [Computing]
These methods go by different names in different disciplines, including “data modeling,” “systems analysis,” and “document engineering” (e.g., Kent, 1978/2000; Silverston, 2001; Glushko & McGrath, 2005). What they have in common is that they produce conceptual models of a domain that specify their components or parts and the relationships among these components or parts. These conceptual models are called “schemas” or “domain ontologies” in some modeling approaches, and are typically implemented in models that are optimized for particular technologies or applications.
[128] [CogSci]
Specifically, an NFL football team needs to be considered a single resource for games through the season and in playoffs, and 53 individual players for other situations, like the NFL draft or play-calling. The team and the team’s roster can be thought of as resources, and the team’s individual players are also resources that make up the whole team.
[129] [LIS]
Denton (2007) is a highly readable retelling of the history of cataloguing that follows four themes – the use of axioms, user requirements, the “work,” and standardization and internationalization – culminating with their synthesis in the Functional Requirements for Bibliographic Records (FRBR).
[130] [LIS]
This was a surprisingly controversial activity. Many people opposed Panizzi’s efforts as a waste of time of effort because they assumed that “building a catalog was a simple matter of writing down a list of titles” (Denton 2007, p. 38).
[131] [LIS]
Lubetzsky worked for the US Library of Congress from 1943-1960 where he tirelessly sought to simplify the proliferating mass of special case cataloguing rules proposed by the American Library Association, because at the time the LOC had the task of applying those rules and making the catalog cards other libraries used. Lubetsky’s book on Cataloguing Rules and Principles (Lubetsky, 1953) bluntly asks “Is this rule necessary?” and was a turning point in cataloguing.
[132] [LIS]
In between the abstraction of the WORK and the specific single ITEM are two additional levels in the FRBR abstraction hierarchy. An EXPRESSION denotes the multiple the multiple realizations of a work in some particular medium or notation, where it can actually be perceived. There are many editions and translations of Macbeth, but they are all the same expression, and they are a different expression from all of the film adaptations of Macbeth. A MANIFESTATION is the set of physical artifacts with the same expression. All of the copies of the Folger Library print edition of Macbeth are the same manifestation.
[133] [Computing]
This kind of advice can be found in many data or conceptual modeling texts, but this particular statement comes from Glushko, Weaver, Coonan, and Lincoln (1988).
[134] [Computing]
A group of techniques collectively called normalization produces a set of tightly defined information components that have minimal redundancy and ambiguity. Imagine that a business keeps information about customer orders using a “spreadsheet” style of organization in which a row contains cells that record the date, order number, customer name, customer address, item ID, item description, quantity, unit price, and total price. If an order contains multiple products, these would be recorded on additional rows, as would subsequent orders from the same customer. All of this information is important to the business, but this way of organizing it has a great deal of redundancy and inefficiency. For example, the customer address recurs in every order, and the customer address field merges street, city, state and zip code into a large unstructured field rather than separating them as atomic components of different types of information with potentially varying uses. Similar redundancy exists for the products and prices. Cancelling an order might result in the business deleting all the information it has about a particular customer or product.Normalization divides this large body of information into four separate tables, one for customers, one for customer orders, one for the items contained in each order, and one for item information. This normalized information model encodes all of the information in the “spreadsheet style” model, but eliminates the redundancy and avoids the data integrity problems that are inherent in it.Normalization is taught in every database design course. The concept and methods were proposed by Codd (1970), who invented the relational data model, and has been taught to countless students in numerous database design textbooks like Date (2003).
[135] [Computing]
The “Internet of Things” concept spread very quickly after it was proposed in 1999 by Kevin Ashton, who co-founded the Auto-ID center at MIT that year to standardize RFID and sensor information. For a popular introduction, see (Gershenfeld, Krikorian, & Cohen, 2004). For a recent technical survey and a taxonomy of application domains and scenarios see (Atzori, Iera, & Morabito, 2010).
[136] [Computing]
University of Southern California professor Julian Bleecker (2006) coined the term “Blogjects” to describe objects that blog (p. 2). Bleecker’s early example of a Blogject is Beatriz da Costa’s Pigeon Blog. Da Costa, a Los Angeles—based artist working at the intersection of life sciences, politics, and technology, armed urban pigeons with pollution sensors and locative tracking devices, released them, and created a web interface—in this case Pigeon Blog—to display their flight patterns on Google Maps alongside the pollution levels in the air as they flew. “Whereas once the pigeon was an urban varmint whose value as a participant in the larger social collective was practically nil or worse, the Pigeon that Blogs now attains first-class citizen status” (Bleecker, 2006, p. 5).
[137] [Computing]
IBM’s Andy Stanford-Clark has been credited with coining the term when he wired his house with sensors, enabling appliances to send information to the house’s Twitter account, @andy_house (MacManus, 2009, para. 4). The house plant kit: http://www.sparkfun.com/products/10334. See also http://supermechanical.com/twine/
[138] [Computing]
Pattern analysis can help escape this dilemma by enabling predictive modeling to make optimal use of the data. In designing smart things and devices for people, it is helpful to create a smart model in order to predict the kinds of patterns and locations relevant to the data collected or monitored. These allow designers to develop a set of dimensions and principles that will act as smart guides for the development of smart things. Modeling helps to enable automation, security, or energy efficiency, and baseline models can be used to detect anomalies. As for location, exact locations are unnecessary; use of a “symbolic space” to represent each “sensing zone”—e.g., rooms in a house—and an individual’s movement history as a string of symbols—e.g., abcdegia—works sufficiently as a model of prediction. See (Das et al 2002).
[139] [Law]
Well, maybe not anything. Books list traditional meanings of various names, charts rank names by popularity in different eras, and dozens of websites tout themselves as the place to find a special and unique name. See http://www.ssa.gov/oact/babynames/ for historical trends about baby names in the US with an interactive visualization at http://www.babynamewizard.com/voyager#Different countries have rules about characters or words that may be used in names. In Germany, for example, the government regulates the names parents can give to their children; there’s even a book, the International Handbook of Forenames, to guide them (Kulish, 2009). In Portugal, the Ministry of Justice publishes lists of prohibited names (Cornell, 2006). Meanwhile, in 2007, Swedish tax officials rejected a family’s attempt to name their daughter Metallica (BBC, 2007).We can also change our names. Whether a woman takes on her husband’s surname after marriage or, like the California man who changed his name to “Trout Fishing,” we just find something that better suits us than the name given by our parents. In California in the 1990s, a high school student made waves by changing his name to “Trout Fishing in America” (Associated Press, 1994).
[140] [CogSci]
And while you may think that certain terms are more obviously “good” than others, studies show that “there is no one good access term for most objects. The idea of an ‘obvious,’ ‘self-evident,’ or ‘natural’ term is a myth!” (Furnas et al, 1987, p. 967).
[141] [CogSci]
The most common names for this service were activities, calendar and events, but in all over a hundred different names were suggested, including cityevents, whatup, sparetime, funtime, weekender, nightout, and many more, “People use a surprisingly great variety of words to refer to the same thing,” Furnas wrote. “If everyone always agreed on what to call things, the user’s word would be the designer’s word would be the system’s word. … Unfortunately, people often disagree on the words they use for things” (Furnas, 1987, p. 964).
[142] [CogSci]
This example comes from (Farish, 2002), who analyzes “What’s in a Name?” and suggests that multiple names for the same thing might be a good idea because non-technical business users, data analysts, and system implementers need to see things differently and no one standard for assigning names will work for all three audiences.
[143] [CogSci]
See, for example, Handbook of Cross-Cultural Marketing, (Herbig 1998). The Starbucks coffee chain seemingly goes out of its way to confuse its customers by calling the smallest of its three coffee sizes (12 ounces) the “tall” size, calling its 16-ounce size a “grande,” and calling its largest a “venti,” which is Italian for 20 (ounces). Outside of Starbucks, something that is “tall” is never also considered “small.” Ironically, despite having about 20,000 stores in about 60 countries, Starbucks has none in Italy where “venti” would be in the local language.
[144] [Business]
Economist, As easy as YZX, August 30 2001. For example, the convention to list the co-authors of scientific publications in alphabetic order has been shown to affect reputation and employment by giving undeserved advantages to people whose names start with letters that come early in the alphabet. This bias might also affect admission to selective schools. (Efthyvoulou, 2008).
[145] [Business]
The Kentucky Fried Chicken franchise solved this problem by changing its name to KFC, which you can now find in Beijing, Moscow, London and other locations not anywhere near Kentucky and where many people have probably never heard of the place.
[146] [Computing]
Tim Berners-Lee, the founder of the web, famously argued that “Cool URIs Don’t Change” (Berners-Lee, 1998).
[147] [Law]
Any online citation to one of the West printed court reports will use the West format. However, when Mead Data wanted to use the West page numbers in its LEXIS online service to link to specific pages, West sued for copyright infringement. The citation for the West Publishing vs. Mead Data Central case is 799 F.2d 1219 (8th Cir 1986), which means that the case begins on page 1219 of volume 799 in the set of opinions from the 8th circuit court of appeals that West published in print form. West won the case and Mead Data had to pay substantial royalties. Fortunately, this logic behind this decision was repudiated by the US Supreme Court a few years later in a case that West published as Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991), and West can no longer claim copyright on page numbers.
[148] [CogSci]
When George Orwell gave the title “1984” to a novel he wrote in 1949 he intended it as a warning about a totalitarian future as the Cold War took hold in a divided Europe, but today 1984 is decades in the past and the title doesn’t have the same impact.
[149] [Citation]
(Dorai and Venkatesh, 2001).
[150] [Computing]
Identifiers with meaningful internal structure are said to be structured or intelligent. Those that contain no additional information are sometimes said to be unstructured, opaque, or dumb. The 8 in the ISBN example is a check digit, not technically part of the identifier, that is algorithmically derived from the other digits to detect errors in entering the ISBN.
[151] [Citation]
(McCartney 2006).
[152] [LIS]
Svenonius (2000) calls vocabulary control “the sine qua non of information organization” (p. 89). “The imposition of vocabulary control creates an artificial language out of a natural language” (p. 89), leaving behind an official, normalized set of terms and their uses.
[153] [LIS]
This mapping is “the means by which the language of the user and that of a retrieval system are brought into sync” (Svenonius, 2000, p. 93) and allows an information-seeker to understand the relationship between, say, Samuel Clemens and Mark Twain. The Library of Congress maintains a list of standard, accepted names for authors, subjects, and titles called the Name Authority File. http://id.loc.gov/authorities/names.html
[154] [Citation]
PESI www.eu-nomen.eu/pesi; CBOL www.barcoding.si.edu/; http://services.natureserve.org/BrowseServices/getSpeciesData/getSpeciesListREST.jsp
[155] [Citation]
(Hemerly 2011).
[156] [Law]
This rations / radio confusion is described in (Wheatley 2004). In 2008 a similar mistake in managing inventory at a US military warehouse led to missile launch fuses being sent to Taiwan instead of helicopter batteries, causing a high-level diplomatic furor when the Chinese government objected to this as a treaty violation (Hoffman 2008).
[157] [LIS]
Organizing systems in libraries, museums, and businesses often give sequential accession numbers to resources when they are added to a collection, but these identifiers are of no use outside of the context in which they are assigned, as when a union catalog or merged database is created.
[158] [Computing]
A more general technique is to use the UUID standard, which standardizes some algorithms that generate 128-bit tokens that, for all practical purposes, will be unique for hundreds, if not thousands, of years
[159] [Computing]
The OASIS XML Common Biometric Format (XCBF) was developed to standardize the use of biometric data like DNA, fingerprints, iris scans, and hand geometry to verify identity (OASIS 2003).
[160] [Citation]
(Coyle, 2006, p. 429).
[161] [Computing]
IP 6 for internet addresses. The threat of exhaustion was the motivation for remedial technologies, such as classful networks, Classless Inter-Domain Routing (CIDR) methods, and network address translation (NAT) that extend the usable address space.
[162] [Computing]
Digital Object Identifier system (www.doi.org). However, DOI has its issues too. It’s a highly political, publisher-controlled system, not a universal solution to persistence.
[163] [CogSci]
This is called the Paradox of Theseus, a philosophical debate since ancient times. Every day that Theseus’s ship is in the harbor, a single plank gets replaced, until after a few years the ship is completely rebuilt: not a single original plank remains. Is it still the ship of Theseus? And suppose, meanwhile, the shipbuilders have been building a new ship out of the replaced planks? Is that the ship of Theseus? (Furner, 2008, p. 6)
[164] [Citation]
See (Renear and Dubin 2003), (Wynholds 2011).
[165] [Business]
Effectivity in the tax code is simple compared to that relating to documents in complex systems, like commercial aircraft. Because of their long lifetimes—the Boeing 737 has been flying since the 1960s—and continual upgrading of parts like engines and computers, each airplane has its own operating and maintenance manual that reflects the changes made to the plane over time. Every change to the plane requires an update to the repair manual, making the old version obsolete. And while an aircraft mechanic might refer to “the 737 maintenance manual,” each 737 aircraft actually has its own unique manual.
[166] [Law]
Notary publics are used on a daily basis to verify that a signature on an important document such as a mortgage or other contract is authentic, much as signet rings and sealing wax once proved that no one has tampered with a document since it was sealed.