Chapter 1 Navigating a diverse field

What is multimodality?

‘Multimodality’ is a term that is now widely used in the academic world. The number of publication titles featuring the term has grown exponentially since it was first coined in the mid-1990s. Since then, a myriad of conferences, monographs, edited volumes and other academic discussion forums have been produced that are dedicated to multimodality. Signs of its becoming a shorthand term for a distinct field include the publication of the first edition of the Handbook of Multimodal Analysis (Jewitt, 2009), now a revised second edition (Jewitt, 2014), the launch of the Routledge Series in Multimodality Studies (2011) and the launch of a journal titled Multimodal Communication (2012). These and many other outlets inviting contributions in the area of multimodality provide platforms for scholars working in different disciplines, including semiotics, linguistics, media studies, new literacy studies, education, sociology and psychology, addressing a wide range of different research questions.

With the term being used so frequently and widely, it may seem as though a shared phenomenon of interest has been recognized and a common object of study identified. Indeed, we can, in relatively generic terms, describe that phenomenon, or object of interest, as something like, ‘We make meaning in a variety of ways’, or, ‘We communicate in a variety of ways’. Yet we must immediately add that ‘multimodality’ (and related concepts, including ‘mode’/’modality’, ‘[semiotic] resource’) is differently construed. Exactly how the concept is articulated and ‘operationalized’ varies widely, both across and within the different disciplines and research traditions in which the term is now commonly used. Therefore, it is very difficult and potentially problematic to talk about multimodality without making explicit one’s theoretical and methodological stance.

Before going any further, we turn to those who first used the term and explore what it was that they were trying to draw attention to. As far as we can reconstruct, the term first appeared in the middle to late 1990s in different parts of the world. It is used, for instance, by Charles Goodwin, in a seminal article that he submitted to the Journal of Pragmatics in 1998 (Goodwin, 2000). It also features in Gunther Kress and Theo van Leeuwen’s Multimodal Discourse: The Modes and Media of Contemporary Communication (2001), the manuscript of which had been ‘in the making’ for a number of years. These scholars started using the term more or less independently of each other, with Goodwin in the US working in the tradition of ethnomethodology and conversation analysis, and Kress and van Leeuwen (then) in the UK in the tradition of social semiotics. Around this same time, O’Halloran, working (then) in Australia and drawing on earlier work by O’Toole (1994) and Kress and van Leeuwen (1996), began to use the term ‘multisemiotic’ to describe the multimodal character of mathematics texts (see, for instance, O’Halloran [1999b], published in Semiotica).

If a ‘means for making meaning’ is a ‘modality’, or ‘mode’, as it is usually called, then we might say that the term ‘multimodality’ was used to highlight that people use multiple means of meaning making. But that formulation alone does not accurately describe the conceptual shift these scholars were trying to mark and promote. After all, disciplines such as linguistics, semiotics and sociology have studied different forms of meaning making since well before the term ‘multimodality’ was introduced. Indeed, Ferdinand de Saussure (1857–1913), writing in the early 20th century, already suggested that ‘linguistics’ was a ‘branch’ of a more general science he called semiology. Yet the branches of that imaginary science have continued to specialize in the study of one or a small set of means for making meaning: linguistics on speech and writing, semiotics on image and film, musicology on music; and new subdisciplines have emerged: visual sociology, which is concerned with, for example, photography; visual anthropology, which is concerned with, for example, dress. These (sub)disciplines focus on the means of meaning making that fall within their ‘remit’; they do not systematically investigate synergies between the modes that fall inside and outside that remit.

Multimodality questions that a strict ‘division of labour’ among the disciplines traditionally focused on meaning making, on the grounds that in the world we’re trying to account for, different means of meaning making are not separated but almost always appear together: image with writing, speech with gesture, math symbolism with writing and so forth. It is that recognition of the need for studying how different kinds of meaning making are combined into an integrated, multimodal whole that scholars attempted to highlight when they started using the term ‘multimodality’. It was a recognition of the need to move beyond the empirical boundaries of existing disciplines and develop theories and methods that can account for the ways in which we use gesture, inscription, speech and other means together in order to produce meanings that cannot be accounted for by any of the existing disciplines. This fact only became more noticeable with the introduction of digital technologies, which enable people to combine means of making meaning that were more difficult or impossible to disseminate before – for the majority of people anyway (moving image being one pertinent example). So that is how the introduction of the notion of multimodality marks a significant turn in theorizing and analysing meaning.

What the early adopters of the term recognized was not only the need to look at the co-occurrence and interplay of different means of making meaning but also that each ‘mode’ offers distinct possibilities and constraints. It had often been argued (e.g. by Saussure and Vygotsky) that language has, ultimately, the highest ‘reach’, that it can serve the widest range of communicative functions or that it enables the highest, most complex forms of thinking and is therefore the ‘most important’. Others, including Goodwin, Kress, van Leeuwen and others who first introduced the notion of multimodality, have pointed out that there are differences between semiotic resources in terms of the possibilities they offer for making meaning but that it is not the case that one resource has more or less potential than the other. The same point was made by O’Halloran, who in her definition of ‘multisemiotic’ emphasized the significance of the combination of different resources, each with their own potential. Thus multimodality marks a departure from the traditional opposition of ‘verbal’ and ‘non-verbal’ communication, which presumes that the verbal is primary and that all other means of making meaning can be dealt with by one and the same term.

We can now formulate three key premises of multimodality:

1 Meaning is made with different semiotic resources, each offering distinct potentialities and limitations.
2 Meaning making involves the production of multimodal wholes.
3 If we want to study meaning, we need to attend to all semiotic resources being used to make a complete whole.

We should add four important footnotes to this. First, not everyone working in multimodality uses the notion of meaning making. Depending on their disciplinary background and focus, they might say that they are interested in ‘multimodal communication’, ‘multimodal discourse’, or ‘multimodal interaction’. We will use the term ‘meaning making’ unless we are writing about a specific approach to multimodality. Nor does everyone working in multimodality use the term ‘mode’: some prefer to talk about ‘resource’, or ‘semiotic resource’, and generally avoid drawing strong boundaries between different resources, highlighting instead the significance of the multimodal whole (‘gestalt’). Indeed, for that very reason, some scholars whose work we subsume under the heading of ‘multimodality’ do not use that term themselves, while otherwise committing to the three key premises we just presented.

Second, scholarly interest in the connections between different means of making meaning predates the notion of multimodality. For instance, the study of gesture and its relation to speech, gaze and the built environment has a long history in linguistic anthropology, interactional sociology and other disciplines (see e.g. Goffman, 1981; Kendon, 2004a; Mehan, 1980); the relation between image and writing has been studied in semiotics (e.g. Barthes, 1977 [1964]) and so on. These early contributions have produced important insights in what we now call multimodality. At the same time, we should note that the potential empirical scope of multimodality goes further still. We can see a development from an exclusive interest in language to an interest in language and its relations to other means of making meaning, to an interest in making meaning more generally, without a clear base point, whether language or any other mode.

Third, while those using the term ‘multimodality’ generally aim to develop a framework that accounts for the ways in which people combine distinctly different kinds of meaning making, their epistemological perspectives (i.e. their perspective on how we can know the world) are different. As we shall see later on in this chapter, in some approaches to multimodality the assumption is that it is possible and indeed necessary to develop an integrated theoretical and methodological framework for some kinds of meaning making, for instance for the study of speech, gesture, gaze and the material environment. In other approaches, the assumption is that it is possible and necessary to develop an encompassing theoretical and methodological framework to account for all kinds of meaning making – whether in image or in gesture or in writing or in any other mode. So researchers who adopt the notion of multimodality (or whose work is treated by others as being part of the field of multimodality) still draw different boundaries around what it is in the empirical world that they aim to account for. This is not a matter of ambition but a matter of epistemology: some argue that the differences between, say, image and speech are too great to handle within one and the same framework; others argue that, notwithstanding the differences, it is still possible, at a more general level, to establish common principles of meaning making.

Fourth, when exploring how the notion of multimodality has been and is being developed along diverse lines and schools of thought, it is important to keep an eye on the ‘original’ premises we just outlined. Fundamental to all those premises is a concern with the cultural and social resources for making meaning, not with the senses. While there are, of course, important relations to be explored between the senses and the means for making meaning, it is important not to conflate the two. The focus on the cultural and the social shaping of resources used for making meaning also sets the approaches apart from the popular notion that observation of ‘non-verbal behaviour’ offers a ‘way in’ to what an individual ‘really’ thinks (as suggested in e.g. best-selling guidebooks on ‘successful business communication’).

What makes a study ‘multimodal’?

When reviewing literature or when planning your own study, it is important to clarify what makes a study multimodal. The following sets of questions about aims, theory and method can help you assess the centrality (or marginality) of multimodality in a study:

1 Aims and research questions: Does it address research questions about meaning, communication, discourse or interaction? Is one aim of the study to contribute to the development of a theory of multimodality? For instance, you might find questions such as, ‘What is the semiotic relation between objects displayed in museums and their captions? What is the role of gaze in turn taking?’
2 Theory: What is the place of multimodality in the theoretical framework of the study? Is it a central concept, or is it referenced but not expanded on? It may also be that a theory is presented that could be described as multimodal even though it is not described as such by the authors/researchers themselves.
3 Method: What empirical materials are collected and analysed, and how? Do the collected materials include documentation of human artefacts and social interactions? Do the researchers attend to all (or at least a number of different) means of meaning making that can be reconstructed from the collected materials? Do they give equally systematic attention to all?

Considering the place of multimodality on these dimensions, we can distinguish between:

Doing multimodality: Designing a study in which multimodality is central to aims/research questions, theory and method;
Adopting multimodal concepts: Designing a study in which multimodality concepts (such as mode, semiotic resource) are used selectively.

When adopting multimodal concepts, you can draw selectively from approaches to multimodality such as the ones we discuss in the book. But picking and mixing can be a tricky approach. When selecting concepts from the frameworks and connecting them to concepts derived from other frameworks, it is important to reflect on their ‘compatibility’. Drawing on a theory raises expectations about methods used. For example, claiming to ‘use’ a theory from one of the approaches discussed in this book raises the expectation (among others, as we shall see in the next section) that you will analyse human artefacts or social interactions. So if you choose to combine that theory with the method of the interview, you are likely to be seen as having produced an incoherent framework. If you believe there are good reasons to use the interview as a method, you need to make a case for it (alternatively, you could treat the interview not as a method but as an object of study and analyse it multimodally).

Making explicit what the place of multimodality is in one’s study along these lines can be a way of setting appropriate expectations about the coherence of the research design. When you submit a research paper to a journal and suggest that the study you present is multimodal, some reviewers will expect multimodality to be central throughout the paper. When you explain that you adopt selected multimodal concepts, reviewers are more likely to assess the ‘fit’ between those concepts and the theoretical and methodological frame within which you integrate it. We will elaborate on the issue of mixing approaches in Chapter 6.

Three approaches to multimodal research

In Chapters 3, 4 and 5 of this book we discuss three approaches to doing multimodality. We will elaborate on how elements of the three approaches have been incorporated into other approaches in Chapter 6. Each is grounded in a distinct discipline, with a distinct theoretical and methodological outlook: conversation analysis, systemic functional linguistics and social semiotics. Not all scholars working in these originating disciplines are interested in multimodality. For instance, many conversation analysts or systemic functional linguists continue to focus on the study of ‘talk’ or ‘speech’. Yet within each of the three disciplines, we can identify a substantial and growing body of literature and a community of scholars engaging with multimodal research. It is these bodies of work that we will focus on.

While there are significant differences between them, they share a number of important features:

They draw on disciplines that originally focused on language in use, that is disciplines concerned with what people do with language in their everyday life, notably how they construe the social world through language.
They have a preference for collecting and analysing observable traces of meaning making, notably those found on human artefacts and video recordings of social interaction.
They aim to describe, transcribe, annotate and analyse materials at a micro level, that is with attention to the fine-grained detail of form and meaning.
Through micro analysis, they have produced rich, detailed metalanguages for theorizing about the social world.
They have all in recent decades branched out, maintaining the focus on the social and incorporating means of meaning making beyond speech and writing in their theoretical and methodological frameworks.

As the last bullet point suggests, the approaches that we focus on in this book have developed a more encompassing multimodal frame, largely by expanding their original frame: all had developed sophisticated toolkits to investigate language in use and then branched out, as it were, to explore meaning made with other means – gesture, for instance, or image. We should point out from the outset that the risk of branching out is that the new territory is described in the terms of the originating discipline. Indeed, this is a common critique of all three approaches, and one that we will attend to throughout the book. When expanding the traditional scope, it is important to keep a close eye on what is typical of a mode or semiotic resource and what may count as a more general principle of meaning making, making sure that linguistic categories are not imposed onto other modes. Every time the frame is expanded, old terms and categories need to be revisited and re-evaluated, in the light of the wider range of empirical cases being considered. So we might ask, ‘What would the counterpart be of a verb in image?’ But we can ask that only if we then immediately add, ‘Maybe image doesn’t have anything like the verb. Maybe it has categories unlike anything language has’.

The same can be said about the names of the originating disciplines. The terms ‘conversation analysis’ or ‘systemic functional linguistics’ no longer match the scope of the disciplines they describe. A number of new terms are now being used to mark the changing scopes of these disciplines. We will, for the moment, continue to use some of the old names and use new names if they are widely used within the community they represent. Thus we use the term ‘systemic functional multimodal discourse analysis’ (SF-MDA) but not, for instance, ‘multimodal conversation analysis’.

We will discuss the three approaches at length in Chapters 3, 4 and 5, respectively. Here we summarize them by briefly introducing their aims, history, theory of meaning, concept of mode, empirical focus and methodology. We also present a typical research question for each approach. If you have problems understanding some of the bullet points at this point, rest assured that we will come back to all of them.

Systemic functional linguistics

Aim: To understand the ways in which language is organized and used to fulfil a range of social functions.
History: Originally developed by Michael Halliday in the 1960s in the UK, influenced by European functionalism. O’Toole, van Leeuwen, Martin and O’Halloran and others have taken SFL procedures to explore what is now often called multimodal discourse, resulting in systemic functional multimodal discourse analysis (SF-MDA). The approach aims to understand and describe the functions of different semiotic resources as systems of meaning and to analyse the meanings that arise when semiotic choices combine in multimodal phenomena over space and time.
Theory of meaning: Language is conceptualized as a social semiotic resource for creating meaning. The meaning potential of language is reflected in its underlying organization, which is modelled as interrelated systems of meaning. The systems are ‘networks of interlocking options’ (i.e. choice between different forms), and ‘text’ is a process and product of selection (and materialization) from that potential (e.g. Halliday, 2008).
Concept of mode: The notion of semiotic resource is central to multimodal studies grounded in SFL. Semiotic resources are seen to fulfil four main social functions: (1) to construct our experience of the word, (2) to make logical connections in that world, (3) to enact social relations and (4) to organize the message.
Empirical focus: ‘Artefacts’ of all kinds, including print and digital texts, videos and three-dimensional objects and sites. Usually these artefacts are readily available as popular media (advertisements, TV programmes, websites, social media), educational media (e.g. textbooks and other education materials) and art and crafts (e.g. sculptures, buildings).
Methodology: Detailed transcription and analysis of selected fragments of the texts, as well as the analysis of larger corpora and ‘multimodal analytics’ (e.g. Bateman, 2008, 2014c; O’Halloran, Tan & E, 2014; O’Toole, 2011).
Typical research question: Unsworth (2007), building on Martinec and Salway (2005), explored the nature of text–image relations in school textbooks and other educational materials by developing a classification system that documents the types of logical relations established between the text and images. In this SF approach, the questions that are asked include, ‘What is the nature of the text-image relations? Are these compatible with the communicative purpose of the educational materials? What, if any, challenges do they represent for young learners?’

Social semiotics

Aim: To recognize the agency of social actors and social/power relations between them.
History: Pioneered by Gunther Kress and Bob Hodge in the early 1980s in Australia, building on critical linguistics, SFL, semiotics and social theory. Van Leeuwen brought inspiration from music and film studies.
Theory of meaning: Based on the notion of the motivated sign (Kress, 1993), which holds that the relation between signifier and signified is always motivated and never ‘arbitrary’, as Saussure suggested.
Concept of mode: Central to social semiotic theory. Indeed most theorizing on what counts as mode comes from social semiotics. A short definition would be ‘a set of socially and culturally shaped resources for making meaning’ that has distinct ‘affordances’ (cf. Kress, 2014).
Empirical focus: Initially focused on ‘artefacts’ (especially print media, film and games – both ‘professional’, e.g. an advert in a magazine, and ‘vernacular’, e.g. a child’s drawing), it then also began to account for social interaction recorded on video through fieldwork.
Methodology: Typically detailed analysis of selected small fragments (e.g. a drawing or a small set of them), sometimes involving historical comparisons; is often combined with ethnography.
Typical research question: Mavers (2011) looked at a teacher’s instructions and the drawings that children made subsequently in a science classroom. As in any social semiotic study, questions she addressed included, ‘How did the sign makers use the modes available to them (in this case, drawing and writing) to re-present the world? What did they attend to? What did they highlight? What was gained and lost in the process of ‘translating’ from one mode to another?’

Conversation analysis

Aim: To recognize ‘order’ in the ways in which people organize themselves in and through interaction.
History: Originally developed in the US in the early 1960s by Schegloff, Sacks and Jefferson, influenced by interactionism and ethnomethodology. Goodwin, Heath, Mondada and others have taken CA procedures to explore what is now often called multimodal interaction.
Theory of meaning: Based on the notion of sequentiality: Action unfolds in time, one action after another. Each social action is understood in relation to the action that preceded and followed. That principle provides a basis for making claims about the meanings that people make.
Concept of mode: While it is recognized that people use a range of different (‘semiotic’) ‘resources’ and that these resources are ‘mutually elaborating’, the term ‘mode’ is rarely used.
Empirical focus: Video recordings of ‘naturally occurring’ social encounters (i.e. encounters that were not initiated by the researcher), obtained through fieldwork, showing all participants involved in an activity. Activities include those in which speech is only used occasionally, such as when two people assemble a piece of furniture.
Methodology: Typically involves detailed transcription and analysis of (collections of) small fragments or strips of interaction (say, 30 seconds) illustrating a phenomenon of interest.
Typical research question: Goodwin and Tulbert (2011) looked at toothbrushing in family houses. In ‘plain English’, their question can be formulated as, ‘How do parents get their children to brush their teeth?’ In a CA framework, such formulations get translated into questions, such as, ‘How do members of a community (in this case, families) organize their routine activities? How do they use their bodies, objects and the built environment as resources for the accomplishment of these activities? How do they achieve a joint focus of attention? How do they jointly ‘build’ the activity up? How are the activities related?’

Throughout the book, we will cross-reference and point out differences and similarities among these three focal approaches. The main differences are summarized in Table 1.1.

We want to highlight two significant differences here: one theoretical and one methodological.

The theoretical point is, first of all, an issue of naming. The three approaches have different terminological preferences, coupled with different conceptualizations of what we have described so far as ‘means for making meaning’. In SS and SFL, the terms ‘mode’ and ‘semiotic resource’ are both used, and definitions have been proposed that make a distinction between the two. In CA, ‘(semiotic) resource’ is used, but ‘mode’ is not, or very rarely, and some attempts at defining ‘(semiotic) resource’ have been made. Yet none of these definitions is (as yet) widely and consistently used beyond those who proposed them.

**Table 1.1** Mapping three approaches to multimodality: SFL, social semiotics and conversation analysis
	SFL	Social semiotics	CA
Aims	Recognition of social functions of forms	Recognition of power and agency	Recognition of social order in interaction
Theory of meaning	Meaning as choice	Motivated sign	Sequentiality
History	European functionalism	SFL, critical linguistics, semiotics	American interactionism, ethnomethodology
Conceptualization of ‘means for making meaning’	Semiotic resource, mode	Mode, semiotic resource	(Semiotic) resource
Example representatives	O’Toole, Martin, Unsworth, O’Halloran	Kress, van Leeuwen	Goodwin, Heath, Mondada
Empirical focus	Artefacts, including texts and objects	Artefacts, mostly texts	Researcher-generated video recordings of interaction
Method of analysis	Micro analysis of selected short fragments, corpus analysis, multimodal analytics	Micro analysis of selected short fragments, historical analysis	Micro analysis of (collections of) selected short fragments

There is, put simply, much variation in the meanings ascribed to mode and (semiotic) resource. Gesture and gaze, image and writing seem plausible candidates, but what about colour or layout? And is photography a separate mode? What about facial expression and body posture? Are action and movement modes? You will find different answers to these questions not only between different research publications but also within. To avoid potential confusion, it is important to make a deliberate decision on what categories and terms to use when engaging with multimodal research. It will be helpful to formulate some ‘working definitions’, drawing on the ones already put forward by the approach you adopt. Even though the working definition is unlikely to be entirely satisfactory, it is important to strive for maximum conceptual clarity and consistency. We will discuss the definitions proposed within our focal approaches in the respective chapters.

The methodological point is this. CA is primarily interested in meanings made in situ, in dynamic, face-to-face interactions. It looks at artefacts only insofar as these artefacts are being oriented to in observed interactions. So, for instance, Charles Goodwin (2000) looked at the Munsell chart, a tool used to determine the colour of soil by the archaeologists participating in the interactions he had video-recorded. In social semiotics, artefacts have been explored in situ – for instance the use of 3D models in the science classroom (Kress and van Leeuwen 2001) – but in other social semiotic work, artefacts have also been studied away from specific situated interactions. For instance, Bezemer and Kress (2008) studied textbooks. Their focus was on meanings made by the makers of textbooks (including authors and graphic designers), not on the meanings of those who engage with textbooks, such as teachers and students. In SFL, a similar position is usually taken, recognizing that it is possible to reconstruct meanings from (collections of) artefacts. Thus SS and SFL generally cover a wider empirical scope than CA. For instance, the architectural design of a building would normally fall outside the scope of CA.

There is, of course, significant variation in the degree to which scholars stay close (some might say ‘faithful’) to the principles put forward by the founders of the originating disciplines. Indeed there is a tension between staying faithful to concepts as they were originally defined and the need to revise old concepts in the light of the changing world. After all, the world we live in now looks very different from what it looked like when the originating disciplines appeared. Social, cultural and technological changes constantly challenge old notions.

There are close connections between scholars working with the different frameworks, and indeed some are active members in both communities. The closest links are between SFL and SS; there is far less interaction between representatives of CA, on the one hand, and SFL and SS, on the other. For instance, at the International Conference on Multimodality, CA has to date been under-represented, while SS and SFL were hardly represented at the tenth edition of the International Conference on Conversation Analysis (2010), which was dedicated to ‘multimodal interaction’. CA is closely linked with interactional (socio)linguistics and linguistic anthropology, and this connection is reflected in early work on the role of, for example, gaze in classroom interaction (see the work of Ray McDermott and Frederick Erickson). Social semiotics is closely linked with critical discourse analysis, which developed as a separate branch of critical linguistics. That is visible, for instance, in the joint work of David Machin and Theo van Leeuwen on media discourse (e.g. 2007).

In many studies, selected elements of one of the three approaches have been adopted and brought into connection with concepts and methods derived from other disciplines, such as psychology. For instance, you could use eye-tracking technology to ‘test’ certain concepts proposed in social semiotics (Holsanova, 2012). Other work has attempted to bring together concepts from social semiotics with ethnography. We will elaborate on how elements of the three approaches have been incorporated into other approaches in Chapter 6.