The challenges of data management and analysis on a large longitudinal qualitative research project
Computer aided qualitative data analysis has the potential to revolutionise both the scale of research and possible analysis techniques. Yet, the software itself still imposes limits that hinder and prevent this full potential from being realised. This post looks at the large and complex dataset created as part of the Welfare Conditionality research project, the analytical approach adopted, and the challenges QDAS faces.
The Welfare Conditionality project has two broad research questions in setting out to consider the issues surrounding sanctions, support, and behaviour change. Firstly, is conditionality ‘effective’ – and if so for whom, under what conditions, and by what definition of effective. And, secondly, whether welfare conditionality is ‘ethical’ – how do people justify or criticise its use and for what reasons. To answer these questions, we have undertaken the ambitious task of collecting a trove of qualitative data on conditional forms of welfare. Our work across nine policy areas, each of which has a dedicated ‘policy team’ that is responsible for the research. The policy areas are: unemployed people, Universal Credit claimants, lone parents, disabled people, social tenants, homeless people, individuals/families subject to antisocial behaviour orders or family intervention projects, (ex-)offenders, and migrants. Research has consisted of 45 interviews with policy stakeholders (MPs, civil servants, heads of charities), 27 focus groups with service providers, and three waves of repeat qualitative interviews with 481 welfare service users across 10 interview locations in England and Scotland.
Our first task relating to data management and analysis, was how to deal with the logistics of storing and organising data on this scale. One of our key protocols has been the creation of a centralised Excel sheet used to collate participant information, contact details, and the stage each interview is at. It tells us, for example, when the interview recording has been uploaded to a shared network drive, transcribed, anonymised, added to our NVivo project file, case node created, attributes assigned, auto-coded, and coded & summarised in a framework matrix. On the analysis side, we have been using the server edition of NVivo. It became clear early into the fieldwork that working with multiple stand-alone project files that would be regularly merged and then redistributed would be impractical – with a high risk of merge conflicts arising due to the complexity of our data. The server project means multiple team members can access and work in the project file at the same time.
Another emerging challenge was the difficulty for team members to be involved in time-intensive fieldwork and dedicate sufficient time to analysis. We also needed to find an analytical approach which could offer information at a range of levels i.e. by individual over time; as well as across and within the policy areas and welfare domains under investigation. There was debate amongst team members on having each policy team independently doing their own analysis versus a shared approach. Some felt a shared approach would be too time consuming compared to coding for specific outputs and that there were not enough commonalities between all the policy areas for there to be a workable shared approach. Others felt that coding for specific outputs would result in unnecessary repetition of analysis and make it difficult to reach general conclusions across the whole sample.
Diversifying Analysis Strategies
The use of QDAS has enabled an approach that facilitates both styles of analysis. Working iteratively as a team, initial coding and preliminary nodes were suggested. Taking this initial work forward, at a series of meetings the suggestions were refined based on the idea that the main node schema would be comprehensive but not exhaustive; containing the ‘top-level’ and ‘middle-level’ nodes that would be applicable across the whole welfare service user sample. As far as possible the definitions of nodes accepted for inclusion were broad enough to incorporate areas covered by nodes that were too narrow or policy-area specific. Output specific coding remains possible, however, through policy teams being able to use the results of the main node schema to do further coding without having to code every interview from scratch. For example, after interviews had been coded with the common coding schema the disability team could use a query to return all the nodes coded at ‘1g_Health_issues’ for participants in receipt of disability related benefits and use the returned sections to do further coding for specific disabilities and health issues. With many participants having experiences relevant across different policy areas this also aimed to avoid policy teams repeating the same coding work that would occur if there was no shared coding.
Three further QDAS features have been key to diversifying the analysis strategies possible on the project. The first, has been using the main node schema to code every interview and summarise the data in a framework matrix. Framework matrices in NVivo are laid out similarly to an Excel spreadsheet with a row for each participant and a column for each node. The advantage in doing this in NVivo rather than pen and paper or in Excel, is maintaining a much closer linkage to the original data. Beside a framework matrix it is possible to display an ‘associate view’ which has multiple display options. This includes showing the full transcript for the participant row selected, or showing only the interview data relevant to the specific cell selected – such as everything coded for the participant in relation to their experience of sanctions. Change over time can be captured both in the structure of the summaries, and by using the associate view to switch between viewing data related to each wave. Use of framework matrices then helps towards avoiding what Gilbert (2002: 218) calls the “the coding trap”; where QDAS users become overly concerned with the process of coding at the expense of seeing the data as a whole. With such a large dataset and a tight time frame for the project this was a risk. It also solves the problem not by less coding, but in being more creative in how the coding can be viewed and engaged with.
The second key feature, is the capability to reshape and recombine the data. Once a summary has been written for a participant at one node in a framework matrix, it will also appear in any other matrices the cell appears in. After coding and summarisation therefore, it becomes possible to quickly create new matrices to filter information for different sub-samples, for a selected number of the main nodes, and also use attribute data to organise the order participants appear in – such as by the number of times sanctioned or geographical area. With each summary linked to the coded sections from the interview, it is possible to move with little effort from a framework matrix for a selected group, to the summaries for a particular participant, view the interview sections coded at each of the main themes, and to then open the transcript at the exact point a chosen section appears at, to see the wider context. Despite there being a heavy amount of coding work needed to achieve this, the use of matrices and the ways it can be used to reshape and recombine the data avoids the process of coding resulting in the “deferring [of] thinking about the data” (Bazeley and Richards 2000: 62). The final key feature contains the ways the data can be explored through search folders and queries that add to the multiple sites of entry for navigating the data. To reflect the complexity of our sample, we have search folders that contain participants based on policy area, location, benefits they are receiving, and times sanctioned amongst others. Queries have been used to find key phrases, return coded content for specific groups, and node matrices.
Problems with QDAS
Although many of the early criticisms of QDAS are outdated, since they do not reflect what is possible with the latest versions of the software (Silvers and Rivers 2016; Salmona & Kaczynski 2016), there remain many barriers to greater adoption of QDAS. Not all of these are receiving the attention they deserve in the methodology literature. The price of the software, and the server licence is costly and would make a significant dent in the budget of smaller projects. This is made worse, as although a ‘server’, NVivo Server edition (recently renamed NVivo for Teams) only operates efficiently on a local network. For team members on our project outside of York to access our server project file requires them to connect first to a virtual desktop and then connect to the server project file from there. This introduces a chain of potential IT and compatibility issues between different university systems that can, and in our case did, cause problems.
More broadly an issue with QDAS is that data is often stored in relational databases with no means of the user to access the database directly except through the limited interface the software provides on top of it. A related issue is that none of the main software packages allow interacting with the data via a programming language. Incorporating one would vastly speed up the creation of complex queries that is currently only possible through laborious point-and-click user-interfaces. A user may for example need to return multiple lists of participants matching the same criteria for three attributes with a different value for a fourth. Using the open source programming languages R and Python allows quantitative researchers to write short and simple code that will loop through all the values for the fourth attribute, e.g. running the same query each time but replacing ‘Jobseeker’ with ‘Lone Parent’. In contrast, within NVivo a user must manually create the query for each list individually, and each time for every query has to work through a series of dropdown menus and dialogue boxes to specify the three criteria that remain the same as well as then, finally, specifying the fourth criterion that varies between them. While presenting a supposedly ‘user-friendly’ interface in the long run it is more a hindrance than an aid.
This is only one of the many issues we have experienced with the user interface when needing to create a multitude of complex search folders, matrices, and queries. In some cases, I have been able to work around the user-interface by writing scripts in AutoHotKey, a scripting language that can be used to automate and shorten repetitive tasks. To aid other users in my own time I have been refactoring the codes so that they can be adapted for use in other projects. On my GitHub page can be found my scripts for using the keyboard to rapidly speed up entering attribute data and coding transcripts. Currently, only the first has usage instructions and a demonstration video but I plan to add the same for the second in the coming months. It is important to note that these scripts can speed up some tasks only through existing ‘on-top’ of the software and automating user input rather than being able to interact directly with the data. They are a glimpse of what could be possible but ultimately limited compared with a programming language that can operate directly with the data.
Another issue is that at times the software can be painfully slow, with no real computational reasons why this should be the case given the relative simplicity of the actions for a computer to complete. For example, as the project file has grown larger it can now take 8-10 seconds for the ‘Select Project Items’ dialogue Window to open. Along with a slightly clunky UI in places this is what makes the seams when exploring the data more obvious. While QDAS remains more time efficient in the long run through the power of queries, etc any performance issues and difficulties encountered in undertaking basic tasks make it harder for users to realise this when they compare it to approaches they are already comfortable with and have used in the past.
Finally, commercial QDAS tend to use proprietary file-types that risk ‘locking in’ the data and the analysis performed on it. Another trend in quantitative data with the move to open source software has been making available not only the data but the code used in the analysis. This helps both in allowing peers to check the robustness of the analysis and aid further secondary analysis. Qualitative research funding increasingly stipulates that transcripts are archived following the research, and generally does not ask for the project file, with its record of coding and queries performed on the data, as there is no open standard file format available. Without one it incurs upon anyone wanting to access the data to pay for a software licence. One positive development in this area is Deboose, a cloud-based QDAS package, being able to import file formats used by many other QDAS packages. The drawback is it requires a monthly fee to use, so there remains a financial burden imposed on any secondary analysis. Having an open standard would instead make it easier to develop open source QDAS packages that remove any financial cost on the user. An existing open source QDAS written in R, RQDA, is still under active development. It only has the basic features expected from a QDAS package implemented but it shows what is possible. In my own time, I have started work on a Python based QDAS, Pythia, that I hope to have a working prototype of the coding interface complete by spring 2018. Another benefit from further developing an open source option would be the ability to enhance mixed and truly blended methods by making it easier for the existing open source quantitative textual packages to work with the results of qualitative analysis.
It is only through QDAS that we have been able to manage and analyse a large qualitative dataset from a multitude of angles. Qualitative research in general is on the cusp of entering new territory as software makes more manageable and opens analytical approaches that are impossible or time and resource intensive with traditional ‘pen and paper’ strategies. Many of the early, and sadly still widely held, concerns about QDAS were solved not by relying less on the software but by taking full advantage of the features the software can offer. The shared coding and framework matrix help form a flexible scaffolding to support further analysis and maintain the broad picture despite the intensive coding work required. In diversifying the ways the data can be filtered, shaped, and recombined with search folders, queries, and filtered matrices it also aims to aid analysis for specific outputs and reduce any unnecessary repetition in work that would occur if different teams were only doing analysis for specific outputs. Yet QDAS is not without its flaws and there are serious limitations in the capabilities of proprietary software, though not widely covered in the methodology literature. Open source solutions are gaining increasing popularity in quantitative research, and it is perhaps time that qualitative researchers joined them.
Bazeley P (2007) Qualitative Data Analysis with Nvivo, Second Edition. 2nd Revised edition edition. Los Angeles ; London: SAGE Publications Ltd.
Bazeley P and Richards L (2000) The Nvivo Qualitative Project Book. Pap/Cdr edition. London ; Thousand Oaks, Calif: SAGE Publications Ltd.
Gilbert LS (2002) Going the distance: “Closeness” in qualitative data analysis software. International Journal of Social Research Methodology 5(3): 215–228.
Richards L (2002) Qualitative computing–a methods revolution? International Journal of Social Research Methodology 5(3): 263–276.
Salmona M and Kaczynski D (2016) Don’t Blame the Software: Using Qualitative Data Analysis Software Successfully in Doctoral Research. Forum: Qualitative Social Research 17(3): 42–64.
Silver C and Rivers C (2016) The CAQDAS Postgraduate Learning Model: an interplay between methodological awareness, analytic adeptness and technological proficiency. International Journal of Social Research Methodology 19(5): 593–609.