Linking survey and digital trace data
Understanding online behaviors, attitudes, and identities was a key challenge for social science in the 21st century. At the same time, the opportunities provided by digital trace data were substantial as researchers could access huge quantities of precise observational data relatively quickly, easily, and cheaply. However, the fact that these data were not designed for social researchers created challenges: researchers had a limited understanding of who (or what in the case of 'bots') was included in the data and the biases it may have had, or control over what information was collected to ensure it answered their research questions.
The event explored the feasibility, challenges, and opportunities of linking digital trace data with survey data, drawing on experiences and findings from the ESRC-funded ‘Understanding (Online/Offline) Society’ project. It focused specifically on experiences linking X (formerly known as Twitter) and LinkedIn data with survey data, although the findings could be applied to digital trace data more broadly. It was split into three sessions, each focusing on a key methodological question:
- How can digital trace data and survey data enhance each other?
- How can we maximise informed consent to link survey and digital trace data?
- How can digital trace data be collected, linked to survey data, and shared in a legal and ethical manner that maintains utility?
Session 1: How can digital trace data and survey data enhance each other?
Digital trace data (DTD) and survey data, when linked together, became greater than the sum of their parts. DTD data was often lacking in demographic information, making it difficult to understand who was represented. However, when linked to survey data, the issue of representation could be addressed. DTD could also enhance data collected through surveys by filling in time gaps between waves of a data collection exercise, potentially capturing fluctuations in employment trajectories, mental health, or political allegiances. Surveys collected data using a focused set of standardized questions, while DTD provided a greater breadth of information on attitudes, beliefs, and behaviors that a questionnaire might not have captured. Finally, much DTD had a network element to them, allowing researchers to situate individuals within a web of wider social connections.
The session included examples of how DTD and survey data could be used together to enhance each other and was followed by a wider discussion of the opportunities when linking survey and DTD.
Shujun Liu: Exploring the impact of social class and political affiliation on X / Twitter usage motives and actual activities
Social media platforms, particularly X / Twitter, are instrumental in enabling individuals to accrue social capital by fostering networking opportunities and enabling self-presentation via tweets. This study delves into the influence of various factors, such as social class and political affiliation, on the motives behind X / Twitter usage and the resultant behavioural trends on the platform. By integrating survey responses with real X / Twitter activities, the research seeks to uncover the links between an individual's social class and political leanings, their reasons for using X / Twitter, and their specific activities on the platform, including frequency of posts, following and follower patterns, and the presence of bio information.
Tarek Al Baghal: Linking survey and X / Twitter data: Survey behaviours and data quality
Linking social media and survey data at the individual level has the potential to add evidence to a variety of research questions. Using linked Understanding Society Innovation Panel and X / Twitter data, this study explores how the combined data can be used to understand important survey behaviours that have an impact on data quality, particularly in a longitudinal setting. We explore the potential use of added social media data on predicting survey attrition and survey measurement. While small sample sizes impact the power of some analyses, the methods developed are illustrative of ways to use this novel data source. To the extent that social media metrics are predictive of these behaviours, the use of the data may improve strategies for future survey design.
Session 2: How can we maximise informed consent to link survey and digital trace data?
A key step in linking survey and digital trace data (DTD) was receiving informed consent from participants to do so and enabling them to provide that access. If rates of consent or data provision were low, the risk of bias in the sample could increase and the quality of the research could be undermined. It could also reduce the amount of data available for analysis or increase costs, as more effort was required to reach sufficient sample sizes for robust analysis. It was also important to ensure that any decision was appropriately informed - for ethical and, potentially, legal reasons.
This session included two presentations looking at these issues – the extent to which non-consent introduced bias into the sample, and public attitudes to data linkage and approaches to maximising informed consent. It was followed by a broader discussion of consent to link survey and DTD.
Curtis Jessop: Understanding and improving consent to link survey and X / Twitter data
Previous research has shown that consent rates to link survey and X / Twitter data are relatively low. This is a problem for studies as lower consent rates increase the risk of bias being introduced into the sample and of having insufficient data for robust analysis. This paper will update previous research looking at the bias introduced by non-consent using a larger sample size. It will look firstly at the socio-demographic characteristics associated with consent to data linkage and then if and how consent is associated with self-reported X / Twitter use. It will then present evidence from qualitative research on public attitudes to consent to data linkage and experimental evidence on the effectiveness of different approaches to improving consent rates.
Shujun Liu: Associations with consent to link survey and X / Twitter data
Linking survey and social media data has gained popularity. However, obtaining consent from respondents to link social media is a known challenge. Using data from a nationally representative survey of the UK, this study investigated whether respondents’ a) activity frequency, b) activity variety and c) technical skills with smartphones are associated with consent to link X / Twitter data to survey responses. Additionally, this study explored mediating role of privacy and security concern and moderating effects of age, gender, employment and educational level to better understand the influences of privacy concern on X / Twitter linkage consent.
Session 3: How can digital trace data be collected, linked to survey data, and shared in a legal and ethical manner that maintains utility?
The public nature of some digital trace data (DTD) that made them so accessible to researchers also meant that, in their raw form, individuals were identifiable from them. This was problematic as, although these data may have been public and users agreed to terms and conditions that said the data they produced may be used for research, they may not have read or fully appreciated the context or considered this at the point of posting the information online. Further, when these data were linked to an individual’s survey data (which were not public), this would de-anonymize the survey responses. However, anonymizing data to minimize the risk of harm to participants risked undermining the additional utility of the DTD for certain types of analysis.
This session explored our experiences of sharing such potentially disclosive data within the team and archiving them for public use, and reflected on how new forms of data challenged our assumptions and approaches to data governance.
Luke Sloan: Accessing and sharing linked data within a research team
Whilst cross-institutional projects and data sharing agreements (DSAs) are common practice in collaborative work, our experiences indicate a lack of shared understanding and expectations about the enabling of access to identifiable and linked data. In this session we reflect on the difficulties and challenges we experienced in working with this data and setting up DSAs, reflecting on how to ensure people in different organisations adhere to specific requirements, provision of secure access via locked down devices, and how everyone, from legal and contracts through the risk and compliance, have an opinion on how these things should be done.
Paulo Serodio: Archiving social media and survey data from the Understanding Society Innovation Panel
Social media corpora are increasingly used in social science research, albeit rarely accompanied by survey data due to the risk of unmasking respondents’ identity. We outline a framework to archive social media data and survey data in a linked format that minimizes the risk of disclosure of respondents’ identity. Leveraging data from the Innovation Panel of the UK Household Longitudinal Survey, which asked for respondents’ consent to link their survey responses to their X / Twitter data, we propose a systematic and transparent approach to generating summary metrics of X / Twitter activity at both tweet and user levels, spanning disciplinary boundaries to accommodate the multi-dimensional nature of the survey.
Deb Wiltshire: Is it time to re-examine our approach to data governance?
With the increased availability of sensitive and highly detailed data, we’ve seen a corresponding focus on data governance. Trusted Research Environments (TREs) serve an important function here, setting up carefully considered frameworks of policies, procedures, standards, and guidelines used to ensure safe and ethical data use, thus enabling the sharing of data previously deemed too disclosive to share.
With the up rise in the open data movement and methodologies like web-scraping, it seems an opportune moment to re-examine our approach to data governance. This session aims to start this discussion by focusing on key questions including: Do we know where risk comes from? Are our systems built on control or trust? Are we achieving what we think we are?
Luke SloanDeputy Director Social Data Science LabLuke Sloan is a Professor in the School of Social Sciences at Cardiff University, Deputy Director of the Social Data Science Lab, and Principal Investigator on the ESRC project ‘Understanding [Online/Offline] Society: Linking Surveys with Twitter Data’ (ES/S015175/1). His key research interests are understanding representation on Twitter and augmenting social media data through data linkage.
Shujun LiuDoctor of Philosophy Cardiff UniversityShujun Liu is a Research Associate of School of Social Sciences at Cardiff University, where she works as a part of ESRC project ‘Understanding [Online/Offline] Society: Linking Surveys with Twitter Data’ (ES/S015175/1). Her key research interests include digital media studies, computational social science, climate communication, political communication.
Tarek Al BaghalDeputy Director of Understanding Society and Professor University of EssexTarek Al Baghal is a Professor of Survey Methodology at the Institute of Social and Economic Research, University of Essex, and is Deputy Director of Understanding Society, one of the largest longitudinal studies in the world. His research focuses on linkage of new digital data sources to surveys and the uses of these combined sources.
Curtis JessopDirector of Attitudinal Surveys and the NatCen Panel National Centre for Social Research
Curtis Jessop is the Director of Attitudinal Surveys and the NatCen Panel at the National Centre for Social Research where he oversees the development and delivery of the British Social Attitudes study and the NatCen Panel, the UK’s first open mixed-mode random probability research panel.
Curtis is an expert in survey research and has conducted research in a wide range of substantive and methodological areas. Prior to this role he has worked on large, mixed-mode longitudinal projects such as Understanding Society and Next Steps. He has also conducted research into combining survey data with digital trace data and was the lead for the ‘New social media, new social science’ collaborative network.
Paulo SerodioSenior Research Officer University of EssexPaulo Serodio is a Senior Research Officer at the Institute for Social and Economic Research at the University of Essex. He has an MA and a PhD in Political Economy from the same university. Prior to joining ISER, he held positions with the D’Amore-McKim School of Business at Northeastern University, Brasenose College at the University of Oxford, the Centre for the Sociology of Organisations at Sciences Po, and the University of Barcelona. His research primarily delves into survey augmentation and methodology, sampling strategies for hard-to-reach populations, commercial determinants of health, and network science.
Deb WiltshireSocial ScientistDeb Wiltshire is a historical Demographer and Social Scientist, with many years’ experience in teaching quantitative research methods and data skills in the UK. For the last ten years she has worked in secure data access services and currently heads up a Trusted Research Environment in Germany where she specialises in safe access for sensitive data. She regularly trains and advises the research community on statistical disclosure control and developing data governance frameworks.