By Ninian Frenguelli
It is increasingly difficult to access data from social media platforms. Researcher access to Meta platforms was removed in 2024 when CrowdTangle was shut down, but it was already being slowly restricted prior to this with Meta closing accounts of researchers in 2021. Researchers are in a difficult position: with no official access to necessary data, understanding the legal and ethical frameworks for accessing this data can be time-consuming and complicated.
This was the position I was in at the start of a project I’m working on. After reading and discussing with the IT and data protection departments at my university, I have used Apify to collect social media data. This blog addresses the considerations that went into this decision and the methods I tried and discarded before I reached it.
Manual Data Collection
Manually collecting large amounts of data is often the most time-consuming option. The quickest way to do this is to take screenshots of social media posts and their associated aspects that are relevant to your research question. This is good for image analysis but falls down for textual analysis because there is no easy way to enter data into analysis software, search for key words or phrases, or copy and paste content you’re interested in.
NVivo now offers you the option of highlighting parts of images (or text in image format) for analysis. This is good for highlighting codes and themes, but you aren’t able to copy and paste text from these screenshot images into your working document, and you can’t put these screenshots into corpus analysis software such as AntConc. A solution to this is to save your screenshots as PDFs and then use a PDF to OCR converter that will turn the images of text into searchable text that can be copy and pasted. Unfortunately, unless your institution has already paid for access to a service that provides this, bulk conversion services will have a cost associated. This can be done for free in Python, but is fiddly to do so without an existing Python and IDE setup (there are instructions on YouTube).
The same drawback applies to quantitative analysis of number of comments, likes, and reposts etc. These figures would need to be manually entered after data collection, which is time consuming and has a risk of human error.
This option is possible and would work well for smaller datasets. It would also work easier with a small amount of research funding (or Python knowhow) for the PDF to OCR conversion. I rejected this method because of the amount of time it would have taken to get the collection of screenshots into a usable format.
Developing a web scraper
Automatic collection, web scraping, of research data can be done legally and ethically if it does not overburden platforms’ servers, complies with local laws (for the EU/UK this will be GDPR), and does not violate a platform’s terms of service. Web scrapers should send minimal requests to websites and always have delays between these requests. For GDPR compliance, the data that is collected must always be “manifestly made public by the data subject,” be collected for a task in the public interest, and be the minimum amount of data necessary to complete that public interest task. You must not use an account to access social media sites for web scraping as this will violate most social media companies’ terms of service, resulting in a potentially permanent ban for your account.
There are several Instagram scraping packages that can be run from the command line or used in your own code that often come with YouTube videos demonstrating how to use them. Instaloader is an Instagram scraper that stores each scraped post in one folder containing text and image files. This is useful because it keeps the numerical, textual, and image data from one Instagram post together, but can be awkward for larger datasets as the data will need to be moved to another format (such as an Excel, CSV, or text file) for analysis if you want to do things like compare likes and engagement over time. The packages available vary by platform and therefore require research to ensure that it is both ethical and feasible.
I decided against this option because sorting the data for analysis after collection would have taken too long, and it would not have resulted in the collection of all the necessary data.
Paid-for Services
Using a paid-for service to collect your data can save a lot of time, but it will cost and the data that you collect will still need some time-investment to get it into a format that you can analyse. You will also need to check that the service complies with your local laws and employs ethical scraping. I decided to use Apify for my data collection because its costs were low and it hosts extensive information about its ethical scraping and GDPR-compliance on its website. I have included information about the costs of using this method to offer an illustration to other researchers.
Apify offers $5 platform usage for free each month, enough for a dataset of about 1,300 Instagram posts. If you want more, you will need to buy a month’s plan for $39 that provides platform credit and the option to buy more on Pay as You Go. The $39 will get you a dataset of about 15,000 Instagram posts (downloadable in an Excel or JSON format among others). However, this does not include the images. For that, you will need to use the Dataset Image Downloader & Uploader. (If you do this, set the file name function to “({ item }) => `${item.id}`” so that the file name of the images matches with the ID of the Instagram posts.) This actor does not have an inherent cost and instead you pay for the platform usage. For all Apify actors, the live cost is shown as the actor runs and you can set a cap for your monthly spend. For us, a dataset of 5000 Instagram posts cost $11.70 and took 2 hours. Downloading the images from this cost a further $1.75 and took half an hour. Facebook posts can cost more as the scraper must be rented for $35 per month. Fortunately, there is a 7-day free trial so if you complete your data collection in those 7 days and then cancel the rental, it will be significantly cheaper. Scraping 3300 posts from Facebook cost $1.93 and took 23 minutes.
Another thing to be aware of is that the columns of the CSV file that your data is downloaded in will change based on the content scraped, so you are best off scraping all your data in one batch. I recommend doing a small test scrape to familiarise yourself with the process, and then collecting all your data at once. Finally, data storage, data transfer, and use of proxies (that Apify uses by default) all cost. Storing 650 posts from Instagram Scraper for about 18 hours cost me 5¢. I have also spent 57¢ transferring data internally and externally on the platform and $3.50 on proxies. This could ramp up with larger datasets, therefore deleting your runs once you have downloaded your output and not transferring more data than necessary could help minimize costs.
The bottom line is that it is still possible to access this data in an ethical manner despite the removal of researcher access. Social media research does not necessarily need to be hindered by these events. While this blog is just a summary and you should still research your method before collecting any data, it is still possible to collect large datasets from social media platforms. Small projects, probably best suited to deep qualitative analysis, can be completed manually on many different platforms, and on Instagram with a free $5 from Apify. Web scraping, when done properly, can also increase the amount of data researchers can collect. Research budgets of $50 – $100 will get you a data set of thousands to tens of thousands of social media posts. We should press on.
Ninian Frenguelli is a postdoctoral researcher at Swansea University on the Trans-Atlantic Platform-funded project Repairing Sociality, Safeguarding Democracy: Transatlantic North-South Narratives and Practices of Deep Equality (Social Repair) that is based across Brazil, Canada, South Africa, and Wales. His PhD studied gender-related beliefs on right wing extremist websites and his work on the Social Repair project looks at practices of human connection and deep equality on social media.
IMAGE CREDIT: PEXELS
Want to submit a blog post? Click here.