SE Radio 503: Diarmuid McDonnell on Web Scraping

Diarmuid McDonnell, a Lecturer in Social Sciences, University of the West of Scotland talks about the increasing use of computational approaches for data collection and data analysis in social sciences research. Host Kanchan Shringi speaks with McDonell about webscraping, a key computational tool for data collection. Diarmuid talks about what a social scientist or data scientist must evaluate before starting on a web scraping project, what they should learn and watch out for and the challenges they may encounter. The discussion then focuses on the use of python libraries and frameworks that aid webscraping as well as the processing of the gathered data which centers around collapsing the data into aggregate measures.
This episode sponsored by TimescaleDB.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Kanchan Shringi 00:00:57 Hi, all. Welcome to this episode of Software Engineering Radio. I’m your host, Kanchan Shringi. Our guest today is Diarmuid McDonnell. He’s a lecturer in Social Sciences at the University of West Scotland. Diarmuid graduated with a PhD from the Faculty of Social Sciences at the University of Sterling in Scotland, his research employs large-scale administrative datasets. This has led Diarmuid on the path of web scraping. He has run webinars and publish these on YouTube to share his experiences and educate the community on what a developer or data scientist must evaluate before starting out on a Web Scraping project, as well as what they should learn and watch out for. And finally, the challenges that they may encounter. Diarmuid it’s so great to have you on the show? Is there anything else you’d like to add to your bio before we get started?

Diarmuid McDonnell 00:01:47 Nope, that’s an excellent introduction. Thank you so much.

Kanchan Shringi 00:01:50 Great. So big picture. Let’s spend a little bit of time on that. And my first question would be what’s the difference between screen scraping, web scraping, and crawling?

Diarmuid McDonnell 00:02:03 Well, I think they’re three varieties of the same approach. Web scraping is traditionally where we try and collect information, particularly texts and often tables, maybe images from a website using some computational means. Screen scraping is roughly the same, but I guess a bit more of a broader term for collecting all of the information that you see on a screen from a website. Crawling is very similar, but in that instance or less interested in the content that’s on the webpage or the website. I’m more interested in the links that exists on a website. So crawling is about finding out how websites are connected together.

Kanchan Shringi 00:02:42 How would crawling and web scraping be related? You definitely need to find the sites you need to scrape first.

Diarmuid McDonnell 00:02:51 Absolutely they’ve got different purposes, but they have a common first step, which is requesting the URL of a webpage. And the first instance web scraping, the next step is collect the text or the video or image information on the webpage. But crawling what you’re interested in are all of the hyperlinks that exist on that web page and where they’re linked to going forward.

Kanchan Shringi 00:03:14 So we get into some of the use cases, but before that, why use web scraping in this day and age with the prevalent APIs provided by most Windows?

Diarmuid McDonnell 00:03:28 That’s a good question. APIs are a very important development in general for the public and for developers, as academics they’re useful, but they don’t provide the full spectrum of information that we may be interested in for research purposes. So many public services, for example, our access through websites, they provide lots of interesting information on policies on statistics for example, these web pages change quite frequently. Through an API, you can get maybe some of the same information, but of course it’s restricted to whatever the data provider thinks you need. So in essence, it’s about what you think you may need in total to do your research, for example, versus what’s available from the data provider based on their policies.

Kanchan Shringi 00:04:11 Okay. Now let’s drill in some of the use cases. What in your mind are the key use cases for which web scraping is implied and what was yours?

Diarmuid McDonnell 00:04:20 Well, I’ll pick him up mine as an academic and as a researcher, I’m interested in large scale administrative data about non-profits around the world. There’s lots of different regulators of these organizations and many do provide data downloads and common Open Source formats. However, there’s lots of information about these sectors that the regulator holds but doesn’t necessarily make available in their data download. So for example, the people running these organizations, that information is typically available on the regulator’s website, but not in the data download. So a good use case for me as a researcher, if I want to analyze how these organizations are governed, I need to know who sits on the board of these organizations. So for me, often the use case in academia and in research is that the value added richer information we need for our research exists on web pages, but not necessarily in the publicly available data downloads. And I think this is a common use case across industry and potentially for personal use also that the value added and bridge information is available on websites but has not necessarily been packaged nicely as a data download.

Kanchan Shringi 00:05:28 Can you start with an actual problem that you solve? You hinted at one, but if you’re going to guide us through the entire issue, did something unexpected happen as you were trying to scrape the information? What was the purpose just to get us started?

Diarmuid McDonnell 00:05:44 Absolutely. What particular jurisdiction I’m interested in is Australia, it has quite a vibrant non-profit sector, known as charities in that jurisdiction. And I was interested in the people who governed these organizations. Now, there is some limited information on these people in the publicly available data download, but the value-added information on the webpage shows how these trustees are also on the board of other non-profits on the board of other organizations. So those network connections, I was particularly interested in Australia. So that led me to develop a reasonably simple web scraping application that would get me to the trustee information for Australia non-profits. There are some common approaches and techniques I’m sure we’ll get into, but one particular challenge was the regulator’s website does have an idea of who’s making requests for their web pages. And I haven’t counted exactly, but every one or 2000 requests, it would block that IP address. So I was setting my scraper up at night, which would be the morning over there for me. I was assuming it was running and I would come back in the morning and would find that my script had stopped working midway through the night. So that led me to build in some protections on some conditionals that meant that every couple of hundred requests I would send my web scraping application to sleep for five, 10 minutes, and then start again.

Kanchan Shringi 00:07:06 So was this the first time you had done bad scraping?

Diarmuid McDonnell 00:07:10 No, I’d say this is probably somewhere in the middle. My first experience of this was quite simple. I was on strike for my university and fighting for our pensions. I had two weeks and I call it had been using Python for a different application. And I thought I would try and access some data that looked particularly interesting back at my home country of the Republic of Ireland. So I said, I sat there for two weeks, tried to learn some Python quite slowly, and tried to download some data from an API. But what I quickly realized in my field of non-profit studies is that there aren’t too many APIs, but there are lots of websites. With lots of rich information on these organizations. And that led me to employ web scraping quite frequently in my research.

Kanchan Shringi 00:07:53 So there must be a reason though why these websites don’t actually provide all this data as part of their APIs. Is it actually legal to scrape? What is legal and what’s not legal to scrape?

Diarmuid McDonnell 00:08:07 It would be lovely if there was a very clear distinction between which websites were legal and which were not. In the UK for example, there isn’t a specific piece of legislation that forbids web scraping. A lot of it comes under our copyright legislation, intellectual property legislation and data protection legislation. Now that’s not the case in every jurisdiction, it varies, but those are the common issues you come across. It’s less to do with the fact that you can’t in an automated manner, collect information from websites though. Sometimes some websites, terms and conditions say you cannot have a computational means of collecting data from the website, but in general, it’s not about not being able to computationally collect the data. It’s there’s restrictions on what you can do with the data, having collected it through your web scraper. So that’s the real barrier, particularly for me in the UK and particularly the applications I have in mind, it’s the restrictions on what I can do with the data. I may be able to technically and legally scrape it, but I might be able to do any analysis or repackage it or share it in some findings.

Kanchan Shringi 00:09:13 Do you first check the terms and conditions? Does your scraper first parse through the terms and conditions to decide?

Diarmuid McDonnell 00:09:21 This is actually one of the manual tasks associated with web scraping. In fact, it’s the detective work that you have to do to get your web scrapers set up. It’s not actually a technical task or a computational task. It’s simply clicking on the web sites terms of service, our terms of conditions, usually a link found near the bottom of web pages. And you have to read them and say, does this website specifically forbid automated scraping of their web pages? If it does, then you may usually write to that website and ask for their permission to run a scraper. Sometimes they do say yes, you often, it’s a blanket statement that you’re not allowed web scraper if you have a good public interest reason as an academic, for example, you may get permission. But often websites aren’t explicit and banning web scraping, but they will have lots of conditions about the use of the data you find on the web pages. That’s usually the biggest obstacle to overcome.

Kanchan Shringi 00:10:17 In terms of the terms and conditions, are they different? If it’s a public page versus a page that’s predicted by user like you actually logged in?

Diarmuid McDonnell 00:10:27 Yes, there is a distinction between those different levels of access to pages. In general, quite scraping, maybe just forbidden by the terms of service in general. Often if information is accessible via web scraping, then not usually does not apply to information held behind authentication. So private pages, members only areas, they’re usually restricted from your web scraping activities and often for good reason, and it’s not something I’ve ever tried to overcome. So, there are technical means of doing so.

Kanchan Shringi 00:11:00 That makes sense. Let’s now talk about the technology that you used to employ web scraping. So let’s start with the challenges.

Diarmuid McDonnell 00:11:11 The challenges, of course, when I began learning to conduct web scraping, it began as an intellectual pursuit and in social sciences, there’s increasing use of computational approaches in our data collection and data analysis methods. One way of doing that is to write your own programming applications. So instead of using a software out of a box, so to speak, I’ll write a web scraper from scratch using the Python programming language. Of course, the natural first challenge is you’re not trained as a developer or as a programmer, and you don’t have those ingrained good practices in terms of writing code. For us as social scientists in particular, we call it the grilled cheese methodology, which is out your programs just have to be good enough. And you’re not too focused on performance and shaving microseconds off the performance of your web scraper. You’re focused on making sure it collects the data you want and does so when you need to.

Diarmuid McDonnell 00:12:07 So the first challenge is to write effective code if it’s not necessarily efficient. But I guess if you are a developer, you will be focused on efficiency also. The second major challenge is the detective work. I outlined earlier often the terms of conditions or terms of service of a web page are not entirely clear. They may not expressly prohibit web scraping, but they may have lots of clauses around, you know, you may not download or use this data for your own purposes and so on. So, you may be technically able to collect the data, but you may be in a bit of a bind in terms of what you can actually do with the data once you’ve downloaded it. The third challenge is building in some reliability into your data collection activities. This is particularly important in my area, as I’m interested in public bodies and regulators whose web pages tend to update very, very quickly, often on a daily basis as new information comes in.

Diarmuid McDonnell 00:13:06 So I need to ensure not just that I know how to write a web scraper and to direct it, to collect useful information, but that brings me into more software applications and systems software, where I need to either have a personal server that’s running. And then I need to maintain that as well to collect data. And it brings me into a couple of other areas that are not natural and I think to a non-developer and a non-programmer. I’d see those as the three main obstacles and challenges, particularly for a non- programmer to overcome when web scraping,

Kanchan Shringi 00:13:37 Yeah, these are certainly challenges even for somebody that’s experienced, because I know this is a very popular question at interviews that I’ve actually encountered. So, it’s certainly an interesting problem to solve. So, you mentioned being able to write effective code and earlier in the episode, you did talk about having learned Python over a very short period of time. How do you then manage to write the effective code? Is it like a back and forth between the code you write and you’re learning?

Diarmuid McDonnell 00:14:07 Absolutely. It’s a case of experiential learning or learning on the job. Even if I had the time to engage in formal training in computer science, it’s probably more than I could ever possibly need for my purposes. So, it’s very much project-based learning for social scientists in particular to become good at web scraping. So, he’s definitely a project that really, really grabs you. I would sustain your intellectual interest long after you start encountering the challenges that I’ve mentioned with web scraping.

Kanchan Shringi 00:14:37 It’s definitely interesting to talk to you there because of the background and the fact that the actual use case led you into learning the technologies for embarking on this journey. So, in terms of reliability, early on you also mentioned the fact that some of these websites will have limits that you have to overcome. Can you talk more about that? You know, for that one specific case where you able to use that same methodology for every other case that you encountered, have you built that into the framework that you’re using to do the web scraping?

Diarmuid McDonnell 00:15:11 I’d like to say that all websites present the same challenges, but they do not. So in that particular use case, the challenge was no matter who was making the request after a certain amount of requests, somewhere in the 1000 to 2000 requests in a row that regulator’s website would cancel any further requests, some wouldn’t respond. But a different regulator in a different jurisdiction, it was a similar challenge, but the solution was a little bit different. This time it was less to do with how many requests you made and the fact that you couldn’t make consecutive requests from the same IP address. So, from the same computer or machine. So, in that case, I had to implement a solution which basically cycled through public proxies. So, a public list of IP addresses, and I would select from those and make my request using one of those IP addresses, cycled through the list again, make my request from a different IP address and so on and so forth for the, I think it was something like 10 or 15,000 requests I needed to make for records. So, there are some common properties to some of the challenges, but actually the solutions need to be specific to the website.

Kanchan Shringi 00:16:16 I see. What about dead data quality? How do you know if you’re not reading duplicate information which is in different pages or broken links?

Diarmuid McDonnell 00:16:26 Data quality thankfully, is an area a lot of social scientists have a lot of experience with. So that particular aspect of web scraping is common. So whether I conduct a survey of individuals, whether I collect data downloads, run experiments and so on, the data quality challenges are largely the same. Dealing with missing observations, dealing with duplicates, that’s usually not problematic. What can be quite difficult is the updating of websites that does tend to happen reasonably frequently. If you’re running your own little personal website, then maybe it gets updated weekly or monthly, public service, UK government website. For example, that gets updated multiple times across multiple web pages every day, sometimes on a minute basis. So for me, you certainly have to build in some scheduling of your web scraping activities, but thankfully depending on the webpage you’re interested in, there’ll be some clues about how often the webpage actually updates.

Diarmuid McDonnell 00:17:25 So for regulators, they have different policies about when they show the records of new non-profits. So some regulators say every day we get a new non-profit we’ll update, some do it monthly. So usually there’s persistent links and the information changes on a predictable basis. But of course there are definitely times where older webpages become obsolete. I’d like to say there’s sophisticated means I have of addressing that, but largely particularly for a non-programmer, like myself, that comes back to the detective work of frequently, checking in with your scraper, making sure that the website is working as intended looks as you expect and making any necessary changes to your scraper.

Kanchan Shringi 00:18:07 So in terms of maintenance of these tools, have you done research in terms of how other people might be doing that? Is there a lot of information available for you to rely on and learn?

Diarmuid McDonnell 00:18:19 Yes, there were actually some free and some paid for solutions that do help you with the reliability of your scrapers. There’s I think it is an Australian product called morph.io, which allows you to host your scrapers, set a frequency with which the scrapers execute. And then there’s a webpage on the morph site, which shows the results of your scraper, how often it runs, what results it produces and so on. That does have some limitations. That means you have to make your results of your scraping on your scraper public, that you may not want to do that, particularly if you’re a commercial institution, but there are other packages and software applications that do help you with the reliability. It’s certainly technically something you can do with a reasonable level of programming skills, but I’d imagine for most people, particularly as researchers, that will go much beyond what we’re capable of. Now, that case we’re looking at solutions like morph.io and Scrapy applications and so on to help us build in some reliability,

Kanchan Shringi 00:19:17 I do want to walk through just all the different steps in how you would get started on what you would implement. But before that I did have two or three more areas of challenges. What about JavaScript heavy sites? Are there specific challenges in dealing with that?

Diarmuid McDonnell 00:19:33 Yes, absolutely. Web scraping does work best when you have a static webpage. So what you see, what you loaded up in your browser is exactly what you see when you request it using a scraper. Often there are dynamic web pages, so there’s JavaScript that produces responses depending on user input. Now, there are a couple of different ways around this, depending on the webpage. If there are forms are drop down menus on the web page, there are solutions that you can use in Python. And there’s the selenium package for example, that allows you to essentially mimic user input, or it’s essentially like launching a browser that’s in the Python programming language, and you can give it some input. And that will mimic you actually manually inputting information at the fields, for example. Sometimes there’s JavaScript or there’s user input that actually you can see the backend off.

Diarmuid McDonnell 00:20:24 So the Irish regulator, for example of non-profits, its website actually draws information from an API. And the link to that API is nowhere on the webpage. But if you look in the developer tools that you can actually see what link it’s calling the data in from, and at that instance, I can go direct to that link. There are certainly some white pages that present some very difficult JavaScript challenges that I have not overcome myself. Just now the Singapore non-profit sector, for example, has a lot of JavaScript and a lot of menus that have to be navigated that I think are technically possible, but have beaten me in terms of time spent on the problem, certainly.

Kanchan Shringi 00:21:03 Is it a community that you can leverage to solve some of these issues and bounce ideas and get feedback?

Diarmuid McDonnell 00:21:10 There’s not so much an active community in my area of social science, or generally there are increasingly social scientists who use computational methods, including web scraping. We have a very small loose community, but it is quite supportive. But in the main we’re quite lucky that web scraping is a fairly mature computational approach in terms of programming. Therefore I’m able to consult fast corporate of questions and solutions that others have posted on stack overflow, for example. There are a numerable useful blogs, I won’t even mention if you just Googled solutions to IP addresses, getting blocked or so on. There’s some excellent web pages in addition to Stack Overflow. So, for somebody coming into it now, you’re quite lucky all the solutions have largely been developed. And it’s just you finding those solutions using good search practices. But I wouldn’t say I need an active community. I’m reliant more on those detailed solutions that have already been posted on the likes of Stack Overflow.

Kanchan Shringi 00:22:09 So a lot of this data is on structured as you’re scraping. So how do you know, like understand the content? For example, there may be a price listed, but then maybe for the annotations on discount. So how would you figure out what the actual price is based on your web scraper?

Diarmuid McDonnell 00:22:26 Absolutely. In terms of your web scraper, all it’s recognizing is text on a webpage. Even if that text, we would recognize as numeric as humans, your web scraper is just saying reams and reams of text on a webpage that you’re asking it to collect. So, you’re very true. There’s a lot of data cleaning and posts scraping. Some of that data cleaning can occur during your scraping. So, you may use regular expressions to search for certain terms that helps you refine what you’re actually collecting from the webpage. But in general, certainly for research purposes, we need to get as much information as possible and that we use our common techniques for cleaning up quantitative data, in particular usually in a different software package. You can’t keep everything within the same programming language, your collection, your cleaning, your analysis can all be done in Python, for example. But for me, it’s about getting as much information as possible and dealing with the data cleaning issues at a later stage.

Kanchan Shringi 00:23:24 How expensive have you found this endeavor to be? You mentioned a few things you know. You have to use different IPs so I suppose you’re doing that with proxies. You mentioned some tooling like provided by morph.io, which helps you host your scraper code and maybe schedule it as well. So how expensive has this been for you? We’ll talk about the, and maybe you can talk about all the open-source tools to use versus places you actually had to pay.

Diarmuid McDonnell 00:23:52 I think I can say in the last four years of engaging a web scraping and using APIs that I have not spent a single pound, penny, dollar, Euro, that’s all been using Open Source software. Which has been absolutely fantastic particularly as an academic, we don’t have large research budgets usually, if even any research budget. So being able to do things as cheaply as possible is a strong consideration for us. So I’ve been able to use completely open source tools. So Python as the main programming language for developing the scrapers. Any additional packages or modules like selenium, for example, are again, Open Source and can be downloaded and imported into Python. I guess maybe I’m minimizing the cost. I do have a personal server hosted on DigitalOcean, which I guess I don’t technically need, but the other alternative would be leaving my work laptop running pretty much all of the time and scheduling scrapers on a machine that not very capable, frankly.

Diarmuid McDonnell 00:24:49 So having a personal server, does cost something in the region of 10 US dollars per month. It might be a truer cost as I’ve spent about $150 in four years of web scraping, which is hopefully a very good return for the information that I’m getting back. And in terms of hosting our version control, GitHub is very good for that purpose. As an academic I can get, a free version that works perfectly for my uses as well. So it’s all largely been Open Source and I’m very grateful for that.

Kanchan Shringi 00:25:19 Can you now just walk through the step-by-step of how would you go about implementing a web scraping project? So maybe you can choose a use case and then we can walk that through the things I wanted to cover was, you know, how will you start with actually generating the list of sites, making their CP calls, parsing the content and so on?

Diarmuid McDonnell 00:25:39 Absolutely. A recent project I’m just about finished, was looking at the impact of the pandemic on non-profit sectors globally. So, there were eighth non-profit sectors that we were interested in. So the four that we have in the UK and the Republic of Ireland, the US and Canada, Australia, and New Zealand. So, it’s eight different websites, eight different regulators. There aren’t eight different ways of collecting the data, but there were at least four. So we had that challenge to begin with. So the selection of sites came from the pure substantive interests of which jurisdictions we were interested in. And then there’s still more manual detective work. So you’re going to each of these webpages and saying, okay, so on the Australia regulator’s website for example, everything gets scraped from a single page. And then you scrape a link at the bottom of that page, which takes you to additional information about that non-profit.

Diarmuid McDonnell 00:26:30 And you scrape that one as well, and then you’re done, and you move on to the next non-profit and repeat that cycle. For the US for example, it’s different, you visit a webpage, you search it for a recognizable link and that has the actual data download. And you tell your scraper, visit that link and download the file that exists on that webpage. And for others it’s a mix. Sometimes I’m downloading files, and sometimes I’m just cycling through tables and tables of lists of organizational information. So that’s still the manual part you know, figuring out the structure, the HTML structure of the webpage and where everything is.

Kanchan Shringi 00:27:07 The two general links, wouldn’t you have leveraged in any sites to go through, the list of hyperlinks that they actually link out to? Have you not leveraged those to then figure out the additional sites that you would like to scrape?

Diarmuid McDonnell 00:27:21 Not so much for research purposes, it’s less about maybe to use a term that may be relevant. It’s less about data mining and, you know, searching through everything and then maybe something, some interesting patterns will appear. We usually start with a very narrow defined research question and that you’re just collecting information that helps you answer that question. So I personally, haven’t had a research question that was about, you know, say visiting a non-profits own organization webpage, and then saying, well, what other non-profit organizations does that link to? I think that’s a very valid question, but it’s not something I’ve investigated myself. So I think in research and academia, it’s less about crawling web pages to see where the connections lie. Though sometimes that may be of interest. It’s more about collecting specific information on the webpage that goes on to help you answer your research question.

Kanchan Shringi 00:28:13 Okay. So generating in your experience or in your realm has been more manual. So what next, once you have the list?

Diarmuid McDonnell 00:28:22 Yes, exactly. Once I have a good sense of the information I want, then it becomes the computational approach. So you’re getting at the eight separate websites, you’re setting up your scraper, usually in the form of separate functions for each jurisdiction, because just to simply cycle through each jurisdiction, each web page looks a little bit different on your scraper would break down. So there’s different functions or modules for each regulator that I then execute separately just to have a bit of protection against potential issues. Usually the process is to request a data file. So one of the publicly available data files. So I do that computation a request that I open it up in Python and I extract unique IDs for all of the non-profits. Then the next stage is building another link, which is the personal webpage of that non-profit on the regulator’s website, and then cycling through those lists of non-profit IDs. So for every non-profit requests it’s webpage and then collect the information of interest. So it’s latest income when it was founded, if it’s not been desponded, what was the reason for its removal or its disorganization, for example. So then that becomes a separate process for each regulator, cycling through those lists, collecting all of the information I need. And then the final stage essentially is packaging all of those up into a single data set as well. Usually a single CSV file with all the information I need to answer my research question.

Kanchan Shringi 00:29:48 So can you talk about the actual tools or libraries that you’re using to make the calls and parsing the content?

Diarmuid McDonnell 00:29:55 Yeah, thankfully there aren’t too many for my purposes, certainly. So it’s all done in the Python programming language. The main two for web scraping specifically are the Requests package, which is a very mature well-established well tested module in Python and also the Beautiful Soup. So Requests is excellent for making the request to the website. Then the information that comes back, as I said, scrapers at that point, just see it as a blob of text. The Beautiful Soup module in Python tells Python that you’re actually dealing with a webpage and that there’s certain tags and structure to that page. And then Beautiful Soup allows you to pick out the information you need and then save that to a file. As a social scientist, we’re interested in the data at the end of the day. So I want to structure and package all of the scrape data. So I’ll then use the CSV or the Json modules and Python to make sure I’m exporting it in the correct format for use later on.

Kanchan Shringi 00:30:50 So you had mentioned Scrapy as well earlier. So our Beautiful Soup and scrapy use for similar purposes,

Diarmuid McDonnell 00:30:57 Scrapy is basically a software application overall that you can use for web scraping. So you can use its own functions to request web pages to build your own functions. So you do everything within the Scrapy module or the Scrapy package. Instead of in my case, I’ve been building it, I guess, from the ground up using their Quests and the Beautiful Soup modules and some of the CSV and Json modules. I don’t think there’s a correct way. Scrapy probably saves time and it has more functionality that I currently use, but I certainly find it’s not too much effort and I don’t lose any accuracy or a functionality for my purposes, just by writing the scraper myself, using those four key packages that I’ve just outlined.

Kanchan Shringi 00:31:42 So Scrapy sounds like more of a framework, and you would have to learn it a little bit before you start to use it and you haven’t felt the need to go there yet, or have you actually tried it before?

Diarmuid McDonnell 00:31:52 That’s exactly how it’s described. Yes, it’s a framework that doesn’t take a lot of effort to operate, but I haven’t felt the strong push to move from my approach into adjust yet. I’m familiar with it because colleagues use it. So when I’ve collaborated with more able data scientists on projects, I’ve noticed that they tend to use Scrapy and build their, their scrapers in that. But going back to my grilled cheese analogy that our colleague in Liverpool came up, but it’s at the end of the day, just getting it working and there’s not such strong incentives to make things as efficient as possible.

Kanchan Shringi 00:32:25 And maybe something I should have asked you earlier, but now that I think about it, you know, you started to learn Python just so that you could embark on this journey of web scraping. So why Python, what drove you to Python versus Java for example?

Diarmuid McDonnell 00:32:40 In academia you’re entirely influenced by the person above you? So it was my former PhD supervisor had said he had started using Python and he had found it very interesting just as an intellectual challenge and found it very useful for handling large scale unstructured data. So it really was as simple as who in your department is using a tool and that’s just common in academia. There’s not often a lot of talk goes into the merits and disadvantages of different Open Source approaches. It’s purely that was what was suggested. And I’ve found it very hard to give up Python for that purpose.

Kanchan Shringi 00:33:21 But in general, I think I’ve done some basic research and people only talk with Python when talking about web scraping. So certainly it’d be curious to know if you ever reset something else and rejected it, or sounds like you knew where your path before you chose the framework.

Diarmuid McDonnell 00:33:38 Well, that’s a good question. I mean, there’s a lot of, I guess, path dependency. So once you start on something like that are usually given to, it’s very difficult to move away from it. In the Social Sciences, we tend to use the statistical software language ëR’ for a lot of our data analysis work. And of course, you can perform web scraping in ëR’ quite easily just as easily as in Python. So I do find what I’m training you know, the upcoming social scientists, many if that will use ëR’ and then say, why can’t I use ëR’ to do our web scraping, you know. You’re teaching me Python, should I be using ëR’ but I guess as we’ve been discussing, there’s really not much of a distinction between which one is better or worse, it’s becomes a preference. And as you say, a lot of people prefer Python, which is good for support and communities and so on.

Kanchan Shringi 00:34:27 Okay. So you’ve pulled a content with an CSV, as you mentioned, what next do you store it and where do you store it and how do you then use it?

Diarmuid McDonnell 00:34:36 For some of the larger scale frequent data collection exercises I do through web scraping and I’ll store it on my personal server is usually the best way. I like to say I could store it on my university server, but that’s not an option at the moment. A hopefully it would be in the future. So it’s stored on my personal server, usually as CSV. So even if the data is available in Json, I’ll do that little bit of extra step to convert it from Json to CSV in Python, because when it comes to analysis, when I want to build statistical models to predict outcomes in the non-profit sector, for example, a lot of my software applications don’t really accept Json. You as social scientists, maybe even more broadly than that, we’re used to working with rectangular or tabular data sets and data formats. So CSV is enormously helpful if the data comes in that format to begin with, and if it can be easily packaged into that format during the web scraping, that makes things a lot easier when it comes to analysis as well.

Kanchan Shringi 00:35:37 Have you used any tools to actually visualize the results?

Diarmuid McDonnell 00:35:41 Yeah. So in Social Science we tend to use, well it depends there’s three or four different analysis packages. But yes, regardless of whether you’re using Python or Stater or the ëR’, physical software language, visualization is the first step in good data exploration. And I guess that’s true in academia as much as it is in industry and data science and research and development. So, yeah, so we’re interested in, you know, the links between, a non-profit’s income and its probability of dissolving in the coming year, for example. A scatter plot would be an excellent way of looking at that relationship as well. So data visualizations for us as social scientists are the first step and exploration and are often the products at the end. So to speak that go into our journal articles and into our public publications as well. So it is a very important step, particularly for larger scale data to condense that information and derive as much insight as possible

Kanchan Shringi 00:36:36 In terms of challenges like the websites themselves, not allowing you to scrape data or, you know, putting terms and conditions or adding limits. Another thing that comes to mind, which probably is not really related to scraping, but captures, has that been something you’ve had to invent special techniques to deal with?

Diarmuid McDonnell 00:36:57 Yes, there is a way usually around them. Well, certainly there was a way around the original captures, but I think certainly in my experience with the more recent ones of selecting images and so on, it’s become quite difficult to overcome using web scraping. There are absolutely better people than me, more technical who may have solutions, but I certainly have an implemented or found an easy solution to overcoming captures. So it is certainly on those dynamic web pages, as we’ve mentioned, it’s certainly probably the major challenge to overcome because as we’ve discussed, there’s ways around proxies and the ways around making a limited number of requests and so on. Captures are probably the outstanding problem, certainly for academia and researchers.

Kanchan Shringi 00:37:41 Do you envision using machine learning natural language processing, on the data that you’re gathering sometime in the future, if you haven’t already?

Diarmuid McDonnell 00:37:51 Yes and no is the academic’s answer. In terms of machine learning for us, that’s the equivalent of statistical modeling. So that’s, you know, trying to estimate the parameters that fit the data best. Social scientists, quantitative social scientists have similar tools. So different types of linear and logistic regression for example, are very coherent with machine learning approaches, but certainly natural language processing is an enormously rich and valuable area for social science. As you said, a lot of the information stored on web pages is unstructured and on text, I’m making good sense of that. And quantitatively analyzing the properties of the texts and its meaning. That is certainly the next big step, I think for empirical social scientists. But I think machine learning, we kind of have similar tools that we can implement. Natural language is certainly something we don’t currently do within our discipline. You know, we don’t have our own solutions that we certainly need that to help us make sense of data that we scrape.

Kanchan Shringi 00:38:50 For the analytic aspects, how much data do you feel that you need? And can you give an example of when you’ve used, specifically use, this and what kind of analysis have you gathered from the data you’ve captured?

Diarmuid McDonnell 00:39:02 But one of the benefits of web scraping certainly for research purposes is it can be collected at a scale. That’s very difficult to do through traditional means like surveys or focus groups, interviews, experiments, and so on. So we can collect data in my case for entire non-profit sectors. And then I can repeat that process for different jurisdictions. So what I’ve been looking at the impact of the pandemic on non-profit sectors, for example, I’m collecting, you know, tens of thousands, if not millions of records of, for each jurisdiction. So thousands and tens of thousands of individual non-profits that I’m aggregating all of that information into a time series of the number of charities or non-profits that are disappearing every month. For example, I’m tracking that for a few years before the pandemic. So I have to have a good long time series in that direction. And I have to frequently collect data since the pandemic for these sectors as well.

Diarmuid McDonnell 00:39:56 So that I’m tracking because of the pandemic are there now fewer charities being formed. And if there are, does that mean that some needs will, will go unmet because of that? So, some communities may have a need for mental health services, and if there are now fewer mental health charities being formed, what’s the impact of what kind of planning should government do? And then the flip side, if more charities are now disappearing as a result of the pandemic, then what impact is that going to have on public services in certain communities also. So, to be able to answer what seems to be reasonably simple, understandable questions does need large-scale data that’s processed, collected frequently, and then collapsed into an aggregate measures over time. That can be done in Python, that can be done in any particular programming or statistical software package, my personal preference is to use Python for data collection. I think it has lots of computational advantages to doing that. And I kind of like to use traditional social science packages for the analysis also. But again that’s entirely a personal preference and everything can be done in an Open Source software, the whole data collection, cleaning and analysis.

Kanchan Shringi 00:41:09 It would be curious to hear what packages did you use for this?

Diarmuid McDonnell 00:41:13 Well, I use the Stater statistical software package, which is a proprietary piece of software by a company in Texas. And that has been built for the types of analysis that quantitative social scientists tend to do. So, regressions, time series, analyses, survival analysis, these kinds of things that we traditionally do. Those are not being imported into the likes of Python and ëR’. So it, as I said, it is getting possible to do everything in a single language, but certainly I can’t do any of the web scraping within the traditional tools that I’ve been using Stater or SPSS, for example. So, I guess I’m building a workflow of different tools, tools that I think are particularly good for each distinct task, rather than trying to do everything in a, in a single tool.

Kanchan Shringi 00:41:58 It makes sense. Could you still talk more about what happens once you start using the tool that you’ve done? What kind of aggregations then do you try to use the tool for what kind of input additional input you might have to provide would be addressed it to kind of close that loop here?

Diarmuid McDonnell 00:42:16 I say, yeah, of course, web scraping is simply stage one of completing this piece of analysis. So once I transferred the role data into Stater, which is what I use, then it begins a data cleaning process, which is centered really around collapsing the data into aggregate measures. So, the role of data, every role is a non-profit and there’s a date field. So, a date of registration or a date of dissolution. So I’m collapsing all of those individual records into monthly observations of the number of non-profits who are formed and are dissolved in a given month. Analytically then the approach I’m using is that data forms a time series. So there’s X number of charities formed in a given month. Then we have what we would call an exogenous shock, which is the pandemic. So this is, you know, something that was not predictable, at least analytically.

Diarmuid McDonnell 00:43:07 We may have arguments about whether it was predictable from a policy perspective. So we essentially have an experiment where we have a before period, which is, you know, almost like the control group. And we have the pandemic period, which is like the treatment group. And then we’re seeing if that time series of the number of non-profits that are formed is discontinued or disrupted because of the pandemic. So we have a technique called interrupted time series analysis, which is a quasi- experimental research design and mode of analysis. And then that gives us an estimate of, to what degree the number of charities has now changed and whether the long-term temporal trend has changed also. So to give a specific example from what we’ve just concluded is not the pandemic certainly led to many fewer charities being dissolved? So that sounds a bit counter intuitive. You would think such a big economic shock would lead to more non-profit organizations actually disappearing.

Diarmuid McDonnell 00:44:06 The opposite happened. We actually had much fewer dissolutions that we would expect from the pre pandemic trend. So there’s been a massive shock in the level, a massive change in the level, but the long-term trend is the same. So over time, there’s not been much deviation in the number of charities dissolving, how we see that going forward as well. So it’s like a one-off shock, it’s like a one-off drop in the number, but the long-term trend continues. And specifically that if you’re interested, the reason is the pandemic effected regulators who process the applications of charities to dissolve a lot of their activities were halted. So they couldn’t process the applications. And hence we have lower levels and that’s in combination with the fact that a lot of governments around the world put a place, financial support packages that kept organizations that would naturally fail, if that makes sense, it prevented them from doing so and kept them afloat for a much longer period than we could expect. So at some point we’re expecting a reversion to the level, but it hasn’t happened yet.

Kanchan Shringi 00:45:06 Thank you for that detailed download. That was very, very interesting and certainly helped me close the loop in terms of the benefits that you’ve had. And it would have been absolutely impossible for you to have come to this conclusion without doing the due diligence and scraping different sites. So, thanks. So you’ve been educating the community, I’ve seen some of your YouTube videos and webinars. So what led you to start that?

Diarmuid McDonnell 00:45:33 Could I say money? Would that be no, of course not. I became interested in the methods myself short, my post-doctoral studies and that I had a fantastic opportunity to join. One of the UK is kind of flagship data archives, which is called the UK data service. And I got a position as a trainer in their social science division and like a lot of research councils here in the UK. And I guess globally as well, they’re becoming more interested in computational approaches. So what a colleague, we were tasked with developing a new set of materials that looked at the computational skills, social scientists should really have moving into this kind of modern era of empirical research. So really it was a carte blanche, so to speak, but my colleague and I, so we started doing a little bit of a mapping exercise, seeing what was available, what were the core skills that social scientists might need.

Diarmuid McDonnell 00:46:24 And fundamentally it did keep coming back to web scraping because even if you have really interesting things like natural language processing, which is very popular social network analysis, becoming a huge area in the social sciences, you still have to get the data from somewhere. It’s not as common anymore for those data sets to be packaged up neatly and made available via data portal, for example. So you do still need to go out and get your data as a social scientist. So that led us to focus quite heavily on the web scraping and the API skills that you needed to have to get data for your research.

Kanchan Shringi 00:46:58 What have you learned along the way as you were teaching others?

Diarmuid McDonnell 00:47:02 Not that there’s an apprehension, so to speak. I teach a lot of quantitative social science and there’s usually a natural apprehension or anxiety about doing those topics because they’re based on mathematics. I think it’s less so with computers, for social scientists, it’s not so much a fear or an apprehension, but it’s mystifying. You know, if you don’t do any programming or you don’t engage with the kind of hardware, software aspects of your machine, that it’s very difficult to see A how these methods could apply to you. You know, why web scraping would be of any value and B it’s very difficult to see the process of learning. I like to usually use the analogy of an obstacle course, which has you know, a 10-foot high wall and you’re staring at it going, there’s absolutely no way I can get over it, but with a little bit of support and a colleague, for example, once you’re over the barrier, suddenly it becomes a lot easier to clear the course. And I think learning computational methods for somebody who’s not a non-programmer, a non-developer, there’s a very steep learning curve at the beginning. And once you get past that initial bit and learned how to make requests sensibly, learn how to use Beautiful Soup for parsing webpages and do some very simple scraping, then people really become enthused and see fantastic applications in their research. So there’s a very steep barrier at the beginning. And if you can get people over that with a really interesting project, then people see the value and get fairly enthusiastic.

Kanchan Shringi 00:48:29 I think that’s quite synonymous of the way developers learn as well, because there’s always a new technology, a new language to learn a lot of times. So it makes sense. How do you keep up with this topic? Do you listen to any specific podcasts or YouTube channels or Stack Overflow? Is that your place where you do most of your research?

Diarmuid McDonnell 00:48:51 Yes. In terms of learning the techniques, it’s usually through Stack Overflow, but actually increasingly it’s through public repositories made available by other academics. There’s a big push in general, in higher education to make research materials, Open Access we’re maybe a bit, a bit late to that compared to the developer community, but we’re getting there. We’re making our data and our syntax and our code available. So increasingly I’m learning from other academics and their projects. And I’m looking at, for example, people in the UK, who’ve been looking at scraping NHS or National Health Service releases, lots of information about where it procures clinical services or personal protective equipment from, there’s people involved at scraping that information. That tends to be a bit more difficult than what I usually do so I’ve been learning quite a lot about handling lots of unstructured data at a scale I’ve never worked out before. So that’s an area I’m moving into now. No data that’s far too big for my server or my personal machine. So I’m largely learning from other academics at the moment. So to learn the initial skills, I was highly dependent on the developer community Stack Overflow in particular, and a couple of select kind of blogs and websites and a couple of books as well. But now I’m really looking at full-scale academic projects and learning how they’ve done their web scraping activities.

Kanchan Shringi 00:50:11 Awesome. So how can people contact you?

Diarmuid McDonnell 00:50:14 Yeah. I’m happy to be contacted about learning or applying these skills, particularly for research purposes, but more generally, usually it’s best to use my academic email. So it’s my first name dot last [email protected]. So as long as you don’t have to spell my name, you can find me very, very easily.

Kanchan Shringi 00:50:32 We’ll probably put a link in our show notes if that’s okay.

Diarmuid McDonnell 00:50:35 Yes,

Kanchan Shringi 00:50:35 I, so it was great talking to you then with today. I certainly learned a lot and I hope our listeners did too.

Diarmuid McDonnell 00:50:41 Fantastic. Thank you for having me. Thanks everyone.

Kanchan Shringi 00:50:44 Thanks everyone for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

You must be logged in to post a comment.

2 comments

Chirag says:

February 6, 2023 at 2:32 am

Hello Diarmuid McDonnell, Podcast was great. I learned a lot about web scraping. is this podcast available on Spotify?
SE Radio says:

March 29, 2023 at 4:23 pm

You can find SE Radio on Spotify at https://open.spotify.com/show/6UO3XQclSuNnGxB39QdAnL

SE Radio 503: Diarmuid McDonnell on Web Scraping

Show Notes

Related Links

Transcript

Join the discussion

2 comments

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts

Search

Search

SE Radio 503: Diarmuid McDonnell on Web Scraping

Show Notes

Related Links

Transcript

Join the discussion

2 comments

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts