SE Radio 332: John Doran on Fixing a Broken Development Process

Jeremy Jung talks with John Doran about fixing a broken development process. In this episode we discuss how a project that started as a desktop application transitioned to becoming a SAAS business that serves thousands of businesses. John tells the story of a business that found commercial success but struggled with outages, a lack of consistency, and serious performance problems as they grew. To save the business, the company froze feature development and hired additional team members to focus on fixing their broken development process. By introducing monitoring, a continuous integration process, docker, and some fresh perspectives the team was able to turn their situation around and thrive.

Show Notes

Transcript

Transcript brought to you by IEEE Software
[0:00:00]

Automated: This is Software Engineering Radio, the podcast for professional developers, on the Web se-radio.net. SE Radio is brought to you by the IEEE Computer Society, the IEEE Software magazine, online at computer.org/software.

[Music playing]

Jeremy Jung: Hello, this is Jeremy Jung for Software Engineering Radio. Today, I have John Doran with me. John is the director of engineering at Phorest, a Dublin-based SAS company that processes appointments for the hair and beauty industry. He previously worked as a technical lead at Travelport Digital, where he supported major airlines and hotel chains with their mobile platforms. I’ll be speaking with John about the early-days of their business, the challenges they’ve faced while scaling, and how they were able to reshape their processes and team to overcome those challenges. John, welcome to Software Engineering Radio.

John Doran: Hey, Jeremy, thanks so much for having me.

Jeremy Jung: The first thing I’d like to discuss is the early-days of Phorest.

[0:01:00]

To just give the listeners a little bit of background, what type of product is Phorest?

John Doran: Sure. So, Phorest is, essentially, it’s a salon software focused in the hair and beauty industry. And it didn’t actually start off as that. Back in 2003, it was actually a messaging service, actually built by a few students in Trinity College, one of which his name was Ronan Perceval. Ronan is actually our current CEO. So, in 2003, that messaging service was supporting nightclubs, dentists, various small businesses around Dublin, and the guys were finding it really hard to get some traction in those various different industries. So, Ronan actually went and worked as a hair receptionist in a salon, and what he learned from that was that, through using messaging and a platform, that they were able to actually increase revenue for salons and actually get more money in the tills, which was a hugely powerful thing.

[0:02:00]

So from there they were able to really focus on that particular industry; they built supplementary features and a product around that messaging service.

So, in 2004, it became kind of a fully-fledged appointment book, and from there, then, they integrated that appointment book with the messaging service. So, by 2006, then, I guess you could classify as far as there’s a full salon software, so it had things like stock take, financial reporting, and staff roster. That fully-based salon software system was pretty popular in Ireland, an actually, between 2006 and 2008, they became the number one in the industry, in Ireland. So, what that meant was, you know, the majority of salons in Ireland were running on the Phorest platform, and that was actually an on-premise system, so, all of the data would’ve been stored locally in the hair salon, and there was no backend.

Jeremy Jung: Just so I understand correctly, so you say it was running on-premise; it was an appointment system.

[0:03:00]

So, is this where somebody would come in to the salon, and make an appointment, and they would enter it into a local computer, and it would be just stored there?

John Doran: Exactly. So, what Ronan figured out, throughout his time working in the salon, was that by actually sending customers text messages to remind them about their appointments really helped cut down the no-show rates. Meaning that customers did turn up for their appointments when they were due, and meaning that the staff members didn’t have to sit around waiting for customers to walk in. So, as far as, I guess, as the company developed, we moved into building extra features around that core system, which is an appointment book, which manages the day-to-day roles of a hair stylist. So we built e-mail and marketing retention tools around that. I guess a really important point about Phorest’s history is, when the recession hit in 2008, in Ireland, we moved into the UK.

[0:04:00]

So, as we were kind of the number one provider in Ireland, we felt, when the recession hit, that we needed to move into the UK, but being on-premise meant there was a lot of friction actually installing the system into the salons. So, in 2011, they actually took a small seed round, to build out, I guess, the cloud backend. Once that kind of cloud backend was built, it took about a year to get it off the ground and released, and as the company kind of gained traction in the UK, they migrated all of their on-premise customers onto the cloud solution.

Jeremy Jung: I guess you would say that, when it was on-premise, a lot of the engineering effort or the support effort was probably in keeping the software working for your customers, and just addressing technical issues or questions and things like that, and that was probably taking a lot of your time, is that correct?

John Doran: Precisely. The team was quite small, so we had five engineers who were essentially building out the cloud backend, and one engineer who was maintaining that Delphi on-premise application.

[0:05:00]

So, what was happening was, our CEO, Ronan, was actually the product owner at that time, and the guys were making pretty drastic and kind of quick-fire decisions in terms of features being added to the product, based on getting a certain customer in that really needed to pay the bills. And some of those decisions, I guess, made the product a bit more complex as a group, but it certainly was a big improvement from the on-premise solution.

Jeremy Jung: So, the on-premise solution, you said, was written in Delphi, is that correct?

John Doran: Yeah.

Jeremy Jung: When it was first started, was it just a single developer?

John Doran: Exactly, yeah, so it was literally put together by some outsourcers and a single developer managing it; there was no real inhouse developers, it was a good bit of turnover there. But when that small seed round came in, the guys put that together, the foundations of the cloud-based backend, which was a Java kind of classic N-tier application, with Web socket to update the appointment screen if anything changed on the backend.

[0:06:00]

And you would kind of consider it a majestic monolith, as such.

Jeremy Jung: When you started the cloud solution, were you maintaining both systems simultaneously?

John Doran: Yeah, so, for a full year, the guys were building at that backend, and at the same time, there was one guy who was literally maintaining, fixing bugs in that Delphi application. And just to kind of give you an example, one of the guys, who was actually working on support, he actually went and taught himself SQL, and he used to tunnel into the salons at nighttime, to fix any database issues. And, yeah, so it was, you know, hardcore stuff. Another big thing about not being cloud-based, and one of the big reasons we needed to become cloud-based was we, you know, as people move online and, you know, it’s quite common to book your cinema or something else online.

[0:07:00]

But Ronan could see that trend coming for online bookings, and we needed to be cloud-based to build that online booking system. And just to kind of give you an idea of the scale, like, last year, we would’ve processed about over €2 billion worth of transactions through the system. So, it’s really growing and it’s huge-scale, at the moment. But I guess looking back at the past, the guys would’ve built a great robust system, getting us to about that 1,000-salon mark, particularly in the UK. But that would’ve been the point that the guys would’ve started seeing some shakiness in terms of stability and the speed at which we could deliver new features.

Jeremy Jung: You were saying the initial cloud version took about a year to create?

John Doran: Exactly, yeah.

Jeremy Jung: And you had five engineers working on it, after the seed round?

John Doran: Mm-hmm.

[0:08:00]

Jeremy Jung: At that time, when you first started working on the cloud version of the application, did you have a limited rollout to kind of weed out defects? Or how did you start transferring customers over?

John Doran: So, there was definitely some reluctant customers to move across. We did it, I guess, gradually; there was a lot of reluctance for people. People were quite scared about their data not being stored in their salon, so, it was quite hard to get some of those customers across. And only two weeks ago, we actually officially stopped supporting that, and our final two customers are finished up. So, you know, it took us a good seven years to finish that transition.

Jeremy Jung: Okay, so it was a very gradual transition where you actually – did you ask customers whether they wanted to move, or how did you –

John Doran: Oh, yeah, it was a huge, huge sales and team effort to get people across the line, but I would say the majority of people either would’ve churned or would’ve moved across – the more forward-thinking people, you know, they would’ve been getting new features and a better service.

[0:09:00]

Jeremy Jung: Right, so it’s kind of more of a marketing push from your side to say, “Hey, if you move over to our cloud solution, you’ll get all these additional capabilities,” but it was ultimately up to them to decide whether they wanted to move.

John Doran: Yeah, so, some companies, they kind of build a product with a different name, and they try and sell it. But Phorest, we actually kept the UI very similar, so it wasn’t very intrusive to the users; it was just kind of seen as an upgrade with, I guess, less friction.

Jeremy Jung: Right, right. I wanna talk a little bit about the early-days where you said you spent about a year to build the MVP. At that point, let’s say after that year had passed, were you still able to iterate quickly in terms of features, or were you having any problems with performance or downtime, at that point?

John Doran: So, in 2012 when the cloud-based product launched, particularly in the UK, once we hit about 1,000 customers, we started to see –

[0:10:00]

– creaking issues in the backend: lots of JVM garbage collection problems, lots of database contention, and lots of outages. So, we got to a point where we were throwing hardware at the problem, to make things a little bit faster. Some other problems were, we kind of relied a lot on a single person to do a lot of the deployments. So it wasn’t really a team effort to ship things, it was more so developer finishes coding the machine, push it off, maybe at the end of the month we’d ship. I guess the big problem was the stability, so, essentially what happened was, in terms of the architecture, we were introducing caches at various levels, to try and cope with performance. So, a layer of caching on the client’s side was introduced, Memcached was introduced, level-2 Hibernate caching, always just really focusing on fixing the immediate problem without looking at kind of the bigger picture.

[0:11:00]

I mentioned that 1,000 salons as a marker – I guess once we hit, like, 1,200, the guys had to introduce the idea of silos, which was, like, essentially 1,000 customers are gonna be linked to a specific URL, and that URL will host the API return back to data that they need. And then, the other silo would serve as the other 200 growing to, say, 1,000 businesses. So, essentially, if you think about it, you’ve got, I guess, a big point of failure: if that server goes down, there’s no load balancing between servers, and those two servers are their biggest size possible. So, I guess a big red herring was the cost, I guess, implications of that, you know, it was the largest instance type on Amazon at an ORDS EC2 level.

Jeremy Jung: The entire system was on a single instance for each silo?

[0:12:00]

John Doran: Yeah, so, if you imagine, when you log in, you’ll get returned a URL for a particular silo. So, what would happen, then, would be x businesses go to that silo and y businesses go to the other silo. And what that did was, basically, it load-balanced the businesses at kind of database level.

Jeremy Jung: You were mentioning how you had different caching layers, for example, Memcached and things like that, but those were all being run on the same instance, is that correct?

John Doran: They would’ve been hosted by Amazon.

Jeremy Jung: Oh, okay, so those would’ve been Amazon’s hosted services, so –

John Doran: Yeah, yeah. It’s kind of like when you build that MVP or you build that initial stage of your product, you’re focusing on building features, you’re focusing on getting bums on seats, and you – it was that point, that 1,200 to 1,000 salons, where we felt that scaling pain.

[0:13:00]

Jeremy Jung: So in a way, like you said, you were doing multi-tenancy, but it was kind of split per 1,000 customers.

John Doran: Yeah, exactly. So, if you imagine, if a failure happened on one of those servers, there is no fault tolerance. If the deployment goes wrong in terms of, like, putting an instance in service, those 1,000 customers can’t make purchases, their customers can’t make online bookings, there’s no appointments being served, you can’t run transactions through the till, so, it would cause huge friction.

Jeremy Jung: Right. What were the managed services you were using in combination with the EC2 instance?

John Doran: So, a really good decision at the start of the guys moving to the cloud was making a big bet on Amazon in terms of utilizing them for ORDS EC2 caching. There was no deployment stack or there was no deployment infrastructure as code. It was all, I guess, manually done through Amazon Console, which is something we later addressed, which we’ll chat about, but it was all heavily reliant on Amazon.

[0:14:00]

Jeremy Jung: And you had mentioned that you were relying on one person to do deployment. Was that still the case, at this time?

John Doran: Yeah, so, up until, I guess, 2014, it was all reliant on one guy who literally had to bring his laptop on holidays with him, and tether from cafes if something went down. To deploy new code, he was the only guy who really knew how to do it, so, it was a huge pain point _____ factor.

Jeremy Jung: So, it sounds like in terms of how the team was split up, there was basically, you have people working on development, and you have a single person in the sort of ops role?

John Doran: Yeah, and essentially, when this kind of thing happens, the people who write the code don’t ship it, you get all sorts of problems in terms of dependencies and tangles and just knowledge silos.

[0:15:00]

And also, because the guys were working kind of in their own verticals at different areas of the product, there was no consistency. Consistency in terms of the engineering values, how they were working, practices, procedures, deployments, that sort of stuff. It was all very isolated, so people did their own thing. So, you could imagine for, say, trying to hire someone new it would be quite hard, because, you know, for someone to come in, very different depending on which engineer you talk to. That make sense?

Jeremy Jung: Yeah. Was this a locally-located team, or was this a remote team?

John Doran: Yeah, so, most of the guys were actually in Dublin; one or two people traveled a little bit, worked remotely, and a couple of people did actually move abroad. So, it was predominantly based in Dublin, but some people traveled a bit.

[0:16:00]

Jeremy Jung: In terms of processes for someone knowing how to deploy or how to work on a feature, it was mostly tribal knowledge; it’s more just trying to piece together a story from talking to different people. Is that correct?

John Doran: Precisely. So, you had no consistency in languages or frameworks, except, I would say that that monolith, that initial part of the platform, was extremely consistent in terms of the patterns used. I guess the way it communicated with the database and how the API was built was extremely strong and is the heart of, still is the heart of, the organization. So, say, for example, there’s a lot of really good, say, integration unit test, there, but they got abandoned for a little while, and we had to bring them back, back to life, to enable us to start moving faster again, and to give us a lot more confidence in our releases

[0:17:00]

Jeremy Jung: So, it sounds like maybe the initial version in the first year or so had a pretty solid foundation, but then, as – I’m not sure if it was the team that grew or just the rate of features, would you say that kind of –

John Doran: I would say it was a combination of the growth of the company, in terms of the number of the customers at it, and the focus on delivering features, the focus on feature development, rather than thinking about scalability and being extremely aware of it.

Jeremy Jung: How fast were you gaining customers, at that time? Was this a steady increase or large spikes?

John Doran: Oh, you’re talking 30 percent annually, so, [crosstalk] and really low churn rate, as well.

Jeremy Jung: So, what would you feel was the turning point where it felt like your software or your business had to fundamentally change due to the number of customers you had?

[0:18:00]

John Doran: So, it was essentially those issues around stability and cost were unsustainable for the business, customers complaining, our staff not being able to do their job. So, you know, part of Phorest’s core values and mission is to help the salon owner grow their business and use the tools that we provide to do that. And if people are firefighting and not being able to support our customers to be able to help them send really great marketing campaigns to boost their revenue, if we’re not doing that and we’re firefighting, the company would have been pointless. So, we weren’t fulfilling our mission by coping with outages and panicking all the time. The costs, again, were unsustainable, and the team was just, I guess, uncomfortable with the state we were in. So, the turning point would have been, I would say, in, like, 2014, when we essentially hired in some people who had more experience in –

[0:19:00]

– I would say, high scalability systems, and people who cared a little bit more about quality and best practices.

So, when you hire three or four people like that, you kind of, you bring in a different way of thinking, you kind of, you hire these different values. You know, when you try to talk to a team and try and get these things out, they’re normally quite organic. Well, if you bring people in from maybe a similar _____ _____ from, you know, a different industry but similar experience, you kind of get that for free. That’s what Phorest did. So, basically, in 2014 and since, now, we’ve invested heavily in hiring, and hiring the right people, in terms of how they operate and in terms of how they think, but also bringing that back to our values and what we try to do.

[0:20:00]

Jeremy Jung: Do you think that bringing in new people, new talent, is really one of the largest factors that allowed you to make such large changes, to change your culture and change your way of thinking?

John Doran: The other thing would be, I would say, the trust that Ronan, CEO, and the leadership team _____ has, and their openness to change. I think that a lot of our organizations would be quite scared of this type of change, in terms of heavily investing in the product to make it better. Just from experience and talking to other people, you know, it would’ve been very easy to not invest and just leave the software ticking along with bugs and handling the downtime. But it was about the organization and their values around really helping the salon owners, and not spending that time firefighting.

[0:21:00]

Jeremy Jung: So, it sounds like within two years or so of launch was when you decided to make this change.

John Doran: Yeah, so, it’s not an easy one to make, because it’s really hard to find talent, and we were lucky to really get some great people in. And it wasn’t about making radical change at the start, you know, it started from foundations, so, it was things like, “Let’s get a continuous integration server going here, guys, and let’s bring back all the broken tests and make sure they’re running, so that we can have a bit more confidence in what we

.” We introduced code reviews and pull requests back into things, and a bit more collaboration and getting rid of those pockets of knowledge, you know, reliance on individuals.

Jeremy Jung: I do wanna go more into those a little bit later.

John Doran: Okay.

[0:22:00]

Jeremy Jung: But before that, when you were having performance issues or having outages before all these changes, how were you finding out? Was it being reported by users, or did you have any kind of process, you know, to notify you?

John Doran: So, the quite common thing was, basically, the phones was light up, and there was very little transparency of what was going on in the system. It got to a stage where we actually installed a physical red button on the support floor, which texted everyone on the engineering team.

Jeremy Jung: Oh, wow, okay. [Laughs] One of the things that we often hear is, when a system has issues like this, it’s difficult to free up people to fix the underlying problems due to the time investment required, and as you mentioned, all the firefighting going on. How did you overcome this issue?

John Doran: So, I guess that, beforehand, it was a matter of, you know, “Restart the server. Let’s keep going with our features.”

[0:23:00]

But it was really about stopping to think about, you know, “What really happened here? And let’s write down that incident report and gather some data about what actually happened _____ _____.” And a few things, a few questions, key questions, could be raised from that, like: What are we gonna do to stop this from happening again? Why didn’t we know about it before the customers? And what were the steps we made to reproduce and actually fix this issue? And what are the actions that are gonna happen? And how are we gonna track that those actions do happen after the issue?

Jeremy Jung: Let me see if I understand this correctly: you actually did build sort of a process, when you would have incidents, to figure out, “Okay – ”

John Doran: That was the first step, I would say, yeah, so, “Let’s figure out what happened and how – ” And it was just about gathering data and getting information about what was really going on, so, it led us to identify common things that happened that maybe usually we would just –

[0:24:00]

– restart server and forget, or failover database and everything’s not normal in a couple of areas. But as we started gathering that data, we started to see common problems, so, maybe our deployment process isn’t good enough and it’s error prone, or this specific message broker isn’t fault tolerant, or the IOPS in the database are too high at this time due to these queries happening. But after we got that data and we started really digging deep into the system, we realized that this isn’t something that you could just take two days in your sprint to start to fix. And just coming back to your question on finding that time to fix things, we kind of had to make a tough call, when we looked at everything, to say, “Let’s stop feature work, and let’s stop product work, and let’s fix this properly.”

[0:25:00]

Jeremy Jung: So, basically, you got more information on why you were having these problems, why you were having downtime or performance issues, and started to build kind of a picture of and realized that, “Oh, this is actually a very large problem.” And then as a company, you made the decision that, “Okay, we’re gonna stop feature development, to make sure we have enough resources to really tackle this problem.”

John Doran: Precisely. And from the product side of things, this was a big driving factor in it, you know, we wanted to build all these amazing features to help salons to grow, but we just couldn’t actually deliver on them. And we couldn’t have any predictability on when we delivered them because, ’cause of that firefighting and ’cause we were sidetracked so much, there was no confidence in release cycle and stability or what we could actually deliver. So, yeah, it was a pretty hard decision for us to make in terms of the business, ’cause we had a lot of deliverables and commitments to customers and to our sales team, so we had to make that call.

[0:26:00]

Jeremy Jung: You were mentioning, earlier, about how you started to bring in a continuous integration process. Before you had done that, you also mentioned that there were tests in the system initially, but then the tests kind of went away. Could you kind of elaborate on what you meant by that?

John Doran: Yeah, so, as I said, the kind of the core of the system was built with a lot of integrity and a lot of really good patterns. For example, a lot of integration tests are running against the APIs, where _____ maybe were written against a specific feature, but they were never run as a full suite. So, what would happen is there’d be maybe one or two flaky ones, and because there was no continuous integration server, it was easy enough for a developer to run specific tests for that functionality that they were building.

[0:27:00]

But because the CI wasn’t there, there was no fault suite run, so, when it came time to actually do that, we realized, you know, 70 percent of them are broken.

Jeremy Jung: So, they were building tests as they developed, but then they were maybe not running those as the [crosstalk].

John Doran: Before

Jeremy Jung: Right, okay, so, adding the continuous integration process, having some kind of build process, really forced people to pay attention to whether those tests were working or not.

John Doran: Exactly, and just a step on from that was, you know, a huge delay in getting stuff to test. Because we relied on that one guy to build stuff, and actually, that was done from a little Linux box in the engineering floor which was quite temperamental, you’d be quite delayed in actually even just getting stuff into people’s hands. And it’s kind of what the core software development’s all about, right, it’s getting what you build into people’s hands, and we just couldn’t do it.

[0:28:00]

Jeremy Jung: Just because the process of actually doing a build and a deployment was so difficult.

John Doran: Yeah, exactly.

Jeremy Jung: When you added the continuous integration process, were there other benefits that you maybe didn’t expect when you started bringing this in?

John Doran: So, yeah, I mentioned the deployments is a big one; I think that people started to see real benefit in terms of their workflow. I guess, along with the continuous integration, it was more disciplined in terms of how we worked. So, the CI server introduced a better workflow for us, and it helped us see real clarity in terms of the quality of the system, where it had coverage, where it didn’t. And it also helped us break up the system a little bit. So, I mentioned majestic monolith, so, it was actually, when we went to look at it, there was five application servers sitting in one repo.

[0:29:00]

And the CI server and some crafty engineering helped us split that up quite well, to break at the dash repo into multiple application servers.

Jeremy Jung: So actually bringing in the continuous integration actually encouraged you to rearchitect your application in certain ways, and break it down into smaller pieces?

John Doran: Exactly, yeah, and really, it was all about confidence and being able to test and know that we weren’t regressing.

Jeremy Jung: What do you think people saw in terms of the pain or the challenges from that sort of monolith setup that you think sort of inspired them to break it up.

John Doran: The big one was, a bug fix in one small area of the system meant the whole stack had to be redeployed, which took hours and hours. The other thing would’ve been the speed of development in terms of navigating around a pretty large code base.

[0:30:00]

And just the slowness of the test suite to run, which was around 35 minutes when we started and got them all going.

Jeremy Jung: The pain of running the tests and having it possible to break so many things with just one change maybe encouraged people to shrink things down, so they wouldn’t have to think so much about the whole picture all the time.

John Doran: Exactly, we started to see a small fix or a small feature breaking something completely nonrelated. A typical example would’ve been due to HP connection configuration on a client breaking completely unrelated areas of the system.

Jeremy Jung: One thing I’d like to talk about next is the monitoring. You mentioned, earlier, that it was really just phone calls would come in to support, and you even had the big red button you could press. What did you do to add monitoring to your application?

[0:31:00]

John Doran: It’s pretty important to mention that, you know, we talked about making a decision to stop _____ _____ and start fixing stuff, so that’s when we started looking at the monitoring and everything else like continuous integration and bringing back tests. But kind of a key point of this evolutionary project was the monitoring, so, we did a few things. So, we upgraded our systems to be using New Relic, to help us find errors. And it was there, but it wasn’t being utilized in a good enough way, so we used APM there. We looked at CloudWatch and we reintroduced CloudWatch metrics, to help us watch traffic, to help us see slow transactions. Logentries helped us a lot in terms of spotting anomalies in the logs.

Pingdom was actually a really surprising good addition to the monitoring. It simply just calls any health check endpoint you want, and has some nice slack and messaging integration.

[0:32:00]

That was great for us; it helped us a lot. So, we did a couple of other things like some small end-to-end tests that would give us kind of a heartbeat to how the system was running, and they also gave us the kind of confidence that we would know about an issue before a customer, and being able to allow us to get rid of that red button.

Jeremy Jung: All of these are managed services that you either send logs to or check health endpoints on your system. Did you configure them, somehow, to text your team or send messages to your team when certain conditions were met, or – ?

John Doran: So, we started with just, like, a simple Slack channel that would send us any kind of dev ops related issues into there, and that’s kind of what helped us change the culture a little bit in terms of being more aware of the infrastructure and the operations.

[0:33:00]

And Pingdom was great for setting up a team with people who would get notifications for various parts of the system. And our CloudWatch alarms, we set up a little Lambda function that would just forward on any critical messages to text us.

Jeremy Jung: And before this, you said it was basically just user calls, and now you were actually shifting to kind of proactively identifying the problems.

John Doran: Yeah, exactly, there were some really small alerts there, but nothing as comprehensive as this. We actually, we changed some of the applications; we introduced health endpoints to all of them, so, they would report on their ability to connect to a message broker, their ability to connect to a database, any dependencies that they actually needed, we would actually check as part of pinging that endpoint. So, if you hit any of our servers, any new or older ones, they would all have, like, a forward-slash health endpoint, and that would give you back a JSON structure, and give us a good insight into how healthy that component was.

[0:34:00]

Jeremy Jung: And if there was a problem and you were trying to debug issues, were you mostly able to log in to these third-party services, like Logentries or New Relic, to actually track down the problem?

John Doran: Yeah, so, again, those services gave us that information, but it would always come back to, you know, be it if you needed to get into a server – and a big thing, which we’ll talk about, is Docker – we don’t have SSH access into those servers, so, we rely on this third-parties to give us that information. But in the past, maybe we would’ve had to get in and look at the processes and take dumps, but with Logentries and New Relic, we were able to do that stuff without needing to.

Jeremy Jung: So, previously, you might have someone actually SSH into the individual boxes, and look at log files and things like that.

[0:35:00]

John Doran: Exactly. So, quite easy when you’ve got one server, but, as we’ll discuss, when you’ve got many small containers, it’s extremely complicated.

Jeremy Jung: Next, I’d like to talk about, since you mentioned Docker, how did you make the decision to move to Docker?

John Doran: So, it was something our CTO was really aware of and he really wanted us to explore this. The big benefit, for us, was that shift in mindset of one guy not being responsible for deployments, but us actually developing and using Docker in our day-to-day workflow. And the cost implications, as well, the fact that we could, instead of having, say, ADECS large running 1 application server, we could have 12 containers running on much smaller containers running on EC2 instances. So, it was that idea of being able to maximize CPU memory was a huge, huge benefit for us that we saw.

Jeremy Jung: So, the primary driver was almost your AWS bill or your –

[0:36:00]

John Doran: Big time, yeah, portable applications that had much less maintenance. We didn’t have to in and worry about – because we had, I guess, we mentioned this earlier, like, these kind of silo tech stacks, we didn’t need to worry about a Ruby environment or a PHP environment or Java JVM install, it was just the container. And that was a hugely big and important thing for us to do, and was really kind of well thought-out by our CTO at the time.

Jeremy Jung: So, you mentioned Ruby containers and JVMs and things like that – does your application actually have a bunch of different frameworks in a bunch of different languages?

John Doran: Yeah, so, as we split out that monolith, we also, I guess, started building smaller domain-specific – not micro, I’d say, kind of services responsible for areas of the system. Our online booking stack, so, if you go to any of our customers, you know, you can book in their point of sales system in the salon, but you can also book on your phone.

[0:37:00]

And we have a custom domain for every one of those salons, so it’s, like, phorest.com/book/foundationhair, if you click on that, you’re gonna be brought to the online booking stack, which a Rails app, actually, and an Ember.js frontend. So, the system, as we started splitting it apart, became more and more distributed, and Docker was great for us in terms of consistency and that probability, particularly around different tech stacks.

Jeremy Jung: Migrating to Docker made it easier for you to both develop and to deploy, using a bunch of different tech stacks.

John Doran: Exactly.

Jeremy Jung: When running through your CI pipeline, would you configure it to create an image that you would put into a private registry, such as Amazon’s Elastic Container registry?

John Doran: Yeah, so, we made the mistake of building and hosting our own registry, at the start. We quickly realized the pain in that around three, four months in –

[0:38:00]

– and, actually, at the same time as Amazon released the ECOR. So, I guess the main reason we did that ourselves was because we were early adopters, and we paid a little tax on that, but we did, we moved to ECOR. So, our typical application kind of pipeline is build unit tests, maybe integration acceptance tests, build a container. And some of those applications, they run acceptance tests against the container running on the CI server. Push to the registry, and after it’s pushed to the registry, then we would configure deployment and trigger it off.

Jeremy Jung: Do you have a manual process where somebody watches the pipeline go through and you make the call to push or not? Or is it an automated process?

John Doran: No, it’s automated. So, we built a small kind of deployment framework, again, ’cause we were early adopters of Amazon’s ECS, their container service.

[0:39:00]

So we built a small deployment stack, which allowed us to essentially configure new services in ECS, and deploy new versions of our application, through CI, to our ECS clusters. So, it was all automated.

Jeremy Jung: Were you using an infrastructure as code solutions, such as CloudFormation?

John Doran: Yeah, so, when we were looking back at the problems in the good old days, we’d seen that one was, you know, things were just configured on the AWS console, and we knew we needed infrastructure as code, and we needed repeatability and the ability to recreate stuff. So, we used CloudFormation and, essentially, something very similar to Terraform, and we do use Terraform for managing our _____ clusters and some other things.

Jeremy Jung: Okay, so you maybe initially moved from having someone manually going in and creating virtual machines, to more of an infrastructure-as-code approach.

John Doran: Exactly, yeah.

[0:40:00]

Jeremy Jung: You had mentioned that one of the primary drivers of using Docker was performance – did you start creating performance metrics so that you knew how much progress you were making in that front?

John Doran: Yeah, so, essentially, that effort to kind of make our infrastructure more reliable, it was kind of a set of steps to get there, and we started with API-level testing to make sure that anything we change under the hood, it didn’t break the Intend functionality. And we also wrote a bunch of performance tests, particularly around pulling down appointments, creating appointments, and sending large volumes of messages. We knew we couldn’t have any regressions there, so we used Gatling to do those performance tests, and we would run that from continuous integration server, and we’d do various types of soak testing, to make sure we weren’t taking any steps backwards.

Jeremy Jung: So, each time you would do a deployment, you would run performance tests, to ensure that you weren’t getting slower or you weren’t having any problems from the new deployment.

[0:41:00]

John Doran: Yeah. I would say, though, that this kind of effort – we called it Project R, when internally, this effort to kind of – it had a few goals, but it was all about becoming fault tolerant, being more scalable, reducing Amazon costs. And during Project R, we didn’t just move our 1,200-1,500 salons, we didn’t just drop them and move them to Docker. There were so many changes under the hood that these performances were key to giving us a pulse on how we were doing. But I guess when we were done with Project R, and everything was on Docker, and everyone was much happier, we just, we’d run those performance tests ad hoc and as part of some release pipelines.

Jeremy Jung: Okay, so initially, you were undergoing a lot of big changes, and that was where it was really important to check the performance metrics with every build.

[0:42:00]

John Doran: Exactly, yeah.

Jeremy Jung: What were some of the big challenges? ‘Cause you mentioned you were changing a lot of things, what were some of the big challenges moving your existing architecture to Docker and to ECS?

John Doran: There were a couple – there’s two huge ones. So, one was state; getting state out of those big servers was extremely hard. We needed to remove the level-2 cache; because we needed to turn that one server into smaller load-balanced containers, we needed to remove the state. Because we didn’t want somebody on one computer terminal fetching their appointments, and then on their Phorest GO Mobile app, looking at different data. So, we had to get rid of state, and the challenge there was that MySQL performance just wasn’t good enough for us, so, we actually had to look really hard at migrating to Amazon Aurora, which is what we did.

[0:43:00]

Again, coming back to cost, Aurora is much more cost-effective in terms of, the system beforehand was provisioned for peak load.

So, we would have provisioned IOPS for Friday afternoon, the busiest time that the salon was using the system, and we were paying that for the same amount on a Sunday night. Compared to Aurora, where you’re paying for IOPS and the additional benefits of performance around how Amazon rebuilt the storage engine there. So, that’s the caching side of things. The other big, big challenge was the VPC. So, we needed to get all of our applications into VPC, to be able to use the latest instance types on Amazon, and also, for our application to be able to talk security to Aurora database. So those two were definitely the biggest challenges.

Jeremy Jung: With the MySQL setup, it sounded like you had to pay for your peak usage, whereas, with Aurora, it automatically scales up and down, is that correct?

[0:44:00]

John Doran: No, you’re actually charged per read and write, so that would be the big difference.

Jeremy Jung: Oh, I see, per read and write, okay, so it’s just per operation. So, you don’t actually think about what your needs are gonna be, it kind of just charges you based on how it’s used.

John Doran: Yeah, the other really nice thing was, looking back at our instant reports, a really common issue would’ve been, “Hey, the database has run out of storage,” and Aurora does actually auto-scale its storage engine.

Jeremy Jung: You mentioned removing state from your servers, and you mentioned removing the level-2 cache. Can you explain, sort of at a high level, what that means, to those who don’t understand that part?

John Doran: Sure. So, in the Java world, when you have an ORM framework like Hibernate, essentially, when you create a database, that cache will store that data in its level-2 cache. And what that means is that it doesn’t need to hit the database for every query.

[0:45:00]

And that was the solution for Phorest, as we were in that MVP/early-days, but it wasn’t the solution for us to become fault tolerant.

Jeremy Jung: So, it would be, someone makes a query to an ORM – in this case, it’s Hibernate – and on the server’s memory, it would retrieve the results from there instead of from the database.

John Doran: Yeah, exactly. And that’s what I was coming back to around querying an API for a list of appointments. If you had two servers deployed, with them using an L2 cache, you would get different state, because

cache.

Jeremy Jung: So, did you put a different cache in place, or did you remove the cache entirely?

John Doran: So, we removed that cache entirely, but we did have a rest cache, which was Memcached, and that’s distributed, and we used cache keys based on entity versions.

[0:46:00]

So, that was distributed and worked well with multiple containers behind the load balancer.

Jeremy Jung: So, you removed the cache from the individual servers, but you do have a managed Memcached instance that you use.

John Doran: Yeah, exactly. And getting rid of that level-2 cache, our performance tests told us that MySQL just wasn’t performant enough, whereas, Aurora was much better at handling those types of queries, similar to JOINs, it’s a big relational database.

Jeremy Jung: So, we’ve talked about adding in continuous integration monitoring, performance metrics, Aurora, Docker – did any of these processes require large changes to your code base?

John Doran: To be honest, not really; it was more about plumbing things together, and a lot of orchestration from a human point of view. So, people being aware of how all this stuff worked, and essentially, making sure that we all knew we were on the right page.

[0:47:00]

The biggest piece of engineering and coding work was the deployment and infrastructure script, so, provisioning the VPCs, writing the integrations with ECS, that sort of thing. But in terms of actual coding, it wasn’t too invasive.

Jeremy Jung: I think that’s actually a positive for a lot of people, because I believe there are people who think they need to do a big rewrite if they have performance problems or problems keeping track of the status of their system. But I think this is a good case study that shows that you don’t necessarily need to do a rewrite, you just need to put additional processes and checks in place, and maybe change your deployment process to kind of get the results that you want.

John Doran: It’s about the foundations, as well.

[0:48:00]

If you have some really strong people at the start who pave some good roads there in terms of good practices like, just for example, a really good database change management setup, some good packaging of the code, really good packaging of the code, so it was quite easy for us to slip out five services from that big monolith. It’s about the foundations at the start, because it would be quite easy to build an MVP with some people who write thousand-line PHP scripts, and the product works, and – that’s a different case study, because, you know, you can’t fix that, essentially.

Jeremy Jung: Right, so it’s because the original foundation was strong, that you were able to undergo this sort of transformation.

John Doran: Yeah, truly, yeah.

Jeremy Jung: Adopting all of these processes, did they resolve all of the key problems your business faced?

John Doran: When we look back and we see that all of our systems are running on Docker, we see a huge cost benefit, so, that problem was certainly solved.

[0:49:00]

We were able to see issues before our customers, so we had better transparency in the system. No longer were we dependent on one big server – 1,000 customers were no longer dependent on one big server, so it meant that we had, and we do have, really good fault tolerance on those containers. If one of them dies, ECS will literally kill it and bring up a new one. It’ll also do some auto-scaling for us, say, on a Monday morning you maybe have 8 containers running, but on a Friday, maybe it’ll auto-scale to 14. So, that’s been groundbreaking for us. In terms of how we work, we went from shipping monthly to quarterly, from between monthly and quarterly to daily. And something I use as a team health metric, right now, is our frequency of deployment.

[0:50:00]

And I’d say we’re hitting about 25 deployments a week, now, compared to the olden days. It’s great, we always wanna get better at it. I would say that those have been really amazing things for us. But also, in terms of the team, it’s a lot easier for us, now, to hire a new engineer and bring them in, because of this consistency. And also, I guess, we’re not reliant on these pockets of knowledge anymore, so, again, around hiring, it’s a lot easier for someone to come into the system and know how things work. And I think in terms of hiring as well, when you talk about that kind of setup, it’s, you know, there’s some good stuff happening there.

Jeremy Jung: It sounds like you have a better picture in terms of monitoring the system, you brought your costs down significantly, the deployment process is much easier, the existence of the containers and ECS is –

[0:51:00]

– kind of serving the purpose of where people used to have to monitor the individual servers and bring them up themselves, but now you’ve sort of outsourced that to Amazon to take care of. Does that all sound correct?

John Doran: Yeah, spot on.

Jeremy Jung: And I find it interesting, too, that you had mentioned that improving all of your processes actually made it easier to bring new people in, and that’s because, you were saying, things are now more clearly defined in terms of what to do, rather than all of this information kind of being tribal, in a sense.

John Doran: Yeah. A typical example would be, like, “Hey, let’s redeploy this book fix.” And say, previously, it might be a Capistrano deploy or, you know, “Oh, you need to get SSH keys to this thing, and you need to log in here, and you need to build this code on this local machine, and try and ship it up, and – ”

[0:52:00]

That just all goes away, particularly with Docker, and that continuous integration pipeline is just, it sets a really good set of standards and things that people should find quite easy and repeatable.

Jeremy Jung: So now, in terms of deployment, you can use something like CloudFormation, and you have the continuous integration process that can deploy your software without somebody having to know specifically about how that part works?

John Doran: Exactly. So, I would say if we wanted to create a new service responsible for some new functionality in Phorest – say, a Spring Boot application, a Java application – they can simply provide a Docker file and get that deployed to day of staging or production, with, I would say, ten lines of YAML configuration. So you could go from initial setup of a project to production in a day, if you wanted to – there’s zero friction there, I would say.

[0:53:00]

Jeremy Jung: Mm-hmm, it really makes the onboarding a lot easier, then.

John Doran: Yeah.

Jeremy Jung: Do you think your team waited too long to change their processes? Or do you think these changes came at just the right time?

John Doran: I would say if we’d waited any longer it could’ve been detrimental to, I guess, the health of the business. I think that the guys did a great job in terms of getting us to a certain point, but we would’ve risked technical decay, I would say, and kind of really harming the organization if it had gone any further. I would say it was a lot of work to do this, and it could’ve been easier if we had paid more attention to technical debt and making the right decisions earlier on, so maybe saying no to that customer who wants a bespoke piece of functionality. But you have to do what you have to do.

[0:54:00]

Jeremy Jung: So, you would say maybe identifying earlier on just how much the current processes were causing you problems. If you had identified that earlier, you think you might have made the decision to try and make these changes at an earlier time.

John Doran: Yeah, so the guys, earlier, were making really good decisions, but maybe they didn’t have the experience for higher scalability solutions and systems. So, it’s about hiring the right people at different stages of where the product is evolving, I would say.

Jeremy Jung: Given what you’ve gone through with the migration at Phorest, what advice would you give to someone building a new process? What can they do to keep ahead of either technical debt or any of the other issues you faced?

John Doran: I think it’s about how – it’s actually a people and cultural thing, along with tech decisions.

[0:55:00]

So, everybody needs to be really aligned in terms of the decisions that they’re making, rather than letting people go on an individual basis. I think there needs to be good leadership in terms of getting a group of people thinking the same way. I reckon that technical currency is extremely important, and as your system grows, you need to be able to look back and identify areas of pain. And by pain I mean speed of deployment, speed of development, and ability to adapt and change your software. So, if you notice that a feature that used to maybe take a week is now taking two weeks, you know, you probably need to take a really hard look at that area of the system and figure out could it be simplified and why is it taking too long.

Jeremy Jung: Basically, identifying exactly where your pain points are, so that you can really focus your efforts and have an idea of what you’re really going for.

[0:56:00]

John Doran: Yeah, you need to build an environment of trust, and I would also say that you need to be able to be confident and okay with failure in terms of taking risks, sometimes, and saying no to features and customers. To be able to push back on leadership, and make sure that you’re really evolving the system the right way and not just becoming a feature factory.

Jeremy Jung: Yeah, it’s always gonna be a kind of balance on how much can you pull back but still stay competitive in whatever space you’re in.

John Doran: Yeah. So, what we’re doing right now, based on those lessons, is we try to do a six- to eight-week burst of work, and we would always try and take a week or two wiggle room between that and starting something new –

[0:57:00]

– to look back at what we just built and make sure we’re happy with it. But also look at our technical backlog, and see if there’s anything there that’s really paining us. And just even, for example, this week, we noticed an issue with a lot of builds failing on our CI, because of how it was set up to push Docker images, so, occasionally they would fail. And that was actually a real pain point for us just over the last couple of months, because maybe a deployment which should take 20 minutes was taking 40, ’cause you’d have to retrigger it. So, that’s an example of us looking at what was high value, and making sure we just fix it before we start something new.

Jeremy Jung: So, making sure that you don’t kind of end up in the same situation where you started, where these technical issues sort of build without people noticing them. Instead kind of, in shorter iterations, doing sort of a sanity check, and making sure everything is working and we’re all going in the right direction?

[0:58:00]

John Doran: Yeah, it’s about the team, and I mentioned, before, it’s about the leadership and a group of people, together, talking through common issues. And maybe meet every two or three weeks, talk about some key metrics in the system, why is this too high, why is this too low. Through the kind of trigger pairs, you can really see the pain points, and they’ll more than likely tell you them.

Jeremy Jung: When you look back at all the different technologies and processes you adopted, did you feel that any of them had too much overhead for someone starting out? What was your experience in general?

John Doran: So, some people just didn’t like doing code reviews, some people just really just felt that they could just push what they needed, and that it was almost a judgment on them in terms of the code review process, which it totally wasn’t. I would say some people found Jenkins and continuous integration a bit, you know, “What’s the point?” and so, we had some pain points there.

[0:59:00]

But as we got to Docker, as people seen the benefits of these things, you know, less bugs going into production, less things breaking, people being able to go home nice and early in the evening and not be woken up in the middle of the night with an outage call, those were all the benefits, and that’s us reaping the rewards of thinking like this.

Jeremy Jung: Your team was bringing on a bunch of new things at once – what was your process for adopting all these new things without overwhelming your team?

John Doran: So, it was starting at the foundation, so, the continuous integration, the code reviews were incrementally brought in, and we had regular team meetings to discuss pros and cons. And it was really important for people to input on those things, rather than to just implement them. They would’ve failed if we hadn’t done it like that.

[1:00:00]

It took time. I would say we’re still not in a perfect world, but it’s about group consensus and making sure that everyone’s bought-in to what we’re trying to achieve.

Jeremy Jung: So, basically, getting everyone in the same room, and making sure they understand what exactly the goal is, and everyone’s on the same page?

John Doran: Yeah, so, we try to make a big effort, particularly for people who are working remotely, to get them all in the same room once a quarter. We talk about our challenges, talk about our goals, talk about our values, and make sure we’re all on the same page. And sometimes we tweak them, and that’s how we feel it’s best to do it.

Jeremy Jung: Great. Finally, I wanna talk about what’s next for Phorest. What are the remaining pain points you have now, and what do you plan on tackling next?

John Doran: So, right now we’re around 4,000 salons on our platform.

[1:01:00]

We’re really happy with the state of the infrastructure to get us to maybe 8,000 to 10,000 salons, but we need to be really conscious of the company’s growth and our goals. So, we need to make sure that we can scale at a much bigger level, and we also need to make sure that our customers aren’t affected by our growth. We’re looking at serverless for any kind of newer pieces of the product, to see if they can help us reduce costs even more, and help us stay agile in terms of our infrastructure and how we roll out. A couple years ago when we launched into the USA, we noticed it doubled our overhead in terms of infrastructure, operations, and deployment. And as we grow in the US, we need to be really conscious of not making any, I guess, mistakes from the past.

Jeremy Jung: So, you’re mostly looking forward to additional scaling challenges, and possibly addressing those with serverless or some other type of technology?

John Doran: Yeah.

[1:02:00]

So, one area in particular will be our SMS sending, so, that’s kind of a plan for the next six to eight months will be to make sure that we can continue to scale at the growth rate of SMS and e-mail sending, which is huge in the platform.

Jeremy Jung: You said so far you’ve been experiencing 30 percent growth, year over year.

John Doran: Mm-hmm.

Jeremy Jung: And you said when you moved to the US, you actually doubled your customer base?

John Doran: I’d say we doubled our overhead in terms of infrastructure managing deployments. We’re still very early-stage in the US, and that’s our big focus for the moment. But as we grow there, we need to be, I guess, more operationally aware of how it’s going over there – it’s a much bigger market.

Jeremy Jung: To kind of cap it off, how can people follow you on the Internet?

John Doran: Sure, so, you can grab me on Twitter @johnwildoran.

[1:03:00]

And if you ever want to reach out to me and talk to me about any of this type of stuff, I’d love to meet up with you. So, feel free to reach out.

Jeremy Jung: All right, John, thank you so much for coming on the show.

John Doran: Thanks so much for having me.

Jeremy Jung: This is Jeremy Jung for Software Engineering Radio. Thank you for listening.

Automated: Thanks for listening to SE Radio, an educational program brought to you by IEEE Software magazine. For more about the podcast, including other episodes, visit our website at se-radio.net. To provide feedback, you can comment on each episode on the website, or reach us on LinkedIn, Facebook, Twitter, or through our Slack channel at seradio.slack.com. You can also e-mail us at [email protected]. This and all other episodes of SE Radio is licensed under Creative Commons License 2.5. Thanks for listening.

[Music playing]

[1:04:00]

[End of Audio]

Join the discussion

You must be logged in to post a comment.

1 comment

Jens T says:

August 31, 2021 at 2:47 pm

I liked that one very much. A real nice case of ‘story telling’, easy to follow and understand. Some technical details I did not quite get, but the general improvement is clear. What was probably most important for me: there was not something “wrong” in the beginning … the on-premise solution worked probably well for many years and was a success, but in time there was need to adapt and that was not simple and required a lot of learning. I am glad you found a solution for the next big step forward.

SE Radio 332: John Doran on Fixing a Broken Development Process

Show Notes

Related Links

Transcript

Join the discussion

1 comment

More from this show

SE Radio 731: Sonali Varde on AI and the Engineering Manager Role

SE Radio 730: Birgitta Boeckeler on Harness Engineering for AI Agents

SE Radio 729: Garth Mollett on AI Supply Chain Security

Menu

Recent posts

Search

Search

SE Radio 332: John Doran on Fixing a Broken Development Process

Show Notes

Related Links

Transcript

Join the discussion

1 comment

More from this show

SE Radio 731: Sonali Varde on AI and the Engineering Manager Role

SE Radio 730: Birgitta Boeckeler on Harness Engineering for AI Agents

SE Radio 729: Garth Mollett on AI Supply Chain Security

Menu

Recent posts