Notes about ShipItCon 2022
Well, here we are again. Back to going physically to a place where people are talking for an audience in a structured way. It’s been quite some time.
I’m not going to deny that the feeling was a bit weird and that I got a bit (extra) anxious about being in a place with so many people.
I’ve talked about a previous ShipItCon in this blog. It is one of my favourite conferences that I’ve been attending. There has been not too many editions, this was just the third one, but all of them has been full of interesting talks.
One peculiarity is that, being focus in a nebulous concept (shipping software) more than a particular technology or area, it allows for stretching the knowledge into areas that you are not used to. Sort of thinking of your own box. I tend to get a lot of notes and ideas to think about later.
The conference itself
First of all, let’s start by saying that the conference was impeccably run. It was hosted, as it has been in the previous editions, in The Round Room, which is an historic venue in the centre of Dublin.
As its name implies, it’s a big round room, which have its own vibe and makes it distinctive from other venues.
The whole conference had its own MC, CK (Ntsoaki Phakoe). She introduced all the different speakers and set the flow of the conference. The concept of a general MC for conferences is something that I’ve only seen in this one, but it’s a great one, and CK always does a great job at it.
The main theme of the conference was resilience. Being this a word we talked a lot in the past few years, it seemed like a good general topic to go for this ShipItCon 2022, so there were a lot of ideas about how to make software, and the release of it, more resilient, as we will see. I liked that, being the conference the way it is, it also included not only technical ideas about it, but ones related to people.
Talks and notes
Some ideas that I took on the different talks.
Keynote by Cian Ó Maidín.
Cian talked about his company, Near Form, and how a business based on County Waterford was strongly hit initially on COVID, but then later went out to develop an open source COVID tracker that was used by multiple health authorities across the world.
He talked about a lot of the different challenges on working with health organisations at that critical time, under a huge pressure and the urge to help without thinking on profit when the occasion arises.
Failure: A building block, by Nicole Imerson
Nicole talked about her podcast Failurology, which describes different failures and presents how stye came to be, and what lessons we can learn from them.
The most uncomfortable the failure, the more we need to talk about it.
She categorised failures into three kinds:
- New territory. The edges of technology. It’s very difficult to foresee the problems, as new ground is being covered. She used the installation of the first transatlantic cable in the XIX century as an example.
- Mistakes. Human errors or others. Things that could be avoided, but for some reason, are not. Sometimes, things go wrong. She talked about the 2003 North East blackouts, which affected parts of Canada and the US.
- Deliberate. When the risk is accepted as part of the process. In most cases, this can be driven by profit. I think that, in others, this can be a trade off that can be worth it.
As an example of something that was not, she presented the case of the Boeing 373 MAX and its crashes.
Failure happens by Filipe Freire
In this lightning talk, Filipe talked about some failure histories and presented four principles that are quite interesting:
- Dive deeper. Running an ongoing system is a challenging task that will require to fix a lot of unexpected problems. When focused in the right mindset, that can focus the team, instead of feeling lost of full of embarrassment.
- Scrap! Make everything scriptable, and nothing sacred, so it can be adjusted properly.
- Feedback loops, especially when interacting with customers, to be sure that new products are used and improved for the customers.
- Empathise. Most problems are people problems.
Don’t give up! A tale of a system impossible to scale by Nicola Zaghini
Nicola talked on how the concept of resilience moved from being an obscure term related to wood properties to the concept that we use around the year 2000.
He defined the resilience on the capacity to persist, adjust or transform maintaining the basic identity, and emphasised that a resilient mind is more important than a resilient system, as the first will find a way to make the second.
Stop developing in the dark by Noel King
Noel talked about the lost time due bad code, and how we developers mostly operate in situations with lack of data, which caused frustration.
He talked on how we can use data driven techniques to improve delivering and developing of code by selecting an area of focus, agree on metrics to track that area, set goals, track progress and, finally and very important, celebrate the successes.
Your app has died and that’s OK by Anton Whalley
Anton talked about the status of tooling for errors in production that are specific on production, and in particular, on fatal errors. They can be very difficult to track as production environment is normally undebugeable, logs capture known unknowns and stack traces may not capture the whole situation.
There has been crash analytic tools since a long time ago, but the current tooling is adjusting to new environments, like containers and other distributed systems.
Standardisation & effects in resilience by Christine Trahe
Christine talked on how the standardisation of different process can help not only in technical aspects, like tech debt, duplication and reliability, but also on mental aspects like reducing the cognitive log, the learning curve for projects, and reducing also the possibility of burnout.
She explained several examples of standardisation to reinforce these concepts, from purely technical stuff, like Cloud Formation to others more related to social aspects, like knowledge sharing or incident management.
Panel with Sheeka Patak, Stephanie Sheehan, James Broadhead and Paul Dailly, moderated by Damien Marshall
An interesting talk discussing different aspects of running software services. Some random ideas that I took note about:
- Be conscious of time spent on systems that are being phased out. Perhaps is best to accelerate the phase out than to fix them.
- Prepare plans with actions in case that something needs to be fixed. If a pre-agreed situation happens, then activate the plan. For example, if certain error appears more than three times, then fix it.
- Formalising on-call can lead to people deciding that the compensation is not worth it, and end with a worse system than the non-formal one.
- One of the main aspects of senior people is to act calmly in stress situations and reinforce confidence to less experienced engineers. Also to destress situations and provide external viewpoints
- Culture is very important.
- Failure happens, it needs to be accepted as part of the process.
On-call shouldn’t be a chore, by Hannah Healey and Brian Scanlan
Hannah and Brian described some of the elements of the on-call in Intercom, and how some aspects of on-call can be empowering, like fixing actual problems, or terrible, like the possible disruption of personal live.
Intercom went through an internal rethink of the process to try to decouple on-call from the team membership, review the quality of the alarms, check how many people actually need to be on-call, and, in general, try to make on-call something to be proud off.
They went to move to a volunteer system, where the principles were aligned with the values of the company. as part of that, they included clear support for people on-call by introducing what could be described as “on-call for on-call”. They also put the accent into reviewing and using the time into fixing the problems appearing into on-call, as well as making on-call a celebrated part of the company.
They described that these elements are importantly parts of Intercom, and dependent on their culture, which makes that each road will be different and needs to be optimised for the specific people.
A formula for failure, by Brian Long and Julia Grabos
Brian and Julia described the project of migrating to a new language they created for their application, to address several limitations, in particular, composing JSON with a templating system.
The project took around a year to release, and while it was a success and customers liked it, their alternative plan in hindsight would have been to try to work in smaller increments to release it partially, as much as possible.
Ditch the template, by Laura Nolan
Laura shared her experience writing public incident reports that have generated a lot of attention, and how to make them engaging.
The value of creating these kind of documents, in public, is to share knowledge, creating thoughtful reflection on what actually happened and allowing long term storage and documentation of problems. It also helps being transparent with customers, that can understand the different problems and it generally lifts the industry to a better ground.
She talked that the main value of incident reports is the learning that it produces. And that’s difficult to do just by just routinely filling a form.
Laura described several aspects for engaging incident reports:
- Supporting the reader. It should be without jargon, clear to read and can use links to get deeper into concepts
- Be visual, reinforcing the concepts with graphs, timelines, screenshots or other creative elements.
- Analysis. If the Incident report is a story, the analysis is the moral of it. What should you get from what happened? It also helps to get a feeling of resolution.
- Craft. Elements like simple language, use headings to avoid a wall of text, use a creative title to help searching for it later, consistent tense and avoid converting it into a sales pitch.
I really enjoyed the conference and I think it was full of great content. I think that they do a great job to select great speakers that show a lot of different aspects of software development. I always learn some things, and seeing what different teams are doing in aspects like on-call, release and monitoring, among other aspects, is always interesting.
I definitely recommend going to the next one if you’re interesting in these aspects! Hopefully it will be earlier than the three years we had to wait for this one.