Eight Days a Week

Reflections on the Pitfalls of Story Points and Velocity Capture

Many don’t know this, but I studied mathematics for two years at university before specialising in Computer Science. One mandatory class specialized in teaching the fundamentals of mathematical proof. In a nutshell, we learned how to establish that a given hypothesis was correct, or identify the contradiction that disproves the theory. It was immensely satisfying when you reached the end of a proof. Furthermore, verifying that you had the correct answer was pretty straightforward.

If only all hypotheses were so easy to disprove. Lately there has been an unfolding obsession with story points across the squads. Managers, business stakeholders and even the PO are starting to raise questions on why this item is X points. Worrying it has become the hot topic within our circle.

Why should we need to prove the correctness of our story point estimates?

A couple of teams have raised concerns that the metric reported in a centralised dashboard is incorrect. That their true sprint velocity is actually higher. They feel pressure to justify to that same stakeholder group that they are delivering value. However, unlike in my university proof class, this result has been reached through gut feel rather than mathematical proof.

Subsequent conversations have me thinking this is more about proving velocity correctness. Do the numbers really add up? Or are there bigger lessons to learn? While trust in the metrics is indeed one problem, it is not the full picture. There are several estimation pitfalls contributing to this situation. Here we discuss story points pitfalls and how they can erode customer confidence.

Original Sin

While evaluating our stance on story points, it’s important to reflect on their origins. According to the legendary Ron Jeffries, story points originate from the initial XP evolution. Fascinatingly they were meant to represent ideal days. Essentially, a story point is a productive developer day.

Is there such a thing as a super productive ideal developer day anymore?

Let’s take note of the terms ideal and productive in that statement. A common planning pitfall that is contributing to our current issue is implicitly mapping points to days. Humans are absolutely awful at judging time an item will take. Coupled with the symptomatic optimism of programmers, it’s easy to see that we are setting unrealistic estimates.

There is a case to be made to potentially join the #NoEstimates movement. Although this is a wider journey to commit to that requires extensive engagement with stakeholders and the wider department. Given the state of play, a smaller series of steps to be applied to provide shorter term improvements.

Relative estimating using either points or t-shirt sizing could be more helpful here.

Some tips outlined within Software Estimation Without Guessing by George Dimwiddie can help. A small step in the right direction would be to move to relative estimates, either as points or t-shirt sizing. Both approaches require estimates to be made relative to previously undertaken work. A second item would be to revisit items to provide further estimation practice for the team, and to provide further data into the relative scale the team is using.

Crystal Clear

While estimation is a clear problem, it is important to identify the root cause of the lack of confidence in the published metrics. Lack of transparency in how metrics are calculated on the shared portal is definitely a key one. While there is a publicly queryable API, the underlying calculation remains closed. This makes it easy for people to point the finger at the numbers, and claim they are wrong.

Since these numbers are under the spotlight by management and clients alike, it’s natural for development teams to try to either game the system, or defend their record. This is a classic case of weaponisation. In this case, blame has immediately fallen on the Jira workflow feeding the dashboard. No one has done the maths. No one has investigated how the dashboard number is calculated, just what we expect it to be.

The age old saying of a bad workman blames his tools can apply to software development too. It’s a natural defence.

We always jump to blaming the tools, and justifying that it doesn’t cover a particular portion of our lifecycle. Here, the justification to change the workflow was that the UAT carryover phase is responsible for a large portion of incomplete stories at the end of the sprint.

The reality is that there is more to this story than incorrect metrics. Looking at a cumulative flow chart shows that a long time spent in stages other than UAT, such as TO DO. This is partially down to unplanned work. Pulling in unexpected items to the sprint puts the team under pressure to deliver the committed items in a shorter time frame. Inadequate developer testing means defects identified in UAT pull features back to development. Key acceptance criteria is missed. Tight deadlines push the team to burnout, represented in a high rate of sickness and sprint rollover.

Retrospectives are a vital tool in identifying the source of issues and a remediation action plan. Use any metric trends as a communication starter, not a justification of your work.

Calm Like a Bomb

Weaponisation of velocity metrics by management and client groups is a major concern. While I understand the desire to compare the effectiveness of squads at a higher level, story points are not the way. Story point estimates are personal and unique to a team. They depend on numerous factors, not limited to:

Team experience
Degree of squad autonomy
Average story complexity
Level of Product Owner engagement
Development team collaboration and cohesiveness maturity levels
Degree of process and testing automation

If story points cannot give a comparable value, some may seek to find another metric. Story count could be considered a viable alternative. However, it unfairly discriminates against small teams or those squads that encounter work gaps for events such as vacation or long term leave. Furthermore, it can be gamed by writing minutely small stories that may not reflect a true customer value.

Comparing stories across teams can encourage bad gamesmanship as teams try to beat the system.

Assuming a single set of metrics fits all approaches for all squads is a monumental mistake. A pitfall in large scale Agile transformations is that a single paradigm is enforced from the top. It shouldn’t matter whether squads adopt Scrum, Kanban, Scrumban or any other paradigm. That choice should be made by the team depending on their collective circumstances. Allowing them to prescribe their own practice empowers teams to be in control of their own tracked metrics and the hypothesised affect their deliverables are expected to have.

If you start to get questions about your point commitments in a sprint, escalate immediately. Even some of the alternatives such as number of completed stories per sprint cannot provide a fair comparison of team effectiveness. What is more important is tracking of sprint and product goals using the corresponding Key Performance Indicators, or KPIs. If those goals are not defined and quantified well, it is impossible to show the Product Owner or any other stakeholders that valuable increments are being delivered.

Prove It

Any fascination with story points or sprint velocity by key stakeholders should be considered unhealthy. From many of my experiences over the years, a significant cause is often management pressure. Someone is looking at that trend of points and asking why it is lower than last time.

Scrutinising a single metric will not demonstrate team effectiveness or productivity. Number of points committed or completed isn’t important. Neither is whether you finish every single story committed to in the sprint. Neither is how many stories a given team completed in a sprint.

Success is a straight road for all development teams.

What is the true measure of success? Why it’s client value of course! KPIs are one way to measure this. If these are tied to the product vision, an increase or decrease in these values can provide an empirical method of measuring team effectiveness. Over time they allow us to prove the hypothesis to one simple question. Are we going in the right direction?

Thanks for reading!