What I Learned from Dropbox Eng's Quality Framework

In my current pursuit to organize all of the knowledge I have stored across tools and apps, I've been decluttering my digital “stuff” and came across some old docs from a quality assessment framework we used when I started at Dropbox back in 2016.

I first learned about the Quality Maturity Evaluation (QME) framework, developed by Dropbox’s founding QA Engineer Alex Hoffer, after I had been with the company for a month or two. My onboarding buddy, also on the QA team, explained that I was to assess the quality maturity of each product team I worked with, write detailed reports, and assign each a score based on their practices.

I thought it was brilliant.

…But, it ultimately didn't survive the evolution of our org over time. Teams loved the insight conversations, but the full QME felt too heavy to sustain (especially as our QA headcount shrank). We tried evolving the process to account for these changes by creating a "QME Lite" version and even retro-style formats, but none of them stuck either. Still, there was so much there that helped teams think critically about and improve their quality practices.

What were those things that helped? Looking back, these are the things that stood out as genuinely useful.

What Actually Worked

Asking about effectiveness, not just existence. The QME evaluated teams across 10 categories: Release Process, Triage Process, Bug/Tech Debt, Manual Test Coverage, Automated Test Coverage, Testability, Communication, Field Bug Capture, Spec Review, and Inspection. But instead of asking "do you have automated tests?" we'd ask "how do your automated tests actually prevent regressions from reaching users?" or "how does your triage process ensure serious issues get fixed quickly?". This pushed people to think about whether their practices were actually working.

The team discussions were way more valuable than the scores. Instead of QA writing assessments about teams, we'd get everyone in a room - PMs, engineers, QA design, CX, etc - and have them talk through the team’s practices. We’d dig in to questions like "how do you handle incoming bugs and decide what gets fixed when?" and "what's your process for deciding if a feature is ready to release?". Teams consistently told us these conversations were useful even when they hated filling out the assessment.

We also evolved this approach over time based on feedback from the teams. The process started with QA engineers writing assessments about teams, moved to facilitated team discussions, then tried survey-style "QME Lite" versions, even experimented with retro "liked/lacked/longed for" formats. Each iteration taught us that teams resist assessment but embrace reflection when you change how you frame it.

Context was everything. Teams could weight each assessment category based on what mattered for their situation. A team with lower automation coverage might be doing exactly the right thing for their stage, while a mature product team with the same score would have real gaps - the scoring reflected that.

Patterns across teams showed things no individual team could see. When you're running these across 20+ teams, you start noticing patterns like "all teams in this area identified technical debt as their biggest challenge" or "teams that improved their project kick-offs consistently saw fewer post-release bugs". We'd spot things like team reorgs consistently hurting quality scores (who’d have thunk it?), and automation investment patterns that predicted which teams would struggle with releases. That kind of insight was super valuable to leadership.

It was also surprisingly effective at pointing new teams toward useful quality practices they should stand up early. By using the QME as a forward-looking framework, teams could easily identify which mature quality practices made sense to adopt for their own context and stage.

Why It Didn't Work

I genuinely loved QMEs. When they landed well, they surfaced nuanced risks, sparked real cross-functional dialogue, and made quality feel shared. But, they were not light lifts. The full evaluation was a lot: 10 categories, detailed scoring, formal write-ups with action items. Teams found the discussions valuable but the process exhausting. We're talking several hours of team time plus prep, research, writing, and follow-up for the QAE leading the evaluation. And most QAEs worked with at least two teams, sometimes more.

Eventually, we started carving out a full week per QAE each quarter just to make it happen, and sometimes even that wasn’t enough. We shifted to twice a year, then yearly. Even our lighter versions needed more facilitation than we could manage as the QA team shrank and priorities shifted. Without dedicated support, many teams started skipping them entirely. Teams that didn't would sometimes rush through the evaluations just to “check the box”, which resulted in less valuable insights, which led to the argument that they weren't helpful… you see where this is going.

Plus the framework couldn’t keep up with how teams actually worked. The biggest example: our automation questions focused on basic stuff like coverage when teams were actually struggling with feature flag complexity, deployment pipeline reliability, and gating strategies. Unless you had a senior QAE running the evaluation who knew to dig deeper, you'd miss the real challenges teams faced.

Honestly, though, I think the biggest reason they just didn’t really work is we couldn’t consistently get teams to prioritize the resulting action items. Teams would identify the same problems quarter after quarter: flaky tests, unclear release criteria, poor incident response. But without commitment to actually fix these things, the assessments became an exercise in documenting known problems rather than solving them. Super demoralizing for everyone involved.

What Keeps Me Coming Back

Now, I wouldn’t blame you if you were reading and wondering why I want to do any of this again 😅

The thing that keeps me coming back is that the teams who bought in to the process would run into learnings that eventually benefitted the broader org. One team started reviewing specs as a team when starting new projects as a result of their QME discussions. Engineers found them so valuable that as they would reorg onto other teams without QE support they’d champion the process. Another team began collaborative test case brainstorming for complex features, which helped the whole team think through edge cases they would have missed otherwise. That practice was then shared across teams as part of our self-serve testing toolkit.

We were surprised to find that teams began using QMEs to identify which quality practices to stand up early instead of learning the hard way - they'd look at things that mature teams were doing and figure out which made sense to adopt. The teams that followed through on action items (tracking them in Jira, making them part of sprint planning, hearing leadership follow up on them) saw measurable improvements in how confident they felt about their ability to ship quality software. Some teams even shifted to holding frequent quality-focused retrospectives based on the QME discussions.

These teams figured out that the real value was in using the assessment data to identify gaps, then building shared understanding about what quality meant for their specific context and using that insight to guide their own improvements. That's what I want to rebuild - that data-driven model where teams reflect on their practices together and actually improve them. But how do you get all of that context together for meaningful quality discussions across teams without requiring a QA engineer/coach/unicorn to lead every conversation?

If I Joined a New Company Tomorrow

I haven't cracked the perfect framework yet (is there one? 🫠), but I know that thanks to Alex I have a great head start. Building on my learnings from the QME I'd focus on:

embedding these questions into existing team rhythms (retros, planning sessions, post-incident reviews,
looking for teams that are already having quality conversations and figuring out how to amplify what's working, and
making sure insights actually lead to organizational support for change

Some ideas are worth preserving even when their original implementation doesn't survive. This framework taught me that teams genuinely want to reflect on their quality practices - they just need the right format and support to actually fix what's broken. Because assessment without action is just really expensive organizational therapy 🤷‍♀️

Tagged in:

Quality Assurance & Testing Product & Development Best Practices

When Good Quality Ideas Don't Stick: Lessons from a Failed Framework That I Loved

What Actually Worked

Why It Didn't Work

What Keeps Me Coming Back

If I Joined a New Company Tomorrow

Susanne Abdelrahman

Other Stories

Scrappy Isn’t Scalable

Half -Baked, Fully Shipped: Quality is Care (and AI Can't Care)

On Dream Teams

Minimum Viable Caring

When Quality Work Doesn't Land (And Other Things I’ve Been Thinking About)

Press ESC to close

Or check our Popular Categories...

What Actually Worked

Why It Didn't Work

What Keeps Me Coming Back

If I Joined a New Company Tomorrow

Share Article:

Related Articles

Other Stories

Scrappy Isn’t Scalable

Half -Baked, Fully Shipped: Quality is Care (and AI Can't Care)