Most interview questions get asked because someone else asks them. They're inherited from a template, copied off a Notion doc, or recycled from the last hire. Almost none of them predict whether the candidate will actually do the job. Which is awkward, because that's the whole point.
The research on what does predict performance has shifted hard in the last few years. Here's what it shows, what to keep asking, and the questions that should have been retired a long time ago.
Why so many interview questions don't actually work
A typical unstructured interview produces the right hire roughly 57% of the time. Slightly better than a coin flip. That number comes from decades of hiring research, and it's stubbornly stable. The structure of the interview matters more than the questions inside it, and most interviews have no structure at all. Every candidate gets different questions. Scoring happens in the interviewer's head. Hire/no-hire calls are made on impressions formed in the first three minutes.
For a long time, the canonical reference here was Schmidt and Hunter's 1998 meta-analysis on personnel selection methods. It put structured interviews at a predictive validity of r = 0.51 and unstructured at r = 0.38. That paper anchored almost every hiring textbook for two decades.
Then in 2022, Sackett, Zhang, Berry and Lievens reran the math with corrected statistical assumptions. The validity numbers came out lower across the board, but the ranking changed too. Structured interviews moved up to the top of the list. Cognitive ability tests, long considered the gold standard, dropped to about r = 0.31. In other words: how you run the interview matters more than which test you run alongside it.
The numbers, side by side:
| Method | Predictive validity (r) | Source |
|---|---|---|
| Work sample tests | 0.54 | Schmidt & Hunter (1998) |
| Structured interviews | 0.42 operational, 0.51 classic | Sackett et al. (2022); Schmidt & Hunter (1998) |
| Cognitive ability tests | 0.31 (Sackett), 0.51 (classic) | Sackett et al. (2022); Schmidt & Hunter (1998) |
| Unstructured interviews | 0.38 | Schmidt & Hunter (1998) |
| Reference checks | 0.26 | Schmidt & Hunter (1998) |
| Years of experience | 0.06 | Schmidt & Hunter (1998) |
| Brainteasers | ~0 | Google internal study (Bock, 2013) |
A reminder on what r values mean: a validity of 1.0 would mean the interview perfectly predicts on-the-job performance. A validity of 0 means it's noise. Anything above 0.40 is considered strong for personnel selection. Anything below 0.20 is barely useful. Most of what's still in active rotation in interviews falls into that lower bucket. This is the part you can read about further in our piece on passive candidate sourcing, which uses the same research-skeptical lens on a different sourcing assumption.
The questions to retire
Here are the questions that show up in interview templates everywhere and contribute almost nothing to a hire/no-hire decision. Some of them feel insightful. None of them are.
Brainteasers. "Why are manholes round?" "How many golf balls fit in a 747?" Google was famous for asking these and tested whether they worked. Laszlo Bock, who ran Google's people operations through that era, published the result: zero correlation between brainteaser performance and actual job performance. Across tens of thousands of interviews. Google dropped them. Most of the industry didn't get the memo.
"What's your greatest weakness?" Every candidate has rehearsed this one. The answer is a humblebrag wrapped in self-awareness. "I work too hard." "I care too much." Nothing about the answer is behavior under real constraints, which is what you actually need to know.
"Where do you see yourself in five years?" It's an aspiration. Aspirations don't predict performance. The candidate who says "I want to lead a team of fifty" and the candidate who says "I want to be a senior IC writing better code" both might do the job equally well or equally badly. The answer tells you about their fantasy, not their fit.
"Why should we hire you?" This is a sales pitch. You're scoring how well the candidate can pitch themselves. Some great hires are bad at this. Some terrible hires are great at it. The variance in pitch ability swamps the signal you're looking for.
Culture-fit gut calls. "Did they feel like one of us?" is a vibe check, not an assessment. First impressions are among the worst predictors in the entire selection literature. Worse, they tend to filter for demographic similarity rather than competence, which is how teams become accidentally homogeneous. If you want to assess values alignment, write specific behavioral questions about working through values trade-offs. Don't outsource it to your gut.
Years of experience as a proxy. Validity of r = 0.06. Almost the worst signal on a resume. A candidate with three years can outperform a candidate with twelve. Filtering by years is a convenience, not a signal. Use it to narrow a pile, not to rank a shortlist.
If you're spending energy filtering through bad-fit candidates by asking bad-fit questions, the leak is upstream. Sourcing on self-reported keywords gives you a pile that's mostly noise, and weak interview questions are what's left to clean it up with. Sourcing on what a candidate has actually done, mapped to the skills the role needs, narrows the funnel before the interview starts. Glozo's Skill Graph handles that part.
The questions that actually predict performance
This is what the research backs. None of these are new, but they're consistently underused in real interviews because they take more setup than asking off the cuff.
Past behavior questions. "Tell me about a time when..." The behavioral consistency principle is the foundation here: the best predictor of how someone will behave is how they've behaved in similar situations before. The trick is to draw the question from a critical incident in the actual job, not a generic prompt. Instead of "tell me about a time you failed," ask "tell me about a time you had to ship something on a deadline you didn't think was realistic, and how you handled the trade-offs." The first question is filler. The second one gets you behavior.
Situational questions with a rubric. "What would you do if a senior engineer pushed back on your hiring recommendation?" Situational questions work, but only when there's a predefined scoring rubric drawn from how high performers actually handled the situation. Without the rubric, you're scoring on personal preference. With it, you're scoring against a standard. Validity drops by about half if you skip the rubric.
Work sample tasks. The highest validity method in the literature, at r = 0.54. A small piece of the actual job, scored against the same rubric for every candidate. For a recruiter role: have them write a sourcing search for a hard-to-fill role and explain their logic. For a developer: a short code review or a small bug fix. For a designer: a critique of an existing screen. The cost is interviewer time. The payoff is direct evidence.
Cognitive screening tied to the role. Cognitive ability still predicts performance, even at the revised lower validity numbers. The question isn't whether to assess it. The question is whether you assess it through a job-relevant problem or through abstract puzzles. The first builds signal. The second is the brainteaser trap by another name.
How to structure an interview that actually predicts
The research is unambiguous: structure is what makes the interview work. Questions in a vacuum don't. Same questions, asked in the same order, scored against the same rubric. Every candidate. No improvisation in the moment. Improvisation is where bias enters.
A scoring rubric in advance, written down. Each question has a definition of what a strong answer, a passable answer, and a weak answer look like. Drawn from how your current high performers actually answer the same question. Without a rubric, you're scoring on impression, and impressions are noisy.
Multiple interviewers, scoring independently before they talk. Group discussion before individual scores collapses into the highest-status person's opinion. Each interviewer writes their score, with notes, then shares. This is where weak signals get caught.
How many questions you need: the diminishing returns curve flattens around eight to twelve well-designed questions. Past that, you're adding interview time without adding signal. Most companies under-design their question set and overlong their interviews. The fix is not more questions. The fix is sharper ones.
From skill-based sourcing to skill-based interviewing
The interview is only as predictive as the candidate pool you're testing. If sourcing is broad and keyword-based, the interview is doing salvage work: trying to find the right person inside a pile assembled on the wrong signals. The math doesn't favor the interviewer.
When sourcing is skill-based, the relationship inverts. You source for the skills the role actually needs, mapped to evidence the candidate has done that work. The interview validates the specific skills you sourced for. The behavioral questions are tied to those skills. The work sample tests those skills. You spend interview time confirming what you already have evidence of, not searching for it.
This is the loop Glozo's product is built around: source by skill graph, interview by the same skill graph. Candidates that surface through skill-based search arrive with the relevant past behavior already mapped. The interview becomes a check, not a fishing trip. Try it on a search you'd otherwise run on LinkedIn and compare what comes back.
If you want to read about how that upstream piece fits with the rest of the recruiting funnel, we wrote about candidate sourcing as a discipline separate from recruiting and the mechanics of running a search using boolean operators or natural language.
Source with skill clarity, interview with skill clarity
Strong interviews start before the interview. If you spend less time filtering bad-fit candidates with weak questions, you can spend more time validating skill fit with the questions that actually predict. Try Glozo on the next hard-to-fill role and see what skill-based sourcing looks like end to end.

