Back to blog

AI Voice Cloning Has Crossed the Indistinguishable Threshold: What Security Teams Must Do Now

Report

Report

Written by

Brightside Team

Published on

t's 8:47 AM on a Monday. Your CFO's voice comes through the phone. The tone is urgent but controlled. There's a wire transfer that needs to happen before the markets open. The voice sounds exactly right, the cadence, the slight formality, the way he trails off at the end of sentences. Four minutes later, the call ends. The transfer is authorized.

The CFO never made that call.

This isn't a hypothetical. It's a pattern that security researchers and fraud investigators documented repeatedly throughout 2024 and 2025, and the technology making it possible has only gotten more accessible. In late 2025, researchers at Queen Mary University of London published findings confirming what many in the security field had suspected: AI-generated voices are now, in controlled testing conditions, indistinguishable from real human voices. Some AI voices were rated as more trustworthy than the authentic recordings they were compared against.

The lead researcher put it plainly: "It just shows how accessible and sophisticated AI voice technology has become."

This article is for security teams who need to act on that finding. Not just understand it, act on it. We'll cover what AI voice cloning actually is and how it works, why the defenses most organizations currently rely on are already failing, what you need to do right now, and which platforms are helping teams build the right response.

Key Terms You Need to Know

Before going further, let's define the terms that matter here. These aren't concepts to skim past, they describe the actual mechanics of the threat.

Vishing (voice phishing): A social engineering attack conducted over the phone, where an attacker impersonates a trusted individual to manipulate the target into disclosing sensitive information, authorizing a transaction, or taking a harmful action.

AI voice cloning: The use of machine learning models to generate a synthetic replica of a specific person's voice from a short audio sample. Modern tools can produce a convincing clone from as little as three seconds of source audio.

Deepfake voice: An AI-generated audio stream that mimics a target person's vocal identity with enough fidelity to deceive a human listener. The output can be a pre-recorded file or a live real-time stream during an active call.

Real-time voice conversion (RTVC): A live deepfake technique where the attacker speaks naturally, and their voice is transformed into the cloned voice in near-real-time during an active phone call. The person on the receiving end hears the target's voice, not the attacker's.

The indistinguishable threshold: The point at which AI-synthesized speech can no longer be reliably distinguished from authentic human speech by human listeners. Peer-reviewed research confirmed this threshold was crossed in 2025.

Social engineering: Psychological manipulation that exploits trust, authority, urgency, or reciprocity to bypass security controls without exploiting any technical vulnerability. The human is the vulnerability.

Hybrid attack: A coordinated multi-channel attack that combines a phishing email and an AI voice call in a single campaign. The email builds initial credibility, the call closes the deception.

How AI Voice Cloning Actually Works

Understanding the mechanism matters because it changes what defenses make sense.

The technical pipeline, in plain terms

Modern voice cloning systems use neural networks trained on large datasets of human speech. The model learns to separate "what is being said" from "who is saying it." Voice identity gets encoded as a mathematical pattern, a kind of vocal fingerprint, that can be separated from the content of speech and reapplied to new content.

Give the model a short reference recording of your target and it maps their vocal identity onto that fingerprint. From that point forward, it can render any text or speech in that person's voice, with their intonation, rhythm, pauses, and emotional register intact.

In the current generation of tools, that reference recording can be three seconds long. McAfee research found that three seconds of audio is enough to produce a voice clone with 85% accuracy. A few minutes of source audio produces a clone that's nearly impossible to detect in a live call context.

How it runs during a live call

This is the part that changes the threat model. Early voice cloning produced pre-recorded audio files. An attacker would generate the clip, play it during a call, and hope the conversation didn't go off-script.

Real-time voice conversion removes that constraint. The attacker speaks naturally into a microphone. Their voice is processed and transformed into the cloned voice in near-real-time, and that's what the target hears on the other end. The attacker can respond to anything the target says, adjust their approach, and hold a fully dynamic conversation, all while sounding like the CFO, the IT director, or the CEO's assistant.

Latency has dropped to imperceptible levels on current hardware. There's no delay that gives it away.

Where attackers get the source audio

This is often the part that surprises security professionals. Attackers don't need to breach your systems to harvest source audio. They're usually pulling from:

  • Earnings calls and investor presentations, publicly archived on company websites or financial platforms

  • LinkedIn and YouTube profile videos, company "meet the team" pages, and conference talk recordings

  • Podcast appearances and media interviews featuring executives

  • Internal company town halls or training recordings that have been leaked, scraped, or improperly shared

Any employee with a public-facing video presence is a potential source voice. That includes people well below the C-suite.

The Scale of the Problem

The technology didn't appear overnight, but its accessibility and use in attacks have accelerated sharply.

In 2019, a UK energy firm CEO authorized a $243,000 transfer after receiving a call he believed was from his German parent company's CEO. That was accomplished with 2019-era voice synthesis, a fraction of what's available today, and the fraud still worked.

Fast-forward to 2025 and the numbers reflect a different scale entirely:

  • Vishing attacks surged 442–449% year-over-year in 2024–2025

  • Deepfake-enabled vishing increased over 1,600% in Q1 2025 compared to Q4 2024

  • 35% of people cannot identify an AI-cloned voice from a genuine one

  • 25% of employees are fooled by deepfake voices even in controlled simulation scenarios

  • Average vishing cost per organization: $14 million per year

  • Generative AI fraud losses are projected to reach $40 billion globally by 2027

One pattern that's become especially common is help desk impersonation. Attackers call employees posing as IT support, claim there's a security incident requiring urgent action, and talk them through resetting their password or modifying their MFA settings. There's no malware involved, no technical exploit. The attack is a phone call that ends with the attacker holding valid credentials and full account access.

Why Detection-First Defenses Are Already Failing

Most organizations have responded to the vishing threat by adding it to their awareness training: "Be careful about phone calls. Listen for unusual requests. Verify the caller's identity." The intent is right. The approach has a hard ceiling.

Research shows that 33% of trained employees still disclose sensitive information under vishing pressure, even after being explicitly warned about AI voice cloning risks. That's not a training failure in the usual sense. It's a perceptual limitation. The auditory cues that used to reveal synthetic voices, the slight robotic quality, unnatural pauses, flat emotional range, have largely disappeared from current-generation models. There's nothing left to hear.

Training employees to "trust their ears" is telling them to rely on a sense that the technology has already outpaced. Further training doesn't lower that 33% floor once perceptual discrimination fails.

The response has to shift in focus: from trying to detect a fake voice to building processes that hold regardless of how the voice sounds.

5 Myths About AI Voice Cloning That Are Leaving Organizations Exposed

Before getting to the action steps, let's clear away some assumptions that are giving security teams false confidence.

Myth 1: "You need hours of audio to clone someone's voice convincingly."

You need three seconds. That's the McAfee finding for an 85% accuracy match. Researchers at Queen Mary University achieved convincing clones with a few minutes of source recording. The idea that cloning requires long or specially captured recordings is simply outdated.

Myth 2: "Our employees would recognize an unusual request, even from the CEO's voice."

Attackers don't make unusual-sounding requests. They use pretexting to build a believable scenario, urgency to short-circuit rational thinking, and authority to override normal approval instincts. The 2019 UK energy CEO transfer was for a plausible business reason, framed as time-sensitive, and came from a voice the target trusted. Unusual framing would have killed the attack. Convincing framing made it work.

Myth 3: "AI-generated voices still sound slightly robotic. Our people will hear it."

The Queen Mary University study found AI voices are now perceived as more trustworthy than real human voices in some test conditions. The perceptual tells are gone. There is no reliable auditory signal left to catch.

Myth 4: "This only affects large enterprises with public-facing executives."

Any employee who appears in a company video, a conference recording, a LinkedIn post, or a podcast is a potential source voice. Mid-market companies and SMBs are increasingly targeted because their verification processes tend to be less formal and their employees less likely to have received vishing-specific training.

Myth 5: "Strong security awareness training is sufficient protection."

It's necessary, but not sufficient on its own. A September 2025 study from UC San Diego found that cybersecurity training programs as currently implemented by most large organizations "do little to reduce the risk that employees will fall for phishing scams." The programs that produce real risk reduction combine simulations with process controls and technical safeguards, not awareness content alone.

5 Steps Security Teams Must Take Now

These steps work because they don't ask employees to detect a fake voice. They build procedural redundancy that protects against social engineering even when everything sounds completely authentic.

1. Replace voice authentication with out-of-band verification

Any request that involves a financial transaction, credential change, or sensitive data disclosure needs a second verification step that runs through a separate, independently confirmed channel. The rule is simple: hang up, find the number independently from a verified internal directory, and call back. Never use a phone number provided during the suspicious call itself; that number routes back to the attacker.

This one step would have stopped most of the major vishing fraud incidents documented in the past two years.

2. Establish pre-shared code words for high-risk requests

Finance teams, HR, IT help desks, and executives should have pre-established code words for any out-of-the-ordinary requests, particularly wire transfers, MFA resets, and password changes. If the code word isn't present in the call, the request doesn't move forward. The code word can't be guessed from public information, it can't be cloned from audio, and it doesn't require the employee to make a judgment call about what the voice sounds like.

3. Require dual-approval workflows for phone-authorized actions

No financial transaction, no credential change, and no sensitive data disclosure should be authorized by a single employee based on a phone call alone. Multi-party approval needs to be built into the workflow at the process level, not just in policy documents that employees are expected to remember under pressure. When two people need to sign off, a single social engineering attempt can't complete the action.

4. Run regular AI vishing simulations

This is where process and training meet. Static awareness modules and annual video courses don't build the procedural reflexes employees need to respond correctly when an urgent call arrives from a convincing voice. Regular simulations using live adaptive AI phone calls, not scripted recordings, create realistic conditions where employees practice following verification protocols under pressure.

Simulation-based programs consistently produce better outcomes than passive training: organizations running ongoing simulation campaigns see phishing susceptibility drop 50–80%, with the gains concentrated in programs that run continuously rather than once a year. The cadence matters as much as the content.

5. Reduce your organization's audio attack surface

Audit what public-facing audio and video content featuring your executives and high-value employees is available online. Restrict access to internal town halls, training recordings, and all-hands sessions to verified employees only. Establish a clear policy that executives don't participate in unsolicited calls or video meetings without prior confirmation through a verified calendar invite from a known address.

Attackers need source audio to build a convincing clone. Reducing that supply raises the effort required and lowers the fidelity of any clone they manage to create.

5 Top-Rated Platforms for Vishing Attack Prevention and Employee Training

Not every security awareness platform handles vishing the same way. Some offer only email phishing with a basic callback component. Others run live AI voice calls that adapt in real time to what your employee says. The gap between those two things is significant: one trains pattern recognition, the other trains behavioral response under pressure.

Here are the five platforms best suited for organizations building a serious vishing training program.

1. Brightside AI

Brightside AI is a Swiss cybersecurity awareness platform that covers email phishing, vishing, and deepfake simulations in a single admin-controlled suite. Its vishing module uses GenAI to conduct live, adaptive phone conversations, not pre-recorded audio, which means each simulation call responds dynamically to what your employee actually says.

Admins build simulation templates in five steps: attack goal, context, tactics, voice selection, and review. The platform's Recommended Strategy system suggests proven social engineering tactic combinations organized into Foundation, Approach, and Pressure layers, with explanations for the psychological reasoning behind each. You can choose from eight preset AI voices across English, French, German, and Italian, or upload a 1–2 minute recording to create a custom cloned voice for executive impersonation scenarios.

Hybrid attack mode coordinates a phishing email and an AI phone call in a single campaign workflow. A dedicated vishing metrics dashboard tracks answer rate, failed rate, and median call duration across 7, 30, and 90-day windows. When an employee fails a simulation, follow-up training triggers automatically. And before any campaign goes live, you can test the full simulation flow directly in your browser to hear how the call sounds and how the AI adapts in real time.

2. Mirage Security

Mirage Security focuses specifically on AI vishing simulation, building attack campaigns from OSINT and publicly available data about your organization to create highly targeted scenarios. Its standout feature is Adaptive Training: employees answer a simulated call in their browser and receive immediate, personalized 1:1 coaching on what they missed and why. It's designed to complement existing awareness platforms rather than replace them, useful for organizations that want deep vishing fidelity alongside an existing broader program.

3. Keepnet Labs

Keepnet Labs offers over 2,700 voice phishing scenario templates across 120+ languages, with campaigns that can be launched in approximately five minutes. A published case study from a global bank CISO reported a 92% improvement in employees recognizing fake calls after regular Keepnet campaigns. It covers callback phishing alongside live voice simulation, making it a solid multi-scenario option for voice-focused training programs.

4. Jericho Security

Jericho Security raised $15 million Series A at RSA 2025 and uses AI-driven technology to generate personalized phishing campaigns through a no-code interface. Its simulator supports conversational phishing, meaning multi-message exchanges that build trust before a malicious request, and deepfake phishing using AI-generated audio or video. Voice phishing is available as a separate simulator product. When an employee interacts with a simulated email, they're immediately redirected to targeted training content.

5. Adaptive Security

Adaptive Security is the only cybersecurity startup backed by the OpenAI Startup Fund, having raised $55 million through September 2025. It simulates deepfake voice, video, and messaging attacks with OSINT-driven personalization and real-time risk scoring that surfaces the most vulnerable employees for prioritized coaching. It's a strong fit for large enterprises that need a fully AI-native, multi-channel social engineering simulation platform.

Try our vishing simulator

Experience the most advanced voice phishing simulator built for security teams. Create scenarios, test voice cloning, and explore automation features.

Frequently Asked Questions

How long does it actually take to clone someone's voice?

Based on current research, as little as three seconds of source audio can produce a voice clone with 85% accuracy. A higher-fidelity clone, the kind that can sustain a four-minute phone conversation without detection, can be produced from a few minutes of publicly available audio using commercially available tools and minimal technical expertise. The barrier to entry is lower than most security teams realize.

Can deepfake detection tools catch AI voice calls in real time?

Real-time deepfake audio detection exists in research settings, but no enterprise-grade product reliably catches all current-generation synthetic voices in live call conditions. Existing detectors struggle with diverse datasets and their accuracy drops under adversarial conditions, meaning attackers can tune their output to evade known detection signatures. This is exactly why the primary defensive recommendation shifts from detection to process controls: callback verification, pre-shared code words, and multi-party approval that remain effective even when detection fails.

Our employees have completed security awareness training. Are we protected?

Awareness training significantly reduces risk but doesn't eliminate it. Research shows 33% of trained employees still disclose sensitive information under vishing pressure, even after being explicitly warned about voice cloning risks. That floor doesn't come down with more training because it reflects a perceptual limitation, not a knowledge gap. The programs that produce the best outcomes combine regular AI vishing simulations with protocol-based defenses: verification workflows, code words, and multi-party approval, rather than relying on awareness content alone.

How often should we run vishing simulations?

Quarterly simulations are a reasonable baseline for most organizations, with monthly cadences recommended for higher-risk populations: finance, HR, IT help desks, and executives. The frequency should go up after major personnel changes, company announcements, or any known increase in regional attack activity. The research is consistent on one point: ongoing programs produce far better outcomes than annual one-time drills.

What's the difference between a vishing simulation and a callback phishing simulation?

A callback phishing simulation sends an email directing the target to call a phone number. The "attack" only happens if the employee initiates the call. A vishing simulation involves an inbound AI-generated phone call to the target, conducted in real time with a live adaptive conversation that responds to what the employee actually says. Callback phishing tests whether someone follows a suspicious email's instructions. Vishing simulation trains employees to handle active, voice-based social engineering under realistic time pressure, a meaningfully different skill.

What makes a vishing simulation realistic enough to change behavior?

The most effective simulations use live adaptive AI conversations, not scripted recordings or voicemails. They apply real social engineering tactics, urgency creation, authority impersonation, pretexting, commitment escalation, and they're personalized to the target's role and organizational context. Platforms that support custom voice cloning for executive impersonation create the highest-fidelity conditions. Equally important, simulations should be followed immediately by targeted coaching tied to what the employee did in that specific call, not generic training modules that aren't connected to the experience.

The Organizations That Adapt Now Will Be Better Positioned When the Call Comes

The technology shift here is real, peer-reviewed, and already being exploited at scale. Voice cloning that required expensive hardware and expert knowledge in 2019 now runs on commercial tools that cost almost nothing and require almost no expertise. The indistinguishable threshold isn't a future milestone, it's already behind us.

The organizations that respond well aren't necessarily the ones with the biggest security budgets. They're the ones that update their assumptions first. They've accepted that perceptual detection has a ceiling, and they've built verification processes that don't depend on it. They run simulations that create real pressure, not ones that tick a compliance box. And they treat vishing as a first-order risk, not a footnote at the bottom of a phishing awareness program.

Your employees will eventually get one of these calls. The only variable is whether they'll have practiced what to do before it happens.

If you want to see what that call looks like from your employees' side, Brightside AI's vishing simulator lets you run a live AI voice simulation against your own team, with custom voice cloning, adaptive conversations, and immediate metrics on how your organization responded.