A customer calls your contact center and reads out their credit card number to an agent. A prospect joins a sales call and shares their name, work email, and company. Both conversations are recorded and transcribed, and both now contain sensitive personal data sitting in plain text in your database.
PII redaction in speech-to-text is the process of automatically detecting and replacing personally identifiable information, such as names, addresses, phone numbers, and financial data — in audio transcripts before they are stored or processed. It ensures that sensitive data is stripped at the source, so the transcripts you store and analyze are compliant by default.
What counts as PII?
PII (Personally Identifiable Information) is any data that can be used, alone or in combination, to identify a specific individual. In the context of audio transcription, PII goes far beyond just names and email addresses. It spans financial records, health data, government IDs, and more.
Here’s a breakdown of the main categories:
| Category |
Examples |
| Personal identifiers |
Names (given, family), date of birth, Social Security Numbers and international equivalents (Canadian SIN, German
Sozialversicherungsnummer, French NIR), passport numbers, driver's license |
| Contact information |
Email addresses, phone numbers, mailing addresses (down to street level), IP addresses |
| Financial data (PCI) |
Credit card numbers, CVV codes, bank account numbers, IBANs |
| Health information (PHI) |
Medical conditions, insurance IDs, prescription information |
| Employment & education data |
Employee IDs, student numbers, payroll data |
Why PII redaction matters
Recording and transcribing conversations without redacting sensitive information introduces serious legal, financial, and reputational risk.
Regulatory compliance
Several major compliance regulations require businesses to protect personal data:
- GDPR (Europe) — mandates strict handling of personal data, with heavy fines for non-compliance.
- HIPAA (US healthcare) — requires safeguarding protected health information (PHI).
- PCI DSS — governs how credit card data must be handled and stored.
- CCPA (California) — gives consumers control over their personal data.
If transcripts contain raw PII, they fall under these regulations — increasing your compliance burden.
Security risk
Unredacted transcripts are a high-value target. If your database is compromised, attackers gain access to names, financial information, and other sensitive data in a readable format.
Redacting PII at transcription time drastically reduces the risk in case of data breaches.
Data minimization
Modern privacy frameworks emphasize data minimization, collecting and storing only what you truly need. If your analytics don’t require PII values like raw credit card numbers or full addresses, there’s no reason to keep them.
How PII redaction works in speech-to-text
Modern speech-to-text systems integrate PII detection directly into the transcription pipeline.
Here’s how it typically works:
- Entity detection
The model identifies sensitive entities (names, card numbers, emails, etc.) using NER (Named Entity Recognition) and pattern recognition. - Classification
The detected entity is categorized (e.g., credit card, phone number, person name). - Replacement strategy
The system replaces the detected input text using a defined redaction method:
- Full removal
- Category tagging (e.g., [CREDIT_CARD])
- Masking (e.g., **** **** **** 1234)
- Secure output
Only the redacted data is stored or returned via API.
Redaction vs masking
There are two main approaches:
1. Marker replacement
This is where you replace PII with a label indicating the PII category.
Original transcript:
"My name is Sarah Johnson and my credit card number is 4242 4242 4242 4242."
Redacted transcript (marker):
"My name is [PERSON_NAME] and my credit card number is [CREDIT_CARD]."
This is ideal for analytics while preserving structure.
2. Partial masking
Sensitive information is partially hidden.
Original transcript:
“My card number is 4242 4242 4242 4242.”
Masked transcript:
“My card number is **** **** **** 4242.”
This is useful when partial visibility is operationally necessary (e.g., verification flows).
When should you enable PII redaction?
Enable PII redaction whenever transcripts cross system boundaries. That’s the rule. If your speech-to-text output is:
- Stored in a database
- Sent to a CRM
- Indexed in search
- Passed to an LLM
- Used for embeddings
- Retained for QA or training
Then PII will propagate unless removed at the source. Here’s where this becomes critical:
- Contact centers: Agents collect card numbers, account IDs, DOBs. If transcripts are logged unredacted, your analytics stack, BI tools, and logging systems now contain regulated data.
- Sales and CS calls: Transcripts are often auto-pushed into CRM systems and summarization workflows. Once there, PII spreads across enrichment tools, exports, and dashboards.
- LLM pipelines: If transcripts are embedded or used for fine-tuning, unredacted PII can end up inside vector stores or model training data. That’s difficult to unwind later.
- Regulated industries (healthcare, fintech, insurance): Even incidental mentions of identifiers may place your storage systems under HIPAA, PCI DSS, or GDPR scope.
If transcripts are ephemeral and never stored, redaction may not be required. However, if transcripts are persisted, even temporarily, redaction should be the default.
The question is not “Do we handle PII?” It’s “Do we want raw identifiers permanently embedded in our data infrastructure?”
PII redaction with Gladia
At Gladia, every layer of the pipeline is designed to reduce risk and make deployment viable in highly regulated industries like healthcare, finance, insurance, and public sector environments.
Gladia’s PII Redaction detects and replaces sensitive entities in transcripts, so private data doesn’t leak into your outputs. Available for pre-recorded transcription.
How it works
You add two fields to your transcription request:
pii_redaction: true — enables the featurepii_redaction_config — controls what gets redacted and how
Gladia runs NER on the transcript, detects entities matching your config, and replaces them in the output. The redacted text shows up in full_transcript, utterances, sentences, subtitles,...
| processed_text_type |
Example output |
When to use |
| MARKER (default) |
[NAME_1], [EMAIL_1] |
Tracking references across a transcript; good for LLM tasks |
| MASK |
#### ##### |
Full character-level obfuscation |
💡 MARKER is smart about consistency: the same entity (e.g. "John Smith") always gets the same marker ID ([NAME_1]) across the entire transcript. Great for reasoning downstream.
Entity types
You can pass presets (regulation shortcuts) or individual entity types in entity_types.
Presets
The presets available are : GDPR, GDPR_SENSITIVE, HIPAA_SAFE_HARBOR, CPRA, QUEBEC_PRIVACY_ACT, APPI, APPI_SENSITIVE, PCI, HEALTH_INFORMATION
To check which entities are redacted for each presets, check here : TODO DOC TABLE
Presets are the easiest way to get compliant quickly. Use individual types when you want fine-grained control.
Request with config
{
"audio_url": "<https://example.com/your-audio.mp3>",
"pii_redaction": true,
"pii_redaction_config": {
"entity_types": ["GDPR"],
"processed_text_type": "MARKER"
}
}
Output example
Raw transcript:
Hi, I'm calling about the order for John Smith. Can you confirm the delivery to john.smith@company.com? Yes, John Smith placed it yesterday.
With MASK:
Hi, I'm calling about the order for #### #####. Can you confirm the delivery to ######################? Yes, #### ##### placed it yesterday.
With MARKER:
Hi, I'm calling about the order for [NAME_1]. Can you confirm the delivery to [EMAIL_1]? Yes, [NAME_1] placed it yesterday.
Note how [NAME_1] is reused — the same entity, same marker.
Code samples
Below are code samples. For full configuration details, see the documentation.
Python:
import os
from time import sleep
import requests
from dotenv import load_dotenv
load_dotenv(override=True)
headers = {
"x-gladia-key": os.getenv("GLADIA_API_KEY", ""),
"accept": "application/json",
}
data = {
"audio_url": "https://www.youtube.com/watch?v=NhAwCo2wX38",
"sentences": True,
"language_config": {"code_switching": False, "languages": ["en"]},
"pii_redaction": True,
"pii_redaction_config": {
"entity_types": ["GDPR"],
"processed_text_type": "MASK",
},
}
print("Sending request...")
response = requests.post(
"https://api.gladia.io/v2/pre-recorded/",
headers=headers,
json=data,
)
post_data = response.json()
print("Job created:", post_data)
result_url = post_data.get("result_url")
if result_url:
while True:
poll = requests.get(result_url, headers=headers).json()
status = poll.get("status")
print("Status:", status)
if status == "done":
print("Result:", poll.get("result"))
break
elif status == "error":
print("Error:", poll)
break
sleep(1)
Typescript:
import * as dotenv from "dotenv";
dotenv.config({ override: true });
const GLADIA_API_KEY = process.env.GLADIA_API_KEY || "";
const headers: Record<string, string> = {
"x-gladia-key": GLADIA_API_KEY,
accept: "application/json",
"content-type": "application/json",
};
async function makeRequest(
url: string,
method: "GET" | "POST" = "GET",
data?: object
): Promise<any> {
const options: RequestInit = {
method,
headers,
...(data ? { body: JSON.stringify(data) } : {}),
};
const response = await fetch(url, options);
return response.json();
}
const data = {
audio_url: "https://www.youtube.com/watch?v=NhAwCo2wX38",
sentences: true,
language_config: { code_switching: false, languages: ["en"] },
pii_redaction: true,
pii_redaction_config: {
entity_types: ["GDPR"],
processed_text_type: "MASK",
},
};
async function main() {
console.log("- Sending request to Gladia API...");
const postResponse = await makeRequest(
"https://api.gladia.io/v2/pre-recorded/",
"POST",
data
);
console.log("Post response with Transcription ID:", postResponse);
const resultUrl: string | undefined = postResponse.result_url;
if (resultUrl) {
while (true) {
console.log("Polling for results...");
const pollResponse = await makeRequest(resultUrl);
if (pollResponse.status === "done") {
console.log("- Transcription done:\n");
console.log(pollResponse.result);
break;
} else if (pollResponse.status === "error") {
console.log("- Transcription failed");
console.log(pollResponse);
break;
} else {
console.log("Transcription status:", pollResponse.status);
await new Promise((resolve) => setTimeout(resolve, 1000));
}
}
}
console.log("- End of work");
}
main();
Best practices for PII-safe voice applications
Redaction is necessary, but insufficient on its own. PII safety is about limiting how far sensitive data can travel inside your system.
A production-grade approach includes:
Redact before persistence: Never rely on batch cleanup jobs. Once raw transcripts are written to logs, caches, or storage, they’re already replicated.
Control transcript fan-out: Be explicit about which services receive transcript data. If your architecture includes:
- LLM summarization
- Embedding pipelines
- CRM sync
- Data warehouse exports
Ensure those consumers only receive redacted text.
Avoid mixing raw and processed storage: Keep raw audio (if retained) isolated from processed transcript data. Shared buckets or indexes increase accidental exposure.
Minimize retention windows: If transcripts are only needed for short-term QA, implement automated deletion. Indefinite storage compounds risk without adding value.
Restrict access by role, not convenience: Engineers building dashboards typically don’t need raw transcript access. Apply RBAC deliberately.
Encrypt by default: TLS in transit. Encryption at rest. No plaintext transcript logs.
The goal is not just compliance, it’s reducing blast radius. If a storage bucket is exposed, or an API token leaks, the difference between a major breach and a minor incident is whether raw identifiers were ever stored there to begin with.
Frequently asked questions
Does PII redaction affect transcription accuracy?
No. Redaction is applied after entity detection within the model pipeline. The underlying transcription quality remains unchanged.
Can I choose which types of PII to redact?
Most advanced APIs allow configurable redaction, enabling or disabling specific entity categories.
Is redaction reversible?
No. In properly designed systems, redaction is one-way. The original data is not stored alongside the redacted output.
Final thoughts
As voice interfaces become central to modern applications, protecting personal data is not optional, it’s foundational. PII redaction ensures compliance across your speech-to-text infrastructure and alignment with modern privacy standards from day one.
If you want to see how it works in practice, try Gladia’s PII redaction in your workflow. We’d love to hear how it performs on your data.