How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

A picking operation is the process of collecting items from storage locations to fulfil customer orders.

It is one of the most labour-intensive activities in logistics, accounting for up to 55% of total warehouse operating costs.

Example of warehouse layout where operators need to pick in multiple locations – (Image by Samir Saci)

For each order, an operator receives a list of items to collect from their storage locations.

They walk to each location, identify the product, pick the right quantity, and confirm the operation before moving to the next line.

In most warehouses, operators rely on RF scanners or handheld tablets to receive instructions and confirm each pick.

What happens when operators need both hands for handling?
How to onboard operators who don’t read the local language?

Voice picking solves this by replacing the screen with audio instructions: the system tells the operator where to go and what to pick, and the operator confirms verbally.

Illustration of an operator using voice picking – (Image by Samir Saci)

When I was designing supply chain solutions in logistics companies, vocalisation was the default choice, especially for price-sensitive projects.

Based on my experience, with vocalization, operators’ productivity can reach 250 boxes/hour for retail and FMCG operations.

The concept is not new. Hardware providers and software editors have offered voice-picking solutions since the early 2000s.

But these systems come with significant constraints:

Proprietary hardware at $2,000 to $5,000 per headset
Vendor-locked software with limited customisation
Long deployment cycles of 3 to 6 months per site
Rigid language support that requires retraining for each new language

For a 50-FTE warehouse, the total investment reaches $150K to $300K, excluding training costs.

It is too expensive for my customers.

What if you could achieve similar results using a smartphone, a custom-made web application, and modern AI voice technology?

In this article, I will show how I built a minimalist voice-picking module that integrates with Warehouse Management Systems, using ElevenLabs for text-to-speech and speech recognition.

Example of screens of this app designed to be used on a smartphone with a vocal interface – (Image by Samir Saci)

This web application has been deployed in the distribution centre of a small supermarket chain with great results (the customer is happy!).

The objective is not to design solutions that compete with market leaders, but rather to offer an alternative to logistics and manufacturing operations that lack the capacity to invest in expensive equipment and want customised solutions.

Problem Statement

Before we get into voice-picking powered by ElevenLabs, let me introduce the logistic operations this AI-powered web application will support.

Layout of the distribution centre – (Image by Samir Saci)

This is the central distribution centre of a small supermarket chain that delivers to 50 stores in Central Europe.

Layout of the warehouse with 10 aisles and 12 pallet positions displayed on the app – (Image by Samir Saci)

The facility is organised in a grid layout with aisles (A through L) and positions along each aisle:

Each location stores a specific item (called SKU) with a known quantity in boxes.
Operators need to know where to go and what to expect when they arrive.

What is the objective? Boost the operators productivity!

They were not happy about the order allocation and walking paths provided by their old system.

Solutions used to optimise picking operations for this warehouse – (Image by Samir Saci)

They first asked to reduce operators’ walking distance and boost the number of boxes picked per hour using the solutions presented in this article.

The solution was a web application connected to the Warehouse Management System (WMS) database that guides the operator through the warehouse.

Operators can check their picking list but also detailed information per location – (Image by Samir Saci)

This visual layout provides a real-time view of what we have in the system, with a better routing solution.

Our objective is to go from a productivity of 75 boxes/hour to 200 boxes/hour with:

A better order allocation of orders with spatial clustering and pathfinding to minimise the walking distance per box picked
Voice-picking to guide operators in a flawless manner

How the Picking Flow Works

Before jumping into the vocalisation of the tool, let me introuce the process of order picking.

Three stores sent orders to the warehouse:

Store 1 ordered 3 boxes of Organic Green Tea 500g that are located in Location A1
Store 2 ordered 2 boxes of Earl Grey Tea 250g that are located in Location A3
Store 3 ordered 5 boxes of Arabica Coffee Beans 1kg that are located in Location B2

A picking batch is a group of store orders consolidated into a single work assignment.

The operator will prepare the three orders in a single batch – (Image by Samir Saci)

The system generates a batch with multiple order lines with instructions:

Where to go (the storage location)
What to pick (the SKU reference)
How many boxes to collect

Picking list (left), layout (middle), details of location (right) – (Image by Samir Saci)

The operator just has to process each line sequentially.

Once they confirm a pick, the system advances to the next instruction.

This sequential flow is critical because it determines the walking path through the warehouse using the optimisation algorithms.

Example of the original pathfinding solution (bottom) and the optimised (top)

As this is a custom application, we could implement this optimisation without relying on an external editor.

Why building a custom solution? Because it’s cheaper and easier to implement.

Initially, the customer planned to purchase a commercial solution and wanted me to integrate the pathfinding solution.

After investigation, we discovered that it would have been more expensive to integrate the app into the vendor solution than to build something from scratch.

What is the process without the AI-based voice feature?

Manual Mode: The Screen-Based Baseline

In manual mode, the operator reads each instruction on screen and confirms by tapping a button.

Two actions are available at each step:

Confirm Pick: operator collected the right quantity
Report Issue: the location is empty, the quantity doesn’t match, or the product is damaged

Our operator has to press the button to confirm the picking or report an issue – (Image by Samir Saci)

I built the manual mode as a reliable fallback in case we have issues with Elevenlabs.

But it keeps the operator’s eyes and one hand tied to the device at every step.

We need to add vocal commands!

Voice Mode: Hands-Free with ElevenLabs

Now that you know why we want the voice mode to replace screen interaction, let me explain how I added two AI-powered components.

Technical architecture of this application – (Image by Samir Saci)

Text-to-Speech: ElevenLabs Reads the Instructions

When the operator starts a picking session in voice mode, each instruction is converted to speech using the ElevenLabs API.

Instead of reading “Location A-03-2, pick 4 boxes of SKU-1042” on a screen, the operator hears a natural voice say:

“Location Alpha Three Two. Pick four boxes.”

ElevenLabs provides several advantages over basic browser-based TTS:

Natural intonation that is easy to understand in a noisy warehouse
29+ languages available out of the box, with no retraining
Consistent voice quality across all instructions
Sub-second generation for short sentences like pick instructions

But what about speech recognition?

Speech-to-Text: The Operator Confirms Verbally

After hearing the instruction, the operator walks to the location, picks the items, and needs to confirm.

Here, I made a deliberate design choice relying on speech recognition and the reasoning capabilities of ElevenLabs.

Using a single endpoint, we capture the response and match it against expected commands:

“Confirm” or “Done” to validate the pick
“Problem” or “Issue” to flag a discrepancy
“Repeat” to hear the instruction again

The agentic part translates the operator’s feedback and tries to match it to the expected interactions (CONFIRM, ISSUE, or REPEAT).

The complete process from left to right: Step 1 -> Step 2 -> Step 3 – (Image by Samir Saci)

For a multilingual warehouse, this is a significant benefit:

A Czech operator and a Filipino operator can both receive instructions in their native language from the same system, without any hardware change.
I don’t have to consider all the languages possible in the design of the solution

Why using ElevenLabs?

For another feature, the inventory cycle count tool presented in this video, I have used n8n with AI agent nodes to perform the same task.

n8n workflow for the voice-powered inventory cycle count tools – (Image by Samir Saci)

This was working quite well, but it required a more complex setup

Two AI nodes: one for the audio transcription using OpenAI models, and one AI agent to format the output of the transcription
The system prompts were assuming that the operator was speaking English.

I have replaced that with a single ElevenLabs endpoint with multi-lingual capabilities.

Putting both components together, a single pick cycle looks like this:

The Complete Voice Picking Cycle – (Image by Samir Saci)

The app calls ElevenLabs to generate the audio instruction
The operator hears: “Location Alpha Three Two. Pick four boxes.”
The operator walks to the location (hands free, eyes free)
The operator picks the items and says, “Confirm”
The speech recognition endpoint processes the confirmation and moves to the next picking location

The entire interaction takes a few seconds of system time.

What about the costs?

This is where the comparison with traditional systems becomes striking.

Comparative study – (Image by Samir Saci)

For this mid-size warehouse with 50 FTEs, they estimated that the traditional approach costs roughly $60K to $150K in the first year.

The AI-powered approach costs a few API calls.

The trade-off is clear: traditional systems offer proven reliability and offline capability for high-volume operations.

In case of failures, we have the manual solution as a rollback.

This AI-powered approach offers accessibility and speed for organisations that cannot justify a six-figure investment.

What Does That Mean for Operations Managers and Decision Makers?

Voice picking is no longer a technology reserved for the largest 3PLs and retailers with large budgets.

If your warehouse has WiFi and your operators have smartphones, you can prototype a voice-guided picking system in days.

It is easy to test it on a real batch to measure the impact before committing any significant budget for productisation.

Three scenarios where this approach makes particular sense:

Multilingual facilities where operators struggle with screen-based instructions in a language that is not their own
Multi-site operations where deploying proprietary hardware to every small warehouse is not economically viable
High-turnover environments where training time on complex scanning systems directly impacts productivity

What about other processes?

Good news, the same architecture extends beyond picking.

Voice-guided workflows can support any process where an operator needs instructions while keeping their hands free.

You can find a live demo of an inventory cycle counting tool here:

How to start this journey?

As you could easily guess, the front end of these applications has been vibecoded using Lovable and Claude Code.

For the backend, if you have limited coding capabilities, I would suggest starting with n8n.

Example of n8n workflows – (Image by Samir Saci)

n8n is a low-code automation platform that lets you connect APIs and AI models using visual workflows.

The initial version of this solution has been built with this tool:

I started with a backend connected to a Telegram Bot
Users were playing with the tool using this interface
After validation, we moved that to a web application

This is the easiest way to start, even with limited coding skills.

I share a step-by-step tutorial with free templates to start automating from day 1 in this video:

Let me know what you plan to build using all these nice tools!

About Me

Let’s connect on LinkedIn and Twitter. I am a Supply Chain Engineer who is using data analytics to improve logistics operations and reduce costs.

If you’re looking for tailored consulting solutions to optimise your supply chain and meet sustainability goals, please contact me.

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

[2510.14989] Constrained Diffusion for Protein Design with Hard Structural Constraints

Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

a Fully Interpretable Relational Way

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

What the Bits-over-Random Metric Changed in How I Think About RAG and Agents

A Self-Adapting, Tool-Enabled, Extensible Architecture for Multi-Protocol AI Agent Systems

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

ChatGPT hits $100 million in ad revenue and is opening self-serve access in April

Why Google’s New “Google-Agent” Is The Biggest Mindset Shift In SEO History

[2510.14989] Constrained Diffusion for Protein Design with Hard Structural Constraints

How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

Automating a YouTube channel with Cursor

Google-Agent user agent identifies AI agent traffic in server logs

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

Problem Statement

How the Picking Flow Works

Manual Mode: The Screen-Based Baseline

Voice Mode: Hands-Free with ElevenLabs

Text-to-Speech: ElevenLabs Reads the Instructions

Speech-to-Text: The Operator Confirms Verbally

What Does That Mean for Operations Managers and Decision Makers?

About Me

Related Posts

Subscribe to Updates