Skip to main content

Voice-PLAN Interactive Examples

Overview​

This section provides interactive examples and demonstrations of the voice-PLAN capabilities in the Vision-Language-Action (VLA) system. These examples illustrate how natural language commands are processed through the Whisper integration and executed by the autonomous humanoid robot system.

Example 1: Simple Navigation Command​

Command: "Take me to the kitchen"​

Processing Pipeline:​

  1. Speech Recognition: Whisper converts speech to text: "Take me to the kitchen"
  2. Language Understanding: LLM-4 identifies intent as navigation request
  3. Location Resolution: System identifies "kitchen" in the environment map
  4. Path Planning: NAVIGATE system plans safe route to kitchen
  5. Execution: Robot moves to kitchen while maintaining safety

Interactive Simulation:​

User: "Take me to the kitchen"
↓
Whisper: "Take me to the kitchen" [Confidence: 0.92]
↓
LLM-4: {
"intent": "navigation",
"target_location": "kitchen",
"action_sequence": [
{"type": "navigate", "destination": "kitchen"}
]
}
↓
NAVIGATE: Path planned (3.2m, 4 waypoints)
↓
Robot: Moving to kitchen... [Progress: 65%]
↓
Result: Arrived at kitchen. What would you like me to do next?

Code Implementation:​

def handle_navigation_command(command):
# Step 1: Speech to text
transcription = whisper_transcribe(audio_input)

# Step 2: Intent recognition
intent_data = llm4_process_command(transcription)

# Step 3: Location resolution
target_location = resolve_location(intent_data['target_location'])

# Step 4: Path planning and execution
if intent_data['intent'] == 'navigation':
path = navigate_system.plan_path_to(target_location)
navigate_system.execute_path(path)
return f"Arrived at {target_location}"

Example 2: Object Manipulation Command​

Command: "Please pick up the red cup from the table"​

Processing Pipeline:​

  1. Speech Recognition: Whisper processes: "Please pick up the red cup from the table"
  2. Command Parsing: LLM-4 identifies manipulation intent with object specification
  3. Object Recognition: Vision system identifies "red cup" in environment
  4. Grasp Planning: MANIPULATE system plans optimal grasp
  5. Execution: Robot navigates, grasps, and picks up the object

Interactive Simulation:​

User: "Please pick up the red cup from the table"
↓
Whisper: "Please pick up the red cup from the table" [Confidence: 0.88]
↓
LLM-4: {
"intent": "manipulation",
"action": "pick_up",
"object": {
"color": "red",
"type": "cup",
"location": "table"
}
}
↓
Vision: Found red cup at position [1.2, 0.8, 0.75]
↓
MANIPULATE: Grasp planned (quality: 0.85)
↓
Robot: Approaching red cup... Grasping... Cup picked up successfully!
↓
Result: I've picked up the red cup. Where should I place it?

Code Implementation:​

def handle_manipulation_command(command):
# Step 1: Process command with LLM-4
command_data = llm4_process_command(command)

# Step 2: Identify target object
target_object = vision_system.find_object(
color=command_data['object']['color'],
type=command_data['object']['type'],
location=command_data['object']['location']
)

# Step 3: Plan and execute grasp
if target_object:
grasp_plan = manipulate_system.plan_grasp(target_object)
result = manipulate_system.execute_grasp(grasp_plan)
return f"Successfully picked up the {command_data['object']['color']} {command_data['object']['type']}"
else:
return f"Could not find {command_data['object']['color']} {command_data['object']['type']}"

Example 3: Complex Multi-Step Command​

Command: "Go to the living room and turn on the lamp next to the sofa"​

Processing Pipeline:​

  1. Speech Recognition: Whisper captures the multi-step command
  2. Task Decomposition: LLM-4 breaks into navigation and interaction tasks
  3. Environment Analysis: Vision system identifies lamp and sofa relationship
  4. Sequential Execution: NAVIGATE to location, then manipulation action
  5. Result Verification: Confirm lamp is turned on

Interactive Simulation:​

User: "Go to the living room and turn on the lamp next to the sofa"
↓
Whisper: "Go to the living room and turn on the lamp next to the sofa" [Confidence: 0.85]
↓
LLM-4: {
"intent": "complex_task",
"subtasks": [
{
"id": 1,
"type": "navigation",
"target": "living room",
"next": 2
},
{
"id": 2,
"type": "manipulation",
"action": "press_switch",
"object": "lamp next to sofa"
}
]
}
↓
NAVIGATE: Path to living room planned (5.1m)
↓
Robot: Navigating to living room...
↓
Vision: Identified lamp next to sofa at [2.1, 3.2, 0.9]
↓
MANIPULATE: Approach and press switch planned
↓
Robot: Approaching lamp... Switch pressed!
↓
Vision: Confirming lamp status - ON
↓
Result: I've turned on the lamp next to the sofa in the living room.

Code Implementation:​

def handle_complex_command(command):
# Step 1: Decompose complex command
task_plan = llm4_decompose_task(command)

# Step 2: Execute subtasks sequentially
for subtask in task_plan['subtasks']:
if subtask['type'] == 'navigation':
navigate_system.go_to(subtask['target'])
elif subtask['type'] == 'manipulation':
# Identify target object relative to reference
target = vision_system.find_object_relative_to(
object_type=subtask['action_object'],
reference_object=subtask['reference_object']
)
manipulate_system.execute_action(subtask['action'], target)

return "Complex task completed successfully"

Example 4: Conditional Command​

Command: "If you see the blue book, bring it to me; otherwise, tell me what you found instead"​

Processing Pipeline:​

  1. Speech Recognition: Whisper processes conditional logic
  2. Conditional Parsing: LLM-4 identifies if-then-else structure
  3. Object Search: Vision system actively looks for "blue book"
  4. Conditional Execution: Execute appropriate branch based on findings
  5. Result Communication: Report outcome based on condition result

Interactive Simulation:​

User: "If you see the blue book, bring it to me; otherwise, tell me what you found instead"
↓
Whisper: "If you see the blue book, bring it to me; otherwise, tell me what you found instead" [Confidence: 0.82]
↓
LLM-4: {
"intent": "conditional_task",
"condition": {
"check": "object_exists",
"object": {"color": "blue", "type": "book"}
},
"if_true": {
"action": "bring_object",
"target": "user"
},
"if_false": {
"action": "report_alternative",
"target": "user"
}
}
↓
Vision: Scanning for blue book...
↓
Vision: Blue book found at [0.5, 1.8, 0.85]!
↓
MANIPULATE: Grasp planned for blue book
↓
Robot: Approaching blue book... Grasping... Bringing to user...
↓
Result: Here is the blue book you asked for.

Code Implementation:​

def handle_conditional_command(command):
# Step 1: Parse conditional structure
conditional_plan = llm4_parse_conditional(command)

# Step 2: Evaluate condition
condition_result = evaluate_condition(conditional_plan['condition'])

# Step 3: Execute appropriate branch
if condition_result:
result = execute_action(conditional_plan['if_true'])
else:
result = execute_action(conditional_plan['if_false'])

return result

def evaluate_condition(condition):
if condition['check'] == 'object_exists':
found_object = vision_system.find_object(
color=condition['object']['color'],
type=condition['object']['type']
)
return found_object is not None
return False

Example 5: Temporal Command​

Command: "Wait for me to say 'go', then bring me the newspaper"​

Processing Pipeline:​

  1. Speech Recognition: Whisper identifies waiting instruction
  2. State Management: System enters listening state for trigger word
  3. Event Detection: Monitor for "go" command while maintaining readiness
  4. Action Execution: After trigger, execute manipulation task
  5. Result Reporting: Confirm completion

Interactive Simulation:​

User: "Wait for me to say 'go', then bring me the newspaper"
↓
Whisper: "Wait for me to say 'go', then bring me the newspaper" [Confidence: 0.87]
↓
LLM-4: {
"intent": "temporal_task",
"waiting_state": true,
"trigger": "go",
"action": "bring_newspaper",
"on_trigger": {
"action": "locate_and_grasp",
"object": "newspaper"
}
}
↓
System: Waiting for 'go' command... [Listening]
↓
User: "go"
↓
Whisper: "go" [Confidence: 0.95]
↓
System: Trigger received! Locating newspaper...
↓
Vision: Newspaper detected at [1.5, 0.2, 0.75]
↓
MANIPULATE: Grasping newspaper...
↓
Robot: Bringing newspaper to user...
↓
Result: Here is the newspaper as requested.

Code Implementation:​

def handle_temporal_command(command):
# Parse temporal command
temporal_plan = llm4_parse_temporal(command)

if temporal_plan['waiting_state']:
# Enter waiting state
trigger_detected = wait_for_trigger(temporal_plan['trigger'])

if trigger_detected:
# Execute post-trigger action
target_object = vision_system.find_object(
type=temporal_plan['on_trigger']['object']
)
result = manipulate_system.execute_grasp(target_object)
return f"Action completed after trigger: {result}"

return "Temporal command processed"

def wait_for_trigger(trigger_word, timeout=30):
"""Wait for specific trigger word with timeout"""
start_time = time.time()
while time.time() - start_time < timeout:
audio_input = get_audio_input()
transcription = whisper_transcribe(audio_input)
if trigger_word.lower() in transcription.lower():
return True
return False

Interactive Exercise: Design Your Own Voice Command​

Exercise Instructions:​

Create a voice command for the VLA system that incorporates multiple elements from the examples above. Your command should include:

  1. Action Type: Navigation, manipulation, or both
  2. Object Specification: Color, size, or other identifying features
  3. Conditional Logic: If-then conditions (optional)
  4. Temporal Element: Timing or sequence requirements (optional)

Example Template:​

"Please [ACTION] the [COLOR] [OBJECT] in the [LOCATION] and [FOLLOW-UP ACTION]"

Sample User Creation:​

Command: "Go to the office and if you find the black laptop on the desk, bring it to the living room; otherwise, just tell me it's not there."

Expected Processing:

  • Navigate to office
  • Look for black laptop on desk
  • If found: bring to living room
  • If not found: report status

Voice Command Best Practices​

Clear Speech Guidelines:​

  • Speak at normal volume and pace
  • Use clear pronunciation
  • State commands directly (e.g., "Go to kitchen" vs. "Could you go to the kitchen?")

Effective Command Structure:​

  • Be specific about objects and locations
  • Use spatial references when needed ("the cup on the left")
  • Break complex tasks into simpler commands if needed

Error Recovery:​

  • If misunderstood, the system will ask for clarification
  • Commands can be modified or canceled during execution
  • Safety overrides always take precedence over voice commands

Advanced Voice-PLAN Features​

Context-Aware Processing:​

The system maintains conversation context to handle referential commands like "it" or "that one" based on previous interactions.

Multi-Language Support:​

Future implementations will support multiple languages with appropriate language models.

Speaker Recognition:​

Advanced systems can adapt to different users' speech patterns and preferences.

These interactive examples demonstrate the sophisticated processing capabilities of the voice-PLAN system in the VLA architecture, showing how natural language commands are transformed into coordinated robotic actions while maintaining safety and contextual awareness.