How To Design Effective Conversational AI Experiences: A Comprehensive Guide

Category Image 062

Conversational AI is revolutionizing information access, offering a personalized, intuitive search experience that delights users and empowers businesses. A well-designed conversational agent acts as a knowledgeable guide, understanding user intent and effortlessly navigating vast data, which leads to happier, more engaged users, fostering loyalty and trust. Meanwhile, businesses benefit from increased efficiency, reduced costs, and a stronger bottom line. On the other hand, a poorly designed system can lead to frustration, confusion, and, ultimately, abandonment.

Achieving success with conversational AI requires more than just deploying a chatbot. To truly harness this technology, we must master the intricate dynamics of human-AI interaction. This involves understanding how users articulate needs, explore results, and refine queries, paving the way for a seamless and effective search experience.

This article will decode the three phases of conversational search, the challenges users face at each stage, and the strategies and best practices AI agents can employ to enhance the experience.

The Three Phases Of Conversational Search

To analyze these complex interactions, Trippas et al. (2018) (PDF) proposed a framework that outlines three core phases in the conversational search process:

  1. Query formulation: Users express their information needs, often facing challenges in articulating them clearly.
  2. Search results exploration: Users navigate through presented results, seeking further information and refining their understanding.
  3. Query re-formulation: Users refine their search based on new insights, adapting their queries and exploring different avenues.

Building on this framework, Azzopardi et al. (2018) (PDF) identified five key user actions within these phases: reveal, inquire, navigate, interrupt, interrogate, and the corresponding agent actions — inquire, reveal, traverse, suggest, and explain.

In the following sections, I’ll break down each phase of the conversational search journey, delving into the actions users take and the corresponding strategies AI agents can employ, as identified by Azzopardi et al. (2018) (PDF). I’ll also share actionable tactics and real-world examples to guide the implementation of these strategies.

Phase 1: Query Formulation: The Art Of Articulation

In the initial phase of query formulation, users attempt to translate their needs into prompts. This process involves conscious disclosures — sharing details they believe are relevant — and unconscious non-disclosure — omitting information they may not deem important or struggle to articulate.

This process is fraught with challenges. As Jakob Nielsen aptly pointed out,

“Articulating ideas in written prose is hard. Most likely, half the population can’t do it. This is a usability problem for current prompt-based AI user interfaces.”

— Jakob Nielsen

This can manifest as:

  • Vague language: “I need help with my finances.”
    Budgeting? Investing? Debt management?
  • Missing details: “I need a new pair of shoes.”
    What type of shoes? For what purpose?
  • Limited vocabulary: Not knowing the right technical terms. “I think I have a sprain in my ankle.”
    The user might not know the difference between a sprain and a strain or the correct anatomical terms.

These challenges can lead to frustration for users and less relevant results from the AI agent.

AI Agent Strategies: Nudging Users Towards Better Input

To bridge the articulation gap, AI agents can employ three core strategies:

  1. Elicit: Proactively guide users to provide more information.
  2. Clarify: Seek to resolve ambiguities in the user’s query.
  3. Suggest: Offer alternative phrasing or search terms that better capture the user’s intent.

The key to effective query formulation is balancing elicitation and assumption. Overly aggressive questioning can frustrate users, and making too many assumptions can lead to inaccurate results.

For example,

User: “I need a new phone.”

AI: “What’s your budget? What features are important to you? What size screen do you prefer? What carrier do you use?...”

This rapid-fire questioning can overwhelm the user and make them feel like they're being interrogated. A more effective approach is to start with a few open-ended questions and gradually elicit more details based on the user’s responses.

As Azzopardi et al. (2018) (PDF) stated in the paper,

“There may be a trade-off between the efficiency of the conversation and the accuracy of the information needed as the agent has to decide between how important it is to clarify and how risky it is to infer or impute the underspecified or missing details.”

Implementation Tactics And Examples

  • Probing questions: Ask open-ended or clarifying questions to gather more details about the user’s needs. For example, Perplexity Pro uses probing questions to elicit more details about the user’s needs for gift recommendations.

For example, after clicking one of the initial prompts, “Create a personal webpage,” ChatGPT added another sentence, “Ask me 3 questions first on whatever you need to know,” to elicit more details from the user.

  • Interactive refinement: Utilize visual aids like sliders, checkboxes, or image carousels to help users specify their preferences without articulating everything in words. For example, Adobe Firefly’s side settings allow users to adjust their preferences.

  • Suggested prompts: Provide examples of more specific or detailed queries to help users refine their search terms. For example, Nelson Norman Group provides an interface that offers a suggested prompt to help users refine their initial query.

For example, after clicking one of the initial prompts in Gemini, “Generate a stunning, playful image,” more details are added in blue in the input.

  • Offering multiple interpretations: If the query is ambiguous, present several possible interpretations and let the user choose the most accurate one. For example, Gemini offers a list of gift suggestions for the query “gifts for my friend who loves music,” categorized by the recipient’s potential music interests to help the user pick the most relevant one.

Phase 2: Search Results Exploration: A Multifaceted Journey

Once the query is formed, the focus shifts to exploration. Users embark on a multifaceted journey through search results, seeking to understand their options and make informed decisions.

Two primary user actions mark this phase:

  1. Inquire: Users actively seek more information, asking for details, comparisons, summaries, or related options.
  2. Navigate: Users navigate the presented information, browse through lists, revisit previous options, or request additional results. This involves scrolling, clicking, and using voice commands like “next” or “previous.”

AI Agent Strategies: Facilitating Exploration And Discovery

To guide users through the vast landscape of information, AI agents can employ these strategies:

  1. Reveal: Present information that caters to diverse user needs and preferences.
  2. Traverse: Guide the user through the information landscape, providing intuitive navigation and responding to their evolving interests.

During discovery, it’s vital to avoid information overload, which can overwhelm users and hinder their decision-making. For example,

User: “I’m looking for a place to stay in Tokyo.”

AI: Provides a lengthy list of hotels without any organization or filtering options.

Instead, AI agents should offer the most relevant results and allow users to filter or sort them based on their needs. This might include presenting a few top recommendations based on ratings or popularity, with options to refine the search by price range, location, amenities, and so on.

Additionally, AI agents should understand natural language navigation. For example, if a user asks, “Tell me more about the second hotel,” the AI should provide additional details about that specific option without requiring the user to rephrase their query. This level of understanding is crucial for flexible navigation and a seamless user experience.

Implementation Tactics And Examples

  • Diverse formats: Offer results in various formats (lists, summaries, comparisons, images, videos) and allow users to specify their preferences. For example, Gemini presents a summarized format of hotel information, including a photo, price, rating, star rating, category, and brief description to allow the user to evaluate options quickly for the prompt “I’m looking for a place to stay in Paris.”

  • Context-aware navigation: Maintain conversational context, remember user preferences, and provide relevant navigation options. For example, following the previous example prompt, Gemini reminds users of the potential next steps at the end of the response.

  • Interactive exploration: Use carousels, clickable images, filter options, and other interactive elements to enhance the exploration experience. For example, Perplexity offers a carousel of images related to “a vegetarian diet” and other interactive elements like “Watch Videos” and “Generate Image” buttons to enhance exploration and discovery.

  • Multiple responses: Present several variations of a response. For example, users can see multiple draft responses to the same query by clicking the “Show drafts” button in Gemini.

  • Flexible text length and tone. Enable users to customize the length and tone of AI-generated responses to better suit their preferences. For example, Gemini provides multiple options for welcome messages, offering varying lengths, tones, and degrees of formality.

Phase 3: Query Re-formulation: Adapting To Evolving Needs

As users interact with results, their understanding deepens, and their initial query might not fully capture their evolving needs. During query re-formulation, users refine their search based on exploration and new insights, often involving interrupting and interrogating. Query re-formulation empowers users to course-correct and refine their search.

  • Interrupt: Users might pause the conversation to:
    • Correct: “Actually, I meant a desktop computer, not a laptop.”
    • Add information: “I also need it to be good for video editing.”
    • Change direction: “I’m not interested in those options. Show me something else.”
  • Interrogate: Users challenge the AI to ensure it understands their needs and justify its recommendations:
    • Seek understanding: “What do you mean by ‘good battery life’?”
    • Request explanations: “Why are you recommending this particular model?”

AI Agent Strategies: Adapting And Explaining

To navigate the query re-formulation phase effectively, AI agents need to be responsive, transparent, and proactive. Two core strategies for AI agents:

  1. Suggest: Proactively offer alternative directions or options to guide the user towards a more satisfying outcome.
  2. Explain: Provide clear and concise explanations for recommendations and actions to foster transparency and build trust.

AI agents should balance suggestions with relevance and explain why certain options are suggested while avoiding overwhelming them with unrelated suggestions that increase conversational effort. A bad example would be the following:

User: “I want to visit Italian restaurants in New York.”

AI: Suggest unrelated options, like Mexican restaurants or American restaurants, when the user is interested in Italian cuisine.

This could frustrate the user and reduce trust in the AI.

A better answer could be, “I found these highly-rated Italian restaurants. Would you like to see more options based on different price ranges?” This ensures users understand the reasons behind recommendations, enhancing their satisfaction and trust in the AI's guidance.

Implementation Tactics And Examples

  • Transparent system process: Show the steps involved in generating a response. For example, Perplexity Pro outlines the search process step by step to fulfill the user’s request.

  • Explainable recommendations: Clearly state the reasons behind specific recommendations, referencing user preferences, historical data, or external knowledge. For example, ChatGPT includes recommended reasons for each listed book in response to the question “books for UX designers.”

  • Source reference: Enhance the answer with source references to strengthen the evidence supporting the conclusion. For example, Perplexity presents source references to support the answer.

  • Point-to-select: Users should be able to directly select specific elements or locations within the dialogue for further interaction rather than having to describe them verbally. For example, users can select part of an answer and ask a follow-up in Perplexity.

  • Proactive recommendations: Suggest related or complementary items based on the user’s current selections. For example, Perplexity offers a list of related questions to guide the user’s exploration of “a vegetarian diet.”

Overcoming LLM Shortcomings

While the strategies discussed above can significantly improve the conversational search experience, LLMs still have inherent limitations that can hinder their intuitiveness. These include the following:

  • Hallucinations: Generating false or nonsensical information.
  • Lack of common sense: Difficulty understanding queries that require world knowledge or reasoning.
  • Sensitivity to input phrasing: Producing different responses to slightly rephrased queries.
  • Verbosity: Providing overly lengthy or irrelevant information.
  • Bias: Reflecting biases present in the training data.

To create truly effective and user-centric conversational AI, it’s crucial to address these limitations and make interactions more intuitive. Here are some key strategies:

  • Incorporate structured knowledge
    Integrating external knowledge bases or databases can ground the LLM’s responses in facts, reducing hallucinations and improving accuracy.
  • Fine-tuning
    Training the LLM on domain-specific data enhances its understanding of particular topics and helps mitigate bias.
  • Intuitive feedback mechanisms
    Allow users to easily highlight and correct inaccuracies or provide feedback directly within the conversation. This could involve clickable elements to flag problematic responses or a “this is incorrect” button that prompts the AI to reconsider its output.
  • Natural language error correction
    Develop AI agents capable of understanding and responding to natural language corrections. For example, if a user says, “No, I meant X,” the AI should be able to interpret this as a correction and adjust its response accordingly.
  • Adaptive learning
    Implement machine learning algorithms that allow the AI to learn from user interactions and improve its performance over time. This could involve recognizing patterns in user corrections, identifying common misunderstandings, and adjusting behavior to minimize future errors.
Training AI Agents For Enhanced User Satisfaction

Understanding and evaluating user satisfaction is fundamental to building effective conversational AI agents. However, directly measuring user satisfaction in the open-domain search context can be challenging, as Zhumin Chu et al. (2022) highlighted. Traditionally, metrics like session abandonment rates or task completion were used as proxies, but these don’t fully capture the nuances of user experience.

To address this, Clemencia Siro et al. (2023) offer a comprehensive approach to gathering and leveraging user feedback:

  • Identify key dialogue aspects
    To truly understand user satisfaction, we need to look beyond simple metrics like “thumbs up” or “thumbs down.” Consider evaluating aspects like relevance, interestingness, understanding, task completion, interest arousal, and efficiency. This multi-faceted approach provides a more nuanced picture of the user’s experience.
  • Collect multi-level feedback
    Gather feedback at both the turn level (each question-answer pair) and the dialogue level (the overall conversation). This granular approach pinpoints specific areas for improvement, both in individual responses and the overall flow of the conversation.
  • Recognize individual differences
    Understand that the concept of satisfaction varies per user. Avoid assuming all users perceive satisfaction similarly.
  • Prioritize relevance
    While all aspects are important, relevance (at the turn level) and understanding (at both the turn and session level) have been identified as key drivers of user satisfaction. Focus on improving the AI agent’s ability to provide relevant and accurate responses that demonstrate a clear understanding of the user’s intent.

Additionally, consider these practical tips for incorporating user satisfaction feedback into the AI agent’s training process:

  • Iterate on prompts
    Use user feedback to refine the prompts to elicit information and guide the conversation.
  • Refine response generation
    Leverage feedback to improve the relevance and quality of the AI agent’s responses.
  • Personalize the experience
    Tailor the conversation to individual users based on their preferences and feedback.
  • Continuously monitor and improve
    Regularly collect and analyze user feedback to identify areas for improvement and iterate on the AI agent’s design and functionality.
The Future Of Conversational Search: Beyond The Horizon

The evolution of conversational search is far from over. As AI technologies continue to advance, we can anticipate exciting developments:

  • Multi-modal interactions
    Conversational search will move beyond text, incorporating voice, images, and video to create more immersive and intuitive experiences.
  • Personalized recommendations
    AI agents will become more adept at tailoring search results to individual users, considering their past interactions, preferences, and context. This could involve suggesting restaurants based on dietary restrictions or recommending movies based on previously watched titles.
  • Proactive assistance
    Conversational search systems will anticipate user needs and proactively offer information or suggestions. For instance, an AI travel agent might suggest packing tips or local customs based on a user’s upcoming trip.

Integrating Image-To-Text And Text-To-Speech Models (Part 2)

Category Image 062

In Part 1 of this brief two-part series, we developed an application that turns images into audio descriptions using vision-language and text-to-speech models. We combined an image-to-text that analyses and understands images, generating description, with a text-to-speech model to create an audio description, helping people with sight challenges. We also discussed how to choose the right model to fit your needs.

Now, we are taking things a step further. Instead of just providing audio descriptions, we are building that can have interactive conversations about images or videos. This is known as Conversational AI — a technology that lets users talk to systems much like chatbots, virtual assistants, or agents.

While the first iteration of the app was great, the output still lacked some details. For example, if you upload an image of a dog, the description might be something like “a dog sitting on a rock in front of a pool,” and the app might produce something close but miss additional details such as the dog’s breed, the time of the day, or location.

The aim here is simply to build a more advanced version of the previously built app so that it not only describes images but also provides more in-depth information and engages users in meaningful conversations about them.

We’ll use LLaVA, a model that combines understanding images and conversational capabilities. After building our tool, we’ll explore multimodal models that can handle images, videos, text, audio, and more, all at once to give you even more options and easiness for your applications.

Visual Instruction Tuning and LLaVA

We are going to look at visual instruction tuning and the multimodal capabilities of LLaVA. We’ll first explore how visual instruction tuning can enhance the large language models to understand and follow instructions that include visual information. After that, we’ll dive into LLaVA, which brings its own set of tools for image and video processing.

Visual Instruction Tuning

Visual instruction tuning is a technique that helps large language models (LLMs) understand and follow instructions based on visual inputs. This approach connects language and vision, enabling AI systems to understand and respond to human instructions that involve both text and images. For example, Visual IT enables a model to describe an image or answer questions about a scene in a photograph. This fine-tuning method makes the model more capable of handling these complex interactions effectively.

There’s a new training approach called LLaVAR that has been developed, and you can think of it as a tool for handling tasks related to PDFs, invoices, and text-heavy images. It’s pretty exciting, but we won’t dive into that since it is outside the scope of the app we’re making.

Examples of Visual Instruction Tuning Datasets

To build good models, you need good data — rubbish in, rubbish out. So, here are two datasets that you might want to use to train or evaluate your multimodal models. Of course, you can always add your own datasets to the two I’m going to mention.

Vision-CAIR

  • Instruction datasets: English;
  • Multi-task: Datasets containing multiple tasks;
  • Mixed dataset: Contains both human and machine-generated data.

Vision-CAIR provides a high-quality, well-aligned image-text dataset created using conversations between two bots. This dataset was initially introduced in a paper titled “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,” and it provides more detailed image descriptions and can be used with predefined instruction templates for image-instruction-answer fine-tuning.

There are more multimodal datasets out there, but these two should help you get started if you want to fine-tune your model.

Let’s Take a Closer Look At LLaVA

LLaVA (which stands for Large Language and Vision Assistant) is a groundbreaking multimodal model developed by researchers from the University of Wisconsin, Microsoft Research, and Columbia University. The researchers aimed to create a powerful, open-source model that could compete with the best in the field, just like GPT-4, Claude 3, or Gemini, to name a few. For developers like you and me, its open nature is a huge benefit, allowing for easy fine-tuning and integration.

One of LLaVA’s standout features is its ability to understand and respond to complex visual information, even with unfamiliar images and instructions. This is exactly what we need for our tool, as it goes beyond simple image descriptions to engage in meaningful conversations about the content.

Architecture

LLaVA’s strength lies in its smart use of existing models. Instead of starting from scratch, the researchers used two key models:

  • CLIP VIT-L/14
    This is an advanced version of the CLIP (Contrastive Language–Image Pre-training) model developed by OpenAI. CLIP learns visual concepts from natural language descriptions. It can handle any visual classification task by simply being given the names of the visual categories, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
  • Vicuna
    This is an open-source chatbot trained by fine-tuning LLaMA on 70,000 user-shared conversations collected from ShareGPT. Training Vicuna-13B costs around $300, and it performs exceptionally well, even when compared to other models like Alpaca.

These components make LLaVA highly effective by combining state-of-the-art visual and language understanding capabilities into a single powerful model, perfectly suited for applications requiring both visual and conversational AI.

Training

LLaVA’s training process involves two important stages, which together enhance its ability to understand user instructions, interpret visual and language content, and provide accurate responses. Let’s detail what happens in these two stages:

  1. Pre-training for Feature Alignment
    LLaVA ensures that its visual and language features are aligned. The goal here is to update the projection matrix, which acts as a bridge between the CLIP visual encoder and the Vicuna language model. This is done using a subset of the CC3M dataset, allowing the model to map input images and text to the same space. This step ensures that the language model can effectively understand the context from both visual and textual inputs.
  2. End-to-End Fine-Tuning
    The entire model undergoes fine-tuning. While the visual encoder’s weights remain fixed, the projection layer and the language model are adjusted.

The second stage is tailored to specific application scenarios:

  • Instructions-Based Fine-Tuning
    For general applications, the model is fine-tuned on a dataset designed for following instructions that involve both visual and textual inputs, making the model versatile for everyday tasks.
  • Scientific reasoning
    For more specialized applications, particularly in science, the model is fine-tuned on data that requires complex reasoning, helping the model excel at answering detailed scientific questions.

Now that we’re keen on what LLaVA is and the role it plays in our applications, let’s turn our attention to the next component we need for our work, Whisper.

Using Whisper For Text-To-Speech

In this chapter, we’ll check out Whisper, a great model for turning text into speech. Whisper is accurate and easy to use, making it perfect for adding natural-sounding voice responses to our app. We’ve used Whisper in a different article, but here, we’re going to use a new version — large v3. This updated version of the model offers even better performance and speed.

Whisper large-v3

Whisper was developed by OpenAI, which is the same folks behind ChatGPT. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. The original Whisper was trained on 680,000 hours of labeled data.

Now, what’s different with Whisper large-v3 compared to other models? In my experience, it comes down to the following:

  • Better inputs
    Whisper large-v3 uses 128 Mel frequency bins instead of 80. Think of Mel frequency bins as a way to break down audio into manageable chunks for the model to process. More bins mean finer detail, which helps the model better understand the audio.
  • More training
    This specific Whisper version was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio that was collected from Whisper large-v2. From there, the model was trained for 2.0 epochs over this mix.

Whisper models come in different sizes, from tiny to large. Here’s a table comparing the differences and similarities:

Size Parameters English-only Multilingual
tiny 39 M
base 74 M
small 244 M
medium 769 M
large 1550 M
large-v2 1550 M
large-v3 1550 M
Integrating LLaVA With Our App

Alright, so we’re going with LLaVA for image inputs, and this time, we’re adding video inputs, too. This means the app can handle both images and videos, making it more versatile.

We’re also keeping the speech feature so you can hear the assistant’s replies, which makes the interaction even more engaging. How cool is that?

For this, we’ll use Whisper. We’ll stick with the Gradio framework for the app’s visual layout and user interface. You can, of course, always swap in other models or frameworks — the main goal is to get a working prototype.

Installing and Importing the Libraries

We will start by installing and importing all the required libraries. This includes the transformers libraries for loading the LLaVA and Whisper models, bitsandbytes for quantization, gtts, and moviepy to help in processing video files, including frame extraction.

#python
!pip install -q -U transformers==4.37.2
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
!pip install -q gTTS
!pip install -q moviepy

With these installed, we now need to import these libraries into our environment so we can use them. We’ll use colab for that:

#python
import torch
from transformers import BitsAndBytesConfig, pipeline
import whisper
import gradio as gr
from gtts import gTTS
from PIL import Image
import re
import os
import datetime
import locale
import numpy as np
import nltk
import moviepy.editor as mp

nltk.download('punkt')
from nltk import sent_tokenize

# Set up locale
os.environ["LANG"] = "en_US.UTF-8"
os.environ["LC_ALL"] = "en_US.UTF-8"
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

Configuring Quantization and Loading the Models

Now, let’s set up a 4-bit quantization to make the LLaVA model more efficient in terms of performance and memory usage.

#python

# Configuration for quantization
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16
)

# Load the image-to-text model
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text",
  model=model_id,
  model_kwargs={"quantization_config": quantization_config})

# Load the whisper model
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("large-v3", device=DEVICE)

In this code, we’ve configured the quantization to four bits, which reduces memory usage and improves performance. Then, we load the LLaVA model with these settings. Finally, we load the whisper model, selecting the device based on GPU availability for better performance.

Note: We’re using llava-v1.5-7b as the model. Please feel free to explore other versions of the model. For Whisper, we’re loading the “large” size, but you can also switch to another size like “medium” or “small” for your experiments.

To get our assistant up and running, we need to implement five essential functions:

  1. Handling conversations,
  2. Converting images to text,
  3. Converting videos to text,
  4. Transcribing audio,
  5. Converting text to speech.

Once these are in place, we will create another function to tie all this together seamlessly. The following sections provide the code that defines each function.

Conversation History

We’ll start by setting up the conversation history and a function to log it:

#python

# Initialize conversation history
conversation_history = []

def writehistory(text):
  """Write history to a log file."""
  tstamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
  logfile = f'{tstamp}_log.txt'
  with open(logfile, 'a', encoding='utf-8') as f:
    f.write(text + '\n')

Image to Text

Next, we’ll create a function to convert images to text using LLaVA and iterative prompts.

#python
def img2txt(input_text, input_image):
  """Convert image to text using iterative prompts."""
  try:
    image = Image.open(input_image)

    if isinstance(input_text, tuple):
      input_text = input_text[0]  # Take the first element if it's a tuple

      writehistory(f"Input text: {input_text}")
      prompt = "USER: <image>\n" + input_text + "\nASSISTANT:"
      while True:
        outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

          if outputs and outputs[0]["generated_text"]:
            match = re.search(r'ASSISTANT:\s*(.*)', outputs[0]["generated_text"])
            reply = match.group(1) if match else "No response found."
            conversation_history.append(("User", input_text))
            conversation_history.append(("Assistant", reply))
            prompt = "USER: " + reply + "\nASSISTANT:"
            return reply  # Only return the first response for now
          else:
            return "No response generated."
  except Exception as e:
    return str(e)

Video to Text

We’ll now create a function to convert videos to text by extracting frames and analyzing them.

#python
def vid2txt(input_text, input_video):
  """Convert video to text by extracting frames and analyzing."""
  try:
    video = mp.VideoFileClip(input_video)
    frame = video.get_frame(1)  # Get a frame from the video at the 1-second mark
    image_path = "temp_frame.jpg"
    mp.ImageClip(frame).save_frame(image_path)
    return img2txt(input_text, image_path)
  except Exception as e:
    return str(e)

Audio Transcription

Let’s add a function to transcribe audio to text using Whisper.

#python
def transcribe(audio_path):
  """Transcribe audio to text using Whisper model."""
  if not audio_path:
    return ''

  audio = whisper.load_audio(audio_path)
  audio = whisper.pad_or_trim(audio)
  mel = whisper.log_mel_spectrogram(audio).to(model.device)
  options = whisper.DecodingOptions()
  result = whisper.decode(model, mel, options)
  return result.text

Text to Speech

Lastly, we create a function to convert text responses into speech.

#python
def text_to_speech(text, file_path):
  """Convert text to speech and save to file."""
  language = 'en'
  audioobj = gTTS(text=text, lang=language, slow=False)
  audioobj.save(file_path)
  return file_path

With all the necessary functions in place, we can create the main function that ties everything together:

#python

def chatbot_interface(audio_path, image_path, video_path, user_message):
  """Process user inputs and generate chatbot response."""
  global conversation_history

  # Handle audio input
  if audio_path:
    speech_to_text_output = transcribe(audio_path)
  else:
    speech_to_text_output = ""

  # Determine the input message
  input_message = user_message if user_message else speech_to_text_output

  # Ensure input_message is a string
  if isinstance(input_message, tuple):
    input_message = input_message[0]

  # Handle image or video input
  if image_path:
    chatgpt_output = img2txt(input_message, image_path)
  elif video_path:
      chatgpt_output = vid2txt(input_message, video_path)
  else:
    chatgpt_output = "No image or video provided."

  # Add to conversation history
  conversation_history.append(("User", input_message))
  conversation_history.append(("Assistant", chatgpt_output))

  # Generate audio response
  processed_audio_path = text_to_speech(chatgpt_output, "Temp3.mp3")

  return conversation_history, processed_audio_path

Using Gradio For The Interface

The final piece for us is to create the layout and user interface for the app. Again, we’re using Gradio to build that out for quick prototyping purposes.

#python

# Define Gradio interface
iface = gr.Interface(
  fn=chatbot_interface,
  inputs=[
    gr.Audio(type="filepath", label="Record your message"),
    gr.Image(type="filepath", label="Upload an image"),
    gr.Video(label="Upload a video"),
    gr.Textbox(lines=2, placeholder="Type your message here...", label="User message (if no audio)")
  ],
  outputs=[
    gr.Chatbot(label="Conversation"),
    gr.Audio(label="Assistant's Voice Reply")
  ],
  title="Interactive Visual and Voice Assistant",
  description="Upload an image or video, record or type your question, and get detailed responses."
)

# Launch the Gradio app
iface.launch(debug=True)

Here, we want to let users record or upload their audio prompts, type their questions if they prefer, upload videos, and, of course, have a conversation block.

Here’s a preview of how the app will look and work:

Looking Beyond LLaVA

LLaVA is a great model, but there are even greater ones that don’t require a separate ASR model to build a similar app. These are called multimodal or “any-to-any” models. They are designed to process and integrate information from multiple modalities, such as text, images, audio, and video. Instead of just combining vision and text, these models can do it all: image-to-text, video-to-text, text-to-speech, speech-to-text, text-to-video, and image-to-audio, just to name a few. It makes everything simpler and less of a hassle.

Examples of Multimodal Models that Handle Images, Text, Audio, and More

Now that we know what multimodal models are, let’s check out some cool examples. You may want to integrate these into your next personal project.

CoDi

So, the first on our list is CoDi or Composable Diffusion. This model is pretty versatile, not sticking to any one type of input or output. It can take in text, images, audio, and video and turn them into different forms of media. Imagine it as a sort of AI that’s not tied down by specific tasks but can handle a mix of data types seamlessly.

CoDi was developed by researchers from the University of North Carolina and Microsoft Azure. It uses something called Composable Diffusion to sync different types of data, like aligning audio perfectly with the video, and it can generate outputs that weren’t even in the original training data, making it super flexible and innovative.

ImageBind

Now, let’s talk about ImageBind, a model from Meta. This model is like a multitasking genius, capable of binding together data from six different modalities all at once: images, video, audio, text, depth, and even thermal data.

Source: Meta AI. (Large preview)

ImageBind doesn’t need explicit supervision to understand how these data types relate. It’s great for creating systems that use multiple types of data to enhance our understanding or create immersive experiences. For example, it could combine 3D sensor data with IMU data to design virtual worlds or enhance memory searches across different media types.

Gato

Gato is another fascinating model. It’s built to be a generalist agent that can handle a wide range of tasks using the same network. Whether it’s playing games, chatting, captioning images, or controlling a robot arm, Gato can do it all.

The key thing about Gato is its ability to switch between different types of tasks and outputs using the same model.

GPT-4o

The next on our list is GPT-4o; GPT-4o is a groundbreaking multimodal large language model (MLLM) developed by OpenAI. It can handle any mix of text, audio, image, and video inputs and give you text, audio, and image outputs. It’s super quick, responding to audio inputs in just 232ms to 320ms, almost like a real conversation.

There’s a smaller version of the model called GPT-4o Mini. Small models are becoming a trend, and this one shows that even small models can perform really well. Check out this evaluation to see how the small model stacks up against other large models.

Conclusion

We covered a lot in this article, from setting up LLaVA for handling both images and videos to incorporating Whisper large-v3 for top-notch speech recognition. We also explored the versatility of multimodal models like CoDi or GPT-4o, showcasing their potential to handle various data types and tasks. These models can make your app more robust and capable of handling a range of inputs and outputs seamlessly.

Which model are you planning to use for your next app? Let me know in the comments!

Bard: The New ChatGPT Competitor

Category Image 062

In its constant quest to optimize the user experience in artificial intelligence, Google has introduced Bard, its latest and most advanced conversational system.

This innovative tool not only promises to stay up-to-date thanks to its permanent connection to the Internet, distinguishing it from other systems such as ChatGPT, but it also seeks to revolutionize the way we interact with technology. From its ability to interpret and describe images to its promising integration with other leading services such as Gmail, Docs, and Google Lens, Bard is shaping up to be the central nexus in Google’s service ecosystem. Moreover, its collaboration with Adobe Firefly suggests a horizon where the generation and understanding of visual content reach unprecedented levels. Although still in an experimental phase, Bard promises to redefine the boundaries of what we expect from an AI system.

Limited Conversations With Distributed Systems

Category Image 062

By the way, ChatGPT suggested the title: The Art of Balancing Control and Accessibility

Background

Houston Airport had this really big problem. Passengers complained about the time it took for luggage to arrive at the terminal building after the airplane had landed. The Airport invested millions to solve this pain point. They improved the process, hired more people, and introduced new technology. They eventually succeeded in reducing the wait time to 7 minutes. However, users still complained. The Airport realized that they had reached a point where optimizing the process/design was no longer optimal. So they did something different. They reframed the problem. By reframing the problem, they discovered that it was not the time it took to get the luggage to the terminal building that was the problem. It was the time the passengers had to wait for the luggage that was the problem. The Airport decided to park the airplanes further away from the terminal building. Consequently, it took some time for passengers to arrive at the terminal building, thus reducing the wait time for luggage, and voila! Complaints dropped drastically.

If I Was Starting My Career Today: Thoughts After 15 Years Spent In UX Design (Part 1)

Category Image 062

My design career began in 2008. The first book that I read on the topic of design was Photoshop Tips And Tricks by Scott Kelby, which was a book about a very popular design tool, but not about user experience (UX) design itself. Back at the time, I didn’t know many of the approaches and techniques that even junior designers know today because they weren’t invented yet, and also because I was just beginning my learning journey and finding my way in UX design. But now I have diverse experience; I’m myself hiring designers for my team, and I know much more.

In my two-part series of articles, I’ll try to share with you what I wish I knew if I was starting my career today.

“If you want to go somewhere, it is best to find someone who has already been there.”

Robert Kiyosaki

The two-part series contains four sections, each roughly covering one key stage in your beginner career:

  1. Master Your Design Tools
  2. Work on Your Portfolio
  3. Preparing for Your First Interviews: Getting a First Job
  4. In Your New Junior UX Job: On the Way to Grow

I’ll cover the first three topics in this first article and the fourth one in the second article. In addition, I will include very detailed Further Reading sections at the end of each part.

When you’re about to start learning, every day, you will receive new pieces of evidence of how many things you don’t know yet. You will see people who have been doing this for years and you will doubt whether you can do this, too. But there is a nuance I want to highlight: first, take a look at the following screenshot:

This is the Amazon website in 2008 when I was about to start my design career and received my first paycheck as a beginner designer.

And this is how Amazon looked like even earlier, in 2002:

Source: versionmuseum.com. (Large preview)

In 2002, Amazon made 3.93 billion US dollars in profits. I dare say they could have hired the very best designers at the time. So today, when you speak to a designer with twenty years of experience and think, “Oh, this designer must be on a very high level now, a true master of his craft,” remind yourself about the state of UX design that existed when the designer’s career was about to start, sometime in the early 2000s!

A lot of the knowledge that I have learned and that is over five years old is outdated now, and the learning complexity only increases every year.

It doesn’t matter how many years you have been in this profession; what matters are the challenges you met in the last few years and the lessons you’ve learned from them.

Are you a beginner or an aspiring user interface/user experience designer? Don’t be afraid to go through all the steps in your UX design journey. Patience and a good plan will let you become a good designer faster than you think.

“The best time to start was yesterday. The next best time is now.”

H. Jackson Brown, Jr.

This was the more philosophical part of my writing, where I wanted to help you become better motivated. Now, let’s continue with the more practical things and advice!

Getting Started: Master Your Design Tools

When I was just beginning to learn, most of us did our design work in Adobe Photoshop.

In Photoshop, there were no components, styles, design libraries, auto layouts, and so on. Every screen was in another PSD file, and even making rounded corners on a rectangle object was a difficult task. Files were “heavy,” and sometimes I needed to wait thirty or more seconds to open a file and check what screen was inside while changing a button’s name or label in twenty separate PSD files (each containing only one design screen, remember?) could take up to an hour, depending on the power of your computer.

There were many digital design tools at the time, including Fireworks — which some professionals considered superior to Photoshop, and for quite a few reasons — but this is not the main point of my story. One way or another, Photoshop back then became very popular among designers and we all absolutely had to have it in our toolset, no matter what other tools we also needed and used.

Now computers are much faster, and our design tools have evolved quite a bit, too. For example, I can apply multiple changes to multiple design screens in just a few seconds by using Figma components and a proper structure of the design file, I can design/prototype responsive designs by using auto-layout, and more.

In one word, knowing your design tool can be a real “superpower” for junior UX designers — a power that beginners often ignore. When you know your tool inside-out, you’ll spend less time on the design routine and you’ll have more time for learning new things.

Master your tool(s) of choice (be it Figma Design or Illustrator, Sketch, Affinity Designer, Canva, Framer, and so on) in the most efficient way, and free up to a couple of extra hours every day for reading, doing tutorials, or taking longer breaks.

Learn all the key features and options, and discover and remember the most important hotkeys so you’ll be working without the need to constantly reach for your mouse and navigate the “web” of menus and sub-menus. It’s no secret that we, designers, mostly learn through doing practical tasks. So, imagine how much time it would save you within a few years of your career!

Also, it’s your chance: developers are rolling out new features for beginner designers and pro designers simultaneously, but junior designers usually have more time to learn those updates! So, be faster and get your advantage!

Getting Started: Work On Your Portfolio

You need to admit it: your portfolio (or, to put it more precisely, the lack of it) will be the main pain point at the start.

You may hear sometimes statements such as: “We understand that being a junior designer is not about having a portfolio...” But the fact is that we all would like to see some results of your work, even if it is your very early work on a few design projects or concepts. Remember, if you have something to show, this would always be a considerable advantage!

I have heard from some juniors that they don’t want to invest time in their portfolio because this work is not payable and it’s time-consuming. But sitting and waiting and getting rejected again and again is also time-consuming. And spending a few of your first career years in the wrong company is also time-consuming (and disappointing, too). So my advice is to spend some time in advance on showcasing your work and then get much better results in the near future.

In case you need some extra motivation, here is a quote from Muhammad Ali, regarded as one of the most significant sports figures of the 20th century:

“I hated every minute of training, but I said to myself, ‘Do not quit. Suffer now and live the rest of your life as a champion.’”

— Muhammad Ali

Ready to fire but have no idea where to start? Here are a few options:

  • Find a popular product with a rather difficult-to-use or not very elegant interface and research what the users of this product are complaining about the most. Then, as an exercise, design a few interface screens for this product, with their core features explained, publish them on social media, and tag that company. (This approach may not always work, but it’s worth a try.)
  • Sign up for and actively participate in hackathons. As a result, it’s possible that you may get not just a few screens redesigned in Figma but a real working product you can show (and be proud of). Also, you can meet nice people there who may recommend you if you apply for a job at one of the companies they work for.
  • Complete UXchallenge challenges and present how you solved them on LinkedIn.
    Note: You’re not limited to LinkedIn, of course; you can also use Instagram, Facebook, Behance, Dribbble, and so on. But keep in mind that many recruiters prefer LinkedIn.
  • Pick up a website that you use often and check whether it meets the “Ten Usability Heuristics for User Interface Design.” Create a detailed report that lists everything that can be (re)designed better. Publish the report on LinkedIn and also send it to the company that made this website. Don’t forget to tell them why you did that report for their website specifically and that you’re learning UX design, practicing, and actively looking for a job.
  • Visit some popular developer conferences where you would be one of the only designers attending. Talk to people and propose your help for their startups. Who knows, you may become the co-creator of some future cool startup!
  • Choose an area where digitalization hasn’t propagated yet and create a design concept using very modern technologies. For instance, people have been growing plants for thousands of years, but data analysis and visualization dramatically changed the efficiency of that process only lately. The agricultural industry has undergone a remarkable transformation thanks to UX design — a crucial element in ensuring that agricultural applications are not just functional but also intuitive and user-friendly. From precision farming to crop monitoring systems, digital tools have revolutionized the way farmers manage their operations.
    Note: You can check the following article for details: “The Evolution of UX Design in Agricultural Applications.”

Don’t wait until someone hands you your chance on a “silver platter.” There are many projects that need the designer’s hands and help but can’t get such help yet. Assist them and then show the results of your work in your first portfolio. It gives you a huge advantage over other candidates who haven’t worked on their portfolios yet!

Preparing For Your First Interviews: Getting A First Job

From what I’ve heard, getting the first job is the biggest problem for a junior designer, so I will focus on this part in more detail.

Applying For A Job

To reach the goal, you should formulate it correctly. It’s already formulated in this case, but most candidates understand it wrong. The right goal here is to be invited to an interview — not to get an offer right now or tell everything about your life in the CV document. You just need to break through the first level of filtering.

Note: Some of these tips are for absolute beginners. Do they sound too obvious to you? Apologies if so. However, all of them are based on my personal experience, so I think there are no tips that I should omit.

  • Send your CV and motivational letter (if required in the job description) from the correct email address. It’s always strange to receive a job application from an email such as ‘sad.batman2006@gmail.com’. Seniors are always responsible for the tasks that junior designers complete, and we want to know that you are a seriously-minded and responsible person to help us do our work. Small details, such as the email address you would use to get in touch, do matter.
  • Use your real name. I’ve had cases where people have used different names in their emails and CVs. I think it’s too obvious why this will look very strange, so I won’t spend time describing it in detail.
  • Skill representations. Use the well-accepted standards. I have seen some CVs created with the help of services such as CV Maker where skills (level of English, how well you know Figma, Illustrator, and other design tools, and so on) were represented as loaders or diagrams. But there are existing standards, so use them in order to be understood better. For instance, if you describe your level of English knowledge, use the CEFR levels (A1/A2, B1/B2, C1/C2). Don’t make people interpret a diagram instead.
  • Check/proofread the text in your email, CV, and portfolio. We expect that you may not know everything about design, but spelling errors don’t demonstrate exactly your desire to learn and your attention to detail. You can use Grammarly or ChatGPT to check your text, but you should not try to substitute your thoughts with some AI-“generated” ideas. Also, make sure to structure well the content of your CV and to format it properly.
  • Read the job description carefully, find matches with your skills, and reflect these in the CV. Recruiters cannot review all the CVs thoroughly. Remember, the goal is to break through the first level of filtering — the recruiter is not a designer and can’t evaluate you and your skills. However, the recruiter can decide whether your CV is relevant to the job description, so it’s very important to tweak the CV by making sure you mention all the skills that you possess and that match the ones found in the specific job description.
  • Don’t count solely on the job application form posted on the company’s website. There were cases when I had no reply after filling out and submitting the official application form but then got an offer after trying to reach a recruiter from that company directly on LinkedIn or via some other available communication channel. So don’t be shy to get in touch directly.
  • Avoid using PDF documents for portfolios or anything else that people need to download before opening. The more time it takes to open and review your portfolio, the less time people will spend checking what’s in it. A link to your portfolio on the web will always work better, and it’s also a much more professional approach! You can use platforms such as Behance (or similar), or you can create your case studies in Figma and paste the shareable link into your CV.

Note: There are many ways to show your portfolio, and Figma is only one of them. For ideas, you can check “Figma Portfolio Templates & Examples” (a curated selection of portfolio templates for Figma). Or even better, you can self-host your portfolio on your own domain and website; however, this may require some more advanced technical skills and knowledge, so you can leave this idea for later.

Completing A Test Task

The test task aims to assess what we can expect from you in the workplace. And this is not just about the quality of your design skills — it’s also about how you will communicate with others and how you will be able to propose practical solutions to problems.

What do I mean by “practical solutions”? In the real world, designers always work within certain limitations (constraints), such as time, budget, team capacity, and so on. So, if you have some bright ideas that are likely very hard to implement, keep these for the interview. The test task is a way to show that you are someone who can define the correct problems and do the proper work, e.g. find the solutions to them.

A few words of advice on how to do exactly that:

  • If you have a chance to speak to the target audience, do it, especially if the test task is to make an existing product better. You don’t have to do complete research, but if it’s a popular product that everyone uses, you can ask your friends about their experience of using it. If it’s not, check what people say on Reddit, in reviews on the Apple App Store, or on Google Play. Find video reviews of this product on YouTube and analyze the comments under the video. Also, take a look at similar products and what people say about them. Defining real problems is a key skill for designers.
    Note: How can we we conduct UX research when there is no or only limited access to users? Vitaly Friedman outlines a few excellent strategies in his article on this topic: “Why Designers Aren’t Understood.”
  • Prioritize features that you see and can reflect on in the test task. You can use the Kano model or another framework, but don’t skip this step! It is sometimes puzzling to see candidates spending a lot of time on dark mode UI mockups but failing to work on the required key features instead.
    Note: The Kano Analysis model is a tool that enables you to understand how customer emotional responses to products or features can be measured and explored.
  • If you need more time, say so. It also will show what your behavior will be when working on a real project. Speaking about the problem at the last moment can bring big troubles to the team. Also — happened in my practice in a few cases — it’s strange to hear:
    • “I didn’t fully complete the test task because I was busy.”
    • OK, if you are too busy (with other things?), then we will have to interview some other candidates.
      My advice is to show dedication and focus toward your current job application assignment.
  • In some cases, the candidates try to go the “extra mile” by doing more things than were initially asked of them, but with lower quality. Unfortunately, It doesn’t work this way. Instead, you need to do less but better. Of course, there could be exclusions in some cases, like when you do sketching and prototyping, where showing rough ideas is perfectly OK. So, try to find the balance between the volume and quality of your work. Showing many (but weak) mockups in order to impress with the volume of your work (instead of the quality) is not a good idea.
  • Sometimes, we ask to redesign a screen as a test task. This is not about using better/shinier UI components. Instead, try to understand the user goals on that screen and then think about the most suitable UI components that you can use to serve these user goals.

Recommendations For The Interview

The interview is the most challenging part because the most optimal way to prepare for it depends on the specific company where you’re applying for the job and the interviewer’s experience. But there are still a few “universal” things you can do in order to increase your chances:

  • If I was restricted to giving only one piece of advice, I would say: Be sincere! It’s not an exam, so don’t try to guess the answer if you don’t know it. No one knows everything, and it’s OK — be honest and it will pay off.
  • Research the company and the role before the interview. Check the company’s portfolio, cases, products, and so on, and even look up the names and titles of designers working there.
    Note: It will help a lot if the company has an AboutThe Team page on its website; but if not, using LinkedIn will probably help, too.) When you have researched the role in detail, it will help you define which of your skills will be a good match and you could then highlight them during the interview.
  • The core questions in a UX design interview are not a secret. Usually, it’s about the design phases, your experience, hobbies, motivation, and so on. Work on these questions and clarify the answers before going to the interview. Just write them down and read them out loudly. Try to check how it sounds. Converting your design experience into exact words requires brain energy, especially if somebody in front of you is waiting for the answer, so do it beforehand, and you’ll feel much better prepared — and calmer.
  • Listen carefully to the questions you are being asked. Ask the interviewer to clarify if you do not understand a question completely. It’s always weird when the candidate gives an answer that is not related to the question you asked.
  • Don’t be late. Do your best to be on time.
    • If it’s an online interview, check the time zones, the communication tools, and everything else. There’s nothing worse than starting Zoom (or another app that you know you’ll need) at the last minute and discovering that it needs an urgent update. Precious minutes will be lost during the update process while the other party will be patiently waiting for you to come online. And you better also check your headphones, microphone, camera, and Bluetooth connection before the start of the meeting.
    • Similarly, if it’s an in-person interview, plan your trip in advance and add some extra time for something unexpected; better if you arrive early than late. The problem is not only about wasting someone’s time; it’s about your emotional balance. If you are late, you will be nervous and make mistakes that you otherwise wouldn’t.
  • Don’t look for a job in the companies of your dreams right from the start. First, pass a few interviews with other companies, get feedback, do some retrospectives, gain some real experience, and be prepared to show your best when you get your chance.
  • Be yourself, but also clearly communicate who you are going to be as people with goals and a plan always make a better impression. Most companies don’t hire juniors — they hire future middle-level and senior designers. And if you feel a certain company where you’re applying for a job would not support you in this way, better try another one. The first few years are the foundation of your future career, so do your best to get into a company where you can grow as a designer.
Conclusion

Thank you for following me so far! Hopefully, you have learned your design tools, worked on your portfolio, and prepared meticulously for your first interviews. If all goes according to plan, sooner or later, you’ll get your first junior UX job. And then you’ll face more challenges, about which I will speak in detail in the second part of my two-part article series.

But before that, do check Further Reading, where I have gathered a few resources that will be very useful if you are just about to begin your UX design career.

Further Reading

Basic Design Resources

A List of Design Resources from the Nielsen Norman Group

What ChatGPT Needs Is Context

Category Image 062

As part of my involvement at LeadDev NYC, I had the opportunity a short video message that would be part of a montage played for folks between the live talks. I decided to speak about the way engineers are enabling the future of products (you can watch it here).

It seems to me that questions like “how can engineers affect the future of (whatever)” sometimes come from a place of anxiety. And these days, there’s no greater source of that anxiety than the advances — and the impacts we imagine coming from those advances — in large language models (LLM), more broadly billed as artificial intelligence (AI).

GenAI-Infused ChatGPT: A Guide To Effective Prompt Engineering

Category Image 062

In today's world, interacting with AI systems like ChatGPT has become an everyday experience. These AI systems can understand and respond to us in a more human-like way. But how do they do it? That's where prompt engineering comes in.

Think of prompt engineering as the instruction manual for AI. It tells AI systems like ChatGPT how to understand what we want and respond appropriately. It's like giving clear directions to a helpful friend.

How to Repurpose Content with ChatGPT: A Step-by-Step Guide

Category Image 062
How to repurpose content with chatgpt.Figuring out how to repurpose content with ChatGPT is one of the smartest ways to use the AI tool. With the right approach, you can take a blog post or a page and get AI to transform it or use it as the basis for all other kinds of content. That means podcast scripts, Frequently Asked Questions (FAQ) pages, and many other options. In this article, we’ll show you how to repurpose content with ChatGPT.

Reverse engineering minified JS with ChatGPT

Category Image 062

#​703 — September 5, 2024

Read on the Web

JavaScript Weekly

An SSR Performance Showdown — Fastify’s Matteo Collina set out to find the current state of server-side rendering performance across today’s most popular libraries. The first attempt faced negative feedback due to implementation issues, but the showdown has been improved and re-run.

Matteo Collina

Announcing Vue 3.5 — While v3.5 is a minor release, it’s one Vue users will love, with big performance and memory usage improvements in its reactivity system. With no breaking changes, upgrade and watch memory consumption drop.

Evan You

WorkOS: The Modern Identity Platform for B2B SaaS — WorkOS is a modern identity platform for B2B SaaS, offering flexible and easy-to-use APIs to integrate SSO, SCIM, and RBAC in minutes instead of months. It’s trusted by hundreds of high-growth startups such as Perplexity, Vercel, Drata, and Webflow.

WorkOS sponsor

Reverse Engineering Minified JavaScript with ChatGPTWriting new code with AI is one thing, but could it be even better at understanding existing code that you’re struggling to grok? Yes, it seems.

Frank Fiegel

Inside ECMAScript: JavaScript Standards Get an Extra Stage — After nine years of annual updates, TC39 has tweaked the process to make rolling out new features faster and smoother. The so-called ‘Stage 2.7’ has been around for a while, but this is a neat primer to what it represents.

Mary Branscombe (The New Stack)

IN BRIEF:

⭐ Vercel goes deep into what’s new in React 19.

💰 Alpine.js creator Caleb Porzio shares his tale of passing $1m on GitHub Sponsors.

Bye NgModules, the future of Angular is standalone! Angular v19 will make standalone: true the default for components, directives, and pipes. This is already the recommended best practice, however.

Angular’s product lead, Minko Gechev, has also shared a little about what it means to mange the Angular project.

OpenAI has switched ChatGPT from Next.js to a Remix-based app, according to Remix’s Ryan Florence on X.

🇵🇱 Poland’s WarsawJS community is holding a 10th anniversary meetup on September 11. They invite you to ▶️ watch live on YouTube.

🤖 Lee Robinson shows off ▶️ the latest enhancements to Vercel’s v0, an AI-based tool for creating apps and components from prompts you supply.

[Workshop] Fix Your Front-End: JavaScript Edition — Learn practical tips to make debugging more tolerable. Join our JavaScript team live for a masterclass on Sept 24.

Sentry sponsor

RELEASES:

Node.js v22.8.0 (Current) – Adds a new API for enabling on-disk code caching at runtime, as well as options to set thresholds for code coverage success.

Astro 4.15 – The popular content site framework stabilizes Astro Actions, a solution for fully type-safe backend functions.

Jimp 1.3 – Pure JS image processing library for Node.

Turborepo 2.1, Puppeteer 23.3, Mermaid 11.1

📒 Articles & Tutorials

▶  Behind the Scenes: The Making of VS Code — A detailed conversation with two of the popular editor’s principal engineers on what makes it tick. VS Code is surely one of the world’s most widely distributed JavaScript-powered apps.

Holland, Rieken and Pasero (Microsoft)

How I Created a 3.78MB Docker Image for a JavaScript Service — The smallest JavaScript app container images tend to run into tens of megabytes, but tailoring your app to run on a lighter runtime like llrt can yield striking results.

Shenzilong

Leave Forms to SurveyJS and Get Back to What You Love Coding — Extensible JavaScript libraries for form management. Drag-and-drop UI, JSON form definitions, and seamless integration with any backend for full data control.

SurveyJS sponsor

Exploring Goja: A Go-Powered JavaScript RuntimeGoja is a pure Go(lang) JS runtime that makes it possible to embed JS into Go apps.

JT Archie

How to Use React Compiler — The compiler feature in React 19 is generating a lot of buzz — this “complete guide”, as described by this author, covers much of what you’ll need to get started.

Tapas Adhikary

Multithreaded Programming in Node.js using AtomicsWorker threads enable you to write multi-threaded Node apps, but sharing resources across them can quickly become tricky. Atomics can help avoid some of the pain.

Pavel Romanov

📄 A Complete Guide to Beginning with JavaScript – A rather epic article packed with background knowledge, context, and third party resources for starting a modern JavaScript learning journey. Cody Lindley

📄 Implementing Filtered Semantic Search Using pgvector and JavaScript Team Timescale

📄 How to Quickly (and Weightlessly) Convert Chrome Extensions to Safari Nina Torgunakova (Evil Martians)

📄 How Sentry Uses Mutation Testing on its JavaScript SDKs Lukas Stracke (Sentry)

🎤 Talking Deno 2 with Ryan Dahl Syntax․fm Podcast

🛠 Code & Tools

jsdiff 6.0: A JavaScript Text Diffing Implementation — Can compare strings for differences in various ways including creating patches. There’s an online demo. (Don’t worry – we’re not going monthly ;-))

Kevin Decker

Redwood v8.0 Released — A long standing, opinionated React & GraphQL (and/or RSC) full-stack framework that covers all the bases for professional dev teams with best-in-class tool support. v8.0 introduces a background jobs system, Docker support, and easier SSR and RSC setup.

Redwood Team

Tests Are Dead. Meticulous Is Here — Automatically creates & maintains E2E UI tests. Zero flakes. Backed by YC, CTO of GitHub, CPO of Adobe, CEO of Vercel.

Meticulous sponsor

🇬🇧 GOV.UK Vue 1.0: Build Vue Apps, the British Way — The UK government is known for having an effective, well-designed site where Brits can complete various official tasks. Now you can get all of their components in Vue 3 form.

UK Government

👀 style-observer: A Mutation Observer for CSS — Attach JavaScript callbacks to changes in computed values of CSS properties.

Bramus Van Damme

Goxygen: Quickly Generate a Go Backend for a JS Project — A tool that sets up a new Go-based project with Angular, React, or Vue in the front-end, and Docker and Docker Compose files to make it all work.

Sasha Shpota

Typist 7.0: Tiptap-Based Rich Text Editor Component — Simple and opinionated. You can try several examples in the sidebar. Well suited for basic rich text situations like writing comments or messages and has a single-line mode.

Doist

Belt: A New Tool for Starting React Native Apps — A CLI tool for starting a new React Native app that takes various mundane decisions away from you and uses tooling and conventions established by a productive app development team.

Thoughtbot

Tinybase 5.2 – Powerful reactive data store for local‑first apps. Now with Postgres support (which can even work in-browser!)

jsdoc-to-markdown 9.0 – Generate Markdown docs from JSDoc-annotated code.

LogTape 0.5 – No-dependency logging lib for Deno, Node, Bun & browsers.

Plasmo 0.89 – Imagine Next.js but for building browser extensions.

JsonTree.js 3.0 – Customizable tree views for JSON data.

Poku 2.6 – Cross-platform JavaScript test runner.

Faker 9.0 – Generate large amounts of fake data.

Data Poisoning and Model Collapse: The Coming AI Cataclysm

Category Image 062

Generative AI tools like ChatGPT seem too good to be true: craft a simple prompt, and the platform generates text (or images, videos, etc.) to order.

Behind the scenes, ChatGPT and its ilk leverage vast swaths of the World Wide Web as training data — the ‘large’ in ‘large language model’ (LLM) that gives this technology its name.

ChatGPT Functions: Observations, Tips, and Tricks

Category Image 062

Recently introduced ChatGPT functions represent a huge leap forward that lets ChatGPT use your local files, data, and system services. So, if you supply proper functions to ChatGPT, you can ask it something like “Email Kate Bell with birthday greetings” and see a new email message pop up with correct email address, correct subject and generated email text with birthday best wishes. 

Pretty cool, right?  

How To Avoid AI Hallucinations With ChatGPT

Category Image 062

Tage wrote about how to prevent ChatGPT from hallucinating a couple of months ago. However, I wanted to dive deeply into one specific thing you can do to completely avoid AI hallucinations. Before I explain how to avoid hallucinations, I need to explain a little bit about what we do when we create a custom ChatGPT chatbot.

What we do is prompt engineering based on an SQL database with VSS capabilities. It could be argued that we jailbreak ChatGPT, but instead of allowing ChatGPT to go completely berserk, we significantly restrict its capabilities to only be able to answer questions related to the data found in our SQL database. To understand the process, it helps to create your custom chatbot, something you can do below. 

The Grok AI Model From X: What Does It Mean to the Market?

Category Image 062

Elon Musk recently announced the introduction of an artificial intelligence (AI) product to compete with the OpenAI ChatGPT product suite. This product, named Grok, is currently available in beta version only, and is in limited release to a select number of users in the United States. 

As we await the full release of the Grok product, it may be worth considering its potential impact on the market and the features that Musk believes will distinguish Grok from its main competitor.