Using Multimodal AI Models For Your Applications (Part 3)

Featured Imgs 23

In this third and final part of a three-part series, we’re taking a more streamlined approach to an application that supports vision-language (VLM) and text-to-speech (TTS). This time, we’ll use different models that are designed for all three modalities — images or videos, text, and audio( including speech-to-text) — in one model. These “any-to-any” models make things easier by allowing us to avoid switching between models.

Specifically, we’ll focus on two powerful models: Reka and Gemini 1.5 Pro.

Both models take things to the next level compared to the tools we used earlier. They eliminate the need for separate speech recognition models, providing a unified solution for multimodal tasks. With this in mind, our goal in this article is to explore how Reka and Gemini simplify building advanced applications that handle images, text, and audio all at once.

Overview Of Multimodal AI Models

The architecture of multimodal models has evolved to enable seamless handling of various inputs, including text, images, and audio, among others. Traditional models often require separate components for each modality, but recent advancements in “any-to-any” models like Next-GPT or 4M allow developers to build systems that process multiple modalities within a unified architecture.

Gato, for instance, utilizes a 1.2 billion parameter decoder-only transformer architecture with 24 layers, embedding sizes of 2048 and a hidden size of 8196 in its feed-forward layers. This structure is optimized for general tasks across various inputs, but it still relies on extensive task-specific fine-tuning.

GPT-4o, on the other hand, takes a different approach with training on multiple media types within a single architecture. This means it’s a single model trained to handle a variety of inputs (e.g., text, images, code) without the need for separate systems for each. This training method allows for smoother task-switching and better generalization across tasks.

Similarly, CoDi employs a multistage training scheme to handle a linear number of tasks while supporting input-output combinations across different modalities. CoDi’s architecture builds a shared multimodal space, enabling synchronized generation for intertwined modalities like video and audio, making it ideal for more dynamic multimedia tasks.

Most “any-to-any” models, including the ones we’ve discussed, rely on a few key concepts to handle different tasks and inputs smoothly:

  • Shared representation space
    These models convert different types of inputs — text, images, audio — into a common feature space. Text is encoded into vectors, images into feature maps, and audio into spectrograms or embeddings. This shared space allows the model to process various inputs in a unified way.
  • Attention mechanisms
    Attention layers help the model focus on the most relevant parts of each input, whether it’s understanding the text, generating captions from images, or interpreting audio.
  • Cross-modal interaction
    In many models, inputs from one modality (e.g., text) can guide the generation or interpretation of another modality (e.g., images), allowing for more integrated and cohesive outputs.
  • Pre-training and fine-tuning
    Models are typically pre-trained on large datasets across different types of data and then fine-tuned for specific tasks, enhancing their performance in real-world applications.
Reka Models

Reka is an AI research company that helps developers build powerful applications by offering models for a range of tasks. These tasks include generating text from videos and images, translating speech, and answering complex questions from long multimodal documents. Reka’s models can even write and execute code, providing flexible, real-world solutions for developers.

These are the three main models Reka offers:

  1. Reka Core
    A 67-billion-parameter multimodal language model designed for complex tasks. It supports inputs like images, videos, and texts while excelling in advanced reasoning and coding.
  2. Reka Flash
    A faster model with a 21-billion-parameter, designed for flexibility and rapid performance in multimodal settings.
  3. Reka Edge (PDF)
    A smaller 7-billion-parameter model was built for on-device and low-latency applications, making it efficient for local use and local or latency-sensitive applications.

Reka’s models can be fine-tuned and deployed securely, whether on the cloud, on-premises, or even on-device. Let’s start by testing Reka’s capabilities directly through its playground. This allows us to experiment with its multimodal features without writing any code, providing a hands-on way to see how the models handle various tasks, such as image and video comprehension.

Alright, we’ll kick things off by uploading an image of a diagram outline of the PaliGemma architecture and ask Reka for a detailed explanation.

Can you provide a more detailed explanation of this image?

Here’s what we get from Reka Core:

“This image is a schematic representation of a machine-learning model named PaliGemma. The model appears to be designed for multimodal input, combining both image and text data. At the top, there are two input sources: ‘Image Input’ and ‘Text Input’. The image input is processed by a ‘SigLIP Image Encoder’, which likely stands for a specific type of image encoding technique. The output of this encoder is then subjected to a ‘Linear Projection’, which is a common step in machine learning to transform data into a lower-dimensional space. The text input is directly fed into the model. Both the projected image data and the text input are then concatenated into ‘Concatenated Tokens’, which are then processed by a component labeled ‘Gemma’. The final output of the model is ‘Text Output’, which suggests that the model is designed to generate text based on the input data.”

It’s a clear and detailed breakdown of the PaliGemma architecture. The model accurately identifies image and text inputs, the SigLIP image encoder, and the linear projection step. It also grasps how the image and text inputs are combined and processed to generate text output. I should admit that the response is quite neat, offering a solid explanation of the architecture.

Let’s try a video input instead. We’ll upload the video about Genny, a software, and ask a question to see how well Reka Core handles video comprehension.

What is this video about?
“This video is a screen recording of a group of friends discussing their plans for the weekend.”

Well, it’s not a good answer compared to what LLaVA gave us in the previous article. Now let’s try a longer video, but with the Reka Flash model. The video is about a TTS model called Parler TTS. Instead of asking what the video is about, let’s prompt it like this:

I’m an AI developer, and I’m not sure how this TTS model can help me in my projects. Can you explain how I can use it or what it’s useful for?

Reka Flash generates a solid answer this time:

“Of course! This TTS (Text-to-Speech) model can be incredibly useful for developers, especially those working on projects that involve creating more human-like interactions or providing audio feedback to users. Here are some ways you can leverage this TTS model in your projects:
  1. Accessibility: TTS models can greatly enhance the accessibility of your applications for users with visual impairments or reading difficulties. By providing spoken feedback or narrating content, you can make your applications more inclusive
  2. Voice Assistants: You can integrate this TTS model into voice assistants or chatbots, allowing them to generate more natural-sounding responses. This can improve user experience and make interactions with your bots more engaging.
  3. Multimedia Applications: This TTS model can be used in multimedia applications, such as e-books, educational tools, or games, to provide spoken narration or instructions. This can make the content more engaging and easier to understand.
  4. Audio Notifications: You can use the TTS model to generate audio notifications for your applications, providing users with spoken alerts or reminders. This can be particularly useful for users who may not always have their screens visible.
  5. Language Learning: The TTS model can be used in language learning applications to provide users with correct pronunciation and intonation. It can also generate spoken feedback or narrate lessons, making the learning process more immersive.
  6. Audio Books: You can use the TTS model to generate audio books, allowing users to listen to written content instead of reading it. This can be especially helpful for users who prefer auditory learning or have difficulty reading.
  7. Explainer Videos: The TTS model can be used to generate voiceovers for explainer videos, providing users with spoken narration that explains the content on the screen. This can make the content more engaging and easier to understand.
  8. Audio Feedback: You can use the TTS model to generate audio feedback for users in your applications, such as confirming actions, providing instructions, or giving feedback on their performance. This can make the interactions more natural and intuitive.
In summary, the TTS model can be a valuable tool for developers looking to create more human-like interactions, improve accessibility, and enhance the user experience in various applications.”

It’s a good answer that’s informative and quite practical for developers. Let’s compare what happens when we try the Reka Edge model with a PDF input titled “Meta AI — Responsible Use Guide” (PDF).

As an AI developer, how can I ensure that I am following the responsible AI guidelines outlined in this document?
“To ensure that your use of AI aligns with the principles outlined in this guide, you can follow these steps and best practices:
  1. Familiarize yourself with the principles of responsible AI, such as transparency, accountability, and user empowerment, as outlined in the guide.
  2. Understand the potential impacts of AI on users and society, and be aware of the risks and challenges associated with the development and deployment of AI systems.
  3. Embrace an iterative model-development mindset, continuously striving for improvement and learning from feedback and errors.
  4. Build a responsible AI ecosystem by working with partners, experts, and communities to develop and share best practices, tools, and benchmarking resources.
  5. Follow the guidelines and best practices outlined in the guide for each stage of model development, from data collection and preparation to fine-tuning, evaluation, and deployment.
  6. Consider the specific use case and requirements for your AI system, and tailor your approach accordingly.
  7. Regularly monitor and evaluate the performance and safety of your AI system and make adjustments as needed to ensure it aligns with the principles outlined in the guide.
  8. Be transparent and accountable in your interactions with users and stakeholders, and be open to feedback and suggestions for improvement.
  9. Stay informed about the latest developments and research in the field of AI, and be prepared to adapt your approach as needed to address new challenges and opportunities.
  10. Follow the specific guidelines and best practices outlined in the addendum for Code Llama, such as defining content policies, evaluating and benchmarking models, and addressing input- and output-level risks.”

Reka Edge’s answer is quite thorough and detailed, too!

If you want to dive into coding, start by creating a Reka account and getting an API key. Then you can follow along with this notebook. Try out their Python SDK, and make sure to explore the different models available. It’s a great way to see how these tools work under the hood!

Gemini Models

Gemini 1.5, developed by Google DeepMind, is designed to handle complex tasks efficiently, thanks to the new Mixture-of-Experts (MoE) system. MoE is like packing several specialized models inside one big model. Instead of using the whole network for every task, Gemini 1.5 activates only the parts (or what they call “experts”) that are most relevant. This means Gemini 1.5 can tackle complex tasks with less processing power than a typical monolithic model.

We can explore Gemini’s features in Google AI Studio since it’s already baked right in. We’ll begin by uploading the following image illustrating VLM capabilities.

We’ll ask Gemini to analyze the image and extract details or concepts:

Explain this image. What are the specific details or concepts it represents?
“The image is a flowchart illustrating different types of image understanding tasks with a Vision Language Model.
  • Object Localization: This task aims to identify the precise location of objects within an image. The example question asks, “Is one cat behind another?”
  • Zero-shot Segmentation: This task involves segmenting an image into different regions based on a given label, even if the model has never seen that specific label before. The example question asks to segment “striped cat.”
  • Zero-shot Visual QA: This task requires answering questions about an image, even if the model has not been specifically trained to answer that type of question. The example question asks, “What is the breed of these cats?”
  • One-shot Learning with Instructions: This task involves providing the model with a single example and instruction to learn a new concept and apply it to other images. The example question asks, “Striped cats are called tabby cats. What is the breed of the cats in the image?”

Well, Gemini nailed it — it clearly understood the image and recognized the tasks a VLM handles. This is exactly what you’d expect from a model designed for advanced image analysis.

Now, we’ll upload another image — this time of an Italian dish — and ask Gemini if it can recognize the dishes and list the ingredients needed for them. This type of “food recognition task” will help us test its ability to identify foods and ingredients.

List the different dishes in this image, also from which country, and give me the ingredients I need to make them.
“The image shows an Italian feast with the following dishes:
  1. Seafood Stew: A rich and flavorful stew with seafood like clams, mussels, shrimp, fish, and vegetables in a tomato broth.
  2. Spaghetti with Tomato Sauce: Classic spaghetti pasta tossed in a simple yet flavorful tomato sauce.
  3. Meatballs in Tomato Sauce: Tender and juicy meatballs simmered in a rich tomato sauce.
  4. Lasagna: Layered pasta with meat sauce, cheese, and tomato sauce.
  5. Side Salad: A refreshing side salad with mixed greens, tomatoes, and other vegetables.
  6. Garlic Bread: Toasted bread slices rubbed with garlic and olive oil.
  7. Marinara Sauce: A simple tomato-based sauce.
This is a typical Italian meal with various dishes showcasing the flavors of Italian cuisine.

As for the ingredients, you’ll need tomatoes for many of these dishes, specifically for the sauces and the side salad.”

Well, the model did a good job recognizing the dishes, but it also didn’t give me all the necessary ingredients. Let’s see how it does with videos instead, using the same video we used earlier with the Reka models. This time, the task will focus on action recognition.

What is the purpose of this video, and what key actions or processes are being demonstrated?
“The video is a tutorial about the importance of friendship and how it can impact our lives. It demonstrates the use of a video editing software called “Genmy” to create a video and add text subtitles. The video creator is using Genmy to:
  • Record their voice: This is seen at the top of the screen with the mic icon.
  • Add text to speech: You see the “Text to Speech” option, which is likely being used to generate subtitles or captions for the video.
  • Edit the timeline: They’re working with a video timeline, adding and arranging different elements.
  • Add voice-over: The speaker is adding commentary to the video.
  • Create a visually appealing presentation: They’re using a combination of video footage, text, and voice-over to create an engaging tutorial.”

Well, Gemini gave a clear breakdown of the video, identifying it as a tutorial demonstrating the Genmy video editing software to highlight how it’s used for recording voice, adding text-to-speech, editing video timelines, and allowing voice-over. Impressive!

Let’s upload one more video, the same one we used for Reka about the Parler TTS model, and ask Gemini to analyze and summarize the contents.

Summarize this video and explain how the contents might be relevant or helpful for software engineers.
“This video showcases different voice samples generated by a text-to-speech (TTS) system called “Parler-TTS.” The video highlights various voice characteristics like gender, pace, pitch, and even the simulated environment. For software engineers, this video demonstrates the progress made in TTS technology. These are some insights:
  • Voice Variety: The video shows that Parler-TTS can create diverse voices, potentially useful for applications needing different personas or tones.
  • Fine-grained Control: The video emphasizes the ability to control voice parameters like pace and pitch, showcasing the potential for customizability.
  • Quality and Naturalness: The video focuses on the quality of the generated speech, demonstrating the advancements made in speech synthesis, which is crucial for user experience.”

Nicely done! I can go with that answer. Gemini explains adjusting voice settings, like pitch and speed, and how having different voices can be useful. Gemini also emphasizes the importance of natural, high-quality speech, which is handy for developers working with TTS systems!

Alright, for coding, you can grab the code from Google AI Studio by clicking the Get Code button. You can choose between formatting the code in Python, Swift, and Java, among other languages.

Conclusion

Both Reka and Gemini are strong multimodal models for AI applications, but there are key differences between them to consider. Here’s a table that breaks those down:

Feature Reka Gemini 1.5
Multimodal Capabilities Image, video, and text processing Image, video, text, with extended token context
Efficiency Optimized for multimodal tasks Built with MoE for efficiency
Context Window Standard token window Up to two million tokens (with Flash variant)
Architecture Focused on multimodal task flow MoE improves specialization
Training/Serving High performance with efficient model switching More efficient training with MoE architecture
Deployment Supports on-device deployment Primarily cloud-based, with Vertex AI integration
Use Cases Interactive apps, edge deployment Suited for large-scale, long-context applications
Languages Supported Multiple languages Supports many languages with long context windows

Reka stands out for on-device deployment, which is super useful for apps requiring offline capabilities or low-latency processing.

On the other hand, Gemini 1.5 Pro shines with its long context windows, making it a great option for handling large documents or complex queries in the cloud.

How A Bottom-Up Design Approach Enhances Site Accessibility

Featured Imgs 23

Accessibility is key in modern web design. A site that doesn’t consider how its user experience may differ for various audiences — especially those with disabilities — will fail to engage and serve everyone equally. One of the best ways to prevent this is to approach your site from a bottom-up perspective.

Understanding Bottom-Up Design

Conventional, top-down design approaches start with the big picture before breaking these goals and concepts into smaller details. Bottom-up philosophies, by contrast, consider the minute details first, eventually achieving the broader goal piece by piece.

This alternative way of thinking is important for accessibility in general because it reflects how neurodivergent people commonly think. While non-autistic people tend to think from a top-down perspective, those with autism often employ a bottom-up way of thinking.

Of course, there is considerable variation, and researchers have identified at least three specialist thinking types within the autism spectrum:

  • Visual thinkers who think in images;
  • Pattern thinkers who think of concepts in terms of patterns and relationships;
  • Verbal thinkers who think only in word detail.

Still, research shows that people with autism and ADHD show a bias toward bottom-up thinking rather than the top-down approach you often see in neurotypical users. Consequently, a top-down strategy means you may miss details your audience may notice, and your site may not feel easily usable for all users.

As a real-world example, consider the task of writing an essay. Many students are instructed to start an essay assignment by thinking about the main point they want to convey and then create an outline with points that support the main argument. This is top-down thinking — starting with the big picture of the topic and then gradually breaking down the topic into points and then later into words that articulate these points.

In contrast, someone who uses a bottom-up thinking approach might start an essay with no outline but rather just by freely jotting down every idea that comes to mind as it comes to mind — perhaps starting with one particular idea or example that the writer finds interesting and wants to explore further and branching off from there. Then, once all the ideas have been written out, the writer goes back to group related ideas together and arrange them into a logical outline. This writer starts with the small details of the essay and then works these details into the big picture of the final form.

In web design, in particular, a bottom-up approach means starting with the specifics of the user experience instead of the desired effect. You may determine a readable layout for a single blog post, then ask how that page relates to others and slowly build on these decisions until you have several well-organized website categories.

You may even get more granular. Say you start your site design by placing a menu at the bottom of a mobile site to make it easier to tap with one hand, improving ease of use. Then, you build a drop-down menu around that choice — placing the most popular or needed options at the bottom instead of the top for easy tapping. From there, you may have to rethink larger-scale layouts to work around those interactive elements being lower on the screen, slowly addressing larger categories until you have a finished site design.

In either case, the idea of bottom-up design is to begin with the most specific problems someone might have. You then address them in sequence instead of determining the big picture first.

Benefits Of A Bottom-Up Approach For Accessible Design

While neither bottom-up nor top-down approaches dominate the industry, some web engineers prefer the bottom-up approach due to the various accessibility benefits this process provides. This strategy has several accessibility benefits.

Putting User Needs First

The biggest benefit of bottom-up methods is that they prioritize the user’s needs.

Top-down approaches seem organized, but they often result in a site that reflects the designer’s choices and beliefs more than it serves your audience.

Consider some of the complaints that social media users have made over the years related to usability and accessibility for the everyday user. For example, many users complain that their Facebook feed will randomly refresh as they scroll for the sake of providing users with the most up-to-date content. However, the feature makes it virtually impossible to get back to a post a user viewed that they didn’t think to save. Likewise, TikTok’s watch history feature has come and gone over the years and still today is difficult for many users to find without viewing an outside tutorial on the subject.

This is a common problem: 95.9% of the largest one million homepages have Web Content Accessibility Guidelines (WCAG) errors. While a bottom-up alternative doesn’t mean you won’t make any mistakes, it may make them less likely, as bottom-up thinking often improves your awareness of new stimuli so you can catch things you’d otherwise overlook. It’s easier to meet user’s needs when you build your entire site around their experience instead of looking at UX as an afterthought.

Consider this example from Berkshire Hathaway, a multi-billion-dollar holding company. The overall design philosophy is understandable: It’s simple and direct, choosing to focus on information instead of fancy aesthetics that may not suit the company image. However, you could argue it loses itself in this big picture.

While it is simple, the lack of menus or color contrast and the small font make it harder to read and a little overwhelming. This confusion can counteract any accessibility benefits of its simplicity.

Alternatively, even a simple website redesign could include intuitive menus, additional contrast, and accessible font for easy navigation across the site.

The homepage for U.K. charity Scope offers a better example of web design centered around users’ needs. Concise, clear menus line the top of the page to aid quicker, easier navigation. The color scheme is simple enough to avoid confusion but provides enough contrast to make everything easy to read — something the sans-serif font further helps.

Ensuring Accessibility From The Start

A top-down method also makes catering to a diverse audience difficult because you may need to shoehorn features into an existing design.

For example, say, a local government agency creates a website focused on providing information and services to a general audience of residents. The site originally featured high-resolution images, bright colors, and interactive charts.

However, they realize the images are not accessible to people navigating the site with screen readers, while multiple layers of submenus are difficult for keyboard-only users. Further, the bright colors make it hard for visually impaired users to read the site’s information.

The agency, realizing these accessibility concerns, adds captions to each image. However, the captions disrupt the originally intended visual aesthetics and user flow. Further, adjusting the bright colors would involve completely rethinking the site’s entire color palette. If these considerations had been made before the site was built, the site build could have specifically accommodated these elements while still creating an aesthetically pleasing and user-friendly result.

Alternatively, a site initially built with high contrast, a calm color scheme, clear typography, simple menus, and reduced imagery would make this site much more accessible to a wide user base from the get-go.

As a real-world example, consider the Awwwards website. There are plenty of menus to condense information and ease navigation without overloading the screen — a solid accessibility choice. However, there does not seem to be consistent thought in these menus’ placement or organization.

There are far too many menus; some are at the top while others are at the bottom, and a scrolling top bar adds further distractions. It seems like Awwwards may have added additional menus as an afterthought to improve navigation. This leads to inconsistencies and crowding because they did not begin with this thought.

In contrast,

Bottom-up alternatives address usability issues from the beginning, which results in a more functional, accessible website.

Redesigning a system to address a usability issue it didn’t originally make room for is challenging. It can lead to errors like broken links and other unintended consequences that may hinder access for other visitors. Some sites have even claimed to lose 90% of their traffic after a redesign. While bottom-up approaches won’t eliminate those possibilities, they make them less likely by centering everything around usage from the start.

The website for the Vasa Museum in Stockholm, Sweden, showcases a more cohesive approach to ensuring accessibility. Like Awwwards, it uses menus to aid navigation and organization, but there seems to be more forethought into these features. All menus are at the top, and there are fewer of them, resulting in less clutter and a faster way to find what you’re looking for. The overall design complements this by keeping things simple and neat so that the menus stand out.

Increasing Awareness

Similarly, bottom-up design ensures you don’t miss as many accessibility concerns. When you start at the top, before determining what details fit within it, you may not consider all the factors that influence it. Beginning with the specifics instead makes it easier to spot and address problems you would’ve missed otherwise.

This awareness is particularly important for serving a diverse population. An estimated 16% of the global population — 1.6 billion people — have a significant disability. That means there’s a huge range of varying needs to account for, but most people lack firsthand experience living with these conditions. Consequently, it’s easy to miss things impacting others’ UX. You can overcome that knowledge gap by asking how everyone can use your site first.

Bottom-Up vs. Top-Down: Which Is Best for You?

As these benefits show, a bottom-up design philosophy can be helpful when building an accessible site. Still, top-down methods can be advantageous at times, too. Which is best depends on your situation.

Top-down approaches are a good way to ensure a consistent brand image, as you start with the overall idea and base future decisions on this concept. It also makes it easier to create a design hierarchy to facilitate decision-making within your team. When anyone has a question, they can turn to whoever is above them or refer to the broader goals. Such organization can also mean faster design processes.

Bottom-up methods, by contrast, are better when accessibility for a diverse audience is your main concern. It may be harder to keep everyone on the same overall design philosophy page, but it usually produces a more functional website. You can catch and solve problems early and pay great attention to detail. However, this can mean longer design cycles, which can incur extra costs.

It may come down to what your team is most comfortable with. People think and work differently, with some preferring a top-down approach while others find bottom-up more natural. Combining the two — starting with a top-down model before tackling updates from a bottom-up perspective — can be beneficial, too.

Strategies For Implementing A Bottom-Up Design Model

Should you decide a bottom-up design method is best for your goals, here are some ways you can embrace this philosophy.

Talk To Your Existing User Base

One of the most important factors in bottom-up web design is to center everything around your users. As a result, your existing user base — whether from a separate part of your business or another site you run — is the perfect place to start.

Survey customers and web visitors about their experience on your sites and others. Ask what pain points they have and what features they’d appreciate. Any commonalities between responses deserve attention. You can also turn to WCAG standards for inspiration on accessible functionality, but first-hand user feedback should form the core of your mission.

Look To Past Projects For Accessibility Gaps

Past sites and business projects can also reveal what specifics you should start with. Look for any accessibility gaps by combing through old customer feedback and update histories and using these sites yourself to find issues. Take note of any barriers or usability concerns to address in your next website.

Remember to document everything you find as you go. Up to 90% of organizations’ data is unstructured, making it difficult to analyze later. Reversing that trend by organizing your findings and recording your accessible design process will streamline future accessibility optimization efforts.

Divide Tasks But Communicate Often

Keep in mind that a bottom-up strategy can be time-consuming. One of the reasons why top-down alternatives are popular is because they’re efficient. You can overcome this gap by splitting tasks between smaller teams. However, these groups must communicate frequently to ensure separate design considerations work as a cohesive whole.

A DevOps approach is helpful here. DevOps has helped 49% of its adopters achieve a faster time to market, and 61% report higher-quality deliverables. It also includes space for both detailed work and team-wide meetings to keep everyone on track. Such benefits ensure you can remain productive in a bottom-up strategy.

Accessible Websites Need A Bottom-Up Design Approach

You can’t overstate the importance of accessible website design. By the same token, bottom-up philosophies are crucial in modern site-building. A detail-oriented approach makes it easier to serve a more diverse audience along several fronts. Making the most of this opportunity will both extend your reach to new niches and make the web a more equitable place.

The Web Accessibility Initiative’s WCAG standards are a good place to start. While these guidelines don’t necessarily describe how to apply a bottom-up approach, they do outline critical user needs and accessibility concerns your design should consider. The site also offers a free and comprehensive Digital Accessibility Foundations course for designers and developers.

Familiarizing yourself with these standards and best practices will make it easier to understand your audience before you begin designing your site. You can then build a more accessible platform from the ground up.

Additionally, the following are some valuable related reads that can act as inspiration in accessibility-centered and user-centric design.

By employing bottom-up thinking as well as resources like these in your design approach, you can create websites that not only meet current accessibility standards but actively encourage site use among users of all backgrounds and abilities.

Further Reading On SmashingMag

SVG Coding Examples: Useful Recipes For Writing Vectors By Hand

Featured Imgs 23

Even though I am the kind of front-end engineer who manually cleans up SVG files when they are a mess, I never expected to become one of those people. You know, those crazy people that draw with code.

But here we are.

I dove deep into SVG specs last winter when I created a project to draw Calligraphy Grids, and even though I knew the basic structures and rules of SVG, it was only then that I fully tried to figure out and understand what all of those numbers meant and how they interacted with each other.

And, once you get the hang of it, it is actually very interesting and quite fun to code SVG by hand.

No <path> ahead

We won’t go into more complex SVG shapes like paths in this article, this is more about practical information for simple SVGs. When it comes to drawing curves, I still recommend using a tool like Illustrator or Affinity. However, if you are super into compounding your lines, a path is useful. Maybe we’ll do that in Part 2.

Also, this guide focuses mostly on practical examples that illustrate some of the math involved when drawing SVGs. There is a wonderful article here that goes a bit deeper into the specs, which I recommend reading if you’re more interested in that: “A Practical Guide To SVG And Design Tools.”
Drawing With Math. Remember Coordinate Systems?

Illustrator, Affinity, and all other vector programs are basically just helping you draw on a coordinate system, and then those paths and shapes are stored in SVG files.

If you open up these files in an editor, you’ll see that they are just a bunch of paths that contain lots of numbers, which are coordinates in that coordinate system that make up the lines.

But, there is a difference between the all-powerful <path> and the other, more semantic elements like <rect>, <circle>, <line>, <ellipse>, <polygon>, and <polyline>.

These elements are not that hard to read and write by hand, and they open up a lot of possibilities to add animation and other fun stuff. So, while most people might only think of SVGs as never-pixelated, infinitely scaling images, they can also be quite comprehensive pieces of code.

How Does SVG Work? unit != unit

Before we get started on how SVG elements are drawn, let’s talk about the ways units work in SVG because they might be a bit confusing when you first get started.

The beauty of SVG is that it’s a vector format, which means that the units are somewhat detached from the browser and are instead just relative to the coordinate system you’re working in.

That means you would not use a unit within SVG but rather just use numbers and then define the size of the document you’re working with.

So, your width and height might be using CSS rem units, but in your viewBox, units become just a concept that helps you in establishing sizing relationships.

What Is The viewBox?

The viewBox works a little bit like the CSS aspect-ratio property. It helps you establish a relationship between the width and the height of your coordinate system and sets up the box you’re working in. I tend to think of the viewBox as my “document” size.

Any element that is placed within the SVG with bigger dimensions than the viewBox will not be visible. So, the viewBox is the cutout of the coordinate system we’re looking through. The width and height attributes are unnecessary if there is a viewBox attribute.

So, in short, having an SVG with a viewBox makes it behave a lot like a regular image. And just like with images, it’s usually easiest to just set either a width or a height and let the other dimension be automatically sized based on the intrinsic aspect ratio dimensions.

So, if we were to create a function that draws an SVG, we might store three separate variables and fill them in like this:

`<svg 
  width="${svgWidth}" 
  viewBox="0 0 ${documentWidth} ${documentHeight}" 
  xmlns="http://www.w3.org/2000/svg"
>`;

SVG Things Of Note

There is a lot to know about SVG: When you want to reuse an image a lot, you may want to turn it into a symbol that can then be referenced with a use tag, you can create sprites, and there are some best practices when using them for icons, and so on.

Unfortunately, this is a bit out of the scope of this article. Here, we’re mainly focusing on designing SVG files and not on how we can optimize and use them.

However, one thing of note that is easier to implement from the start is accessibility.

SVGs can be used in an <img> tag, where alt tags are available, but then you lose the ability to interact with your SVG code, so inlining might be your preference.

When inlining, it’s easiest to declare role="img" and then add a <title> tag with your image title.

Note: You can check out this article for SVG and Accessibility recommendations.

<svg
  role="img"
  [...attr]
>
  <title>An accessible title</title>
  <!-- design code -->
</svg>
Drawing SVG With JavaScript

There is usually some mathematics involved when drawing SVGs. It’s usually fairly simple arithmetic (except, you know, in case you draw calligraphy grids and then have to dig out trigonometry…), but I think even for simple math, most people don’t write their SVGs in pure HTML and thus would like to use algebra.

At least for me, I find it much easier to understand SVG Code when giving meaning to numbers, so I always stick to JavaScript, and by giving my coordinates names, I like them immeasurable times more.

So, for the upcoming examples, we’ll look at the list of variables with the simple math and then JSX-style templates for interpolation, as that gives more legible syntax highlighting than string interpolations, and then each example will be available as a CodePen.

To keep this Guide framework-agnostic, I wanted to quickly go over drawing SVG elements with just good old vanilla JavaScript.

We’ll create a container element in HTML that we can put our SVG into and grab that element with JavaScript.

<div data-svg-container></div>
<script src="template.js"></script>

To make it simple, we’ll draw a rectangle <rect> that covers the entire viewBox and uses a fill.

Note: You can add all valid CSS values as fills, so a fixed color, or something like currentColor to access the site’s text color or a CSS variable would work here if you’re inlining your SVG and want it to interact with the page it’s placed in.

Let’s first start with our variable setup.

// vars
const container = document.querySelector("[data-svg-container]");
const svgWidth = "30rem"; // use any value with units here
const documentWidth = 100;
const documentHeight = 100;
const rectWidth = documentWidth;
const rectHeight = documentHeight;
const rectFill = "currentColor"; // use any color value here
const title = "A simple square box";

Method 1: Create Element and Set Attributes

This method is easier to keep type-safe (if using TypeScript) — uses proper SVG elements and attributes, and so on — but it is less performant and may take a long time if you have many elements.

const svg = document.createElementNS("http://www.w3.org/2000/svg", "svg");
const titleElement = document.createElementNS("http://www.w3.org/2000/svg", "title");
const rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");

svg.setAttribute("width", svgWidth);
svg.setAttribute("viewBox", 0 0 ${documentWidth} ${documentHeight});
svg.setAttribute("xmlns", "http://www.w3.org/2000/svg");
svg.setAttribute("role", "img");

titleElement.textContent = title;

rect.setAttribute("width", rectWidth);
rect.setAttribute("height", rectHeight);
rect.setAttribute("fill", rectFill);

svg.appendChild(titleElement);
svg.appendChild(rect);

container.appendChild(svg);

Here, you can see that with the same coordinates, a polyline won’t draw the line between the blue and the red dot, while a polygon will. However, when applying a fill, they take the exact same information as if the shape was closed, which is the right side of the graphic, where the polyline makes it look like a piece of a circle is missing.

This is the second time where we have dealt with quite a bit of repetition, and we can have a look at how we could leverage the power of JavaScript logic to render our template faster.

But first, we need a basic implementation like we’ve done before. We’re creating objects for the circles, and then we’re chaining the cx and cy values together to create the points attribute. We’re also storing our transforms in variables.

const polyDocWidth = 200;
const polyDocHeight = 200;
const circleOne = { cx: 25, cy: 80, r: 10, fill: "red" };
const circleTwo = { cx: 40, cy: 20, r: 5, fill: "lime" };
const circleThree = { cx: 70, cy: 60, r: 8, fill: "cyan" };
const points = ${circleOne.cx},${circleOne.cy} ${circleTwo.cx},${circleTwo.cy} ${circleThree.cx},${circleThree.cy};
const moveToTopRight = translate(${polyDocWidth / 2}, 0);
const moveToBottomRight = translate(${polyDocWidth / 2}, ${polyDocHeight / 2});
const moveToBottomLeft = translate(0, ${polyDocHeight / 2});

And then, we apply the variables to the template, using either a polyline or polygon element and a fill attribute that is either set to none or a color value.


<svg
  width={svgWidth}
  viewBox={`0 0 ${polyDocWidth} ${polyDocHeight}`}
  xmlns="http://www.w3.org/2000/svg"
  role="img"
>
  <title>Composite shape comparison</title>
  <g>
    <circle
      cx={circleOne.cx}
      cy={circleOne.cy}
      r={circleOne.r}
      fill={circleOne.fill}
    />
    <circle
      cx={circleTwo.cx}
      cy={circleTwo.cy}
      r={circleTwo.r}
      fill={circleTwo.fill}
    />
    <circle
      cx={circleThree.cx}
      cy={circleThree.cy}
      r={circleThree.r}
      fill={circleThree.fill}
    />
    <polyline
      points={points}
      fill="none"
      stroke="black"
    />
  </g>
  <g transform={moveToTopRight}>
    <circle
      cx={circleOne.cx}
      cy={circleOne.cy}
      r={circleOne.r}
      fill={circleOne.fill}
    />
    <circle
      cx={circleTwo.cx}
      cy={circleTwo.cy}
      r={circleTwo.r}
      fill={circleTwo.fill}
    />
    <circle
      cx={circleThree.cx}
      cy={circleThree.cy}
      r={circleThree.r}
      fill={circleThree.fill}
    />
    <polyline
      points={points}
      fill="white"
      stroke="black"
    />
  </g>
  <g transform={moveToBottomLeft}>
    <circle
      cx={circleOne.cx}
      cy={circleOne.cy}
      r={circleOne.r}
      fill={circleOne.fill}
    />
    <circle
      cx={circleTwo.cx}
      cy={circleTwo.cy}
      r={circleTwo.r}
      fill={circleTwo.fill}
    />
    <circle
      cx={circleThree.cx}
      cy={circleThree.cy}
      r={circleThree.r}
      fill={circleThree.fill}
    />
    <polygon
      points={points}
      fill="none"
      stroke="black"
    />
  </g>
  <g transform={moveToBottomRight}>
    <circle
      cx={circleOne.cx}
      cy={circleOne.cy}
      r={circleOne.r}
      fill={circleOne.fill}
    />
    <circle
      cx={circleTwo.cx}
      cy={circleTwo.cy}
      r={circleTwo.r}
      fill={circleTwo.fill}
    />
    <circle
      cx={circleThree.cx}
      cy={circleThree.cy}
      r={circleThree.r}
      fill={circleThree.fill}
    />
    <polygon
      points={points}
      fill="white"
      stroke="black"
    />
  </g>
</svg>

And here’s a version of it to play with:

See the Pen SVG Polygon / Polyline (simple) [forked] by Myriam.

Dealing With Repetition

When it comes to drawing SVGs, you may find that you’ll be repeating a lot of the same code over and over again. This is where JavaScript can come in handy, so let’s look at the composite example again and see how we could optimize it so that there is less repetition.

Observations:

  • We have three circle elements, all following the same pattern.
  • We create one repetition to change the fill style for the element.
  • We repeat those two elements one more time, with either a polyline or a polygon.
  • We have four different transforms (technically, no transform is a transform in this case).

This tells us that we can create nested loops.

Let’s go back to just a vanilla implementation for this since the way loops are done is quite different across frameworks.

You could make this more generic and write separate generator functions for each type of element, but this is just to give you an idea of what you could do in terms of logic. There are certainly still ways to optimize this.

I’ve opted to have arrays for each type of variation that we have and wrote a helper function that goes through the data and builds out an array of objects with all the necessary information for each group. In such a short array, it would certainly be a viable option to just have the data stored in one element, where the values are repeated, but we’re taking the DRY thing seriously in this one.

The group array can then be looped over to build our SVG HTML.

const container = document.querySelector("[data-svg-container]");
const svgWidth = 200;
const documentWidth = 200;
const documentHeight = 200;
const halfWidth = documentWidth / 2;
const halfHeight = documentHeight / 2;
const circles = [
  { cx: 25, cy: 80, r: 10, fill: "red" },
  { cx: 40, cy: 20, r: 5, fill: "lime" },
  { cx: 70, cy: 60, r: 8, fill: "cyan" },
];
const points = circles.map(({ cx, cy }) => ${cx},${cy}).join(" ");
const elements = ["polyline", "polygon"];
const fillOptions = ["none", "white"];
const transforms = [
  undefined,
  translate(${halfWidth}, 0),
  translate(0, ${halfHeight}),
  translate(${halfWidth}, ${halfHeight}),
];
const makeGroupsDataObject = () => {
  let counter = 0;
  const g = [];
  elements.forEach((element) => {
    fillOptions.forEach((fill) => {
      const transform = transforms[counter++];
      g.push({ element, fill, transform });
    });
  });
  return g;
};
const groups = makeGroupsDataObject();
// result:
// [
//   {
//     element: "polyline",
//     fill: "none",
//   },
//   {
//     element: "polyline",
//     fill: "white",
//     transform: "translate(100, 0)",
//   },
//   {
//     element: "polygon",
//     fill: "none",
//     transform: "translate(0, 100)",
//   },
//   {
//     element: "polygon",
//     fill: "white",
//     transform: "translate(100, 100)",
//   }
// ]

const svg = document.createElementNS("http://www.w3.org/2000/svg", "svg");
svg.setAttribute("width", svgWidth);
svg.setAttribute("viewBox", 0 0 ${documentWidth} ${documentHeight});
svg.setAttribute("xmlns", "http://www.w3.org/2000/svg");
svg.setAttribute("role", "img");
svg.innerHTML = "<title>Composite shape comparison</title>";
groups.forEach((groupData) => {
  const circlesHTML = circles
    .map((circle) => {
      return &lt;circle 
          cx="${circle.cx}" 
          cy="${circle.cy}" 
          r="${circle.r}" 
          fill="${circle.fill}"
        /&gt;;
    })
    .join("");
  const polyElementHTML = &lt;${groupData.element} 
      points="${points}" 
      fill="${groupData.fill}" 
      stroke="black" 
    /&gt;;
  const group = &lt;g ${groupData.transform ?transform="${groupData.transform}": ""}&gt;
        ${circlesHTML}
        ${polyElementHTML}
      &lt;/g&gt;;
  svg.innerHTML += group;
});
container.appendChild(svg);

And here’s the Codepen of that:

See the Pen SVG Polygon / Polyline (JS loop version) [forked] by Myriam.

More Fun Stuff

Now, that’s all the basics I wanted to cover, but there is so much more you can do with SVG. There is more you can do with transform; you can use a mask, you can use a marker, and so on.

We don’t have time to dive into all of them today, but since this started for me when making Calligraphy Grids, I wanted to show you the two most satisfying ones, which I, unfortunately, can’t use in the generator since I wanted to be able to open my generated SVGs in Affinity and it doesn’t support pattern.

Okay, so pattern is part of the defs section within the SVG, which is where you can define reusable elements that you can then reference in your SVG.

Graph Grid with pattern

If you think about it, a graph is just a bunch of horizontal and vertical lines that repeat across the x- and y-axis.

So, pattern can help us with that. We can create a <rect> and then reference a pattern in the fill attribute of the rect. The pattern then has its own width, height, and viewBox, which defines how the pattern is repeated.

So, let’s say we want to perfectly center our graph grid in any given width or height, and we want to be able to define the size of our resulting squares (cells).

Once again, let’s start with the JavaScipt variables:

const graphDocWidth = 226;
const graphDocHeight = 101;
const cellSize = 5;
const strokeWidth = 0.3;
const strokeColor = "currentColor";
const patternHeight = (cellSize / graphDocHeight) * 100;
const patternWidth = (cellSize / graphDocWidth) * 100;
const gridYStart = (graphDocHeight % cellSize) / 2;
const gridXStart = (graphDocWidth % cellSize) / 2;

Now, we can apply them to the SVG element:

<svg
  width={svgWidth}
  viewBox={`0 0 ${graphDocWidth} ${graphDocHeight}`}
  xmlns="http://www.w3.org/2000/svg"
  role="img"
>
  <defs>
    <pattern
      id="horizontal"
      viewBox={`0 0 ${graphDocWidth} ${strokeWidth}`}
      width="100%"
      height={`${patternHeight}%`}
    >
      <line
        x1="0"
        x2={graphDocWidth}
        y1={gridYStart}
        y2={gridYStart}
        stroke={strokeColor}
        stroke-width={strokeWidth}
      />
    </pattern>
    <pattern
      id="vertical"
      viewBox={`0 0 ${strokeWidth} ${graphDocHeight}`}
      width={`${patternWidth}%`}
      height="100%"
    >
      <line
        y1={0}
        y2={graphDocHeight}
        x1={gridXStart}
        x2={gridXStart}
        stroke={strokeColor}
        stroke-width={strokeWidth}
      />
    </pattern>
  </defs>
  <title>A graph grid</title>
  <rect
    width={graphDocWidth}
    height={graphDocHeight}
    fill="url(#horizontal)"
  />
  <rect
    width={graphDocWidth}
    height={graphDocHeight}
    fill="url(#vertical)"
  />
</svg>

And this is what that then looks like:

See the Pen SVG Graph Grid [forked] by Myriam.

Dot Grid With pattern

If we wanted to draw a dot grid instead, we could simply repeat a circle. Or, we could alternatively use a line with a stroke-dasharray and stroke-dashoffset to create a dashed line. And we’d only need one line in this case.

Starting with our JavaScript variables:

const dotDocWidth = 219;
const dotDocHeight = 100;
const cellSize = 4;
const strokeColor = "black";
const gridYStart = (dotDocHeight % cellSize) / 2;
const gridXStart = (dotDocWidth % cellSize) / 2;
const dotSize = 0.5;
const patternHeight = (cellSize / dotDocHeight) * 100;

And then adding them to the SVG element:

<svg
  width={svgWidth}
  viewBox={`0 0 ${dotDocWidth} ${dotDocHeight}`}
  xmlns="http://www.w3.org/2000/svg"
  role="img"
>
  <defs>
    <pattern
      id="horizontal-dotted-line"
      viewBox={`0 0 ${dotDocWidth} ${dotSize}`}
      width="100%"
      height={`${patternHeight}%`}
    >
      <line
        x1={gridXStart}
        y1={gridYStart}
        x2={dotDocWidth}
        y2={gridYStart}
        stroke={strokeColor}
        stroke-width={dotSize}
        stroke-dasharray={`0,${cellSize}`}
        stroke-linecap="round"
      ></line>
    </pattern>
  </defs>
  <title>A Dot Grid</title>
  <rect
    x="0"
    y="0"
    width={dotDocWidth}
    height={dotDocHeight}
    fill="url(#horizontal-dotted-line)"
  ></rect>
</svg>

And this is what that looks like:

See the Pen SVG Dot Grid [forked] by Myriam.

Conclusion

This brings us to the end of our little introductory journey into SVG. As you can see, coding SVG by hand is not as scary as it seems. If you break it down into the basic elements, it becomes quite like any other coding task:

  • We analyze the problem,
  • Break it down into smaller parts,
  • Examine each coordinate and its mathematical breakdown,
  • And then put it all together.

I hope that this article has given you a starting point into the wonderful world of coded images and that it gives you the motivation to delve deeper into the specs and try drawing some yourself.

How To Build Custom Data Visualizations Using Luzmo Flex

Featured Imgs 23

This article is a sponsored by Luzmo

In this article, I’ll introduce you to Luzmo Flex, a new feature from the Luzmo team who have been working hard making developer tooling to flatten the on-ramp for analytics reporting and data visualization.

With Luzmo Flex, you can hook up a dataset and create beautifully crafted, fully customizable interactive charts that meet your reporting needs. They easily integrate and interact with other components of your web app, allowing you to move away from a traditional “dashboard” interface and build more bespoke data products.

While many charting libraries offer similar features, I often found it challenging to get the data into the right shape that the library needed. In this article, I’ll show you how you can build beautiful data visualizations using the Google Analytics API, and you won’t have to spend any time “massaging” the data!

What Is Luzmo Flex?

Well, it’s two things, really. First of all, Luzmo is a low-code platform for embedded analytics. You can create datasets from just about anything, connect them to APIs like Google Analytics or your PostgreSQL database, or even upload static data in a .csv file and start creating data visualizations with drag and drop.

Secondly, Luzmo Flex is their new React component that can be configured to create custom data visualizations. Everything from the way you query your data to the way you display it can be achieved through code using the LuzmoVizItemComponent.

What makes Luzmo Flex unique is that you can reuse the core functionalities of Luzmo’s low-code embedded analytics platform in your custom-coded components.

That means, besides creating ready-to-use datasets, you can set up functions like the following out-of-the-box:

  • Multi-tenant analytics: Showing different data or visualizations to different users of your app.
  • Localization: Displaying charts in multiple languages, currencies, and timezones without much custom development.
  • Interactivity: Set up event listeners to create complex interactivity between Luzmo’s viz items and any non-Luzmo components in your app.

What Can You Build With Luzmo Flex?

By combining these off-the-shelf functions with flexibility through code, Luzmo Flex makes a great solution for building bespoke data products that go beyond the limits of a traditional dashboard interface. Below are a few examples of what that could look like.

Report Builder

A custom report builder that lets users search and filter a dataset and render it out using a number of different charts.

Filter Panel

Enable powerful filtering using HTML Select inputs, which will update each chart shown on the page.

Wearables Dashboard

Or how about a sleep tracker hooked up to your phone to track all those important snoozes?

When to Consider Luzmo Flex vs Chart Libraries

When building data-intensive applications, using something like Recharts, a well-known React charting library, you’ll likely need to reformat the data to fit the required shape. For instance, if I request the top 3 page views from the last seven days for my site, paulie.dev, I would have to use the Google Analytics API using the following query.

import dotenv from 'dotenv';
import { BetaAnalyticsDataClient } from '@google-analytics/data';
dotenv.config();

const credentials = JSON.parse(
  Buffer.from(process.env.GOOGLE_APPLICATION_CREDENTIALS_BASE64, 'base64').toString('utf-8')
);

const analyticsDataClient = new BetaAnalyticsDataClient({
  credentials,
});

const [{ rows }] = await analyticsDataClient.runReport({
  property: properties/${process.env.GA4&#95;PROPERTY&#95;ID},
  dateRanges: [
    {
      startDate: '7daysAgo',
      endDate: 'today',
    },
  ],
  dimensions: [
    {
      name: 'fullPageUrl',
    },
    {
      name: 'pageTitle',
    },
  ],
  metrics: [
    {
      name: 'totalUsers',
    },
  ],
  limit: 3,
  metricAggregations: ['MAXIMUM'],
});

The response would look something like this:

[
  {
    "dimensionValues": [
      {
        "value": "www.paulie.dev/",
        "oneValue": "value"
      },
      {
        "value": "Paul Scanlon | Home",
        "oneValue": "value"
      }
    ],
    "metricValues": [
      {
        "value": "61",
        "oneValue": "value"
      }
    ]
  },
  {
    "dimensionValues": [
      {
        "value": "www.paulie.dev/posts/2023/11/a-set-of-sign-in-with-google-buttons-made-with-tailwind/",
        "oneValue": "value"
      },
      {
        "value": "Paul Scanlon | A set of: \"Sign In With Google\" Buttons Made With Tailwind",
        "oneValue": "value"
      }
    ],
    "metricValues": [
      {
        "value": "41",
        "oneValue": "value"
      }
    ]
  },
  {
    "dimensionValues": [
      {
        "value": "www.paulie.dev/posts/2023/10/what-is-a-proxy-redirect/",
        "oneValue": "value"
      },
      {
        "value": "Paul Scanlon | What Is a Proxy Redirect?",
        "oneValue": "value"
      }
    ],
    "metricValues": [
      {
        "value": "23",
        "oneValue": "value"
      }
    ]
  }
]

To make that data work with Recharts, I’d need to reformat it so it conforms to the following data shape.

[
  {
    "name": "Paul Scanlon | Home",
    "value": 61
  },
  {
    "name": "Paul Scanlon | A set of: \"Sign In With Google\" Buttons Made With Tailwind",
    "value": 41
  },
  {
    "name": "Paul Scanlon | What Is a Proxy Redirect?",
    "value": 23
  }
]

To accomplish this, I’d need to use an Array.prototype.map() to iterate over each item, destructure the relevant data and return a key-value pair for the name and value for each.

const data = response.rows.map((row) => {
  const { dimensionValues, metricValues } = row;

  const pageTitle = dimensionValues[1].value;
  const totalUsers = parseInt(metricValues[0].value);

  return {
    name: pageTitle,
    value: totalUsers,
  };
});

And naturally, if you’re reformatting data this way in your application, you’d also want to write unit tests to ensure the data is always formatted correctly to avoid breaking your application… and all of this before you even get on to creating your charts!

With Luzmo Flex, all of this goes away, leaving you more time to focus on which data to display and how best to display it.

The First Steps to Building Bespoke Data Products

Typically, when building user interfaces that display data insights, your first job will be to figure out how to query the data source. This can take many forms, from RESTful API requests to direct database queries or sometimes reading from static files. Your next job will be figuring out when and how often these requests need to occur.

  • For data that rarely changes: Perhaps a query in the build step will work.
  • For data that changes regularly: A server-side request on page load.
  • For ever-changing data: A client-side request that polls an API on an interval.

Each will likely inform your application’s architecture, and there’s no single solution to this. Your last job, as mentioned, will be wrangling the responses, reformatting the data, and displaying it in the UI.

Below, I’ll show you how to do this using Luzmo Flex by using a simple example product.

What We’re Building: Custom Data Visualizations As Code

Here’s a screenshot of a simple data product I’ve built that displays three different charts for different reporting dimensions exposed by the Google Analytics API for page views for my site, paulie.dev, from the last seven days.

You can find all the code used in this article on the following link:

Getting Started With Luzmo

Before we get going, hop over to Luzmo and sign up for a free trial. You might also like to have a read of one of the getting started guides listed below. In this article, I’ll be using the Next.js starter.

Creating a Google Analytics Dataset

To create data visualization, you’ll first need data! To achieve this using Luzmo, head over to the dashboard, select Datasets from the navigation, and select GA4 Google Analytics. Follow the steps shown in the UI to connect Luzmo with your Google Analytics account.

With the setup complete, you can now select which reporting dimensions to add to your dataset. To follow along with this article, select Custom selection.

Lastly, select the following using the search input. Device Category, Page Title, Date, and Total users, then click Import when you’re ready.

You now have all the data required to build the Google Analytics dashboard. You can access the dataset ID from the URL address bar in your browser. You’ll need this in a later step.

If you’ve followed along from either of the first two getting started guides, you’ll have your API Key, API Token, App server, and API host environment variables set up and saved in a .env file.

Install Dependencies

If you’ve cloned one of the starter repositories, run the following to install the required dependencies.

npm install

Next, install the Luzmo React Embed dependency which exports the LuzmoVizItemComponent.

npm install  @luzmo/react-embed@latest

Now, find page.tsx located in the src/app directory, and add your dataset id as shown below.

Add the access object from the destructured response and pass access.datasets[0].id onto the LuzmoClientComponent component using a prop named datasetId.

// src/app/page.tsx


+ import dynamic from 'next/dynamic';

import Luzmo from '@luzmo/nodejs-sdk';
- import LuzmoClientComponent from './components/luzmo-client-component';
+ const LuzmoClientComponent = dynamic(() => import('./components/luzmo-client-component'), {
  ssr: false,
});


const client = new Luzmo({
  api_key: process.env.LUZMO_API_KEY!,
  api_token: process.env.LUZMO_API_TOKEN!,
  host: process.env.NEXT_PUBLIC_LUZMO_API_HOST!,
});

export default async function Home() {
  const response = await client.create('authorization', {
    type: 'embed',
    username: 'user id',
    name: 'first name last name',
    email: 'name@email.com',
    access: {
      datasets: [
        {
-          id: '<dataset_id>',
+          id: '42b43db3-24b2-45e7-98c5-3fcdef20b1a3',
          rights: 'use',
        },
      ],
    },
  });

-  const { id, token } = response;
+  const { id, token, access } = response;

-  return <LuzmoClientComponent authKey={id} authToken={token} />;
+  return <LuzmoClientComponent authKey={id} authToken={token} datasetId={access.datasets[0].id} />;
}

And lastly, find luzmo-client-component.tsx located in src/app/components. This is where you’ll be creating your charts.

Building a Donut Chart

The first chart you’ll create is a Donut chart that shows the various devices used by visitors to your site.

Add the following code to luzmo-client-component.tsx component.

// src/app/component/luzmo-client-component.tsx

'use client';

+ import { LuzmoVizItemComponent } from '@luzmo/react-embed';

interface Props {
  authKey: string;
  authToken: string;
+  datasetId: string;
}

- export default function LuzmoClientComponent({ authKey, authToken}: Props) {
+ export default function LuzmoClientComponent({ authKey, authToken, datasetId }: Props) {

+  const date = new Date(new Date().getTime() - 7 * 24 * 60 * 60 * 1000).toISOString(); // creates a date 7 days ago

  console.log({ authKey, authToken });

  return (
    <section>
+    <div className='w-1/2 h-80'>
+      <LuzmoVizItemComponent
+        appServer={process.env.NEXT_PUBLIC_LUZMO_APP_SERVER}
+        apiHost={process.env.NEXT_PUBLIC_LUZMO_API_HOST}
+        authKey={authKey}
+        authToken={authToken}
+        type='donut-chart'
+        options={{
+          title: {
+            en: Devices from last 7 days,
+          },
+          display: {
+            title: true,
+          },
+          mode: 'donut',
+          legend: {
+            position: 'bottom',
+          },
+        }}
+        slots={[
+          {
+            name: 'measure',
+            content: [
+              {
+                label: {
+                  en: 'Total users',
+                },
+                column: '<column id>', // Total users
+                set: datasetId,
+                type: 'numeric',
+                format: '.4f',
+              },
+            ],
+          },
+          {
+            name: 'category',
+            content: [
+              {
+                label: {
+                  en: 'Device category',
+                },
+                column: '<column id>', // Device category
+                set: datasetId,
+                type: 'hierarchy',
+              },
+            ],
+          },
+        ]}
+        filters={[
+          {
+            condition: 'or',
+            filters: [
+              {
+                expression: '? >= ?',
+                parameters: [
+                  {
+                    column_id: '<column id>', // Date
+                    dataset_id: datasetId,
+                  },
+                  date,
+                ],
+              },
+            ],
+          },
+        ]}
+      />
+    <div/>
    </section>
  );
}

There’s quite a lot going on in the above code snippet, and I will explain it all in due course, but first, I’ll need to cover a particularly tricky part of the configuration.

Column IDs

You’ll notice the filters parameters, measure, and category content all require a column id.

In the filters parameters, the key is named column_id, and in the measure and category, the key is named column. Both of these are actually the column IDs from the dataset. And here’s how you can find them.

Back in the Luzmo dashboard, click into your dataset and look for the “more dots” next to each column heading. From the menu, select Copy column id. Add each column ID to the keys in the configuration objects.

In my example, I’m using the Total users for the measure, the Device category for the category, and the Date for the filter.

If you’ve added the column IDs correctly, you should be able to see a rendered chart on your screen!

… and as promised, here’s a breakdown of the configuration.

Initial Props Donut chart

The first part is fairly straightforward. appServer and authKey are the environment variables you saved to your .env file, and authKey and authToken are destructured from the authorization request and passed into this component via props.

The type prop determines which type of chart to render. In my example, I’m using donut-chart, but you could choose from one of the many options available, area-chart, bar-chart, bubble-chart, box-plot, and many more. You can see all the available options in the Luzmo documentation under Chart docs.

<LuzmoVizItemComponent
  appServer={process.env.NEXT_PUBLIC_LUZMO_APP_SERVER}
  apiHost={process.env.NEXT_PUBLIC_LUZMO_API_HOST}
  authKey={authKey}
  authToken={authToken}
  type='donut-chart'

The one thing I should point out is my use of Tailwind classes: w-1/2 (width: 50%) and h-80 (height: 20rem). The LuzmoVizItemComponent ships with height 100%, so you’ll need to wrap the component with an element that has an actual height, or you won’t be able to see the chart on the page as it could be 100% of the height of an element with no height.

Donut Chart Options

The options object is where you can customize the appearance of your chart. It accepts many configuration options, among which:

  • A title for the chart that accepts a locale with corresponding text to display.
  • A display title value to determine if the title is shown or not.
  • A mode to determine if the chart is to be of type donut or pie chart.
  • A legend option to determine where the legend can be positioned.

All the available configuration options can be seen in the Donut chart documentation.

options={{
  title: {
    en: `Devices from last 7 days`,
  },
  display: {
    title: true,
  },
  mode: 'donut',
  legend: {
    position: 'bottom',
  },
}}

Donut Chart Slots

Slots are where you can configure which column from your dataset to use for the category and measure.

Slots can contain multiple measures, useful for displaying two columns of data per chart, but if more than two are used, one will become the measure.

Each measure contains a content array. The content array, among many other configurations, can include the following:

  • A label and locale,
  • The column id from the dataset,
  • The datasetId,
  • The type of data you’re displaying,
  • A format for the data.

The format used here is Python syntax for floating-point numbers; it’s similar to JavaScript’s .toFixed() method, e.g number.toFixed(4).

The hierarchy type is ​​the Luzmo standard data type. Any text column is considered as an hierarchical data type.

You can read more in the Donut chart documentation about available configuration options for slots.

slots={[
  {
    name: 'measure',
    content: [
      {
        label: {
          en: 'Total users',
        },
        column: '<column id>', // Total users
        set: datasetId,
        type: 'numeric',
        format: '.4f',
      },
    ],
  },
  {
    name: 'category',
    content: [
      {
        label: {
          en: 'Device category',
        },
        column: '<column id>', // Device category
        set: datasetId,
        type: 'hierarchy',
      },
    ],
  },
]}

Donut Chart Filters

The filters object is where you can apply conditions that will determine which data will be shown. In my example, I only want to show data from the last seven days. To accomplish this, I first create the date variable:

const date = new Date(new Date().getTime() - 7 * 24 * 60 * 60 * 1000).toISOString();

This would produce an ISO date string, e.g., 2024-08-21T14:25:40.088Z, which I can use with the filter. The filter uses Luzmo’s Filter Expressions, to determine if the date for each row of the data is greater than or equal to the date variable. You can read more about Filter Expressions in Luzmo’s Academy article.

filters={[
  {
    condition: 'or',
    filters: [
      {
        expression: '? >= ?',
        parameters: [
          {
            column_id: '<column id>', // Date
            dataset_id: datasetId,
          },
          date,
        ],
      },
    ],
  },
]}
Building a Line Chart

The second chart you’ll be creating is a Line chart that displays the number of page views on each date from the last seven days from folks who visit your site.

Initial Props Line Chart

As with the Donut chart, the initial props are pretty much the same, but the type has been changed to line-chart.

<LuzmoVizItemComponent
  appServer={process.env.NEXT_PUBLIC_LUZMO_APP_SERVER}
  apiHost={process.env.NEXT_PUBLIC_LUZMO_API_HOST}
  authKey={authKey}
  authToken={authToken}
  type='line-chart'

Line Chart Options

The options for the Line chart are as follows, and the mode has been changed to line-chart.

options={{
  title: {
    en: `Site visits from last 7 days`,
  },
  display: {
    title: true,
  },
  mode: 'grouped',
}}

Line Chart Slots

The slots object is almost the same as before with the Donut chart, but for the Line chart, I’m using the date column from the dataset instead of the device category, and instead of category, I’m using the x-axis slot type. To ensure I’m formatting the data correctly (by day), I’ve used level 5. You can read more about levels in the docs.

slots={[
  {
    name: 'measure',
    content: [
      {
        label: {
          en: 'Total users',
        },
        column: '<column id>', // Total users
        set: datasetId,
        type: 'numeric',
        format: '.4f',
      },
    ],
  },
  {
    name: 'x-axis',
    content: [
      {
        label: {
          en: 'Date',
        },
        column: '<column id>', // Date
        set: datasetId,
        type: 'datetime',
        level: 5,
      },
    ],
  },
]}

Line Chart Filters

I’ve used the same filters as I used in the Donut chart.

Building a Bar Chart

The last chart you’ll be creating is a Bar chart that displays the number of page views for the top ten most viewed pages on your site.

Initial Props Bar Chart

As with the Donut and Line chart, the initial props are pretty much the same, but the type has been changed to bar-chart.

<LuzmoVizItemComponent
  className='w-full h-80'
  appServer={process.env.NEXT_PUBLIC_LUZMO_APP_SERVER}
  apiHost={process.env.NEXT_PUBLIC_LUZMO_API_HOST}
  authKey={authKey}
  authToken={authToken}
  type='bar-chart'

Bar Chart Options

The options for the Bar chart are a little more involved. I’ve included some styling options for the border-radii of the bars, limited the number of results to 10, and sorted the data by the highest page view count first using the sort by measure and direction options.

options={{
  title: {
    en: `Page views from last 7 days`,
  },
  display: {
    title: true,
  },
  mode: 'grouped',
  bars: {
    roundedCorners: 5,
  },
  limit: {
    number: 10,
  },
  sort: {
    by: 'measure',
    direction: 'desc',
  },
}}

Line Chart Slots

As with the Line chart, I’ve used an axis for one of the columns from the dataset. In this case, it’s the y-axis which displays the page title.

slots={[
  {
    name: 'measure',
    content: [
      {
        label: {
          en: 'Total users',
        },
        column: '<column id>', // Total users
        set: datasetId,
        type: 'numeric',
        format: '.4f',
      },
    ],
  },
  {
    name: 'y-axis',
    content: [
      {
        label: {
          en: 'Page title',
        },
        column: '<column id>', // Page title
        set: datasetId,
        type: 'hierarchy',
      },
    ],
  },
]}

Bar Chart Filters

I’ve used the same filters as I used in the Donut and Line chart.

What’s Next

As you can see, there are plenty of types of charts and customization options. Because this is just an “ordinary” React component, you can very easily make it configurable by an end user by allowing options to be set and unset using HTML input elements, checkbox, select, date, and so on.

But for me, the real power behind this is not having to mutate data!

This is particularly pertinent when displaying multiple charts with different reporting dimensions. Typically, this would require each to have their own utility function or reformatting method. That said, setting column IDs and dataset IDs is a little fiddly, but once you have the component hooked up to the dataset, you can configure and reconfigure as much as you like, all without having to rewrite data formatting functions.

If you’re interested in bringing data to life in your application and want to get it done without the usual headaches, book a free demo with the Luzmo team to learn more!

Generating Unique Random Numbers In JavaScript Using Sets

Category Image 080

JavaScript comes with a lot of built-in functions that allow you to carry out so many different operations. One of these built-in functions is the Math.random() method, which generates a random floating-point number that can then be manipulated into integers.

However, if you wish to generate a series of unique random numbers and create more random effects in your code, you will need to come up with a custom solution for yourself because the Math.random() method on its own cannot do that for you.

In this article, we’re going to be learning how to circumvent this issue and generate a series of unique random numbers using the Set object in JavaScript, which we can then use to create more randomized effects in our code.

Note: This article assumes that you know how to generate random numbers in JavaScript, as well as how to work with sets and arrays.

Generating a Unique Series of Random Numbers

One of the ways to generate a unique series of random numbers in JavaScript is by using Set objects. The reason why we’re making use of sets is because the elements of a set are unique. We can iteratively generate and insert random integers into sets until we get the number of integers we want.

And since sets do not allow duplicate elements, they are going to serve as a filter to remove all of the duplicate numbers that are generated and inserted into them so that we get a set of unique integers.

Here’s how we are going to approach the work:

  1. Create a Set object.
  2. Define how many random numbers to produce and what range of numbers to use.
  3. Generate each random number and immediately insert the numbers into the Set until the Set is filled with a certain number of them.

The following is a quick example of how the code comes together:

function generateRandomNumbers(count, min, max) {
  // 1: Create a Set object
  let uniqueNumbers = new Set();
  while (uniqueNumbers.size < count) {
    // 2: Generate each random number
    uniqueNumbers.add(Math.floor(Math.random() * (max - min + 1)) + min);
  }
  // 3: Immediately insert them numbers into the Set...
  return Array.from(uniqueNumbers);
}
// ...set how many numbers to generate from a given range
console.log(generateRandomNumbers(5, 5, 10));

What the code does is create a new Set object and then generate and add the random numbers to the set until our desired number of integers has been included in the set. The reason why we’re returning an array is because they are easier to work with.

One thing to note, however, is that the number of integers you want to generate (represented by count in the code) should be less than the upper limit of your range plus one (represented by max + 1 in the code). Otherwise, the code will run forever. You can add an if statement to the code to ensure that this is always the case:

function generateRandomNumbers(count, min, max) {
  // if statement checks that count is less than max + 1
  if (count > max + 1) {
    return "count cannot be greater than the upper limit of range";
  } else {
    let uniqueNumbers = new Set();
    while (uniqueNumbers.size < count) {
      uniqueNumbers.add(Math.floor(Math.random() * (max - min + 1)) + min);
    }
    return Array.from(uniqueNumbers);
  }
}
console.log(generateRandomNumbers(5, 5, 10));
Using the Series of Unique Random Numbers as Array Indexes

It is one thing to generate a series of random numbers. It’s another thing to use them.

Being able to use a series of random numbers with arrays unlocks so many possibilities: you can use them in shuffling playlists in a music app, randomly sampling data for analysis, or, as I did, shuffling the tiles in a memory game.

Let’s take the code from the last example and work off of it to return random letters of the alphabet. First, we’ll construct an array of letters:

const englishAlphabets = [
  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 
  'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
];

// rest of code

Then we map the letters in the range of numbers:

const englishAlphabets = [
  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 
  'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
];

// generateRandomNumbers()

const randomAlphabets = randomIndexes.map((index) => englishAlphabets[index]);

In the original code, the generateRandomNumbers() function is logged to the console. This time, we’ll construct a new variable that calls the function so it can be consumed by randomAlphabets:

const englishAlphabets = [
  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 
  'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
];

// generateRandomNumbers()

const randomIndexes = generateRandomNumbers(5, 0, 25);
const randomAlphabets = randomIndexes.map((index) => englishAlphabets[index]);

Now we can log the output to the console like we did before to see the results:

const englishAlphabets = [
  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 
  'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
];

// generateRandomNumbers()

const randomIndexes = generateRandomNumbers(5, 0, 25);
const randomAlphabets = randomIndexes.map((index) => englishAlphabets[index]);
console.log(randomAlphabets);

And, when we put the generateRandomNumbers`()` function definition back in, we get the final code:

const englishAlphabets = [
  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 
  'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
];
function generateRandomNumbers(count, min, max) {
  if (count > max + 1) {
    return "count cannot be greater than the upper limit of range";
  } else {
    let uniqueNumbers = new Set();
    while (uniqueNumbers.size < count) {
      uniqueNumbers.add(Math.floor(Math.random() * (max - min + 1)) + min);
    }
    return Array.from(uniqueNumbers);
  }
}
const randomIndexes = generateRandomNumbers(5, 0, 25);
const randomAlphabets = randomIndexes.map((index) => englishAlphabets[index]);
console.log(randomAlphabets);

So, in this example, we created a new array of alphabets by randomly selecting some letters in our englishAlphabets array.

You can pass in a count argument of englishAlphabets.length to the generateRandomNumbers function if you desire to shuffle the elements in the englishAlphabets array instead. This is what I mean:

generateRandomNumbers(englishAlphabets.length, 0, 25);
Wrapping Up

In this article, we’ve discussed how to create randomization in JavaScript by covering how to generate a series of unique random numbers, how to use these random numbers as indexes for arrays, and also some practical applications of randomization.

The best way to learn anything in software development is by consuming content and reinforcing whatever knowledge you’ve gotten from that content by practicing. So, don’t stop here. Run the examples in this tutorial (if you haven’t done so), play around with them, come up with your own unique solutions, and also don’t forget to share your good work. Ciao!

Mastering Typography In Logo Design

Typography Definitions Cover

Typography is much more than just text on a page — it forms the core of your design. As a designer, I always approach selecting types from two angles: as a creative adventure and as a technical challenge.

Choosing the right typeface for a company, product, or service is an immensely important task. At that moment, you’re not only aligning with the brand’s identity but also laying the foundation to reinforce the company or service’s brand. Finding the right typeface can be a time-consuming process that often begins with an endless search. During this search, you can get tangled up in the many different typefaces, which, over time, all start to look the same.

In this article, I aim to provide you with the essential background and tools to enhance your typography journey and apply this knowledge to your logo design. We will focus on three key pillars:

  1. Font Choice
  2. Font Weight
  3. Letter Spacing

We will travel back in time to uncover the origins of various typefaces. By exploring different categories, we will illustrate the distinctions with examples and describe the unique characteristics of each category.

Additionally, we will discuss the different font weights and offer advice on when to use each variant. We will delve into letter-spacing and kerning, explaining what they are and how to effectively apply them in your logo designs.

Finally, we will examine how the right typeface choices can significantly influence the impact and success of a brand. With this structured approach, I will show you how to create a logo that is not only expressive but also purposeful and well-thought-out.

Understanding Typography in Logo Design

From the invention of the Gutenberg press in the mid-15th century through the creation of the first Slab Serif in 1815 and the design of the first digital typeface in 1968, the number of available fonts has grown exponentially. Today, websites like WhatFontIs, a font finder platform, catalogs over a million fonts.

So, the one downside of not being born in the 15th century is that your task of choosing the right font has grown enormously. And once you’ve made the right choice out of a million-plus fonts, there are still many pitfalls to watch out for.

Fortunately for us, all these fonts have already been categorized. In this article, we refer to the following four categories: serif, sans serif, script, and display typefaces. But why do we have these categories, and how do we benefit from them today?

Each category has its specific uses. Serif typefaces are often used for books due to their enhancement of readability on paper, while sans serif typefaces are ideal for screens because of their clean lines. Different typefaces also evoke different emotions: for example, script can convey elegance, while sans serif offers a more modern look. Additionally, typeface categories have a rich history, with Old Style Serifs inspired by Roman inscriptions and Modern Serifs designed for greater contrast.

Today, these categories provide a fundamental basis for choosing the right typeface for any project.

As mentioned, different typefaces evoke different emotions; like people, they convey distinct characteristics:

  • Serif fonts are seen as traditional and trustworthy;
  • Sans Serif fonts are seen as modern and clear;
  • Script fonts can come across as elegant and/or informal depending on the style;
  • Display fonts are often bold and dynamic.

Historically, typefaces reflected cultural identities, but the “new typography” movement sought a universal style. Designers emphasized that typefaces should match the character of the text, a view also supported by the Bauhaus school.

Different Fonts And Their Characteristics

We have touched upon the history of different typeface categories. Now, to make a good font choice, we need to explore these categories and see what sets them apart, as each one has specific characteristics. In this article, we refer to the following four categories:

Let’s take a closer look at each category.

A serif typeface is a typeface that features small lines or decorative elements at the ends of the strokes. These small lines are called “serifs”.

A sans-serif typeface is a typeface that lacks the small lines or decorative elements at the ends of the strokes, resulting in a clean and modern appearance. The term “sans-serif” comes from the French word “sans,” meaning “without,” so sans-serif translates to “without serif.”

A script typeface is a typeface that mimics the fluid strokes of handwriting or calligraphy, featuring connected letters and flowing strokes for an elegant or artistic appearance.

A display typeface is a typeface designed for large sizes, such as headlines or titles, characterized by bold, decorative elements that make a striking visual impact.

Typeface Persona in Practice

Experts link typeface characteristics to physical traits. Sans serif faces are perceived as cleaner and more modern, while rounded serifs are friendly and squared serifs are more official. Light typefaces are seen as delicate and feminine, and heavy ones are seen as strong and masculine. Some typefaces are designed to be child-friendly with smoother shapes. Traditional serifs are often considered bookish, while sans serifs are seen as modern and no-nonsense.

Based on the provided context, we can assign the following characteristics per category:

  • Serif: Bookish, Traditional, Serious, Official, Respectable, Trustworthy.
  • Sans Serif: Clean, Modern, Technical, No-nonsense, Machine-like, Clear.
  • Script: Elegant, Informal, Feminine, Friendly, Flowing.
  • Display: Dramatic, Sophisticated, Urban, Theatrical, Bold, Dynamic.

Let me provide you with a real real-life logo example to help visualize how different typeface categories convey these characteristics.

We’re focusing on ING, a major bank headquartered in the Netherlands. Before we dive into the logo itself, let’s first zoom in on some brand values. On their website, it is stated that they “value integrity above all” and “will not ignore, tolerate, or excuse behavior that breaches our values. To do so would break the trust of society and the trust of the thousands of colleagues who do the right thing.”

Given the strong emphasis on integrity, trust, and adherence to values, the most suitable typeface category would likely be a serif.

The serif font in the ING logo conveys a sense of authority, professionalism, and experience associated with the brand.

Let’s choose a different font for the logo. The font used in the example is Poppins Bold, a geometric sans-serif typeface.

The sans-serif typeface in this version of the ING logo conveys modernity, simplicity, and accessibility. These are all great traits for a company to convey, but they align less with the brand’s chosen values of integrity, trust, and adherence to tradition. A serif typeface often represents these traits more effectively. While the sans-serif version of the logo may be more accessible and modern, it could also convey a sense of casualness that misaligns with the brand’s values.

So let’s see these traits in action with a game called “Assign the Trait.” The rules are simple: you are shown two different fonts, and you choose which font best represents the given trait.

Understanding these typeface personas is crucial when aligning typography with a company’s brand identity. The choice of typeface should reflect and reinforce the brand’s characteristics and values, ensuring a cohesive and impactful visual identity.

We covered a lot of ground, and I hope you now have a better understanding of different typeface categories and their characteristics. I also hope that the little game of “Assign the Trait” has given you a better grasp of the differences between them. This game would also be great to play while you’re walking your dog or going for a run. See a certain logo on the back of a lorry? Which typeface category does it belong to, and what traits does it convey?

Now, let’s further explore the importance of aligning the typeface with the brand identity.

Brand Identity and Consistency

The most important Aspect when choosing a typeface is that it aligns with the company’s brand identity. We have reviewed various typeface options, and each has its unique characteristics. You can link these characteristics to those of the company.

As discussed in the previous section, a sans-serif is more “modern” and “no-nonsense”. So, for a modern company, a sleek sans-serif typeface often fits better than a classic Serif typeface. In the previous section, we examined the ING logo and how the use of a sans-serif typeface gave it a more modern appearance, but it also reduced the emphasis on certain traits that ING wants to convey with its brand.

To further illustrate the impact of typeface on logo design, let’s explore some more ‘extreme’ examples.

Our first ‘Extreme’ example is Haribo, which is an iconic gummy candy brand. They use a custom sans-serif typeface.

Let’s zoom in on a couple of characteristics of the typeface and explore why this is a great match for the brand.

  • Playfulness: The rounded, bold shapes give the logo a playful and child-friendly feel, aligning with its target audience of children and families.
  • Simplicity: The simple, easily readable sans-serif design makes it instantly recognizable and accessible.
  • Friendliness: The soft, rounded edges of the letters convey a sense of friendliness and positivity.

The second up is Fanta, a global soft drink brand that also uses a custom sans-serif typeface.

  • Handcrafted, Cut-Paper Aesthetic: The letters are crafted to appear as though they’ve been cut from paper, giving the typeface a distinct, hand-made look that adds warmth and creativity.
  • Expressive: The logo design is energetic and packed with personality, perfectly embodying Fanta’s fun, playful, and youthful vibe.

Using these ‘extreme’ cases, we can really see the power that a well-aligned typeface can have. Both cases embody the fun and friendly values of the brand. While the nuances may be more subtle in other cases, the power is still there.

Now, let’s delve deeper into the different typefaces and also look at weight, style, and letter spacing.

Elements of Typography in Logo Design

Now that we have a background of the different typeface categories, let’s zoom in on three other elements of typography in logo design:

Typefaces

Each category of typefaces has a multitude of options. The choice of the right typeface is crucial and perhaps the most important decision when designing a logo. It’s important to realize that often, there isn’t a single ‘best’ choice. To illustrate, we have four variations of the Adidas logo below. Each typeface could be considered a good choice. It’s crucial not to get fixated on finding the perfect typeface. Instead, ensure it aligns with the brand identity and looks good in practical use.

These four typefaces could arguably all be great choices for the Adidas brand, as they each possess the clean, bold, and sans-serif qualities that align with the brand’s values of innovation, courage, and ownership. While the details of typeface selection are important, it’s essential not to get overly fixated on them. The key is to ensure that the typeface resonates with the brand’s identity and communicates its core values effectively. Ultimately, the right typeface is one that not only looks good but also embodies the spirit and essence of the brand.

Let’s zoom in on the different weights and styles each typeface offers.

Weight and Style

Each typeface can range from 1 to more than 10 different styles, including choices such as Roman and Italic and various weights like Light, Regular, Semi-Bold, and Bold.

Personally, I often lean towards a Roman in Semi-Bold or Bold variant, but this choice heavily depends on the desired appearance, brand name, and brand identity. So, how do you know which font weight to choose?

When to choose bold fonts

  • Brand Identity
    If the brand is associated with strength, confidence, and modernity, bold fonts can effectively communicate these attributes.
  • Visibility and Readability
    Bold fonts are easy to read from a distance, making them perfect for signage, billboards, and other large formats.
  • Minimalist Design
    Using bold fonts in minimalist logos not only ensures that the logo stands out but also aligns with the principles of minimalism, where less is more.

Letter-spacing & Kerning

An important Aspect of typography is overall word spacing, also known as tracking. This refers to the overall spacing between characters in a block of text. By adjusting the tracking in logo design, we can influence the overall look of the logo. We can make a logo more spacious and open or more compact and tight with minimal adjustments.

Designer and design educator Ellen Lupton states that kerning adjusts the spacing between individual characters in a typeface to ensure visual uniformity. When letters are spaced too uniformly, gaps can appear around certain letters like W, Y, V, T, and L. Modern digital typefaces use kerning pairs tables to control these spaces and create a more balanced look.

Tracking and kerning are often confused. To clarify, tracking (letter-spacing) adjusts the space between all letters uniformly, while kerning specifically involves adjusting the distance between individual pairs of letters to improve the readability and aesthetics of the text.

In the example shown below, we observe the concept of kerning in typography. The middle instance of “LEAF” displays the word without any kerning adjustments, where the spacing between each letter is uniform and unaltered.

In the first “LEAF,” kerning adjustments have been applied between the letters ‘A’ and ‘F’, reducing the space between them to create a more visually appealing and cohesive pair.

In the last “LEAF,” kerning has been applied differently, adjusting the space between ‘E’ and ‘A’. This alteration shifts the visual balance of the word, showing how kerning can change the aesthetics and readability of text (or logo) by fine-tuning the spacing between individual letter pairs.

Essential Techniques for Selecting Typefaces

Matching Typeface Characteristics with Brand Identity

As we discussed earlier, different categories of typefaces have unique characteristics that can align well with, or deviate from, the brand identity you want to convey. This is a great starting point on which to base your initial choice.

Inspiration

A large part of the creative process is seeking inspiration. Especially now that you’ve been able to make a choice regarding category, it’s interesting to see the different typefaces in action. This helps you visualize what does and doesn’t work for your brand. Below, I share a selection of my favorite inspiration sources:

Trust the Crowd

Some typefaces are used more frequently than others. Therefore, choosing typefaces that have been tried and tested over the years is a good starting point. It’s important to distinguish between a popular typeface and a trendy one. In this context, I refer to typefaces that have been “popular” for a long time. Let’s break down some of these typefaces.

Helvetica

One of the most well-known typefaces is Helvetica, renowned for its intrinsic legibility and clarity since its 1957 debut. Helvetica’s tall x-height, open counters, and neutral letterforms allow it to lend a clean and professional look to any logo.

Some well-known brands that use Helvetica are BMW, Lufthansa, and Nestlé.

Futura

Futura) has been helping brands convey their identity for almost a century. Designed in 1927, it is celebrated for its geometric simplicity and modernist design. Futura’s precise and clean lines give it a distinctive and timeless look.

Some well-known brands that use Futura are Louis Vuitton, Red Bull, and FedEx.

That said, you naturally have all the creative freedom, and making a bold choice can turn out fantastic, especially for brands where this is desirable.

Two’s Company, Three’s a Crowd

Combining typefaces is a challenging task. But if you want to create a logo with two different typefaces, make sure there is enough contrast between the two. For example, combine a serif with a sans-serif. If the two typefaces look too similar, it’s better to stick to one typeface. That said, I would never choose more than two typefaces for your logo.

Let’s Build a Brand Logo

Now that we’ve gone through the above steps, it seems a good time for a practical example. Theory is useful, but only when you put it into practice will you notice that you become more adept at it.

TIP: Try creating a text logo yourself. First, we’ll need to do a company briefing where we come up with a name, define various characteristics, and create a brand identity. This is a great way to get to know your fictional brand.

Bonus challenge: If you want to go one step further, you can also include a logo mark in the briefing. In the following steps, we are going to choose a typeface that suits the brand’s identity and characteristics. For an added challenge, include the logo mark at the start so the typeface has to match your logo mark as well. You can find great graphics at Iconfinder.

Company Briefing

Company Name: EcoWave

Characteristics:

  • Sustainable and eco-friendly products.
  • Innovative technologies focused on energy saving.
  • Wide range of ecological solutions.
  • Focus on quality and reliability.
  • Promotion of a green lifestyle.
  • Dedicated to addressing marine pollution.

Brand Identity: EcoWave is committed to a greener future. We provide sustainable and eco-friendly products that are essential for a better environment. Our advanced technologies and high-quality solutions enable customers to save energy and minimize their ecological footprint. EcoWave is more than just a brand; we represent a movement towards a more sustainable world with a special focus on combating marine pollution.

Keyword: Sustainability

Now that we’ve been briefed, we can start with the following steps:

  1. Identify key characteristics: Compile the top three defining characteristics of the company. You can add related words to each characteristic for more detail.
  2. Match the characteristics: Try to match these characteristics with the characteristics of the typeface category.
  3. Get inspired: Check the suggested links for inspiration and search for Sans-Serif fonts, for example. Look at popular fonts, but also search for fonts that fit what you want to convey about the brand (create a mood board).
  4. Make a preliminary choice: Use the gathered information to make an initial choice for the typeface. Adjust the weight and letter spacing until you are satisfied with the design of your logo.
  5. Evaluate your design: You now have the first version of your logo. Try it out on different backgrounds and photos that depict the desired look of the company. Assess whether it fits the intended identity and whether you are satisfied with the look. Not satisfied? Go back to your mood board and try a different typeface.

Let’s go over the steps for EcoWave:

1. Sustainable, Trustworthy, Innovative.

2. The briefing and brand focus primarily on innovation. When we match this Aspect with the characteristics of typefaces, everything points to a Sans-Serif font, which offers a modern and innovative look.

3. Example Mood Board

4. Ultimately, I chose the IBM Plex Sans typeface. This modern, sans-serif typeface offers a fresh and contemporary look. It fits excellently with the innovative and sustainable characteristics of EcoWave. Below are the steps from the initial choice to the final result:

IBM Plex Sans Regular

IBM Plex Sans Bold

IBM Plex Sans Bold & Custom letter-spacing

IBM Plex Sans Bold & Custom edges

5. Here, you see the typeface in action. For me, this is a perfect match with the brand’s identity. The look feels just right.

Expert Insights and Trends in Typographic Logo Design

Those interested in typography might find ‘The Elements of Typographic Style’ by Robert Bringhurst insightful. In this section, I want to share an interesting part about the importance of choosing a typeface that suits the specific task.

“Choose faces that suit the task as well as the subject. You are designing, let us say, a book about bicycle racing. You have found in the specimen books a typeface called Bicycle, which has spokes in the O, an A in the shape of a racing seat, a T that resembles a set of racing handlebars, and tiny cleated shoes perched on the long, one-sided serifs of ascenders and descenders, like pumping feet on the pedals. Surely this is the perfect face for your book?

Actually, typefaces and racing bikes are very much alike. Both are ideas as well as machines, and neither should be burdened with excess drag or baggage. Pictures of pumping feet will not make the type go faster, any more than smoke trails, pictures of rocket ships, or imitation lightning bolts tied to the frame will improve the speed of the bike.

The best type for a book about bicycle racing will be, first of all, an inherently good type. Second, it will be a good type for books, which means a good type for comfortable long-distance reading. Third, it will be a type sympathetic to the theme. It will probably be lean, strong, and swift; perhaps it will also be Italian. But it is unlikely to be carrying excess ornament or freight and unlikely to be indulging in a masquerade.”

— Robert Bringhurst

As Robert Bringhurst illustrates, choosing a typeface should be appropriate not only for the subject but also for the specific task. What lessons can we draw from this for our typeface choice in our logo?

Functional and Aesthetic Considerations

The typeface must be legible in various sizes and on different mediums, from business cards to billboards. A well-designed logo should be easy to reproduce without loss of clarity.

Brand Identity

Suppose we have a brand in the bicycle industry, an innovative and modern company. In Robert Bringhurst’s example, we choose the typeface Bicycle, which, due to its name, seems to perfectly match bicycles. However, the typeface described by Robert is a serif font with many decorative elements, which does not align with the desired modern and innovative look of our brand. Therefore, this would be a mismatch.

Trends

“Styles come and go. Good design is a language, not a style.”

In this part, we discuss some new trends. However, it is also important to highlight the above quote. The basic principles we mention have been applicable for a long time and will continue to be. It can be both fun and challenging to follow the latest trends, but it is essential to integrate them with your basic principles.

Minimalism and Simplicity

Minimalism in Logo Design remains one of the major trends this year. The most characteristic Aspect of this style is to limit the logo to the most essential elements. This creates a clear and timeless character. In typography, this is beneficial for readability and, at the same time, effectively communicating the brand identity in a timeless manner. We also see this well reflected in the rebranding of the fast-food chain Ashton.

Customization and Uniqueness

Another growing trend is customization in typography, where designers create personalized typefaces or modify existing typefaces to give the brand a unique look. This can range from subtle adjustments in letterforms to developing a completely custom typeface. Such an approach can contribute to a distinctive visual identity. A good example of this can be seen in the Apex logo, where the ‘A’ and ‘e’ are specifically adjusted.

Conclusion

We now know that choosing the right typeface for a logo goes beyond personal taste. It has a significant impact on how powerful and recognizable a brand becomes. In this article, we have seen that finding the perfect typeface is a challenge that requires both creativity and a practical approach. With a strong focus on three key Aspects:

  • Font choice,
  • Font weight,
  • Letter spacing.

We have seen that finding the right typeface can be a quest, and personal preferences certainly play a role, but with the right tools, this process can be made much easier. The goal is to create a logo that is not only beautiful but also truly adds value by resonating with the people you want to reach and strengthening the brand’s key values.

We also looked at how trends can influence the longevity of your logo. It is important to be trendy, but it is equally important to remain true to timeless principles.

In summary,

Truly understanding both the technical details and the emotional impact of typefaces is enormously important for designing a logo. This knowledge helps to develop brands that not only look good but also have a deeper strategic impact — a strong brand.

And for those of you who are interested in diving deeper, I’ve tried to capture the fundamentals we’ve discussed in this article, focusing on good typeface choices, font weights, and letter spacing in a tool huisstijl. While it’s not perfect yet, I hope it can help some people create a simple brand identity that they love.

Regexes Got Good: The History And Future Of Regular Expressions In JavaScript

Category Image 080

Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text, but they have a longstanding reputation (perhaps outdated, as I’ll show) for being difficult to write and understand.

This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.

In this article, I’ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).

The History of Regular Expressions in JavaScript

ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.

But that was then.

Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?

Let’s take a look at them.

Don’t worry if it’s hard to understand what some of the following features mean — we’ll look more closely at several of the key features afterward.

  • ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (/[/]/).
  • ES6/ES2015 added two new regex flags: y (sticky), which made it easier to use regexes in parsers, and u (unicode), which added several significant Unicode-related improvements along with strict errors. It also added the RegExp.prototype.flags getter, support for subclassing RegExp, and the ability to copy a regex while changing its flags.
  • ES2018 was the edition that finally made JavaScript regexes pretty good. It added the s (dotAll) flag, lookbehind, named capture, and Unicode properties (via \p{...} and \P{...}, which require ES6’s flag u). All of these are extremely useful features, as we’ll see.
  • ES2020 added the string method matchAll, which we’ll also see more of shortly.
  • ES2022 added flag d (hasIndices), which provides start and end indices for matched substrings.
  • And finally, ES2024 added flag v (unicodeSets) as an upgrade to ES6’s flag u. The v flag adds a set of multicharacter “properties of strings” to \p{...}, multicharacter elements within character classes via \p{...} and \q{...}, nested character classes, set subtraction [A--B] and intersection [A&&B], and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets [^...].

As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v, is supported in Node.js 20 and 2023-era browsers. The rest are supported in 2021-era browsers or earlier.

Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via \p{...} and \P{...}. And to be a completionist, ES2021 added string method replaceAll — although, when given a regex, the only difference from ES3’s replace is that it throws if not using flag g.

Aside: What Makes a Regex Flavor Good?

With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key Aspects:

  • Performance.
    This is an important Aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8’s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.
  • Support for advanced features that handle common or important use cases.
    Here, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter “properties of strings,” set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.
  • Ability to write readable and maintainable patterns.
    Here, native JavaScript has long been the worst of the major flavors since it lacks the x (“extended”) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.

So, it’s a bit of a mixed bag.

JavaScript regexes have become exceptionally powerful, but they’re still missing key features that could make regexes safer, more readable, and more maintainable (all of which hold some people back from using this power).

The good news is that all of these holes can be filled by a JavaScript library, which we’ll see later in this article.

Using JavaScript’s Modern Regex Features

Let’s look at a few of the more useful modern regex features that you might be less familiar with. You should know in advance that this is a moderately advanced guide. If you’re relatively new to regex, here are some excellent tutorials you might want to start with:

Named Capture

Often, you want to do more than just check whether a regex matches — you want to extract substrings from the match and do something with them in your code. Named capturing groups allow you to do this in a way that makes your regexes and code more readable and self-documenting.

The following example matches a record with two date fields and captures the values:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = /^Admitted: (?<admitted>\d{4}-\d{2}-\d{2})\nReleased: (?<released>\d{4}-\d{2}-\d{2})$/;
const match = record.match(re);
console.log(match.groups);
/* → {
  admitted: '2024-01-01',
  released: '2024-01-03'
} */

Don’t worry — although this regex might be challenging to understand, later, we’ll look at a way to make it much more readable. The key things here are that named capturing groups use the syntax (?<name>...), and their results are stored on the groups object of matches.

You can also use named backreferences to rematch whatever a named capturing group matched via \k<name>, and you can use the values within search and replace as follows:

// Change 'FirstName LastName' to 'LastName, FirstName'
const name = 'Shaquille Oatmeal';
name.replace(/(?<first>\w+) (?<last>\w+)/, '$<last>, $<first>');
// → 'Oatmeal, Shaquille'

For advanced regexers who want to use named backreferences within a replacement callback function, the groups object is provided as the last argument. Here’s a fancy example:

function fahrenheitToCelsius(str) {
  const re = /(?<degrees>-?\d+(\.\d+)?)F\b/g;
  return str.replace(re, (...args) => {
    const groups = args.at(-1);
    return Math.round((groups.degrees - 32) * 5/9) + 'C';
  });
}
fahrenheitToCelsius('98.6F');
// → '37C'
fahrenheitToCelsius('May 9 high is 40F and low is 21F');
// → 'May 9 high is 4C and low is -6C'

Lookbehind

Lookbehind (introduced in ES2018) is the complement to lookahead, which has always been supported by JavaScript regexes. Lookahead and lookbehind are assertions (similar to ^ for the start of a string or \b for word boundaries) that don’t consume any characters as part of the match. Lookbehinds succeed or fail based on whether their subpattern can be found immediately before the current match position.

For example, the following regex uses a lookbehind (?<=...) to match the word “cat” (only the word “cat”) if it’s preceded by “fat ”:

const re = /(?<=fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'cat, fat pigeon, brat cat'

You can also use negative lookbehind — written as (?<!...) — to invert the assertion. That would make the regex match any instance of “cat” that’s not preceded by “fat ”.

const re = /(?<!fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'pigeon, fat cat, brat pigeon'

JavaScript’s implementation of lookbehind is one of the very best (matched only by .NET). Whereas other regex flavors have inconsistent and complex rules for when and whether they allow variable-length patterns inside lookbehind, JavaScript allows you to look behind for any subpattern.

The matchAll Method

JavaScript’s String.prototype.matchAll was added in ES2020 and makes it easier to operate on regex matches in a loop when you need extended match details. Although other solutions were possible before, matchAll is often easier, and it avoids gotchas, such as the need to guard against infinite loops when looping over the results of regexes that might return zero-length matches.

Since matchAll returns an iterator (rather than an array), it’s easy to use it in a for...of loop.

const re = /(?<char1>\w)(?<char2>\w)/g;
for (const match of str.matchAll(re)) {
  const {char1, char2} = match.groups;
  // Print each complete match and matched subpatterns
  console.log(Matched "${match[0]}" with "${char1}" and "${char2}");
}

Note: matchAll requires its regexes to use flag g (global). Also, as with other iterators, you can get all of its results as an array using Array.from or array spreading.

const matches = [...str.matchAll(/./g)];

Unicode Properties

Unicode properties (added in ES2018) give you powerful control over multilingual text, using the syntax \p{...} and its negated version \P{...}. There are hundreds of different properties you can match, which cover a wide variety of Unicode categories, scripts, script extensions, and binary properties.

Note: For more details, check out the documentation on MDN.

Unicode properties require using the flag u (unicode) or v (unicodeSets).

Flag v

Flag v (unicodeSets) was added in ES2024 and is an upgrade to flag u — you can’t use both at the same time. It’s a best practice to always use one of these flags to avoid silently introducing bugs via the default Unicode-unaware mode. The decision on which to use is fairly straightforward. If you’re okay with only supporting environments with flag v (Node.js 20 and 2023-era browsers), then use flag v; otherwise, use flag u.

Flag v adds support for several new regex features, with the coolest probably being set subtraction and intersection. This allows using A--B (within character classes) to match strings in A but not in B or using A&&B to match strings in both A and B. For example:

// Matches all Greek symbols except the letter 'π'
/[\p{Script_Extensions=Greek}--π]/v

// Matches only Greek letters
/[\p{Script_Extensions=Greek}&&\p{Letter}]/v

For more details about flag v, including its other new features, check out this explainer from the Google Chrome team.

A Word on Matching Emoji

Emoji are 🤩🔥😎👌, but how emoji get encoded in text is complicated. If you’re trying to match them with a regex, it’s important to be aware that a single emoji can be composed of one or many individual Unicode code points. Many people (and libraries!) who roll their own emoji regexes miss this point (or implement it poorly) and end up with bugs.

The following details for the emoji “👩🏻‍🏫” (Woman Teacher: Light Skin Tone) show just how complicated emoji can be:

// Code unit length
'👩🏻‍🏫'.length;
// → 7
// Each astral code point (above \uFFFF) is divided into high and low surrogates

// Code point length
[...'👩🏻‍🏫'].length;
// → 4
// These four code points are: \u{1F469} \u{1F3FB} \u{200D} \u{1F3EB}
// \u{1F469} combined with \u{1F3FB} is '👩🏻'
// \u{200D} is a Zero-Width Joiner
// \u{1F3EB} is '🏫'

// Grapheme cluster length (user-perceived characters)
[...new Intl.Segmenter().segment('👩🏻‍🏫')].length;
// → 1

Fortunately, JavaScript added an easy way to match any individual, complete emoji via \p{RGI_Emoji}. Since this is a fancy “property of strings” that can match more than one code point at a time, it requires ES2024’s flag v.

If you want to match emojis in environments without v support, check out the excellent libraries emoji-regex and emoji-regex-xs.

Making Your Regexes More Readable, Maintainable, and Resilient

Despite the improvements to regex features over the years, native JavaScript regexes of sufficient complexity can still be outrageously hard to read and maintain.

Regular Expressions are SO EASY!!!! pic.twitter.com/q4GSpbJRbZ

— Garabato Kid (@garabatokid) July 5, 2019


ES2018’s named capture was a great addition that made regexes more self-documenting, and ES6’s String.raw tag allows you to avoid escaping all your backslashes when using the RegExp constructor. But for the most part, that’s it in terms of readability.

However, there’s a lightweight and high-performance JavaScript library named regex (by yours truly) that makes regexes dramatically more readable. It does this by adding key missing features from Perl-Compatible Regular Expressions (PCRE) and outputting native JavaScript regexes. You can also use it as a Babel plugin, which means that regex calls are transpiled at build time, so you get a better developer experience without users paying any runtime cost.

PCRE is a popular C library used by PHP for its regex support, and it’s available in countless other programming languages and tools.

Let’s briefly look at some of the ways the regex library, which provides a template tag named regex, can help you write complex regexes that are actually understandable and maintainable by mortals. Note that all of the new syntax described below works identically in PCRE.

Insignificant Whitespace and Comments

By default, regex allows you to freely add whitespace and line comments (starting with #) to your regexes for readability.

import {regex} from 'regex';
const date = regex`
  # Match a date in YYYY-MM-DD format
  (?<year>  \d{4}) - # Year part
  (?<month> \d{2}) - # Month part
  (?<day>   \d{2})   # Day part
`;

This is equivalent to using PCRE’s xx flag.

Subroutines and Subroutine Definition Groups

Subroutines are written as \g<name> (where name refers to a named group), and they treat the referenced group as an independent subpattern that they try to match at the current position. This enables subpattern composition and reuse, which improves readability and maintainability.

For example, the following regex matches an IPv4 address such as “192.168.12.123”:

import {regex} from 'regex';
const ipv4 = regex`\b
  (?<byte> 25[0-5] | 2[0-4]\d | 1\d\d | [1-9]?\d)
  # Match the remaining 3 dot-separated bytes
  (\. \g<byte>){3}
\b`;

You can take this even further by defining subpatterns for use by reference only via subroutine definition groups. Here’s an example that improves the regex for admittance records that we saw earlier in this article:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = regex`
  ^ Admitted:\ (?<admitted> \g<date>) \n
    Released:\ (?<released> \g<date>) $

  (?(DEFINE)
    (?<date>  \g<year>-\g<month>-\g<day>)
    (?<year>  \d{4})
    (?<month> \d{2})
    (?<day>   \d{2})
  )
`;
const match = record.match(re);
console.log(match.groups);
/* → {
  admitted: '2024-01-01',
  released: '2024-01-03'
} */

A Modern Regex Baseline

regex includes the v flag by default, so you never forget to turn it on. And in environments without native v, it automatically switches to flag u while applying v’s escaping rules, so your regexes are forward and backward-compatible.

It also implicitly enables the emulated flags x (insignificant whitespace and comments) and n (“named capture only” mode) by default, so you don’t have to continually opt into their superior modes. And since it’s a raw string template tag, you don’t have to escape your backslashes \\\\ like with the RegExp constructor.

Atomic Groups and Possessive Quantifiers Can Prevent Catastrophic Backtracking

Atomic groups and possessive quantifiers are another powerful set of features added by the regex library. Although they’re primarily about performance and resilience against catastrophic backtracking (also known as ReDoS or “regular expression denial of service,” a serious issue where certain regexes can take forever when searching particular, not-quite-matching strings), they can also help with readability by allowing you to write simpler patterns.

Note: You can learn more in the regex documentation.

What’s Next? Upcoming JavaScript Regex Improvements

There are a variety of active proposals for improving regexes in JavaScript. Below, we’ll look at the three that are well on their way to being included in future editions of the language.

Duplicate Named Capturing Groups

This is a Stage 3 (nearly finalized) proposal. Even better is that, as of recently, it works in all major browsers.

When named capturing was first introduced, it required that all (?<name>...) captures use unique names. However, there are cases when you have multiple alternate paths through a regex, and it would simplify your code to reuse the same group names in each alternative.

For example:

/(?<year>\d{4})-\d\d|\d\d-(?<year>\d{4})/

This proposal enables exactly this, preventing a “duplicate capture group name” error with this example. Note that names must still be unique within each alternative path.

Pattern Modifiers (aka Flag Groups)

This is another Stage 3 proposal. It’s already supported in Chrome/Edge 125 and Opera 111, and it’s coming soon for Firefox. No word yet on Safari.

Pattern modifiers use (?ims:...), (?-ims:...), or (?im-s:...) to turn the flags i, m, and s on or off for only certain parts of a regex.

For example:

/hello-(?i:world)/
// Matches 'hello-WORLD' but not 'HELLO-WORLD'

Escape Regex Special Characters with RegExp.escape

This proposal recently reached Stage 3 and has been a long time coming. It isn’t yet supported in any major browsers. The proposal does what it says on the tin, providing the function RegExp.escape(str), which returns the string with all regex special characters escaped so you can match them literally.

If you need this functionality today, the most widely-used package (with more than 500 million monthly npm downloads) is escape-string-regexp, an ultra-lightweight, single-purpose utility that does minimal escaping. That’s great for most cases, but if you need assurance that your escaped string can safely be used at any arbitrary position within a regex, escape-string-regexp recommends the regex library that we’ve already looked at in this article. The regex library uses interpolation to escape embedded strings in a context-aware way.

Conclusion

So there you have it: the past, present, and future of JavaScript regular expressions.

If you want to journey even deeper into the lands of regex, check out Awesome Regex for a list of the best regex testers, tutorials, libraries, and other resources. And for a fun regex crossword puzzle, try your hand at regexle.

May your parsing be prosperous and your regexes be readable.

If I Was Starting My Career Today: Thoughts After 15 Years Spent In UX Design (Part 2)

Typography Definitions Cover

In the previous article in my two-part series, I have explained how important it is to start by mastering your design tools, to work on your portfolio (even if you have very little work experience — which is to be expected at this stage), and to carefully prepare for your first design interviews.

If all goes according to plan, and with a little bit of luck, you’ll land your first junior UX job — and then, of course, you’ll be facing more challenges, about which I am about to speak in this second article in my two-part article series.

In Your New Junior UX Job: On the Way to Grow

You have probably heard of the Pareto Rule, which states that 20% of actions provide 80% of the results.

“The Pareto Principle is a concept that specifies that 80% of consequences come from 20% of the causes, asserting an unequal relationship between inputs and outputs. The principle was named after the economist Vilfredo Pareto.”

— “The Pareto Principle, a.k.a. the Pareto Rule

This means that some of your actions will help you grow much faster than others.

But before we go into the details, let’s briefly consider the junior UX designer path. I think it’s clear that, at first, juniors usually assist other designers with simple but time-consuming tasks. Then, the level of complexity and your responsibilities start increasing, depending on your performance.

So, you got your first design job? Great! Here are a few things you can focus on if you want to be growing at a faster pace.

Chase For Challenges

The simple but slow way to go is to do your work and then wait until your superiors notice how good you are and start giving you more complex tasks. The problem is that people focus on themselves too much.

So, to “cut some corners,” you need to actively look for challenges. It’s scary, I know, but remember, people who invented any new groundbreaking UX approach or a new framework you see in books and manuals now used their intuition first. You have the whole World Wide Web full of articles and lectures about that. So, define the skill you want to develop, spend a day reading about this topic, find a real problem, and practice. Then, share what you did and get some feedback. After a few iterations, I bet you will be assigned the first real task for your practice!

Use Interfaces Consciously

Take the time to look again at the screenshot of the Amazon website (from Part One):

User interfaces didn’t appear in their present form right from the start. Instead, they evolved to their current state over the span of many years. And you all were part of their evolution, albeit passively — you registered on different websites, reset your passwords quite a few times, clicked onboarding screens, filled out short and long web forms, used search, and so on.

In your design work, all tasks (or 99% of them, at least at the beginning) will be based on those UX patterns. You don’t need to reinvent the bicycle; you only need to remember what you already know and pay attention to the details while using the interfaces of the apps on your smartphone and on your computer. Ask yourself:

  • Why was this designed this way?
  • What is not clear enough for me as a user?
  • What is thought out well and what is not?

All of today’s great design solutions were built based on common sense and then documented so that other people can learn how to re-use this knowledge. Develop your own “common sense” skill every day by being a careful observer and by living your life consciously. Notice the patterns of good design, try to understand and memorize them, and then implement and rethink them in your own work.

I can also highly recommend the Smart Interface Design Patterns course with Vitaly Friedman. It provides guidelines and best practices for common components in modern interfaces. Inventing a new solution for every problem takes time, and too often, it’s just unnecessary. Instead, we can rely on bulletproof design patterns to avoid issues down the line. This course helps with just that. In the course, you will study hundreds of hand-picked examples, from complex navigation to filters, tables, and forms, and you will work on actual real-life challenges.

Learn How to Present Your Work

The ability to convey complex thoughts and ideas in the form of clear sentences defines how effectively you will be able to interact with other people.

This is a core work skill — a skill that you’ll be actually using your whole life, and not only in your work. I have written about this topic in much detail previously:

“Good communication is about sharing your ideas as clearly as possible.”

— “Effective Communication For Everyday Meetings” (Smashing Magazine)

In my article, I have described all the general principles that apply to effective communication, with the most important being: to develop a skill, you need to practice.

As a quick exercise, try telling your friends about the work you do and not to be boring while explaining the details. You will feel that you are on the right track if they do not try to change the topic and instead ask you additional questions!

Gather Feedback

Don’t wait for your yearly review to hear about what you were doing right and wrong. Ask people for feedback and suggestions, and ask them often.

To help them start, first, tell them about your weak side and ask them to tell you their own impressions. Try encouraging them to expand their input and ask for recommendations on how you could fix your weaknesses. Don’t forget to tell them when you are trying to apply their suggestions in practice. After all, these people helped you become better, so be thankful.

Learn Business

I see a lot of designers trying to apply all of their experience to every project, and they often complain that it doesn’t work — customers refuse to follow the entire classical UX process, such as defining User Personas, creating the Information Architecture (IA), outlining the customer journey map, and so on. Sometimes, it happens because clients don’t have the time and budget for it, or they don’t see the value because the designer can’t explain it in a proper way.

But remember that many great products were built without using all of today’s available and tested UX approaches &mdahs; this doesn’t mean those approaches are useless. But initially, there was only common sense and many attempts to get better results, and only then did someone describe something as a working approach and specify all the details. So, before trying to apply any of these UX techniques, think about what you need to achieve. Is there any other way to get there within your time and budget?

Learn how the business works. Talk to customers in business language and communicate the value you create and not the specific approach, framework, or tool that you’ll be using.

“Good UX design is where value comes into the picture. We add value when we transform a product or service from delivering a poor experience to providing a good experience.”

— “The Value of Great UX,” by Jared Spool
Learn How to Make Interfaces Nice-looking

Yes, user experience should be first, but let’s be honest — we also love nice things! The same goes for your customers; they can’t always see the UX part of your work but can always say whether the interface is good-looking. So, learn the composition and color theory, use elegant illustrations and icons, learn typography, and always strive to make your work visually appealing. Some would say that it’s not so important, but trust me, it is.

As an exercise, try to copy the design of a few beautifully looking interfaces. Take a look at an interface screen, then close it and try to make a copy of it from memory. When you are done, compare the two and then make a few more adjustments in order to have as close a copy of the interface as possible. Try to understand why the original was built the way it is. I bet this process of reproducing an interface will help you understand many things you haven’t been noticing before.

Save the People’s Time and Efforts

Prepare to get some new tasks in advance. Create a list of questions, and don’t forget to ask about the deadlines. Align your plan and the number of iterations so people know precisely what and when to expect from you. Be curious (but not annoying) by asking or sending questions every few hours (but try to first search for the answers online). Even if you don’t find the exact answer, it’ll help you formulate the right questions better and get a better view of the “big picture.” Remember, one day, you will get a task directly from the customer, so fetching the data you need to complete tasks correctly is an excellent skill to develop.

Structurize Your Knowledge and Create a Learning Plan

When you are just beginning to learn, too many articles about UX design will look like absolute “must-reads” to you. But you will drown in the information if you try to read them all in no particular order. Better, instead of just trying to read everything, try first to find a mentor who will help you build a learning plan and will advise you along the way.

Another good way to start is to complete a solid UX online course. If you can’t, take the learning program of any popular UX course out there and research the topics from the course’s list one by one. Also, you can use such a structured list (going from easier to more complex UX topics) for filtering articles you are going to read.

There are many excellent courses out there, and here are a few suggestions:
  • Selection of free UX design courses, including those offering certifications,” by Cheshta Dua
    In this article, the author shares a few free UX design courses which helped her get started as a UX designer.
  • Best free UX design courses — 2024,” by Cynthia Vinney (UX Design Institute)
    This is a comparison of a few free UX design courses, both online and in-person.
  • The 10 Best Free UX Design Courses in 2024,” by Rachel Meltze (CareerFoundry)
    A selection of free UX design courses — using these you can learn the fundamentals of UX design, the tools designers use, and more about the UX design career path.
  • The HTML/CSS Basics (.dev),” by Geoff Graham
    The Basics is an excellent online course that teaches the basic principles of front-end development. It’s a good “entry point” for those just coming into front-end development or perhaps for someone with experience writing code from years ago who wants to jump into modern-day development.
Practice, Practice, Practice

Bruce Lee once said:

“I fear not the man who has practiced 10,000 kicks once, but the man who has practiced one kick 10,000 times.”

— Bruce Lee

You may have read a lot about some new revolutionary UX approaches, but only practicing allows you to convert this knowledge into a skill. Our brain continually works to clear out unnecessary information from our memory. Therefore, actively practicing the ideas and knowledge that you have learned is the only way to signal to your brain that this knowledge is essential to be retained and re-used.

On a related note, you will likely remember also the popular “10,000-hour rule,” which was popularized by Malcolm Gladwell’s bestseller book Outliers).

As Malcolm says, the rule goes like this: it takes 10,000 hours of intensive practice to achieve mastery of complex skills and materials, like playing the violin or getting as good as Bill Gates at computer programming. Turns out, practice is important, and it’s surprising how much time and effort it may take to master something complicated. But later research also suggests that someone could practice for thousands of hours and still not be a master performer. They could be outperformed by someone who practiced less but had a teacher who showed them just what to focus on at a key moment in their practice.

So, remember my advice from the previous section? Try to find a mentor because, as I said earlier, learning and practicing with a mentor and a good plan will often lead to better results.

Conclusion

Instead of a conclusion (or trying to give you the answer to the ultimate question of life, the universe, and everything), only a few final words of advice.

Remember, there doesn’t exist a single correct way to do things because there are no absolute criteria to define “things done properly.” You can apply all your knowledge and required steps in the classical design process, and the product may fail.

At the same time, someone could quickly develop a minimum viable product (MVP) without using all of the standard design phases — and still conquer the market. Don’t believe me?

The first Apple iPhone, introduced 17 years ago, didn’t have even a basic copy/paste feature yet we all know how the iPhone conquered the world (and it’s not only the iPhone, there are many other successful MVP examples out there, often conceived by small startups). It’s because Apple engineers and designers got the core product design concept right; they could release a product that didn’t yet have everything in it.

So yes, you need to read a lot about UX and UI design, watch tutorials, learn the design theory, try different approaches, speak to the people using your product (or the first alpha or beta version of it), and practice. But in the end, always ask yourself, “Is this the most efficient way to bring value to people and get the needed results?” If the answer is “No,” update your design plan. Because things are not happening by themselves. Instead, we, humans, make things happen.

You are the pilot of your plane, so don’t expect someone else to care about your success more than you. Do your best. Make corrections and iterate. Learn, learn, learn. And sooner or later, you’ll reach success!

Further Reading

A Selection Of Design Resources (Part One, Part Two)

  • Photoshop CS Down & Dirty Tricks, a book by Scott Kelby
    Bestselling author Scott Kelby shares an amazing collection of Photoshop tricks, including how to create the same exact effects you see every day in magazines, at the movies, on the Web, and more. These are real-world techniques, the same ones you see used by leading Photoshop photographers, designers, and special effect masters.
  • Why Designers Aren’t Understood,” by Vitaly Friedman (Smashing Magazine)
    How do we conduct UX research when there is no or only limited access to users? Here are some workarounds to run UX research or make a strong case for it. (This article is an upcoming part of the “Smart Interface Design Patterns.” — Editor’s Note)
  • UXchallenge,” by Yachin You
    This website will help you learn how to solve real problems that customers face and present case studies that are related to these problems.
  • Kano analysis: The Kano model explained(Qualtrics)
    Kano analysis (also known as the “Customer Delight vs. Implementation Investment” approach) is a tool that helps you enhance your products and services based on customer emotions. This guide will help you understand what is Kano analysis and how you can use it in practice.
  • Kano Model: What It Is & How to Use It to Increase Customer Satisfaction(Userpilot)
    The Kano model uses quick and powerful data analysis to design your product roadmap. In this article, you will learn a brief history of the Kano model, a practical explanation of how it works, five categories of potential customer reactions to new features, and a four-step process for effective Kano analysis.
  • The Pareto Principle(Investopedia)
    The Pareto Principle is a concept that specifies that 80% of consequences come from 20% of the causes, asserting an unequal relationship between inputs and outputs. Named after the economist Vilfredo Pareto, this principle serves as a general reminder that the relationship between inputs and outputs is not balanced. The Pareto Principle is also known as the Pareto Rule or the 80/20 Rule.
  • Figma Portfolio Templates & Examples(UX Crush)
    A curated selection of portfolio templates for Figma Design.
  • How to Define a User Persona,” by Raven Veal (CareerFoundry)
    As you break into a career in UX, user personas are one tool you’ll certainly want to have available as you gather user research and find design solutions to solve problems and create more human-friendly products and experiences.
  • How to design a customer journey map,” by Emily Stevens (UX Design Institute)
    A customer journey map is a visual representation of how a user interacts with your product. This detailed guide will teach you how to create such a customer journey map.
  • “Building Components For Consumption, Not Complexity” (Part 1, Part 2),” by Luis Ouriach (Smashing Magazine)
    In this two-part series of articles, Luis shares his experience with design systems and how you can overcome the potential pitfalls, starting from how to make designers on your team adopt the complex and well-built system that you created to what are the best naming conventions and how to handle the auto-layout of components, indexing/search, and more.
  • Effective Communication For Everyday Meetings,” by Andrii Zhdan (Smashing Magazine)
    Before any meeting starts, we often have many ideas about what to say and how it should go. But when the meeting happens, reality may “crash” all of our plans. This article is about conducting productive meetings. The author will give you a step-by-step guide on preparing a solid meeting structure that will let you follow the original plan and reach the meeting goals.
  • The Value of Great UX,” by Jared Spool
    This crossover from poor UX design to good UX design is where value comes into the picture. We add value when we transform a product or service from delivering a poor experience to providing a good experience.
  • How Designers Should Ask For (And Receive) High-Quality Feedback,” by Andy Budd (Smashing Magazine)
    Designers often complain about the quality of feedback they get from senior stakeholders. In this article, Andy Budd shares a better way of requesting feedback: rather than sharing a linear case study that explains every design revision, the first thing to do would be to better frame the problem.
  • Designing A Better Design Handoff File In Figma,” by Ben Shih (Smashing Magazine)
    Practical tips to enhance the handoff process between design and development in product development, with provided guidelines for effective communication, documentation, design details, version control, and plugin usage.
  • The HTML/CSS Basics (.dev),” by Geoff Graham
    The Basics is an online course that teaches the basic principles of front-end development, focusing specifically on HTML and CSS. A good “entry point” for those just coming into front-end development and perhaps for someone with experience writing code years ago who wants to jump into modern-day development.
  • Selection of free UX design courses, including those offering certifications,” by Cheshta Dua
    In this article, the author shares a few free UX design courses that helped her get started as a UX designer.
  • Best free UX design courses — 2024,” by Cynthia Vinney (UX Design Institute)
    Check this comparison of several free UX design courses currently on the market, both online and in-person.
  • The 10 Best Free UX Design Courses in 2024,” by Rachel Meltze (CareerFoundry)
    A selection of free UX design courses where you can learn the fundamentals of UX design, the tools designers use, and the UX design career path. This guide provides a range of courses, from micro-tutorials to full-featured UI/UX courses.
  • Researcher Behind ‘10,000-Hour Rule’ Says Good Teaching Matters, Not Just Practice,” by Jeffrey Young (EdSurge Magazine)
    It takes 10,000 hours of intensive practice to achieve mastery of complex skills and materials, like playing the violin or getting as good as Bill Gates at computer programming. Turns out, a study also shows that there’s another important variable that Gladwell originally didn’t focus on: how good a student’s teacher is.
  • An Apple engineer details why the first iPhone didn’t have copy and paste,” by Filipe Espósito (9to5Mac)
    Apple introduced the first iPhone 17 years ago, and a lot has changed since then, but it’s hard to believe that long ago, the iPhone didn’t even have copy-and-paste options. Now, former Apple software engineer Ken Kocienda has revealed details about why the first iPhone didn’t have such features.
  • Fifteen examples of successful MVPs,” Ross Krawczyk (RST Software)
    Startups need to get their products to the market faster than ever in an increasingly competitive world. The minimum viable product is the way to achieve this, but you must be really able to provide the right key features that give value to a wide customer base in order to attract clients and investors on time.

It’s Time To Talk About “CSS5”

Typography Definitions Cover

We have been talking about CSS3 for a long time. Call me a fossil, but I still remember the new border-radius property feeling like the most incredible CSS3 feature. We have moved on since we got border-radius and a slew of new features dropped in a single CSS3 release back in 2009.

CSS, too, has moved on as a language, and yet “CSS3” is still in our lexicon as the last “official” semantically-versioned release of the CSS language.

It’s not as though we haven’t gotten any new and exciting CSS features between 2009 and 2024; it’s more that the process of developing, shipping, and implementing new CSS features is a guessing game of sorts.

We see CSS Working Group (CSSWG) discussions happening in the open. We have the draft specifications and an archive of versions at our disposal. The resources are there! But the develop-ship-implement flow remains elusive and leaves many of us developers wondering: When is the next CSS release, and what’s in it?

This is a challenging balancing act. We have spec authors, code authors, and user agents working both interdependently and independently and the communication gaps are numerous and wide. The result? New features take longer to be implemented, leading to developers taking longer to adopt them. We might even consider CSS3 to be the last great big “marketing” push for CSS as a language.

That’s what the CSS-Next community is grappling with at this very moment. If you haven’t heard of the group, you’re not alone, but either way, it’s high time we shed light on it and the ideas coming from it. As someone participating in the group, I thought I would share the conversations we’re having and how we’re approaching the way CSS releases are communicated.

Meet The CSS-Next Community

Before we formally “meet” the CSS-Next group, it’s worth knowing that it is still officially referred to as the CSS4 Community Group as far as the W3C is concerned.

And that might be the very first thing you ought to know about CSS-Next: it is part of the W3C and consists of CSSWG members, developers, designers, user agents, and, really, anyone passionate about the web and who wants to participate in the discussion. W3C groups like CSS-Next are open to everyone to bring our disparate groups together, opening opportunities to shape tomorrow’s vision of the web.

CSS-Next, in particular, is where people gather to discuss the possibility of raising awareness of CSS evolutions during the last decade. At its core, the group is discussing approaches for bundling CSS features that have shipped since CSS3 was released in 2009 and how to name the bundle (or bundles, perhaps) so we have a way of referring to this particular “era” of CSS and pushing those features forward.

Why We Need A Group Like CSS-Next

Let’s go back a few years. More specifically, let’s return to the year 2020.

It all started when Safari Evangelist Jen Simmons posted an open issue in the CSSWG’s GitHub repo for CSS draft specifications requesting a definition for a “CSS4” release.

This might be one of the biggest responses — if not the biggest response — to a CSSWG issue based solely on emoji reactions.

The idea of defining CSS4 had some back-ups by Chris Coyier, Nicole Sullivan, and PPK. The idea is to push technologies forward and help educators and site owners, even if it’s just for the sake of marketing.

But why is this important? Why should we care about another level or “CSS Saga”? To get to that point, we might need to talk about CSS3 and what exactly it defines.

What Exactly Is “CSS3”?

The CSS3 grouping of features included level-3 specs for features from typography to selectors and backgrounds. From this point on, each CSS spec has been numbered individually.

However, CSS3 is still the most common term developers use to define the capabilities of modern CSS. We see this across the web, from the way educational institutions teach CSS to the job requirements on resumes.

The term CSS3 loses meaning year-over-year. You can see the dilution everywhere. The earliest CSS3 drafts were published in June 1999 — before many of my colleagues were even born — and yet CSS is one of the fastest-growing languages in the current webscape.

What About The CSS3 Logo?

When we look at job postings, we run into vacancies asking for knowledge of CSS3, which is over 10 years old. Without an updated level, we’re just asking if you’ve written CSS since the border-radius property came out. Furthermore, when we want to learn CSS, a CSS3 logo next to educational materials no longer signals current material. It kind of feels like time has stood still.

Here’s an example job posting that illustrates the issue:

But that’s not all. If you do a Google search on “Learn CSS” and check the images, you might be surprised how many CSS3 logos you can spot:

About 50% of the images show the CSS3 badge. To me, this clearly signals:

  1. People want badges or logos to aid in signaling skills.
  2. The CSS3 brand has made a large impact on the web ecosystem.
  3. The CSS3 logo has reached the end of its efficacy.

CSS3 had still has a huge impact on the ecosystem. The same logo is trying to say it teaches Flexbox all the way to color-mix() — a spread of hundreds of CSS features.

What Exactly Does “Modern CSS” Mean?

CSS3 and HTML5 were big improvements to those respective languages — we’ve come a long way since then. We have features that people didn’t even think were possible back in 2012 (when we officially spoke of CSS3 as a level).

For example, there was a time when people thought that containers didn’t know anything and it never be possible to style an element based on the width of its parent. But now, of course, we have CSS Container Queries, and all of this is possible today. The things that are possible with CSS changed over time, as so beautifully told by Miriam Suzanne at CSS Day 2023.

We do not want to ignore the success of CSS3 and say it is wrong; in fact, we believe it’s time to repeat the tremendous success of CSS3.

Imagine yourself 10 years from now reading a “modern” CSS feature that was introduced as many as 10 years ago. It wouldn’t add up, right? Modern is not a future-proof name, something that Geoff Graham opined when asking the correct question, “What exactly is ‘Modern CSS’?

Naming is always hard, yet it’s just something we have to do in CSS to properly select things. I think it’s time we start naming [CSS releases] like this, too. It’s only a matter of time before “modern” isn’t “modern” anymore.”

— Geoff Graham

This is exactly where the CSS-Next community group comes in.

Let’s Talk About “CSS Eras”

The CSS-Next community group aims to align and modernize the general understanding of CSS in the wider developer community by labeling feature sets that have shipped since the initial set of CSS3 features, helping developers upskill their understanding of CSS across the ecosystem.

Why Isn’t This Part Of The Web Platform Baseline?

The definition of what is “current” CSS changes with time. Sometimes, specs are incomplete or haven’t even been drafted. While Baseline looks at the current browser support of a feature in CSS, we want to take a look at the evolution of the language itself. The CSS levels should not care about which browser implemented it first.

It might be more nuanced than this in reality, but that’s pretty much the gist. We also don’t want it to become another “modern CSS” bucket. Indeed, referring to CSS3 as an “era” has helped compartmentalize how we can shift into CSS4, CSS5, and beyond. For example, labeling something as a “CSS4” feature provides a hint as far as when that feature was born. A feature that reaches “baseline” meanwhile merely indicates the status of that feature’s browser implementation, which is a separate concern.

Identifying features by era and implementation status are both indicators and provide meta information about a CSS feature but with different purposes.

Why Not Work With An Annual Snapshot Instead Of A Numbered Era?

It’s fair to wonder if a potential solution is to take a “snapshot” of the CSS feature set each year and use that as a mile marker for CSS feature releases. However, an annual picture of the language is less effective than defining a particular era in which specific features are introduced.

There were a handful of years when CSS was relatively quiet compared to the mad dash of the last few years. Imagine a year in which nothing, or maybe very few, CSS features are shipped, and the snapshot for that year is nearly identical to the previous year’s snapshot. Now imagine CSS explodes the following year with a deluge of new features that result in a massive delta between snapshots. It takes mental agility to compare complete snapshots of the entire language and find what’s new.

Goals And Non-Goals

I think I’ve effectively established that the term “CSS” alone isn’t clear or helpful enough to illustrate the evolution of the CSS, just as calling a certain feature “modern” degrades over time.

Grouping features in levels that represent different eras of releases — even from a marketing standpoint — offers a good deal of meaning and has a track record of success, as we’ve seen with CSS3.

All of this comes back to a set of goals that the CSS-Next group is rallying around:

  • Help developers learn CSS.
  • Help educators teach CSS.
  • Help employers define modern web skills.
  • Help the community understand the progression of CSS capabilities over time.
  • Create a shared vernacular for describing how CSS evolves.

What we do not want is to:

  • Affect spec definitions.
    CSS-Next is not a group that would define the working process of or influence working groups such as the CSSWG.
  • Create official developer documentation.
    Making something like a new version of MDN doesn’t get us closer to a better understanding of how the language changes between eras.
  • Define browser specification work.
    This should be conducted in relevant standardization or pre-standardization forums (such as the CSSWG or OpenUI).
  • Educate developers on CSS best practices.
    That has much more to do with feature implementations than the features themselves.
  • Manage browser compatibility data.
    Baseline is already doing that, and besides, we’ve already established that feature specifications and implementations are separate concerns.

This doesn’t mean that everything in the last list is null and void. We could, for example, have CSS eras that list all the features specced in that period. And inside that list, there could be a baseline reference for the implementations of those features, making it easier to bring forward some ideas for the next Interop, which informs Baseline.

This leaves the CSS-Next group with a super-clear focus to:

  • Research the community’s understanding of modern CSS,
  • Build a shared understanding of CSS feature evolution since CSS3,
  • Grouping those features into easily-digestible levels (i.e., CSS4, CSS5, and so on), and
  • Educate the community about modern CSS features.

We’d Likely Start With The “CSS5” Era

A lot of thought and work has gone into the way CSS is described in eras. The initial idea was to pick up where CSS3 left off and jump straight into CSS4. But the number of features released between the two eras would be massive, even if we narrowed it down to just the features released since 2020, never mind 2009.

It makes sense, instead, to split the difference and call CSS4 a done deal as of, say, 2018 and a fundamental part of CSS in its current state as we begin with the next logical period: CSS5.

Here’s how the definitions are currently defined:

CSS3 (~2009-2012):
Level 3 CSS specs as defined by the CSSWG. (immutable)

CSS4 (~2013-2018):
Essential features that were not part of CSS3 but are already a fundamental part of CSS.

CSS5 (~2019-2024):
Newer features whose adoption is steadily growing.

CSS6 (~2025+):
Early-stage features that are planned for future CSS.

Uncle Sam CSS Wants You!

We released a request for comments last May for community input from developers like you. We’ve received a few comments that have been taken into account, but we need much more feedback to help inform our approach.

We want a big representative response from the community! But that takes awareness, and we need you to make that happen. Anything you can do to let your teams and colleagues that the CSS-Next group is a thing and that we’re trying to solve the way we talk about CSS features is greatly appreciated. We want to know what you and others think about the things we’re wrestling with, like whether or not the way we’re grouping eras above is a sound approach, where you think those lines should be drawn, and if you agree that we’re aiming for the right goals.

We also want you to participate. Anyone is welcome to join the CSS-Next group and we could certainly use help brainstorming ideas. There’s even an incubation group that conducts a biweekly hour-long session that takes place on Mondays at 8:00 a.m. Pacific Time (2:00 p.m. GMT).

On a completely personal note, I’d like to add that I joined the CSS-Next group purely out of interest but became much more actively involved once the mission became very clear to me. As a developer working in an agency, I see how fast CSS changes and have struggled, like many of you, to keep up.

A seasoned colleague of mine commented the other day that they wouldn’t even know how to approach vanilla CSS on a fresh website project. There is no shame in that! I know many of us feel the same way. So, why not bring it to marketing terms and figure out a better way to frame discussions about CSS features based on eras? You can help get us there!

And if you think I’m blameless when it comes to talking about CSS in generic “modern” terms, all it takes is a quick look at the headline of another Smashing article I authoredthis year!

Let’s get going with CSS5 and spread the word! Let me hear your thoughts.

Resources

Integrating Image-To-Text And Text-To-Speech Models (Part 1)

Category Image 073

Audio descriptions involve narrating contextual visual information in images or videos, improving user experiences, especially for those who rely on audio cues.

At the core of audio description technology are two crucial components: the description and the audio. The description involves understanding and interpreting the visual content of an image or video, which includes details such as actions, settings, expressions, and any other relevant visual information. Meanwhile, the audio component converts these descriptions into spoken words that are clear, coherent, and natural-sounding.

So, here’s something we can do: build an app that generates and announces audio descriptions. The app can integrate a pre-trained vision-language model to analyze image inputs, extract relevant information, and generate accurate descriptions. These descriptions are then converted into speech using text-to-speech technology, providing a seamless and engaging audio experience.

By the end of this tutorial, you will gain a solid grasp of the components that are used to build audio description tools. We’ll spend time discussing what VLM and TTS models are, as well as many examples of them and tooling for integrating them into your work.

When we finish, you will be ready to follow along with a second tutorial in which we level up and build a chatbot assistant that you can interact with to get more insights about your images or videos.

Vision-Language Models: An Introduction

VLMs are a form of artificial intelligence that can understand and learn from visuals and linguistic modalities.

They are trained on vast amounts of data that include images, videos, and text, allowing them to learn patterns and relationships between these modalities. In simple terms, a VLM can look at an image or video and generate a corresponding text description that accurately matches the visual content.

VLMs typically consist of three main components:

  1. An image model that extracts meaningful visual information,
  2. A text model that processes and understands natural language,
  3. A fusion mechanism that combines the representations learned by the image and text models, enabling cross-modal interactions.

Generally speaking, the image model — also known as the vision encoder — extracts visual features from input images and maps them to the language model’s input space, creating visual tokens. The text model then processes and understands natural language by generating text embeddings. Lastly, these visual and textual representations are combined through the fusion mechanism, allowing the model to integrate visual and textual information.

VLMs bring a new level of intelligence to applications by bridging visual and linguistic understanding. Here are some of the applications where VLMs shine:

  • Image captions: VLMs can provide automatic descriptions that enrich user experiences, improve searchability, and even enhance visuals for vision impairments.
  • Visual answers to questions: VLMs could be integrated into educational tools to help students learn more deeply by allowing them to ask questions about visuals they encounter in learning materials, such as complex diagrams and illustrations.
  • Document analysis: VLMs can streamline document review processes, identifying critical information in contracts, reports, or patents much faster than reviewing them manually.
  • Image search: VLMs could open up the ability to perform reverse image searches. For example, an e-commerce site might allow users to upload image files that are processed to identify similar products that are available for purchase.
  • Content moderation: Social media platforms could benefit from VLMs by identifying and removing harmful or sensitive content automatically before publishing it.
  • Robotics: In industrial settings, robots equipped with VLMs can perform quality control tasks by understanding visual cues and describing defects accurately.

This is merely an overview of what VLMs are and the pieces that come together to generate audio descriptions. To get a clearer idea of how VLMs work, let’s look at a few real-world examples that leverage VLM processes.

VLM Examples

Based on the use cases we covered alone, you can probably imagine that VLMs come in many forms, each with its unique strengths and applications. In this section, we will look at a few examples of VLMs that can be used for a variety of different purposes.

IDEFICS

IDEFICS is an open-access model inspired by Deepmind’s Flamingo, designed to understand and generate text from images and text inputs. It’s similar to OpenAI’s GPT-4 model in its multimodal capabilities but is built entirely from publicly available data and models.

IDEFICS is trained on public data and models — like LLama V1 and Open Clip — and comes in two versions: the base and instructed versions, each available in 9 billion and 80 billion parameter sizes.

The model combines two pre-trained unimodal models (for vision and language) with newly added Transformer blocks that allow it to bridge the gap between understanding images and text. It’s trained on a mix of image-text pairs and multimodal web documents, enabling it to handle a wide range of visual and linguistic tasks. As a result, IDEFICS can answer questions about images, provide detailed descriptions of visual content, generate stories based on a series of images, and function as a pure language model when no visual input is provided.

PaliGemma

PaliGemma is an advanced VLM that draws inspiration from PaLI-3 and leverages open-source components like the SigLIP vision model and the Gemma language model.

Designed to process both images and textual input, PaliGemma excels at generating descriptive text in multiple languages. Its capabilities extend to a variety of tasks, including image captioning, answering questions from visuals, reading text, detecting subjects in images, and segmenting objects displayed in images.

The core architecture of PaliGemma includes a Transformer decoder paired with a Vision Transformer image encoder that boasts an impressive 3 billion parameters. The text decoder is derived from Gemma-2B, while the image encoder is based on SigLIP-So400m/14.

Through training methods similar to PaLI-3, PaliGemma achieves exceptional performance across numerous vision-language challenges.

PaliGemma is offered in two distinct sets:

  • General Purpose Models (PaliGemma): These pre-trained models are designed for fine-tuning a wide array of tasks, making them ideal for practical applications.
  • Research-Oriented Models (PaliGemma-FT): Fine-tuned on specific research datasets, these models are tailored for deep research on a range of topics.

Phi-3-Vision-128K-Instruct

The Phi-3-Vision-128K-Instruct model is a Microsoft-backed venture that combines text and vision capabilities. It’s built on a dataset of high-quality, reasoning-dense data from both text and visual sources. Part of the Phi-3 family, the model has a context length of 128K, making it suitable for a range of applications.

You might decide to use Phi-3-Vision-128K-Instruct in cases where your application has limited memory and computing power, thanks to its relatively lightweight that helps with latency. The model works best for generally understanding images, recognizing characters in text, and describing charts and tables.

Yi Vision Language (Yi-VL)

Yi-VL is an open-source AI model developed by 01-ai that can have multi-round conversations with images by reading text from images and translating it. This model is part of the Yi LLM series and has two versions: 6B and 34B.

What distinguishes Yi-VL from other models is its ability to carry a conversation, whereas other models are typically limited to a single text input. Plus, it’s bilingual making it more versatile in a variety of language contexts.

Finding And Evaluating VLMs

There are many, many VLMs and we only looked at a few of the most notable offerings. As you commence work on an application with image-to-text capabilities, you may find yourself wondering where to look for VLM options and how to compare them.

There are two resources in the Hugging Face community you might consider using to help you find and compare VLMs. I use these regularly and find them incredibly useful in my work.

Vision Arena

Vision Arena is a leaderboard that ranks VLMs based on anonymous user voting and reviews. But what makes it great is the fact that you can compare any two models side-by-side for yourself to find the best fit for your application.

And when you compare two models, you can contribute your own anonymous votes and reviews for others to lean on as well.

OpenVLM Leaderboard

OpenVLM is another leaderboard hosted on Hugging Face for getting technical specs on different models. What I like about this resource is the wealth of metrics for evaluating VLMs, including the speed and accuracy of a given VLM.

Further, OpenVLM lets you filter models by size, type of license, and other ranking criteria. I find it particularly useful for finding VLMs I might have overlooked or new ones I haven’t seen yet.

Text-To-Speech Technology

Earlier, I mentioned that the app we are about to build will use vision-language models to generate written descriptions of images, which are then read aloud. The technology that handles converting text to audio speech is known as text-to-speech synthesis or simply text-to-speech (TTS).

TTS converts written text into synthesized speech that sounds natural. The goal is to take published content, like a blog post, and read it out loud in a realistic-sounding human voice.

So, how does TTS work? First, it breaks down text into the smallest units of sound, called phonemes, and this process allows the system to figure out proper word pronunciations. Next, AI enters the mix, including deep learning algorithms trained on hours of human speech data. This is how we get the app to mimic human speech patterns, tones, and rhythms — all the things that make for “natural” speech. The AI component is key as it elevates a voice from robotic to something with personality. Finally, the system combines the phoneme information with the AI-powered digital voice to render the fully expressive speech output.

The result is automatically generated speech that sounds fairly smooth and natural. Modern TTS systems are extremely advanced in that they can replicate different tones and voice inflections, work across languages, and understand context. This naturalness makes TTS ideal for humanizing interactions with technology, like having your device read text messages out loud to you, just like Apple’s Siri or Microsoft’s Cortana.

TTS Examples

Based on the use cases we covered alone, you can probably imagine that VLMs come in many forms, each with its unique strengths and applications. In this section, we will look at a few examples of VLMs that can be used for a variety of different purposes.

Just as we took a moment to review existing vision language models, let’s pause to consider some of the more popular TTS resources that are available.

Bark

Straight from Bark’s model card in Hugging Face:

“Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio — including music, background noise, and simple sound effects. The model can also produce nonverbal communication, like laughing, sighing, and crying. To support the research community, we are providing access to pre-trained model checkpoints ready for inference.”

The non-verbal communication cues are particularly interesting and a distinguishing feature of Bark. Check out the various things Bark can do to communicate emotion, pulled directly from the model’s GitHub repo:

  • [laughter]
  • [laughs]
  • [sighs]
  • [music]
  • [gasps]
  • [clears throat]

This could be cool or creepy, depending on how it’s used, but reflects the sophistication we’re working with. In addition to laughing and gasping, Bark is different in that it doesn’t work with phonemes like a typical TTS model:

“It is not a conventional TTS model but instead a fully generative text-to-audio model capable of deviating in unexpected ways from any given script. Different from previous approaches, the input text prompt is converted directly to audio without the intermediate use of phonemes. It can, therefore, generalize to arbitrary instructions beyond speech, such as music lyrics, sound effects, or other non-speech sounds.”

Coqui

Coqui/XTTS-v2 can clone voices in different languages. All it needs for training is a short six-second clip of audio. This means the model can be used to translate audio snippets from one language into another while maintaining the same voice.

At the time of writing, Coqui currently supports 16 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, and Korean.

Parler-TTS

Parler-TTS excels at generating high-quality, natural-sounding speech in the style of a given speaker. In other words, it replicates a person’s voice. This is where many folks might draw an ethical line because techniques like this can be used to essentially imitate a real person, even without their consent, in a process known as “deepfake” and the consequences can range from benign impersonations to full-on phishing attacks.

But that’s not really the aim of Parler-TTS. Rather, it’s good in contexts that require personalized and natural-sounding speech generation, such as voice assistants and possibly even accessibility tooling to aid visual impairments by announcing content.

TTS Arena Leaderboard

Do you know how I shared the OpenVLM Leaderboard for finding and comparing vision language models? Well, there’s an equivalent leadership for TTS models as well over at the Hugging Face community called TTS Arena.

TTS models are ranked by the “naturalness” of their voices, with the most natural-sounding models ranked first. Developers like you and me vote and provide feedback that influences the rankings.

TTS API Providers

What we just looked at are TTS models that are baked into whatever app we’re making. However, some models are consumable via API, so it’s possible to get the benefits of a TTS model without the added bloat if a particular model is made available by an API provider.

Whether you decide to bundle TTS models in your app or integrate them via APIs is totally up to you. There is no right answer as far as saying one method is better than another — it’s more about the app’s requirements and whether the dependability of a baked-in model is worth the memory hit or vice-versa.

All that being said, I want to call out a handful of TTS API providers for you to keep in your back pocket.

ElevenLabs

ElevenLabs offers a TTS API that uses neural networks to make voices sound natural. Voices can be customized for different languages and accents, leading to realistic, engaging voices.

Try the model out for yourself on the ElevenLabs site. You can enter a block of text and choose from a wide variety of voices that read the submitted text aloud.

Colossyan

Colossyan’s text-to-speech API converts text into natural-sounding voice recordings in over 70 languages and accents. From there, the service allows you to match the audio to an avatar to produce something like a complete virtual presentation based on your voice — or someone else’s.

Once again, this is encroaching on deepfake territory, but it’s really interesting to think of Colossyan’s service as a virtual casting call for actors to perform off a script.

Murf.ai

Murf.ai is yet another TTS API designed to generate voiceovers based on real human voices. The service provides a slew of premade voices you can use to generate audio for anything from explainer videos and audiobooks to course lectures and entire podcast episodes.

Amazon Polly

Amazon has its own TTS API called Polly. You can customize the voices using lexicons and Speech Synthesis Markup (SSML) tags for establishing speaking styles with affordances for adjusting things like pitch, speed, and volume.

PlayHT

The PlayHT TTS API generates speech in 142 languages. Type what you want it to say, pick a voice, and download the output as an MP3 or WAV file.

Demo: Building An Image-to-Audio Interface

So far, we have discussed the two primary components for generating audio from text: vision-language models and text-to-speech models. We’ve covered what they are, where they fit into the process of generating real-sounding speech, and various examples of each model.

Now, it’s time to apply those concepts to the app we are building in this tutorial (and will improve in a second tutorial). We will use a VLM so the app can glean meaning and context from images, a TTS model to generate speech that mimics a human voice, and then integrate our work into a user interface for submitting images that will lead to generated speech output.

I have decided to base our work on a VLM by Salesforce called BLIP, a TTS model from Kakao Enterprise called VITS, and Gradio as a framework for the design interface. I’ve covered Gradio extensively in other articles, but the gist is that it is a Python library for building web interfaces — only it offers built-in tools for working with machine learning models that make Gradio ideal for a tutorial like this.

You can use completely different models if you like. The whole point is less about the intricacies of a particular model than it is to demonstrate how the pieces generally come together.

Oh, and one more detail worth noting: I am working with the code for all of this in Google Collab. I’m using it because it’s hosted and ideal for demonstrations like this. But you can certainly work in a more traditional IDE, like VS Code.

Installing Libraries

First, we need to install the necessary libraries:

#python
!pip install gradio pillow transformers scipy numpy

We can upgrade the transformers library to the latest version if we need to:

#python
!pip install --upgrade transformers

Not sure if you need to upgrade? Here’s how to check the current version:

#python
import transformers
print(transformers.__version__)

OK, now we are ready to import the libraries:

#python
import gradio as gr
from PIL import Image
from transformers import pipeline
import scipy.io.wavfile as wavfile
import numpy as np

These libraries will help us process images, use models on the Hugging Face hub, handle audio files, and build the UI.

Creating Pipelines

Since we will pull our models directly from Hugging Face’s model hub, we can tap into them using pipelines. This way, we’re working with an API for tasks that involve natural language processing and computer vision without carrying the load in the app itself.

We set up our pipeline like this:

#python
caption_image = pipeline("image-to-text", model="Salesforce/blip-image-captioning-large")

This establishes a pipeline for us to access BLIP for converting images into textual descriptions. Again, you could establish a pipeline for any other model in the Hugging Face hub.

We’ll need a pipeline connected to our TTS model as well:

#python
Narrator = pipeline("text-to-speech", model="kakao-enterprise/vits-ljs")

Now, we have a pipeline where we can pass our image text to be converted into natural-sounding speech.

Converting Text to Speech

What we need now is a function that handles the audio conversion. Your code will differ depending on the TTS model in use, but here is how I approached the conversion based on the VITS model:

#python

def generate_audio(text):
  # Generate speech from the input text using the Narrator (VITS model)
  Narrated_Text = Narrator(text)

  # Extract the audio data and sampling rate
  audio_data = np.array(Narrated_Text["audio"][0])
  sampling_rate = Narrated_Text["sampling_rate"]

  # Save the generated speech as a WAV file
  wavfile.write("generated_audio.wav", rate=sampling_rate, data=audio_data)

  # Return the filename of the saved audio file
  return "generated_audio.wav"

That’s great, but we need to make sure there’s a bridge that connects the text that the app generates from an image to the speech conversion. We can write a function that uses BLIP to generate the text and then calls the generate_audio() function we just defined:

#python
def caption_my_image(pil_image):
  # Use BLIP to generate a text description of the input image
  semantics = caption_image(images=pil_image)[0]["generated_text"]

  # Generate audio from the text description
  return generate_audio(semantics)

Building The User Interface

Our app would be pretty useless if there was no way to interact with it. This is where Gradio comes in. We will use it to create a form that accepts an image file as an input and then outputs the generated text for display as well as the corresponding file containing the speech.

#python

main_tab = gr.Interface(
  fn=caption_my_image,
  inputs=[gr.Image(label="Select Image", type="pil")],
  outputs=[gr.Audio(label="Generated Audio")],
  title=" Image Audio Description App",
  description="This application provides audio descriptions for images."
)

# Information tab
info_tab = gr.Markdown("""
  # Image Audio Description App
  ### Purpose
  This application is designed to assist visually impaired users by providing audio descriptions of images. It can also be used in various scenarios such as creating audio captions for educational materials, enhancing accessibility for digital content, and more.

  ### Limits
  - The quality of the description depends on the image clarity and content.
  - The application might not work well with images that have complex scenes or unclear subjects.
  - Audio generation time may vary depending on the input image size and content.
  ### Note
  - Ensure the uploaded image is clear and well-defined for the best results.
  - This app is a prototype and may have limitations in real-world applications.
""")

# Combine both tabs into a single app 
 demo = gr.TabbedInterface(
  [main_tab, info_tab],
  tab_names=["Main", "Information"]
)

demo.launch()

The interface is quite plain and simple, but that’s OK since our work is purely for demonstration purposes. You can always add to this for your own needs. The important thing is that you now have a working application you can interact with.

At this point, you could run the app and try it in Google Collab. You also have the option to deploy your app, though you’ll need hosting for it. Hugging Face also has a feature called Spaces that you can use to deploy your work and run it without Google Collab. There’s even a guide you can use to set up your own Space.

Here’s the final app that you can try by uploading your own photo:

Coming Up…

We covered a lot of ground in this tutorial! In addition to learning about VLMs and TTS models at a high level, we looked at different examples of them and then covered how to find and compare models.

But the rubber really met the road when we started work on our app. Together, we made a useful tool that generates text from an image file and then sends that text to a TTS model to convert it into speech that is announced out loud and downloadable as either an MP3 or WAV file.

But we’re not done just yet! What if we could glean even more detailed information from images and our app not only describes the images but can also carry on a conversation about them?

Sounds exciting, right? This is exactly what we’ll do in the second part of this tutorial.

How To Design Effective Conversational AI Experiences: A Comprehensive Guide

Category Image 062

Conversational AI is revolutionizing information access, offering a personalized, intuitive search experience that delights users and empowers businesses. A well-designed conversational agent acts as a knowledgeable guide, understanding user intent and effortlessly navigating vast data, which leads to happier, more engaged users, fostering loyalty and trust. Meanwhile, businesses benefit from increased efficiency, reduced costs, and a stronger bottom line. On the other hand, a poorly designed system can lead to frustration, confusion, and, ultimately, abandonment.

Achieving success with conversational AI requires more than just deploying a chatbot. To truly harness this technology, we must master the intricate dynamics of human-AI interaction. This involves understanding how users articulate needs, explore results, and refine queries, paving the way for a seamless and effective search experience.

This article will decode the three phases of conversational search, the challenges users face at each stage, and the strategies and best practices AI agents can employ to enhance the experience.

The Three Phases Of Conversational Search

To analyze these complex interactions, Trippas et al. (2018) (PDF) proposed a framework that outlines three core phases in the conversational search process:

  1. Query formulation: Users express their information needs, often facing challenges in articulating them clearly.
  2. Search results exploration: Users navigate through presented results, seeking further information and refining their understanding.
  3. Query re-formulation: Users refine their search based on new insights, adapting their queries and exploring different avenues.

Building on this framework, Azzopardi et al. (2018) (PDF) identified five key user actions within these phases: reveal, inquire, navigate, interrupt, interrogate, and the corresponding agent actions — inquire, reveal, traverse, suggest, and explain.

In the following sections, I’ll break down each phase of the conversational search journey, delving into the actions users take and the corresponding strategies AI agents can employ, as identified by Azzopardi et al. (2018) (PDF). I’ll also share actionable tactics and real-world examples to guide the implementation of these strategies.

Phase 1: Query Formulation: The Art Of Articulation

In the initial phase of query formulation, users attempt to translate their needs into prompts. This process involves conscious disclosures — sharing details they believe are relevant — and unconscious non-disclosure — omitting information they may not deem important or struggle to articulate.

This process is fraught with challenges. As Jakob Nielsen aptly pointed out,

“Articulating ideas in written prose is hard. Most likely, half the population can’t do it. This is a usability problem for current prompt-based AI user interfaces.”

— Jakob Nielsen

This can manifest as:

  • Vague language: “I need help with my finances.”
    Budgeting? Investing? Debt management?
  • Missing details: “I need a new pair of shoes.”
    What type of shoes? For what purpose?
  • Limited vocabulary: Not knowing the right technical terms. “I think I have a sprain in my ankle.”
    The user might not know the difference between a sprain and a strain or the correct anatomical terms.

These challenges can lead to frustration for users and less relevant results from the AI agent.

AI Agent Strategies: Nudging Users Towards Better Input

To bridge the articulation gap, AI agents can employ three core strategies:

  1. Elicit: Proactively guide users to provide more information.
  2. Clarify: Seek to resolve ambiguities in the user’s query.
  3. Suggest: Offer alternative phrasing or search terms that better capture the user’s intent.

The key to effective query formulation is balancing elicitation and assumption. Overly aggressive questioning can frustrate users, and making too many assumptions can lead to inaccurate results.

For example,

User: “I need a new phone.”

AI: “What’s your budget? What features are important to you? What size screen do you prefer? What carrier do you use?...”

This rapid-fire questioning can overwhelm the user and make them feel like they're being interrogated. A more effective approach is to start with a few open-ended questions and gradually elicit more details based on the user’s responses.

As Azzopardi et al. (2018) (PDF) stated in the paper,

“There may be a trade-off between the efficiency of the conversation and the accuracy of the information needed as the agent has to decide between how important it is to clarify and how risky it is to infer or impute the underspecified or missing details.”

Implementation Tactics And Examples

  • Probing questions: Ask open-ended or clarifying questions to gather more details about the user’s needs. For example, Perplexity Pro uses probing questions to elicit more details about the user’s needs for gift recommendations.

For example, after clicking one of the initial prompts, “Create a personal webpage,” ChatGPT added another sentence, “Ask me 3 questions first on whatever you need to know,” to elicit more details from the user.

  • Interactive refinement: Utilize visual aids like sliders, checkboxes, or image carousels to help users specify their preferences without articulating everything in words. For example, Adobe Firefly’s side settings allow users to adjust their preferences.

  • Suggested prompts: Provide examples of more specific or detailed queries to help users refine their search terms. For example, Nelson Norman Group provides an interface that offers a suggested prompt to help users refine their initial query.

For example, after clicking one of the initial prompts in Gemini, “Generate a stunning, playful image,” more details are added in blue in the input.

  • Offering multiple interpretations: If the query is ambiguous, present several possible interpretations and let the user choose the most accurate one. For example, Gemini offers a list of gift suggestions for the query “gifts for my friend who loves music,” categorized by the recipient’s potential music interests to help the user pick the most relevant one.

Phase 2: Search Results Exploration: A Multifaceted Journey

Once the query is formed, the focus shifts to exploration. Users embark on a multifaceted journey through search results, seeking to understand their options and make informed decisions.

Two primary user actions mark this phase:

  1. Inquire: Users actively seek more information, asking for details, comparisons, summaries, or related options.
  2. Navigate: Users navigate the presented information, browse through lists, revisit previous options, or request additional results. This involves scrolling, clicking, and using voice commands like “next” or “previous.”

AI Agent Strategies: Facilitating Exploration And Discovery

To guide users through the vast landscape of information, AI agents can employ these strategies:

  1. Reveal: Present information that caters to diverse user needs and preferences.
  2. Traverse: Guide the user through the information landscape, providing intuitive navigation and responding to their evolving interests.

During discovery, it’s vital to avoid information overload, which can overwhelm users and hinder their decision-making. For example,

User: “I’m looking for a place to stay in Tokyo.”

AI: Provides a lengthy list of hotels without any organization or filtering options.

Instead, AI agents should offer the most relevant results and allow users to filter or sort them based on their needs. This might include presenting a few top recommendations based on ratings or popularity, with options to refine the search by price range, location, amenities, and so on.

Additionally, AI agents should understand natural language navigation. For example, if a user asks, “Tell me more about the second hotel,” the AI should provide additional details about that specific option without requiring the user to rephrase their query. This level of understanding is crucial for flexible navigation and a seamless user experience.

Implementation Tactics And Examples

  • Diverse formats: Offer results in various formats (lists, summaries, comparisons, images, videos) and allow users to specify their preferences. For example, Gemini presents a summarized format of hotel information, including a photo, price, rating, star rating, category, and brief description to allow the user to evaluate options quickly for the prompt “I’m looking for a place to stay in Paris.”

  • Context-aware navigation: Maintain conversational context, remember user preferences, and provide relevant navigation options. For example, following the previous example prompt, Gemini reminds users of the potential next steps at the end of the response.

  • Interactive exploration: Use carousels, clickable images, filter options, and other interactive elements to enhance the exploration experience. For example, Perplexity offers a carousel of images related to “a vegetarian diet” and other interactive elements like “Watch Videos” and “Generate Image” buttons to enhance exploration and discovery.

  • Multiple responses: Present several variations of a response. For example, users can see multiple draft responses to the same query by clicking the “Show drafts” button in Gemini.

  • Flexible text length and tone. Enable users to customize the length and tone of AI-generated responses to better suit their preferences. For example, Gemini provides multiple options for welcome messages, offering varying lengths, tones, and degrees of formality.

Phase 3: Query Re-formulation: Adapting To Evolving Needs

As users interact with results, their understanding deepens, and their initial query might not fully capture their evolving needs. During query re-formulation, users refine their search based on exploration and new insights, often involving interrupting and interrogating. Query re-formulation empowers users to course-correct and refine their search.

  • Interrupt: Users might pause the conversation to:
    • Correct: “Actually, I meant a desktop computer, not a laptop.”
    • Add information: “I also need it to be good for video editing.”
    • Change direction: “I’m not interested in those options. Show me something else.”
  • Interrogate: Users challenge the AI to ensure it understands their needs and justify its recommendations:
    • Seek understanding: “What do you mean by ‘good battery life’?”
    • Request explanations: “Why are you recommending this particular model?”

AI Agent Strategies: Adapting And Explaining

To navigate the query re-formulation phase effectively, AI agents need to be responsive, transparent, and proactive. Two core strategies for AI agents:

  1. Suggest: Proactively offer alternative directions or options to guide the user towards a more satisfying outcome.
  2. Explain: Provide clear and concise explanations for recommendations and actions to foster transparency and build trust.

AI agents should balance suggestions with relevance and explain why certain options are suggested while avoiding overwhelming them with unrelated suggestions that increase conversational effort. A bad example would be the following:

User: “I want to visit Italian restaurants in New York.”

AI: Suggest unrelated options, like Mexican restaurants or American restaurants, when the user is interested in Italian cuisine.

This could frustrate the user and reduce trust in the AI.

A better answer could be, “I found these highly-rated Italian restaurants. Would you like to see more options based on different price ranges?” This ensures users understand the reasons behind recommendations, enhancing their satisfaction and trust in the AI's guidance.

Implementation Tactics And Examples

  • Transparent system process: Show the steps involved in generating a response. For example, Perplexity Pro outlines the search process step by step to fulfill the user’s request.

  • Explainable recommendations: Clearly state the reasons behind specific recommendations, referencing user preferences, historical data, or external knowledge. For example, ChatGPT includes recommended reasons for each listed book in response to the question “books for UX designers.”

  • Source reference: Enhance the answer with source references to strengthen the evidence supporting the conclusion. For example, Perplexity presents source references to support the answer.

  • Point-to-select: Users should be able to directly select specific elements or locations within the dialogue for further interaction rather than having to describe them verbally. For example, users can select part of an answer and ask a follow-up in Perplexity.

  • Proactive recommendations: Suggest related or complementary items based on the user’s current selections. For example, Perplexity offers a list of related questions to guide the user’s exploration of “a vegetarian diet.”

Overcoming LLM Shortcomings

While the strategies discussed above can significantly improve the conversational search experience, LLMs still have inherent limitations that can hinder their intuitiveness. These include the following:

  • Hallucinations: Generating false or nonsensical information.
  • Lack of common sense: Difficulty understanding queries that require world knowledge or reasoning.
  • Sensitivity to input phrasing: Producing different responses to slightly rephrased queries.
  • Verbosity: Providing overly lengthy or irrelevant information.
  • Bias: Reflecting biases present in the training data.

To create truly effective and user-centric conversational AI, it’s crucial to address these limitations and make interactions more intuitive. Here are some key strategies:

  • Incorporate structured knowledge
    Integrating external knowledge bases or databases can ground the LLM’s responses in facts, reducing hallucinations and improving accuracy.
  • Fine-tuning
    Training the LLM on domain-specific data enhances its understanding of particular topics and helps mitigate bias.
  • Intuitive feedback mechanisms
    Allow users to easily highlight and correct inaccuracies or provide feedback directly within the conversation. This could involve clickable elements to flag problematic responses or a “this is incorrect” button that prompts the AI to reconsider its output.
  • Natural language error correction
    Develop AI agents capable of understanding and responding to natural language corrections. For example, if a user says, “No, I meant X,” the AI should be able to interpret this as a correction and adjust its response accordingly.
  • Adaptive learning
    Implement machine learning algorithms that allow the AI to learn from user interactions and improve its performance over time. This could involve recognizing patterns in user corrections, identifying common misunderstandings, and adjusting behavior to minimize future errors.
Training AI Agents For Enhanced User Satisfaction

Understanding and evaluating user satisfaction is fundamental to building effective conversational AI agents. However, directly measuring user satisfaction in the open-domain search context can be challenging, as Zhumin Chu et al. (2022) highlighted. Traditionally, metrics like session abandonment rates or task completion were used as proxies, but these don’t fully capture the nuances of user experience.

To address this, Clemencia Siro et al. (2023) offer a comprehensive approach to gathering and leveraging user feedback:

  • Identify key dialogue aspects
    To truly understand user satisfaction, we need to look beyond simple metrics like “thumbs up” or “thumbs down.” Consider evaluating aspects like relevance, interestingness, understanding, task completion, interest arousal, and efficiency. This multi-faceted approach provides a more nuanced picture of the user’s experience.
  • Collect multi-level feedback
    Gather feedback at both the turn level (each question-answer pair) and the dialogue level (the overall conversation). This granular approach pinpoints specific areas for improvement, both in individual responses and the overall flow of the conversation.
  • Recognize individual differences
    Understand that the concept of satisfaction varies per user. Avoid assuming all users perceive satisfaction similarly.
  • Prioritize relevance
    While all aspects are important, relevance (at the turn level) and understanding (at both the turn and session level) have been identified as key drivers of user satisfaction. Focus on improving the AI agent’s ability to provide relevant and accurate responses that demonstrate a clear understanding of the user’s intent.

Additionally, consider these practical tips for incorporating user satisfaction feedback into the AI agent’s training process:

  • Iterate on prompts
    Use user feedback to refine the prompts to elicit information and guide the conversation.
  • Refine response generation
    Leverage feedback to improve the relevance and quality of the AI agent’s responses.
  • Personalize the experience
    Tailor the conversation to individual users based on their preferences and feedback.
  • Continuously monitor and improve
    Regularly collect and analyze user feedback to identify areas for improvement and iterate on the AI agent’s design and functionality.
The Future Of Conversational Search: Beyond The Horizon

The evolution of conversational search is far from over. As AI technologies continue to advance, we can anticipate exciting developments:

  • Multi-modal interactions
    Conversational search will move beyond text, incorporating voice, images, and video to create more immersive and intuitive experiences.
  • Personalized recommendations
    AI agents will become more adept at tailoring search results to individual users, considering their past interactions, preferences, and context. This could involve suggesting restaurants based on dietary restrictions or recommending movies based on previously watched titles.
  • Proactive assistance
    Conversational search systems will anticipate user needs and proactively offer information or suggestions. For instance, an AI travel agent might suggest packing tips or local customs based on a user’s upcoming trip.

When Friction Is A Good Thing: Designing Sustainable E-Commerce Experiences

Fotolia Subscription Monthly 4685447 Xl Stock

As lavish influencer lifestyles, wealth flaunting, and hauls dominate social media feeds, we shouldn’t be surprised that excessive consumption has become the default way of living. We see closets filled to the brim with cheap, throw-away items and having the latest gadget arsenal as signifiers of an aspirational life.

Consumerism, however, is more than a cultural trend; it’s the backbone of our economic system. Companies eagerly drive excessive consumption as an increase in sales is directly connected to an increase in profit.

While we learned to accept this level of material consumption as normal, we need to be reminded of the massive environmental impact that comes along with it. As Yvon Chouinard, founder of Patagonia, writes in a New York Times article:

“Obsession with the latest tech gadgets drives open pit mining for precious minerals. Demand for rubber continues to decimate rainforests. Turning these and other raw materials into final products releases one-fifth of all carbon emissions.”

— Yvon Chouinard

In the paper, Scientists’ Warning on Affluence, a group of researchers concluded that reducing material consumption today is essential to avoid the worst of the looming climate change in the coming years. This need for lowering consumption is also reflected in the UN’s Sustainability goals, specifically Goal 17, “Ensuring sustainable consumption and production patterns”.

For a long time, design has been a tool for consumer engineering by for example, designing products with artificially limited useful life (planned obsolescence) to ensure continuous consumption. And if we want to understand specifically UX design’s role in influencing how much and what people buy, we have to take a deeper look at pushy online shopping experiences.

Design Shaping Shopping Habits: The Problem With Current E-commerce Design

Today, most online shopping experiences are designed with persuasion, gamification, nudging and even deception to get unsuspecting users to add more things to their basket.

There are “Hurry, only one item left in stock” type messages and countdown clocks that exploit well-known cognitive biases to nudge users to make impulse purchase decisions. As Michael Keenan explains,

“The scarcity bias says that humans place a higher value on items they believe to be rare and a lower value on things that seem abundant. Scarcity marketing harnesses this bias to make brands more desirable and increase product sales. Online stores use limited releases, flash sales, and countdown timers to induce FOMO — the fear of missing out — among shoppers.”

— Michael Keenan

To make buying things quick and effortless, we remove friction from the checkout process, for example, with the one-click-buy button. As practitioners of user-centered design, we might implement the button and say: thanks to this frictionless and easy checkout process, we improved the customer experience. Or did we just do a huge disservice to our users?

Gliding through the checkout process in seconds leaves no time for the user to ask, “Do I actually want this?” or “Do I have the money for this?”. Indeed, putting users on autopilot to make thoughtless decisions is the goal.

As a business.com article says: “Click to buy helps customers complete shopping within seconds and reduces the amount of time they have to reconsider their purchase.”

Amanda Mull writes from a user perspective about how it has become “too easy to buy stuff you don’t want”:

“The order took maybe 15 seconds. I selected my size and put the shoes in my cart, and my phone automatically filled in my login credentials and added my new credit card number. You can always return them, I thought to myself as I tapped the “Buy” button. [...] I had completed some version of the online checkout process a million times before, but I never could remember it being quite so spontaneous and thoughtless. If it’s going to be that easy all the time, I thought to myself, I’m cooked.”

— Amanda Mull

This quote also highlights that this thoughtless consumption is not only harmful to the environment but also to the very same user we say we center our design process around. The rising popularity of buy-now-pay-later services, credit card debt, and personal finance gurus to help “Overcoming Overspending” are indicators that people are spending more than they can afford, a huge source of stress for many.

The one-click-buy button is not about improving user experience but building an environment where users are “more likely to buy more and buy often.” If we care to put this bluntly, frictionless and persuasive e-commerce design is not user-centered but business-centered design.

While it is not unusual for design to be a tool to achieve business goals, we, designers, should be clear about who we are serving and at what cost with the power of design. To reckon with our impact, first, we have to understand the source of power we yield — the power asymmetry between the designer and the user.

Power Asymmetry Between User And Designer

Imagine a scale: on one end sits the designer and the user on the other. Now, let’s take an inventory of the sources of power each party has in their hands in an online shopping situation and see how the scale balances.

Designers

Designers are equipped with knowledge about psychology, biases, nudging, and persuasion techniques. If we don’t have the time to learn all that, we can reach for an out-of-the-box solution that uses those exact psychological and behavioral insights. For example, Nudgify, a Woocommerce integration, promises to help “you get more sales and reduce shopping cart abandonment by creating Urgency and removing Friction.”

Erika Hall puts it this way: “When you are designing, you are making choices on behalf of other people.” We even have a word for this: choice architecture. Choice architecture refers to the deliberate crafting of decision-making environments. By subtly shaping how options are presented, choice architecture influences individual decision-making, often without their explicit awareness.

On top of this, we also collect funnel metrics, behavioral data, and A/B test things to make sure our designs work as intended. In other words, we control the environment where the user is going to make decisions, and we are knowledgeable about how to tweak it in a way to encourage the decisions we want the user to make. Or, as Vitaly Friedman says in one of his articles:

“We’ve learned how to craft truly beautiful interfaces and well-orchestrated interactions. And we’ve also learned how to encourage action to meet the project’s requirements and drive business metrics. In fact, we can make pretty much anything work, really.”

— Vitaly Friedman

User

On the other end of the scale, we have the user who is usually unaware of our persuasion efforts, oblivious about their own biases, let alone understanding when and how those are triggered.

Luckily, regulation around Deceptive Design on e-commerce is increasing. For example, companies are not allowed to use fake countdown timers. However, these regulations are not universal, and enforcement is lax, so often users are still not protected by law against pushy shopping experiences.

After this overview, let’s see how the scale balances:

When we understand this power asymmetry between designer and user, we need to ask ourselves:

  • What do I use my power for?
  • What kind of “real life” user behavior am I designing for?
  • What is the impact of the users’ behavior resulting from my design?

If we look at e-commerce design today, more often than not, the unfortunate answer is mindless and excessive consumption.

This needs to change. We need to use the power of design to encourage sustainable user behavior and thus move us toward a sustainable future.

What Is Sustainable E-commerce?

The discussion about sustainable e-commerce usually revolves around recyclable packaging, green delivery, and making the site energy-efficient with sustainable UX. All these actions and angles are important and should be part of our design process, but can we build a truly sustainable e-commerce if we are still encouraging unsustainable user behavior by design?

To achieve truly sustainable e-commerce, designers must shift from encouraging impulse purchases to supporting thoughtful decisions. Instead of using persuasion, gamification, and deception to boost sales, we should use our design skills to provide users with the time, space, and information they need to make mindful purchase decisions. I call this approach Kind Commerce.

But The Business?!

While the intent of designing Kind Commerce is noble, we have a bitter reality to deal with: we live and work in an economic system based on perpetual growth. We are often measured on achieving KPIs like “increased conversion” or “reduced cart abandonment rate”. We are expected to use UX to achieve aggressive sales goals, and often, we are not in a position to change that.

It is a frustrating situation to be in because we can argue that the system needs to change, so it is possible for UXers to move away from persuasive e-commerce design. However, system change won’t happen unless we push for it. A catch-22 situation. So, what are the things we could do today?

  • Pitch Kind Commerce as a way to build strong customer relationships that will have higher lifetime value than the quick buck we would make with persuasive tricks.
  • Highlight reduced costs. As Vitaly writes, using deceptive design can be costly for the company:
“Add to basket” is beautifully highlighted in green, indicating a way forward, with insurance added in automatically. That’s a clear dark pattern, of course. The design, however, is likely to drive business KPIs, i.e., increase a spend per customer. But it will also generate a wrong purchase. The implications of it for businesses might be severe and irreversible — with plenty of complaints, customer support inquiries, and high costs of processing returns.”

— Vitaly Friedman

Helping users find the right products and make decisions they won’t regret can help the company save all the resources they would need to spend on dealing with complaints and returns. On top of this, the company can save millions of dollars by avoiding lawsuits for unfair commercial practices.

  • Highlight the increasing customer demand for sustainable companies.
  • If you feel that your company is not open to change practices and you are frustrated about the dissonance between your day job and values, consider looking for a position where you can support a company or a cause that aligns with your values.
A Few Principles To Design Mindful E-commerce

Add Friction

I know, I know, it sounds like an insane proposition in a profession obsessed with eliminating friction, but hear me out. Instead of “helping” users glide through the checkout process with one-click buy buttons, adding a step to review their order and give them a pause could help reduce unnecessary purchases. A positive reframing for this technique could be helpful to express our true intentions.

Instead of saying “adding friction,” we could say “adding a protective step”. Another example of “adding a protective step” could be getting rid of the “Quick Add” buttons and making users go to the product page to take a look at what they are going to buy. For example, Organic Basics doesn’t have a “Quick Add” button; users can only add things to their cart from the product page.

Inform

Once we make sure users will visit product pages, we can help them make more informed decisions. We can be transparent about the social and environmental impact of an item or provide guidelines on how to care for the product to last a long time.

For example, Asket has a section called “Lifecycle” where they highlight how to care for, repair and recycle their products. There is also a “Full Transparency” section to inform about the cost and impact of the garment.

Design Calm Pages

Aggressive landing pages where everything is moving, blinking, modals popping up, 10 different discounts are presented are overwhelming, confusing and distracting, a fertile environment for impulse decisions.

Respect your user’s attention by designing pages that don’t raise their blood pressure to 180 the second they open them. No modals automatically popping up, no flashing carousels, and no discount dumping. Aim for static banners and display offers in a clear and transparent way. For example, H&M shows only one banner highlighting a discount on their landing page, and that’s it. If a fast fashion brand like H&M can design calm pages, there is no excuse why others couldn’t.

Be Honest In Your Messaging

Fake urgency and social proof can not only get you fined for millions of dollars but also can turn users away. So simply do not add urgency messages and countdown clocks where there is no real deadline behind an offer. Don’t use fake social proof messages. Don’t say something has a limited supply when it doesn’t.

I would even take this a step further and recommend using persuasion sparingly, even if they are honest. Instead of overloading the product page with every possible persuasion method (urgency, social proof, incentive, assuming they are all honest), choose one yet impactful persuasion point.

Disclaimer

To make it clear, I’m not advocating for designing bad or cumbersome user experiences to obstruct customers from buying things. Of course, I want a delightful and easy way to buy things we need.

I’m also well aware that design is never neutral. We need to present options and arrange user flows, and whichever way we choose to do that will influence user decisions and actions.

What I’m advocating for is at least putting the user back in the center of our design process. We read earlier that users think it is “too easy to buy things you don’t need” and feel that the current state of e-commerce design is contributing to their excessive spending. Understanding this and calling ourselves user-centered, we ought to change our approach significantly.

On top of this, I’m advocating for expanding our perspective to consider the wider environmental and social impact of our designs and align our work with the move toward a sustainable future.

Mindful Consumption Beyond E-commerce Design

E-commerce design is a practical example of how design is a part of encouraging excessive, unnecessary consumption today. In this article, we looked at what we can do on this practical level to help our users shop more mindfully. However, transforming online shopping experiences is only a part of a bigger mission: moving away from a culture where excessive consumption is the aspiration for customers and the ultimate goal of companies.

As Cliff Kuang says in his article,

“The designers of the coming era need to think of themselves as inventing a new way of living that doesn’t privilege consumption as the only expression of cultural value. At the very least, we need to start framing consumption differently.”

— Cliff Kuang

Or, as Manuel Lima puts in his book, The New Designer,

“We need the design to refocus its attention where it is needed — not in creating things that harm the environment for hundreds of years or in selling things we don’t need in a continuous push down the sales funnel but, instead, in helping people and the planet solve real problems. [...] Designs’s ultimate project is to reimagine how we produce, deliver, consume products, physical or digital, to rethink the existing business models.”

— Manuel Lima

So buckle up, designers, we have work to do!

To Sum It Up

Today, design is part of the problem of encouraging and facilitating excessive consumption through persuasive e-commerce design and through designing for companies with linear and exploitative business models. For a liveable future, we need to change this. On a tactical level, we need to start advocating and designing mindful shopping experiences, and on a strategic level, we need to use our knowledge and skills to elevate sustainable businesses.

I’m not saying that it is going to be an easy or quick transition, but the best time to start is now. In a dire state of need for sustainable transformation, designers with power and agency can’t stay silent or continue proliferating the problem.

“As designers, we need to see ourselves as gatekeepers of what we are bringing into the world and what we choose not to bring into the world. Design is a craft with responsibility. The responsibility to help create a better world for all.”

— Mike Monteiro

How To Improve Your Microcopy: UX Writing Tips For Non-UX Writers

Category Image 073

Throughout my UX writing career, I’ve held many different roles: a UX writer in a team of UX writers, a solo UX writer replacing someone who left, the first and only UX writer at a company, and even a teacher at a UX writing course, where I reviewed more than 100 home assignments. And oh gosh, what I’ve seen.

Crafting microcopy is not everyone’s strong suit, and it doesn’t have to be. Still, if you’re a UX designer, product manager, analyst, or marketing content writer working in a small company, on an MVP, or on a new product, you might have to get by without a UX writer. So you have the extra workload of creating microcopy. Here are some basic rules that will help you create clear and concise copy and run a quick health check on your designs.

Ensure Your Interface Copy Is Role-playable
Why it’s important:
  • To create a friendly, conversational experience;
  • To work out a consistent interaction pattern.

When crafting microcopy, think of the interface as a dialog between your product and your user, where:

  • Titles, body text, tooltips, and so on are your “phrases.”
  • Button labels, input fields, toggles, menu items, and other elements that can be tapped or selected are the user’s “phrases.”

Ideally, you should be able to role-play your interface copy: a product asks the user to do something — the user does it; a product asks for information — the user types it in or selects an item from the menu; a product informs or warns the user about something — the user takes action.

For example, if your screen is devoted to an event and the CTA is for the user to register, you should opt for a button label like “Save my spot” rather than “Save your spot.” This way, when a user clicks the button, it’s as if they are pronouncing the phrase themselves, which resonates with their thoughts and intentions.

Be Especially Transparent And Clear When It Comes To Sensitive Topics
Why it’s important: To build trust and loyalty towards your product.

Some topics, such as personal data, health, or money, are extremely sensitive for people. If your product involves any limitations, peculiarities, or possible negative outcomes related to these sensitive topics, you should convey this information clearly and unequivocally. You will also need to collaborate with your UX/UI Designer closely to ensure you deliver this information in a timely manner and always make it visible without requiring the user to take additional actions (e.g., don’t hide it in tooltips that are only shown by tapping).

Here’s a case from my work experience. For quite some time, I’ve been checking homework assignments for a UX writing course. In this course, all the tasks have revolved around an imaginary app for dog owners. One of the tasks students worked on was creating a flow for booking a consultation with a dog trainer. The consultation had to be paid in advance. In fact, the money was blocked on the user’s bank card and charged three hours before the consultation. That way, a user could cancel the meeting for free no later than three hours before the start time. A majority of the students added this information as a tooltip on the checkout screen; if a user didn’t tap on it, they wouldn’t be warned about the possibility of losing money.

In a real-life situation, this would cause immense negativity from users: they may post about it on social media, and it will show the company in a bad light. Even if you occasionally resort to dark patterns, make sure you can afford any reputational risks.

So, when creating microcopy on sensitive topics:

  • Be transparent and honest about all the processes and conditions. For example, you’re a fintech service working with other service providers. Because of that, you have fees built into transactions but don’t know the exact amount. Explain to users how the fees are calculated, their approximate range (if possible), and where users can see more precise info.
  • Reassure users that you’ll be extremely careful with their data. Explain why you need their data, how you will use it, store and protect it from breaches, and so on.
  • If some restrictions or limitations are implied, provide options to remove them (if possible).

Ensure That The Button Label Accurately Reflects What Happens Next
Why it’s important:
  • To make your interface predictable, trustworthy, and reliable;
  • To prevent user frustration.

The button label should reflect the specific action that occurs when the user clicks or taps it.

It might seem valid to use a button label that reflects the user’s goal or target action, even if it actually happens a bit later. For example, if your product allows users to book accommodations for vacations or business trips, you might consider using a “Book now” button in the booking flow. However, if tapping it leads the user to an order screen where they need to select a room, fill out personal details, and so on, the accommodation is not booked immediately. So you might want to opt for “Show rooms,” “Select a rate,” or another button label that better reflects what happens next.

Moreover, labels like “Buy now” or “Book now” might seem too pushy and even off-putting (especially when it comes to pricey products involving a long decision-making process), causing users to abandon your website or app in favor of ones with buttons that create the impression they can browse peacefully for as long as they need. You might want to let your users “Explore,” “Learn more,” “Book a call,” or “Start a free trial” first.

As a product manager or someone with a marketing background, you might want to create catchy and fancy button labels to boost conversion rates. For instance, when working on an investment app, you might label a button for opening a brokerage account as “Become an investor.” While this might appeal to users’ egos, it can also come across as pretentious and cheap. Additionally, after opening an account, users may still need to do many things to actually become investors, which can be frustrating. Opt for a straightforward “Open an account” button instead.

In this regard, it’s better not to promise users things that we can’t guarantee or that aren’t entirely up to us. For example, in a flow that includes an OTP password, it’s better to opt for the “Send a code” button rather than “Get a code” since we can’t guarantee there won’t be any network outages or other issues preventing the user from receiving an SMS or a push notification.

Finally, avoid using generic “Yes” or “No” buttons as they do not clearly reflect what happens next. Users might misread the text above or fail to notice a negation, leading to unexpected outcomes. For example, when asking for a confirmation, such as “Are you sure you want to quit?” you might want to go with button labels like “Quit” and “Stay” rather than just “Yes” and “No.”

Tip: If you have difficulty coming up with a button label, this may be a sign that the screen is poorly organized or the flow lacks logic and coherence. For example, a user has to deal with too many different entities and types of tasks on one screen, so the action can’t be summarized with just one verb. Or perhaps a subsequent flow has a lot of variations, making it hard to describe the action a user should take. In such cases, you might want to make changes to the screen (say, break it down into several screens) or the flow (say, add a qualifying question or attribute earlier so that the flow would be less branching).
Make It Clear To The User Why They Need To Perform The Action
Why it’s important:
  • To create transparency and build trust;
  • To boost conversion rates.

An ideal interface is self-explanatory and needs no microcopy. However, sometimes, we need to convince users to do something for us, especially when it involves providing personal information or interacting with third-party products.

You can use the following formula: “To [get this], do [this] + UI element to make it happen.” For example, “To get your results, provide your email,” followed by an input field.

It’s better to provide the reasoning (“to get your results”) first and then the instructions (“provide your email” ): this way, the guidance is more likely to stick in the user’s memory, smoothly leading to the action. If you reverse the order — giving the instructions first and then the reasoning — the user might forget what they need to do and will have to reread the beginning of the sentence, leading to a less smooth and slightly hectic experience.

Ensure The UI Element Copy Doesn’t Explain How To Interact With This Very Element
Why it’s important:
  • If you need to explain how to interact with a UI element, it may be a sign that the interface is not intuitive;
  • Risk omitting or not including more important, useful text.

Every now and then, I come across meaningless placeholders or excessive toggle copy that explains how to interact with the field or toggle. The most frequent example is the “Search” placeholder for a search field. Occasionally, I see button labels like “Press to continue.”

Mobile and web interfaces have been around for quite a while, and users understand how to interact with buttons, toggles, and fields. Therefore, explanations such as “click,” “tap,” “enter,” and so on seem excessive in most cases. Perhaps it’s only with a group of checkboxes that you might add something like “Select up to 5.”

You might want to add something more useful. For example, instead of a generic “Search” placeholder for a search field, use specific instances a user might type in. If you’re a fashion marketplace, try placeholders like “oversized hoodies,” “women’s shorts,” and so on. Keep in mind the specifics of your website or app: ensure the placeholder is neither too broad nor too specific, and if a user types something like you’ve provided, their search will be successful.

Stick To The Rule “1 Microcopy Item = 1 Idea”
Why it’s important:
  • Not to create extra cognitive load, confusion, or friction;
  • To ensure a smooth and simple experience.

Users have short attention spans, scan text instead of reading it thoroughly, and can’t process multiple ideas simultaneously. That’s why it’s crucial to break information down into easily digestible chunks instead of, for example, trying to squeeze all the restrictions into one tooltip.

The golden rule is to provide users only with the information they need at this particular stage to take a specific action or make a decision.

You’ll need to collaborate closely with your designer to ensure the information is distributed over the screen evenly and you don’t overload one design element with a lot of text.

Be Careful With Titles Like “Done,” “Almost There,” “Attention,” And So On
Why it’s important:
  • Not to annoy a user;
  • To be more straightforward and economical with users’ time;
  • Not to overuse their attention;
  • Not to provoke anxiety.

Titles, written in bold and larger font sizes, grab users’ attention. Sometimes, titles are the only text users actually read. Titles stick better in their memory, so they must be understandable as a standalone text.

Titles like “One more thing” or “Almost there” might work well if they align with a product’s tone of voice and the flows where they appear are cohesive and can hardly be interrupted. But keep in mind that users might get distracted.

Use this quick check: set your design aside for about 20 minutes, do something else, and then open only the screen for which you’re writing a title. Is what happens on this screen still understandable from the title? Do you easily recall what has or hasn’t happened, what you were doing, and what should be done next?

Don’t Fall Back On Abstract Examples
Why it’s important:
  • To make the interface more precise and useful;
  • To ease the navigation through the product for a user;
  • To reduce cognitive load.

Some products (e.g., any B2B or financial ones) involve many rules and restrictions that must be explained to the user. To make this more understandable, use real-life examples (with specific numbers, dates, and so on) rather than distilling abstract information into a hint, tooltip, or bottom sheet.

It’s better to provide explanations using real-life examples that users can relate to. Check with engineers if it’s possible to get specific data for each user and add variables and conditions to show every user the most relevant microcopy. For example, instead of saying, “Your deposit limit is $1,000 per calendar month,” you could say, “Until Jan 31, you can deposit $400 more.” This relieves the user of unnecessary work, such as figuring out the start date of the calendar month in their case and calculating the remaining amount.

Try To Avoid Negatives
Why it’s important:
  • Not to increase cognitive load;
  • To prevent friction.

As a rule of thumb, it’s recommended to avoid double negatives, such as “Do not unfollow.” However, I’d go further and advise avoiding single negatives as well. The issue is that to decipher such a message, a user has to perform an excessive logical operation: first eliminating the negation, then trying to understand the gist.

For example, when listing requirements for a username, saying “Don’t use special characters, spaces, or symbols” forces a user to speculate (“If this is not allowed, then the opposite is allowed, which must be…”). It can take additional time to figure out what falls under “special characters.” To simplify the task for the user, opt for something like “Use only numbers and letters.”

Moreover, a user can easily overlook the “not” part and misread the message.

Another aspect worth noting is that negation often seems like a restriction or prohibition, which nobody likes. In some cases, especially in finance, all those don’ts might be perceived with suspicion rather than as precaution.

Express Action With Verbs, Not Nouns
Why it’s important:
  • To avoid wordiness;
  • To make text easily digestible.

When describing an action, use a verb, not a noun. Nouns that convey the meaning of verbs make texts harder to read and give off a legalistic vibe.

Here are some sure signs you need to paraphrase your text for brevity and simplicity:

  • Forms of “be” as the main verbs;
  • Noun phrases with “make” (e.g., “make a payment/purchase/deposit”);
  • Nouns ending in -tion, -sion, -ment, -ance, -ency (e.g., cancellation);
  • Phrases with “of” (e.g., provision of services);
  • Phrases with “process” (e.g., withdrawal process).
Make Sure You Use Only One Term For Each Entity
Why it’s important: Not to create extra cognitive load, confusion, and anxiety.

Ensure you use the same term for the same object or action throughout the entire app. For example, instead of using “account” and “profile” interchangeably, choose one and stick to it to avoid confusing your users.

The more complicated and/or regulated your product is, the more vital it is to choose precise wording and ensure it aligns with legal terms, the wording users see in the help center, and communication with support agents.

Less “Oopsies” In Error Messages
Why it’s important:
  • Not to annoy a user;
  • To save space for more important information.

At first glance, “Oops” may seem sweet and informal (yet with an apologetic touch) and might be expected to decrease tension. However, in the case of repetitive or serious errors, the effect will be quite the opposite.

Use “Oops” and similar words only if you’re sure it suits your brand’s tone of voice and you can finesse it.

As a rule of thumb, good error messages explain what has happened or is happening, why (if we know the reason), and what the user should do. Additionally, include any sensitive information related to the process or flow where the error appears. For example, if an error occurs during the payment process, provide users with information concerning their money.

No Excessive Politeness
Why it’s important: Not to waste precious space on less critical information.

I’m not suggesting we remove every single “please” from the microcopy. However, when it comes to interfaces, our priority is to convey meaning clearly and concisely and explain to users what to do next and why. Often, if you start your microcopy with “please,” you won’t have enough space to convey the essence of your message. Users will appreciate clear guidelines to perform the desired action more than a polite message they struggle to follow.

Remove Tech Jargon
Why it’s important:
  • To make the interface understandable for a broad audience;
  • To avoid confusion and ensure a frictionless experience.

As tech specialists, we’re often subject to the curse of knowledge, and despite our efforts to prioritize users, tech jargon can sneak into our interface copy. Especially if our product targets a wider audience, users may not be tech-savvy enough to understand terms like “icon.”

To ensure your interface doesn’t overwhelm users with professional jargon, a quick and effective method is to show the interface to individuals outside your product group. If that’s not feasible, here’s how to identify jargon: it’s the terminology you use in daily meetings among yourselves or in Jira task titles (e.g., authorization, authentication, and so on), or abbreviations (e.g., OTP code, KYC process, AML rules, and so on).

Ensure That Empty State Messages Don’t Leave Users Frustrated
Why it’s important:
  • For onboarding and navigation;
  • To increase discoverability of particular features;
  • To promote or boost the use of the product;
  • To reduce cognitive load and anxiety about the next steps.

Quite often, a good empty state message is a self-destructing one, i.e. one that helps a user to get rid of this emptiness. An empty state message shouldn’t just state “there’s nothing here” — that’s obvious and therefore unnecessary. Instead, it should provide users with a way out, smoothly guiding them into using the product or a specific feature. A well-crafted empty message can even boost conversions.

Of course, there are exceptions, for example, in a reactive interface like a CRM system for a restaurant displaying the status of orders to workers. If there are no orders in progress and, therefore, no corresponding empty state message, you can’t nudge or motivate restaurant workers to create new orders themselves.

Place All Important Information At The Beginning
Why it’s important:
  • To keep the user focused;
  • Not to overload a user with info;
  • Avoid information loss due to fading or cropping.

As mentioned earlier, users have short attention spans and often don’t want to focus on the texts they read, especially microcopy. Therefore, ensure you place all necessary information at the beginning of your text. Omit lead-ins, introductory words, and so on. Save less vital details for later in the text.

Ensure Title And Buttons Are Understandable Without Body Text
Why it’s important:
  • For clarity;
  • To overcome the serial position effect;
  • To make sure the interface, the flow, and the next steps are understandable for a user even if they scan the text instead of reading.

There’s a phenomenon called the serial position effect: people tend to remember information better if it’s at the beginning or end of a text or sentence, often overlooking the middle part. When it comes to UX/UI design, this effect is reinforced by the visual hierarchy, which includes the bigger font size of the title and the accentuated buttons. What’s more, the body text is often longer, which puts it at risk of being missed. Since users tend to scan rather than read, ensure your title and buttons make sense even without the body text.

Wrapping up

Trying to find the balance between providing a user with all the necessary explanations, warnings, and reasonings on one hand and keeping the UI intuitive and frictionless on the other hand is a tricky task.

You can facilitate the process of creating microcopy with the help of ChatGPT and AI-based Figma plugins such as Writer or Grammarly. But beware of the limitations these tools have as of now.

For instance, creating a prompt that includes all the necessary details and contexts can take longer than actually writing a title or a label on your own. Grammarly is a nice tool to check the text for typos and mistakes, but when it comes to microcopy, its suggestions might be a bit inaccurate or confusing: you might want to, say, omit articles for brevity or use elliptical sentences, and Grammarly will identify it as a mistake.

You’ll still need a human eye to evaluate the microcopy &mdahs; and I hope this checklist will come in handy.

Microcopy Checklist

General

✅ Microcopy is role-playable (titles, body text, tooltips, etc., are your “phrases”; button labels, input fields, toggles, menu items, etc. are the user’s “phrases”).

Information presentation & structure

✅ The user has the exact amount of information they need right now to perform an action — not less, not more.
✅ Important information is placed at the beginning of the text.
✅ It’s clear to the user why they need to perform the action.
✅ Everything related to sensitive topics is always visible and static and doesn’t require actions from a user (e.g., not hidden in tooltips).
✅ You provide a user with specific information rather than generic examples.
✅ 1 microcopy item = 1 idea.
✅ 1 entity = 1 term.
✅ Empty state messages provide users with guidelines on what to do (when possible and appropriate).

Style

✅ No tech jargon.
✅ No excessive politeness, esp. at the expense of meaning.
✅ Avoid or reduce the use of “not,” “un-,” and other negatives.
✅ Actions are expressed with verbs, not nouns.

Syntax

✅ UI element copy doesn’t explain how to interact with this very element.
✅ Button label accurately reflects what happens next.
✅ Fewer titles like “done,” “almost there,” and “attention.”
✅ “Oopsies” in error messages are not frequent and align well with the brand’s tone of voice.
✅ Title and buttons are understandable without body text.

Uniting Web And Native Apps With 4 Unknown JavaScript APIs

Category Image 080

A couple of years ago, four JavaScript APIs that landed at the bottom of awareness in the State of JavaScript survey. I took an interest in those APIs because they have so much potential to be useful but don’t get the credit they deserve. Even after a quick search, I was amazed at how many new web APIs have been added to the ECMAScript specification that aren’t getting their dues and with a lack of awareness and browser support in browsers.

That situation can be a “catch-22”:

An API is interesting but lacks awareness due to incomplete support, and there is no immediate need to support it due to low awareness.

Most of these APIs are designed to power progressive web apps (PWA) and close the gap between web and native apps. Bear in mind that creating a PWA involves more than just adding a manifest file. Sure, it’s a PWA by definition, but it functions like a bookmark on your home screen in practice. In reality, we need several APIs to achieve a fully native app experience on the web. And the four APIs I’d like to shed light on are part of that PWA puzzle that brings to the web what we once thought was only possible in native apps.

You can see all these APIs in action in this demo as we go along.

1. Screen Orientation API

The Screen Orientation API can be used to sniff out the device’s current orientation. Once we know whether a user is browsing in a portrait or landscape orientation, we can use it to enhance the UX for mobile devices by changing the UI accordingly. We can also use it to lock the screen in a certain position, which is useful for displaying videos and other full-screen elements that benefit from a wider viewport.

Using the global screen object, you can access various properties the screen uses to render a page, including the screen.orientation object. It has two properties:

  • type: The current screen orientation. It can be: "portrait-primary", "portrait-secondary", "landscape-primary", or "landscape-secondary".
  • angle: The current screen orientation angle. It can be any number from 0 to 360 degrees, but it’s normally set in multiples of 90 degrees (e.g., 0, 90, 180, or 270).

On mobile devices, if the angle is 0 degrees, the type is most often going to evaluate to "portrait" (vertical), but on desktop devices, it is typically "landscape" (horizontal). This makes the type property precise for knowing a device’s true position.

The screen.orientation object also has two methods:

  • .lock(): This is an async method that takes a type value as an argument to lock the screen.
  • .unlock(): This method unlocks the screen to its default orientation.

And lastly, screen.orientation counts with an "orientationchange" event to know when the orientation has changed.

Browser Support

Finding And Locking Screen Orientation

Let’s code a short demo using the Screen Orientation API to know the device’s orientation and lock it in its current position.

This can be our HTML boilerplate:

<main>
  <p>
    Orientation Type: <span class="orientation-type"></span>
    <br />
    Orientation Angle: <span class="orientation-angle"></span>
  </p>

  <button type="button" class="lock-button">Lock Screen</button>

  <button type="button" class="unlock-button">Unlock Screen</button>

  <button type="button" class="fullscreen-button">Go Full Screen</button>
</main>

On the JavaScript side, we inject the screen orientation type and angle properties into our HTML.

let currentOrientationType = document.querySelector(".orientation-type");
let currentOrientationAngle = document.querySelector(".orientation-angle");

currentOrientationType.textContent = screen.orientation.type;
currentOrientationAngle.textContent = screen.orientation.angle;

Now, we can see the device’s orientation and angle properties. On my laptop, they are "landscape-primary" and .

If we listen to the window’s orientationchange event, we can see how the values are updated each time the screen rotates.

window.addEventListener("orientationchange", () => {
  currentOrientationType.textContent = screen.orientation.type;
  currentOrientationAngle.textContent = screen.orientation.angle;
});

To lock the screen, we need to first be in full-screen mode, so we will use another extremely useful feature: the Fullscreen API. Nobody wants a webpage to pop into full-screen mode without their consent, so we need transient activation (i.e., a user click) from a DOM element to work.

The Fullscreen API has two methods:

  1. Document.exitFullscreen() is used from the global document object,
  2. Element.requestFullscreen() makes the specified element and its descendants go full-screen.

We want the entire page to be full-screen so we can invoke the method from the root element at the document.documentElement object:

const fullscreenButton = document.querySelector(".fullscreen-button");

fullscreenButton.addEventListener("click", async () => {
  // If it is already in full-screen, exit to normal view
  if (document.fullscreenElement) {
    await document.exitFullscreen();
  } else {
    await document.documentElement.requestFullscreen();
  }
});

Next, we can lock the screen in its current orientation:

const lockButton = document.querySelector(".lock-button");

lockButton.addEventListener("click", async () => {
  try {
    await screen.orientation.lock(screen.orientation.type);
  } catch (error) {
    console.error(error);
  }
});

And do the opposite with the unlock button:

const unlockButton = document.querySelector(".unlock-button");

unlockButton.addEventListener("click", () => {
  screen.orientation.unlock();
});

Can’t We Check Orientation With a Media Query?

Yes! We can indeed check page orientation via the orientation media feature in a CSS media query. However, media queries compute the current orientation by checking if the width is “bigger than the height” for landscape or “smaller” for portrait. By contrast,

The Screen Orientation API checks for the screen rendering the page regardless of the viewport dimensions, making it resistant to inconsistencies that may crop up with page resizing.

You may have noticed how PWAs like Instagram and X force the screen to be in portrait mode even when the native system orientation is unlocked. It is important to notice that this behavior isn’t achieved through the Screen Orientation API, but by setting the orientation property on the manifest.json file to the desired orientation type.

2. Device Orientation API

Another API I’d like to poke at is the Device Orientation API. It provides access to a device’s gyroscope sensors to read the device’s orientation in space; something used all the time in mobile apps, mainly games. The API makes this happen with a deviceorientation event that triggers each time the device moves. It has the following properties:

  • event.alpha: Orientation along the Z-axis, ranging from 0 to 360 degrees.
  • event.beta: Orientation along the X-axis, ranging from -180 to 180 degrees.
  • event.gamma: Orientation along the Y-axis, ranging from -90 to 90 degrees.

Browser Support

Moving Elements With Your Device

In this case, we will make a 3D cube with CSS that can be rotated with your device! The full instructions I used to make the initial CSS cube are credited to David DeSandro and can be found in his introduction to 3D transforms.

To rotate the cube, we change its CSS transform properties according to the device orientation data:

const currentAlpha = document.querySelector(".currentAlpha");
const currentBeta = document.querySelector(".currentBeta");
const currentGamma = document.querySelector(".currentGamma");

const cube = document.querySelector(".cube");

window.addEventListener("deviceorientation", (event) => {
  currentAlpha.textContent = event.alpha;
  currentBeta.textContent = event.beta;
  currentGamma.textContent = event.gamma;

  cube.style.transform = rotateX(${event.beta}deg) rotateY(${event.gamma}deg) rotateZ(${event.alpha}deg);
});

This is the result:

3. Vibration API

Let’s turn our attention to the Vibration API, which, unsurprisingly, allows access to a device’s vibrating mechanism. This comes in handy when we need to alert users with in-app notifications, like when a process is finished or a message is received. That said, we have to use it sparingly; no one wants their phone blowing up with notifications.

There’s just one method that the Vibration API gives us, and it’s all we need: navigator.vibrate().

vibrate() is available globally from the navigator object and takes an argument for how long a vibration lasts in milliseconds. It can be either a number or an array of numbers representing a patron of vibrations and pauses.

navigator.vibrate(200); // vibrate 200ms
navigator.vibrate([200, 100, 200]); // vibrate 200ms, wait 100, and vibrate 200ms.

Browser Support

Vibration API Demo

Let’s make a quick demo where the user inputs how many milliseconds they want their device to vibrate and buttons to start and stop the vibration, starting with the markup:

<main>
  <form>
    <label for="milliseconds-input">Milliseconds:</label>
    <input type="number" id="milliseconds-input" value="0" />
  </form>

  <button class="vibrate-button">Vibrate</button>
  <button class="stop-vibrate-button">Stop</button>
</main>

We’ll add an event listener for a click and invoke the vibrate() method:

const vibrateButton = document.querySelector(".vibrate-button");
const millisecondsInput = document.querySelector("#milliseconds-input");

vibrateButton.addEventListener("click", () => {
  navigator.vibrate(millisecondsInput.value);
});

To stop vibrating, we override the current vibration with a zero-millisecond vibration.

const stopVibrateButton = document.querySelector(".stop-vibrate-button");

stopVibrateButton.addEventListener("click", () => {
  navigator.vibrate(0);
});
4. Contact Picker API

In the past, it used to be that only native apps could connect to a device’s “contacts”. But now we have the fourth and final API I want to look at: the Contact Picker API.

The API grants web apps access to the device’s contact lists. Specifically, we get the contacts.select() async method available through the navigator object, which takes the following two arguments:

  • properties: This is an array containing the information we want to fetch from a contact card, e.g., "name", "address", "email", "tel", and "icon".
  • options: This is an object that can only contain the multiple boolean property to define whether or not the user can select one or multiple contacts at a time.

Browser Support

I’m afraid that browser support is next to zilch on this one, limited to Chrome Android, Samsung Internet, and Android’s native web browser at the time I’m writing this.

Selecting User’s Contacts

We will make another demo to select and display the user’s contacts on the page. Again, starting with the HTML:

<main>
  <button class="get-contacts">Get Contacts</button>
  <p>Contacts:</p>
  <ul class="contact-list">
    <!-- We’ll inject a list of contacts -->
  </ul>
</main>

Then, in JavaScript, we first construct our elements from the DOM and choose which properties we want to pick from the contacts.

const getContactsButton = document.querySelector(".get-contacts");
const contactList = document.querySelector(".contact-list");

const props = ["name", "tel", "icon"];
const options = {multiple: true};

Now, we asynchronously pick the contacts when the user clicks the getContactsButton.


const getContacts = async () => {
  try {
    const contacts = await navigator.contacts.select(props, options);
  } catch (error) {
    console.error(error);
  }
};

getContactsButton.addEventListener("click", getContacts);

Using DOM manipulation, we can then append a list item to each contact and an icon to the contactList element.

const appendContacts = (contacts) => {
  contacts.forEach(({name, tel, icon}) => {
    const contactElement = document.createElement("li");

    contactElement.innerText = ${name}: ${tel};
    contactList.appendChild(contactElement);
  });
};

const getContacts = async () => {
  try {
    const contacts = await navigator.contacts.select(props, options);
    appendContacts(contacts);
  } catch (error) {
    console.error(error);
  }
};

getContactsButton.addEventListener("click", getContacts);

Appending an image is a little tricky since we will need to convert it into a URL and append it for each item in the list.

const getIcon = (icon) => {
  if (icon.length > 0) {
    const imageUrl = URL.createObjectURL(icon[0]);
    const imageElement = document.createElement("img");
    imageElement.src = imageUrl;

    return imageElement;
  }
};

const appendContacts = (contacts) => {
  contacts.forEach(({name, tel, icon}) => {
    const contactElement = document.createElement("li");

    contactElement.innerText = ${name}: ${tel};
    contactList.appendChild(contactElement);

    const imageElement = getIcon(icon);
    contactElement.appendChild(imageElement);
  });
};

const getContacts = async () => {
  try {
    const contacts = await navigator.contacts.select(props, options);
    appendContacts(contacts);
  } catch (error) {
    console.error(error);
  }
};

getContactsButton.addEventListener("click", getContacts);

And here’s the outcome:

Note: The Contact Picker API will only work if the context is secure, i.e., the page is served over https:// or wss:// URLs.

Conclusion

There we go, four web APIs that I believe would empower us to build more useful and robust PWAs but have slipped under the radar for many of us. This is, of course, due to inconsistent browser support, so I hope this article can bring awareness to new APIs so we have a better chance to see them in future browser updates.

Aren’t they interesting? We saw how much control we have with the orientation of a device and its screen as well as the level of access we get to access a device’s hardware features, i.e. vibration, and information from other apps to use in our own UI.

But as I said much earlier, there’s a sort of infinite loop where a lack of awareness begets a lack of browser support. So, while the four APIs we covered are super interesting, your mileage will inevitably vary when it comes to using them in a production environment. Please tread cautiously and refer to Caniuse for the latest support information, or check for your own devices using WebAPI Check.

What Are CSS Container Style Queries Good For?

Category Image 052

We’ve relied on media queries for a long time in the responsive world of CSS but they have their share of limitations and have shifted focus more towards accessibility than responsiveness alone. This is where CSS Container Queries come in. They completely change how we approach responsiveness, shifting the paradigm away from a viewport-based mentality to one that is more considerate of a component’s context, such as its size or inline-size.

Querying elements by their dimensions is one of the two things that CSS Container Queries can do, and, in fact, we call these container size queries to help distinguish them from their ability to query against a component’s current styles. We call these container style queries.

Existing container query coverage has been largely focused on container size queries, which enjoy 90% global browser support at the time of this writing. Style queries, on the other hand, are only available behind a feature flag in Chrome 111+ and Safari Technology Preview.

The first question that comes to mind is What are these style query things? followed immediately by How do they work?. There are some nice primers on them that others have written, and they are worth checking out.

But the more interesting question about CSS Container Style Queries might actually be Why we should use them? The answer, as always, is nuanced and could simply be it depends. But I want to poke at style queries a little more deeply, not at the syntax level, but what exactly they are solving and what sort of use cases we would find ourselves reaching for them in our work if and when they gain browser support.

Why Container Queries

Talking purely about responsive design, media queries have simply fallen short in some Aspects, but I think the main one is that they are context-agnostic in the sense that they only consider the viewport size when applying styles without involving the size or dimensions of an element’s parent or the content it contains.

This usually isn’t a problem since we only have a main element that doesn’t share space with others along the x-axis, so we can style our content depending on the viewport’s dimensions. However, if we stuff an element into a smaller parent and maintain the same viewport, the media query doesn’t kick in when the content becomes cramped. This forces us to write and manage an entire set of media queries that target super-specific content breakpoints.

Container queries break this limitation and allow us to query much more than the viewport’s dimensions.

How Container Queries Generally Work

Container size queries work similarly to media queries but allow us to apply styles depending on the container’s properties and computed values. In short, they allow us to make style changes based on an element’s computed width or height regardless of the viewport. This sort of thing was once only possible with JavaScript or the ol’ jQuery, as this example shows.

As noted earlier, though, container queries can query an element’s styles in addition to its dimensions. In other words, container style queries can look at and track an element’s properties and apply styles to other elements when those properties meet certain conditions, such as when the element’s background-color is set to hsl(0 50% 50%).

That’s what we mean when talking about CSS Container Style Queries. It’s a proposed feature defined in the same CSS Containment Module Level 3 specification as CSS Container Size Queries — and one that’s currently unsupported by any major browser — so the difference between style and size queries can get a bit confusing as we’re technically talking about two related features under the same umbrella.

We’d do ourselves a favor to backtrack and first understand what a “container” is in the first place.

Containers

An element’s container is any ancestor with a containment context; it could be the element’s direct parent or perhaps a grandparent or great-grandparent.

A containment context means that a certain element can be used as a container for querying. Unofficially, you can say there are two types of containment context: size containment and style containment.

Size containment means we can query and track an element’s dimensions (i.e., aspect-ratio, block-size, height, inline-size, orientation, and width) with container size queries as long as it’s registered as a container. Tracking an element’s dimensions requires a little processing in the client. One or two elements are a breeze, but if we had to constantly track the dimensions of all elements — including resizing, scrolling, animations, and so on — it would be a huge performance hit. That’s why no element has size containment by default, and we have to manually register a size query with the CSS container-type property when we need it.

On the other hand, style containment lets us query and track the computed values of a container’s specific properties through container style queries. As it currently stands, we can only check for custom properties, e.g. --theme: dark, but soon we could check for an element’s computed background-color and display property values. Unlike size containment, we are checking for raw style properties before they are processed by the browser, alleviating performance and allowing all elements to have style containment by default.

Did you catch that? While size containment is something we manually register on an element, style containment is the default behavior of all elements. There’s no need to register a style container because all elements are style containers by default.

And how do we register a containment context? The easiest way is to use the container-type property. The container-type property will give an element a containment context and its three accepted values — normal, size, and inline-size — define which properties we can query from the container.

/* Size containment in the inline direction */
.parent {
  container-type: inline-size;
}

This example formally establishes a size containment. If we had done nothing at all, the .parent element is already a container with a style containment.

Size Containment

That last example illustrates size containment based on the element’s inline-size, which is a fancy way of saying its width. When we talk about normal document flow on the web, we’re talking about elements that flow in an inline direction and a block direction that corresponds to width and height, respectively, in a horizontal writing mode. If we were to rotate the writing mode so that it is vertical, then “inline” would refer to the height instead and “block” to the width.

Consider the following HTML:

<div class="cards-container">
  <ul class="cards">
    <li class="card"></li>
  </ul>
</div>

We could give the .cards-container element a containment context in the inline direction, allowing us to make changes to its descendants when its width becomes too small to properly display everything in the current layout. We keep the same syntax as in a normal media query but swap @media for @container

.cards-container {
  container-type: inline-size;
  }

  @container (width < 700px) {
  .cards {
    background-color: red;
  }
}

Container syntax works almost the same as media queries, so we can use the and, or, and not operators to chain different queries together to match multiple conditions.

@container (width < 700px) or (width > 1200px) {
  .cards {
    background-color: red;
  }
}

Elements in a size query look for the closest ancestor with size containment so we can apply changes to elements deeper in the DOM, like the .card element in our earlier example. If there is no size containment context, then the @container at-rule won’t have any effect.

/* 👎 
 * Apply styles based on the closest container, .cards-container
 */
@container (width < 700px) {
  .card {
    background-color: black;
  }
}

Just looking for the closest container is messy, so it’s good practice to name containers using the container-name property and then specifying which container we’re tracking in the container query just after the @container at-rule.

.cards-container {
  container-name: cardsContainer;
  container-type: inline-size;
}

@container cardsContainer (width < 700px) {
  .card {
    background-color: #000;
  }
}

We can use the shorthand container property to set the container name and type in a single declaration:

.cards-container {
  container: cardsContainer / inline-size;

  /* Equivalent to: */
  container-name: cardsContainer;
  container-type: inline-size;
}

The other container-type we can set is size, which works exactly like inline-size — only the containment context is both the inline and block directions. That means we can also query the container’s height sizing in addition to its width sizing.

/* When container is less than 700px wide */
@container (width < 700px) {
  .card {
    background-color: black;
  }
}

/* When container is less than 900px tall */
@container (height < 900px) {
  .card {
    background-color: white;
  }
}

And it’s worth noting here that if two separate (not chained) container rules match, the most specific selector wins, true to how the CSS Cascade works.

So far, we’ve touched on the concept of CSS Container Queries at its most basic. We define the type of containment we want on an element (we looked specifically at size containment) and then query that container accordingly.

Container Style Queries

The third value that is accepted by the container-type property is normal, and it sets style containment on an element. Both inline-size and size are stable across all major browsers, but normal is newer and only has modest support at the moment.

I consider normal a bit of an oddball because we don’t have to explicitly declare it on an element since all elements are style containers with style containment right out of the box. It’s possible you’ll never write it out yourself or see it in the wild.

.parent {
  /* Unnecessary */
  container-type: normal;
}

If you do write it or see it, it’s likely to undo size containment declared somewhere else. But even then, it’s possible to reset containment with the global initial or revert keywords.

.parent {
  /* All of these (re)set style containment */
  container-type: normal;
  container-type: initial;
  container-type: revert;
}

Let’s look at a simple and somewhat contrived example to get the point across. We can define a custom property in a container, say a --theme.

.cards-container {
  --theme: dark;
}

From here, we can check if the container has that desired property and, if it does, apply styles to its descendant elements. We can’t directly style the container since it could unleash an infinite loop of changing the styles and querying the styles.

.cards-container {
  --theme: dark;
}

@container style(--theme: dark) {
  .cards {
    background-color: black;
  }
}

See that style() function? In the future, we may want to check if an element has a max-width: 400px through a style query instead of checking if the element’s computed value is bigger than 400px in a size query. That’s why we use the style() wrapper to differentiate style queries from size queries.

/* Size query */
@container (width > 60ch) {
  .cards {
    flex-direction: column;
  }
}

/* Style query */
@container style(--theme: dark) {
  .cards {
    background-color: black;
  }
}

Both types of container queries look for the closest ancestor with a corresponding containment-type. In a style() query, it will always be the parent since all elements have style containment by default. In this case, the direct parent of the .cards element in our ongoing example is the .cards-container element. If we want to query non-direct parents, we will need the container-name property to differentiate between containers when making a query.

.cards-container {
  container-name: cardsContainer;
  --theme: dark;
}

@container cardsContainer style(--theme: dark) {
  .card {
    color: white;
  }
}
Weird and Confusing Things About Container Style Queries

Style queries are completely new and bring something never seen in CSS, so they are bound to have some confusing qualities as we wrap our heads around them — some that are completely intentional and well thought-out and some that are perhaps unintentional and may be updated in future versions of the specification.

Style and Size Containment Aren’t Mutually Exclusive

One intentional perk, for example, is that a container can have both size and style containment. No one would fault you for expecting that size and style containment are mutually exclusive concerns, so setting an element to something like container-type: inline-size would make all style queries useless.

However, another funny thing about container queries is that elements have style containment by default, and there isn’t really a way to remove it. Check out this next example:

.cards-container {
  container-type: inline-size;
  --theme: dark;
}

@container style(--theme: dark) {
  .card {
    background-color: black;
  }
}

@container (width < 700px) {
  .card {
    background-color: red;
  }
}

See that? We can still query the elements by style even when we explicitly set the container-type to inline-size. This seems contradictory at first, but it does make sense, considering that style and size queries are computed independently. It’s better this way since both queries don’t necessarily conflict with each other; a style query could change the colors in an element depending on a custom property, while a container query changes an element’s flex-direction when it gets too small for its contents.

But We Can Achieve the Same Thing With CSS Classes and IDs

Most container query guides and tutorials I’ve seen use similar examples to demonstrate the general concept, but I can’t stop thinking no matter how cool style queries are, we can achieve the same result using classes or IDs and with less boilerplate. Instead of passing the state as an inline style, we could simply add it as a class.

<ol>
  <li class="item first">
    <img src="..." alt="Roi's avatar" />
    <h2>Roi</h2>
  </li>
  <li class="item second"><!-- etc. --></li>
  <li class="item third"><!-- etc. --></li>
  <li class="item"><!-- etc. --></li>
  <li class="item"><!-- etc. --></li>
</ol>

Alternatively, we could add the position number directly inside an id so we don’t have to convert the number into a string:

<ol>
  <li class="item" id="item-1">
    <img src="..." alt="Roi's avatar" />
    <h2>Roi</h2>
  </li>
  <li class="item" id="item-2"><!-- etc. --></li>
  <li class="item" id="item-3"><!-- etc. --></li>
  <li class="item" id="item-4"><!-- etc. --></li>
  <li class="item" id="item-5"><!-- etc. --></li>
</ol>

Both of these approaches leave us with cleaner HTML than the container queries approach. With style queries, we have to wrap our elements inside a container — even if we don’t semantically need it — because of the fact that containers (rightly) are unable to style themselves.

We also have less boilerplate-y code on the CSS side:

#item-1 {
  background: linear-gradient(45deg, yellow, orange); 
}

#item-2 {
  background: linear-gradient(45deg, grey, white);
}

#item-3 {
  background: linear-gradient(45deg, brown, peru);
}

See the Pen Style Queries Use Case Replaced with Classes [forked] by Monknow.

As an aside, I know that using IDs as styling hooks is often viewed as a no-no, but that’s only because IDs must be unique in the sense that no two instances of the same ID are on the page at the same time. In this instance, there will never be more than one first-place, second-place, or third-place player on the page, making IDs a safe and appropriate choice in this situation. But, yes, we could also use some other type of selector, say a data-* attribute.

There is something that could add a lot of value to style queries: a range syntax for querying styles. This is an open feature that Miriam Suzanne proposed in 2023, the idea being that it queries numerical values using range comparisons just like size queries.

Imagine if we wanted to apply a light purple background color to the rest of the top ten players in the leaderboard example. Instead of adding a query for each position from four to ten, we could add a query that checks a range of values. The syntax is obviously not in the spec at this time, but let’s say it looks something like this just to push the point across:

/* Do not try this at home! */
@container leaderboard style(4 >= --position <= 10) {
  .item {
    background: linear-gradient(45deg, purple, fuchsia);
  }
}

In this fictional and hypothetical example, we’re:

  • Tracking a container called leaderboard,
  • Making a style() query against the container,
  • Evaluating the --position custom property,
  • Looking for a condition where the custom property is set to a value equal to a number that is greater than or equal to 4 and less than or equal to 10.
  • If the custom property is a value within that range, we set a player’s background color to a linear-gradient() that goes from purple to fuschia.

This is very cool, but if this kind of behavior is likely to be done using components in modern frameworks, like React or Vue, we could also set up a range in JavaScript and toggle on a .top-ten class when the condition is met.

See the Pen Style Ranged Queries Use Case Replaced with Classes [forked] by Monknow.

Sure, it’s great to see that we can do this sort of thing directly in CSS, but it’s also something with an existing well-established solution.

Separating Style Logic From Logic Logic

So far, style queries don’t seem to be the most convenient solution for the leaderboard use case we looked at, but I wouldn’t deem them useless solely because we can achieve the same thing with JavaScript. I am a big advocate of reaching for JavaScript only when necessary and only in sprinkles, but style queries, the ones where we can only check for custom properties, are most likely to be useful when paired with a UI framework where we can easily reach for JavaScript within a component. I have been using Astro an awful lot lately, and in that context, I don’t see why I would choose a style query over programmatically changing a class or ID.

However, a case can be made that implementing style logic inside a component is messy. Maybe we should keep the logic regarding styles in the CSS away from the rest of the logic logic, i.e., the stateful changes inside a component like conditional rendering or functions like useState and useEffect in React. The style logic would be the conditional checks we do to add or remove class names or IDs in order to change styles.

If we backtrack to our leaderboard example, checking a player’s position to apply different styles would be style logic. We could indeed check that a player’s leaderboard position is between four and ten using JavaScript to programmatically add a .top-ten class, but it would mean leaking our style logic into our component. In React (for familiarity, but it would be similar to other frameworks), the component may look like this:

const LeaderboardItem = ({position}) => {
  <li className={item ${position &gt;= 4 && position &lt;= 10 ? "top-ten" : ""}} id={item-${position}}>
    <img src="..." alt="Roi's avatar" />
    <h2>Roi</h2>
  </li>;
};

Besides this being ugly-looking code, adding the style logic in JSX can get messy. Meanwhile, style queries can pass the --position value to the styles and handle the logic directly in the CSS where it is being used.

const LeaderboardItem = ({position}) => {
  <li className="item" style={{"--position": position}}>
    <img src="..." alt="Roi's avatar" />
    <h2>Roi</h2>
  </li>;
};

Much cleaner, and I think this is closer to the value proposition of style queries. But at the same time, this example makes a large leap of assumption that we will get a range syntax for style queries at some point, which is not a done deal.

Conclusion

There are lots of teams working on making modern CSS better, and not all features have to be groundbreaking miraculous additions.

Size queries are definitely an upgrade from media queries for responsive design, but style queries appear to be more of a solution looking for a problem.

It simply doesn’t solve any specific issue or is better enough to replace other approaches, at least as far as I am aware.

Even if, in the future, style queries will be able to check for any property, that introduces a whole new can of worms where styles are capable of reacting to other styles. This seems exciting at first, but I can’t shake the feeling it would be unnecessary and even chaotic: styles reacting to styles, reacting to styles, and so on with an unnecessary side of boilerplate. I’d argue that a more prudent approach is to write all your styles declaratively together in one place.

Maybe it would be useful for web extensions (like Dark Reader) so they can better check styles in third-party websites? I can’t clearly see it. If you have any suggestions on how CSS Container Style Queries can be used to write better CSS that I may have overlooked, please let me know in the comments! I’d love to know how you’re thinking about them and the sorts of ways you imagine yourself using them in your work.

The Scent Of UX: The Unrealized Potential Of Olfactory Design

Fotolia Subscription Monthly 4685447 Xl Stock

Imagine that you could smell this page. The introduction would emit a subtle scent of sage and lavender to set the mood. Each paragraph would fill your room with the coconut oil aroma, helping you concentrate and immerse in reading. The fragrance of the comments section, resembling a busy farmer’s market, would nudge you to share your thoughts and debate with strangers.

How would the presence of smells change your experience reading this text or influence your takeaways?

Scents are everywhere. They fill our spaces, bind our senses to objects and people, alert us to dangers, and arouse us. Smells have so much influence over our mood and behavior that hundreds of companies are busy designing fragrances for retail, enticing visitors to purchase more, hotels, making customers feel at home, and amusement parks, evoking a warm sense of nostalgia.

At the same time, the digital world, where we spend our lives working, studying, shopping, and resting, remains entirely odorless. Our smart devices are not designed to emit or recognize scents, and every corner of the Internet, including this page, smells exactly the same.

We watch movies, play games, study, and order dinner, but our sense of smell is left unengaged. The lack of odors rarely bothers us, but occasionally, we choose analog things like books merely because their digital counterparts fail to connect with us at the same level.

Could the presence of smells improve our digital experiences? What would it take to build the “smelly” Internet, and why hasn't it been done before? Last but not least, what power do scents hold over our senses, memory, and health, and how could we harness it for the digital world?

Let’s dive deep into a fascinating and underexplored realm of odors.

Olfactory Design For The Real World

Why Do We Remember Smells?

In his novel In Search of Lost Time, French writer Marcel Proust describes a sense of déjà vu he experienced after tasting a piece of cake dipped in tea:

“Immediately the old gray house upon the street rose up like a stage set… the house, the town, the square where I was sent before lunch, the streets along which I used to run errands, the country roads we took… the whole of Combray and of its surroundings… sprang into being, town and gardens alike, all from my cup of tea.”

— Marcel Proust

The Proust Effect, the phenomenon of an ‘involuntary memory’ evoked by scents, is a common occurrence. It explains how the presence of a familiar smell activates areas in our brain responsible for odor recognition, causing us to experience a strong, warm, positive sense of nostalgia.

Smells have a potent and almost magical impact on our ability to remember and recognize objects and events. “The nose makes the eyes remember”, as a renowned Finnish architect Juhani Pallasmaa puts it: a single droplet of a familiar fragrance is often enough to bring up a wild cocktail of emotions and recollections, even those that have long been forgotten.

A memory of a place, a person, or an experience is often a memory of their smell that lingers long after the odor is gone. J. Douglas Porteous, Professor of Geography at the University of Victoria, coined the term Smellscape to describe how a collective of smells in each particular area form our perception, define our attitude, and craft our recollection of it.

To put it simply, we choose to avoid beautiful places and forget delicious meals when their odors are not to our liking. Pleasant aromas, on the other hand, alter our memory, make us overlook flaws and defects, or even fall in love.

With such an immense power that scents hold over our perception of reality, it comes as no surprise they have long become a tool in the hands of brand and service designers.

Scented Advertising

What do a luxury car brand, a cosmetics store, and a carnival ride have in common? The answer is that they all have their own distinct scents.

Carefully crafted fragrances are widely used to create brand identities, make powerful impressions, and differentiate brands “emotionally and memorably”.

Some choose to complement visual identities with subtle, tailored aromas. 12.29, a creative “olfactive branding company,” developed the “scent identity” for Cadillac, a “symbol of self-expression representing the irrepressible pursuit of life.”

The branded Cadillac scent is diffused in dealerships and auto shows around the world, evoking a sense of luxury and class. Customers are expected to remember Cadillac better for its “signature nutty coffee, dark leather, and resinous amber notes”, forging a strong emotional connection with the brand.

Next time they think of Cadillac, their brain will recall its signature fragrance and the way it made them feel. Cadillac is ready to bet they will not even consider other brands afterwards.

Others may be less subtle and employ more aggressive, fragrant marketing tactics. LUSH, a British cosmetics retailer, is known for its distinct smells. Although even the company co-founder admits that odors can be overwhelming for some, LUSH’s scents play an important role in crafting the brand’s identity.

Indeed, the aroma of their stores is so recognizable that it lures customers in from afar with ease, and few walk away without forever remembering the brand’s distinct smell.

However, retail is not the only area that employs discernible smells.

Disney takes a holistic approach to service design, carefully considering every aspect that influences customer satisfaction. Smells have long been a part of the signature “Disney experience”: the main street smells like pastry and popcorn, Spaceship Earth is filled with the burning wood aroma, and Soarin’ is accompanied by notes of orange and pine.

Dozens of scent-emitting devices, Smellitzers, are responsible for adding scents to each experience. Deployed around each park and perfectly synced with every other sensory stimulus, they “shoot scents toward passersby” and “trigger memories of childhood nostalgia.”

As shown in the patent, Smellitzer is a rather simple odor delivery system designed to “enhance the sense of flight created in the minds of the passengers.” Scents are carefully curated and manufactured to evoke precise emotions without disrupting the ride experience.

Disney’s attractions, lanes, and theaters are packed with smell-emitting gadgets that distribute sweet and savoury notes. The visitors barely notice the presence of added scents, but later inevitably experience a sudden but persistent urge to return to the park.

Could it be something in the air, perhaps?

Well-curated, timely delivered, recognizable scents can be a powerful ally in the hands of a designer.

They can soothe a passenger during a long flight with the subtle notes of chamomile and mint or seduce a hungry shopper with the familiar aroma of freshly baked cinnamon buns. Scents can create and evoke great memories, amplify positive emotions, or turn casual buyers into eager and loyal consumers.

Unfortunately, smells can also ruin otherwise decent experiences.

Scented Entertainment

Why Fragrant Cinema Failed

In 1912, Aldous Huxley, author of the dystopian novel Brave New World, published an essay “Silence is Golden”, reflecting on his first experience watching a sound film. Huxley despised cinema, calling it the “most frightful creation-saving device for the production of standardized amusement”, and the addition of sound made the writer concerned for the future of entertainment. Films engaged multiple senses but demanded no intellectual involvement, becoming more accessible, more immersive, and, as Huxley feared, more influential.

“Brave New World,” published in 1932, features the cinema of the future — a multisensory entertainment complex designed to distract society from seeking a deeper sense of purpose in life. Attendees enjoy a ​​“scent organ” playing “a delightfully refreshing Herbal Capriccio — rippling arpeggios of thyme and lavender, of rosemary, basil, myrtle, tarragon,” and get to experience every physical stimulation imaginable.

Huxley’s critical take on the state of the entertainment industry was spot-on. Obsessed with the idea of multisensory entertainment, studios did not take long to begin investing in immersive experiences. The 1950s were the age of experiments designed to attract more viewers: colored cinema, 3D films, and, of course, scented movies.

In 1960, two films hit the American theaters: Scent of Mystery, accompanied by the odor-delivery technology called “Smell–O–Vision”, and Behind the Great Wall, employing the process named AromaRama. Smell–O–Vision was designed to transport scents through tubes to each seat, much like Disney’s Smellitzers, whereas AromaRama distributed smells through the theater’s ventilation.

Both scented movies were panned by critics and viewers alike. In his review for the New York Times, Bosley Crowther wrote that “...synthetic smells [...] occasionally befit what one is viewing, but more often they confuse the atmosphere”. Audiences complained about smells being either too subtle or too overpowering and the machines disrupting the viewing experience.

The groundbreaking technologies were soon forgotten, and all plans to release more scented films were scrapped.

Why did odors, so efficient at manufacturing nostalgic memories of an amusement park, fail to entertain the audience at the movies? On the one hand, it may attributed to the technological limitations of the time. For instance, AromaRama diffused the smells into the ventilation, which significantly delayed the delivery and required scents to be removed between scenes. Suffice it to say the viewers did not enjoy the experience.

However, there could be other possible explanations.

First of all, digital entertainment is traditionally odorless. Viewers do not anticipate movies to be accompanied by smells, and their brains are conditioned to ignore them. Researchers call it “inattentional anosmia”: people connect their enjoyment with what they see on the screen, not what they smell or taste.

Moreover, background odors tend to fade and become less pronounced with time. A short exposure to a pleasant odor may be complimentary. For instance, viewers could smell orange as the character in “Behind the Great Wall” cut and squeezed the fruit: an “impressive” moment, as admitted by critics. However, left to linger, even the most pleasant scents can leave the viewer uninvolved or irritated.

Finally, cinema does not require active sensory involvement. Viewers sit still in silence, rarely even moving their heads, while their sight and hearing are busy consuming and interpreting the information. Immersion requires suspension of disbelief: well-crafted films force the viewer to forget the reality around them, but the addition of scents may disrupt this state, especially if scents are not relevant or well-crafted.

For the scented movie to engage the audience, smells must be integrated into the film’s events and play an important role in the viewing experience. Their delivery must be impeccable: discreet, smooth, and perfectly timed. In time, perhaps, we may see the revival of scented cinema. Until then, rare auteur experiments and 4D–cinema booths at carnivals will remain the only places where fragrant films will live on.

Fortunately, the lessons from the early experiments helped others pave the way for the future of fragrant entertainment.

Immersive Gaming

Unlike movies, video games require active participation. Players are involved in crafting the narrative of the game and, as such, may expect (and appreciate) a higher degree of realism. Virtual Reality is a good example of technology designed for full sensory stimulation.

Modern headsets are impressive, but several companies are already working hard on the next-gen tech for immersive gaming. Meta and Manus are developing gloves that make virtual elements tangible. Teslasuit built a full-body suit that captures motion and biometry, provides haptic feedback, and emulates sensations for objects in virtual reality. We may be just a few steps away from virtual multi-sensory entertainment being as widespread as mobile phones.

Scents are coming to VR, too, albeit at a slower pace, with a few companies already selling devices for fragrant entertainment. For instance, GameScent has developed a cube that can distribute up to 8 smells, from “gunfire” and “explosion” to “forest” and “storm”, using AI to sync the odors with the events in the game.

The vast majority of experiments, however, occur in the labs, where researchers attempt to understand how smells impact gamers and test various concepts. Some assign smells to locations in a VR game and distribute them to players; others have the participants use a hand-held device to “smell” objects in the game.

The majority of studies demonstrate promising results. The addition of fragrances creates a deeper sense of immersion and enhances realism in virtual reality and in a traditional gaming setting.

A notable example of the latter is “Tainted”, an immersive game based on South-East Asian folklore, developed by researchers in 2017. The objective of the game is to discover and burn banana trees, where the main antagonist of the story — a mythical vengeful spirit named Pontianak — is traditionally believed to hide.

The way “Tainted” incorporates smells into the gameplay is quite unique. A scent-emitting module, placed in front of the player, diffuses fragrances to complement the narrative. For instance, the smell of banana signals the ghost’s presence, whereas pineapple aroma means that a flammable object required to complete the quest is nearby. Odors inform the player of dangers, give directions, and become an integral part of the gaming experience, like visuals and sound.

Some of the most creative examples of scented learning come from places that combine education and entertainment, most notably, museums.

Jorvik Viking Centre is famous for its use of “smells of Viking-age York” to capture the unique atmosphere of the past. Its scented halls, holograms, and entertainment programs turn a former archeological site into a carnival ride that teleports visitors into the 10th century to immerse them into the daily life of the Vikings.

Authentic smells are the center’s distinct feature, an integral part of its branding and marketing, and an important addition to its collection. Smells are responsible for making Jorvik exhibitions so memorable, and hopefully, for visitors walking away with a few Viking trivia facts firmly stuck in their heads.

At the same time, learning is becoming increasingly more digital, from mobile apps for foreign languages to student portals and online universities. Smart devices strive to replace classrooms with their analog textbooks, papers, gel pens, and teachers. Virtual Reality is a step towards the future of immersive digital education, and odors may play a more significant role in making it even more efficient.

Education will undoubtedly continue leveraging the achievements of the digital revolution to complement its existing tools. Tablets and Kindles are on their way to replace textbooks and pens. Phones are no longer deemed a harmful distraction that causes brain cancer.

Odors, in turn, are becoming “learning supplements”. Teachers and parents have access to personalized diffusers that distribute the smell of peppermint to enhance students’ attention. Large scent-emitting devices for educational facilities are available on the market, too.

At the same time, inspired to figure out the way to upload knowledge straight into our brains, we’ve discovered a way to learn things in our sleep using smells. Several studies have shown that exposure to scents during sleep significantly improves cognitive abilities and memory. More than that, smells can activate our memory while we sleep and solidify what we have learnt while awake.

Odors may not replace textbooks and lectures, but their addition will make remembering and recalling things significantly easier. In fact, researchers from MIT built and tested a wearable scent-emitting device that can be used for targeted memory reactivation.

In time, we will undoubtedly see more smart devices that make use of scents for memory enhancement, training, and entertainment. Integrated into the ecosystems of gadgets, olfactory wearables and smart home appliances will improve our well-being, increase productivity, and even detect early symptoms of illnesses.

There is, however, a caveat.

The Challenging UX Of Scents

We know very little about smells.

Until 2004, when Richard Axel and Linda Buck received a Nobel Prize for identifying the genes that control odor receptors, we didn’t even know how our bodies processed smells or that different areas in our brains were activated by different odors.

We know that our experience with smells is deep and intimate, from the memories they create to the emotions they evoke. We are aware that unpleasant scents linger longer and have a stronger impact on our mental state and memory. Finally, we understand that intensity, context, and delivery matter as much as the scent itself and that a decent aroma diffused out of place ruins the experience.

Thus, if we wish to build devices that make the best use of scents, we need to follow a few simple principles.

Design Principle #1: Tailor The Scents To Each User

In his article about Smellscapes, J. Douglas Porteous writes:

“The smell of a certain institutional soap may carry a person back to the purgatory of boarding school. A particular floral fragrance reminds one of a lost love. A gust of odour from an ethnic spice emporium may waft one back, in memory, to Calcutta.”

— J. Douglas Porteous

Smells revive hidden memories and evoke strong emotions, but their connection to our minds is deeply personal. A rich, spicy aroma of freshly roasted coffee beans will not have the same impact on different people, and in order to use scents in learning, we need to tailor the experience to each user.

In order to maximize the potential of odors in immersion and learning, we need to understand which smells have the most impact on the user. By filtering out the smells that the user finds unpleasant or associates with sad events in their past, we can reduce any potential negative effect on their wellness or memory.

Design Principle #2: Stick To The Simpler Smells

Humans are notoriously bad at describing odors.

Very few languages in the world feature specific terms for smells. For instance, the speakers of Jahai, a language in Malaysia, enjoy the privilege of having specific names for scents like “bloody smell that attracts tigers” and “wild mango, wild ginger roots, bat caves, and petrol”.

English, on the other hand, often uses adjectives associated with flavor (“smoky vanilla”) or comparison (“smells like orange”) to describe scents. For centuries, we have been trying to work out a system that could help cluster odors.

Aristotle classified all odors into six groups: sweet, acid, severe, fatty, sour, and fetid (unpleasant). Carl Linnaeus expanded it to 7 types: aromatic, fragrant, alliaceous (garlic), ambrosial (musky), hircinous (goaty), repulsive, and nauseous. Hans Henning arranged all scent groups in a prism. None of the existing classifications, however, help accurately describe complex smells, which inevitably makes it harder to recreate them.

Academics have developed several comprehensive lists, for instance, the Odor Character Profiling that contains 146 unique descriptors. Pleasant smells from the list are easier to reproduce than unique and sophisticated odors.

Although an aroma of the “warm touch of an early summer sun” may work better for a particular user than the smell of an apple pie, the high price of getting the scent wrong makes it a reasonable trade-off.

Design Principle #3: Ensure Stable And Convenient Delivery

Nothing can ruin a good olfactory experience more than an imperfect delivery system.

Disney’s Smellitzers and Jorvik’s scented exhibition set the standard for discreet, contextual, and consistent inclusion of smells to complement the experience. Their diffusers are well-concealed, and odors do not come off as overwhelming or out of place.

On the other hand, the failure of scented movies from the 1950s can at least partially be attributed to poorly designed aroma delivery systems. Critics remembered that even the purifying treatment that was used to clear the theater air between scenes left a “sticky, sweet” and “upsetting” smell.

Good delivery systems are often simple and focus on augmenting the experience without disrupting it. For instance, eScent, a scent-enhanced FFP3 mask, is engineered to reduce stress and improve the well-being of frontline workers. The mask features a slot for applicators infused with essential oil; users can choose fragrances and swap the applicator whenever they want. Beside that, eScent is no different from its “analog” predecessor: it does not require special equipment or preparation, and the addition of smells does not alter the experience of wearing a mask.

In The Not Too Distant Future

We may know little about smells, but we are steadily getting closer to harnessing their power.

In 2022, Alex Wiltschko, a former Google staff research scientist, founded Osmo, a company dedicated to “giving computers a sense of smell.” In the long run, Osmo aspires to use its knowledge to manufacture scents on demand from sustainable synthetic materials.

Today, the company operates as a research lab, using a trained AI to predict the smell of a substance by analyzing its molecular structure. Osmo’s first tests demonstrated some promising results, with machine accurately describing the scents in 53% of cases.

Should Osmo succeed at building a machine capable of recognizing and predicting smells, it will change the digital world forever. How will we interact with our smart devices? How will we use their newly discovered sense of smell to exchange information, share precious memories with each other, or relive moments from the past? Is now the right time for us to come up with ideas, products, and services for the future?

Odors are a booming industry that offers designers and engineers a unique opportunity to explore new and brave concepts. With the help of smells, we can transform entire industries, from education to healthcare, crafting immersive multi-sensory experiences for learning and leisure.

Smells are a powerful tool that requires precision and perfection to reach the desired effect. Our past shortcomings may have tainted the reputation of scented experiences, but recent progress demonstrates that we have learnt our lessons well. Modern technologies make it even easier to continue the explorations and develop new ways to use smells in entertainment, learning, and wellness — in the real world and beyond.

Our digital spaces may be devoid of scents, but they will not remain odorless for long.

How To Hack Your Google Lighthouse Scores In 2024

Fotolia Subscription Monthly 4685447 Xl Stock

This article is a sponsored by Sentry.io

Google Lighthouse has been one of the most effective ways to gamify and promote web page performance among developers. Using Lighthouse, we can assess web pages based on overall performance, accessibility, SEO, and what Google considers “best practices”, all with the click of a button.

We might use these tests to evaluate out-of-the-box performance for front-end frameworks or to celebrate performance improvements gained by some diligent refactoring. And you know you love sharing screenshots of your perfect Lighthouse scores on social media. It’s a well-deserved badge of honor worthy of a confetti celebration.

Just the fact that Lighthouse gets developers like us talking about performance is a win. But, whilst I don’t want to be a party pooper, the truth is that web performance is far more nuanced than this. In this article, we’ll examine how Google Lighthouse calculates its performance scores, and, using this information, we will attempt to “hack” those scores in our favor, all in the name of fun and science — because in the end, Lighthouse is simply a good, but rough guide for debugging performance. We’ll have some fun with it and see to what extent we can “trick” Lighthouse into handing out better scores than we may deserve.

But first, let’s talk about data.

Field Data Is Important

Local performance testing is a great way to understand if your website performance is trending in the right direction, but it won’t paint a full picture of reality. The World Wide Web is the Wild West, and collectively, we’ve almost certainly lost track of the variety of device types, internet connection speeds, screen sizes, browsers, and browser versions that people are using to access websites — all of which can have an impact on page performance and user experience.

Field data — and lots of it — collected by an application performance monitoring tool like Sentry from real people using your website on their devices will give you a far more accurate report of your website performance than your lab data collected from a small sample size using a high-spec super-powered dev machine under a set of controlled conditions. Philip Walton reported in 2021 that “almost half of all pages that scored 100 on Lighthouse didn’t meet the recommended Core Web Vitals thresholds” based on data from the HTTP Archive.

Web performance is more than a single core web vital metric or Lighthouse performance score. What we’re talking about goes way beyond the type of raw data we’re working with.

Web Performance Is More Than Numbers

Speed is often the first thing that comes up when talking about web performance — just how long does a page take to load? This isn’t the worst thing to measure, but we must bear in mind that speed is probably influenced heavily by business KPIs and sales targets. Google released a report in 2018 suggesting that the probability of bounces increases by 32% if the page load time reaches higher than three seconds, and soars to 123% if the page load time reaches 10 seconds. So, we must conclude that converting more sales requires reducing bounce rates. And to reduce bounce rates, we must make our pages load faster.

But what does “load faster” even mean? At some point, we’re physically incapable of making a web page load any faster. Humans — and the servers that connect them — are spread around the globe, and modern internet infrastructure can only deliver so many bytes at a time.

The bottom line is that page load is not a single moment in time. In an article titled “What is speed?” Google explains that a page load event is:

[…] “an experience that no single metric can fully capture. There are multiple moments during the load experience that can affect whether a user perceives it as ‘fast’, and if you just focus solely on one, you might miss bad experiences that happen during the rest of the time.”

The key word here is experience. Real web performance is less about numbers and speed than it is about how we experience page load and page usability as users. And this segues nicely into a discussion of how Google Lighthouse calculates performance scores. (It’s much less about pure speed than you might think.)

How Google Lighthouse Performance Scores Are Calculated

The Google Lighthouse performance score is calculated using a weighted combination of scores based on core web vital metrics (i.e., First Contentful Paint (FCP), Largest Contentful Paint (LCP), Cumulative Layout Shift (CLS)) and other speed-related metrics (i.e., Speed Index (SI) and Total Blocking Time (TBT)) that are observable throughout the page load timeline.

This is how the metrics are weighted in the overall score:

Metric Weighting (%)
Total Blocking Time 30
Cumulative Layout Shift 25
Largest Contentful Paint 25
First Contentful Paint 10
Speed Index 10

The weighting assigned to each score gives us insight into how Google prioritizes the different building blocks of a good user experience:

1. A Web Page Should Respond to User Input

The highest weighted metric is Total Blocking Time (TBT), a metric that looks at the total time after the First Contentful Paint (FCP) to help indicate where the main thread may be blocked long enough to prevent speedy responses to user input. The main thread is considered “blocked” any time there’s a JavaScript task running on the main thread for more than 50ms. Minimizing TBT ensures that a web page responds to physical user input (e.g., key presses, mouse clicks, and so on).

2. A Web Page Should Load Useful Content With No Unexpected Visual Shifts

The next most weighted Lighthouse metrics are Largest Contentful Paint (LCP) and Cumulative Layout Shift (CLS). LCP marks the point in the page load timeline when the page’s main content has likely loaded and is therefore useful.

At the point where the main content has likely loaded, you also want to maintain visual stability to ensure that users can use the page and are not affected by unexpected visual shifts (CLS). A good LCP score is anything less than 2.5 seconds (which is a lot higher than we might have thought, given we are often trying to make our websites as fast as possible).

3. A Web Page Should Load Something

The First Contentful Paint (FCP) metric marks the first point in the page load timeline where the user can see something on the screen, and the Speed Index (SI) measures how quickly content is visually displayed during page load over time until the page is “complete”.

Your page is scored based on the speed indices of real websites using performance data from the HTTP Archive. A good FCP score is less than 1.8 seconds and a good SI score is less than 3.4 seconds. Both of these thresholds are higher than you might expect when thinking about speed.

Usability Is Favored Over Raw Speed

Google Lighthouse’s performance scoring is, without a doubt, less about speed and more about usability. Your SI and FCP could be super quick, but if your LCP takes too long to paint, and if CLS is caused by large images or external content taking some time to load and shifting things visually, then your overall performance score will be lower than if your page was a little slower to render the FCP but didn’t cause any CLS. Ultimately, if the page is unresponsive due to JavaScript blocking the main thread for more than 50ms, your performance score will suffer more than if the page was a little slow to paint the FCP.

To understand more about how the weightings of each metric contribute to the final performance score, you can play about with the sliders on the Lighthouse Scoring Calculator, and here’s a rudimentary table demonstrating the effect of skewed individual metric weightings on the overall performance score, proving that page usability and responsiveness is favored over raw speed.

Description FCP (ms) SI (ms) LCP (ms) TBT (ms) CLS Overall Score
Slow to show something on screen 6000 0 0 0 0 90
Slow to load content over time 0 5000 0 0 0 90
Slow to load the largest part of the page 0 0 6000 0 0 76
Visual shifts occurring during page load 0 0 0 0 0.82 76
Page is unresponsive to user input 0 0 0 2000 0 70

The overall Google Lighthouse performance score is calculated by converting each raw metric value into a score from 0 to 100 according to where it falls on its Lighthouse scoring distribution, which is a log-normal distribution derived from the performance metrics of real website performance data from the HTTP Archive. There are two main takeaways from this mathematically overloaded information:

  1. Your Lighthouse performance score is plotted against real website performance data, not in isolation.
  2. Given that the scoring uses log-normal distribution, the relationship between the individual metric values and the overall score is non-linear, meaning you can make substantial improvements to low-performance scores quite easily, but it becomes more difficult to improve an already high score.

Read more about how metric scores are determined, including a visualization of the log-normal distribution curve on developer.chrome.com.

Can We “Trick” Google Lighthouse?

I appreciate Google’s focus on usability over pure speed in the web performance conversation. It urges developers to think less about aiming for raw numbers and more about the real experiences we build. That being said, I’ve wondered whether today in 2024, it’s possible to fool Google Lighthouse into believing that a bad page in terms of usability and usefulness is actually a great one.

I put on my lab coat and science goggles to investigate. All tests were conducted:

  • Using the Chromium Lighthouse plugin,
  • In an incognito window in the Arc browser,
  • Using the “navigation” and “mobile” settings (apart from where described differently),
  • By me, in a lab (i.e., no field data).

That all being said, I fully acknowledge that my controlled test environment contradicts my advice at the top of this post, but the experiment is an interesting ride nonetheless. What I hope you’ll take away from this is that Lighthouse scores are only one piece — and a tiny one at that — of a very large and complex web performance puzzle. And, without field data, I’m not sure any of this matters anyway.

How to Hack FCP and LCP Scores

TL;DR: Show the smallest amount of LCP-qualifying content on load to boost the FCP and LCP scores until the Lighthouse test has likely finished.

FCP marks the first point in the page load timeline where the user can see anything at all on the screen, while LCP marks the point in the page load timeline when the main page content (i.e., the largest text or image element) has likely loaded. A fast LCP helps reassure the user that the page is useful. “Likely” and “useful” are the important words to bear in mind here.

What Counts as an LCP Element

The types of elements on a web page considered by Lighthouse for LCP are:

  • <img> elements,
  • <image> elements inside an <svg> element,
  • <video> elements,
  • An element with a background image loaded using the url() function, (and not a CSS gradient), and
  • Block-level elements containing text nodes or other inline-level text elements.

The following elements are excluded from LCP consideration due to the likelihood they do not contain useful content:

  • Elements with zero opacity (invisible to the user),
  • Elements that cover the full viewport (likely to be background elements), and
  • Placeholder images or other images with low entropy (i.e., low informational content, such as a solid-colored image).

However, the notion of an image or text element being useful is completely subjective in this case and generally out of the realm of what machine code can reliably determine. For example, I built a page containing nothing but a <h1> element where, after 10 seconds, JavaScript inserts more descriptive text into the DOM and hides the <h1> element.

Lighthouse considers the heading element to be the LCP element in this experiment. At this point, the page load timeline has finished, but the page’s main content has not loaded, even though Lighthouse thinks it is likely to have loaded within those 10 seconds. Lighthouse still awards us with a perfect score of 100 even if the heading is replaced by a single punctuation mark, such as a full stop, which is even less useful.

This test suggests that if you need to load page content via client-side JavaScript, we‘ll want to avoid displaying a skeleton loader screen since that requires loading more elements on the page. And since we know the process will take some time — and that we can offload the network request from the main thread to a web worker so it won’t affect the TBT — we can use some arbitrary “splash screen” that contains a minimal viable LCP element (for better FCP scoring). This way, we’re giving Lighthouse the impression that the page is useful to users quicker than it actually is.

All we need to do is include a valid LCP element that contains something that counts as the FCP. While I would never recommend loading your main page content via client-side JavaScript in 2024 (serve static HTML from a CDN instead or build as much of the page as you can on a server), I would definitely not recommend this “hack” for a good user experience, regardless of what the Lighthouse performance score tells you. This approach also won’t earn you any favors with search engines indexing your site, as the robots are unable to discover the main content while it is absent from the DOM.

I also tried this experiment with a variety of random images representing the LCP to make the page even less useful. But given that I used small file sizes — made smaller and converted into “next-gen” image formats using a third-party image API to help with page load speed — it seemed that Lighthouse interpreted the elements as “placeholder images” or images with “low entropy”. As a result, those images were disqualified as LCP elements, which is a good thing and makes the LCP slightly less hackable.

View the demo page and use Chromium DevTools in an incognito window to see the results yourself.

This hack, however, probably won’t hold up in many other use cases. Discord, for example, uses the “splash screen” approach when you hard-refresh the app in the browser, and it receives a sad 29 performance score.

Compared to my DOM-injected demo, the LCP element was calculated as some content behind the splash screen rather than elements contained within the splash screen content itself, given there were one or more large images in the focussed text channel I tested on. One could argue that Lighthouse scores are less important for apps that are behind authentication anyway: they don’t need to be indexed by search engines.

There are likely many other situations where apps serve user-generated content and you might be unable to control the LCP element entirely, particularly regarding images.

For example, if you can control the sizes of all the images on your web pages, you might be able to take advantage of an interesting hack or “optimization” (in very large quotes) to arbitrarily game the system, as was the case of RentPath. In 2021, developers at RentPath managed to improve their Lighthouse performance score by 17 points when increasing the size of image thumbnails on a web page. They convinced Lighthouse to calculate the LCP element as one of the larger thumbnails instead of a Google Map tile on the page, which takes considerably longer to load via JavaScript.

The bottom line is that you can gain higher Lighthouse performance scores if you are aware of your LCP element and in control of it, whether that’s through a hack like RentPath’s or mine or a real-deal improvement. That being said, whilst I’ve described the splash screen approach as a hack in this post, that doesn’t mean this type of experience couldn’t offer a purposeful and joyful experience. Performance and user experience are about understanding what’s happening during page load, and it’s also about intent.

How to Hack CLS Scores

TL;DR: Defer loading content that causes layout shifts until the Lighthouse test has likely finished to make the test think it has enough data. CSS transforms do not negatively impact CLS, except if used in conjunction with new elements added to the DOM.

CLS is measured on a decimal scale; a good score is less than 0.1, and a poor score is greater than 0.25. Lighthouse calculates CLS from the largest burst of unexpected layout shifts that occur during a user’s time on the page based on a combination of the viewport size and the movement of unstable elements in the viewport between two rendered frames. Smaller one-off instances of layout shift may be inconsequential, but a bunch of layout shifts happening one after the other will negatively impact your score.

If you know your page contains annoying layout shifts on load, you can defer them until after the page load event has been completed, thus fooling Lighthouse into thinking there is no CLS. This demo page I created, for example, earns a CLS score of 0.143 even though JavaScript immediately starts adding new text elements to the page, shifting the original content up. By pausing the JavaScript that adds new nodes to the DOM by an arbitrary five seconds with a setTimeout(), Lighthouse doesn’t capture the CLS that takes place.

This other demo page earns a performance score of 100, even though it is arguably less useful and useable than the last page given that the added elements pop in seemingly at random without any user interaction.

Whilst it is possible to defer layout shift events for a page load test, this hack definitely won’t work for field data and user experience over time (which is a more important focal point, as we discussed earlier). If we perform a “time span” test in Lighthouse on the page with deferred layout shifts, Lighthouse will correctly report a non-green CLS score of around 0.186.

If you do want to intentionally create a chaotic experience similar to the demo, you can use CSS animations and transforms to more purposefully pop the content into view on the page. In Google’s guide to CLS, they state that “content that moves gradually and naturally from one position to another can often help the user better understand what’s going on and guide them between state changes” — again, highlighting the importance of user experience in context.

On this next demo page, I’m using CSS transform to scale() the text elements from 0 to 1 and move them around the page. The transforms fail to trigger CLS because the text nodes are already in the DOM when the page loads. That said, I did observe in my testing that if the text nodes are added to the DOM programmatically after the page loads via JavaScript and then animated, Lighthouse will indeed detect CLS and score things accordingly.

You Can’t Hack a Speed Index Score

The Speed Index score is based on the visual progress of the page as it loads. The quicker your content loads nearer the beginning of the page load timeline, the better.

It is possible to do some hack to trick the Speed Index into thinking a page load timeline is slower than it is. Conversely, there’s no real way to “fake” loading content faster than it does. The only way to make your Speed Index score better is to optimize your web page for loading as much of the page as possible, as soon as possible. Whilst not entirely realistic in the web landscape of 2024 (mainly because it would put designers out of a job), you could go all-in to lower your Speed Index as much as possible by:

  • Delivering static HTML web pages only (no server-side rendering) straight from a CDN,
  • Avoiding images on the page,
  • Minimizing or eliminating CSS, and
  • Preventing JavaScript or any external dependencies from loading.
You Also Can’t (Really) Hack A TBT Score

TBT measures the total time after the FCP where the main thread was blocked by JavaScript tasks for long enough to prevent responses to user input. A good TBT score is anything lower than 200ms.

JavaScript-heavy web applications (such as single-page applications) that perform complex state calculations and DOM manipulation on the client on page load (rather than on the server before sending rendered HTML) are prone to suffering poor TBT scores. In this case, you could probably hack your TBT score by deferring all JavaScript until after the Lighthouse test has finished. That said, you’d need to provide some kind of placeholder content or loading screen to satisfy the FCP and LCP and to inform users that something will happen at some point. Plus, you’d have to go to extra lengths to hack around the front-end framework you’re using. (You don’t want to load a placeholder page that, at some point in the page load timeline, loads a separate React app after an arbitrary amount of time!)

What’s interesting is that while we’re still doing all sorts of fancy things with JavaScript in the client, advances in the modern web ecosystem are helping us all reduce the probability of a less-than-stellar TBT score. Many front-end frameworks, in partnership with modern hosting providers, are capable of rendering pages and processing complex logic on demand without any client-side JavaScript. While eliminating JavaScript on the client is not the goal, we certainly have a lot of options to use a lot less of it, thus minimizing the risk of doing too much computation on the main thread on page load.

Bottom Line: Lighthouse Is Still Just A Rough Guide

Google Lighthouse can’t detect everything that’s wrong with a particular website. Whilst Lighthouse performance scores prioritize page usability in terms of responding to user input, it still can’t detect every terrible usability or accessibility issue in 2024.

In 2019, Manuel Matuzović published an experiment where he intentionally created a terrible page that Lighthouse thought was pretty great. I hypothesized that five years later, Lighthouse might do better; but it doesn’t.

On this final demo page I put together, input events are disabled by CSS and JavaScript, making the page technically unresponsive to user input. After five seconds, JavaScript flips a switch and allows you to click the button. The page still scores 100 for both performance and accessibility.

You really can’t rely on Lighthouse as a substitute for usability testing and common sense.

Some More Silly Hacks

As with everything in life, there’s always a way to game the system. Here are some more tried and tested guaranteed hacks to make sure your Lighthouse performance score artificially knocks everyone else’s out of the park:

  • Only run Lighthouse tests using the fastest and highest-spec hardware.
  • Make sure your internet connection is the fastest it can be; relocate if you need to.
  • Never use field data, only lab data, collected using the aforementioned fastest and highest-spec hardware and super-speed internet connection.
  • Rerun the tests in the lab using different conditions and all the special code hacks I described in this post until you get the result(s) you want to impress your friends, colleagues, and random people on the internet.

Note: The best way to learn about web performance and how to optimize your websites is to do the complete opposite of everything we’ve covered in this article all of the time. And finally, to seriously level up your performance skills, use an application monitoring tool like Sentry. Think of Lighthouse as the canary and Sentry as the real-deal production-data-capturing, lean, mean, web vitals machine.

And finally-finally, here’s the link to the full demo site for educational purposes.

Integrating Image-To-Text And Text-To-Speech Models (Part 2)

Category Image 062

In Part 1 of this brief two-part series, we developed an application that turns images into audio descriptions using vision-language and text-to-speech models. We combined an image-to-text that analyses and understands images, generating description, with a text-to-speech model to create an audio description, helping people with sight challenges. We also discussed how to choose the right model to fit your needs.

Now, we are taking things a step further. Instead of just providing audio descriptions, we are building that can have interactive conversations about images or videos. This is known as Conversational AI — a technology that lets users talk to systems much like chatbots, virtual assistants, or agents.

While the first iteration of the app was great, the output still lacked some details. For example, if you upload an image of a dog, the description might be something like “a dog sitting on a rock in front of a pool,” and the app might produce something close but miss additional details such as the dog’s breed, the time of the day, or location.

The aim here is simply to build a more advanced version of the previously built app so that it not only describes images but also provides more in-depth information and engages users in meaningful conversations about them.

We’ll use LLaVA, a model that combines understanding images and conversational capabilities. After building our tool, we’ll explore multimodal models that can handle images, videos, text, audio, and more, all at once to give you even more options and easiness for your applications.

Visual Instruction Tuning and LLaVA

We are going to look at visual instruction tuning and the multimodal capabilities of LLaVA. We’ll first explore how visual instruction tuning can enhance the large language models to understand and follow instructions that include visual information. After that, we’ll dive into LLaVA, which brings its own set of tools for image and video processing.

Visual Instruction Tuning

Visual instruction tuning is a technique that helps large language models (LLMs) understand and follow instructions based on visual inputs. This approach connects language and vision, enabling AI systems to understand and respond to human instructions that involve both text and images. For example, Visual IT enables a model to describe an image or answer questions about a scene in a photograph. This fine-tuning method makes the model more capable of handling these complex interactions effectively.

There’s a new training approach called LLaVAR that has been developed, and you can think of it as a tool for handling tasks related to PDFs, invoices, and text-heavy images. It’s pretty exciting, but we won’t dive into that since it is outside the scope of the app we’re making.

Examples of Visual Instruction Tuning Datasets

To build good models, you need good data — rubbish in, rubbish out. So, here are two datasets that you might want to use to train or evaluate your multimodal models. Of course, you can always add your own datasets to the two I’m going to mention.

Vision-CAIR

  • Instruction datasets: English;
  • Multi-task: Datasets containing multiple tasks;
  • Mixed dataset: Contains both human and machine-generated data.

Vision-CAIR provides a high-quality, well-aligned image-text dataset created using conversations between two bots. This dataset was initially introduced in a paper titled “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,” and it provides more detailed image descriptions and can be used with predefined instruction templates for image-instruction-answer fine-tuning.

There are more multimodal datasets out there, but these two should help you get started if you want to fine-tune your model.

Let’s Take a Closer Look At LLaVA

LLaVA (which stands for Large Language and Vision Assistant) is a groundbreaking multimodal model developed by researchers from the University of Wisconsin, Microsoft Research, and Columbia University. The researchers aimed to create a powerful, open-source model that could compete with the best in the field, just like GPT-4, Claude 3, or Gemini, to name a few. For developers like you and me, its open nature is a huge benefit, allowing for easy fine-tuning and integration.

One of LLaVA’s standout features is its ability to understand and respond to complex visual information, even with unfamiliar images and instructions. This is exactly what we need for our tool, as it goes beyond simple image descriptions to engage in meaningful conversations about the content.

Architecture

LLaVA’s strength lies in its smart use of existing models. Instead of starting from scratch, the researchers used two key models:

  • CLIP VIT-L/14
    This is an advanced version of the CLIP (Contrastive Language–Image Pre-training) model developed by OpenAI. CLIP learns visual concepts from natural language descriptions. It can handle any visual classification task by simply being given the names of the visual categories, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
  • Vicuna
    This is an open-source chatbot trained by fine-tuning LLaMA on 70,000 user-shared conversations collected from ShareGPT. Training Vicuna-13B costs around $300, and it performs exceptionally well, even when compared to other models like Alpaca.

These components make LLaVA highly effective by combining state-of-the-art visual and language understanding capabilities into a single powerful model, perfectly suited for applications requiring both visual and conversational AI.

Training

LLaVA’s training process involves two important stages, which together enhance its ability to understand user instructions, interpret visual and language content, and provide accurate responses. Let’s detail what happens in these two stages:

  1. Pre-training for Feature Alignment
    LLaVA ensures that its visual and language features are aligned. The goal here is to update the projection matrix, which acts as a bridge between the CLIP visual encoder and the Vicuna language model. This is done using a subset of the CC3M dataset, allowing the model to map input images and text to the same space. This step ensures that the language model can effectively understand the context from both visual and textual inputs.
  2. End-to-End Fine-Tuning
    The entire model undergoes fine-tuning. While the visual encoder’s weights remain fixed, the projection layer and the language model are adjusted.

The second stage is tailored to specific application scenarios:

  • Instructions-Based Fine-Tuning
    For general applications, the model is fine-tuned on a dataset designed for following instructions that involve both visual and textual inputs, making the model versatile for everyday tasks.
  • Scientific reasoning
    For more specialized applications, particularly in science, the model is fine-tuned on data that requires complex reasoning, helping the model excel at answering detailed scientific questions.

Now that we’re keen on what LLaVA is and the role it plays in our applications, let’s turn our attention to the next component we need for our work, Whisper.

Using Whisper For Text-To-Speech

In this chapter, we’ll check out Whisper, a great model for turning text into speech. Whisper is accurate and easy to use, making it perfect for adding natural-sounding voice responses to our app. We’ve used Whisper in a different article, but here, we’re going to use a new version — large v3. This updated version of the model offers even better performance and speed.

Whisper large-v3

Whisper was developed by OpenAI, which is the same folks behind ChatGPT. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. The original Whisper was trained on 680,000 hours of labeled data.

Now, what’s different with Whisper large-v3 compared to other models? In my experience, it comes down to the following:

  • Better inputs
    Whisper large-v3 uses 128 Mel frequency bins instead of 80. Think of Mel frequency bins as a way to break down audio into manageable chunks for the model to process. More bins mean finer detail, which helps the model better understand the audio.
  • More training
    This specific Whisper version was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio that was collected from Whisper large-v2. From there, the model was trained for 2.0 epochs over this mix.

Whisper models come in different sizes, from tiny to large. Here’s a table comparing the differences and similarities:

Size Parameters English-only Multilingual
tiny 39 M
base 74 M
small 244 M
medium 769 M
large 1550 M
large-v2 1550 M
large-v3 1550 M
Integrating LLaVA With Our App

Alright, so we’re going with LLaVA for image inputs, and this time, we’re adding video inputs, too. This means the app can handle both images and videos, making it more versatile.

We’re also keeping the speech feature so you can hear the assistant’s replies, which makes the interaction even more engaging. How cool is that?

For this, we’ll use Whisper. We’ll stick with the Gradio framework for the app’s visual layout and user interface. You can, of course, always swap in other models or frameworks — the main goal is to get a working prototype.

Installing and Importing the Libraries

We will start by installing and importing all the required libraries. This includes the transformers libraries for loading the LLaVA and Whisper models, bitsandbytes for quantization, gtts, and moviepy to help in processing video files, including frame extraction.

#python
!pip install -q -U transformers==4.37.2
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
!pip install -q gTTS
!pip install -q moviepy

With these installed, we now need to import these libraries into our environment so we can use them. We’ll use colab for that:

#python
import torch
from transformers import BitsAndBytesConfig, pipeline
import whisper
import gradio as gr
from gtts import gTTS
from PIL import Image
import re
import os
import datetime
import locale
import numpy as np
import nltk
import moviepy.editor as mp

nltk.download('punkt')
from nltk import sent_tokenize

# Set up locale
os.environ["LANG"] = "en_US.UTF-8"
os.environ["LC_ALL"] = "en_US.UTF-8"
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

Configuring Quantization and Loading the Models

Now, let’s set up a 4-bit quantization to make the LLaVA model more efficient in terms of performance and memory usage.

#python

# Configuration for quantization
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16
)

# Load the image-to-text model
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text",
  model=model_id,
  model_kwargs={"quantization_config": quantization_config})

# Load the whisper model
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("large-v3", device=DEVICE)

In this code, we’ve configured the quantization to four bits, which reduces memory usage and improves performance. Then, we load the LLaVA model with these settings. Finally, we load the whisper model, selecting the device based on GPU availability for better performance.

Note: We’re using llava-v1.5-7b as the model. Please feel free to explore other versions of the model. For Whisper, we’re loading the “large” size, but you can also switch to another size like “medium” or “small” for your experiments.

To get our assistant up and running, we need to implement five essential functions:

  1. Handling conversations,
  2. Converting images to text,
  3. Converting videos to text,
  4. Transcribing audio,
  5. Converting text to speech.

Once these are in place, we will create another function to tie all this together seamlessly. The following sections provide the code that defines each function.

Conversation History

We’ll start by setting up the conversation history and a function to log it:

#python

# Initialize conversation history
conversation_history = []

def writehistory(text):
  """Write history to a log file."""
  tstamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
  logfile = f'{tstamp}_log.txt'
  with open(logfile, 'a', encoding='utf-8') as f:
    f.write(text + '\n')

Image to Text

Next, we’ll create a function to convert images to text using LLaVA and iterative prompts.

#python
def img2txt(input_text, input_image):
  """Convert image to text using iterative prompts."""
  try:
    image = Image.open(input_image)

    if isinstance(input_text, tuple):
      input_text = input_text[0]  # Take the first element if it's a tuple

      writehistory(f"Input text: {input_text}")
      prompt = "USER: <image>\n" + input_text + "\nASSISTANT:"
      while True:
        outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

          if outputs and outputs[0]["generated_text"]:
            match = re.search(r'ASSISTANT:\s*(.*)', outputs[0]["generated_text"])
            reply = match.group(1) if match else "No response found."
            conversation_history.append(("User", input_text))
            conversation_history.append(("Assistant", reply))
            prompt = "USER: " + reply + "\nASSISTANT:"
            return reply  # Only return the first response for now
          else:
            return "No response generated."
  except Exception as e:
    return str(e)

Video to Text

We’ll now create a function to convert videos to text by extracting frames and analyzing them.

#python
def vid2txt(input_text, input_video):
  """Convert video to text by extracting frames and analyzing."""
  try:
    video = mp.VideoFileClip(input_video)
    frame = video.get_frame(1)  # Get a frame from the video at the 1-second mark
    image_path = "temp_frame.jpg"
    mp.ImageClip(frame).save_frame(image_path)
    return img2txt(input_text, image_path)
  except Exception as e:
    return str(e)

Audio Transcription

Let’s add a function to transcribe audio to text using Whisper.

#python
def transcribe(audio_path):
  """Transcribe audio to text using Whisper model."""
  if not audio_path:
    return ''

  audio = whisper.load_audio(audio_path)
  audio = whisper.pad_or_trim(audio)
  mel = whisper.log_mel_spectrogram(audio).to(model.device)
  options = whisper.DecodingOptions()
  result = whisper.decode(model, mel, options)
  return result.text

Text to Speech

Lastly, we create a function to convert text responses into speech.

#python
def text_to_speech(text, file_path):
  """Convert text to speech and save to file."""
  language = 'en'
  audioobj = gTTS(text=text, lang=language, slow=False)
  audioobj.save(file_path)
  return file_path

With all the necessary functions in place, we can create the main function that ties everything together:

#python

def chatbot_interface(audio_path, image_path, video_path, user_message):
  """Process user inputs and generate chatbot response."""
  global conversation_history

  # Handle audio input
  if audio_path:
    speech_to_text_output = transcribe(audio_path)
  else:
    speech_to_text_output = ""

  # Determine the input message
  input_message = user_message if user_message else speech_to_text_output

  # Ensure input_message is a string
  if isinstance(input_message, tuple):
    input_message = input_message[0]

  # Handle image or video input
  if image_path:
    chatgpt_output = img2txt(input_message, image_path)
  elif video_path:
      chatgpt_output = vid2txt(input_message, video_path)
  else:
    chatgpt_output = "No image or video provided."

  # Add to conversation history
  conversation_history.append(("User", input_message))
  conversation_history.append(("Assistant", chatgpt_output))

  # Generate audio response
  processed_audio_path = text_to_speech(chatgpt_output, "Temp3.mp3")

  return conversation_history, processed_audio_path

Using Gradio For The Interface

The final piece for us is to create the layout and user interface for the app. Again, we’re using Gradio to build that out for quick prototyping purposes.

#python

# Define Gradio interface
iface = gr.Interface(
  fn=chatbot_interface,
  inputs=[
    gr.Audio(type="filepath", label="Record your message"),
    gr.Image(type="filepath", label="Upload an image"),
    gr.Video(label="Upload a video"),
    gr.Textbox(lines=2, placeholder="Type your message here...", label="User message (if no audio)")
  ],
  outputs=[
    gr.Chatbot(label="Conversation"),
    gr.Audio(label="Assistant's Voice Reply")
  ],
  title="Interactive Visual and Voice Assistant",
  description="Upload an image or video, record or type your question, and get detailed responses."
)

# Launch the Gradio app
iface.launch(debug=True)

Here, we want to let users record or upload their audio prompts, type their questions if they prefer, upload videos, and, of course, have a conversation block.

Here’s a preview of how the app will look and work:

Looking Beyond LLaVA

LLaVA is a great model, but there are even greater ones that don’t require a separate ASR model to build a similar app. These are called multimodal or “any-to-any” models. They are designed to process and integrate information from multiple modalities, such as text, images, audio, and video. Instead of just combining vision and text, these models can do it all: image-to-text, video-to-text, text-to-speech, speech-to-text, text-to-video, and image-to-audio, just to name a few. It makes everything simpler and less of a hassle.

Examples of Multimodal Models that Handle Images, Text, Audio, and More

Now that we know what multimodal models are, let’s check out some cool examples. You may want to integrate these into your next personal project.

CoDi

So, the first on our list is CoDi or Composable Diffusion. This model is pretty versatile, not sticking to any one type of input or output. It can take in text, images, audio, and video and turn them into different forms of media. Imagine it as a sort of AI that’s not tied down by specific tasks but can handle a mix of data types seamlessly.

CoDi was developed by researchers from the University of North Carolina and Microsoft Azure. It uses something called Composable Diffusion to sync different types of data, like aligning audio perfectly with the video, and it can generate outputs that weren’t even in the original training data, making it super flexible and innovative.

ImageBind

Now, let’s talk about ImageBind, a model from Meta. This model is like a multitasking genius, capable of binding together data from six different modalities all at once: images, video, audio, text, depth, and even thermal data.

Source: Meta AI. (Large preview)

ImageBind doesn’t need explicit supervision to understand how these data types relate. It’s great for creating systems that use multiple types of data to enhance our understanding or create immersive experiences. For example, it could combine 3D sensor data with IMU data to design virtual worlds or enhance memory searches across different media types.

Gato

Gato is another fascinating model. It’s built to be a generalist agent that can handle a wide range of tasks using the same network. Whether it’s playing games, chatting, captioning images, or controlling a robot arm, Gato can do it all.

The key thing about Gato is its ability to switch between different types of tasks and outputs using the same model.

GPT-4o

The next on our list is GPT-4o; GPT-4o is a groundbreaking multimodal large language model (MLLM) developed by OpenAI. It can handle any mix of text, audio, image, and video inputs and give you text, audio, and image outputs. It’s super quick, responding to audio inputs in just 232ms to 320ms, almost like a real conversation.

There’s a smaller version of the model called GPT-4o Mini. Small models are becoming a trend, and this one shows that even small models can perform really well. Check out this evaluation to see how the small model stacks up against other large models.

Conclusion

We covered a lot in this article, from setting up LLaVA for handling both images and videos to incorporating Whisper large-v3 for top-notch speech recognition. We also explored the versatility of multimodal models like CoDi or GPT-4o, showcasing their potential to handle various data types and tasks. These models can make your app more robust and capable of handling a range of inputs and outputs seamlessly.

Which model are you planning to use for your next app? Let me know in the comments!