Teaching Machines to Understand User Interfaces

This entry in our research page is more of a blog post rather than an actual research work; the goal being to share what we are working on and why we are doing it.

TL;DR: bring me to the sneak peek.

1) Building apps and websites is insanely slow

Back when I was an undergraduate student, I used to work part-time as a front-end developer in a digital agency. I was fortunate enough to be part of a team made of extremely talented people with art directors and UI/UX designers crafting gorgeous interfaces and creative front-end engineers building cutting-edge applications using the latest technologies (remember when WebGL was getting cool and Adobe Flash was dying, that was that time). The team was working exclusively for high-profile clients and collecting awards such as The FWA and Awwwards to acknowledge the quality of their craftsmanship.

Working in the web design industry was a lot of fun but something struck me: the workflow is completely broken.

The majority of designers I worked with prefer to sketch their creative ideas on the whiteboard or on their fancy notebook instead of using a wireframing tool like Balsamiq or Axure. They would argue that these tools are constraining their thoughts and would kill the creative flow — and I quite frankly agree with them. No surprise that graphic tablets are so popular among designers as the device attempts to recreate the pen-and-paper feeling digitally. Some designers would actually draw their ideas directly in a software using their graphic tablet as an attempt to save some time.

The graphic tablet, designer’s best friend. source: Wikipedia

Regardless of the method chosen to sketch ideas, designers would then have to recreate their drawings either in a wireframing tool to get the layout validated by the customer or the project manager, either directly craft the user interface in their favorite design tools such as Adobe Photoshop or Sketch. This essentially means having to redo the very same work twice by converting the work done in one format to another format. That’s the first way in which the workflow is broken.

Once designers have finalized the look of a given user interface, they would ship their work to a front-end developer in order to get it implemented in code. Implementing user interfaces basically consist in re-creating in code what the designers created graphically in a software. Doesn’t it sound like duplicated work once again?

And here is the thing, as a developer you want to focus on implementing the client-server logic, the core functionality, optimize the interactive graphics, animations, and transitions; but you end up wasting the majority of your time coding user interfaces. Writing HTML/CSS is super boring, repetitive, frustrating and so time-consuming that it prevents iteration cycles with designers. In some digital agencies, designers are in charge of implementing the user interfaces they sketched. But the problem remains the same, someone has to sit down and manually write cumbersome, boring, and repetitive UI code. That’s the second way in which the workflow is broken.

The classical workflow for building apps and websites.

As shown in the figure above, these redundant steps bring zero value to the project since their one and only purpose is to convert a user interface encoded in one format to another in order to enable the next step in the workflow. Because these conversions are performed manually by people, they are expensive, time-consuming, frustrating, and prevent innovation because they consume precious time that should be instead spent on iteration cycles to improve the app being built.

2) Deep Learning at the core of a possible solution

As a graduate student focusing on Machine Learning, I was amazed by the breakthroughs made possible by Deep Learning. Computers were finally able to process images in a somewhat satisfying manner. I remember being completely mindblown reading the paper Show and Tell by Vinyals et al. at Google where a deep neural network was trained to generate an English description given an input picture.

Show and Tell: A Neural Image Caption Generator, Vinyals et al. 2015

Inspired by this work and many others, I envisioned that generating English descriptions given a photograph should basically be the same as generating computer code given a UI mockup. In both cases, you want to produce a textual output given a visual input. After letting that idea take the dust in my notebook for a long time, I finally decided to write some code and see if my assumption was correct.

To my great surprise, it actually worked! Of course, it worked in a controlled environment and a lot more work would be needed to improve the technology to meet real-world requirements. Nevertheless, this encouraging first step suggests that Deep Learning can indeed be leveraged for the automatic generation of code from user interface images (and Airbnb agrees with us). That was the moment I wrote the pix2code paper and decided to open-source a basic implementation and a toy dataset for educational purposes. Surprisingly, the project received quite a lot of media attention, was covered in a couple of ML-related podcasts, and was even the subject of a Two Minutes Papers episode.

3) Seeing the bigger picture

At Uizard, we are essentially teaching machines to understand graphical user interfaces the same way humans do in order to propose a more efficient workflow for building apps and websites. Our core technology has evolved quite a lot since the release of our pix2code paper but the central idea remains the same: we are building a software pipeline made of neural network weights to convert pixel values (e.g. photograph, screenshot) to sequences of characters (e.g. iOS code, Sketch file). The workflow we are envisioning is pictured below.

The modern AI-driven workflow we are envisioning for building apps and websites.

For professional users such as designers and developers, such a technology would save critical time early on a project by enabling ideas to be tested quickly, boost iteration cycles, and eventually enable the development of better apps. The goal is to save as much time as possible on trivial tasks; no one enjoys redundant work. Most importantly, this would allow designers and developers to focus on what matters: bringing value to end-users.

The entry level to build simple apps would become really low. Learning to use a UI design tool takes time, learning to code probably takes even more time. However, most people are able to draw a user interface on a piece of paper; allowing your grandma to go from an idea to a materialized UI running on her phone in a matter of seconds.

Our vision is to empower people with Artificial Intelligence because we believe in a future where machines assist humans, not replace them.

4) Sneak peek

We are working hard making our vision a reality. In the meantime, the four of us are really excited to share some of our progress.