Frank Liu is the Director of Operations at Zilliz, a leading provider of vector database and AI technologies. They are also the engineers and scientists who created LF AI Milvus®, the world's most popular open-source vector database.
What initially attracted you to machine learning?
My first exposure to the power of ML/AI was as an undergrad student at Stanford, despite it being a bit far afield from my major (Electrical Engineering). I was initially drawn to EE as a field because the ability to distill complex electrical and physical systems into mathematical approximations felt very powerful to me, and statistics and machine learning felt the same. I ended up taking more computer vision and machine learning classes during grad school, and I ended up writing my Master's thesis on using ML to score the aesthetic beauty of images. All of this led to my first job in the Computer Vision & Machine Learning team at Yahoo, where I was in a hybrid research and software development role. We were still in the pre-transformers AlexNet & VGG days back then, and seeing an entire field and industry move so rapidly, from data preparation to massively parallel model training to model productionization, has been amazing. In many ways, it feels a bit ridiculous to use the phrase “back then” to refer to something that happened less than 10 years ago, but such is the progress that's been made in this field.
After Yahoo, I served as the CTO of a startup that I co-founded, where we leveraged ML for indoor localization. There, we had to optimize sequential models for very small microcontrollers – a very different but nonetheless related engineering challenge to today's massive LLMs and diffusion models. We also built hardware, dashboards for visualization, and simple cloud-native applications, but AI/ML always served as a core component of the work that we were doing.
Even though I've been in or adjacent to ML for the better part of 7 or 8 years now, I still maintain a lot of love for circuit design and digital logic design. Having a background in Electrical Engineering is, in many ways, incredibly helpful for a lot of the work that I'm involved in these days as well. A lot of important concepts in digital design such as virtual memory, branch prediction, and concurrent execution in HDL help provide a full-stack view to a lot of ML and distributed systems today. While I understand the allure of CS, I hope to see a resurgence in more traditional engineering fields – EE, MechE, ChemE, etc… – within the next couple of years.
For readers who are unfamiliar with the term, what is unstructured data?
Unstructured data refers to “complex” data, which is essentially data that cannot be stored in a pre-defined format or fit into an existing data model. For comparison, structured data refers to any type of data that has a pre-defined structure – numeric data, strings, tables, objects, and key/value stores are all examples of structured data.
To help truly understand what unstructured data is and why it's traditionally been difficult to computationally process this type of data, it helps to compare it with structured data. In the simplest terms, traditional structured data can be stored via a relational model. Take, for example, a relational database with a table for storing book information: each row within the table could represent a particular book indexed by ISBN number, while the columns would denote the corresponding category of information, such as title, author, publish date, so on and so forth. Nowadays, there are much more flexible data models – wide-column stores, object databases, graph databases, so on and so forth. But the overall idea remains the same: these databases are meant to store data that fits a particular data mold or data model.
Unstructured data, on the other hand, can be thought of as essentially a pseudo-random blob of binary data. It can represent anything, be arbitrarily large or small, and can be transformed and read in one of countless different ways. This makes it impossible to fit into any data model, let alone a table in a relational database.
What are some examples of this type of data?
Human-generated data – images, video, audio, natural language, etc – are great examples of unstructured data. But there are a variety of less mundane examples of unstructured data too. User profiles, protein structures, genome sequences, and even human-readable code are also great examples of unstructured data. The primary reason that unstructured data has traditionally been so hard to manage is that unstructured data can take any form and can require vastly different runtimes to process.
Using images as an example, two photos of the same scene could have vastly different pixel values, but both have a similar overall content. Natural language is another example of unstructured data that I like to refer to. The phrases “Electrical Engineering” and “Computer Science” are extremely closely related – so much so that the EE and CS buildings at Stanford are adjacent to each other – but without a way to encode the semantic meaning behind these two phrases, a computer may naively think that “Computer Science” and “Social Science” are more related.
What is a vector database?
To understand a vector database, it first helps to understand what an embedding is. I'll get to that momentarily, but the short version is that an embedding is a high-dimensional vector that can represent the semantics of unstructured data. In general, two embeddings which are close to one another in terms of distance are very likely to correspond to semantically similar input data. With modern ML, we have the power to encode and transform a variety of different types of unstructured data – images and text, for example – into semantically powerful embedding vectors.
From an organization's perspective, unstructured data becomes incredibly difficult to manage once the amount grows past a certain limit. This is where a vector database such as Zilliz Cloud comes in. A vector database is purpose-built to store, index, and search across massive quantities of unstructured data by leveraging embeddings as the underlying representation. Searching across a vector database is typically done with query vectors, and the result of the query is the top N most similar results based on distance.
The very best vector databases have many of the usability features of traditional relational databases: horizontal scaling, caching, replication, failover, and query execution are just some of the many features that a true vector database should implement. As a category definer, we've been active in academic circles as well, having published papers in SIGMOD 2021 and VLDB 2022, the two top database conferences out there today.
Could you discuss what an embedding is?
Generally speaking, an embedding is a high-dimensional vector that comes from the activations of an intermediate layer in a multilayer neural network. Many neural networks are trained to output embeddings themselves and some applications use concatenated vectors from multiple intermediate layers as the embedding, but I won't get too deep into either of those for now. Another less common but equally important way to generate embeddings is through handcrafted features. Rather than having an ML model automatically learn the right representations for the input data, good old feature engineering can work for many applications as well. Regardless of the underlying method, embeddings for semantically similar objects are close to each other in terms of distance, and this property is what powers vector databases.
What are some of the most popular use cases with this technology?
Vector databases are great for any application that requires some form of semantic search – product recommendation, video analysis, document search, threat & fraud detection, and AI-powered chatbots are some of the most popular use cases for vector databases today. To illustrate this, Milvus, the open-source vector database created by Zilliz and the underlying core of Zilliz Cloud, has been used by over a thousand enterprise users across a variety of different use cases.
I'm always happy to chat about these applications and help folks understand how they work, but I definitely greatly enjoy going over some of the lesser-known vector database use cases as well. New drug discovery is one of my favorite “niche” vector database use cases. The challenge for this particular application is searching for potential candidate drugs to treat a certain disease or symptom amongst a database of 800 million compounds. A pharmaceutical company we communicated with was able to significantly improve the drug discovery process in addition to cutting down on hardware resources by combining Milvus with a cheminformatics library called RDKit.
Cleveland Museum of Art's (CMA) AI ArtLens is another example I like to bring up. AI ArtLens is an interactive tool that takes a query image as an input and pulls visually similar images from the museum's database. This is usually referred to as reverse image search and is a fairly common use case for vector databases, but the unique value proposition that Milvus provided to CMA was the ability to get the application up and running within a week with a very small team.
Could you discuss what the open-source platform Towhee is?
When communicating with folks from the Milvus community, we found that many of them wanted to have a unified way to generate embeddings for Milvus. This was true for nearly all of the different organizations that we spoke with, but especially so for companies that did not have many machine learning engineers. With Towhee, we aim to solve this gap via what we call “vector data ETL.” While traditional ETL pipelines focus on combining and transforming structured data from multiple sources into a usable format, Towhee is meant to work with unstructured data and explicitly includes ML in the resulting ETL pipeline. Towhee accomplishes this by providing hundreds of models, algorithms, and transformations that can be used as building blocks in a vector data ETL pipeline. On top of this, Towhee also provides an easy-to-use Python API which allows developers to build and test these ETL pipelines in a single line of code.
While Towhee is its own independent project, it is also a part of the broader vector database ecosystem centered around Milvus that Zilliz is creating. We envision Milvus and Towhee to be two highly complementary projects which, when used together, can truly democratize unstructured data processing.
Zilliz recently raised a $60M Series B round. How will this accelerate the Zilliz mission?
I'd first off like to thank Prosperity7 Ventures, Pavilion Capital, Hillhouse Capital, 5Y Capital, Yunqi Capital, and others for believing in Zilliz's mission and supporting us with this Series B extension. We've now raised a total of $113M, and this latest round of funding will support our efforts to scale out engineering and go-to-market teams. In particular, we'll be improving our managed cloud offering, which is currently in early access but scheduled to open up to everybody later this year. We'll also continue to invest in cutting-edge database & AI research as we have done in the past 4 years.
Is there anything else that you would like to share about Zilliz?
As a company, we're growing rapidly, but what really sets our current team apart from others in the database and ML space is our singular passion for what we're building. We're on a mission to democratize unstructured data processing, and it's absolutely amazing to see so many talented folks at Zilliz working towards a singular goal. If any of what we're doing sounds interesting to you, feel free to get in touch with us. We'd love to have you onboard.
If you'd like to know a bit more, I'm also personally open to chatting about Zilliz, vector databases, or embedding-related advancements in AI/ML. My (figurative) door is always open, so feel free to reach out to me directly on Twitter/LinkedIn.
Last but not least, thanks for reading!
Thank you for the great interview, readers who wish to learn more should visit Zilliz.