Jay Mishra is the Chief Operating Officer (COO) at Astera Software, a rapidly-growing provider of enterprise-ready data solutions. They help business users bridge the data-to-insight gap with a suite of user-friendly yet high-performance data extraction, data quality, data integration, data warehousing & electronic data interchange solutions, which are used by both midsize and Fortune 500 companies across a range of industries.
What initially attracted you to computer science?
I've always had a deep-rooted passion for mathematics, and my journey into computer science was a natural extension of that. My undergraduate education was in Mathematics and Computer Science, and it was the logical progression from the world of math to the realm of computer science that fascinated me. What particularly caught my attention were the intricate workings of algorithms and advanced algorithmic processes which led me to pursue a specialization in algorithms during my Masters in Computer Science. Since then, my connection with computer science has remained strong, and I continually strive to stay on top of the latest developments in the field.
You’re currently the COO of Astera, could you share with us what your day-to-day role entails?
As the COO of Astera, my role is multifaceted, reflecting our company's dynamic nature. I've been with Astera since its inception, and my responsibilities have spanned various areas of the organization. This includes everything from actively contributing to the development and coding of our products to ensuring that our features align with our customers' evolving needs. I closely collaborate with our customers, working in tandem with them to refine our solutions. My role extends beyond just product development to encompass sales and marketing, where we bring our offerings to market.
As we're in a growth phase, I have taken on additional responsibilities, including overseeing our revenue goals and strategically expanding our product portfolio to reach new markets. Essentially, I have a hand in nearly every aspect of our operations, ensuring that we not only build exceptional products but also successfully bring them to market and meet our business objectives.
For readers who are unfamiliar with this term, what is data warehousing?
Data warehousing is an architectural pattern used to consolidate all your enterprise data into a centralized repository which will serve as a foundation from which you can generate various types of analytics, reports, and dashboards that are going to be presenting the true picture of where your business is and also forecasting how the business is going to be doing in the future. To cater to all of that you bring your data together in a certain way and that architecture is called a data warehouse.
The term actually is taken from a real life warehouse where your products are stored on organized shelves. But when you come to the data world, you're bringing your data from various sources. You're bringing your data from production, your website, your customers, sales and marketing, finance, and your human resources department. You bring all the data together, bring it into one place, and that's what is going to be called a data warehouse and is designed in a certain way so that reporting, especially based on timeline, is going to be easy. That's the core purpose of a data warehouse.
What are some of the key trends in data warehousing today?
Data warehousing has evolved quite a bit in the past 20 – 25 years. Approximately a decade ago, we witnessed the emergence of automated data warehousing, a paradigm shift that accelerated the process of building data models and data warehouses. Recently, automation has taken center stage. It addresses the repetitive nature of data warehousing tasks, streamlining processes to save time and resources.
Our product, Astera Data Warehouse Builder, for example, offers a holistic approach to automation in data warehousing. It covers everything from automating ETL (Extract, Transform, Load) pipelines and data modeling to the automatic loading of data into structures like star schemas or data vaults. Furthermore, it efficiently maintains these structures through Change Data Capture (CDC) mechanisms. This all-inclusive automation has emerged as a key trend in the data warehousing landscape.
Furthermore, the most recent trend is the fusion between data warehousing and artificial intelligence (AI). Specifically, generative AI has taken automation to new heights. It not only automates tasks but also aids users in decision-making.
Configuration of data warehousing components, pipelines, and decision points can be guided by AI, making data warehousing more powerful and efficient than ever before. In essence, this is automation on steroids, and it's transforming the landscape of data warehousing. The intersection between AI and data warehousing is a trend that holds immense promise for the future.
What are the four fundamental principles that businesses should consider for their data warehouse development?
1. Defining Clear Objectives
It's essential to begin by understanding precisely what you need from your data warehouse. Avoid the common pitfall of collecting excessive data without a clear purpose. Instead, identify the specific objectives you want to achieve with your data warehouse. What reports and insights are you seeking? By focusing on your goals, you can ensure that you bring in only the data that's relevant, rather than indiscriminately accumulating vast amounts of information. Given the decreasing costs of storage and computing power, it's crucial to utilize these resources intelligently and ethically.
2. Choosing the Right Architectural Pattern
Architectural patterns are very important. They decide whether your data warehousing solution is going to be successful or not. There are various options, ranging from Inmon-style data warehousing to Ralph Kimball's star schemas, as well as newer patterns like Data Vault and the one big table approach advocated by columna database vendors. Not all patterns will be suitable for every scenario.
We are seeing mostly a combination of star schema sitting at the top of a data vault. So a combination of Data Vault and Star Schema is still the most widely used pattern. But, as I said, for each requirement or to each scenario there's going to be a different answer. So run it through the experts, see which architectural pattern is a good fit for your scenario.
3. Selecting the Right Tools
They're very important and they make a huge difference again on the time and the sources needed to build a solution and also the accuracy and the quality of your solution which is determined by the products that you're going to be using to build your data warehouse and maintain it. Pay a lot of attention to the products capability and look at the products that are able to bring in the most requirements under one umbrella. There are certain areas such as ETL (Extract, Transform, Load), data quality, data modeling, data loading, and data publishing all play a significant role. If you try to use multiple products for each of these areas, it's going to be difficult. So look at the products that can be used to do most if not all of the different constituents.
4. Your Team
Last but not the least the team of the people that you assemble to build your data warehouse solution is the most important part. We recommend having someone with a strong background in data architectural patterns. In terms of team composition, cross functional teams are the best way to go about it, where you have a mix of business users and people with some programming background or at least the data expertise and having close collaboration between your data custodians, the people who are in charge of data and of course the business. By fostering close cooperation among these different facets of your organization, you can create a cohesive and effective team responsible for building and maintaining your data warehousing solution.
Success in data warehousing hinges on achieving a balance between these four principles. These principles, when carefully followed, have proven to be a recipe for success in our experience.
Why do companies need a modern data stack?
It depends on how we define “modern” and that keeps changing, sometimes by the year, month, and even by the day. We must consider modern toolsets designed with the changing landscape of data in mind. Over the past few years, there have been significant shifts in the nature and volume of data. The rise of Big Data has transformed the data landscape, with data pouring in from sources like e-commerce websites, production databases, and various parts of your business. This data is changing not only in volume but also in its very nature.
In the past, data was mostly structured, but now, unstructured data plays a significant role. Additionally, the velocity at which data is generated and made available for use has increased. Given these changes in data, we must continuously evaluate and adapt our toolset to effectively address these evolving data challenges.
The modern data stack is designed to handle all the variations in the structures and the velocity of the data, and it is well-equipped to adapt to the emerging architectural patterns that have evolved over the past few years. Therefore, if you want to make the best use of your data, you have to look at modernizing your data stack. That's the only way to keep up with the new data challenges.
We have seen that companies stick with existing solutions that appear to be working. It's crucial to recognize that data itself is inherently dynamic. It continually evolves, presenting new challenges and opportunities. Existing solutions may not be equipped to adapt to these changes. Therefore, to harness the full potential of their data, companies must embrace the concept of modernizing their data stack. It's not about breaking what works; it's about staying agile and responsive to the evolving nature of data. By continuously evaluating and integrating advancements in data technology, businesses can remain competitive and make informed decisions in an increasingly data-driven world.
What are some of the current data management challenges that are seen in the industry?
1. Data Velocity and Integration
One of the big challenges we face today is the sheer volume of data pouring in from various applications. If you take any typical IT organization, they deal with new apps popping up all the time—dozens, sometimes even hundreds each year, especially in medium-sized organizations.
Now, all these apps generate data, and that data holds valuable insights. The primary concern here is the ability to swiftly integrate these new data sources into existing data pipelines and consolidate them into a unified view. The speed at which organizations can adapt to and incorporate these new data streams is the biggest challenge we see.
2. Varying Data Formats
Another critical challenge stems from the nature of data itself, particularly the increasing prevalence of unstructured data. With unstructured data there are, of course, different schools of thoughts about how to handle it.
Organizations must decide whether to store this data directly in data lakes for later use or to extract and transform it into a more structured format for immediate consumption. The challenge of how to handle unstructured data remains, and we see that even medium-sized companies or small-sized companies are getting impacted by it. So, devising effective strategies for handling unstructured data is essential.
3. Data Publishing and Sharing
While data integration and consolidation are crucial, the ability to share data effectively is equally important. Organizations need mechanisms for publishing and distributing data to internal departments, third-party vendors, partners, and other stakeholders. This challenge extends beyond simply making data accessible; it involves ensuring data security, privacy, and compliance with regulations. As data sharing becomes a necessity for businesses of all sizes, the technologies and products in this space are rapidly evolving to meet the demand.
What are some ways that Astera has integrated AI into customer workflow?
We look at AI intersecting with data management in two distinct ways.
1. Enhancing Usability with Generative AI
Our deep commitment to usability is a cornerstone of our product development philosophy. Over the last 12 to 13 years, we've built a strong reputation for designing products with a short learning curve, making them accessible even to non-technical users. With just a modest amount of training, individuals can effectively utilize our products to perform meaningful tasks with their data.
With the introduction of generative AI, Astera has taken usability to the next level. We utilized generative AI to create a user interface that allows customers to interact with the product using natural language commands. This AI-driven interface simplifies configuration tasks, making it more intuitive and efficient for users.
Moreover, Astera has integrated automation powered by AI to handle tasks that previously required several hours of manual work, especially in the configuration of data management products. The biggest cost factor of building a data management solution was not just buying a product, it was the time and effort spent on configuring it. We have tried to address that with AI. This approach significantly reduces the time and resources traditionally spent on product configuration.
As an example, Astera's product, ReportMiner, simplifies the extraction of data from unstructured documents by allowing users to create extraction templates based on rules. AI can now generate the initial template in a matter of seconds, a task that previously took two to three hours for a typical user. The first cut of an AI-generated template may not be perfect, but it handles approximately 90% of the workload, allowing users to make quick adjustments and complete the task in minutes instead of hours. This approach is just one example of how Astera leverages AI to enhance usability throughout its products.
We are doing similar things throughout our data stack where we are getting a significant lift in usability with artificial intelligence.
2. AI Functionality as a Toolset
Astera offers a unified data stack that covers various aspects of data management, including ingestion, transformation, data quality, data warehousing, APIs, and data publishing. The company recognizes the importance of providing AI functionality as a versatile toolset for its users. Within this toolset, Astera's customers can access AI across the data science spectrum, from building and deploying machine learning models to handling ML Ops (Machine Learning Operations). Astera also supports the use of open-source-based models, including large language models (LLMs), and facilitates fine-tuning for specific use cases.
This broader AI functionality empowers Astera's users to leverage AI for various data-related tasks, including deploying machine learning models, implementing ML Ops, and fine-tuning open-source models. Additionally, Astera continuously works on expanding its AI support, encompassing areas such as vector databases, similarity searches, embeddings, and more.
What are some of the best practices to leverage AI and ML models in data management for large companies?
1. Stay at the Forefront of AI and ML Developments
The field of large language models is rapidly evolving. To gain a competitive edge, large companies should stay informed about the latest advancements. Astera, for example, was an early adopter of generative AI, utilizing models like OpenAI and LAMA. Continuous monitoring of emerging technologies ensures you're well-prepared to leverage them effectively.
2. Experiment with Multiple Models and Configurations
Using fine-tuning of LLMS, we were able to deploy small sizes, like 8 to 13 billion parameter models, and deploy them locally. It is something that has worked really well for us and what we recommend is that instead of just using one versus the other, try out different base models and different configurations and see which one works for you.
Large language models come in different flavors, each with its unique capabilities. Create a configuration that allows you to choose from a wide array of options, mirroring what developers and data scientists do in their data science journeys.
To empower users, we've created a configuration system that offers an extensive array of options, akin to what developers and data scientists encounter when working with open-source libraries on their data-driven endeavors. Our aim has been to seamlessly integrate these options into our product, facilitating a dynamic and adaptable experience for users.
3. Prioritize Local Deployment Over APIs
When dealing with data-centric products, reducing delays is paramount. Relying solely on APIs for AI and ML model access may introduce unacceptable delays, particularly when handling large volumes of data. It is advisable to prioritize deploying fine-tuned models locally, dedicated to your specific scenario. This approach can significantly improve response times and overall performance.
Why is Astera a superior solution than competing platforms?
- Astera’s solutions have a code free, intuitive, visual interface along with enhanced usability powered by AI which makes it easy to execute complex data processes for all users, irrespective of their technical abilities.
- Automation features of our data stack cut down on repeatable manual tasks and save time and development resources.
- Our unified platform can help users execute end-to-end data processes without switching solutions. This eliminates the expense of learning and managing multiple, siloed systems.
Thank you for the great interview, readers who wish to learn more should visit Astera Software.
- The Black Box Problem in LLMs: Challenges and Emerging Solutions
- Alex Ratner, CEO & Co-Founder of Snorkel AI – Interview Series
- Circleboom Review: The Best AI-Powered Social Media Tool?
- Stable Video Diffusion: Latent Video Diffusion Models to Large Datasets
- Donny White, CEO & Co-Founder of Satisfi Labs – Interview Series