Sam is passionate about building products at the intersection of finance and machine learning. He is currently the Head of Product for the Pricing Group at Opendoor, a late-stage startup that uses algorithms to buy and sell homes instantly, saving homeowners the hassle and uncertainty of listing their home and hosting.
What initially attracted you to machine learning and data science?
After college, I worked for a large professional services firm that hired hundreds of college grads into the same entry-level position. As I became involved in hiring, I was struck, and dismayed, by how wildly people’s opinions within the firm differed about what candidate attributes led to success. It seemed like a really important problem, where clarity was lacking. But I was excited by the fact that we had ample data on past job applicants and new hire outcomes that had never been connected or deeply analyzed. So I started working on that, treating it as a statistical problem, using basic tools like linear regression. Over time, the project grew into a startup, and the methods we used became more sophisticated. For example, we wanted to process unstructured audio and text from interviews directly, and that led us to adopt more powerful machine learning models like neural networks.
Could you discuss Opendoor’s automated valuation model (OVM), and how it calculates the estimated value of a property?
The Opendoor Valuation Model (OVM) is a core piece of our business and feeds into many downstream pricing applications.
In many ways, OVM behaves like a typical buyer or seller would—it looks across a neighborhood, including the types and prices of recently sold homes. However, when it comes to pricing homes, especially given the diversity of homes around the U.S., it’s not enough to solely look at the prices of comparable sales. It’s much more complex than that. We take a variety of factors into account, ranging from the square footage and backyard space to the number of bathrooms and bedrooms, layout, busy roads, upgrades, and more. OVM is fed by a multitude of data sources, including property tax information, market trends as well as many home and neighborhood specific signals. We also look for previous human adjustments on homes to compute the average adjustment value. And we’re able to refine these values with scale. As we collect more human adjustment data for markets, the data set grows and improves the OVM performance. It’s a feedback loop that continuously improves performance over time.
In addition to being highly accurate, it has to run with low latency and high coverage. That means every time we enter a new market, we need to expand OVM’s capabilities to ensure it can serve homeowners across neighborhoods and home types.
What are some of the different machine learning methodologies that are used?
When we first started building OVM, we relied mainly on linear statistical models to better understand our buyers’ and sellers’ decision making process. But over time, OVM developed and is now based on a neural network, specifically an architecture called a Siamese Network. We use this to embed buyers and sellers’ behaviors, including selecting comparable homes, adjusting them and weighting them. This is vital because we’ve found in order to achieve high accuracy, models need to reflect these key steps that market participants follow in their architecture.
One of the many benefits of using a neural network is that it has the precision and flexibility to digest data across all markets and detect granular local nuances. As a result, when Opendoor launches in a new market or expands inventory in an existing market, we can use the same model, bypassing much of the engineering infrastructure work that comes from instantiating a new production model. Instead, we run new data through the existing model, which significantly reduces the time our engineers spend on the process.
There are also many other machine learning methodologies we use at Opendoor, in addition to neural networks. This includes, but isn’t limited to, decision trees, clustering techniques, ranking systems and optimization algorithms.
Opendoor relies on huge amounts of data, where is this data collected from?
The data our algorithms find most valuable is also often the data that is hardest to find. This is the data we generate ourselves or develop via proprietary relationships. We use a combination of in-house data and third-party real estate data, including data points from listings, like the sales date, number of bedrooms and bathrooms, square footage and more. In addition, we look at features that indicate the homes’ uniqueness, which are things only human expertise can provide, such as lighting, street noise, quality of appliances and finishes and much more. We collect data from the homes that are already on the market as well as off-market homes where the owners have shared information with us.
Could you discuss some of Opendoor’s efforts towards improving the speed and reliability of the infrastructure that powers its raw data ingestion?
Ahead of any new market launch, we ingest many years of historical data. High-quality data is vital to training both our algorithms and our local operators to ensure they understand the variations within that market. To improve speed, quality and reliability, we’ve built flexible data mapping tools and tools for automatically assessing the coverage of new data fields. With these tools in place, it takes us a matter of hours or days to ingest and validate large amounts of historical real estate transaction data, instead of weeks.
Another strategy we’ve invested in is proactive, automated data quality monitoring. We’ve set up systems that check the distributions of the data that we’re ingesting and transforming at each step of the process, in real-time. For example, if we expect that in a particular market 20% of new listings on average are apartments, and then today 50% of the new listings are classified as apartments, that will set off an alert for an engineer to investigate.
How is expert human judgement combined with the machine learning algorithms to create feedback loops of ever improving performance?
Our in-house pricing experts play a huge role across our pricing decisions, working in tandem with our algorithms. Where machines still have blind spots, our expert operators fill in, and we rely on them through various stages. For example, they add or verify input data, like the quality of certain renovation projects. They make intermediate decisions about what features might be hard to value, and they also make user-facing decisions, like which offers we should accept. The human element will always be critical to our strategy and we believe that marrying experts and algorithms is best.
Could you both define backtesting and discuss its importance at Opendoor?
Backtesting is a way of assessing the accuracy of a model using historical data. For example, we may train the Opendoor Valuation Model on data from January 2015 to January 2021. In this context, “train” means we feed historical inputs, like home attributes, and outcomes, like sold home prices, to the model. And, in turn, the model learns a relationship between inputs and outcomes. Then we take this model, which reflects those newly-learned relationships, and we feed in another set of historical data, say from February 2021. Because the data is historical, we know the outcomes, and we can measure how much those diverge from the predictions.
This process is very important at Opendoor, and it’s used for all our machine learning products. It reduces the risk of a problem called overfitting, which is when a machine learning model identifies patterns in historical data that aren’t really there. For example, spurious correlations that don’t help with real-world forecasting. It also saves us from running costly real-world A/B tests on new products and strategies that can be eliminated based on historical data.
Is there anything else that you would like to share about Opendoor?
We’re hiring! If you’re interested in building the future of real estate, and/or working at the intersection of fintech, machine learning, and consumer products, please apply! We have open roles across functions and cities Check out our careers page here.
Thank you for the great interview, readers who wish to learn more should visit Opendoor.
- Hobbling Computer Vision Datasets Against Unauthorized Use
- Faisal Ahmed. Co-Founder & CTO at Knockri – Interview Series
- The Shortcomings of Amazon Mechanical Turk May Threaten Natural Language Generation Systems
- AI Chipmaker Deep Vision Raises $35 Million in Series B Funding
- Shay Sabhikhi, CEO of CognitiveScale – Interview Series