Erik joined the data practice of SPR‘s Emerging Technology Group as Principal Architect in 2018.
Erik became specializes in data, open source development using Java, and practical enterprise architecture, including the building of PoCs, prototypes, and MVPs.
What initially attracted you to machine learning?
Its enablement of applications to continuously learn. I had started my development career as a senior data analyst using SPSS at what became a global market research firm, and later incorporated use of a business rules engine called Drools into applications that I built for clients, but the output for all of this work was essentially static.
I later worked through process improvement training, during which time instructors demonstrated in detail how they were able to improve, through statistics and other methods, business processes used by their clients, but here again the output was largely focused on points in time. My experience working to improve a healthcare product my colleagues and I built during this same time period is what showed me why continuous learning is necessary for such efforts, but the resources now available did not exist back then.
Interestingly, my attraction to machine learning has come full circle, as my graduate adviser cautioned me against a specialty in what was then called artificial intelligence, due to the AI winter at the time. I choose to instead make use terms such as ML because these hold fewer connotations, and because even AWS acknowledges that its AI services layer is really just a higher-level abstraction built on top of its ML services layer. While some of the ML hype out there is unrealistic, it provides powerful capabilities from the perspective of developers, as long as these same practitioners acknowledge the fact that the value which ML provides is only as good as the data processed by it.
You’re a huge open source advocate, could you discuss why open source is so important?
One aspect about open source that I’ve needed to explain to executives over the years is that the primary benefit of open source is not that use of such software is made available without monetary cost, but that the source code is made freely available.
Additionally, developers making use of this source code can modify it for their own use, and if suggested changes are approved, make these changes available to other developers using it. In fact, the movement behind open source software started due to developers waiting at length for commercial firms to make changes to products they licensed, so developers took it upon themselves to write software with the same functionality, opening it up to be improved upon by other developers.
Commercialized open source takes advantage of these benefits, the reality being that many modern products make use of open source under the covers, even whilst commercial variants of such software typically provide additional components not available as part of a given open source release, providing differentiators as well as support if this is needed.
My first experiences with open source took place while building the healthcare product I mentioned earlier, making use of tooling such as Apache Ant, used to build software, and an early DevOps product at the time called Hudson (the code base of which later became Jenkins). The primary reason behind our decisions to use these open source products was that these either provided better solutions to commercial alternatives, or were innovative solutions not even offered by commercial entities, not to mention that the commercial licensing of some of the products we had been using was overly restrictive, leading to excessive red tape when it came time to needing more licenses, due to the costs involved.
Over time, I’ve seen open source offerings continue to evolve, providing much needed innovation. For example, many of the issues with which my colleagues and I wrestled building this healthcare product were later solved by an innovative open source Java product we started using called Spring Framework, which is still going strong after more than a decade, the ecosystem of which now stretches far beyond some of the innovations that it initially provided, now seen as commonplace, such as dependency injection.
You’ve used open source for the building of PoCs, prototypes, and MVPs. Could you share your journey behind some of these products?
As explained in one of the guiding principles I presented to a recent client, build-outs for the data platform we built for them should continue to be iteratively carried out as needed over time. The components built out for this platform should not be expected to remain static, as needs change and new components and component features will be made available over time.
When building out platform functionality, always start with what is minimally viable before adding unneeded bells and whistles, which in some cases even includes configuration. Start with what is functional, make sure you understand it, and then evolve it. Don’t waste time and money building what has low likelihood of being used, but make an effort to get ahead of future needs.
The MVP we built for this product expressly needed to be built so that additional use cases could continue to be built on top of it, even though it came packaged with implementation of a single use case, for expense anomaly detection. Unlike this client, an earlier product that I built had some history behind it prior to my arrival. In this case, stakeholders had been debating for three years (!) how they should approach a product they were looking to build. A client executive explained that one of the reasons he brought me in was to help the firm get past some of these internal debates, especially because the product that he was looking to build needed to satisfy the hierarchy of organizations involved.
I came to find that these turf wars were largely associated with the data owned by the client, its subsidiaries, and its external customers, so in this case the entire product backlog revolved around how this data would be ingested, stored, secured, and consumed for a single use case generating on-the-fly networks of healthcare providers for cost analyses.
Earlier in my career, I came to understand that an architectural quality called “usability” was not limited to just end users, but software developers themselves. The reason this is the case is because the code that is written needs to be usable just like user interfaces need to be usable by end users. In order that a product become usable, proofs of concept need to be built to demonstrate that developers are going to be able to do what they set out to do, especially when related to the specific technology choices they are making. But proofs of concept are just the beginning, as products are best when evolved over time. In my view, the foundation for an MVP, however, should ideally be built on prototypes exhibiting some stability so that developers will be able to continue to evolve it.
While reviewing the book ‘Machine Learning at Enterprise Scale’ you stated that ‘use of open source products, frameworks, and languages alongside an agile architecture composed of a mix of open source and commercial components provides the nimbleness that many firms need but don’t immediately realize at the outset’. Could you go into some details as to why you believe that firms which use open source are more nimble?
Many commercial data products use key open source components under the covers, and enable developers to use popular programming languages such as Python. The firms which build these products know that the open source components they’ve chosen to incorporate give them a jump start when these are already widely used by the community.
Open source components with strong communities are easier to sell, due to the familiarity that these bring to the table. Commercially available products which consist mainly of closed source, or even open source that is largely only used by specific commercial products, often require either training by these vendors, or licenses in order to make use of the software.
Additionally, documentation for such components is largely not made publicly available, forcing the continued dependency of developers on these firms. When widely accepted open source components such as Apache Spark are the central focus, as with products such as Databricks Unified Analytics Platform, many of these items are already made available in the community, minimizing the portions on which development teams need to depend on commercial entities to do their work.
Additionally, because components such as Apache Spark are broadly accepted as de facto industry standard tooling, code can also be more easily migrated across commercial implementations of such products. Firms will always be inclined to incorporate what they view as competitive differentiators, but many developers don’t want to use products that are completely novel because this proves challenging to move between firms, and tends to cut their ties with the strong communities they have come to expect.
From personal experience, I’ve worked with such products in the past, and it can be challenging to get competent support. And this is ironic, given that such firms sell their products with the customer expectation that support will be provided in a timely manner. I’ve had the experience submitting a pull request to an open source project, with the fix incorporated into the build that same day, but cannot say the same about any commercial project with which I have worked.
Something else that you believe about open source is that it leads to ‘access to strong developer communities.’ How large are some of these communities and what makes them so effective?
Developer communities around a given open source product can reach into the hundreds of thousands. Adoption rates don’t necessarily point to community strength, but are a good indicator that this is the case due to their tendency to produce virtuous cycles. I consider communities to be strong when these produce healthy discussion and effective documentation, and where active development is taking place.
When an architect or senior developer works through the process to choose which such products to incorporate into what they are building, many factors typically come into play, not only about the product itself and what the community looks like, but about the development teams who will be adopting these, whether these are a good fit for the ecosystem being developed, what the roadmap looks like, and in some cases whether commercial support can be found in the case this may be needed. However, many of these aspects fall by the wayside in the absence of strong developer communities.
You have reviewed 100s of books on your website, are there three that you could recommend to our readers?
These days I read very few programming books, and while there are exceptions, the reality is that these are typically outdated very quickly, and the developer community usually provides better alternatives via discussion forums and documentation. Many of the books I currently read are made freely available to me, either via technology newsletters to which I subscribe, authors and publicists who reach out to me, or the ones Amazon sends to me. For example, Amazon sent me a pre-publication uncorrected proof of “The Lean Startup” for my review in 2011, introducing me to the concept of the MVP, and just recently sent me a copy of “Julia for Beginners”.
(1) One book from O’Reilly that I’ve recommended is “In Search of Database Nirvana”. The author covers in detail the challenges for a data query engine to support workloads spanning the spectrum of OLTP on one end, to analytics on the other end, with operational and business intelligence workloads in the middle. This book can be used as a guide to assess a database engine or combination of query and storage engines, geared toward meeting one’s workload requirements, whether these be transactional, analytical, or a mix of these two. Additionally, the author’s coverage of the “swinging database pendulum” in recent years is especially well done.
(2) While much has changed in the data space over the last few years, since new data analytics products continue to be introduced, “Disruptive Analytics” presents an approachable, short history of the last 50 years of innovation in analytics that I haven’t seen elsewhere, and discusses two types of disruption: disruptive innovation within the analytics value chain, and industry disruption by innovations in analytics. From the perspective of startups and analytics practitioners, success is enabled by disrupting their industries, because using analytics to differentiate a product is a way to create a disruptive business model or to create new markets. From the perspective of investing in analytics technology for their organizations, taking a wait-and-see approach might make sense because technologies at risk of disruption are risky investments due to abbreviated useful lifespans.
(3) One of the best technology business texts I’ve read is “The Limits of Strategy“, by a co-founder of Research Board (acquired by Gartner), an international think tank that investigates developments in the computing world and how corporations should adapt. The author presents very detailed notes from many of his conversations with business leaders, providing insightful analysis throughout about his experiences building (with his wife) a group of clients, major firms that needed to mesh their strategies with the exploding world of computing. As I commented in my review, what sets this book apart from other related efforts are two seemingly opposed characteristics: industry-wide breadth, and intimacy that is only available through face-to-face interaction.
You are the Principal Architect for the data practice of SPR. Could you describe what SPR does?
SPR is a digital technology consultancy based in the Chicago area, delivering technology projects for a range of clients, from Fortune 1000 enterprises to local startups. We build end-to-end digital experiences using a range of technology capabilities, everything from custom software development, user experience, data, and cloud infrastructure, to DevOps coaching, software testing, and project management.
What are some of your responsibilities with SPR?
As principal architect, my key responsibility is to drive solution delivery for clients, leading architecture and development for projects, and this often means wearing other hats such as product owner because being able to relate to how products are built from a hands-on perspective weighs heavily in regard to how work should be prioritized, especially when building from scratch. I’m also pulled in to discussions with potential clients when my expertise is needed, and the company recently requested that I start an ongoing series of sessions with fellow architects in the data practice to discuss client projects, side projects, and what my colleagues are doing to keep abreast of technology, similar to what I had run for a prior consultancy, albeit the internal meetups so-to-speak for this other firm involved their entire technology practice, not specific to data work.
For the bulk of my career, I’ve specialized in open source development using Java, performing an increasing amount of data work along the way. In addition to these two specializations, I also do what my colleagues and I have come to call “practical” or “pragmatic” enterprise architecture, which means performing architecture tasks in the context of what is to be built, and actually building it, rather than just talking about it or drawing diagrams about it, realizing of course that these other tasks are also important.
In my view, these three specializations overlap with one another and are not mutually exclusive. I’ve explained to executives the last few years that the line that had been traditionally drawn by the technology industry between software development and data work is no longer well defined, partially because the tooling between these two spaces has converged, and partially because, as a result of this convergence, data work itself has largely become a software development effort. However, since traditional data practitioners typically don’t have software development backgrounds, and vice versa, I help meet this gap.
What is an interesting project that you are currently working on with SPR?
Just recently, I published the first post in a multi-part case study series about the earlier mentioned data platform that my team and I implemented in AWS from scratch this past year for the CIO of a Chicago-based global consultancy. This platform consists of data pipelines, data lake, canonical data models, visualizations, and machine learning models, to be used by corporate departments, practices, and end customers of the client. While the core platform was to be built by the corporate IT organization run by the CIO, the goal was that this platform would be used by other organizations outside corporate IT as well to centralize data assets and data analysis across the company using a common architecture, building on top of it to meet the use case needs of each organization.
As with many established firms, use of Microsoft Excel was commonplace, with spreadsheets commonly distributed within and across organizations, as well as between the firm and external clients. Additionally, business units and consultancy practices had become siloed, each making use of disparate processes and tooling. So in addition to centralizing data assets and data analysis, another goal was to implement the concept of data ownership, and enable the sharing of data across organizations in a secure, consistent manner.
Is there anything else that you would like to share about open source, SPR or another project that you are working on?
Another project (read about it here and here) that I recently lead involved successfully implementing Databricks Unified Analytics Platform, and migrating the execution of machine learning models to it from Azure HDInsight, a Hadoop distribution, for the director of data engineering of a large insurer.
All of these migrated models were intended to predict the level of consumer adoption that can be expected for various insurance products, with some having been migrated from SAS a few years prior at which time the company moved to making use of HDInsight. The biggest challenge was poor data quality, but other challenges included lack of comprehensive versioning, tribal knowledge and incomplete documentation, and immature Databricks documentation and support with respect to R usage at the time (the Azure implementation of Databricks had just been made generally available a few months prior to this project).
To address these key challenges, as a follow-up to our implementation work I made recommendations around automation, configuration and versioning, separation of data concerns, documentation, and needed alignment across their data, platform, and modeling teams. Our work convinced an initially very skeptical Chief Data Scientist that Databricks is the way to go, with their stated goal following our departure to be migration of their remaining models to Databricks as quickly as possible.
This has been a fascinating interview touching on many subjects, I feel like I have learned a lot about open source. Readers who may wish to learn more may visit the SPR corporate website or Erik Gfesser’s website.