Skip to content

super{set} Lessons: When inference meets engineering

Othmane Rifki
October 30, 2022

SuperSummit


super{set} companies have been very early to adopt data engineering, or leveraging software engineering to power data science workflows and solve business problems. Many of us have been on the front lines of the emerging roles that sit between an organization’s traditional software engineers and data scientists – namely the data engineers who optimize the retrieval and use of data to power ML models, and the machine learning engineers who ensure a scalable and flexible environment for ML model pipelines.

At super{summit} 2022, I organized a session for data and ML engineers to come together and share lessons learned from their particular workflow and business situation.

Our conversation distilled 3 ways that data science can benefit from engineering workflows to deliver business value:

  1. Managing the complexity of machine learning lifecycles at scale
  2. Creating business value by seeing models through to deployment and beyond
  3. Preserving data privacy to build trust with consumers

Let’s review!

 

Managing machine learning lifecycles at scale

Data science teams focus on building models to help businesses solve problems.

For example: identifying hate speech with deep learning models. The performance of those models is assessed with labeled datasets originating from client traffic or other sources. All this is quite manageable at a small scale, when there are only a handful of models to serve and customers that can be counted with one hand. 

When data science models scale, things start to break.

The average super{set} deals with a large number of models that need to be deployed on behalf of multiple clients in a myriad of production environments. Understanding and managing these models and their dependencies at scale while also mitigating risks that may arise from decision automation (decision-making without human intervention) becomes critical to the success of business operations. Simply put: dollars and livelihoods are on the line as startups scale into meaningful businesses.

Data + ML engineering to the rescue!

Data engineers and ML engineers work together to:

  • Optimize the retrieval of data needed to train models
  • Integrate machine learning models into an organization’s applications and systems
  • Ensure a scalable and flexible machine learning model pipeline from design to serving to monitoring
  • Build robust automation to ease the continuous delivery of model updates while maintaining high quality

 

Introducing MLOps

super{set} has more than just in-house expertise to manage lifecycles at scale – something we call machine learning operations or “MLOps.”

One super{set} company, MarkovML, is entirely dedicated to solving the problem of MLOps! MarkovML helps organizations gain visibility into their end-to-end machine learning workflows to reach their business objectives.

As the team from MarkovML shared, MLOps is more than just streamlining the process of deploying, monitoring, and maintaining ML models - it’s about improving the entire lifecycle by providing valuable insights around:

  • The performance of the model
  • The relevance of the data used for training
  • Connecting performance and relevance to the target business value

Once again, it all comes down to solving business problems.

The burden and error-prone manual processes of keeping track of the organization’s data and models can be eliminated in favor of automating the end-to-end machine learning workflow enabling data science teams to focus on extracting insights related to business objectives. Products of MarkovML make it simpler.

markovML

Data governance and model measurement workflow in markovML.

 

Creating business value means seeing models to deployment and beyond…

Creating business value doesn’t stop with model creation. Each super{set} company made clear that the smooth deployment of new models into production is key to maximizing the value of the product offering.

Ketch, a company that enables organizations to build trust with their consumers via privacy controls and governance for data, shared the importance of ensuring that models developed in isolation in a dev environment are prepared for the production environment. For instance, when a model is developed using python libraries and production is based on a Java runtime environment, conversion is required. 

Data scientists can be well-served by using a model format such as ONNX, which is an open format built specifically to represent machine learning models. Look for model formats that are widely used, have built-in optimizations, and support a variety of machine learning frameworks, operating systems, and hardware platforms.

 

Post-deployment testing strategies

Deploying models into production is far from the final step in providing business value. A deployed model can start degrading in quality since a static model cannot keep up with new trends – the reality of life is that change is the only constant.

My company, Spectrum Labs, is dedicated to protecting users from disruptive behaviors and promoting healthy exchange via positive behaviors. We run sanity checks on our models prior to deployment and monitor the performance to track any potential degradation which may trigger a retraining of the model with more representative data.

Spectrum_ml_lifecycleA typical machine learning project lifecycle at Spectrum Labs.


There are two approaches we at Spectrum Labs take to evaluate and monitor model performance post-deployment:

  1. Regression tests via ground truth evaluation.
    These tests pull in data from live traffic through the following deployment process:
    1. Before deployment, a new model is first ensured to pass a set of carefully curated data that previous models passed.
    2. After deployment and some time later, new data from live traffic is pulled and labeled to obtain ground truth that can be used to make sure the model is not degrading as compared to registered metrics in the training phase.

  2. Smoke tests via drift detection, where data distributions are monitored to make sure that they don’t diverge in a statistically significant way from the training and testing phases on one side and the development phase on the other.

Sometimes, this feedback loop from the production environment back to prototyping and development is not simply about quality assurance – it is central to the product value proposition itself and how it solves a business problem.

For instance: Sturdy unifies customer feedback from a variety of sources into one channel and uses machine learning to identify signals in the data that impact revenue retention. Through automation, the signals enable Sturdy’s customers to drive critical business processes and to act on customer feedback as soon as data is received.

Sturdy-Graphic-Feedback
Sturdy puts all your customer conversations and feedback into one single dashboard.

 

Preserving data privacy to build trust with consumers

Models depend on data. The quality of the data used has the biggest impact on the performance of a model. In many cases, the business value is derived from data that originates from people.

As data scientists, software engineers, data engineers, and ML engineers, we are not the true owners of this data – just custodians of data that is truly owned by others. In these cases, the management of data and its privacy requires a set of controls to ensure that organizations deliver on their responsibilities to stakeholders viewed broadly. 

Of chief concern is understanding the following:

  1. The provenance of data used in training
  2. How data was collected
  3. How data was treated for bias

Beyond the ethical use of data is the secure use of data – data must be handled off of desktops and managed in a secure and traceable manner with all personal information strictly removed.

Conveniently, super{set} once again has in-house expertise: Ketch offers an infrastructure for data privacy, compliance, and security, and Habu offers a secure data collaboration platform (“data clean room”) with comprehensive analytics.

habu-cleanroom
Habu’s data clean room software allows for privacy-safe data collaboration between multiple clients’ first party data to obtain valuable insights from aggregated data outputs.

 

Final thoughts

Data and ML engineering is an emerging field. As data scientists, software engineers, data engineers and ML engineers, it’s always helpful to compare notes and get up-to-speed on best practices friends in other organizations are applying to their products.

Only at super{set} will you get a community of data practitioners that are not just leveraging data to solve business models, but also building businesses to solve data problems!