Gigabytes of data for a bag of groceries. That’s what you get when you do robotic delivery.That’s a lot of data – especially if you repeat it over a million times Just like us.
But the rabbit hole goes deeper. The data is also very diverse: robot sensor and image data, user interactions with our app, transaction data from orders, and more. Equally diverse are the use cases, from training deep neural networks to creating beautiful visualizations for our business partners, and everything in between.
So far, we’ve been able to handle all of this complexity with our centralized data team. Continued exponential growth so far has us looking for new ways of working to keep pace.
We found that the data grid paradigm was the best way forward. I’ll describe Starship’s take on data grids below, but first, let’s briefly summarize the approach and why we decided to go with it.
What is a data grid?
The primary purpose of a data grid framework is to help large organizations remove data engineering bottlenecks and deal with complexity. As such, it addresses many details related to the enterprise environment, from data quality, architecture and security to governance and organizational structure.For now, only several companies Adherence to the data grid paradigm has been publicly declared – all large multi-billion dollar enterprises. Nonetheless, we think it can be successfully applied to smaller companies as well.
Data Grids in Starship
Whether the data is working near the people producing or consuming the information
In order to run the hyperlocal robotic delivery market globally, we need to turn all kinds of data into valuable products. Data comes from robots (e.g. telemetry, routing decisions, ETA), merchants and customers (using their apps, orders, products, etc.) and all operational aspects of the business (from short teleoperator tasks to global logistics of spare parts) and robots).
Diversity of use cases is a key reason why we are drawn to the data grid approach – we want to work with data very close to the people who produce or consume the information. By following data grid principles, we hope to meet the diverse data needs of our team while maintaining central oversight.
Since Starship has not yet reached enterprise scale, we cannot implement all aspects of the data grid. Instead, we’ve identified a simplified approach that makes sense for us now and puts us on the right path for the future.
Define what your data products are – each product has owners, interfaces and users
Applying product thinking to our data is the foundation of the entire approach. We treat anything that exposes data to other users or processes as a data product. It can expose its data in any form: as BI dashboards, Kafka topics, data warehouse views, responses from predictive microservices, etc.
A simple example of a data product in Starship might be a BI dashboard for website leads to track traffic to their website. A more detailed example is a self-service pipeline for robotics software engineers to send any type of driving information from a robot to our data lake.
In any case, we don’t see our data warehouse (actually Databricks Lakehouse) as a single product, but as a platform that supports multiple interconnected products. Such fine-grained products are often owned by the data scientists/engineers who build and maintain them, rather than dedicated product managers.
Product owners should know who their users are and what needs they are solving with the product – and based on that define and achieve quality expectations for the product. Perhaps because of this, we started to focus more on interfaces, components that are critical to usability but laborious to modify.
Most importantly, understanding users and the value each product creates for them makes it easier to prioritize between ideas. This is critical in startup environments where you need to move fast and don’t have time to make everything perfect.
Group your data products into domains that reflect your company’s organizational structure
Before learning about the data grid model, we have successfully used Lightly Embedded Data Scientist Spent a while on the starship. In fact, some key teams have a data team member working with them part-time — whatever that means in any particular team.
We continue to define data domains according to our organizational structure, this time carefully covering every part of the company. After mapping data products to domains, we assigned a data team member to manage each domain. This person is responsible for managing the entire suite of data products in the domain – some of which are owned by the same individual, some by other engineers on the domain team, and even some by other data team members (for example, for resource reasons).
We like many aspects of domain setup. First and foremost, there are now people in every area of the company responsible for data architecture. Given the subtleties inherent in each field, it might just be because we’ve divided the work.
Creating structure in our data products and interfaces also helps us better understand our data world. For example, where there are more domains than data team members (currently 19 vs. 7), we are now doing a better job of ensuring that each of us is working on a set of interconnected topics. We now understand that to ease the growing pains, we should minimize the number of interfaces used across domain boundaries.
Finally, a more subtle benefit of using data domains: we now feel like we have a recipe for all kinds of new situations. Every time a new initiative emerges, it becomes clearer to everyone where it belongs and what it should run with.
There are still some unresolved issues. While some domains naturally tend to primarily expose source data, while others tend to use and transform it, some domains do both. Should we separate them when they get too big? Or should we have subdomains within a bigger domain? We need to make these decisions in the future.
Empower the people who build data products by standardizing rather than centralizing
The goal of the data platform in Starship is simple: to enable a single data person (usually a data scientist) to work on a domain end-to-end, taking the central data platform team away from the day-to-day affairs of today’s work. This requires providing domain engineers and data scientists with good tools and standard building blocks for their data products.
Does this mean you need a full data platform team to use the data grid approach? Not really. Our data platform team consists of a data platform engineer who simultaneously embeds half the time into a domain. The main reason we can be so lean on data platform engineering is to choose Spark+Databricks as the core of our data platform. Due to the diversity of our data domains, our previous more traditional data warehouse architecture gave us significant data engineering overhead.
We find it useful to clearly differentiate the components that are part of the platform from everything else in the data stack. Some examples of what we provide to domain teams as part of the data platform:
- Databricks+Spark as a working environment and a multi-functional computing platform;
- One-liner functions for data ingestion, e.g. from Mongo collections or Kafka topics;
- Airflow instance for scheduling data pipelines;
- Templates for building and deploying predictive models as microservices;
- Cost tracking of data products;
- BI and visualization tools.
As a general approach, we aim to standardize as much as possible in our current environment – even the parts we know won’t remain standardized forever. As long as it now helps with productivity and doesn’t focus any part of the process, we’re happy. Of course, some elements are completely missing from the platform at the moment. For example, tools for data quality assurance, data discovery, and data lineage are things we leave to the future.
Feedback loops support strong individual ownership
Having fewer people and teams is actually an asset in some aspects of governance, such as easier decision-making. On the other hand, our key governance issues are also a direct consequence of our size. If there is only one data person per field, they cannot be expected to be experts in every potential technology. However, they are the only ones with detailed knowledge of their field. How do we maximize their chances of making the right choice in their field?
Our answer: through a culture of ownership, discussion and feedback within the team.We have borrowed a lot from management concepts on Netflix and cultivated the following varieties:
- personal responsibility for results (one’s product and domain);
- seek different opinions, especially those affecting other areas, before making a decision;
- Asking for feedback and code reviews is both a quality mechanism and an opportunity for personal growth.
We also have some specific agreements on how to handle quality, write down our best practices (including naming conventions), and more. But we believe that a good feedback loop is a key factor in turning the guidelines into reality.
These principles also apply beyond the “build” work of our data team – which is the focus of this blog post. obviously, there are more Instead of providing data products for how our data scientists create value at the company.
One last thought on governance – we will continue to iterate on the way we work. There will never be a single “best” way of doing things, and we know we need to adjust over time.
This is it! These are the 4 core data grid concepts applied in Starship. As you can see, we’ve found a data grid approach that works for us as an agile growth stage company. If this sounds appealing in your context, I hope reading our experience was helpful.
If you have any questions or ideas, please contact me and let’s learn from each other!