You can’t just throw more data scientists into this model and expect it to magically improve accuracy.
(Part 1 here)
Not all data science projects are created equal.
The majority of data science projects I’ve seen or built started life as one-time proofs of concept. Temporary one-time hacks to get important things working.
Some of these projects may end up being different, something a little bigger or more central to achieving the organization’s goals.
Only a select few grow and mature over time.
These special projects typically solve problems of special interest to the organization. Examples include his CTR prediction tools for online advertising networks, image segmentation models for visual effects generators, and profanity detection tools for content filtering services..
These are also things that require significant corporate resources to optimize, and rightly so. Organizations should be willing to spend money when even small improvements in some accuracy metrics can directly contribute to increased revenue or make or break a product launch or funding round.
Resources featured in this post include: data scientist.
If you’ve never managed a project, team, company, etc., treating people as “resources” may seem strange. However, keep in mind that these are: Experts have limited time available, so use this time to accomplish tasks that benefit your organization..
Please note here: Need to manage resources and optimize their use.
Once a model becomes very large and central, and more than a few data scientists are working on improving it, they may step on each other’s toes, block each other, or otherwise impede each other’s work. It is important to be able to work on the model without having to worry about it. Rather, team members should be able to easily help each other and build on each other’s success.
A common practice I’ve seen in many places is that each member of the team takes on their own “something.” Depending on the characteristics of the project, there may be different models, optimization algorithms, deep learning architectures, engineered features, etc.
This mode of working may seem orthogonal between members because each member can work independently and no dependencies are created that could impede or impede progress.
But as I’ve debunked before, that’s completely untrue.
For example, if a team member wins a lot of money with a particularly lucrative feature, others may want to try using the same feature in their models.
At some point, the performance of a particular model may improve dramatically, and soon branched versions of that best model are created, each slightly different from the next. This is because the optimization process, not only by gradient descent but also by human invention, tends to look for better optima near the current optimum.
This scenario can result in much higher coupling and more dependencies than previously anticipated.
Even if we confirm that not all data scientists will converge in this way, we would like to introduce some data scientists, such as perhaps enforcing contracts with downstream consumers, to ease deployment and save time for machine learning engineers. It is necessary to strive for standardization of work.
We want data scientists to work on the same problems in a way that allows them to be independent on the one hand, but at the same time reuse the work of others.
As an example, suppose we are members of a team working on an iris flower data set. This means that the training data is small enough to be kept in an in-memory pandas dataframe, but the tools we come up with could potentially be applied to data of any type and size.
We want to allow creative freedom. This means that each member has complete freedom in choosing the modeling framework. scikit-learn
, Keras
Python-specific logic, etc.
Our main tool is applying OOP principles to abstract processes and normalize individual work into a unified language.
This post demonstrates how to abstract data science processes to facilitate teamwork.The main point is do not have They are concrete abstractions that we come up with. The main point is Data science managers and leaders should strive to facilitate the work of data scientistsWhether it’s through abstractions, protocols, version control, streamlining processes, or some other method.
This blog post is in no way an advertisement reinventing the wheel. The choice between using off-the-shelf products, open source tools, or developing an in-house solution should be made in collaboration with the data science and machine learning engineering teams involved in the project.
Now let’s get down to business.
Once complete, we would like to have a unified framework for running models through the entire pipeline from training to prediction. So start by defining a common pipeline.
- first get training data as input.
- Additional extractions may be required Features Enrich your dataset.
- we create a model, train Run iteratively until you are satisfied with the loss or metrics.
- At that time we keep Copy the model to disk or other persistence mechanism.
- must be done later load The model is returned to memory.
- Then you can apply prediction About new and unseen data.
Let’s follow the pipeline above to declare the basic structure (aka interface) of our model.
class Model:
def add_features(self, x):
...
def train(self, x, y, train_parameters=None):
...
def save(self, model_dir_path):
...
@classmethod
def load(cls, model_dir_path):
...
def predict(self, x):
...
Note that this is not much different from the interface of existing frameworks that we are used to. However, each framework has its own little quirks, such as how it names “fit” and “train” and how it persists models. disk. Encapsulating pipelines within a uniform structure eliminates the need to add implementation details elsewhere, for example when using different models in your deployment environment.
After defining the basic structure, it’s time to explain how to actually use it.
Features
We need “features” as elements that can be easily carried over and added to various models. You should also be aware that each model may have multiple features.
Let’s try implementing some kind of plugin infrastructure. Feature
class. By having a base class for all functionality, Model
Once a class receives input data, it sequentially instantiates various functions in memory.
encapsulated model
We also want to be able to encapsulate the actual model into our system and transfer it between team members. However, you want to keep the option to change model parameters without writing a lot of new code.
Abstract them in another class and name them ModelInterface
To avoid confusion with us Model
class. The latter defers calls to related methods to the former.
Our function can be thought of as a function that has a pandas dataframe as input.
Giving each feature a unique name and encapsulating it in the same interface as other features makes it very easy to reuse these features.
Let’s define the base class.
class Feature(ABC):
@abstractmethod
def add_feature(self, data):
...
And let’s create an implementation, for example the length of the sepal diagonal.
class SepalDiagonalFeature(Feature):
def add_feature(self, data):
data['SepalDiagonal'] = (data.SepalLength ** 2 + \
data.SepalWidth ** 2) ** 0.5
Since we will be using an instance of this class, we will create a separate file to store all our functionality.
sepal_diagonal = SepalDiagonalFeature()
This particular implementation already shows some decisions we made, consciously or not.
- The output column names are literals within the function code and are not stored elsewhere. This means that you cannot easily create a list of known columns.
- I chose to add the new column to the input dataframe within the add_feature function, rather than returning the column itself and adding it to the outer scope.
- There is no way to tell which columns this functionality depends on other than by reading the function code. That way, we could have built a DAG to determine the feature creation order.
For now, these decisions can be easily reversed, but if you later build dozens of features this way, you may need to refactor every feature to apply the changes to the base class. There may be cases. In other words, You need to decide in advance what you expect from the system Also, be aware of the implications of each choice.
Let’s take a closer look Model
By implementing the base class, add_features
function:
def __init__(self, features: Sequence[Feature] = tuple()):
self.features = featuresdef add_features(self, x):
for feature in self.features:
feature.add_feature(x)
Now anyone can take it sepal_diagonal
Create a feature and use it when creating model instances.
If we didn’t encourage reuse of these features in our abstractions, Alice would choose to copy Bob’s logic, modify it slightly to suit her preprocessing, and apply different names along the way. There is a possibility. Rising technical debt.
The question may arise, “What happens to common operations such as addition?” Do I need to implement an addition every time I use it?”
The answer is no. For this we can use the instance field through the self parameter.
@dataclass
class AdditionFeature(Feature):
col_a: str
col_b: str
output_col: str def add_feature(self, data):
data[self.output_col] = data[self.col_a] + data[self.col_b]
For example, if you want to add the length and width of a petal, create an instance like this: petal_sum = AdditionFeature('petalLength', 'petalWidth', 'petalSum')
.
You may need to implement a class for each operator/function, which can be intimidating at first, but you will quickly realize that the list is quite short.
The abstractions we use for our model interfaces are:
class ModelInterface(ABC):
@abstractmethod
def initialize(self, model_parameters: dict):
...@abstractmethod
def train(self, x, y, train_parameters: dict):
...
@abstractmethod
def predict(self, x):
...
@abstractmethod
def save(self, model_interface_dir_path: Path):
...
@classmethod
def load(cls, model_interface_dir_path: Path):
...
This is an example implementation using . scikit-learn
The model is shown below.
class SKLRFModelInterface(ModelInterface):
def __init__(self):
self.model = None
self.binarizer = Nonedef initialize(self, model_parameters: dict):
forest = RandomForestClassifier(**model_parameters)
self.model = MultiOutputClassifier(forest, n_jobs=2)
def train(self, x, y, w=None):
self.binarizer = LabelBinarizer()
y = self.binarizer.fit_transform(y)
return self.model.fit(x, y)
def predict(self, x):
return self.binarizer.inverse_transform(self.model.predict(x))
def save(self, model_interface_dir_path: Path):
...
def load(self, model_interface_dir_path: Path):
...
As you can see, this code is mainly intended to delegate various actions to ready-made models.in train
and predict
Also, between enumeration values and one-hot encoded vectors, it actually depends on your business needs. scikit-learn
interface.
You can now update Model
Corresponding class ModelInterface
Illustration. Here is the full text:
class Model:
def __init__(self, features: Sequence[Feature] = tuple(), model_interface: ModelInterface = None,
model_parameters: dict = None):
model_parameters = model_parameters or {}self.features = features
self.model_interface = model_interface
self.model_parameters = model_parameters
model_interface.initialize(model_parameters)
def add_features(self, x):
for feature in self.features:
feature.add_feature(x)
def train(self, x, y, train_parameters=None):
train_parameters = train_parameters or {}
self.add_features(x)
self.model_interface.train(x, y, train_parameters)
def predict(self, x):
self.add_features(x)
return self.model_interface.predict(x)
def save(self, model_dir_path: Path):
...
@classmethod
def load(cls, model_dir_path: Path):
...
Once again, create a file to curate your model and include the following line in it:
best_model_so_far = Model([sepal_diagonal], SKLRFModelInterface(), {})
this best_model_so_far
Note that although it is a reusable instance, it is not trained. To create a reusable trained model instance, you must persist the model.
I won’t go into the details of saving and loading in this post as it would be redundant, but feel free to check out my clean data science github repository for a fully working Hey example.
The framework proposed in this post is not a one-size-fits-all solution to the problem of standardizing the work of data science teams with a single model. Also, frameworks should not be treated as one. Each project has its own nuances and niches to address.
Rather, the framework proposed here should only be used as a basis for further discussion, the topics of which are: Accelerate the work of data scientists In the spotlight.
Streamlining work should generally be a goal set by data science team leaders and managers, and abstraction is just one item in the toolbox.
Q: Shouldn’t I use a protocol instead of ABC if I only need specific functionality in a subclass?
A: Yes, but this is not an advanced Python class. There is a saying in Hebrew that says, “A pedant cannot teach.” Well then, here you go.
Q: What happens if I remove a feature? That’s important too!
A: Of course. And you can also choose where to drop it.parameterized Feature
Delete the column or implement it ModelInterface
For example class.
Q: What if I want to compare and measure models?
A: It would be great to have a higher level mechanism for tracking model metrics. That is outside the scope of this post.
Q: How can I track the trained model?
A: This could be a list of paths where you saved your trained model. Be sure to give it a meaningful name.
Q: Shouldn’t we also abstract the dataset creation (before passing it to the dataset)? train
function)
A: I tried to avoid it, but I got hit by an arrow in the knee. But it’s certainly a great idea to have different samples of a complete dataset, or just multiple datasets that can be passed around like features and model interfaces.
Q: Aren’t you being harsh on data scientists?
A: We need to weigh the pros and cons of this issue. The restrictive nature of this abstraction takes some time to get used to, but it can potentially save you a lot of time in the future.