What is the output of Azure Data Factory pipeline that uses the AzureMLBatchExecution activity to retrain a model?
prediction file
model.ilearner file
training data file
All the options
Answers
Answer:
Customers working with Azure Machine Learning models have been leveraging the built in AzureMLBatchExecution activity with Azure Data Factory pipelines to operationalize the ML models in production and score new data against the pre-trained models at scale. But as trends and variables that influence the model’s parameters change over time, ideally this pipeline should also support recurring automated retraining and updates to the model with latest training data. Now Azure Data Factory allows you to do just that with the newly released AzureMLUpdateResource activity.
With Azure ML you typically first setup your scoring and training experiments, then two separate web service endpoints for each experiment. Next, you can use the AzureMLBatchExecution activity with Data Factory to do both scoring of incoming data against the latest model hosted by the scoring web service and scheduled retraining with latest training data. The scoring web service endpoint also exposes an Update Resource method that can be used to update the model used by the scoring web service. This is where the new AzureMLUpdateResource activity comes into picture. You can use this activity now to take the model generated by the training activity and provide it to the scoring web service to update the model for scoring, on a schedule, all automated with your existing data factory pipeline.
Retrain Pipeline
Setting up the experiment endpoints
Below is an overview of the relationship between training and scoring endpoints in Azure ML. Both originate from an experiment in Azure ML Studio, and both are available as Batch Execution Services. A training web service receives training data and produces trained model(s). A scoring web service receives unlabeled data examples and makes predictions.
AML Retaining Diagram
To create the retraining and updating scenario, follow these general steps:
Create your experiment in Azure ML Studio.
When you are satisfied with your model, use Azure ML Studio to publish web services for both the training experiment and the scoring experiment.
The scoring web service endpoint is used to make predictions about new data examples. The output of prediction could have various forms, such as a .csv file or rows in Azure SQL databases, depending on the configuration of the experiment.
The training web service is used to generate new, trained models from new training data. The output of retraining is a .ilearner file in Azure Blob storage.
For detailed instructions on creating web service endpoints for retraining, refer to our documentation.
You can view the Web service endpoints in Azure Management Portal. These will be referenced in ADF Linked Services and Activities.
ML Endpoints
Retraining and updating an Azure ML model using ADF
The operationalized retraining and updating scenario in ADF consists of the following elements:
One or more storage Linked Services and Datasets for the training data, matching the storage type to the training experiment input. For example, if the experiment is configured with a Web Service input, then the training data should come from Azure Blob storage. If your data will be pulled from Azure SQL table, the experiment should be configured with a Reader Module. In your scenario, the training input might be produced by an ADF activity, such as a Hadoop process or Copy Activity. Or it might be generated by some external process.
One or more storage Linked Services and Azure Blob Datasets to receive the newly trained model .ilearner file(s).
AzureML Linked Service and AzureMLBatchExecution Activity to call the training web service endpoint. The training Datasets will be the inputs and the ilearner Dataset the output of this Activity.
AzureML Linked Service and AzureMLUpdateResource Activity for the scoring experiment endpoint to be updated. An ilearner Dataset will be the input to this Activity. If there are multiple models to be updated, there will be one Activity for each. If there are multiple endpoints to be updated, there will be a Linked Service and Activity for each.
Storage Linked Service and Dataset for the Activity output. The Azure ML Update Resource API call does not generate any output, but today in ADF, an output dataset is required to drive the Pipeline schedule.
The output of the Azure data factory pipeline using the Azure ML Batch Execution for retraining a model is model.ilearner file.
- The web training tool is used to make new, qualified models from latest data from the training. The retraining output is a.ilearner file which is stored in Azure Blob.
- AzureML Linked Service and Azure ML Batch Execution Operation are used to call an endpoint for the training web site.
- The training Datasets will be the inputs and the output of this operation will be the ilearner Dataset.