Azure Language Services
Posted on Mon 15 September 2025 in ai,
Create resources in bicep
Before we make any calls to the Azure language service, we need to define our resources, you can of course do this through the UI but ideally these things will be done idempotently through pipelines so you have a repeatable result and can be content with how that will run regardless of which environment in your release phase you're releasing to. I'm a fan of bicep however I'm sure the same result could be achieved with HashiCorp's Terraform.
In bicep we'll start by defining our storage account which will contain the training data that our language service runs on, a fairly ordinary blob storage account with a container for training data, the only thing of note is we need to define some cors rules explicitly to allow access.
resource languageServiceStorage 'Microsoft.Storage/storageAccounts@2023-01-01' = {
name: '${environmentName}langstorage'
location: environmentSettings.location
tags: tags
identity: {
type: 'SystemAssigned'
}
sku: {
name: 'Standard_LRS'
}
kind: 'StorageV2'
properties: {
allowBlobPublicAccess: false
minimumTlsVersion: 'TLS1_2'
accessTier: 'Hot'
}
resource blobStorage 'blobServices@2023-05-01' = {
name: 'default'
properties: {cors: {
corsRules: [
{
allowedOrigins: ['https://language.cognitive.azure.com']
allowedMethods: [
'GET'
'PUT'
'POST'
'HEAD'
'OPTIONS'
]
maxAgeInSeconds: 3600
exposedHeaders: ['*']
allowedHeaders: ['*']
}
]
}}
resource trainingdata 'containers@2023-05-01' = {
name: 'trainingdata'
properties: {
publicAccess: 'None'
}
}
}
}
Once we've defined our storage account - lets define our language service, using a custom domain is recommended, from memory there is some issues in obtaining access tokens if interacting with a user managed identity if you use the regional endpoints. The 'userOwnedStorage' entry is what is used to link your language service to your storage account.
resource langStudio 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
name: '${environmentName}languageservice'
location: environmentSettings.location
tags: tags
identity: {
type: 'SystemAssigned'
}
kind: 'TextAnalytics'
sku: {
name: 'S'
}
properties: {
publicNetworkAccess: 'Enabled'
customSubDomainName: '${environmentName}languageservice'
userOwnedStorage: [
{
resourceId: languageServiceStorage.id
}
]
}
}
Our last step for now, is to give our language service the correct role based access to the storage account.
resource langToStorageRbac 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(languageServiceStorage.id, 'StorageBlobDataContributor', langStudio.id)
scope: languageServiceStorage
properties: {
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
'ba92f5b4-2d11-453d-a403-e96b0029c9fe' // Storage Blob Data Contributor
)
principalId: langStudio.identity.principalId
principalType: 'ServicePrincipal'
}
}
That should be us all setup to move on to create our project for actually doing whatever classification we want to do. It's worth noting as this point, I throughly recommend if you're in the Azure infrastructure ecosystem to use managed identities your app resources, if you are doing this, it's worth remembering that you'll need to add rbac for your app service to connect to the language studio. And you'll also need need to have the cognitiveServicesLanguageOwner assigned to whatever identity you're using to call the text authoring api to create your project/import and train/deploy your model.
Powershell script to create project in Azure Language Services
To create a project, we call the azure text authoring api with a patch verb and provide a json body with some metadata around what the project is. If we set it to multilingual, we can use the language detection api later, to check what language azure thinks our input is, we thankfully don't have to train to detect the language however, if you do expect to be running a result on this, it's a good idea to have training data for multiple languages.
$projectName = 'myproject'
$projectUri = "$endpoint/language/authoring/analyze-text/projects/${projectName}?api-version=2022-05-01"
$projectBody = @{
projectName = $projectName
description = "Relevant or not"
language = 'en'
multilingual = "True"
projectKind = "CustomSingleLabelClassification"
storageInputContainerName = 'trainingdata'
} | ConvertTo-Json -Depth 5
$response = Invoke-RestMethod -Method Patch -Uri $projectUri -Headers $headers -Body $projectBody
Copy your training data to storage account linked to Azure's language service
I like to use azcopy for this, you can copy to your storage account by obtaining a sas token automate it that way.
If you've pre labeled it - call upload endpoint with json body detailing file name + label
This is a bit of a tricky part, you'll want to have a json body that looks something like the below, it'll be the metadata for your project plus an array object giving the label that each file in your training data has.
{
"projectFileVersion": "2022-05-01",
"stringIndexType": "Utf16CodeUnit",
"metadata": {
"projectName": "myproject",
"storageInputContainerName": "trainingdata",
"projectKind": "CustomSingleLabelClassification",
"description": "Relevant or not",
"language": "en",
"multilingual": true,
"settings": {}
},
"assets": {
"projectKind": "CustomSingleLabelClassification",
"classes": [
{
"category": "relevant"
},
{
"category": "not"
}
],
The next entry will be a long array pairing your documents in your training folder with the labels you want for each file. I'd suggest creating this programatically, it'll be more than a little tedious otherwise.
"documents": [
{
"location": "001.txt",
"language": "en",
"class": {
"category": "relevant"
}
},
{
"location": "002.txt",
"language": "en",
"class": {
"category": "not"
}
},
.......
]
}
}
Train your model
Fairly simple here, just want to make a rest call to the text authoring api, you'll want to hit the endpoint that does the training. Lets start by getting a bearer token to put in our call, I'm assuming you're using a managed identity to connect, otherwise you can obtain a key from the resource itself to connect but as always, managed identities avoid you having to copy keys/secrets around.
$token = (ConvertFrom-SecureString (Get-AzAccessToken -ResourceUrl https://cognitiveservices.azure.com -AsSecureString).Token -AsPlainText)
Build up a headers object to send with your call...
$headers = @{
"Authorization" = "Bearer $token"
"Content-Type" = "application/json"
}
Now lets build up the body to send, you can see here that we can define a split for how the training job will use our test data to train vs test it's outcome. For my jobs, I've had roughly 1000 samples in my training data so an 80/20 split is pretty good.
$modelLabel = "mybetamodel"
$deploymentName = "deploymentprime"
# kick off training job
Write-Host "`nStarting training..."
$trainUri = "$endpoint/language/authoring/analyze-text/projects/$projectName/:train?api-version=2022-05-01"
$trainBody = @{
modelLabel = $modelLabel
trainingConfigVersion = "latest"
evaluationOptions = @{
kind = "percentage"
testingSplitPercentage = 20
trainingSplitPercentage = 80
} } | ConvertTo-Json
And finally, make the call itself to kick off the training job.
Invoke-RestMethod -Method Post -Uri $trainUri -Headers $headers -Body $trainBody
The training job can take quite some time to actually run, you can check for the job completing by polling the training endpoint.
Start by getting the latest training job id
$joburi = "$endpoint/language/authoring/analyze-text/projects/$projectName/train/jobs?api-version=2022-05-01"
$jobresponse = Invoke-RestMethod -Method Get -Uri $joburi -Headers $headers
$latestJob = $jobResponse.value | Sort-Object -Property createdDateTime -Descending | Select-Object -First 1
$latestJobId = $latestJob.jobId
Write-Host "Latest training job ID: $latestJobId"
Write-Host "Initial status: $($latestJob.status)"
Then we can plug it into the job status endpoint like so and poll the endpoint until the job returns a status of succeeded. The job will take a while so I'd recommend a fairly high polling interval of at least 3 minutes.
$jobStatusUri = "$endpoint/language/authoring/analyze-text/projects/$projectName/train/jobs/$($latestJobId)?api-version=2022-05-01"
while ($true) {
$statusResponse = Invoke-RestMethod -Method Get -Uri $jobStatusUri -Headers $headers
$status = $statusResponse.status
Write-Host "Training job status: $status - Checked at $(Get-Date -Format "HH:mm:ss")"
if ($status -eq "succeeded") {
Write-Host "Training job succeeded."
break
}
elseif ($status -eq "running") {
Start-Sleep -Seconds 180
}
else {
throw "Training job failed or entered unexpected status: $status"
}
}
Now that our training job has succeeded, we can call the deployment endpoint to get our model deployed and ready for our applications to use.
Deploy model
This one is actually fairly simple, which makes a nice change! Just a simple json body with the name of your model label to be send to your project which is identified in the route of the url we hit.
Write-Host "`nDeploying model..."
$deployUri = "$endpoint/language/authoring/analyze-text/projects/$projectName/deployments/$($deploymentName)?api-version=2022-05-01"
$deployBody = @{
trainedModelLabel = $modelLabel
} | ConvertTo-Json
$response = Invoke-RestMethod -Method Put -Uri $deployUri -Headers $headers -Body $deployBody
Write-Host "Deployment request succeeded."
After this, you're ready to use your trained language classifier!