Intelligent Classifiers for Office 365 • Wortell

Today I had an interesting epiphany. I’m frequently browsing the Microsoft and community blogs in to stay in the now regarding new and improved features coming to Office 365. Recently, I’ve seen increasing investments leveraging machine learning to support data governance and smarter collaboration. I noticed several posts that are indicating intelligent classifiers coming to Office 365, which will help to identify, and label documents based on meaning and context. The epiphany was that this capability was already announced at Ignite 2018 and it will be awesome. This post is meant to provide some insights on what the capability includes, how it will work and why this is so awesome! Being able to use Machine Learning to more intelligently classify content is something I have been waiting for since a very long time. Microsoft Azure already had the services available to support this, but I had not seen any built-in integration with Office 365 to help controlling the ever-growing amount of data in organizations. Naomi Moneypenny actually showed a bit of “Machine teaching” during the SharePoint Conference NA 2019 keynote (approx. 1 hour 32 minutes into the video).

This clearly indicates that Microsoft is adding machine learning abilities to work more intelligently with documents in Office 365. All though the demo doesn’t clearly show if the trained models are helping to e.g. automatically tag content in SharePoint libraries, it would not surprise me if it will. Going back to the epiphany, Microsoft has been working on more Advanced Data Governance capabilities that leverage Machine Learning.

Intelligent Classifiers

Organizations will soon be able to use built-in classifiers that will recognize e.g. contracts, resumes or job descriptions and apply retention labels to those documents. And additionally (this is the awesome part), it will be possible to create tenant specific classifiers that work on organization specific concepts. So how is this different from auto-labelling that is already available? So the sensitive information types are looking for certain patterns based on regular expressions or a functions. The other option is query based using content search. Both are powerful and valuable but lack the understanding of meaning and context and being able to actively learn when published to process content within the tenant. And that is where Machine Learning makes the difference. Being able to train and optimize classifiers using representative specimen of content, will built a model that can intelligently classify existing and new content the Office 365 tenant. Session BRK3223 from Ignite 2018 shows a great walk-through of what can be expected and provides the foundation of the remainder of this post.

Classification Assistant

What we will see soon, is that a new option called “Classification Assistant” will show under “Classifications” in the “Office 365 Security & Compliance” portal. The capability is not yet available, but for a demo tenant, I was able to navigate to the assistant via https://protection.office.com/?flight=EnableSupervisionVNext#/classificationAssistant Creating new classifiers and training sets actually fails at this point, but that makes sense considering it not being available yet.

From there it will be possible to create Training sets, which can link to existing SharePoint Online sites where samples of specific content are uploaded. The Classification assistant is intended to help build tenant specific classifiers based on these samples. Depending on the number of samples, training may take several hours.

After the training completes, the results can be reviewed and analyzed. Based on the demo in the session video, the capability seems to support categorized clusters of content in document library folders, but also uncategorized content. What’s interesting in the review panel, is the ability to provide feedback on the algorithms by excluding content which does not match the criteria of a specific category. This will help to improve the accuracy of the classifier when applied to content beyond the training set.

Intelligently applying labels

Once the classifiers have been created and published, they can be used to provide auto-labelling of retention labels as a new condition.

For each label, it will be possible to choose one or more classifiers to use for auto labeling. In addition, the classifiers can be used for the supervision capabilities.

Conclusion

I could not be more exited about this capability coming to Office 365. It will be valuable to have a more intelligent solution to classify sensitive content and improve data governance by integrating classifiers with e.g. retention labels and underlying policies. There’s still lots of details to find out on how this will work in practice, but I can’t wait to get hands-on and help to get more insights into the “Dark data” of our customer organizations. I’m also thinking back on the Machine Teaching demo from Naomi. Especially the part on how this may help to auto apply metadata in SharePoint libraries. If so, it might help to end the everlasting discussion about folders vs. metadata. It won’t matter anymore if folders are being used, as Machine Learning will help to apply metadata…as it should! The last thing that I’m still wondering about, is how I could have missed the Ignite 2018 session before. It’s a true gem!