Detecting PII with McGrathNicol
McGrathNicol is a professional services firm operating in Australia and New Zealand, specialising in Advisory and Restructuring. Their technology & cyber specialists assist clients who have suffered an unauthorised data breach, and in most cases also includes potential data risk exposures for their clients in relation to the compromise of sensitive and personally identifiable information (PII).
This data must be identified, collected and analysed such that any potential for harm to Affected Persons can be promptly reported under the requirements of the Australian Privacy Act (1988) and the Notifiable Data Breach Scheme.
The current process to identify PII or other sensitive data within large volumes of unstructured data can be a time consuming and human resource-intensive process if “traditional” methods are used. This data most commonly consists of emails, office documents, PDFs, photocopies, scans and photos of documents.
- Apply machine learning to automate the detection of PII within large corpora of compromised unstructured data.
- Reduce the number of documents that are required for manual review by analysts, by detecting PII with a high degree of accuracy.
- Where possible, extract certain data points, such as contact details of individuals whose PII is present, from the underlying compromised unstructured data.
- Eliiza’s data scientists and engineers developed and evaluated a range of techniques for PII detection within the McGrathNicol AWS environment, working alongside analysts performing a concurrent internal analysis process across the same data.
- Eliiza used Amazon Textract to extract the text and geometric data from each page to enable training a detection model.
- A classification model, using Natural Language Processing techniques, was trained to perform a binary classification of each document as PII sensitive or non-sensitive.
- Eliiza’s internally-developed field detection solution “Thea” was used to identify certain types of PII and extract contact details. Both models were hosted in Amazon SageMaker.
- The classification model was able to identify documents containing PII with an accuracy of 96%.
- Once PII was identified, Thea was able to extract the data with the following level of accuracy:
- Date of Birth – 97%
- Driver’s Licence – 87%
- Bank Account Details – 85%
- Passport – 83%
- The solution provides scalability opportunities for McGrathNicol, combining with current procedures to potentially reduce the human effort involved in Data Breach support and collaborate on performing faster assessment for clients.