The Development of a Completely Unsupervised Machine Learning Pipeline for Security Analytics – from Ingestion to Analytics

Schedule Not Yet Finalized October 5, 2022 - Feedback     

Bookmark and Share

Jeff Schwartzentruber

Since the proliferation of data science applications in cyber security, there has been a complimentary division in the approaches to threat detection: Traditional and Machine Learning (ML). The traditional approach remains the predominate method in cyber security and is primarily based on identifying indicators-of-compromise via known signatures. On the other hand, ML applications are focused on deriving insights from the data using statistical methods that isolate malicious activity via supervised and unsupervised algorithms. However, as many data engineering practitioners will attest, one of the most significant issues that affects the efficacy of either approach is data ingestion and parsing, which requires significant operational overhead and expertise. These issues are further compounded by the lack of standardization across log types and schemas used in industry. This session will present a novel method for circumventing many of these issues by using NLP approaches for log parsing. This presentation starts by providing a brief overview of the issues concerning cyber-security ETL pipelines, the state-of-the-art with respect to log NLP and their associated deficiencies as they relate to security data. We then present a new method for a fully unsupervised approach to security ETL pipelines that supports anomaly detection, while attempting to mitigate the operational and technical challenges associated with engineering security data for cyber defence. This session would be of most interest to cyber security practitioners, data scientists/analysts and data engineers who work hands-on with security data and the development of security analytics.