John Seymour believes that machines can do a better job of classifying an ocean of malware – if we can just teach them properly.
Seymour, a data scientist at social media threat intelligence firm ZeroFOX, wants to turn computers into sorting tools that can help antivirus researchers. He has devoted his PHD studies to the topic, using machine learning, a branch of artificial intelligence which we have covered on the SecTor blog before. He’ll be talking about it in depth in his presentation at this October’s SecTor conference in Toronto.
“Malware analysts have to triage what they’re going to be looking at next,” he said. With around 390,000 new malicious programs each day, any automatic classification of them can help researchers hone in on the really interesting ones to look at, rather than the latest simple variant on Conficker. One way you could do that is to ask if it’s similar to several other malware families. If it isn’t, then perhaps it’s a new type of software attack that warrants a closer look.
Machine learning should be particularly good at spotting polymorphic malware, Seymour said. This is malicious software that alters its code to avoid detection, creating potentially thousands of strains of the same basic code.
“There is some underlying structure that is hard to change in that sort of way, and that is what machine learning is aiming to find,” he said.
Equipping machines to classify malware on their own is an uphill struggle, though. “The jury is still out on exactly how effective it will be, or whether it will be able to detect zero days.”
Show me the data
One challenge for companies is simply getting the data to feed to a machine learning system. Seymour categorizes machine learning into two types: unsupervised, and supervised. The former just looks for patterns in data on their own, clustering things that look alike. Supervised systems must be trained using pre-labelled information, learning how to label things based on what people tell it.
Machine learning for malware classification would be supervised, Seymour said, but added that Labelling malware into different categories can be difficult, because it’s hard to get the necessary data.
“There are incentives not to share data about malware, because it is so hard to collect and label,” he said. “It becomes secret sauce.”
This problem is compounded by the fact that anti-malware companies tend not to call malware samples the same thing, anyway.
Nevertheless, there are tools and initiatives helping to unify naming and research. There are public repositories of virus samples out there, including VirusShare and VXHeaven. Seymour will talk about these during his presentation.
There have also been efforts to unify the naming and description of viruses. In 1991, the Computer Antivirus Researcher’s Organization (CARO) proposed a common system, but this fell out of use, not least because it was difficult to compare the scanning results produced by different vendor technology.
Then, MITRE created the Common Malware Enumeration (CME) effort to create a single naming convention for malware. This in turn gave way to its Malware Attribute Enumeration and Characterization (MAEC) language for describing malware attributes and properties.
Efforts like these are laudable, but in 2016 we still can’t be sure that one vendor is calling a particular virus strain the same thing, or classifying it in the same family.
Is it malware?
One of the biggest problems with automatically classifying malware is to work out whether it’s malicious in the first place, Seymour explained. Humans may understand what malicious software looks like, but computers using machine learning to identify and classify malware need examples of both, and it can be difficult finding examples of benign software that don’t introduce bias.
He pointed to one automated malware classification system proposed by Adobe, which he said contained flaws.
The team there only used executables installed as part of Windows for its benign dataset, but this latched onto specific properties of Microsoft code in particular. Consequently, the automatic classification system began emitting false positives for legitimate apps that it didn’t think looked enough like that software.
“The issue they ran into was that their dataset wasn’t representative of malware and benign software in general,” he said.
So, malware classification is a problem in two halves. First, you have to decide whether a piece of software is malicious or not. Then you must classify it. The first half of the problem may end up not being a machine learning problem at all, points out Seymour, but if we can apply appropriate artificial intelligence techniques to the second, then it will help us in the ongoing struggle to manage and mitigate a torrent of malware variants.
If you’d like to learn more about this, then sign up for SecTor’s tenth annual security conference in Toronto this October 17-19, and come to hear John Seymour’s deep dive into machine learning and malware.