The opening event for the UCL Academic Centre of Excellence for Cyber Security Research in the 2015–2016 academic term featured three speakers: Earl Barr, whose work on approximating program equivalence has won several ACM distinguished paper awards; Mirco Musolesi from the Department of Geography, whose background includes a degree in computer science and an interest in analysing myriad types of data while protecting privacy; and Susan Landau, a professor at Worcester Polytechnic Institute and a visiting professor at UCL and an expert on cyber security policy whose books include Privacy On the Line: the Politics of Wiretapping and Encryption (with Whitfield Diffie) and Surveillance or Security? The Risks Posed by New Wiretapping Technologies.
Detecting malware and IP theft through program similarity
Earl Barr is a member of the software systems engineering group and the Centre for Research on Evolution, Search, and Testing. His talk outlined his work using program similarity to determine whether two arbitrary programs have the same behaviour in two areas relevant to cyber security: malware and intellectual property theft in binaries (that is, code reused in violation of its licence).
Barr began by outlining his work on detecting malware, comparing the problem to that facing airport security personnel trying to find a terrorist among millions of passengers. The work begins with profiling: collect two zoos, and then ask if the program under consideration is more likely to belong to the benign zoo or the malware zoo.
Rather than study the structure of the binary, Barr works by viewing the program as strings of 0s and 1s, which may not coincide with the program’s instructions, and using information theory to create a measure of dissimilarity, the normalised compression distance (NCD). The NCD serves as an approximation of the Kolmogorov Complexity, a mathematical measure of the complexity of the shortest description of an object, which is then normalised using a compression algorithm that ignores the details of the instruction set architecture for which the binary is written.
Using these techniques to analyse a malware zoo collected from sources such as Virus Watch, Barr was able to achieve a 95.7% accuracy rate. He believes that although this technique isn’t suitable for contemporary desktop anti-virus software, it opens a new front in the malware detection arms race. Still, Barr is aware that malware writers will rapidly develop countermeasures and his group is already investigating counter-countermeasures.
Malware writers have three avenues for blocking detection: injecting new content that looks benign; encryption; and obfuscation. Adding new content threatens the malware’s viability: raising the NCD by 50% requires doubling the size of the malware. Encryption can be used against the malware writer: applying a language model across the program reveals a distinctive saw-toothed pattern of regions with low surprise and low entropy alternating with regions of high surprise and high entropy (that is, regions with ciphertext). Obfuscation is still under study: the group is using three obfuscation engines available for Java and applying them repeatedly to Java malware. Measuring the NCD after each application shows that after 100 iterations the NCD approaches 1 (that is, the two items being compared are dissimilar), but that two of the three engines make errors after 200 applications. Unfortunately for malware writers, this technique also causes the program to grow in size. The cost of obfuscation to malware writers may therefore be greater than that imposed upon white hats.
In the case of IP theft, suspect binaries are often huge, with tens of thousands of functions in them. Reused code is borrowed and obfuscated (for example, by compiler optimisation), usually out of laziness. To be a match, the suspect code must be both syntactically and functionally similar (though this can happen innocently even with clean-room development if different developers are implementing the same function to the same specification). This work is simplified by filtering out functions under 100 instructions (because they’re not worth copying). Barr finds that it doesn’t take many tests to establish whether two functions are equivalent. However, some shorter functions are “reticent” – that is, they are false almost everywhere, but true at a few points. In these cases, Barr uses the technique of microexecution, in which the researchers build the execution state the program expects at that point and treats that state as part of the input. Using the GNU core utilities and ten popular C projects sourced from GitHub, designating one randomly chosen function as “stolen”, and varying the complier, optimisation level, and obfuscation engines, Barr found that at the most aggressive optimisation level the results had a mean precision of 88% and a recall rate (the number of matches flagged) of 25%. Varying the compiler was the biggest difficulty.
The barrier to adoption for this system is that it’s extremely intensive and took weeks to run on Livermore’s computers; in addition, either the recall rate is too low for the real world or IP theft is more rare than conventional wisdom would believe.
Understanding limits of location privacy
Mirco Musolesi is interested in identity and identification in the smartphone era and looks at the privacy implications of merging the many types of data smartphones collect from their embedded sensors – GPS, cameras, microphones, proximity, ambient light, gyroscope, accelerometer and their “social sensors” (social media apps). Merging open data with these many data streams, which represent our digital identity, makes it possible to test many long-held theories about human behaviour, mobility, and design.
As an example, Musolesi outlined his work with data derived from three location-based social networks (LBSNs): FourSquare, the now-defunct BrightKite (2004–2011), and Gowalla (launched 2007, acquired by Facebook 2011). These and many others such as Twitter encourage users to share location information. Musolesi wanted to find the limits of obfuscation: how hard is it for an attacker to identify individual users from their location information? How many check-ins does the attacker need? Musolesi’s group studied four attack models. The trajectory-based approach identifies a user by considering the trajectory of spatio-temporal points of their check-ins. In the multinomial approach, a Bayes classification model takes advantage of the fact that most users’ check-ins cluster around a limited set of GPS points to estimate the probability that any one check-in relates to a particular user. “Social smoothing” takes into account users’ friends’ check-ins within the social network. Finally, the hybrid model merges the spatial and frequency information into a single model.
Using directed Hausdorff distances to compare the results of these four models showed that where only a few points are available the trajectory model was the most accurate for the Foursquare and Gowalla datasets, though not Brightkite. Musolesi went on to ask whether some locations are more discriminative than others, which venues an attacker might want to monitor to optimise their success, and what strategy users should adopt for deciding whether to make a particular check-in public or not. Based on a dataset of check-ins in 17 core based statistical areas (as defined by the US Office of Management and Budget) and assuming that the attacker can only access some check-ins in specific categories (such as restaurants), the researchers found that venues in the travel category were the least discriminative because of their high user-to-venue ratio. The most discriminative was residence, followed by shops, which are highly discriminative if enough points are available. There turned out to be no correlation between a user’s entropy (whether the user checks in frequently from many or few venues) and the complexity of identifying them; collective, rather than individual, behaviour determines the complexity of identifying an individual. How to ensure identity privacy remains an open question, along with attack models that consider sequences of check-ins and the influence of urban environments.
Metadata and content: wiretap law applied to the Internet
UCL visiting professor Susan Landau‘s talk outlined her forthcoming paper with Steve Bellovin, Matt Blaze, and Stephanie Pell, “It’s Too Complicated: The Technological Implications of IP-Based Communications on Content/Non-Content Distinctions and the Third-Party Doctrine”. The paper uncovers technical and legal problems in applying existing US wiretap law to Internet Protocol-based communications. The main conclusion: the laws in effect today, drafted with traditional telephony in mind, are a poor fit for IP-based communications.
As background, Landau noted her work with seven other expert co-authors on “Bulk Collection of Signals Intelligence: Technical Options”. This report, issued earlier this year, was commissioned under the 2014 Presidential Policy Directive 28; it concluded that there are no technical alternatives to NSA bulk collection that provide the ability to delve into past communications and that automated controls on usage can help enforce privacy protections. Laws covering wiretapping in the US must stand against and be interpreted by court challenges under the Fourth Amendment, which states: “The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no warrants shall issue, but upon probable cause, supported by oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.”
The following laws control wiretapping in the United States:
- Title III (1968) provides for wiretaps in criminal cases where telephony was used in planning one of a short list (originally 25, now nearly 100) of crimes, where there is probable cause, and where wiretapping is a last resort.
- Foreign Intelligence Services Act (1978) adds probable cause that the target is the agent of a foreign power, later amended to include terrorist groups.
- Electronic Communications Privacy Act (1986) sets out rules for pen/traps, which record numbers dialled and trap incoming calls.
- The PATRIOT Act (2001) adds “lone-wolf” suspected terrorists to FISA.
Two key Supreme Court cases control the application of existing wiretap laws. The first, Katz v United States (1967), established that even in a public phone booth wiretapping constitutes a search and requires a warrant because the Fourth Amendment protects people, not places. Title III was drafted in response, and sets out warrant procedures. The second, Smith v Maryland (1979), created “third-party doctrine”, which holds that no reasonable expectation of privacy applies to information voluntarily conveyed to a third party – such as the address written on a letter or a dialled phone number. Protection for the contents of a package in the US goes all the way back to the 18th century, and, like the Fourth Amendment, devolves from the US’s colonial history. Taken together, this is the basis for treating content differently from metadata.
In their research, Landau and her co-authors discovered that the current (2005) Department of Justice manual on electronic surveillance includes “dialing, routing, addressing, and signaling information” in its definition of non-content, and specifies that this includes IP addresses, port numbers, and “to” and “from” information included in the email header.
However, the public switched telephone network (PSTN) and IP communications are fundamentally different architectures. In IP-based communications, there is no dialling; routing does not map easily to identifiable end points (and a single email’s packets may follow many different paths), addressing information may mean one thing to computer protocols and another to the humans using them, and signalling means setting up an end-to-end TCP connection. Other problems with the distinctions created by Katz and Smith include: where a service occurs; whether non-content can reveal content; and whether non-content is content. Third-party doctrine similarly poses problems: users can’t be said to share information voluntarily or knowingly when they can’t tell if a third party is involved in a DNS request; URLs comprise both addressing information (the base domain) and content (the path specifying a page); and proxy servers introduce additional complications. Other commonly used technologies – NAT, firewalls, domain fronting, notification services, basic email headers that can reveal location, and packet lengths that can reveal the content of the communication – further blur the boundaries of the traditional distinction between content and metadata.
They conclude that the rules are too difficult to apply, that trying to apply the old rules on the internet leads to inconsistent results, and that the concept that metadata is wholly distinguishable from content no longer works. The paper concludes with recommendations for the DoJ, judges, and policy makers. For the DoJ, it recommends that “to” and “from” should be seen as content, not signalling information. For policy makers, the paper recommends grounding the law more solidly in today’s technical realities without focusing too much on today’s technologies and being aware that “big data” may give law enforcement insight into individuals that under traditionally required a warrant or court order. Third-party doctrine may have to be reconsidered, as Supreme Court Justice Sonia Sotomayor noted in US v Jones (2012).