What is Synthetic Data? The Good, the Bad, and the Ugly

Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of sensitive nature, and thus sharing them can endanger the privacy of users and organizations.

A possible alternative gaining momentum in the research community is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data — more precisely, having similar statistical properties.

So how do you generate synthetic data? What is that useful for? What are the benefits and the risks? What are the fundamental limitations and the open research questions that remain unanswered?

All right, let’s go!

How To Safely Release Data?

Before discussing synthetic data, let’s first consider the “alternatives.”

Anonymization: Theoretically, one could remove personally identifiable information before sharing it. However, in practice, anonymization fails to provide realistic privacy guarantees because a malevolent actor often has auxiliary information that allows them to re-identify anonymized data. For example, when Netflix de-identified movie rankings (as part of a challenge seeking better recommendation systems), Arvind Narayanan and Vitaly Shmatikov de-anonymized a large chunk by cross-referencing them with public information on IMDb.

Continue reading What is Synthetic Data? The Good, the Bad, and the Ugly

Hiring Research Assistants and PhD students

We’re happy to announce that we have several open positions!

Privacy & machine learning

Emiliano De Cristofaro has at least one post-doc position in privacy and machine learning. The researcher will work with him and others in UCL’s InfoSec group. For a sample of our recent work in the field, please see Emiliano’s publications on this topic.

Please email jobs@emilianodc.com with questions or apply directly before 25 July 2019.

Note that we would be keen to hear from both PhD students looking for part-time research work, as well as people looking for longer-term full-time post-doctoral positions.

Web measurements

Multiple positions are available in the context of a project based at the Alan Turing Institute on cyberbullying and cyberhate, led by Emiliano De Cristofaro and Gareth Tyson. The project will primarily focus on measurements research, i.e., gathering and analysing various types of social datasets.

For a sample of our recent work in this space, please see Emiliano’s publications on this topic.

Again, we would be keen to hear from both PhD students looking for part-time research work, as well as people looking for longer-term full-time post-doctoral positions.

Please email edecristofaro@turing.ac.uk if you have questions.

PhD positions with Philipp Jovanovic

Members of the InfoSec group are always looking for talented PhD students to join their team. If you would like to investigate opportunities, please do check their website for details of their research interests and contact instructions. We are particularly happy to announce that Philipp Jovanovic will join our group as an Associate Professor starting in January 2020, and he is inviting applications for PhD students.

Philipp’s research interests broadly cover applied cryptography, privacy, and decentralised systems. His current work focuses on building scalable, privacy-preserving, decentralised protocols (such as ByzCoin, RandHound, OmniLedger, or Calypso). He has also worked on a wide variety of other security-related topics in the past, including design and analysis of symmetric cryptographic primitives, side-channel attacks and countermeasures, and the security analysis of systems deployed in the real world such as the Transport Layer Security (TLS) protocol or the Open Smart Grid Protocol (OSGP).

For an overview of his work, please visit Philipp’s website.

If you’re interested in working with Philipp as a PhD student, please email philipp@jovanovic.io.

New CDT in cybersecurity

We have several PhD positions funded through the new Centre for Doctoral Training in Cybersecurity (CDT). Please see the article about the CDT for more details and instructions to apply.

Memes are taking the alt-right’s message of hate mainstream

Unless you live under the proverbial rock, you surely have come across Internet memes a few times. Memes are basically viral images, videos, slogans, etc., which might morph and evolve but eventually enter popular culture. When thinking about memes, most people associate them with ironic or irreverent images, from Bad Luck Brian to classics like Grumpy Cats.

Bad Luck Brian (left) and Grumpy Cat (right) memes.

Unfortunately, not all memes are funny. Some might even look as innocuous as a frog but are in fact well-known symbols of hate. Ever since the 2016 US Presidential Election, memes have been increasingly associated with politics.

Pepe The Frog meme used in a Brexit-related context (left), Trump as Perseus beheading Hillary as Medusa (center), meme posted by Trump Jr. on Instagram (right).

But how exactly do memes originate, spread, and gain influence on mainstream media? To answer this question, our recent paper (“On the Origins of Memes by Means of Fringe Web Communities”) presents the largest scientific study of memes to date, using a dataset of 160 million images from various social networks. We show how “fringe” Web communities like 4chan’s “politically incorrect board” (/pol/) and certain “subreddits” like The_Donald are successful in generating and pushing a wide variety of racist, hateful, and politically charged memes.

Continue reading Memes are taking the alt-right’s message of hate mainstream

Measuring and Modeling the Vivino Wine Social Network

Over the past few years, food and drink have become an essential part of our social media footprints. This shouldn’t come as a surprise – after all, eating and drinking were social activities long before the first #foodporn hashtag on Instagram. In fact, scientific studies have showed that what we gobble up or gulp down is shaped by social and regional influences, and how we tend to mirror habits of people with shared social connections.

Nowadays, we have an unprecedented opportunity to study eating & drinking habits at scale, as people share more and more of that online, both on popular social networks like Instagram, Twitter, and Facebook, but also on “dedicated” apps like Yummly or Untappd.

Along these lines is our recent paper, “Of Wines and Reviews: Measuring and Modeling the Vivino Wine Social Network,” recently presented at the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2018) in Barcelona. The study – co-authored by former UCL undergraduate student Neema Kotonya, Italian wine journalist Paolo De Cristofaro, and UCL faculty Emiliano De Cristofaro –  presented a preliminary study showcasing big-data and social network analysis of how users worldwide consume, rate, and provide reviews of wines. We did so through the lens of Vivino, a popular wine social network. (And, yes, Paolo is Emiliano’s brother! 😊)

What is Vivino?

Vivino.com is an online community for wine enthusiasts, available both as a web and a mobile application. It was founded in 2009 by Heini Zachariassen, with his colleague Theis Sondergaard joining in 2010. In a nutshell, Vivino allows users to review and purchase wines through third-party vendors. The mobile app also provides a “wine scanner” functionality – i.e., users can upload pictures of wine labels and access reviews and details about the wine/winery.

But Vivino is actually a social network, as it allows wine enthusiasts to communicate with and follow each other, as well as share reviews and recommendations. As of September 2018, it had 32 million users, 9.7 million wines covering a multitude of wine styles, grapes, and geographical regions, as well as 103.7 million ratings and almost 35 million reviews.

Continue reading Measuring and Modeling the Vivino Wine Social Network

Caveat emptor: Privacy could turn UK’s genomic dream into a nightmare

Raise your hand if, over the past couple of years, you have not heard of whole genome sequencing (usually abbreviated as WGS), or at least read a sensational headline or two about how fast its costs are dropping. In a nutshell, WGS is used to determine an organism’s complete DNA sequence. But it is actually not the only way to analyze our DNA — in fact, genetic testing has been used in clinical settings for decades, e.g., to diagnose patients with known genetic conditions. Seven-time Wimbledon champion Pete Sampras is a beta-thalassemia carrier – a condition that affects the formation of beta-globin chains, ultimately leading to red blood cells not being formed correctly. Testing for thalassemia, usually triggered by family history or a blood test showing low mean corpuscular volume, is done with a number of simple in-vitro techniques.

The availability of affordable whole genome sequencing not only prompts new hopes toward the discovery and diagnosis of rare/unknown genetic conditions, but also enables researchers to better understand the relationship between the genome and predisposition to diseases, response to treatment, etc. Overall, progress makes it increasingly feasible to envision a not-so-distant future where individuals will undergo sequencing once, making their digitized genome easily available for doctors, clinicians, and third-parties. This would also allow us to use computational algorithms to analyze the genome as a whole, as opposed to expensive, slower, targeted in-vitro tests.

Along these lines is last week’s announcement by Prof. Dame Sally Davies, UK’s Chief Medical Officer, calling the NHS to deliver her “genomic dream” within five years, with whole genome sequencing becoming “as standard as blood tests and biopsies.” As detailed in her annual report, a large number of patients in the UK already undergo genetic testing at least once in their life, and for a wide range of reasons, including the aforementioned thalassemia diagnosis, screening for cancer predisposition triggered by high family incidence, or determining the best course of action in cancer treatment. So wouldn’t it make sense to sequence the genome once and keep the data available for life? My answer is yes, but with a number of bold and double underlined caveats.

The first one is with respect to the security concerns prompted by the need to store data of extreme sensitivity like genomic data. The genome obviously contains information about ethnic heritage and predisposition to diseases/conditions, possibly including mental disorders. Data breaches of sensitive information, including health and medical data, sadly happen on a daily basis. But certain security threats are actually specific to genomic data and much more worrisome. For instance, due to its hereditary nature, access to a genome essentially implies access to that of close relatives as well, including offspring, so one’s decision to publish/donate their genome is also being made for their siblings, kids/grandkids, etc. So sensitivity does not degrade over time, but persists long after a patient’s death. In fact, it might even increase, as new aspects of the genome are studied and discovered. As a consequence, Prof. Dame Davies’ dream could easily turn into a nightmare without adequate investments toward sound security measures, that involve both technical tools (such as upgrading of obsolete hardware) as well as education, awareness, and practices that do not simply shift burden onto clinicians and practitioners, but incorporate security in their design and not as an after-the-fact.

Another concern is with allowing researchers to use the genomic data collected by the NHS, along with medical history, for research purposes – e.g., to discover genetic mutations that are responsible for certain traits or diseases. This requires building a meaningful trust relationship between the NHS/Government and patients, which cannot happen without healing the wounds from recent incidents like the care.data debacle or Google DeepMind’s use of personal NHS records. Instead, the annual report seems to include security/anonymity promises we cannot realistically maintain, while, worse yet, promoting a rhetoric of greater good trumping privacy concerns, as well as seemingly pushing a choice between donating data and access to the best care. It is misleading to use terms like “de-identification” of genomic data as an effective protection tool, while proper anonymization is inherently impossible due to its peculiar combination of unique and hereditary features, as demonstrated by a wide array of scientific results. Rather, we should make it clear that data can never be fully anonymized, or protected with 100% guarantees.

Overall, I believe that patients should not be automatically enrolled in sequencing programs. Even if they are given an option to later withdraw, once the data is out there it is impossible to delete all copies of it. Rather, patients should voluntarily decide to join through an effective informed consent mechanism. This proves to be challenging against a background in which information that can be extracted/inferred from genomes may rapidly change: what if in the future a new mutation responsible for early on-set Alzheimer’s is discovered? What if the NHS is privatized? Encouraging results with respect to education and informed consent, however, do exist. For instance, the Personal Genome Project is a good example of effective strategies to help volunteers understand the risks and could be used to inform future NHS-run sequencing programs.

 

An edited version of this article was originally published on the BMJ.

A Longitudinal Measurement Study of 4chan’s Politically Incorrect Forum and its Effect on the Web

The discussion board site 4chan has been a part of the Internet’s dark underbelly since its creation, in 2003, by ‘moot’ (Christopher Poole). But recent events have brought it under the spotlight, making it a central figure in the outlandish 2016 US election campaign, with its links to the “alt-right” movement and its rhetoric of hate and racism. However, although 4chan is increasingly “covered” by the mainstream media, we know little about how it actually operates and how instrumental it is in spreading hate on other social platforms. A new study, with colleagues at UCL, Telefonica, and University of Rome now sheds light on 4chan and in particular, on /pol/, the “politically incorrect” board.

What is 4chan anyway?

4chan is an imageboard site, built around a typical bulletin-board model. An “original poster” creates a new thread by making a post, with one single image attached, to a board with a particular focus of interest. Other users can reply, with or without images. Some of 4chan’s most important aspects are anonymity (there is no identity associated with posts) and ephemerality (inactive threads are routinely deleted).

4chan currently features 69 boards, split into 7 high level categories, e.g. Japanese Culture or Adult. In our study, we focused on the /pol/ board, whose declared intended purpose is “discussion of news, world events, political issues, and other related topics”. Arguably, there are two main characteristics of /pol/ threads. One is its racist connotation, with the not-so-unusual aggressive tone, offensive and derogatory language, and links to the “alt-right” movement—a segment of right-wing ideologies supporting Donald Trump and rejecting mainstream conservatism as well as immigration, multiculturalism, and political correctness. The other characteristic is the fact that it generates a substantial amount of original content and “online” culture, ranging from  the “lolcats” memes to “pepe the frog.”

This figure below shows four examples of typical /pol/ threads:

/pol/ example threads
Examples of typical /pol/ threads. Thread (A) illustrates the derogatory use of “cuck” in response to a Bernie Sanders image, (B) a casual call for genocide with an image of a woman’s cleavage and a “humorous” response, (C) /pol/’s fears that a withdrawal of Hillary Clinton would guarantee Donald Trump’s loss, and (D) shows Kek the “god” of memes via which /pol/ believes influences reality.

Raids towards other services

Another aspect of /pol/ is its reputation for coordinating and organizing so-called “raids” on other social media platforms. Raids are somewhat similar to Distributed Denial of Service (DDoS) attacks, except that rather than aiming to interrupt the service at a network level, they attempt to disrupt the community by actively harassing users and/or taking over the conversation.

Continue reading A Longitudinal Measurement Study of 4chan’s Politically Incorrect Forum and its Effect on the Web

On the hunt for Facebook’s army of fake likes

As social networks are increasingly relied upon to engage with people worldwide, it is crucial to understand and counter fraudulent activities. One of these is “like farming” – the process of artificially inflating the number of Facebook page likes. To counter them, researchers worldwide have designed detection algorithms to distinguish between genuine likes and artificial ones generated by farm-controlled accounts. However, it turns out that more sophisticated farms can often evade detection tools, including those deployed by Facebook.

What is Like Farming?

Facebook pages allow their owners to publicize products and events and in general to get in touch with customers and fans. They can also promote them via targeted ads – in fact, more than 40 million small businesses reportedly have active pages, and almost 2 million of them use Facebook’s advertising platform.

At the same time, as the number of likes attracted by a Facebook page is considered a measure of its popularity, an ecosystem of so-called “like farms” has emerged that inflate the number of page likes. Farms typically do so either to later sell these pages to scammers at an increased resale/marketing value or as a paid service to page owners. Costs for like farms’ services are quite volatile, but they typically range between $10 and $100 per 100 likes, also depending on whether one wants to target specific regions — e.g., likes from US users are usually more expensive.

Screenshot from http://www.getmesomelikes.co.uk/
Screenshot from http://www.getmesomelikes.co.uk/

How do farms operate?

There are a number of possible way farms can operate, and ultimately this dramatically influences not only their cost but also how hard it is to detect them. One obvious way is to instruct fake accounts, however, opening a fake account is somewhat cumbersome, since Facebook now requires users to solve a CAPTCHA and/or enter a code received via SMS. Another strategy is to rely on compromised accounts, i.e., by controlling real accounts whose credentials have been illegally obtained from password leaks or through malware. For instance, fraudsters could obtain Facebook passwords through a malicious browser extension on the victim’s computer, by hijacking a Facebook app, via social engineering attacks, or finding credentials leaked from other websites (and dumped on underground forums) that are also valid on Facebook.

Continue reading On the hunt for Facebook’s army of fake likes

New EU Innovative Training Network project “Privacy & Us”

Last week, “Privacy & Us” — an Innovative Training Network (ITN) project funded by the EU’s Marie Skłodowska-Curie actions — held its kick-off meeting in Munich. Hosted in the nice and modern Wisschenschafts Zentrum campus by Uniscon, one of the project partners, principal investigators from seven different countries set out the plan for the next 48 months.

Privacy & Us really stands for “Privacy and Usability” and aims to conduct privacy research and, over the next 3 years, train thirteen Early Stage Researchers (ESRs) — i.e., PhD students — to be able to reason, design, and develop innovative solutions to privacy research challenges, not only from a technical point of view but also from the “human side”.

The project involves nine “beneficiaries”: Karlstads Universitet (Sweden), Goethe Universitaet Frankfurt (Germany), Tel Aviv University (Israel), Unabhängiges Landeszentrum für Datenschutz (Germany), Uniscon (Germany), University College London (UK), USECON (Austria), VASCO Innovation Center (UK), and Wirtschaft Universitat Wien (Austria), as well as seven partner organizations: the Austrian Data Protection Authority (Austria), Preslmayr Rechtsanwälte OG (Austria), Friedrich-Alexander University Erlangen (Germany), University of Bonn (Germany), the Bavarian Data Protection Authority (Germany), EveryWare Technologies (Italy), and Sentor MSS AB (Sweden).

The people behind Privacy & Us project at the kick-off meeting in Munich, December 2015
The people behind Privacy & Us project at the kick-off meeting in Munich, December 2015

The Innovative Training Networks are interdisciplinary and multidisciplinary in nature and promote, by design, a collaborative approach to research training. Funding is extremely competitive, with acceptance rate as low as 6%, and quite generous for the ESRs who often enjoy higher than usual salaries (exact numbers depend on the hosting country), plus 600 EUR/month mobility allowance and 500 EUR/month family allowance.

The students will start in August 2016 and will be trained to face both current and future challenges in the area of privacy and usability, spending a minimum of six months in secondment to another partner organization, and participating in several training and development activities.

Three studentships will be hosted at UCL,  under the supervision of Dr Emiliano De Cristofaro, Prof. Angela Sasse, Prof. Ann Blandford, and Dr Steven Murdoch. Specifically, one project will investigate how to securely and efficiently store genomic data, design and implementing privacy-preserving genomic testing, as well as support user-centered design of secure personal genomic applications. The second project will aim to better understand and support individuals’ decision-making around healthcare data disclosure, weighing up personal and societal costs and benefits of disclosure, and the third (with the VASCO Innovation Centre) will explore techniques for privacy-preserving authentication, namely, extending these to develop and evaluate innovative solutions for secure and usable authentication that respects user privacy.

Continue reading New EU Innovative Training Network project “Privacy & Us”

Measuring Internet Censorship

Norwegian writer Mette Newth once wrote that: “censorship has followed the free expressions of men and women like a shadow throughout history.” Indeed, as we develop innovative and more effective tools to gather and create information, new means to control, erase and censor that information evolve alongside it. But how do we study Internet censorship?

Organisations such as Reporters Without Borders, Freedom House, or the Open Net Initiative periodically report on the extent of censorship worldwide. But as countries that are fond of censorship are not particularly keen to share details, we must resort to probing filtered networks, i.e., generating requests from within them to see what gets blocked and what gets through. We cannot hope to record all the possible censorship-triggering events, so our understanding of what is or isn’t acceptable to the censor will only ever be partial. And of course it’s risky, or even outright illegal, to probe the censor’s limits within countries with strict censorship and surveillance programs.

This is why the leak of 600GB of logs from hardware appliances used to filter internet traffic in and out of Syria was a unique opportunity to examine the workings of a real-world internet censorship apparatus.

Leaked by the hacktivist group Telecomix, the logs cover a period of nine days in 2011, drawn from seven Blue Coat SG-9000 internet proxies. The sale of equipment like this to countries such as Syria is banned by the US and EU. California-based manufacturer Blue Coat Systems denied making the sales but confirmed the authenticity of the logs – and Dubai-based firm Computerlinks FZCO later settled on a US$2.8m fine for unlawful export. In 2013, researchers at the University of Toronto’s Citizen Lab demonstrated how authoritarian regimes in Saudi Arabia, UAE, Qatar, Yemen, Egypt and Kuwait all rely on US-made equipment like those from Blue Coat or McAfee’s SmartFilter software to perform filtering.

Continue reading Measuring Internet Censorship

MSc Information Security @UCL

As the next programme director of UCL’s MSc in Information Security, I have quickly realized that showcasing a group’s educational and teaching activities is no trivial task.

As academics, we learn over the years to make our research “accessible” to our funders, media outlets, blogs, and the likes. We are asked by the REF to explain why our research outputs should be considered world-leading and outstanding in their impacts. As security, privacy, and cryptography researchers, we repeatedly test our ability to talk to lawyers, bankers, entrepreneurs, and policy makers.

But how do you do good outreach when it comes to postgraduate education? Well, that’s a long-standing controversy. The Economist recently dedicated a long report on tertiary education and also discussed misaligned incentives in strategic decisions involving admissions, marketing, and rankings. Personally, I am particularly interested in exploring ways one can (attempt to) explain the value and relevance of a specialist masters programme in information security. What outlets can we rely on and how do we effectively engage, at the same time, current undergraduate students, young engineers, experienced professionals, and aspiring researchers? How can we shed light on our vision & mission to educate and train future information security experts?

So, together with my colleagues of UCL’s Information Security Group, I started toying with the idea of organizing events — both in the digital and the analog “world” — that could provide a better understanding of both our research and teaching activities. And I realized that, while difficult at first and certainly time-consuming, this is a noble, crucial, and exciting endeavor that deserves a broad discussion.

DSC_0016

Information Security: Trends and Challenges

Thanks to the great work of Steve Marchant, Sean Taylor, and Samantha Webb (now known as the “S3 team” :-)), on March 31st, we held what I hope is the first of many MSc ISec Open Day events. We asked two of our friends in industry — Alec Muffet (Facebook Security Evangelist) and Dr Richard Gold (Lead Security Analyst at Digital Shadows and former Cisco cloud web security expert) — and two of  our colleagues — Prof. Angela Sasse and Dr David Clark — to give short, provocative talks about what they believe trends and challenges in Information Security are. In fact, we even gave it a catchy name to the event: Information Security: Trends and Challenges.

Continue reading MSc Information Security @UCL