Social Media Analytics Technologies, Tactics, and Values: A Project Summary

I started this project hoping to understand how social media data is harvested, processed, analyzed, and used for marketing and political communications. From a programmatic point of view, nothing about the subject is particularly controversial. Coders write code, and machines do what we tell them to do. If I have a tool that can find out everything about you, I can do things with that information to improve your life in various ways. Or I can use that information to profit my company by stimulating you to buy things you don’t actually need. I might choose to target you with messages that inspire you to vote. Or I might leverage your resentments and fears, which I analyzed from your data, to motivate you to vote for or against a specific candidate.

I learned a great deal during this project about the data, the methods, and the tools for doing things with social media analytics. The first thing I tackled was research on common terminology in this area of data science: terms like sentiment analysis, opinion mining, machine learning, and natural language processing. I attended an introductory workshop on “Text Mining Concepts and Sources” presented by Information Sciences and Digital Humanities Librarian Dan Tracy. I used some obvious search terms to search the databases in the Information Sciences Virtual Library, and tweaked my searches using the source subject terms and database thesauri. I found and read dozens of sources to compile a top-twelve list. I then searched more broadly on the web, and discovered there are a number of increasingly effective open source tools used by researchers, along with proprietary commercial platforms that sell big data, analytics, and dashboard services.

There seem to be two main motivations for the research on social media analytics. Scholars want to use big data to do things like improving public health, or mapping relationships and nodes in social networks so as to understand roles, connections, and vectors of influence. Other researchers have a distinct commercial focus on consumer behavior and marketing applications.

But many of the sources I found seemed increasingly dated. Somewhere past the midpoint of this project, I began to feel like I was scratching the surface of something that continues to change shape. It’s as if the academic research can’t keep up with the real-world applications. In addition, too many of the papers are mostly literature reviews. I began to feel that I was writing yet another rehash.

That’s about the time I discovered some older research on the effect of strong ties in social media, and how targeted pro-social messaging can encourage pro-social behaviors. This research indicated that we aren’t influenced by everything we see on social media or by every Facebook friend, but we are influenced by those we consider close. Maybe not by much, but enough to make a large enough difference in the aggregate. This gave me hope for something positive and actionable from the research.

Of course there’s a dark side to this force. Insert partisan politics and big money, and everything goes to hell. Would it surprise anyone to learn that Facebook, Google, Twitter, and other large tech corporations work directly with election campaigns to make “best use” of their analytics and platforms?  Actually this did surprise me, at least the extent to which they all (to use a word) collude.

And now we get to the main reason I chose this topic for my final project. It’s almost like this technology was designed for a company like Cambridge Analytica. There’s no way I could write about social media analytics, and not situate it in context with how it’s being used in politics. But I found very few academic sources on the use of social media data in the 2016 U.S. presidential election, so I turned to journalistic sources.

I feel more strongly than ever that values are a central part of research, and are baked in to every system and tool we build. Values are essential to our social and political responses to change, one of the big stressors of technology innovation. While compiling this bibliography I read a lot more research than I could include, and almost none of it reflected any consideration of how social media analytics might affect human beings, communities, politics, or the natural world. It should be expected that other people will insert their own values into applications of this technology, for good or ill. And that’s exactly what’s happening.


At the end of this project I found myself with more handfuls of straw and fewer needles than I’d hoped. The biggest technical find was the O’Reilly book Mining the Social Web by Matthew Russell, which provides an extremely thorough guide to using Python for mining and processing social media data. An updated edition is slated for publication this month, and my plan is to work through it over the summer. Only for research purposes, I promise.

Keeping a journal on WordPress was a great way to process the research and keep moving forward. Writing the posts while doing the research allowed me to contextualize the materials and tie things together. Sometimes it seemed like I was annotating the annotations, but I think the benefits outweigh the odd format.

That’s my story, and I’m sticking to it.

Finally, here are the annotations I sprinkled in a baker’s dozen of posts throughout this site: sources that provided me the beginning of insight on the technical, business, and political world of social media analytics. I hope it’s useful to someone else as well.

Works Cited

Russell, Matthew A. “Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, GitHub, and More.” 2nd edition, O’Reilly, 2013.

In Mining the Social Web, technologist and open source software advocate Matthew Russell provides a hands-on guide on how to access the APIs and analyze data from Twitter, Facebook, LinkedIn, Google+, GitHub, and other internet resources. The book presents a “guided tour of the social web,” with chapters focusing on each social network, and includes example code written in Python 2.7. The material is introduced gradually enough for non-programmers, but the specific techniques will be useful to intermediate and advanced programmers.

Russell provides a public GitHub repository dedicated to the book, including code examples, screenshots, and several screencasts hosted on Vimeo. The GitHub repo and the screencasts begin with how to set up a virtual server for use while reading the book, introducing readers to using Vagrant virtual environments. Somewhat confusingly, all example code is hosted on a separate wiki which includes numbered examples from each book chapter presented as html. The author also provides a blog with additional writings about topics covered in the book.

Most chapters focus on specific social media networks, providing guidance on their APIs and how to create an API connection. The use of IPython Notebook is detailed throughout, with instructions on how to search for specific content and topics, extract entities, conduct frequency analysis, detect patterns, and visualize data using histograms. Each chapter also provides recommended exercises and links to additional online resources. A chapter on data mining the web provides a useful introduction to web crawling and scraping, natural language processing, entity detection, and gisting. Another chapter on email explains how to process an email corpus using the Enron data set, converting it to a Unix mailbox, then JSON, and finally importing the data into a MongoDB. Python queries and analysis tools are explored throughout.

Since the publication of the 2nd edition in 2013, many aspects of social media network APIs have changed. Some of the book’s recommendations and code examples are therefore no longer valid. For example, the book references the Facebook API 2.0 which was deprecated on August 7, 2016. Indeed, it would be impossible for a printed book to remain current with the frequent updates to the Facebook Graph API. That said, the concepts, recipes, and code examples presented in the book are useful as a foundation for building general skills in social media data mining and analysis. The book can also serve as an effective introduction to Python and programming in general. It may be hoped that the 3rd edition, scheduled for publication in May 2018 with an updated GitHub repo, will bring the code examples and references up to date.

Brooks, Ian. Personal interview. 4 Apr. 2018.

In his work as a research scientist at the University of Illinois iSchool, Ian Brooks is investigating ways of leveraging social media data to inform and improve public health outcomes. Brooks provided a guided tour of tools he’s currently using to download and analyze data from multiple social media networks, including Twitter, Facebook, Reddit, Google Plus, Yelp, YouTube, and Instagram. He is working with a team of scholars with expertise in computer science, history, and digital humanities, using a Crimson Hexagon application to search, view, process, and analyze up to 10 thousand social media posts per batch. The Crimson Hexagon application provides a user interface for constructing a compound search of public posts on selected social media networks. Brooks says the application searches only the “front end” of the social networks, and does not access data through their APIs. Crimson Hexagon was founded as a for-profit “AI-powered consumer insights company” in 2007, and markets its applications and data products to a wide range of corporate, government, and educational entities. The University of Illinois licenses its applications for use by university-affiliated researchers and students.

Brooks uses the Crimson Hexagon tool to follow specific subject terms over time, so as to determine the context, sources, and emotional content of the terms as they are used in social media posts. For example by following hashtags associated with the disease Ebola, he was able to follow public reaction to the arrival of Americans infected with Ebola in Dallas and New Hampshire, including the spread of misinformation about the cases and Ebola itself through social channels. Brooks noted that the World Health Organization had conducted an earlier public information campaign on Ebola, but the campaign had ended by the time the first cases of Ebola were reported on American soil. The social media data showed that a variety of non-credible sources was filling this vacuum with misinformation about Ebola. Understanding these information flows over time provides “actionable” public health responses, e.g. by recognizing when public communication campaigns from official sources like the WHO are needed to counteract misinformation.

Brooks also demonstrated his current research on public sentiment concerning the skin disease psoriasis. He explains that social media data appears more comprehensive and reliable than traditional clinical sources, which are limited both in number of responses and the willingness of patients to complete surveys. His analysis of the social media data shows that in 2010, people affected by psoriasis began expressing less negative sentiment about the disease, possible indicating that more effective medical interventions were introduced at that time.

Crimson Hexagon provides a number of utilities for filtering and analyzing the data, including export to csv files. The application also provides its own sentiment analysis of the posts, displayed and filterable in various numerical and graphic representations. The posts searched by the application are downloadable for further processing and analysis. Brooks’ team has developed a number of Python scripts to identify and remove spam content, which he says is major component of the data. They then use machine learning techniques such as support vector machine (SVM) algorithms and random forests to identify meaningful terms, entities, and expressions. An interesting aspect of studying social media is the personal nature of meaningful expressions, which requires adjustments to standard stopword lists, e.g. making sure not to exclude personal pronouns like “I” and “my.”

Brooks and his team are just beginning to use these tools as a foundation for improving decisions in public health interventions. He explains they are general purpose tools, and are also used by political campaigns and commercial entities for marketing purposes. Social media data and the tools to mine it and extract meaning are rapidly shifting the boundaries of what it is possible to know about how people actually think and communicate.

Kaur, Amandeep, and Vishal Gupta. “A Survey on Sentiment Analysis and Opinion Mining Techniques.” Journal of Emerging Technologies in Web Intelligence, vol. 5, no. 4, Nov. 2013, pp. 367–71.

For anyone seeking to use social media data to inform effective communication campaigns targeting social media users, it is essential to analyze their attitudes and opinions regarding the subjects of the communications. The term “sentiment analysis” is commonly used in reference to techniques for capturing public opinion on commercial products, political candidates and policy preferences, and various social events and movements. In this paper, Kaur and Gupta provide survey of the literature on the subject, and an overview of the terminology and main techniques for extracting sentiment from social network data.

They use the term sentiment analysis to describe a general category of applications using Natural Language Processing (NLP) to extract and classify signifiers of opinion, mood, and emotion from a corpus of social media data. They identify this as a multidisciplinary problem for Artificial Intelligence, to “minimize the gap between human and computer,” and a process of “mining the text and classifying user sentiments, likes, dislikes, and wishes.”  They describe common approaches to sentiment analysis. These include “subjective lexicon,” wherein a text is analyzed based on a list of words that have been scored on a scale of positive, negative, or objective. N-Gram modeling is based on a given sequence of words, using terms such as “unigram,” “bigram,” “trigram,” etc., and/or combinations of such sequences to extract probable meaning. “Machine learning” is a general term for learning a language model from the features of a given text.

The authors then describe the process of sentiment analysis. The first step is the generation of a sentiment lexicon for processing and classification. A number of techniques have been used, including a simple comparison of the words of an input text with a set of terms for which positive or negative associations have been assigned. They cite the development of SentiWordNet as a multi-language lexicon resource for extracting sentiment, and further techniques that combine single-word analysis with multi-word expressions, polarity-based word count within paragraphs, and rules based on antonym generation, among others.

They then discuss the categorization of text at the level of subjectivity or objectivity. Subjective statements contained in a social media post indicate opinion, and thus are stronger signals of personal sentiment. For example “the car is comfortable” is classifiable as subjective, whereas “Maruti launched as new car” is an objective statement of fact and less meaningful as a signal of sentiment about Maruti cars.

In addition to polarity, subjectivity, and objectivity, sentiment analysis may be concerned with more specific affective classes such as happy, anger, sad, and surprise. The authors mention ConceptNet as an example corpus for “sentic computing,” where states of sensitivity, attention, pleasantness, and aptitude are analyzed as a basis for sentiment classification.

The authors conclude with suggestions for further work on sentiment analysis research, including cross-language sentiment sense mapping, and resources for many more of the languages used in their home nation of India.

Varathan, Kasturi Dewi, et al. “Comparative Opinion Mining: A Review.” Journal of the Association for Information Science & Technology, vol. 68, no. 4, Apr. 2017, pp. 811–29.

In Comparative Opinion Mining: A Review, Varathan, Giachanou, and Crestani provide a review of recent research on a form of opinion mining known as comparative opinion mining. They define comparative opinion mining as “a subfield of opinion mining which deals with extracting information that is expressed in a comparative form (e.g. “paper X is better than the Y”).” Comparative opinion mining attempts to make use of the vast number of posts on social networks by billions of people where they express opinions and write reviews about movies, restaurants, music, cars, books, hotels, etc., often by comparing one product relative to another. The authors note that such comparative opinion information is especially useful in product marketing, and for “competitive intelligence” in identifying potential markets and risks.

The authors begin by describing opinion mining as an equivalent term with sentiment analysis, defined as methods for automatic detection of “opinionated information” and “polarity of opinion toward a specific target.” They cite Liu and Zhang in defining an opinion as “a subjective statement, view, attitude, emotion, or appraisal about an entity or an aspect of a entity from a opinion holder.” An entity is simply an abstract object “such as a product, person, event, organization” represented in some hierarchy of components and attributes. The authors argue that while opinion mining is useful in understanding what consumers and citizens think about products, services, or policies, it is often insufficient in revealing what people think about known alternatives. This is where comparative opinion mining can provide more actionable data.

After establishing this distinction, the authors provide an overview of comparative opinion mining techniques. They divide these into three classes: Machine Learning, Rule Mining, and Natural Language Processing. They describe each technique in great detail. For example within machine learning, they discuss support vector machines (SVM), naive Bayes, conditional random fields, supervised, and unsupervised learning. Their section on rule mining identified techniques for discovering meaningful signals in the patterns and structures of terms, such as comparative words and phrases. They describe NLP approaches to analyze language in two levels: syntactic analysis to parse the syntax of sentences, and semantic analysis to identify the meaning of the sentence content. In their discussion of semantic analysis they introduce the semantic networks WordNet and SenticNet, which provide accessible knowledge bases and lexical resources for detecting synonyms, comparative words, and sentiment. They also describe test collections of comparative language available to researchers seeking to develop better algorithms and tools, such the J&L, JDPA, and Kessler14 data sets.

The article does not cover specific software processing methodologies, but briefly discusses several open source preprocessing tools, such as Gate, Stanford CoreNLP, and CRF Toolbox. The authors conclude by suggesting further research in comparative opinion mining. And they observe that in the survey they conducted for this article, they found that more than 60 percent of the researchers currently working on comparative opinion mining are Chinese scholars. They suggest that sociological research might provide insight into why Chinese scholars are more interested in comparative opinion mining than other researchers.

Atzmueller, Martin. “Mining Social Media: Key Players, Sentiments, and Communities.” WIREs: Data Mining & Knowledge Discovery, vol. 2, no. 5, Sept. 2012, pp. 411–19.

In this article for Wiley’s WIREs publications, cognitive science and artificial intelligence researcher Martin Atzmueller explores methods for extracting information, patterns, and knowledge from social media data combined with ubiquitous embedded systems including RFID-based applications, sensor networks, and mobile devices. He introduces the basic terms of his research, including Social Network Analysis (SNA) wherein communities are identified and mapped into sets of nodes, strongly connected by identifiable interests or needs. He then describes ways to identify and characterize “key players” in social networks, defined as “actors that are important for the network in terms of connectivity, number of contacts, and the paths that are passing through the corresponding node.” He introduces the term “role mining” as a method of discovering actor profiles with certain features such as prestige and community importance. Characterizing the communities and roles provides a map for understanding how communications moderate community attitudes and behavior.

Atzmueller next describes methods of sentiment mining and analysis, which he defines as “extracting subjective information from textual data using NLP, linguistic methods, and text mining approaches.” He cites B. Liu’s key elements of sentiment analysis as the opinion holder, the object and features of the opinion, and the positive or negative opinion orientations. Machine learning techniques such as latent semantic analysis and support-vector machines (SVM) are commonly used for developing a sentiment classification for a given text corpus.

Following on the discussion of communities and key players, the author summarizes the concept of “community mining” wherein clusters or subgroups communicate with each other in a larger network. Pattern mining and statistical approaches are used to identify these densely-connected clusters.

The author then describes the open source tool VIKAMINE, and its use in the analysis and mining of social media communities and subgroups. He states that VIKAMINE has been effective in a broad range of social media analysis scenarios, including community mining, key actor and role characterization, and pattern mining and analytics. He concludes with a discussion of “reality mining,” wherein social media analysis is combined with “everyday” sensor information including smartphones and sensor networks.

Forouzandeh, Saman, et al. “Content Marketing through Data Mining on Facebook Social Network.” Webology, vol. 11, no. 1, June 2014, pp. 1–11.

In this introduction to their research on content marketing, Forouzandeh, Soltanpanah, and Soltanpanah discuss potential advantages of using data mining techniques to identify user interests and behavior on social networks. They note that many advertising and marketing messages don’t account for user preferences and are therefore ignored. To address this disconnect, they introduce the concept of “content marketing,” which they define as “a marketing process of creating and properly distributing content in order to attract, make communication with, and understand other people to they can be motivated to do beneficial activities.” Content marketing techniques seek to understand how individuals communicate with each other, so that they can be used to distribute marketing information to influence each other.

The authors then report on a study they conducted of content marketing on Facebook. They used the Netvizz application to develop a friendship graph analyzed with Gephi software, and analyzed the communication behavior and interests of users using data mining techniques. (Note that Facebook recently implemented new limits on metadata that can be openly mined using some of these techniques, due to abuse of their terms of service by some actors.) A new Facebook account was created to connect with users, and Netvizz was used to extract Facebook Open Graph data such as Locale, Like_Count, Post_Count, and Post_Engagement_Count. The research determined that the latter two data fields were especially valuable in identifying users who wrote many posts, and whose posts attracted many likes and comments by other users. Influential users were identified in this manner. These users were asked to distribute marketing messages to other users, resulting in a larger number of shares and customer conversions.

The authors provide additional details on their workflow in identifying influential users, such as the use of Gephi software to map their Facebook connections of friends, posts, comments, and likes. This data was downloaded as an Excel file for further text analysis, and the users were scored based on measures of popularity and frequency of likes and comments on their posts. Based on their connections and behaviors, the researchers determined that these individuals can be predicted to be most influential with other users in future marketing communications.

In the next phase of the research, the authors provided a series of messages through the influencers to test the spread and reaction of their network. Using data mining techniques they determined which messages were most effective in generating positive reactions in the form of likes.

The authors suggest that content marketing can be combined with other marketing strategies such as viral marketing, where influential individuals are given products or services for free in exchange for influencing others. They conclude that content marketing has a comparative advantage over other forms of marketing in social networks, since the message is not presented as a direct commercial appeal. Instead the message is accepted from a trusted source, and thus overrides critical resistance to the marketing appeal.

Stieglitz, Stefan, et al. “Social Media Analytics – Challenges in Topic Discovery, Data Collection, and Data Preparation.” International Journal of Information Management, vol. 39, Dec. 2017, pp. 156–68.

In this 2017 article for the International Journal of Information Management, Stieglitz, Mirbabaie, Ross, and Neuberger discuss social media analytics research, and address a perceived gap in the literature on stages of data discovery, collection, and preparation. They address this gap by conducting an extensive literature analysis to identify key challenges and solutions. In presenting their findings, they suggest ways to extend existing social media research frameworks to better serve scholars and practitioners.

The authors argue that while many research papers have been published on social media network analytics and qualitative data, they often represent isolated case studies bound by a specific subject and time frame. While noting that the methods used by these studies to extract useful information from social networks is often similar, there remains a lack of more comprehensive discussions on a general framework for social media analytics to guide future research. Much of the literature they review concerns specific methods for social media data analysis, such as opinion mining and social network analysis, but the authors assert that data analysis is only one step in the larger process of social media analytics. Their literature review therefore focuses on challenges faced by researchers in discovering topics, and during the process of collection and preparation of social media data for analysis.

One of those challenges is discovery of what to research. Most existing social media analysis models assume that topics of research are pre-defined, for example in political communications or crisis situations where specific topics are tracked. Such models may be useful in conducting a sentiment analysis on known issues and trends, but are less useful in discovering new issues and trends.

Given the size of “big data” in social media network communications, analytics is also challenged by the volume of storage space required, the velocity of data creation, the variety of data forms including unstructured and proprietary data, and the uncertain veracity of the data and its sources. The authors state that there has been little research on topic discovery that addresses these specific challenges.

In their literature review the authors classify the phases of social media analysis research, and usefully present the results in a table showing which have been addressed in papers by other scholars. Their analysis of existing research points to gaps they suggest should be addressed to better define a more generalizable model of social media analytics, and new methods and tools to solve the existing challenges of topic discovery in a world of big data.

Bond, Robert M., et al. “A 61-Million-Person Experiment in Social Influence and Political Mobilization.” Nature, vol. 489, no. 7415, Sept. 2012, pp. 295–98.

In a report published in Nature, the authors describe a randomized controlled study of responses to voter mobilization messages placed on the social media network Facebook on the voting day of November 2, 2010. The study targeted messages to more than 61 million Facebook users over the age of 17, and randomly assigned them to one of three groups. For the “social message group” a message was displayed at the top of their Facebook news feed urging the user to vote with a link to find their local polling place, along with a clickable “I Voted” button showing a count of Facebook users who had clicked the button, and up to six profile pictures of the user’s Facebook friends who had also clicked the button. The “information message group” was shown the same information with the exclusion of the faces of their Facebook friends. The control group received no voting message in their news feed.

The researchers were able to assess the impact of the messages by measuring three user actions: clicking on the “I Voted” button, clicking the link to the local polling place, and registering a vote in the election. They measured actual voting behavior by matching 6.3 million Facebook users to public voting records. Users in the social message group were 2.08 percent more likely to click the “I Voted” button and 0.26 percent more likely to click the local polling place link than members of the “information message” group. More importantly, examination of the voting records showed that members of the social message group were 0.39 percent more likely to actually vote than those in the information message and control groups.

The authors posit that the voting behavior of those in the social message group was influenced by strong social ties with their Facebook friends, whose faces were displayed in the message encouraging them to vote. To validate this they conducted a network analysis of “close” friends based on frequency of interaction among all Facebook friends. They found that while friends outside the “close” group had little or no effect on a user’s voting behavior, messages that included the profile picture of close friends increased the probability of the user voting. Close friends were also found to influence the users’ self-expression of voting behavior with the “I Voted” button, and in clicking the link to local polling place locations.

The study notes that while these increases in voting behaviors are small, they are the result of a single message displayed on voting day. In addition, in a voting population of 236 million people, small percentage increases add up to significantly more votes: in this case a total of more than 340,000 additional votes. Considering that in the 2000 presidential election George Bush defeated Al Gore by a total of 537 votes in Florida, it becomes evident that, as the authors put it, “online political mobilization works.”

Kreiss, Daniel, and Shannon C. McGregor. “Technology Firms Shape Political Communication: The Work of Microsoft, Facebook, Twitter, and Google With Campaigns During the 2016 U.S. Presidential Cycle.” Political Communication, vol. 35, no. 2, Oct. 2017, pp. 155–77.

In this article communications scholars Kreiss and McGregor present the first in-depth look at the role played by technology and social media companies in shaping political communications and digital strategy in the 2016 U.S. presidential election. They describe how a number of technology firms, including Facebook, Google, and Twitter, worked in close collaboration with campaign staff to plan and facilitate digital advertising buys, analyze voter demographics, and advise on messaging strategies. They conclude that social media firms have increasingly become active agents in political and electoral processes, motivated by their interests in advertising revenue, user engagement, relationship-building to aid future lobbying efforts.

The study is based on several different data sources, including fieldwork at the 2016 Democratic National Convention; interviews with managers at the social media and technology firms; and extensive interviews with digital staff in the 2016 presidential campaigns. From this research it was evident that the firms have deployed staffing resources mirroring the partisan and policy characteristics of the campaigns, and have become active participants in shaping campaign messaging strategies using their platforms. In the 2016 election the helped the campaigns analyze voter demographics, interest, and behavior, and advised them on content strategies to reach and sway voters based on these characteristics.

The authors note that spending on digital advertising is an increasing portion of the $2 billion spent during the U.S. presidential campaign. The campaign of Donald Trump allocated 50 percent of its advertising budget to digital and social media, reflecting a recognition that social media platforms have become central to how people communicate about news and politics. Representatives of the firms acknowledged that while they were motivated by an interest in digital advertising revenue, they also saw benefit in helping candidates get elected who would be in a position to implement regulations favorable to the firms.

The study provides additional detail on how the Trump campaign in particular made use of social media firm staff and expertise to identify and target persuadable voters and demographic groups. Much of this expertise was in the form of helping the campaign “build ads that get results” which benefit both the campaign and the firm. Facebook worked directly with the Trump campaign digital and creative staff to test more than 100,000 variants of its ads and measure persuasive performance, a process likened by RNC Director of Advertising Gary Coby to “A/B testing on steroids.” After the election, Coby, who largely ran the Trump digital strategy, credited Facebook’s advisor as an “MVP” of the campaign.

The authors conclude with a discussion of the increasing power of technology firms and social media platforms in the functioning of American democracy and political discourse. They note that platforms like Facebook and Twitter are now “at the center of democratic processes, yet also beholden to market forces.” The platforms position themselves as neutral carriers of political content, but make editorial decisions based on growth in user numbers and engagement. And while their direct support of election campaigns reflects their interests in increasing advertising revenue, it may also lead to increasing political polarization, as more effectively targeted “click-bait” campaign ads further sensationalize political communications.

Channel 4 News. “Cambridge Analytica: Whistleblower reveals data grab of 50 million Facebook profiles.” YouTube, March 17, 2018,

In this video report, former Cambridge Analytica research director Chris Wylie explains how he company mined and used the social media data of 50 million Americans to target prospective voters with messages tailored to their fears and motivations. Wylie worked for Steve Bannon, who sought profile information on Facebook users for use in the Trump presidential campaign. Says Wylie, “Steve wanted weapons for his culture war…to change the culture of America.”

A great deal of research had been conducted over the past decade on analysis of social media data. Machine learning techniques have been developed to extract meaningful information from this data, such as sentiment analysis which determines positive and negative attitudes toward subjects and entities. Comparative opinion mining seeks to identify user preferences among alternative choices. Social network groupings can be mapped and key influencers identified within each group. A wide variety of demographic information and aspects of personality can be derived from data accessible from the Facebook Graph API. Content marketing can make use of this data to develop messages calibrated to influence specific users based on their profiles and the influence they have in their online social networks. Other academic research has determine that a small amount of leverage in the right places can have significant effects in voting behavior.

Bannon’s idea was to leverage academic research on social media personality profiling, and replicate it on a massive scale for messaging operations in the 2016 campaign. Under Bannon as vice president of Cambridge Analytica, the company contracted with Aleksandr Kogan, a data scientist at the University of Cambridge who developed a Facebook app. The app offered to pay Facebook users to fill out a personality survey. These users agreed to share not just their survey responses but also their profile information. Even more crucially for Cambridge Analytica, the app also crawled each user’s contacts and mined the profiles of all their Facebook friends: an average of 300 additional people per app user. The reach was so extensive that Facebook later clarified that while the original estimate was that 50 million profiles were mined, in reality it was closer to 80 million profiles.

Tens of millions of Facebook records could thus be pulled within a short time, with no restrictions or limits from Facebook. Many news reports terms this a “breach,” but the term is inaccurate given the accessibility of the data via the Facebook Graph API. Aleksandr Kogan has been accused of violating Facebook’s terms of service, a claim he disputes. Facebook essentially exposed the profile data to researchers, and asked them to delete the data set when their research was complete. But Cambridge Analytica received a copy of the data, and it remains unclear if it was ever erased.

Facebook has since suspended Kogan, Cambridge Analytica, and Chris Wylie from the platform as the investigation continues.

For their part, Cambridge Analytica claims that the data wasn’t useful and made no difference in the presidential election. An assertion Chris Wylie disputes:

“It wasn’t some tiny pilot project, it was the core of what Cambridge Analytica became. It allowed us to move into the hearts and minds of America voters in a way that had never been done before.”

Wylie’s assertion echos Cambridge Analytica claims during the 2016 election. Cambridge Analytica CEO Alexander NIX spoke openly about how the company was able to “microtarget” voters and deliver custom messages that resonate with their psychographic profile. For example, “someone who is neurotic, is someone who is quite emotional, and might respond in this case to stimulus of fear.”

After the story broke in March of this year, Cambridge Analytica has renounced the effectiveness of psychographic techniques to influence voters. The company claims that Aleksandr Kogan is responsible for any violation of Facebook policy, or breach of Facebook users’ privacy. But Wylie says Cambridge Analytica did just what it claimed during the election:

“If I am studying you, and I have enough information about you, because you curated your entire self online and I capture that, I can anticipate what are your mental vulnerabilities. What cognitive biases might you display in certain situations? And I can exploit that.”

Wylie acknowledges that political campaigns have always attempted to persuade people to think a certain way and vote for their candidates. But he asserts there’s a difference between persuasion and manipulation. “This gets at the heart of why is it you’re taking this psychological approach,” he says. “Why do you need to study neuroticism in people? What’s going to make them fearful?”

Wylie himself was at the heart of whatever Cambridge Analytica did in the 2016 U.S. election. As Cambridge Analytica’s research director he had a hand in all of it. He now claims he was naive and is coming forward to own up to his mistakes.

Channel 4 News. “Cambridge Analytica Uncovered: Secret filming reveals election tricks.” YouTube, March 19, 2018,

In this segment of Channel 4 News’s investigative report on Cambridge Analytica, top executives of the company are captured on a hidden video camera discussing what they can do for a candidate running for office in Sri Lanka, who unknown to them has been made up by the Channel 4 team. The executives boast of their ability to seed the internet with just the right messages to undermine political opponents. They discuss campaign communication strategy based on research on messages leveraging people’s fears. And while they at first claim to always be truthful, they allude to the value of messages that may give people a mistaken view of the truth.

“We just put information into the bloodstream of the internet and then, and then watch it grow.”~ Cambridge Analytica managing director Mark Turnbull

In making his case for hiring Cambridge Analytica, CEO Alexander Nix claims credit for the victory of Donald Trump in the 2016 U.S. presidential election. “We were able to use data to identify that there was very large quantities of persuadable voters there that could be influenced to vote for the Trump campaign.”

They pitch an approach to campaigning that seeks to mine people’s deepest fears, that can then be triggered by messaging strategies. They say Cambridge Analytica’s job is to dig deeper that anyone else into people’s “deep-seated underlying fears, concerns.”

“It’s no good fighting an election campaign on the facts, because actually it’s all about emotion.” ~ Cambridge Analytica managing director Mark Turnbull

This approach played out in the 2017 Kenyan general election, where Cambridge Analytica worked on behalf of the incumbent president Uhuru Kenyatta. The election was characterized by misinformation and violence. 90 percent of Kenyans say they saw false stories spread on social media. Ads from unknown sources, but later attributed to Kenyatta, attacked his main rival Raila Odinga as utterly corrupt. Some targeted specific demographics with messages intended to appeal to their fears, such as an ad targeted to women claiming a sharp rise in the rate of maternal disease, implying the fault was Odinga’s.

Cambridge Analytica publicly denied it was involved with the negative ad campaigns, and indeed in any form with the Kenyan election. But in the Channel 4 News undercover video, managing director Mark Turnbull bragged that they ran the Kenyatta campaigns in 2013 and 2017. He claimed that Cambridge Analytica had written all the speeches and “staged the whole thing.”

A video advertisement for Cambridge Analytica, included in the Channel 4 News report, emphasizes the claim: “Political campaigns have changed. When elections are won by small but crucial numbers of votes, putting the right message in front of the right person at the right moment is more important than ever.”

Cambridge Analytica chief data officer Alex Taylor is seen in the undercover video explaining the use of social media analytics, segmenting, and targeting based on how people are likely to react to certain messages and images. Their sales pitch then turned to an offer of help with another kind of intelligence: spying on the other candidates.

Digging in the dirt

Turnbull describes the process of digging up dirt on opponents. He says they use former spies from agencies like MI5 and MI6, who now work for private intelligence-gathering companies. “They will find all the skeletons in his closet, quietly, discretely, and give you a report.” This information is then released at the right moment to create maximum political damage. “It has to happen without anyone thinking that’s propaganda.”

To avoid revealing that Cambridge Analytica was involved in the campaign, they suggested contracting under a different name.

I look forward to building a very long-term and secretive relationship with you

“We’re used to operating through different vehicles, in the shadows.” ~ Cambridge Analytica CEO Alexander Nix

As the fictional campaign representative asks what expertise Cambridge Analytica can bring to effectively dig up damaging information about the opponent, Nix replies: “Deep digging is interesting, but you know equally effective can be just to go and speak to the incumbents and to offer them a deal that’s too good to be true, and make sure that’s video recorded…instantly having video evidence of corruption, putting it on the internet.”

Nix also discussed Cambridge Analytica’s history of success with other tactics, such as sending girls to the candidate’s house. “We could bring some Ukrainians in…you know what I’m saying? They are very beautiful, I find that works very well.”

Fake IDs and websites

During the undercover video, Nix occasionally backs off from explicit claims of illegal or dishonest activities. He then spells out how Cambridge Analytica can go about hiding its activities using fake identities, for example posing as tourists. He says they can set up fake websites to spread messages. Turnbull describes setting up a front company to run a “very, very successful project” in an Eastern European country. “No one even knew they were there…they just ghosted in, then ghosted out.”

Channel 4 News. “Cambridge Analytica: Undercover Secrets of Trump’s Data Firm.” Mar 20, 2018,

In this third part of Channel 4 News’s investigative report on Cambridge Analytica, CEO Alexander Nix discusses how the company was responsible for the election of Donald Trump as U.S. president. “We did all the research, all the data, all the analytics, all the targeting, we ran all the digital campaign, the television campaign,” Nix says, “and our data informed all the strategy.”

“Data Driven Behavior Change” ~ Cambridge Analytica advertisement

Some data on American voters isn’t difficult to acquire. The Trump campaign had names and email addresses of 230 American voters. Other data sources provide many layers of personal information on income, gender, shopping habits, hobbies, and of course voter registration and participation. Use of this information in political campaigns is nothing new, but Cambridge Analytica claimed it was taking the aggregation and deployment of personal data to a new and much higher level.

In a series of four meetings between Cambridge Analytica executives and representatives from a fictional candidate for office in Sri Lanka, Channel 4 News recorded the Cambridge Analyticas claims on video.

Chief Data Officer Dr. Alex Tayler explained their strategy. “When you think about the fact that Donald Trump lost the popular vote by three million votes, but won the electoral college vote, that’s down to the data and the research. You did your rallies in the right locations, you moved more people out in those key swing states on election day, that’s how we won the election (by 40,000 votes in three states).”

The report explains that Cambridge Analytica grew from a British company specializing in military intelligence and psychological warfare. It began building psychological profiles from social media data, and came to the attention of American billionaire and computer scientist Robert Mercer, a champion of ultra-conservative political causes. Mercer arranged for Breitbart senior editor Steve Bannon to become a vice president of Cambridge Analytica, and began investing millions of dollars in the company.

Channel 4 reports that in June 2016 when the Trump campaign was on the ropes, the Mercers stepped in with a massive cash infusion and the support of Steve Bannon and Cambridge Analytica. Trump’s communication strategy became based on the Cambridge Analytica model as explained by CEO Nix: “It’s no good fighting an election campaign on the facts, because actually it’s all about emotion.”

In an era of massive campaign spending by candidates and third parties, Tayler explained how Cambridge Analytica works: communications are divided into mostly positive messaging from the official campaign, and negative messaging from allied Super PACs running behind the campaign. For example, the ubiquitous “Defeat Crooked Hillary” ads where paid for by PACS and third parties, but designed by Cambridge Analytica. They created hundreds of  variations based on this theme, and arranged for them to be spread on Facebook, YouTube, and Google. Advertisements using these creatives were paid for by the organization Make American Number 1,  funded by Robert Mercer and his daughter Rebekah. Cambridge Analytica leveraged other “proxy” organizations aligned with conservative political causes to spread the same messages. Through these methods of distribution, the origin of the messages could remain hidden.

The scale of disinformation in the 2016 presidential campaign was unprecedented. And while Cambridge Analytica was driving the Trump strategy, it remains unclear if there was any connection between the firm and other disinformation campaigns emanating from Russia, which Facebook now says reached 126 million Americans. The role and activities of Cambridge Analytica, and any possible coordination with Russian actors in the election, are now part of the wide-ranging investigation by Special Council Robert Mueller. ButCambridge Analytica is incorporated in the UK, and CEO Alexander Nix says the company has no intention of giving U.S. investigators any information about its foreign clients.

The Channel 4 News report is inconclusive as to whether claims by the firm for its data analytics and psychographic target prowess are merely marketing hype or something more. But it’s clear that strategies pioneered by Cambridge Analytica, using big data from social media to profile and target voters in key locations and demographics, secretly leveraging proxies to seed the internet with negative information, and “putting the right message in front of the right person at the right moment,” are likely to be with us during the next election and beyond.

Web Resources

Facebook. “The Graph API.”

Where would we be without the Facebook Graph API? Probably still in Paris Climate Accord. Anyway, this was the source of data mined by the folks at Cambridge Analytica through the Facebook app created by Aleksandr Kogan, which crawled through 80 million Facebook profiles to build a dataset for sentiment analysis and psychographic messaging strategies. Facebook has since locked it down to some extent, but it’s still pretty useful for social network analysis and opinion mining.

Klipfolio. “Using Facebook’s Graph API Explorer to Retrieve Insights Data.” Klipfolio.Com, 11 Apr. 2014,

This short tutorial is written for non-programmers and those unfamiliar with APIs. It provides step-by-step instructions for accessing and using the Graph API Explorer, setting up an access token, and retrieving insights from Facebook pages. Klipfolio is a commercial vendor that provides proprietary dashboard solutions for data analytics, and this blog post is couched in terms of feeding Facebook data into their product. It may still be useful to those who need a very basic introduction to the Facebook API and API Explorer.

Ranjan, Ravi. “How to Use Facebook Graph API and Extract Data Using Python?” Towards Data Science, 2016,

A data scientist explains how to extract data from the Facebook Graph API using Python. Ranjan walks through the process of getting an access token, which is required for making API calls. He references Graph API version 2.7, whereas the current version is 3.0, but the programming patterns are the same. (The Graph API Reference provides documentation for the current and past versions.) The guidance will be useful for anyone with some Python experience who is just beginning to explore what data can be mined from Facebook.

Ferrara, Emilio. “Data Science for Social Systems.” Accessed 10 May 2018.

This site is a comprehensive syllabus by Prof. Elimio Ferrara, Research Assistant Professor at the Deptartment of Computer Science at the University of Southern California, covering “how to unleash the full power and potential of the Social Web for research and business application purposes!” Topics include machine learning, Natural Language Processing, sentiment analysis, topic modeling, network visualization, and recommender system among many other area of social media processing and analysis. The course has a strong Python orientation.

Shaik, Afiz. “Facebook Data Analysis Using Python: Explore GraphAPI Part 2.” 2018. YouTube,

Shaik walks through the process of using Jupyter Notebook and Python 3 to mine and process Facebook Graph API data. After setting up the development environment using Anaconda, he explains the use of Facebook access tokens and API queries. He then demonstrates how to work with the Graph API Explorer to pull specific data in JSON format. This brief tutorial may be useful for those who prefer learning from video sources.

Spring. “Accessing Facebook Data.” Accessed 20 Apr. 2018. presets a “getting started” guide to the process of creating a web application to access Facebook data using Java. The guide walks though the requirements and the steps needed to develop working code. This resource will be more useful for those with at least intermediate programming experience, especially in Java.

bigdataenthusiast. “Mining Facebook Data Using R & Facebook API!” Data Enthusiast, Mar. 19, 2016.

A blog post by an enthusiastic programmer showing how to extract Facebook API data using the R programming language, and the Rfacebook package. The author provides a detailed, step-by-step guide using screenshots and code examples. Even with recent changes to the Facebook Graph API, the author’s basic approach should still be valid.

Conkwright, William. “How to Get Public Data from Facebook with PHP.” Will Conkwright, June 14, 2017.

Conkwright provides a guide to accessing Facebook API data using PHP, with an example of getting a “talking about” cont for locations around Raleigh, North Carolina. First he shows how to use the Facebook API explorer to generate queries. He explains the specifics of the query string with screenshots, and breaks down the query url to show the parameters. He then shows how to retrieve Facebook data using a custom PMP function, and provides a link to a gist of the PMP code snippet on GitHub.

Computational Linguistics Research Group. “Pattern: Web Mining Module for Python, with Tools for Scraping, Natural Language Processing, Machine Learning, Network Analysis and Visualization.” Last commit 2017.

This resource is a repo on the GitHub account of the Computational Linguistics Research Group at the University of Antwerp. Pattern is a Python module with tools for data mining, Natural Language Processing, machine learning, and network analysis. It supports a variety of methods for extracting syntactic, semantic, and sentiment information, including n-gram search, clustering, and SVM. Pattern appears well documented and includes bundled examples. The main branch supports Python 2.7, but a Python 3 version is available in the development branch. The documentation includes code examples and several case studies.

GW Libraries. “Social Feed Manager.” Social Feed Manager, Accessed 10 May 2018.

This site from George Washington University Libraries offers code, documentation, and how-to articles related to the Social Feed Manager, an open source project that harvests social media data from a variety of sources. The project also maintains extensive documentation on readthedocs

Routley, Nick. “The Multi-Billion Dollar Industry That Makes Its Living From Your Data.” Visual Capitalist, Apr. 14, 2018.

Finally, here’s a fun little guide for consumers on how big tech companies and data aggregators mine and monetize our personal information. The article serves as a reminder that Facebook is only one company in a large industry that consumes data and excretes all the products of contemporary marketing and financial management. The article covers the nature of personal digital profiles compiled by data brokers like Acxiom and Experian, and suggest ways consumers can limit the exposure of their data.

News Sources