While mining the Information Science Virtual Library for academic papers using keywords (“social media” AND “data mining,”) I came across Matthew Russell’s O’Reilly book Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, and More. The 2nd edition was published in October 2013, with a 3rd edition scheduled for publication next month. Because the book covers the specific techniques I’m after concerning data mining and analysis of social media, I decided to pull the trigger and buy the book right now.
The book is basically a tutorial on data mining social media sites using Python. Alas all the source code it references is in Python 2.7 and I’ve been working with version 3.6, but that’s fine. It also covers using IPython Notebook (now Jupiter Notebook) and even begins with a quick guide to setting up a virtual server. I’ll probably wait to actually do that until I see what’s new in the 3rd edition. But the book definitely makes the final cut for my annotated bibliography. With that as a given, I thought it would be useful to get started with the first annotation:
Russell, Matthew A. “Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, GitHub, and More.” 2nd edition, O’Reilly, 2013.
In Mining the Social Web, technologist and open source software advocate Matthew Russell provides a hands-on guide on how to access the APIs and analyze data from Twitter, Facebook, LinkedIn, Google+, GitHub, and other internet resources. The book presents a “guided tour of the social web,” with chapters focusing on each social network, and includes example code written in Python 2.7. The material is introduced gradually enough for non-programmers, but the specific techniques will be useful to intermediate and advanced programmers.
Russell provides a public GitHub repository dedicated to the book, including code examples, screenshots, and several screencasts hosted on Vimeo. The GitHub repo and the screencasts begin with how to set up a virtual server for use while reading the book, introducing readers to using Vagrant virtual environments. Somewhat confusingly, all example code is hosted on a separate wiki which includes numbered examples from each book chapter presented as html. The author also provides a blog with additional writings about topics covered in the book.
Most chapters focus on specific social media networks, providing guidance on their APIs and how to create an API connection. The use of IPython Notebook is detailed throughout, with instructions on how to search for specific content and topics, extract entities, conduct frequency analysis, detect patterns, and visualize data using histograms. Each chapter also provides recommended exercises and links to additional online resources. A chapter on data mining the web provides a useful introduction to web crawling and scraping, natural language processing, entity detection, and gisting. Another chapter on email explains how to process an email corpus using the Enron data set, converting it to a Unix mailbox, then JSON, and finally importing the data into a MongoDB. Python queries and analysis tools are explored throughout.
Since the publication of the 2nd edition in 2013, many aspects of social media network APIs have changed. Some of the book’s recommendations and code examples are therefore no longer valid. For example, the book references the Facebook API 2.0 which was deprecated on August 7, 2016. Indeed, it would be impossible for a printed book to remain current with the frequent updates to the Facebook Graph API. That said, the concepts, recipes, and code examples presented in the book are useful as a foundation for building general skills in social media data mining and analysis. The book can also serve as an effective introduction to Python and programming in general. It may be hoped that the 3rd edition, scheduled for publication in May 2018 with an updated GitHub repo, will bring the code examples and references up to date.