GraphGists

I recently looked at the whole Star Wars universe from a computational perspective, where I extracted and analyzed social networks from all seven films. The social network structure revealed some interesting differences between the individual films, especially between the original trilogy and the prequels. Here I’ll look at how we can represent and explore the same network using a Neo4j database.

You can read more about my original analysis in my two blog posts, which include additional social network analysis and my F# scripts for downloading and extracting the data:

The network

The social network was automatically constructed using the films' screenplays. The nodes in the network represent the individual characters who are connected by a link if they both speak within the same movie scene. The network only includes characters that appear in at least two scenes and that are explicitly named in the screenplay (I excluded characters called "PILOT" or even "STAR DESTROYER TECHNICIAN"). I also separated the characters into two categories: Person and Droid.

The interactions create the basic structure of the network, where characters are connected by the SPEAKS_WITH relation. I also included the information about the individual movies using the APPEARS_IN relation. The following graph illustrates the general network structure:

network relations

Now we can setup the database using the social network from all the 7 movies combined together.

Let’s look at some basic information from the database. The following query extracts all the films that are included in the network.

MATCH (m:Movie)
RETURN m.name

It seems that the database correctly contains all the seven episodes of Star Wars.

Simple network properties

Now we can start comparing the individual episodes based on their network properties. Is the original trilogy different from the prequels in terms of its social network? And how does The Force Awakens compare to the rest?

Let’s start with looking at the number of characters in each of the episodes. The following query extracts characters and returns their count aggregated by the movie that they appear in.

MATCH (m:Movie)<-[:APPEARS_IN]-(character)
RETURN m.name AS movie, count(*) AS characters
ORDER BY m.name;

We can immediately see some differences between the movies. The original trilogy (Episodes IV to VI) has the smallest number of characters. On the other hand, Episode I: The Phantom Menace has the largest number of characters, almost twice as many.

How many interactions are there between the characters? In this query, we extract all the links between two characters that both appear within the same movie and return their count for each movie. This is a simplification because I’m assuming that if two characters have link between them, then they interact in every movie where they both appear.

MATCH (m)<-[:APPEARS_IN]-(character)-[:SPEAKS_WITH]-(character2)-[:APPEARS_IN]->(m)
RETURN m.name AS movie, count(*) AS edges
ORDER BY m.name;

The results show a similar story, where the original trilogy has smaller number of links between the characters and their social networks are smaller. This corresponds to the tighter and more organized structure of the original films, which have smaller number of characters that bind the story together more.

Extracting social network relations

We can also use the database to extract the social network for each of the Star Wars movies. The following query extracts all the characters that appear in a specific episode and all the interactions between them. I’m using the Episode VII as an example:

MATCH network=(m)<-[:APPEARS_IN]-(character1)-[r]-(character2)-[:APPEARS_IN]->(m)
WHERE m.name='Episode VII: The Force Awakens'
RETURN character1, r, character2

We can see that there are parts of the network that correspond to the Dark side characters, including Snoke, General Hux and others. There is also a cluster of nodes that represent Resistance pilots, and who interact mainly with each other and with Poe. Let’s explore the network in more detail.

Importance of characters in the network

A basic measure of centrality in a social network is the degree centrality. This is simply the number of connections each node has in the network. In our Star Wars network, this corresponds to the total number of scenes where each character speaks. The following query extracts the number of SPEAKS_WITH relations for each character and returns the top 10 results.

MATCH (ch1)-[:SPEAKS_WITH]-(ch2)
RETURN ch1.name AS name, count(*) AS degree
ORDER BY count(*) DESC LIMIT 10;

This result is strongly affected by the large social networks of the prequels. Anakin comes out at the top as the person that speaks with the largest number of other characters. And because of the prequels, even Jar Jar made it into the top 10.

We can instead look at who has the largest degree within the individual films. This query extracts the characters that speak to each other and appear within the same film, and counts the number of such connections for each character. I’m using the orignal Episode IV: A New Hope in the example.

MATCH (m)<-[:APPEARS_IN]-(ch1)-[:SPEAKS_WITH]-(ch2)-[:APPEARS_IN]->(m)
WHERE m.name='Episode IV: A New Hope'
RETURN ch1.name AS name, count(*) AS degree
ORDER BY count(*) DESC LIMIT 5;

Here, Luke Skywalker is the most central character, followed by Leia and the droids.

We can also visualize the results and look at the specific interactions. For example, the following query extracts all the characters that interact with Luke Skywalker, together with the movies that they appear in.

MATCH path=(luke:Person {name: 'LUKE'})-[:SPEAKS_WITH]-(other)-[:APPEARS_IN]-(movie)
RETURN path

Here we can see that some of the characters cluster around specific episodes (these are the characters that appear only the specific episode). Other characters that interact with Luke across several other episodes appear as more central nodes in the network.

Summary

This GraphGist showed how to do simple social network analysis using the Star Wars social network that I extracted from the film scripts. We looked at how to extract and summarize sub-networks for individual episodes, and for specific characters. Overall, this was my first experience with Neo4j and it was very easy to create the network and extract interesting information from the database.