Determining Communities In Graphs Using Label Propagation Algorithm

Written by srivassid | Published 2020/12/21
Tech Story Tags: apache-spark | label-propogation-algorithm | python | community | knowledge-graph | graph | graph-database | cluster | web-monetization

TLDRvia the TL;DR App

Community detection is a topic that can be applied to datasets that have an inherent order among themselves.
For example, redwood, birch and oak are types of trees. Fish, algae and octopus are types of underwater creatures. These entities can be grouped under a common category, and the task of doing that is called as community detection.
Sometimes we have a dataset that is unstructured and we want to get some value out of it, in those cases community detection can be useful. In this article I will be talking about how to detect communities using LPA, label propagation algorithm in GraphFrames package of Spark. The code for the article can be found here.

Label Propagation Algorithm

Label Propagation Algorithm is a fast algorithm that detects communities in a graph. It does not require prior information about the communities, and it uses network structure to detect them. We label some initial communities and it takes it from there.
LPA was proposed by Raghavan in this paper, and it works by labelling the nodes, and propagating these labels throughout the network and forming communities based on the process of label propagation.
The idea is that a label will become dominant in a densely connected group of nodes, but will have trouble propagating in a sparsely connected region. When the algorithm finishes, the nodes with the same label are considered as part of the community.
The algorithm works as follows:
a) Every node is initialized with a unique label. 
b) Label are propagated through the network
c) At every iteration of the propagation, each node updates its label to the one that the maximum number of its neighbours belongs to.
d) LPA reaches convergence when each node has the majority label of its neighbours.
e) LPA stops if either convergence or the user defined maximum number of iterations are achieved. 

Implementation

Dataset and Processing

I am using the dataset provided by UMASS, it is a dataset that contains a word, and a reference to the wikipedia url about that word. There are 10 files, 600 MB each, and i am working with one file. It contains 18 million rows.
It is in the following format:
URL: ftp://217.219.170.14/Computer%20Group/Faani/vaset%20fani/second/sattari/word/2007/source/s%20crt.docx
MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION computer monitor 1503 http://en.wikipedia.org/wiki/Computer_monitor
MENTION computer monitors 3066 http://en.wikipedia.org/wiki/Computer_monitor
MENTION plasma screens 2162 http://en.wikipedia.org/wiki/Plasma_screen
URL is the link to a particular topic, and mention contains the word that was mentioned in the article along with the Wikipedia link (also mentioned in the article). 
I used the following code to preprocess the data to arrive at a meaningful format: 
After preprocessing, i have the data in the following format:
ftp://ftp.pcbsd.org/pub/handbook 9.0/handbook_en_ver9.0.html,Zfs
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,BSD_UNIX
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Acpi
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Softupdates
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Compiz
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Pcx
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Journaling_file_system
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Nextstep
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Freebsd_jail
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Mac_address
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,IPv6_address
Basically, in this format, I have a URL followed by each word that was mentioned in the original file.

Applying LPA

Now that we have the data in the required format, we can apply LPA. I use the following code to do that.
The largest community is has 8500 nodes. Let us look at it’s contents:
As we can see, the largest community is about free photos. Let us look at another:
 Philip_Neame 
 Monsoon#Asia 
 Arab_Jew 
 Cave_of_Elijah 
 Shi’a_Islam 
 Culture_of_the_Song_Dynasty 
 Safavid_dynasty
 John_Romer_(Egyptologist)
 Bamyan_Province 
 Henry_Phillips_(horticulturist)
 Christ_Church,_Nazareth 
The second largest community is about dynasties or provinces, and people related to them. It has about 6000 nodes. There are some false positives as well, which means more iterations in LPA can probably give good results.
Third largest community is:
Hansi 
Vadodara_district 
Cox-Forbes_theory 
Puppis_(constellation) 
University_of_Alexandria
historical_scholarship 
Villuercas 
Center_of_the_universe 
genre
Velayuthampalayam 
and it is mostly about science. It has 4000 nodes, and by the looks of it, it can probably use more number of iterations as well.

Conclusion

In this post we went through what communities are, and how do we detect them. We used the algorithm Label Propagation algorithm to determine that. It works by initially labelling nodes, then iterating and propagating labels and determining the final label at convergence.
We then used a dataset that contains a URL and references to wikipedia text for several mentions of words in that url, and determined communities from them. Most popular community is about stock photos, then about dynasties and then science. 
I hope you liked my post.

Written by srivassid | Data Engineer
Published by HackerNoon on 2020/12/21