Innovative algorithmic techniques in cloud computing

Thumbnail Image
Πισπιρίγκος, Γεώργιος
Journal Title
Journal ISSN
Volume Title
In recent years, due to the universal spread of the internet, the amount of digitally available information continuously grows. Indicatively, in 2021 there are 7.83 billion people, 4.66 billions of whom are considered active Internet users 1 . In addition, over 100 million Gigabytes of data have yet been indexed, while approximately 2.5 billion Gigabytes of raw data are generated on daily basis 2 . Thus, as digital information massively grows, the need for compact data representations has never been more immediate. Among the rest data representations, information networks can undoubtedly be considered one of the most prominent, since they manage to harmonically combine different aspects of information in the same data entity. Specifically, this representation model subtly combines different data modalities, so as to conveniently incorporate the intrinsic inference and semantics. Therefore, by presenting each individual entity as a node and by denoting any kind of association with an interconnection edge, information graphs may adeptly formulate any functional system as one of interacting entities. The usage of this general-purpose abstraction has practically been proven beneficial and has widely been adopted in various scientific sectors - such as chemistry, biology, neuroscience, finance, linguistics, social sciences, software engineering, digital marketing, etc., - mostly because of its obvious interpretation. As a result, with the extended application of this hierarchical data representation in various important issues, - such as customer segmentation, epidemiology, political influence evolution, criminal identification, tissue/organ detection, etc.; - graph analysis has attracted significant scientific interest, triggering the conduction of an impressive amount of research. Focusing on social media, where in 2021 more than 4.14 billion people are considered active social media users 1 , it is obvious that the respective information can effectively be structured with information networks. Specifically, each user can compactly be presented as a vertex, while any kind of user interaction - such as “friendship”, “liking”, “following”, “sharing”, “expressing interest”, “re-tweeting”, etc.; - might be expressed with an interconnection edge between the corresponding vertices. From sentiment analysis, expert identification and mood analysis to recommendation systems, digital footprint analysis and viral marketing campaigns, graph theory lies under numerous substantial issues, aiming to introduce efficient information managing and data mining techniques. Undeniably, one of the most important network analysis topic is community detection. This problem principally aims to identify groups of highly similar entities, aka communities, by primarily leveraging the network topology [1-9] of the data graph under study. The necessity of this topic, as plainly explained in [10], concerns the deeper understanding of the subjacent network structure that could lead to the extraction of advantageous insights regarding the underlying dynamic processes. Therefore, with its appropriate generalization and its widespread adoption from numerous fields - such as recommendation systems, targeted market analysis, influence propagation, link prediction, opinion mining etc.; - this NP-hard graph analysis problem [1] has unquestionably become one of the most essential and challenging. In respect of sociology, community detection can alternatively be construed as the expression of the homophily effect [11,12], which reflect the natural human tendency to mostly associate and interact with groups of similar. Concentrating on social media, community detection can be differently interpreted as the identification of social media users’ groups that are, either directly or indirectly, connected with each other and tend to interact more often, comparing to users of different communities [13]. Consequently, by identifying the strongly related groups of individuals, any social network might be naturally decomposed to groups of highly interacting entities, the communities, which plainly disclose the given social graph's inner mechanics. Despite the lack of a broadly accepted definition [1-9], a community can be intuitively perceived considering its graph representation. Specifically, a set of vertices is expected to form a meaningful community, if only its intra-cluster and inter-cluster densities are bigger, and smaller respectively, to the average link density of the original graph [6]. However, as it is strongly underlined in [14], the optimal community hierarchy is generated when the inter-cluster and intra-cluster densities are comparatively better than expected, and not by the optimization of the individuals. In other words, a good graph division would not be the one in which the number of connections between communities is minimized, but the one in which there are fewer inter-connection edges than originally found. Because of this abstraction, community detection has practically become one of the most conceptually challenging and computationally demanding network analysis topics. Due to its profitable application in plenteous scientific areas, a numerous set of community detection algorithms [1-9] has already been published. From methods that are exclusively based on the repetitive calculation of a global network topology criterion, to alternatives inspired by discrete mathematics and physics, the pluralism of classic community detection approaches is indeed remarkable. The great majority of those methods and algorithms are originally designed to be generally applied to any information network. The classic community detection techniques - e.g., the Louvain [15] algorithm, the Girvan–Newman [16] algorithm, the Clauset–Newman–Moore [6] algorithm, the edge centrality optimization method [1], the geodesic edge betweenness [1] approach, etc.; - are basically recursive methods of high polynomial computational complexity aiming to optimize an iteratively modified and repetitively calculated set of global network topology metrics, in order to extract the underlying community hierarchy from any possible network. Nevertheless, those methods’ capacity has practically confirmed [1,4,8] to be profoundly limited in terms of scalability, outcome consistency, and overall reliability. As meticulously described in [1-9,17], the classic community detection solutions are principally static methods that neglect not only the information networks’ topological heterogeneities, but also their significantly variant subjacent community structure. Additionally, apart from the undisputed computational difficulties of global optimization processes, the actual size of today’s real-world social networks also acts as another major inhibitory factor. As plainly explained in [1], with their corresponding computational complexity being at least quadratic, the application of classic community detection algorithms in large-scale information networks, such as contemporary social media graphs, is prohibitive. In this regard, the primary intention of this thesis is the introduction of highly scalable and accurate community detection methodologies that would efficiently extract the underlying community hierarchy of modern, large-scale social information networks regardless of their size and density. Initially, by alternatively interpreting the node-oriented definition of community to its equivalent edge-oriented representation, the introduction of distributed community prediction takes place. The true purpose of these methodologies is to efficiently identify the subjacent community hierarchy of any large-scale social graph through prediction. This is achieved by classifying the imminent graph's edges to either the ones associating nodes of different communities, aka inter-connection edges, or of the same, aka intra-connection edges, solely based on plain network topology characteristics. The promising perspectives of the distributed community prediction have meticulously been analysed in the [18-20] research publications. All different proposed methodologies have thoroughly been examined on numerous real-life social networks and proven superior to various classic community detection methods in terms of performance, stability and accuracy. Furthermore, aiming to enhance the community identification process to further consider different aspects of social networks’ information, such as the intrinsic user profile information, the article [21] has been published. In this article, a distributed, hybrid, community detection methodology has been introduced that ably combined the local topological characteristics of each social media graph under study along with the existing user profile information, in order to unveil the subjacent community structure. The proposed hybrid community detection approach [21] has been extensively tested on various real-world social graphs, roundly compared to other classic divisive community detection algorithms and practically proven highly scalable and adequately accurate. Finally, the research publications [22,23] have plainly presented the beneficial application of community detection in other sectors such as the viral marketing and computational linguistics. Particularly, in case of [22], the Twitter Personality based Communicative Communities Extraction (T-PCCE) system has been introduced. This system [22], given a real-world Twitter social subgraph, aims to identify the communities of high internal information flows, by also considering the users’ personality traits. Regarding the case of [23], a word-sense disambiguation methodology has been introduced. Specifically, by employing classic community detection techniques and establishing the unprecedented concept of community coherence on the Wikipedia Entities’ semantic ontologies graph, this distributed methodology demonstrated impressive precision and general computational performance as regards the Wikification / text annotation problem. [1] Santo Fortunato: Community detection in graphs. CoRR abs/0906.0612 (2009) [2] Schaeffer, Satu. (2007). Graph Clustering. Computer Science Review. 1. 27-64. 10.1016/j.cosrev.2007.05.001. [3] Devi, J. & Eswaran, Poovammal. (2016). An Analysis of Overlapping Community Detection Algorithms in Social Networks. Procedia Computer Science. 89. 349-358. 10.1016/j.procs.2016.06.082. [4] Stavros Souravlas, Angelo Sifaleras, M. Tsintogianni, Stefanos Katsavounis: A classification of community detection methods in social networks: a survey. Int.J. Gen. Syst. 50(1): 63-91 (2021), doi: 10.1080/03081079.2020.1863394 [5] Cai, Q., Ma, L., Gong, M. and Tian, D. (2016) A survey on network community detection based on evolutionary computation, Int. J. Bio-Inspired Computation, Vol. 8, No. 2, pp.84–98, DOI: 10.1504/IJBIC.2016.076329. [6] Bisma S. Khan, Muaz A. Niazi: Network Community Detection: A Review and Visual Survey. CoRR abs/1708.00977 (2017) [7] Michele Coscia, Fosca Giannotti, Dino Pedreschi: A Classification for Community Discovery Methods in Complex Networks. CoRR abs/1206.3552 (2012) [8] Papadopoulos, S., Kompatsiaris, Y., Vakali, A. et al. Community detection in Social Media. Data Min Knowl Disc 24, 515–554 (2012). [9] Javed, M.A., Younis, M.S., Latif, S., Qadir, J., Baig, A., Community detection in networks: A multidisciplinary review, Journal of Network and Computer Applications (2018), doi: 10.1016/j.jnca.2018.02.011. [10] Andrea Lancichinetti, Mikko Kivelä, Jari Saramäki, Santo Fortunato: Characterizing the community structure of complex networks. CoRR abs/1005.4376 (2010) [11] Khediri, Nourhene & Karoui, Wafa. (2017). Community Detection in Social Network with Node Attributes Based on Formal Concept Analysis, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications, 1346-1353. 10.1109/AICCSA.2017.200. [12] Yuan Li: Community Detection with Node Attributes and its Generalization. CoRR abs/1604.03601 (2016) [13] P. Held, B. Krause and R. Kruse, "Dynamic Clustering in Social Networks Using Louvain and Infomap Method," 2016 Third European Network Intelligence Conference (ENIC), 2016, pp. 61-68, doi: 10.1109/ENIC.2016.017. [14] Newman MEJ. Modularity and community structure in networks. PNAS June 6, 2006 103 (23) 8577-8582; [15] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre: Fast unfolding of community hierarchies in large networks. CoRR abs/0803.0476 (2008) [16] E.J. Newman, Mark & Girvan, Michelle. (2004). Finding and Evaluating Community Structure in Networks. Physical review. E, Statistical, nonlinear, and soft matter physics. 69. 026113. 10.1103/PhysRevE.69.026113. [17] Peel, L.; Larremore, D.B.; Clauset, A. The ground truth about metadata and community detection in networks. Sci. Adv. 2017, 3, e1602548, doi:10.1126/sciadv.1602548. [18] Makris, C.; Pettas, D.; Pispirigos, G. Distributed Community Prediction for Social Graphs Based on Louvain Algorithm. In IFIP International Conference on Artificial Intelligence Applications and Innovations; Springer: Cham, Switzerland, 2019; pp. 500–511. DOI: 10.1007/978-3-030-19823-7_42 [19] Makris, C.; Pispirigos, G.; Rizos, I.O. A Distributed Bagging Ensemble Methodology for Community Prediction in Social Networks. Information 2020, 11, 199. [20] Makris, C.; Pispirigos, G. Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks. Big Data Cogn. Comput. 2021, 5, 14. [21] Konstantinos Georgiou, Christos Makris, Georgios Pispirigos: A Distributed Hybrid Community Detection Methodology for Social Networks. Algorithms 12(8): 175 (2019). [22] Eleanna Kafeza, Andreas Kanavos, Christos Makris, Georgios Pispirigos, Pantelis Vikatos: T-PCCE: Twitter Personality based Communicative Communities Extraction System for Big Data. IEEE Trans. Knowl. Data Eng. 32(8): 1625-1638 (2020). DOI: 10.1109/TKDE.2019.2906197 [23] Makris, C.; Pispirigos, G.; Simos, M.A. Text Semantic Annotation: A Distributed Methodology Based on Community Coherence. Algorithms 2020, 13, 160.
Cloud computing, Distributed processing, Social network analysis, Community detection, Community prediction, Distributed machine learning, Supervised learning, Classification, Bagging ensemble learning, Stacking ensemble learning, Word-sense disambiguation