New big data clustering algorithm with high accuracy and efficiency

The research team aimed to develop a ‘parallel’ k-medoids algorithm that guarantees not only high efficiency but also high accuracy. As in Figure 1, a parallel algorithm can be executed one piece at a time on many different machines and then combined together again at the end to get the correct result.
a year ago

Prof. Lee’s team is developing scalable data mining / machine learning algorithms in support of big data. The k-medoids algorithm is one of the best-known clustering algorithms. Many data mining textbooks introduce it after the k-means algorithm because both algorithms use the notion of representatives. In the k-medoids algorithm, the representatives – medoids – are always selected from actual data objects; on the other hand, in the k-means algorithm, the representatives are determined to be the means of data objects. This is analogous to the difference between the median and mean in statistics. Therefore, the k-medoids algorithm is more robust to noises or outliers than the k-means algorithm because a medoid is less influenced by noises or outliers than a mean.

In addition, the k-medoids algorithm can be applied to a complex data type whose distance measure is not the Euclidean distance. Despite this comparative advantage, the k-medoids algorithm is not as widely used for big data analytics as the k-means algorithm mainly because of its high computational complexity. Although the k-means algorithm has linear time complexity with respect to the number of objects, the most common realization of k-medoids clustering, the Partitioning Around Medoids (PAM) algorithm, has quadratic time complexity. In fact, major distributed machine learning platforms such as Apache Mahout and Apache Spark MLlib officially support only the k-means algorithm between the two algorithms.

Figure: The overall procedure for the parallel processing in the algorithm (credit: Korea Advanced Institute of Science and Technology (KAIST))

Therefore, the research team aimed to develop a ‘parallel’ k-medoids algorithm that guarantees not only high efficiency but also high accuracy. As in Figure 1, a parallel algorithm can be executed one piece at a time on many different machines and then combined together again at the end to get the correct result. The existing algorithm, PAM, incurs high computational cost because it performs a global search over entire data. They are the two main ingredients for high accuracy, but using them simultaneously reduces efficiency. Thus, as in Figure 2, the key idea is to apply these two components individually through two phases: parallel seeding and parallel refinement. In the first phase, the algorithm generates multiple random samples and runs a k-medoids algorithm (e.g., PAM), which performs a global search, on each sample in parallel.

Then, it selects the best among the sets of seeds, each of which is obtained from a sample, and feeds it to the next phase. The second phase aims at reducing the possible errors induced by sampling. The algorithm partitions the entire data based on the seeds and constructs initial clusters. Then, it updates the medoids in parallel locally at each of the initial clusters. The novelty of the algorithm lies in the smart integration of known techniques with formal justifications. The simplicity of the algorithm is a strong benefit because simple algorithms often make a big impact and gain widespread acceptance.

Figure 2. The key idea behind the proposed algorithm.

The research team conducted experiments on 12 Microsoft Azure D12v2 instances located in Japan. Each instance has four cores, 28 GB of main memory, and 200 GB of disk (SSD). All instances run on Ubuntu 14.04. Regarding accuracy, the clustering error is only 1.01~1.03 times that of the most accurate existing algorithm (PAM). At the same time, regarding efficiency, the elapsed time is as low as 52% of the fastest existing algorithm. Therefore, the proposed algorithm has been proven to achieve both high accuracy and high efficiency.

The algorithm was presented at KDD 2017, which is the top-tier conference in the field of data mining.

Source code:

Source: Korea Advanced Institute of Science and Technology (KAIST)

Read more


No comments to display.

Related posts

Call for Applications: Go Ignite Global Call

Start-ups working on solutions in IoT, Big Data Analytics, Cyber Security, Artificial Intelligence, 5G and Customer Experience Enhancement are encouraged to apply.
Application Deadline in 11 days

EU's Call for Proposals: Digital technologies for improved performance in cognitive production plants

Proposals need to develop new technologies to realise cognitive production plants, with improved efficiency and sustainability, by use of smart and networked sensor technologies, intelligent handling and online evaluation of various forms of data streams as well as new methods for self-organizing processes and process chains.
Application Deadline in 4 days

Increased Consumption of Plant-Based Protein Diets to Mitigate the Incidence of Type 2 Diabetes

The < a href = "">wheat protein market</a> is estimated at USD 2.04 Billion in 2017 and is projected to reach USD 2.58 Billion by 2022, at a CAGR of 4.8% from 2017. The wheat protein market has been largely driven by the growing demand for bakery products, the increasing popularity of plant-based foods, wheat protein being a suitable alternative for non-animal protein among vegans coupled with nutritional benefits for lactose-intolerant consumers.

Increase in Use of Crop Protection Products in Developing Countries Drives the Pesticide Inert Ingredients Market

The pesticide inert ingredients market is projected to reach USD 4.7 billion by 2023, from USD 3.5 billion in 2018, at a CAGR of 6.14% during the forecast period. The market is driven by factors such as the increasing demand for specific inert ingredients in pesticide formulation and capability of inert ingredients to improve the efficacy of pesticide application.

Pilot plant to turn sugarcane waste into biofuel and beer bottles

Their patented REACH technology, developed by US parent company Mercurius Biorefining, has the potential to convert sugarcane bagasse and other biomass into cost effective drop-in biofuels and bio-chemicals , as alternatives to fossil fuels.

Call for applications: The Entrepreneurship World Cup

The Entrepreneurship World Cup is more than just a global pitch competition with a shot at life-changing prizes. With 100,000 entrants from around the world, EWC elevates entrepreneurs – providing you with tools and resources to grow your venture. It doesn’t matter how far you’ve come – idea-stage, early-stage, growth-stage or beyond – EWC can put you on the right course. Leverage world-class content in the EWC Accelerator to: unleash your ideas, hone your pitching skills and engage with a global network of mentors. And, oh yeah, compete for those life-changing prizes, together with business opportunities and investment.
Application Deadline in 3 months

EU's Call for Proposals: EGNSS applications fostering digitisation

Actions should deliver new innovative applications, with commercial impact and a clear market uptake perspective (a Business Plan is required as part of the proposal). The proposed EGNSS applications may integrate digital technologies like Internet of Things (IoT), cloud computing, big data and robotics.
Application Deadline in 16 days

Growing Farm Labor Issues due to Higher Costs and Availability Drives the Smart Harvest Market Ma

The smart harvest market is projected to reach USD 15.6 billion by 2023, from USD 9.0 billion in 2018, at a CAGR of 11.81% during the forecast period. The market is driven by factors such as growing farm labor issues due to higher costs and availability, and cost efficiency benefits offered by smart harvest systems.

Call for applications: 2019 SEED Awards for Entrepreneurship in Sustainable Development

Are you part of a small and growing eco-inclusive enterprise that continues to deliver environmental, social and economic benefits to your target markets?
Application Deadline in 2 months

Review of Emerging Additive Manufacturing Technologies in 3D Printing of Cementitious Materials in the Construction Industry

Additive manufacturing is a fabrication technology that is rapidly revolutionizing the manufacturing and construction sectors. In this paper, a review of various prototyping technologies for printing cementitious materials and selected 3D printing techniques are presented in detail.