Data mining computations in the DAaS paradigm

Wendy Hui Wang
Assistant Professor
Department of Computer Science
Stevens Institute of Technology

With the fast growing of data volume, organizations recognize the need to derive insight from this deluge. Data mining is now established as a foundational data analytics discipline that is ever more valuable as datasets grow larger [2]. For example, by analyzing billions of transactions, supermarkets are able to derive promotions based on consumer purchases. As the sheer size of today’s data sets is increasingly crossing the petabyte barrier [1, 5, 9], utilizing the vast amount of data presents technical challenges for effective and efficient data mining. The solution of in-house data mining software is not satisfactory, as it either may not be able to address users’ specific analysis needs or fail to deliver analysis results that are easily deployable. Hiring in-house data mining and mining professionals may not be affordable for small and medium-sized organizations that have limited financial budget.

Therefore, outsourcing of data and computing services to a computational powerful service provider is becoming commonplace and essential. This motivates the data-analytics-as-a-service (DAaS) paradigm [1], aiming at enabling organizations with limited computational resources and/or data mining expertise to outsource their data mining needs to a third party service provider (server). Cloud computing, an emerging trend of provisioning scalable computing services, provides a natural solution for the DAaS paradigm. Recently, several academic and industrial organizations have started investigating and developing technologies and infrastructure for cloud-based data analytics services. For instance, big corporations like Amazon, Google and Microsoft are providing cloud-based data mining services in various forms. The paradigm of “analytics and management of data as a service” will presumably grow as popularity of cloud computing grows [4].

With the DAaS architecture, however, the data owner (client) no longer has direct control over the outsourced data and computations. This raises a few security and privacy issues. One of the main privacy issues is that the server has access to valuable data of the owner and may learn sensitive information from it. For example, by looking at the received transactions, the server (or an intruder who compromised the server) can learn which products are always co-purchased. However, both the transactions and the mining results are the property of the data owner and should remain safe from any third party including the server. This problem of protecting important private information of organizations/companies is referred to as “corporate privacy” [6]. Unlike personal privacy, which only considers the protection of the personal information recorded about individuals, corporate privacy requires that both the individual products and the patterns of the collection of products are regarded as corporate assets and thus must be protected.

One of our active research projects aims to devise encryption schemes for privacy-preserving data mining within a corporate privacy-preserving framework. The encryption schemes enable formal privacy guarantees to be proved, while providing acceptable data mining performance over large-scale, real-life transaction databases. The architecture behind our model is illustrated in Figure 1. The client/owner encrypts its data using an encrypt/decrypt (E/D) module, which can be essentially treated as a “black box” from its perspective. The E/D module is responsible for transforming the input data into an encrypted database. The server conducts data mining on the encrypted data and sends the results to the owner. The E/D module recovers the true identity of the returned patterns.

We consider the attacker as the service provider (server) itself or any party that compromises the server. There are two types of attackers: (1) the semi-honest attacker that executes the data mining computations faithfully and returns the correct data mining results to the client, but it is curious to extract additional information of the private input; and (2) the malicious attacker that tries to cheat the client by returning wrong data mining results. The attacker may possess some knowledge of the outsourced data, such as the frequency distribution of the outsourced datasets. Then the attacker can launch elementary statistical analysis attacks by leveraging the adversary knowledge to break the data transformation schemes. For instance, if the dataset is encrypted by a one-to-one substitution scheme (i.e., all identical plaintext values are mapped to the same ciphertext symbols), the attacker can launch the frequency analysis attack by mapping the ciphertext symbols to plaintext values based on their frequency. Besides the adversary knowledge of the data distribution, the attacker may also possess the knowledge of the data transformation schemes. It can exploit such knowledge to launch the known-scheme attack on the transformed data, trying to reverse-engineer the transformed data.

There are several challenges to design robust and efficient data transformation based privacy- preserving methods for data mining computations in the DAaS paradigm. First, it is well known that data transformation may result in some loss of effectiveness of the underlying data [3]. The privacy-preserving data transformation techniques should not degrade the quality of data mining results significantly. Second, the data transformation methods should be able to defend against the attacks based on the attacker’s adversary knowledge, e.g., the knowledge of outsourced data and/or the underlying transformation schemes. Third, given the client’s weak computational constraint, the complexity of the data transformation methods should be much cheaper than that of data mining computations. Furthermore, the data transformation methods should not lead to dramatic computational overhead of data mining at the server side that the client may not be able to afford.

In our recent work [7, 8], we focus on a specific data mining problem as association rule mining, which aims to find interesting patterns among products in transaction datasets. We devise encryption schemes for outsourcing of association rule mining such that formal privacy guarantees can be proven against the attacks conducted by the server using background knowledge, while keeping the resource requirements under control. In particular, based on a formal definition of the attack model, we defined our privacy model named k-privacy requiring that the probability that any cipher item/itemset is mapped to its plaintext item/itemset is no larger than 1/k, where k is a user-defined threshold. We developed an encryption scheme, called RobFrugal, which includes the E/D module to transform client data before it is shipped to the server. We designed a compact structure, called synopsis, to allow the E/D module to recover the true association rules without the access to the original dataset. We also provided the E/D module with an efficient strategy for incrementally maintaining the synopsis against updates of the transaction dataset. We conducted a formal analysis based on our attack model and proved that the RobFrugal scheme satisfies k-privacy. We validated our schema on a large real-world transaction data set from the Coop store chain in Italy. Our results show that our encryption schema is effective, scalable, and achieve the desired level of privacy.

.    [1]  D. J. Abadi. Data management in the cloud: Limitations and opportunities. IEEE Data Engineer- ing Bulletin, 32(1):3–12, 2009.
.    [2]  A. Agarwal, O. Chapelle, M. Dud ́ık, and J. Langford. A reliable effective terascale linear learning system. CoRR, abs/1110.4198, 2011.
.    [3]  C. C. Aggarwal and S. Y. Philip. An introduction to privacy-preserving data mining. In Privacy- Preserving Data Mining, pages 1–9. Springer, 2008.
.    [4]  R. Buyya, C. S. Yeo, and S. Venugopal. Market-oriented cloud computing: Vision, hype, and reality for delivering it services as computing utilities. In HPCC, 2008.
[5]T. Claburn. Yahoo scales its web analytics database to petabyte range.
.    [6]  C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining privacy for data mining. In in National Science Foundation Workshop on Next Generation Data Mining, 2002.
.    [7]  F. Giannotti, L. V. S. Lakshmanan, A. Monreale, D. Pedreschi, and W. H. Wang. Privacy- preserving mining of association rules from outsourced transaction databases. In Proceedings of the Twentieth Symposium on Advanced Database Systems (SEBD), pages 233–242, 2012.
.    [8]  F. Giannotti, L. V. S. Lakshmanan, A. Monreale, D. Pedreschi, and W. H. Wang. Privacy- preserving mining of association rules from outsourced transaction databases. IEEE Systems Jour- nal, 7(3):385–395, 2013.
.    [9]  C. Monash. The 1-petabyte barrier is crumbling.