A Survey on effective similarity Search Models and Techniques for Big data Processing in Healthcare System


P. Shanmuga Sundari*, Dr. M. Subaji, Dr. J. Karthikeyan

School of Computer Science and Engineering, VIT University, Vellore, Tamil Nadu. India.

*Corresponding Author E-mail: sundari.sigamani@vit.ac.in



In traditional DBMS system handled well structured and no two elements occur twice. But more than one occurrence is quite natural in big data processing. Moreover last decades many characteristics (like volume, variety, value) coupled with the data, makes the searching complex for the traditional database system.  Effective way of storing the data makes it easier way to processes the data. The main objective of this paper is to find similarity over large data that needs effective and efficient processing of raw data within a satisfactory response time.


KEYWORDS: Big data, Similarity search, Healthcare systems, Models, Approaches.





The main objective for most of the real world applications is “similarity search”, which is also known as nearest neighbour search, proximity search, or close item search. This search is to find an item that is the nearest to a query item, called the nearest neighbour usually found under some distance from a search (reference) database. In case if the reference database is very large or the distance computation between the query item and the database item is expensive due to cost involved in hardware, then it becomes computationally infeasible to find the exact nearest neighbour. Thus, a lot of research efforts have been devoted to approximate nearest neighbour search that is shown to be enough and useful for many practical problems. Finding similar data in a wide range of application includes, Duplicate detection [7, 6], Plagiarisms detection [9], Record Linkage [8], Data cleaning [3, 2], String searching [1], Adverse drug reaction [5] and Collaborative filtering for recommendation system [4].


Large volume of data in the form of structured, unstructured and semi structured form become major challenges to find the similar data. But in recent time, smart filters learn about the interests, context and preferences to figure out the most relevant information from them. These smart filters needs an effective big data solution needs to solve this dimensionality problem. Mapreduce [10] is one such method that can be used in distributed and scalable solution. The Mapreduce framework and Hadoop are very popular tools which are used to solve the big data problem. Existing real time application of European project SAPIR [11] finds the new content based method to analyze, index and search large amount images, video and music. It was developed for large scale data architecture for indexing and searching an image collection according to visual characteristics of their content.


The rest of the paper is organized as Section 2:Motivation of the survey, Section 3:Role of similarity search in large scale data: Section 4 Challenges; Section 5 Background study of different model, Section 6 Techniques and tools; Section 7 Discussions and Conclusion.




Today healthcare organization is not capable of processing and analyzing  massive raw data produced by numerous sources like clinical data, Physician’s prescription notes, Lab test report, Medical tweets from social media, Medical forum and Electronic health record that have possible access to capital of information. Such condition has specified rise to survival of the Big Data problem. This is particularly true when the normally semi-structured or unstructured data is only stored in its raw form. According to [12], the distinctive characteristic is that the quantity of data obtainable to organizations nowadays is on sharp rise, while the proportion of data they can analyze or otherwise selectively use is on refuse. In general, it is the volume, variety and velocity of current data which together define the Big Data phenomenon. The big data analytics is a process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information [13]. Data centric world makes the treatment much easier not only for the patient but also for the physician by properly analyzing the historical data in a timely manner with similar symptoms and treatment. Physician gives treatment to the new patient from the previous historical data knowledge from patients who got diagnosed with same disease and symptoms. Finding similarity over huge volume of unstructured data is major challenging task in healthcare system. Classic examples of the current medical forum are blogs and health social media tweets, which are inadequately structured texts, while the more immense images and video data are only structured for storage and display, but completely unstructured according to semantic content. In the modern era of big data, treatment is personalized depending on the user, preferences, objects and context. Finding similar pattern or user is more interesting with less response time.



Searching becomes complex depending on large volume of data that are usually in the form of unstructured data, searching similarity over large volume of data with less response time and retrieval accuracy. Similarity search has two traits. 1. The primary is the effectiveness of expressing the resemblance among elements of a specific dataset. For example, content-based similarity seek techniques for multimedia data normally gain some features (visual, audio, and others) from the data and the search proceeds in the data space(s) of these mined features. 2. The efficiency of the actual similarity indexing and searching. The search performance, storage efficiency and in particular, the scalability of the solution are important for management of large databases.




Though varied models are used to improve retrieval approach, this paper focused mainly on three models namely vector space model, metric model and Non-metric model.


Vector space model:

Vector space model is purely an information retrieval method. It is partitioned into three phrases. The first phrase is document indexing where terms are extracted from the raw data. Second phrase is weighting of the indexed term where it improves the retrieval of term related to the user. Third phrase is ranking the document with reverence to user query depending on some similarity measure.


Document indexing:

The vector space model [14] determines of likeness by defining a vector that symbolizes each document, and a vector that represents the query. Creating a vector that denotes the terms in the document and selecting a method of calculating the nearness of some two vectors. Documents stuffs are calculated by the conditions in the text, correspond most strongly to the content of the query are judged to be the most relevant. In traditional method of determining, proximity of two vectors use the size of the angle between them. Later similarity coefficient is calculated instead of angle. Similarity coefficient is calculated in different ways. For example consider a document set with only two separate terms, α and β. All vectors includes only two components, the first component denotes occurrences of α, and the subsequent represents occurrences of β. The simplest means of constructing a vector is to place a one in the corresponding vector module if the expression appears, and a zero, if the phrase does not appear. The resemblance coefficient among the query with the documents can be considered as the distance from the query to the both vectors. Document one is represented by the same vector as the query so it will have the highest rank in the result set. Similarity search term is not only enough when the term frequency is high and the search is not relevant to the mismatched query during which term weight needs to be calculated.


Term weighting:

Term weighting is calculated to indicate that one term is more important than another. The term weighting for the vector space model is based on three factors. Term frequency factor, collection frequency factor and length normalization factor. Weight is calculated by means of the Inverse Document Frequency (IDF) matching to a given term. To make a vector that corresponds to each corpus, consider the subsequent definitions:


n = amount of distinct terms in the text collection

tfij = amount of happening of term tj in document Di [term frequency]

dfj = number of text which hold tj

idfj = log(  ( where d is the sum number of documents [inverse document frequency] computation of the weighting factor (d) for a word in a text is defined as a mixture of term frequency (tf), and inverse document frequency (idf). To calculate the value of the jth entry in the vector corresponding to document i, the following equation is used in [14].



dij- weighting factor for the phrase in a text,

tfij– term frequency and

idfj– inverse document frequency.


Similarity coefficient:

A similarity coefficient (SC) among a query Q and a documents Di is represented by the product of the two vectors. Since a query vector is alike in length to a document vector, this similar measure is often used to figure the likeness among both the documents.

Where   – term weight of the query and dij - collection of document with terms t.


Similarity measure:

Numerous different measures of comparing a query vector with a document vector have been executed. The majority is the cosine measure [14] where the cosine of the approach among the query and text vector is given: Cosine similarity distance:



Where - inner product by the magnitude of the document vector which gives same relevance results in the cosine coefficient.


Relevance feedback:

The knowledge behind relevance feedback is to obtain the results that are initially returned from a given query. To utilize this information, a new query is carried out whose resultants are relevant .The types of feedback may varies from implicit, explicit and pseudo. Relevance information is exploited by means of the contents of the significant documents to either fine-tune the weights of expression in the novel query, or by using those contents to add words to the query. According [15] to vector space approaches to information retrieval permit the user to search for concepts rather than specific words, order the results of the investigation according to their relative resemblance to the query. Locality Sensitive Indexing [16] is to find information similar to (rather than literally matching) other information. Two novel model architectures [16] were implemented for computing uninterrupted vector demonstrations of words as of very large data sets. The value of these representations is calculated in a word similarity task, and the results are compared to the previously greatest performing techniques found on different kinds of neural networks.


Inverted index:

Implementation of vector space model needs lengthy sequential search. To avoid this time complexity inverted index in introduced. Here each term n is stored in a arrangement called index. For each term a indicator refer the linked list which is called posting list. Posting list holds together document identifier and term frequency. It improves the run time performances of the information retrieval. An inverted index [16] associates an element to the entities that hold it. Instead of comparing the entire element in a collection, inverted index assists to evaluate the entities that at least contain a frequent element. An inverted index [16] based algorithm indexing every set element to generate candidate pairs for similarity computation with optimizations based on using threshold information. Using cloud computing techniques [17] a self-caring services using inverted index was implemented with same disease symptoms and personal information. Symptom set and medical record's ID are mapped using Inverted index. Using regular Lucene APIs the medical records are transformed into numerous Lucene documents. Then Lucene records are accumulate as block files in the HDFS of the Hadoop cluster. Moreover, to allow online medical record searching, MapReduce tasks are initiated to build index files for each Lucene document. In the indexing stage index files are also stored in the HDFS of the Hadoop cluster. It achieves scalability.



The metric space model [17] deals with unstructured data. Metric M is a pair m = (D,d) where D is the area of substance and d is the entire distance function d :  pleasing the following postulates or all objects



The smaller the distance between two objects the more similar they are. According to [17] there are two types of similarity queries. Let ID be a finite set of indexed objects.

Definition 2: An object q and an integer k



Additional generic and extensible variety of similarity searching was developed before several years as the concept of mathematical metric space [19],[18].Most quoted Mtree[20] was the first unbiased, dynamic and disk oriented tree investigate structure for generic metric data. Nearer it was extended into distributed architecture [21]. Several distributed data structures for metric space model searching such as GHT [22], MCAN [23], and M-Chord [24] were implemented by several authors .M-Chord [24] is a distributed peer-to-peer data structure for metric based similarity searching. iDistance [25] maps the data space into one dimensional domain. The indexed data is spitted into distributed nodes and topology structure of chord protocol is used for navigation. Among these techniques M-Chord gives better performance under several conditions. A distributed edition of a data structure for likeness organization called the Metric Index [26] that merge the valued possessions of the ordered peer-to-peer associations with the facility to search by similarity definited by a general metric. It uses the principle of M-Chord. But centralized M-Index is more efficient than centralized M-Chord versions. Using metric model as an orthogonal approach was developed [27] using CBIR (Content Based Image Retrieval) techniques. According to [27] the searching can be done with visual content similarity. It is highly flexible and it works under the principles of structured peer-to-peer networks. These techniques generally pull out a signature of the image substance, which is then used for indexing and searching.


Distance measure:

Distance measure is used to represent a way of quantifying the nearness of objects in a given domain. Different functions are used for various types of data. The most frequently used measures are Murkowski distance, Jaccard coefficient and Hausdorff distance and time complexity.



The normal approach for applying similarity search is to define a dissimilarity calculates that satisfies the properties of a metric (strict positiveness, symmetry, and the triangle inequality), and then to use it to the query for similar objects in large data collections.



Most similarity search function doesn’t satisfy all or some properties of metric space model. So it needs some superficial similarity searches that not only depend on the distance but also deals with the other parameters like time, user, query and objects in data base [27]. These intrinsic characters play major role in content based retrieval in many applications. Content based recommendation systems [28] recommends an item to a user, based upon a description of the items and profile of user’s interest.


Collaborative assessment and recommendation engine (CARE):

CARE applies collaborative filtering model for person’s medical history in order to categorize high likelihood diagnoses in the future. Collaborative filtering is a technique by which alike folks are identified through a set of known shared favorites or attributes. The primary focus of collaborative filtering is to recognize novel preferences for an unit based on the non-shared interest identified between other similar individuals [29-32]. Figure.1 [33] showed an architectural diagram of CARE algorithm. It consists of three major phrases. For a patient p, the algorithm begins with an initial filtering on all patients enclosed in the database, separating only those patients who have at least one disease in common with p. Make use of this subset of patients, the collaborative filtering step is performed. CARE’s collaborative filtering algorithm includes a binary coding, with 1 representing a present diagnosis, and 0 defines absent or undiagnosed. Moreover the inverse frequency of each diagnosis is used in order to provide upper weight to less common diagnosis. CARE in addition incorporates a time factor instead of the phase wise improvement of a disease in the patient’s medical history. A collaborative filtering model is produced for each matching set of patients predictable for each disease of p. Finally the results are then aggregated, making a ranked list of high probability diseases for p. An item based similar match algorithm [4] builds a technique for item –item similarity (e.g. item-item correlation vs cosine similarities between item vectors) table for finding items that customers tend to purchase together.


Fig. 1:Implementation of CARE Algorithm [33].



HCube [39] is a data centre solution designed to reduced number of hops to support similarity search. It mainly concentrates to place the similar data in server that are physically near. It was organized under three dimensional Survey on effective similarity search model and techniques for big data processing  structure based on combination of Random Hyper plane Hashing function, three dimensional topology the adoption of the Gray Space Filling Curve and the XOR routing mechanism. Parallel data processing paradigms rely on data partitioning and redistribution of data for efficient query execution. A 3-stage approach [42] for end-to-end set similarity joins, steadinesses the workload efficiently and reduces the replication. It controls the amount of data kept in main memory.


Locality sensitive hashing:

Locality Sensitive Hashing (LSH) technique [35] is a famous technique in which data elements are hashed so that the similar element are placed in the same bucket with high probabilistic manner. It will reduce the searching hops and response time by placing the similar element in the same bucket. Load balancing is achieved by distribute the data among the bucket while mapping. PLSH[36] is a hash table search distributed across multiple cores and multiple nodes. It provides multiple accesses to the same set of hash tables concurrently. It maximizes the cache locality and throughput by grouping the data. It make use of LSH to extensively reduce the index construction time A new hybrid approach periodically merge the optimized LSH delta table into main LSH structure for handling stream and expiration of old document. RankReduce [37] method make uses locality sensitive hashing (LSH) jointly with a MapReduce implementation, by which plan is an ultimate match as the hashing principle of LSH can be powerfully integrated in the mapping stage of MapReduce. Both these techniques achieve high accuracy and performance.



In order to increase efficiency of similarity search, the query must be executed for all the data and not for part of the data. Searching and filtering from comprehensive data is more complex while extracting feature from large data set. According to [40], “current tools and systems for distributed processing are typically considered either for 1) batch processing of a large data volume that is already stored in a distributed commodity hardware infrastructure (Hadoop-like systems based on the MapReduce processing paradigm) and 2) parallel and distributed processing of a given complex computational task (systems like Storm or S4). Neither of these approaches is fully sufficient for any big data problems, since the desired system must cover both these tasks at the same time”. Recently lambda architecture [43](like spark) is designed to support both batch processing and streaming in order to process the data efficiently. To minimize the infrastructure cost, hadoop distributed file system, distribute the data based on commodity hardware. It provides processing heterogeneous data sources by unified computational task and to increase the efficiency in searching over large amount of data.



Among various models discussed for finding similarity search problems, 1.Vector space model is purely information retrieval method where Indexing, scoring and term weight is calculated. As indexing of large data set makes more time complexity, inverted index is used which can be applied mainly for document data. 2. Metric model is implemented with branch and bound technique wherein it must satisfies the all the properties as mentioned above. But it can be applied only for numeral data. 3. Non metric model deals with the intrinsic characters such as objects, content, user preferences and context. Today business intelligence, personalized their service according to individual interest. Many machine learning and data mining algorithms are used to find similarity matches to high dimensional vectors that represent the training data but computation become very expensive. Similarity search has previously established in text based retrieval, even on the level of the web scale. However, in order to deal with the variety, volume, and the veracity of Big Data, several challenging problems are to be identified. The major focus of the paper is towards the clarification of two basic similarity management challenges, like searching and retrieval.

Table 1 shows that list of techniques and approaches used to solve the similarity searching problem from the literature survey.


Table 1: List of techniques and approaches

Author and Year

Method and Techniques

Test Bed (Data)



HCube [Villaça, R. S., et al 2016] [41]

Storage and retrieval. Data center based approach with XOR based flat routing mechanism.

Adult data set from UCI repository

Efficient and Reduce number of hops.

4-Dimensional HCube data model.

HDInsight4Psi [Mrozek, D., Daniłowicz, P., and Małysiak-Mrozek, B. (2016) [44]

Parallelization technique jCE, jFATCAT-rigid and jFATCAT-flexible

Protein structure from protein data bank.

Reduce the searching time.

Analyze the data with unbalanced data and sequential data set.

B. Mohammadhossein [2015] [45]

Column based data storage. Query processing.

Patient data information symptoms, treatment.


Execution time and accuracy.

Offline only.

Lin. W[2015] [46]

Information retrieval using Inverted index.


Concurrency and filtering privacy information.

Flexible used input addressed synonyms issues.

ROSEFW-RF [Triguero, I et al, [2015] [39]

Data mining and machine learning classification algorithms.


Scalable and improved classification accuracy.

Improve classification performance in imbalanced data and apply under sampling and oversampling for instance reduction.

Kenney Ng [2014]

Dependency graph


Scalability and accuracy

To improve computational performance.

Sundaram, N (2013) [47]

Locality Sensitive Hashing


Scalable, reliable, low latency and high throughput.

Highly dynamic


Baraglia, Ranieri, Gianmaco De Francisci Morales, and Claudio Lucchese (2010) [48]

Indexing and pair wise similarity computation over large collection of data using mapreduce.

AQUAINT-2 newswire text.

Reduce the disk access and time.


Vernica, R., Carey, M. J., and Li, C. (2010, June) [49].

A 3-stage approach for end-to-end setsimilarity joins.


Efficiently perform set similarity join.


Chan, L. W. C., et al [2010] [50]

Machine learning classification.

EHR system from Hong Kong hospital

Accuracy, specificity and sensitivity

Use various similarity measure is improve  classification performance

Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard.(2008) [51]

Similarity self join with prefix filtering and similarity self join with remainder file.

WT10g TREcwebcorpus

Scalable similar information retrieval with above used defined threshold value.


Yan, R., Hauptmann, A. G., andJin, R. (2003 - 86) [52]

Non metric approach- Pattern recognition –Classification

TREC Video Retrieval Track

Negative pseudo relevance feedback to improve the performance of content based video retrieval.

Outlier might be misclassified

Badrul Sarwar [2001] [53]

Collaborative filtering technique – Model based approach

Movie Lens

Item based recommendations algorithms

In efficient for dynamic data



1.        Hadjieleftheriou, M., Chandel, A., Koudas, N., and Srivastava, D. (2008, April). Fast indexes and algorithms for set similarity selection queries. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on (pp. 267-276). IEEE

2.        Baraglia, R., De Francisci Morales, G., and Lucchese, C. (2010, December). Document similarity self-join with mapreduce. In Data Mining (ICDM), 2010 IEEE 10th International Conference on (pp. 731-736). IEEE.

3.        Elsayed, T., Lin, J., and Oard, D. W. (2008, June). Pairwise document similarity in large collections with MapReduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers (pp. 265-268). Association for Computational Linguistics

4.        Sarwar, B., Karypis, G., Konstan, J., andRiedl, J. (2001, April). Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (pp. 285-295).ACM.

5.        Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus 244 et al. training.Journal of biomedical informatics. 2015 Feb 28;53:196-207.

6.        Henzinger, M. (2006, August). Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 284- 291). ACM

7.        Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), 15.

8.        Winkler, W. E. (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.

9.        Hoad, T. C., and Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American society for information science and technology, 54(3), 203-215

10.     Dean, J., andGhemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113

11.     http://sysrun.haifa.il.ibm.com/sapir/demos.html (Last accessed on 11 March 2016).

12.     Zikopoulos P, Eaton C (2006) Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Education

13.     Dhar V (2013) Data Science and Prediction. Commun ACM 56(12):64–73

14.     Grossman, D. A., and Frieder, O. (2012). Information retrieval: Algorithms and heuristics (Vol. 15). Springer Science and Business Media

15.     Letsche, T. A., and Berry, M. W. (1997).Large-scale information retrieval with latent semantic indexing. Information sciences, 100(1), 105-137.

16.     Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

17.     Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach.

18.     Deza M, Deza E (2012) Encyclopedia of Distances. Springer

19.     O’Searcoid M (2006) Metric Spaces. Springer Undergraduate Mathematics Series. Springer.

20.     Ciaccia P, Patella M, Zezula P (1997) M-Tree: An efficient access method for similarity search in metric spaces. In: Proceedings of 23rd International Conference on Very Large Data Bases (VLDB ’97), vol 25, pp 426–435.

21.     Batko M, Novak D, Falchi F, Zezula P. Scalability comparison of peer-to-peer similarity search structures. Future Generation Computer Systems. 2008 Oct 31;24(8):834-48

22.     Batko M, Gennaro C, Zezula P. Similarity grid for searching in metric spaces. InPeer-to-Peer, Grid, and Service Orientation in Digital Library Architectures 2005 (pp. 25- 44). Springer Berlin Heidelberg.

23.     Falchi F, Gennaro C, Zezula P. A content–addressable network for similarity search in metric spaces. InDatabases, Information Systems, and Peer-to-Peer Computing 2007 (pp. 98-110). Springer Berlin Heidelberg.

24.     Novak, D., andZezula, P. (2006, May). M-Chord: a scalable distributed similarity search structure. In Proceedings of the 1st international conference on Scalable information systems (p. 19). ACM

25.     H. V. Jagadish, B. C. Ooi, K.-L.Tan, C. Yu, and R. Zhang.iDistance: An adaptive B -tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS 2005), 30(2):364 397, 2005.

26.     Novak, D., Batko, M., and Zezula, P. (2012).Large-scale similarity data management with distributed metric index. Information Processing and Management, 48(5), 855-872..

27.     Bustos, B., and Skopal, T. (2011). Non-metric similarity search problems in very large collections. databases, 30(31), 32

28.     Pazzani, M. J., and Billsus, D. (2007). Content-based recommendation systems. In The adaptive web (pp. 325- 341).Springer Berlin Heidelberg.

29.     D. Goldberg, D. Nichols, B.M. Oki, D. Terry, Using collaborative filtering to weave an information tapestry, Commun. ACM 35 (12) (1992) 61–70.

30.     P. Resnick, H.R. Varian, Recommender systems, Commun. ACM 40 (3) (1997) 56–58.

31.     J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., 1998, pp. 43–52.Advances in database systems, vol 32. Springer, New York

32.     L. Duan, W.N. Street, E. Xu, Healthcare information systems: data mining methods in the creation of a clinical recommender system, Enterprise Inform. Syst. 5 (2) (2011) 169–181.

33.      D.A. Davis, N.V. Chawla, N.A. Christakis, A.-L. Barabási, Time to care: a collaborative engine for practical disease prediction, Data Min. Knowl. Discovery 20 (3) (2010) 388– 415

34.     N.V. Chawla, D.A. Davis, Bringing big data to personalized healthcare: a patient-centered framework, J. General Internal Med. 28 (3) (2013) 660–665.

35.     Gionis, A., Indyk, P., and Motwani, R. (1999, September).Similarity search in high dimensions via hashing. In VLDB (Vol. 99, No. 6, pp. 518-529)

36.     Sundaram N, Turmukhametova A, Satish N, Mostak T, Indyk P, Madden S, Dubey P. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment. 2013 Sep 1;6(14):1930-41.

37.     Stupar, A., Michel, S., andSchenkel, R. (2010, July). Rank Reduce processing k-nearest neighbor queries on top of Map Reduce. In Proceedings of the 8th Workshop on Large Scale Distributed Systems for Information Retrieval (pp. 13- 18).

38.     Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., and Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69-79.

39.     Villaça, R. S., Pasquini, R., de Paula, L. B. and Magalhães, M. F. (2016).HCube: routing and similarity search in Data Centers. Journal of Network and Computer Applications, 59, 386-398

40.     Marz N, Warren J (2014) In: Principles and best practices of scalable real time data systems. Manning Publications Co

41.     Mrozek D, Daniłowicz P, Małysiak-Mrozek B. HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HD Insight clusters in Microsoft Azure cloud. Information Sciences. 2016 Jul 1;349:77-101

42.     Barkhordari M, Niamanesh M. ScaDiPaSi: an effective scalable and distributable Map Reduce-based method to find patient similarity on huge healthcare networks. Big Data Research. 2015 Mar 31;2(1):19-27

43.     Lin W, Dou W, Zhou Z, Liu C. A cloud-based framework for Home-diagnosis service over big medical data. Journal of Systems and Software. 2015 Apr 30;102:192-20

44.     Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., and Dubey, P. (2013). Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment, 6(14), 1930-1941. Survey on effective similarity search model and techniques for big data processing 245

45.     Baraglia, R., De Francisci Morales, G., and Lucchese, C. (2010, December). Document similarity self-joins with mapreduce. In Data Mining (ICDM), 2010 IEEE 10th International Conference on (pp. 731-736). IEEE

46.     Vernica, R., Carey, M. J., and Li, C. (2010, June). Efficient parallel set-similarity joins using Map Reduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 495-506).ACM.

47.     Chan LW, Chan T, Cheng LF, Mak WS. Machine learning of patient similarity: A case study on predicting survival in cancer patient after locoregional chemotherapy. In Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on 2010 Dec 18 (pp. 467-470). IEEE.

48.     Elsayed, T., Lin, J., and Oard, D. W. (2008, June). Pair wise document similarity in large collections with Map Reduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers (pp. 265-268). Association for Computational Linguistics.

49.     Yan, R., Hauptmann, A. G., andJin, R. (2003, November). Negative pseudo-relevance feedback in content-based video retrieval. In Proceedings of the eleventh ACM international conference on Multimedia (pp. 343-346).ACM.

50.     Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web 2001 Apr 1 (pp. 285-295). ACM.

51.     Ng K, Ghoting A, Steinhubl SR, Stewart WF, Malin B, Sun J. PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. Journal of biomedical informatics. 2014 Apr 30;48:160-70.

52.     Batko M, Falchi F, Lucchese C, Novak D, Perego R, Rabitti F, Sedmidubsky J, Zezula P. Building a web-scale image similarity search system. Multimedia Tools and Applications. 2010 May 1;47(3):599-629.







Received on 18.05.2017             Modified on 16.06.2017

Accepted on 02.07.2017            © RJPT All right reserved

Research J. Pharm. and Tech. 2017; 10(8): 2677-2684.

DOI: 10.5958/0974-360X.2017.00476.0