Massive Data Analysis- Challenges and Applications
Upcoming SlideShare
Loading in...5
×
 

Massive Data Analysis- Challenges and Applications

on

  • 495 views

We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. ...

We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.

Statistics

Views

Total Views
495
Views on SlideShare
495
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Massive Data Analysis- Challenges and Applications Massive Data Analysis- Challenges and Applications Presentation Transcript

  • Massive data analysis: applications and challenges Vijay Raghavan University of Louisiana at Lafayette Jayasimha Katukuri eBay Ying Xie Kennesaw State University
  • Agenda Trends and Perspectives Kinds of Big Data problems Big Data Application scenarios Current State of the Art Big Data Applications- Examples Big Data Analysis- Research Areas Conclusions 12/30/2013 2
  • Trends and Perspectives In 2009, McKinsey estimated that nearly all sectors in the US economy had at least an average of 200 terabytes of stored data per organization (for organizations with more than 1000 employees). As an example, Walmart’s customer transaction database was reported to be 110 terabytes in 2000. By 2004 it increased to be over half a petabyte (Schuman, 2004). An increasing 80% of data organizations own, can be classified as unstructured data: for example data packed in emails, social media and multimedia. 12/30/2013 3
  • Trends and Perspectives (Contd …) Taking account the average data growth, annually by 59% (Pettey & Goasduff, 2011), this percentage (unstructured data) will likely be much higher in a few years. Not only an increasing number of human beings are connected to the Internet, also there is a significant increase in the number of physical devices connected to the Internet. Besides the volume of data is becoming a problem, also the variety and velocity are issues we need to look at (Russom, 2011). 12/30/2013 4
  • Trends and Perspectives (Contd …) Big Data: Data that is complex in terms of volume, variety, velocity and/or its relation to other data, which makes it hard to handle using tradition database management or tools. “Through 2015, more than 85% of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.” (Gartner’s Top Predictions 2012). Analysts need to i) cope with massive data distributed across locations; ii) treat data as a resource to understand underlying phenomena (NRC Study, 2013). 12/30/2013 5
  • The Meaning of Big Data – 3V’s Big Volume - With simple and complex (SQL) analytics - Scaling complex operations Big Velocity Drink from the fire hose -Beyond OLTP, NoSQL - Big Variety Large number of diverse data sources to integrate -Beyond Global Schema-based approaches - 12/30/2013 6
  • Velocity- Time to action vs. Value (Hackathorn, 2002) 12/30/2013 7
  • Kinds of Big Data Problems (Davis, 2012) 12/30/2013 8
  • Big Data – Big Analytics Complex math operations (machine learning, clustering, trend detection, …) mostly specified as linear algebra on array data in the stock market domain, the world of “quants” A dozen or so common “inner loops” Matrix multiply QR decomposition SVD decomposition Linear regression 12/30/2013 9
  • Big Data – Big Analytics- An Example Consider choosing price on all trading days for the last 5 years for two stocks A and B What is the covariance between the two time-series? (1/N) * sum (Ai – mean(A)) * (Bi – mean (B)) Now Make it more challenging … All pairs of 4000 selected stocks- 4000 x 1000 matrix Hourly, instead of daily? All securities? 12/30/2013 10
  • Big Data Application ScenariosDetecting anomalies or emerging events Visa’s fraud detection program HP’s compliance detection using its event management solution Detecting abnormal situations in ICU Detecting server attacks, marketing keywords, environmental hazards Detecting terror and diseases Detecting national security risks (Singapore’s RAHS (Risk Assessment & Horizon Scanning) against disease, financial risk 12/30/2013 11
  • Big Data Application ScenariosPredicting near future & Trend analysis CRM: churn prediction Criminal protection by predicting likely locations of criminal activities Defect prediction (Volvo) Google flu trend Personalized recommender systems (Amazon) Personalized labor support system (Germany, saving 10B euro saving) 12/30/2013 12
  • Big Data Application ScenariosReal-time analysis and Decision Support CRM Healthcare applications in ICU Marketing support Navigation service Real-time Q/A systems 13 12/30/2013
  • Big Data Application ScenariosPattern Learning ( Google’s automatic language translation Apple’s siri, Google’s now IBM Watson (Seton HealthCare Family use Watson to learn 2M patient data annually) 12/30/2013 14
  • Current State of the Art Rise of the cloud Big analytics as a service Amazon DynamoDB, Google BigQuery, Windows Azure Tables Hadoop, Open source- heart of big data analytics HDFS does not index data Run big jobs using big files vs. small jobs as fast as possible Several variants- Cloudera, Amazon Elastic MapReduce, IBM Infosphere 12/30/2013 15
  • Current State of the Art (contd.) Machine learning for massive data sets Hadoop requires mappers and reducers to communicate with each other through a file system (HDFS). Some of the alternative technologies in this space are: Graphlab (http://graphlab.org/) Apache spark (http://spark.incubator.apache.org/) Real-time analytics Hadoop is not ideal for real-time analytics. Apache storm (http://storm-project.net/) is one technology that is trying to address the real-time analytics solution 12/30/2013 16
  • Current State of the Art (contd.) In-Memory analytics Focuses on the velocity part of big data Oracle Exalytics In-Memory machine, 1 terabyte RAM SAS High-performance Analytics (unstructured data) Non-commercial- VoltDB 12/30/2013 17
  • Big Data Applications- Hypothesis Discovery 12/30/2013 18
  • Motivation for Literature-based Hypotheses Discovery Systems Biomedical research is divided into highly specialized fields and subfields, with poor communication between them. The rate of growth of publications makes it difficult for a researcher to derive connections between concepts from different research specialties. It also means an opportunity, since the usefulness of the literature-based discovery is greater as more data means better reliability in statistical methods. Mining hidden connections among biomedical concepts from large amounts of scientific literature is one of the important goals pursued in this field [1]. Pfizer uses text mining software to move to a broader understanding before making major investments in specific compounds. It is estimated that $18 billion is spent per year on compounds that never reach market, while $30 billion is spent reinventing what is in the literature. 12/30/2013 19
  • Hypothesis Discovery from Biomedical Literature : Example Swanson found the hidden connection between “Fish Oil” and “Reynaud's Disease” by finding the common concepts from the document set of “Fish Oil” and “Reynaud's Disease” [4,5]. Raynaud’s disease Fish Oil High blood viscosity Platelet aggregation 12/30/2013 20
  • Link Discovery Methods in Biomedical Literature The problem of hypotheses discovery in biomedical literature is similar to the link discovery problem. The existing approaches for hypotheses discovery have not explored the network topology features used in the link discovery methods. The existing approaches do not provide an automated way of evaluating the results. Supervised learning methods have not been 12/30/2013 explored. 21
  • Proposed Method: Supervised Link Discovery Supervised Link Discovery Concept Network : Model the whole Medline literature repository as a complex network of biomedical concepts Generate labeled data automatically using Concept Networks corresponding to two different time periods. Extract a set of features from the concept network for concept pairs. A supervised learning approach to learn a model for link discovery. 12/30/2013 22
  • Concept Network Each node represents a biomedical concept Node Attributes: concept name semantic type, related authors, and document frequency Each edge represents an association between two concepts. Edge Attributes: Co-occurrence frequency 12/30/2013 23
  • Concept Network – Map-Reduce Doc-2 Doc-1 Mapper-1 Key: (c1, c2,year) Value: co-count 12/30/2013 CCM_ local Mapper-2 Doc-3 CCM_l ocal Mapper-3 CCM_ local Reducer-2 Reducer-1 HDFS 24
  • Concept Network Statistics Total number of concept pairs = 17356486 Total number of documents = 11021605 Total number of concepts = 165674 12/30/2013 25
  • Automatic Generation of Labeled Concept Pairs For each pair whose connection is strong in Gts, if it has no direct connection in Gtf, we assign positive to this pair. For each pair whose connection is weak in Gts, if it has no direct connection in Gtf, we assign negative to this pair. Select a random sample of the nodes in Gtf and generate concept pairs from the selected random sample. if a pair has no connection in both Gtf and Gts, we assign negative to it. 12/30/2013 26
  • Features In addition to the commonly used network topological features, we extract the following features: Cycle Free Effective Conductance (CFEC) The Semantic-CFEC The Author_List Jacccard 12/30/2013 27
  • Feature Extraction For each of labeled pairs, we extract the set of features as described before from the snapshot of the concept network Gtf. To scale the feature extraction for large number of labeled pairs, feature extraction is implemented on a Map-Reduce cluster. The distributed implementation of feature extraction can be described in the following way: Trim Gtf such that it only contains edges with strength greater than or equal to the minimum support. Store the trimmed Gtf in each of the mapper’s main memory. Distribute the labeled pairs among the mappers. Each mapper extracts the features for a subset of concept pairs using the trimmed Gtf . 12/30/2013 28
  • min_support All the measures improved as we increase the value for the parameter ‘min_Support’. As we increase the ‘Min_Support’, there will be fewer positive examples. 10-fold cross-validation is used in all the experiments. 12/30/2013 29
  • Different Classifiers SVM provided around 1.5%-2% better classification accuracy than that of decision trees. 12/30/2013 30
  • Case Study Tumor Necrosis Factor-alpha Prostatic Neoplasms Adenosine Triphosphate NF-kappaB inhibitor alpha Oligopeptid es 12/30/2013 Tetradecan oylphorbol Acetate 31
  • Big Data Applications- Recommendations in e-Commerce 12/30/2013 32
  • eBay Today 12/30/2013 33
  • Introduction Challenges in a dynamic marketplace like eBay Huge inventory Several hundreds of millions Seller-defined listings Listings are short-lived Wide variety From electronics to unique collectibles Majority are unstructured and w/o a product catalog Listing quality Condition, price, shipping, etc Seller trustworthiness Goal for a Recommendation System in eBay Address challenges associated with a dynamic marketplace Scalable and efficient Computationally intensive tasks during offline model generation Efficient online performance system 12/30/2013 34
  • Motivation – Pre-purchase User couldn’t purchase a listing s/he showed interest in Placed a bid but lost the auction “Watched” an item but someone else bought it before s/he was ready to buy Similar Item Recommendation (SIR) Recommend replacement items 12/30/2013 35
  • Motivation – Post-purchase User just purchased an item Related Item Recommendation (RIR) Inspire incremental purchases Recommend complementary/related items 12/30/2013 36
  • System Architecture - Overview Offline Model Generation Clusters Model Generation The Data Store Real-time Performance System Lost Item Clusters Inventor y Similar Items Recommender (SIR) ?similarTo(item) Similar Items Clickstream Transactions Related Clusters Model Generation 12/30/2013 Conceptual Knowledgebase Cluster-Cluster Relations Bought Item Related Items Recommender (RIR) ?relatedTo(item) Related Items 37
  • Data Store Inventory Clickstream Transactions Conceptual Knowledgebase Glue between offline and real-time systems Raw data Inventory data Clickstream data Transaction data Conceptual Knowledgebase Category Tree Stop words, spell corrections, synonyms, etc Term dictionary Models Item Clusters “clarks women shoe pumps classics” “authentic handmade amish quilt” Cluster-Cluster Relations Clusters “samsung galaxy s4” – “samsung galaxy s4 screen protector” “wolfgang puck electric pressure cooker” – “kitchenaid food processor” Cluster-Cluster Relations 12/30/2013 38
  • Model Generation - Clusters Data Store Inventory Global clustering not feasible Inventory size in several hundreds of millions Varied inventory ranging from electronic goods to unique collectibles Conceptual Knowledgebase Partition input data by user queries Clickstream Cluster s new clusters items user queries concepts, categories Query-Recall Generation query-toitems Cluster Generation Take advantage of how users’ perspective of item similarity Parallel distributed K-Means in Hadoop MapReduce Feature set Title tokens Category hierarchy Attributes or concepts Dedupe and merge overlapping clusters 100X reduction in size over inventory with over 90% coverage Clusters Model Generation 12/30/2013 39
  • Model Generation – Related Clusters Data Store Transactional data Item-Item co-purchase matrix Conceptual Knowledgebase Transactions Cluster-Cluster Relations Clusters related cluster-cluster clusters bought item-item concepts,categories Cluster Assignment bought clustercluster Cluster-toCluster Model Generation Cluster Assignment Cluster-Cluster directed graph Rank outgoing edges Collaborative filtering Edge strength ie no. of users with co-purchase Cluster-Cluster content similarity Related Clusters Model Generation 12/30/2013 40
  • Experimental Results A/B Tests comparing against legacy systems SIR legacy system Completely online Naïve approach of using seed item title as a search query RIR legacy system Chen, Y. and J.F. Canny, Recommending ephemeral items at web scale, ACM SIGIR 2011 Collaborative Filtering on stable representations of items Significant improvements at 90% confidence interval SIR resulted in 38.18% higher user engagement (CTR) RIR resulted in 10.5% higher CTR Statistically significant improvement in site-wide business metrics from both SIR & RIR 12/30/2013 41
  • Recommendations in e-CommerceConclusions Balance between similarity and quality crucial in driving user engagement and conversion Clusters of similar items in the inventory Local clustering in the coverage set of user queries Offline models built using Map-Reduce Huge input datasets including inventory, clickstream and transactional data Efficient real-time performance system Currently deployed on ebay.com 12/30/2013 42
  • Big Data Analytics- Research Areas Data representation, including transformations that reduce representational complexity Computational complexity issues to characterize computational resource needs and tradeoffs Statistical model-building in massive data settings having messy data validation issues Sampling- both as data gathering and for data reduction Methods to include humans in the data analysis loop 12/30/2013 43
  • Conclusions Great opportunity in improving the functioning of many disciplines by leveraging the data and turning the data into knowledge Requires an interdisciplinary approach to solving problems of massive data A major need exists for software targeted to end users Concerted effort is needed to educate students and the workforce in statistical thinking and computational thinking 12/30/2013 44
  • References Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. (2011). Big data: The Next Frontier for innovation, Competition, and Productivity. Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012, from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMartWorlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/ Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends. Computer, 33(1), 117–119. Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving “Big Data” Challenge Involves More Than Just Managing Volumes of Data. Stamford: Gartner. Retrieved from http://www.gartner.com/it/page.jsp?id=1731916 12/30/2013 45
  • References (cont’d) Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding Digital Universe. Director, 285(6). doi:10.1002/humu.21252 Russom, P. (2011). Big Data Analytics. TDWI Research. Pettey, C. (2012, October 18). Gartner Identifies the Top 10 Strategic Technologies for 2012. Gartner. Hackathorn, R. (2002). Current practices in active data warehousing. available: http://www.dmreview.com/whitepaper/WID489.pdf Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big Deal. Deloitte. Retrieved from http://www.deloitte.com/view/en_GX/global/insights/c22d83274 d1b4310VgnVCM2000001b56f00aRCRD.htm Lee, P., & Steward, D. (2012). Technology, Media & Telecommunications Predictions 2012, (Deloitte). 12/30/2013 46
  • References (cont’d) NRC of the National Academies, Frontiers in Massive Data Analysis, The National Academy Press, Washington, D.C., 2013. Retrieved from http://www.nap.edu/catalog.php?record_id=18374 Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl 3):S5, 2012. Katukuri, J., Mukherjee, R., and Konik, T. “Large scale recommendations in a dynamic marketplace”. ACM RecSys (LSRS workshop), 2013. 12/30/2013 47
  • References (cont’d) Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall Street Journal. Retrieved September 14, 2011, from http://online.wsj.com/article/SB10001424053111903532804576569133 957145822.html Davis, J. (2012). What Kind of Big Data Problem Do You Have? SAS Blogs Home. Retrieved December 16, 2012, from http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-ofbig-data-problem-do-you-have/ Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers: How Does Data-Driven Decision Making Affect Firm Performance?, Last Retrieved on December 16, 2012. Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’ Value Creation. Master Thesis, University of Amsterdam. Retrieved December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf 12/30/2013 48