We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
Injustice - Developers Among Us (SciFiDevCon 2024)
Massive Data Analysis- Challenges and Applications
1. Massive data analysis:
applications and challenges
Vijay Raghavan
University of Louisiana at Lafayette
Jayasimha Katukuri
eBay
Ying Xie
Kennesaw State University
2. Agenda
Trends and Perspectives
Kinds of Big Data problems
Big Data Application scenarios
Current State of the Art
Big Data Applications- Examples
Big Data Analysis- Research Areas
Conclusions
12/30/2013
2
3. Trends and Perspectives
In 2009, McKinsey estimated that nearly all sectors in
the US economy had at least an average of 200
terabytes of stored data per organization (for
organizations with more than 1000 employees).
As an example, Walmart’s customer transaction
database was reported to be 110 terabytes in 2000.
By 2004 it increased to be over half a petabyte
(Schuman, 2004).
An increasing 80% of data organizations own, can be
classified as unstructured data: for example data
packed in emails, social media and multimedia.
12/30/2013
3
4. Trends and Perspectives (Contd …)
Taking account the average data growth, annually by
59% (Pettey & Goasduff, 2011), this percentage
(unstructured data) will likely be much higher in a
few years.
Not only an increasing number of human beings are
connected to the Internet, also there is a significant
increase in the number of physical devices connected
to the Internet.
Besides the volume of data is becoming a problem,
also the variety and velocity are issues we need to
look at (Russom, 2011).
12/30/2013
4
5. Trends and Perspectives (Contd …)
Big Data: Data that is complex in terms of volume,
variety, velocity and/or its relation to other data,
which makes it hard to handle using tradition
database management or tools.
“Through 2015, more than 85% of Fortune 500
organizations will fail to effectively exploit big data
for competitive advantage.” (Gartner’s Top
Predictions 2012).
Analysts need to i) cope with massive data
distributed across locations; ii) treat data as a
resource to understand underlying phenomena (NRC
Study, 2013).
12/30/2013
5
6. The Meaning of Big Data – 3V’s
Big Volume
-
With simple and complex (SQL) analytics
- Scaling complex operations
Big Velocity
Drink from the fire hose
-Beyond OLTP, NoSQL
-
Big Variety
Large number of diverse data sources to integrate
-Beyond Global Schema-based approaches
-
12/30/2013
6
8. Kinds of Big Data Problems
(Davis, 2012)
12/30/2013
8
9. Big Data – Big Analytics
Complex math operations (machine learning,
clustering, trend detection, …)
mostly specified as linear algebra on array data
in the stock market domain, the world of
“quants”
A dozen or so common “inner loops”
Matrix multiply
QR decomposition
SVD decomposition
Linear regression
12/30/2013
9
10. Big Data – Big Analytics- An Example
Consider choosing price on all trading days for the
last 5 years for two stocks A and B
What is the covariance between the two time-series?
(1/N) * sum (Ai – mean(A)) * (Bi – mean (B))
Now Make it more challenging …
All pairs of 4000 selected stocks- 4000 x 1000 matrix
Hourly, instead of daily?
All securities?
12/30/2013
10
11. Big Data Application ScenariosDetecting anomalies or emerging events
Visa’s fraud detection program
HP’s compliance detection using its event
management solution
Detecting abnormal situations in ICU
Detecting server attacks, marketing keywords,
environmental hazards
Detecting terror and diseases
Detecting national security risks (Singapore’s RAHS
(Risk Assessment & Horizon Scanning) against
disease, financial risk
12/30/2013
11
12. Big Data Application ScenariosPredicting near future & Trend analysis
CRM: churn prediction
Criminal protection by predicting likely
locations of criminal activities
Defect prediction (Volvo)
Google flu trend
Personalized recommender systems (Amazon)
Personalized labor support system (Germany, saving 10B
euro saving)
12/30/2013
12
13. Big Data Application ScenariosReal-time analysis and Decision Support
CRM
Healthcare applications in ICU
Marketing support
Navigation service
Real-time Q/A systems
13
12/30/2013
14. Big Data Application ScenariosPattern Learning
(
Google’s automatic language translation
Apple’s siri, Google’s now
IBM Watson (Seton HealthCare Family use Watson to
learn 2M patient data annually)
12/30/2013
14
15. Current State of the Art
Rise of the cloud
Big analytics as a service
Amazon DynamoDB, Google BigQuery, Windows Azure Tables
Hadoop, Open source- heart of big data analytics
HDFS does not index data
Run big jobs using big files vs. small jobs as fast as possible
Several variants- Cloudera, Amazon Elastic MapReduce, IBM
Infosphere
12/30/2013
15
16. Current State of the Art (contd.)
Machine learning for massive data sets
Hadoop requires mappers and reducers to communicate with
each other through a file system (HDFS). Some of the
alternative technologies in this space are:
Graphlab (http://graphlab.org/)
Apache spark (http://spark.incubator.apache.org/)
Real-time analytics
Hadoop is not ideal for real-time analytics. Apache storm
(http://storm-project.net/) is one technology that is trying to
address the real-time analytics solution
12/30/2013
16
17. Current State of the Art (contd.)
In-Memory analytics
Focuses on the velocity part of big data
Oracle Exalytics In-Memory machine, 1 terabyte RAM
SAS High-performance Analytics (unstructured data)
Non-commercial- VoltDB
12/30/2013
17
19. Motivation for Literature-based
Hypotheses Discovery Systems
Biomedical research is divided into highly specialized fields and
subfields, with poor communication between them.
The rate of growth of publications makes it difficult for a researcher to
derive connections between concepts from different research
specialties. It also means an opportunity, since the usefulness of the
literature-based discovery is greater as more data means better
reliability in statistical methods.
Mining hidden connections among biomedical concepts from large
amounts of scientific literature is one of the important goals pursued in
this field [1].
Pfizer uses text mining software to move to a broader understanding
before making major investments in specific compounds. It is
estimated that $18 billion is spent per year on compounds that never
reach market, while $30 billion is spent reinventing what is in the
literature.
12/30/2013
19
20. Hypothesis Discovery from Biomedical
Literature : Example
Swanson found the hidden connection between “Fish Oil” and
“Reynaud's Disease” by finding the common concepts from the
document set of “Fish Oil” and “Reynaud's Disease” [4,5].
Raynaud’s
disease
Fish Oil
High blood viscosity
Platelet aggregation
12/30/2013
20
21. Link Discovery Methods in Biomedical
Literature
The problem of hypotheses discovery in biomedical
literature is similar to the link discovery problem.
The existing approaches for hypotheses discovery
have not explored the network topology features
used in the link discovery methods.
The existing approaches do not provide an
automated way of evaluating the results.
Supervised learning methods have not been
12/30/2013
explored.
21
22. Proposed Method: Supervised Link
Discovery
Supervised Link Discovery
Concept Network : Model the whole Medline literature repository as
a complex network of biomedical concepts
Generate labeled data automatically using Concept Networks
corresponding to two different time periods.
Extract a set of features from the concept network for concept
pairs.
A supervised learning approach to learn a model for link discovery.
12/30/2013
22
23. Concept Network
Each node represents a biomedical concept
Node Attributes:
concept name
semantic type,
related authors, and
document frequency
Each edge represents an association between two
concepts.
Edge Attributes:
Co-occurrence frequency
12/30/2013
23
25. Concept Network Statistics
Total number of concept pairs = 17356486
Total number of documents = 11021605
Total number of concepts = 165674
12/30/2013
25
26. Automatic Generation of Labeled Concept
Pairs
For each pair whose connection is strong in Gts,
if it has no direct connection in Gtf, we assign positive to this
pair.
For each pair whose connection is weak in Gts,
if it has no direct connection in Gtf, we assign negative to this
pair.
Select a random sample of the nodes in Gtf and generate
concept pairs from the selected random sample.
if a pair has no connection in both Gtf and Gts, we assign
negative to it.
12/30/2013
26
27. Features
In addition to the commonly used network topological
features, we extract the following features:
Cycle Free Effective Conductance (CFEC)
The Semantic-CFEC
The Author_List Jacccard
12/30/2013
27
28. Feature Extraction
For each of labeled pairs, we extract the set of
features as described before from the snapshot of
the concept network Gtf.
To scale the feature extraction for large number of
labeled pairs, feature extraction is implemented on a
Map-Reduce cluster.
The distributed implementation of feature extraction
can be described in the following way:
Trim Gtf such that it only contains edges with strength greater than
or equal to the minimum support. Store the trimmed Gtf in each of
the mapper’s main memory.
Distribute the labeled pairs among the mappers. Each mapper
extracts the features for a subset of concept pairs using the
trimmed Gtf .
12/30/2013
28
29. min_support
All the measures improved as we increase the value for the
parameter ‘min_Support’. As we increase the ‘Min_Support’,
there will be fewer positive examples.
10-fold cross-validation is used in all the experiments.
12/30/2013
29
34. Introduction
Challenges in a dynamic marketplace like eBay
Huge inventory
Several hundreds of millions
Seller-defined listings
Listings are short-lived
Wide variety
From electronics to unique collectibles
Majority are unstructured and w/o a product catalog
Listing quality
Condition, price, shipping, etc
Seller trustworthiness
Goal for a Recommendation System in eBay
Address challenges associated with a dynamic marketplace
Scalable and efficient
Computationally intensive tasks during offline model generation
Efficient online performance system
12/30/2013
34
35. Motivation – Pre-purchase
User couldn’t purchase a listing s/he showed interest in
Placed a bid but lost the auction
“Watched” an item but someone else bought it before s/he was
ready to buy
Similar Item Recommendation (SIR)
Recommend replacement items
12/30/2013
35
36. Motivation – Post-purchase
User just purchased an item
Related Item Recommendation (RIR)
Inspire incremental purchases
Recommend complementary/related items
12/30/2013
36
37. System Architecture - Overview
Offline Model Generation
Clusters
Model
Generation
The Data Store
Real-time Performance System
Lost
Item
Clusters
Inventor
y
Similar Items
Recommender
(SIR)
?similarTo(item)
Similar
Items
Clickstream
Transactions
Related
Clusters
Model
Generation
12/30/2013
Conceptual
Knowledgebase
Cluster-Cluster
Relations
Bought
Item
Related Items
Recommender
(RIR)
?relatedTo(item)
Related
Items
37
38. Data Store
Inventory
Clickstream
Transactions
Conceptual
Knowledgebase
Glue between offline and real-time systems
Raw data
Inventory data
Clickstream data
Transaction data
Conceptual Knowledgebase
Category Tree
Stop words, spell corrections, synonyms, etc
Term dictionary
Models
Item Clusters
“clarks women shoe pumps classics”
“authentic handmade amish quilt”
Cluster-Cluster Relations
Clusters
“samsung galaxy s4” – “samsung galaxy s4 screen
protector”
“wolfgang puck electric pressure cooker” –
“kitchenaid food processor”
Cluster-Cluster
Relations
12/30/2013
38
39. Model Generation - Clusters
Data Store
Inventory
Global clustering not feasible
Inventory size in several hundreds of
millions
Varied inventory ranging from electronic
goods to unique collectibles
Conceptual
Knowledgebase
Partition input data by user queries
Clickstream
Cluster
s
new clusters
items
user queries
concepts,
categories
Query-Recall
Generation
query-toitems
Cluster
Generation
Take advantage of how users’
perspective of item similarity
Parallel distributed K-Means in Hadoop
MapReduce
Feature set
Title tokens
Category hierarchy
Attributes or concepts
Dedupe and merge overlapping clusters
100X reduction in size over inventory
with over 90% coverage
Clusters Model Generation
12/30/2013
39
40. Model Generation – Related Clusters
Data Store
Transactional data
Item-Item co-purchase
matrix
Conceptual
Knowledgebase
Transactions
Cluster-Cluster
Relations
Clusters
related
cluster-cluster
clusters
bought
item-item
concepts,categories
Cluster Assignment
bought clustercluster
Cluster-toCluster
Model
Generation
Cluster Assignment
Cluster-Cluster directed
graph
Rank outgoing edges
Collaborative filtering
Edge strength ie no. of
users with co-purchase
Cluster-Cluster content
similarity
Related Clusters Model Generation
12/30/2013
40
41. Experimental Results
A/B Tests comparing against legacy systems
SIR legacy system
Completely online
Naïve approach of using seed item title as a search query
RIR legacy system
Chen, Y. and J.F. Canny, Recommending ephemeral items
at web scale, ACM SIGIR 2011
Collaborative Filtering on stable representations of items
Significant improvements at 90% confidence interval
SIR resulted in 38.18% higher user engagement (CTR)
RIR resulted in 10.5% higher CTR
Statistically significant improvement in site-wide business
metrics from both SIR & RIR
12/30/2013
41
42. Recommendations in e-CommerceConclusions
Balance between similarity and quality crucial in
driving user engagement and conversion
Clusters of similar items in the inventory
Local clustering in the coverage set of user
queries
Offline models built using Map-Reduce
Huge input datasets including inventory,
clickstream and transactional data
Efficient real-time performance system
Currently deployed on ebay.com
12/30/2013
42
43. Big Data Analytics- Research Areas
Data representation, including transformations that
reduce representational complexity
Computational complexity issues to characterize
computational resource needs and tradeoffs
Statistical model-building in massive data settings
having messy data validation issues
Sampling- both as data gathering and for data
reduction
Methods to include humans in the data analysis loop
12/30/2013
43
44. Conclusions
Great opportunity in improving the functioning of
many disciplines by leveraging the data and turning
the data into knowledge
Requires an interdisciplinary approach to solving
problems of massive data
A major need exists for software targeted to end
users
Concerted effort is needed to educate students and
the workforce in statistical thinking and
computational thinking
12/30/2013
44
45. References
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C.,
& Byers, A. (2011). Big data: The Next Frontier for innovation,
Competition, and Productivity.
Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail
Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012,
from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMartWorlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/
Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends.
Computer, 33(1), 117–119.
Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving
“Big Data” Challenge Involves More Than Just Managing
Volumes of Data. Stamford: Gartner. Retrieved from
http://www.gartner.com/it/page.jsp?id=1731916
12/30/2013
45
46. References (cont’d)
Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding
Digital Universe. Director, 285(6). doi:10.1002/humu.21252
Russom, P. (2011). Big Data Analytics. TDWI Research.
Pettey, C. (2012, October 18). Gartner Identifies the Top 10
Strategic Technologies for 2012. Gartner.
Hackathorn, R. (2002). Current practices in active data
warehousing. available:
http://www.dmreview.com/whitepaper/WID489.pdf
Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big
Deal. Deloitte. Retrieved from
http://www.deloitte.com/view/en_GX/global/insights/c22d83274
d1b4310VgnVCM2000001b56f00aRCRD.htm
Lee, P., & Steward, D. (2012). Technology, Media &
Telecommunications Predictions 2012, (Deloitte).
12/30/2013
46
47. References (cont’d)
NRC of the National Academies, Frontiers in Massive Data Analysis, The
National Academy Press, Washington, D.C., 2013. Retrieved from
http://www.nap.edu/catalog.php?record_id=18374
Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses
generation as supervised link discovery with automated class labeling
on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl
3):S5, 2012.
Katukuri, J., Mukherjee, R., and Konik, T. “Large scale
recommendations in a dynamic marketplace”. ACM RecSys (LSRS
workshop), 2013.
12/30/2013
47
48. References (cont’d)
Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall
Street Journal. Retrieved September 14, 2011, from
http://online.wsj.com/article/SB10001424053111903532804576569133
957145822.html
Davis, J. (2012). What Kind of Big Data Problem Do You Have? SAS
Blogs Home. Retrieved December 16, 2012, from
http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-ofbig-data-problem-do-you-have/
Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers:
How Does Data-Driven Decision Making Affect Firm Performance?, Last
Retrieved on December 16, 2012.
Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’
Value Creation. Master Thesis, University of Amsterdam. Retrieved
December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf
12/30/2013
48