Massive Data Analysis- Challenges and Applications

Massive data analysis:
applications and challenges
Vijay Raghavan
University of Louisiana at Lafayette

Jayasimha Katukuri
eBay

Ying Xie
Kennesaw State University

Agenda
Trends and Perspectives
Kinds of Big Data problems
Big Data Application scenarios
Current State of the Art
Big Data Applications- Examples
Big Data Analysis- Research Areas
Conclusions

12/30/2013

2

Trends and Perspectives
In 2009, McKinsey estimated that nearly all sectors in
the US economy had at least an average of 200
terabytes of stored data per organization (for
organizations with more than 1000 employees).
As an example, Walmart’s customer transaction
database was reported to be 110 terabytes in 2000.
By 2004 it increased to be over half a petabyte
(Schuman, 2004).
An increasing 80% of data organizations own, can be
classified as unstructured data: for example data
packed in emails, social media and multimedia.
12/30/2013

3

Trends and Perspectives (Contd …)
Taking account the average data growth, annually by
59% (Pettey & Goasduff, 2011), this percentage
(unstructured data) will likely be much higher in a
few years.
Not only an increasing number of human beings are
connected to the Internet, also there is a significant
increase in the number of physical devices connected
to the Internet.
Besides the volume of data is becoming a problem,
also the variety and velocity are issues we need to
look at (Russom, 2011).
12/30/2013

4

Trends and Perspectives (Contd …)
Big Data: Data that is complex in terms of volume,
variety, velocity and/or its relation to other data,
which makes it hard to handle using tradition
database management or tools.
“Through 2015, more than 85% of Fortune 500
organizations will fail to effectively exploit big data
for competitive advantage.” (Gartner’s Top
Predictions 2012).
Analysts need to i) cope with massive data
distributed across locations; ii) treat data as a
resource to understand underlying phenomena (NRC
Study, 2013).
12/30/2013
5

The Meaning of Big Data – 3V’s
Big Volume
-

With simple and complex (SQL) analytics
- Scaling complex operations

Big Velocity
Drink from the fire hose
-Beyond OLTP, NoSQL
-

Big Variety
Large number of diverse data sources to integrate
-Beyond Global Schema-based approaches
-

12/30/2013

6

Velocity- Time to action vs. Value
(Hackathorn, 2002)

12/30/2013

7

Kinds of Big Data Problems
(Davis, 2012)

12/30/2013

8

Big Data – Big Analytics

Complex math operations (machine learning,
clustering, trend detection, …)
mostly specified as linear algebra on array data
in the stock market domain, the world of
“quants”
A dozen or so common “inner loops”
Matrix multiply
QR decomposition
SVD decomposition
Linear regression
12/30/2013

9

Big Data – Big Analytics- An Example
Consider choosing price on all trading days for the
last 5 years for two stocks A and B
What is the covariance between the two time-series?
(1/N) * sum (Ai – mean(A)) * (Bi – mean (B))

Now Make it more challenging …
All pairs of 4000 selected stocks- 4000 x 1000 matrix
Hourly, instead of daily?
All securities?

12/30/2013

10

Big Data Application ScenariosDetecting anomalies or emerging events
Visa’s fraud detection program
HP’s compliance detection using its event
management solution
Detecting abnormal situations in ICU
Detecting server attacks, marketing keywords,
environmental hazards
Detecting terror and diseases
Detecting national security risks (Singapore’s RAHS
(Risk Assessment & Horizon Scanning) against
disease, financial risk
12/30/2013

11

Big Data Application ScenariosPredicting near future & Trend analysis
CRM: churn prediction
Criminal protection by predicting likely
locations of criminal activities
Defect prediction (Volvo)
Google flu trend
Personalized recommender systems (Amazon)
Personalized labor support system (Germany, saving 10B
euro saving)

12/30/2013

12

Big Data Application ScenariosReal-time analysis and Decision Support
CRM
Healthcare applications in ICU
Marketing support
Navigation service
Real-time Q/A systems

13

12/30/2013

Big Data Application ScenariosPattern Learning
(
Google’s automatic language translation
Apple’s siri, Google’s now
IBM Watson (Seton HealthCare Family use Watson to
learn 2M patient data annually)

12/30/2013

14

Current State of the Art
Rise of the cloud
Big analytics as a service
Amazon DynamoDB, Google BigQuery, Windows Azure Tables

Hadoop, Open source- heart of big data analytics
HDFS does not index data
Run big jobs using big files vs. small jobs as fast as possible
Several variants- Cloudera, Amazon Elastic MapReduce, IBM
Infosphere

12/30/2013

15

Current State of the Art (contd.)
Machine learning for massive data sets
Hadoop requires mappers and reducers to communicate with
each other through a file system (HDFS). Some of the
alternative technologies in this space are:
Graphlab (http://graphlab.org/)
Apache spark (http://spark.incubator.apache.org/)

Real-time analytics
Hadoop is not ideal for real-time analytics. Apache storm
(http://storm-project.net/) is one technology that is trying to
address the real-time analytics solution
12/30/2013

16

Current State of the Art (contd.)
In-Memory analytics
Focuses on the velocity part of big data
Oracle Exalytics In-Memory machine, 1 terabyte RAM
SAS High-performance Analytics (unstructured data)
Non-commercial- VoltDB

12/30/2013

17

Big Data Applications- Hypothesis
Discovery

12/30/2013

18

Motivation for Literature-based
Hypotheses Discovery Systems
Biomedical research is divided into highly specialized fields and
subfields, with poor communication between them.
The rate of growth of publications makes it difficult for a researcher to
derive connections between concepts from different research
specialties. It also means an opportunity, since the usefulness of the
literature-based discovery is greater as more data means better
reliability in statistical methods.
Mining hidden connections among biomedical concepts from large
amounts of scientific literature is one of the important goals pursued in
this field [1].
Pfizer uses text mining software to move to a broader understanding
before making major investments in specific compounds. It is
estimated that $18 billion is spent per year on compounds that never
reach market, while $30 billion is spent reinventing what is in the
literature.

12/30/2013

19

Hypothesis Discovery from Biomedical
Literature : Example
Swanson found the hidden connection between “Fish Oil” and
“Reynaud's Disease” by finding the common concepts from the
document set of “Fish Oil” and “Reynaud's Disease” [4,5].
Raynaud’s
disease

Fish Oil

High blood viscosity
Platelet aggregation

12/30/2013

20

Link Discovery Methods in Biomedical
Literature
The problem of hypotheses discovery in biomedical
literature is similar to the link discovery problem.
The existing approaches for hypotheses discovery
have not explored the network topology features
used in the link discovery methods.
The existing approaches do not provide an
automated way of evaluating the results.
Supervised learning methods have not been
12/30/2013
explored.

21

Proposed Method: Supervised Link
Discovery
Supervised Link Discovery
Concept Network : Model the whole Medline literature repository as
a complex network of biomedical concepts
Generate labeled data automatically using Concept Networks
corresponding to two different time periods.
Extract a set of features from the concept network for concept
pairs.
A supervised learning approach to learn a model for link discovery.

12/30/2013

22

Concept Network
Each node represents a biomedical concept
Node Attributes:
concept name
semantic type,
related authors, and
document frequency

Each edge represents an association between two
concepts.
Edge Attributes:
Co-occurrence frequency
12/30/2013

23

Concept Network – Map-Reduce
Doc-2

Doc-1

Mapper-1

Key: (c1, c2,year)
Value: co-count

12/30/2013

CCM_
local

Mapper-2

Doc-3

CCM_l
ocal

Mapper-3

CCM_
local

Reducer-2

Reducer-1

HDFS

24

Concept Network Statistics
Total number of concept pairs = 17356486
Total number of documents = 11021605
Total number of concepts = 165674

12/30/2013

25

Automatic Generation of Labeled Concept
Pairs
For each pair whose connection is strong in Gts,
if it has no direct connection in Gtf, we assign positive to this
pair.
For each pair whose connection is weak in Gts,
if it has no direct connection in Gtf, we assign negative to this
pair.
Select a random sample of the nodes in Gtf and generate
concept pairs from the selected random sample.
if a pair has no connection in both Gtf and Gts, we assign
negative to it.

12/30/2013

26

Features
In addition to the commonly used network topological
features, we extract the following features:
Cycle Free Effective Conductance (CFEC)
The Semantic-CFEC
The Author_List Jacccard

12/30/2013

27

Feature Extraction
For each of labeled pairs, we extract the set of
features as described before from the snapshot of
the concept network Gtf.
To scale the feature extraction for large number of
labeled pairs, feature extraction is implemented on a
Map-Reduce cluster.
The distributed implementation of feature extraction
can be described in the following way:
Trim Gtf such that it only contains edges with strength greater than
or equal to the minimum support. Store the trimmed Gtf in each of
the mapper’s main memory.
Distribute the labeled pairs among the mappers. Each mapper
extracts the features for a subset of concept pairs using the
trimmed Gtf .
12/30/2013
28

min_support

All the measures improved as we increase the value for the
parameter ‘min_Support’. As we increase the ‘Min_Support’,
there will be fewer positive examples.
10-fold cross-validation is used in all the experiments.

12/30/2013

29

Different Classifiers

SVM provided around 1.5%-2% better classification
accuracy than that of decision trees.
12/30/2013

30

Case Study
Tumor
Necrosis
Factor-alpha

Prostatic
Neoplasms

Adenosine
Triphosphate

NF-kappaB
inhibitor
alpha

Oligopeptid
es

12/30/2013

Tetradecan
oylphorbol
Acetate

31

Big Data Applications- Recommendations
in e-Commerce

12/30/2013

32

Introduction
Challenges in a dynamic marketplace like eBay
Huge inventory
Several hundreds of millions

Seller-defined listings
Listings are short-lived
Wide variety
From electronics to unique collectibles
Majority are unstructured and w/o a product catalog

Listing quality
Condition, price, shipping, etc

Seller trustworthiness

Goal for a Recommendation System in eBay
Address challenges associated with a dynamic marketplace
Scalable and efficient
Computationally intensive tasks during offline model generation
Efficient online performance system

12/30/2013

34

Motivation – Pre-purchase
User couldn’t purchase a listing s/he showed interest in
Placed a bid but lost the auction
“Watched” an item but someone else bought it before s/he was
ready to buy

Similar Item Recommendation (SIR)
Recommend replacement items

12/30/2013

35

Motivation – Post-purchase
User just purchased an item
Related Item Recommendation (RIR)
Inspire incremental purchases
Recommend complementary/related items

12/30/2013

36

System Architecture - Overview
Offline Model Generation

Clusters
Model
Generation

The Data Store

Real-time Performance System
Lost
Item

Clusters

Inventor
y

Similar Items
Recommender
(SIR)

?similarTo(item)

Similar
Items

Clickstream
Transactions

Related
Clusters
Model
Generation

12/30/2013

Conceptual
Knowledgebase

Cluster-Cluster
Relations

Bought
Item

Related Items
Recommender
(RIR)

?relatedTo(item)

Related
Items

37

Data Store
Inventory

Clickstream

Transactions

Conceptual
Knowledgebase

Glue between offline and real-time systems
Raw data
Inventory data
Clickstream data
Transaction data

Conceptual Knowledgebase
Category Tree
Stop words, spell corrections, synonyms, etc
Term dictionary

Models
Item Clusters
“clarks women shoe pumps classics”
“authentic handmade amish quilt”

Cluster-Cluster Relations
Clusters

“samsung galaxy s4” – “samsung galaxy s4 screen
protector”
“wolfgang puck electric pressure cooker” –
“kitchenaid food processor”

Cluster-Cluster
Relations
12/30/2013

38

Model Generation - Clusters
Data Store

Inventory

Global clustering not feasible
Inventory size in several hundreds of
millions
Varied inventory ranging from electronic
goods to unique collectibles

Conceptual
Knowledgebase

Partition input data by user queries
Clickstream

Cluster
s
new clusters

items

user queries
concepts,
categories

Query-Recall
Generation

query-toitems

Cluster
Generation

Take advantage of how users’
perspective of item similarity

Parallel distributed K-Means in Hadoop
MapReduce
Feature set
Title tokens
Category hierarchy
Attributes or concepts

Dedupe and merge overlapping clusters
100X reduction in size over inventory
with over 90% coverage

Clusters Model Generation

12/30/2013

39

Model Generation – Related Clusters
Data Store

Transactional data
Item-Item co-purchase
matrix

Conceptual
Knowledgebase

Transactions

Cluster-Cluster
Relations

Clusters

related
cluster-cluster

clusters

bought
item-item
concepts,categories

Cluster Assignment
bought clustercluster

Cluster-toCluster
Model
Generation

Cluster Assignment
Cluster-Cluster directed
graph

Rank outgoing edges
Collaborative filtering
Edge strength ie no. of
users with co-purchase
Cluster-Cluster content
similarity

Related Clusters Model Generation

12/30/2013

40

Experimental Results
A/B Tests comparing against legacy systems
SIR legacy system
Completely online
Naïve approach of using seed item title as a search query

RIR legacy system
Chen, Y. and J.F. Canny, Recommending ephemeral items
at web scale, ACM SIGIR 2011
Collaborative Filtering on stable representations of items

Significant improvements at 90% confidence interval
SIR resulted in 38.18% higher user engagement (CTR)
RIR resulted in 10.5% higher CTR
Statistically significant improvement in site-wide business
metrics from both SIR & RIR

12/30/2013

41

Recommendations in e-CommerceConclusions
Balance between similarity and quality crucial in
driving user engagement and conversion
Clusters of similar items in the inventory
Local clustering in the coverage set of user
queries

Offline models built using Map-Reduce
Huge input datasets including inventory,
clickstream and transactional data

Efficient real-time performance system
Currently deployed on ebay.com
12/30/2013

42

Big Data Analytics- Research Areas
Data representation, including transformations that
reduce representational complexity
Computational complexity issues to characterize
computational resource needs and tradeoffs
Statistical model-building in massive data settings
having messy data validation issues
Sampling- both as data gathering and for data
reduction
Methods to include humans in the data analysis loop

12/30/2013

43

Conclusions
Great opportunity in improving the functioning of
many disciplines by leveraging the data and turning
the data into knowledge
Requires an interdisciplinary approach to solving
problems of massive data
A major need exists for software targeted to end
users
Concerted effort is needed to educate students and
the workforce in statistical thinking and
computational thinking
12/30/2013

44

References
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C.,
& Byers, A. (2011). Big data: The Next Frontier for innovation,
Competition, and Productivity.
Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail
Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012,
from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMartWorlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/
Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends.
Computer, 33(1), 117–119.

Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving
“Big Data” Challenge Involves More Than Just Managing
Volumes of Data. Stamford: Gartner. Retrieved from
http://www.gartner.com/it/page.jsp?id=1731916
12/30/2013

45

References (cont’d)
Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding
Digital Universe. Director, 285(6). doi:10.1002/humu.21252
Russom, P. (2011). Big Data Analytics. TDWI Research.
Pettey, C. (2012, October 18). Gartner Identifies the Top 10
Strategic Technologies for 2012. Gartner.
Hackathorn, R. (2002). Current practices in active data
warehousing. available:
http://www.dmreview.com/whitepaper/WID489.pdf
Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big
Deal. Deloitte. Retrieved from
http://www.deloitte.com/view/en_GX/global/insights/c22d83274
d1b4310VgnVCM2000001b56f00aRCRD.htm
Lee, P., & Steward, D. (2012). Technology, Media &
Telecommunications Predictions 2012, (Deloitte).

12/30/2013

46

NRC of the National Academies, Frontiers in Massive Data Analysis, The
National Academy Press, Washington, D.C., 2013. Retrieved from
http://www.nap.edu/catalog.php?record_id=18374
Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses
generation as supervised link discovery with automated class labeling
on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl
3):S5, 2012.
Katukuri, J., Mukherjee, R., and Konik, T. “Large scale
recommendations in a dynamic marketplace”. ACM RecSys (LSRS
workshop), 2013.

12/30/2013

47

Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall
Street Journal. Retrieved September 14, 2011, from
http://online.wsj.com/article/SB10001424053111903532804576569133
957145822.html
Davis, J. (2012). What Kind of Big Data Problem Do You Have? SAS
Blogs Home. Retrieved December 16, 2012, from
http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-ofbig-data-problem-do-you-have/
Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers:
How Does Data-Driven Decision Making Affect Firm Performance?, Last
Retrieved on December 16, 2012.
Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’
Value Creation. Master Thesis, University of Amsterdam. Retrieved
December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf
12/30/2013

48

Massive Data Analysis- Challenges and Applications

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Massive Data Analysis- Challenges and Applications

Similar to Massive Data Analysis- Challenges and Applications (20)

Recently uploaded

Recently uploaded (20)

Massive Data Analysis- Challenges and Applications