SlideShare a Scribd company logo
1 of 48
Download to read offline
Massive data analysis:
applications and challenges
Vijay Raghavan
University of Louisiana at Lafayette

Jayasimha Katukuri
eBay

Ying Xie
Kennesaw State University
Agenda
Trends and Perspectives
Kinds of Big Data problems
Big Data Application scenarios
Current State of the Art
Big Data Applications- Examples
Big Data Analysis- Research Areas
Conclusions

12/30/2013

2
Trends and Perspectives
In 2009, McKinsey estimated that nearly all sectors in
the US economy had at least an average of 200
terabytes of stored data per organization (for
organizations with more than 1000 employees).
As an example, Walmart’s customer transaction
database was reported to be 110 terabytes in 2000.
By 2004 it increased to be over half a petabyte
(Schuman, 2004).
An increasing 80% of data organizations own, can be
classified as unstructured data: for example data
packed in emails, social media and multimedia.
12/30/2013

3
Trends and Perspectives (Contd …)
Taking account the average data growth, annually by
59% (Pettey & Goasduff, 2011), this percentage
(unstructured data) will likely be much higher in a
few years.
Not only an increasing number of human beings are
connected to the Internet, also there is a significant
increase in the number of physical devices connected
to the Internet.
Besides the volume of data is becoming a problem,
also the variety and velocity are issues we need to
look at (Russom, 2011).
12/30/2013

4
Trends and Perspectives (Contd …)
Big Data: Data that is complex in terms of volume,
variety, velocity and/or its relation to other data,
which makes it hard to handle using tradition
database management or tools.
“Through 2015, more than 85% of Fortune 500
organizations will fail to effectively exploit big data
for competitive advantage.” (Gartner’s Top
Predictions 2012).
Analysts need to i) cope with massive data
distributed across locations; ii) treat data as a
resource to understand underlying phenomena (NRC
Study, 2013).
12/30/2013
5
The Meaning of Big Data – 3V’s
Big Volume
-

With simple and complex (SQL) analytics
- Scaling complex operations

Big Velocity
Drink from the fire hose
-Beyond OLTP, NoSQL
-

Big Variety
Large number of diverse data sources to integrate
-Beyond Global Schema-based approaches
-

12/30/2013

6
Velocity- Time to action vs. Value
(Hackathorn, 2002)

12/30/2013

7
Kinds of Big Data Problems
(Davis, 2012)

12/30/2013

8
Big Data – Big Analytics

Complex math operations (machine learning,
clustering, trend detection, …)
mostly specified as linear algebra on array data
in the stock market domain, the world of
“quants”
A dozen or so common “inner loops”
Matrix multiply
QR decomposition
SVD decomposition
Linear regression
12/30/2013

9
Big Data – Big Analytics- An Example
Consider choosing price on all trading days for the
last 5 years for two stocks A and B
What is the covariance between the two time-series?
(1/N) * sum (Ai – mean(A)) * (Bi – mean (B))

Now Make it more challenging …
All pairs of 4000 selected stocks- 4000 x 1000 matrix
Hourly, instead of daily?
All securities?

12/30/2013

10
Big Data Application ScenariosDetecting anomalies or emerging events
Visa’s fraud detection program
HP’s compliance detection using its event
management solution
Detecting abnormal situations in ICU
Detecting server attacks, marketing keywords,
environmental hazards
Detecting terror and diseases
Detecting national security risks (Singapore’s RAHS
(Risk Assessment & Horizon Scanning) against
disease, financial risk
12/30/2013

11
Big Data Application ScenariosPredicting near future & Trend analysis
CRM: churn prediction
Criminal protection by predicting likely
locations of criminal activities
Defect prediction (Volvo)
Google flu trend
Personalized recommender systems (Amazon)
Personalized labor support system (Germany, saving 10B
euro saving)

12/30/2013

12
Big Data Application ScenariosReal-time analysis and Decision Support
CRM
Healthcare applications in ICU
Marketing support
Navigation service
Real-time Q/A systems

13

12/30/2013
Big Data Application ScenariosPattern Learning
(
Google’s automatic language translation
Apple’s siri, Google’s now
IBM Watson (Seton HealthCare Family use Watson to
learn 2M patient data annually)

12/30/2013

14
Current State of the Art
Rise of the cloud
Big analytics as a service
Amazon DynamoDB, Google BigQuery, Windows Azure Tables

Hadoop, Open source- heart of big data analytics
HDFS does not index data
Run big jobs using big files vs. small jobs as fast as possible
Several variants- Cloudera, Amazon Elastic MapReduce, IBM
Infosphere

12/30/2013

15
Current State of the Art (contd.)
Machine learning for massive data sets
Hadoop requires mappers and reducers to communicate with
each other through a file system (HDFS). Some of the
alternative technologies in this space are:
Graphlab (http://graphlab.org/)
Apache spark (http://spark.incubator.apache.org/)

Real-time analytics
Hadoop is not ideal for real-time analytics. Apache storm
(http://storm-project.net/) is one technology that is trying to
address the real-time analytics solution
12/30/2013

16
Current State of the Art (contd.)
In-Memory analytics
Focuses on the velocity part of big data
Oracle Exalytics In-Memory machine, 1 terabyte RAM
SAS High-performance Analytics (unstructured data)
Non-commercial- VoltDB

12/30/2013

17
Big Data Applications- Hypothesis
Discovery

12/30/2013

18
Motivation for Literature-based
Hypotheses Discovery Systems
Biomedical research is divided into highly specialized fields and
subfields, with poor communication between them.
The rate of growth of publications makes it difficult for a researcher to
derive connections between concepts from different research
specialties. It also means an opportunity, since the usefulness of the
literature-based discovery is greater as more data means better
reliability in statistical methods.
Mining hidden connections among biomedical concepts from large
amounts of scientific literature is one of the important goals pursued in
this field [1].
Pfizer uses text mining software to move to a broader understanding
before making major investments in specific compounds. It is
estimated that $18 billion is spent per year on compounds that never
reach market, while $30 billion is spent reinventing what is in the
literature.

12/30/2013

19
Hypothesis Discovery from Biomedical
Literature : Example
Swanson found the hidden connection between “Fish Oil” and
“Reynaud's Disease” by finding the common concepts from the
document set of “Fish Oil” and “Reynaud's Disease” [4,5].
Raynaud’s
disease

Fish Oil

High blood viscosity
Platelet aggregation

12/30/2013

20
Link Discovery Methods in Biomedical
Literature
The problem of hypotheses discovery in biomedical
literature is similar to the link discovery problem.
The existing approaches for hypotheses discovery
have not explored the network topology features
used in the link discovery methods.
The existing approaches do not provide an
automated way of evaluating the results.
Supervised learning methods have not been
12/30/2013
explored.

21
Proposed Method: Supervised Link
Discovery
Supervised Link Discovery
Concept Network : Model the whole Medline literature repository as
a complex network of biomedical concepts
Generate labeled data automatically using Concept Networks
corresponding to two different time periods.
Extract a set of features from the concept network for concept
pairs.
A supervised learning approach to learn a model for link discovery.

12/30/2013

22
Concept Network
Each node represents a biomedical concept
Node Attributes:
concept name
semantic type,
related authors, and
document frequency

Each edge represents an association between two
concepts.
Edge Attributes:
Co-occurrence frequency
12/30/2013

23
Concept Network – Map-Reduce
Doc-2

Doc-1

Mapper-1

Key: (c1, c2,year)
Value: co-count

12/30/2013

CCM_
local

Mapper-2

Doc-3

CCM_l
ocal

Mapper-3

CCM_
local

Reducer-2

Reducer-1

HDFS

24
Concept Network Statistics
Total number of concept pairs = 17356486
Total number of documents = 11021605
Total number of concepts = 165674

12/30/2013

25
Automatic Generation of Labeled Concept
Pairs
For each pair whose connection is strong in Gts,
if it has no direct connection in Gtf, we assign positive to this
pair.
For each pair whose connection is weak in Gts,
if it has no direct connection in Gtf, we assign negative to this
pair.
Select a random sample of the nodes in Gtf and generate
concept pairs from the selected random sample.
if a pair has no connection in both Gtf and Gts, we assign
negative to it.

12/30/2013

26
Features
In addition to the commonly used network topological
features, we extract the following features:
Cycle Free Effective Conductance (CFEC)
The Semantic-CFEC
The Author_List Jacccard

12/30/2013

27
Feature Extraction
For each of labeled pairs, we extract the set of
features as described before from the snapshot of
the concept network Gtf.
To scale the feature extraction for large number of
labeled pairs, feature extraction is implemented on a
Map-Reduce cluster.
The distributed implementation of feature extraction
can be described in the following way:
Trim Gtf such that it only contains edges with strength greater than
or equal to the minimum support. Store the trimmed Gtf in each of
the mapper’s main memory.
Distribute the labeled pairs among the mappers. Each mapper
extracts the features for a subset of concept pairs using the
trimmed Gtf .
12/30/2013
28
min_support

All the measures improved as we increase the value for the
parameter ‘min_Support’. As we increase the ‘Min_Support’,
there will be fewer positive examples.
10-fold cross-validation is used in all the experiments.

12/30/2013

29
Different Classifiers

SVM provided around 1.5%-2% better classification
accuracy than that of decision trees.
12/30/2013

30
Case Study
Tumor
Necrosis
Factor-alpha

Prostatic
Neoplasms

Adenosine
Triphosphate

NF-kappaB
inhibitor
alpha

Oligopeptid
es

12/30/2013

Tetradecan
oylphorbol
Acetate

31
Big Data Applications- Recommendations
in e-Commerce

12/30/2013

32
eBay Today

12/30/2013

33
Introduction
Challenges in a dynamic marketplace like eBay
Huge inventory
Several hundreds of millions

Seller-defined listings
Listings are short-lived
Wide variety
From electronics to unique collectibles
Majority are unstructured and w/o a product catalog

Listing quality
Condition, price, shipping, etc

Seller trustworthiness

Goal for a Recommendation System in eBay
Address challenges associated with a dynamic marketplace
Scalable and efficient
Computationally intensive tasks during offline model generation
Efficient online performance system

12/30/2013

34
Motivation – Pre-purchase
User couldn’t purchase a listing s/he showed interest in
Placed a bid but lost the auction
“Watched” an item but someone else bought it before s/he was
ready to buy

Similar Item Recommendation (SIR)
Recommend replacement items

12/30/2013

35
Motivation – Post-purchase
User just purchased an item
Related Item Recommendation (RIR)
Inspire incremental purchases
Recommend complementary/related items

12/30/2013

36
System Architecture - Overview
Offline Model Generation

Clusters
Model
Generation

The Data Store

Real-time Performance System
Lost
Item

Clusters

Inventor
y

Similar Items
Recommender
(SIR)

?similarTo(item)

Similar
Items

Clickstream
Transactions

Related
Clusters
Model
Generation

12/30/2013

Conceptual
Knowledgebase

Cluster-Cluster
Relations

Bought
Item

Related Items
Recommender
(RIR)

?relatedTo(item)

Related
Items

37
Data Store
Inventory

Clickstream

Transactions

Conceptual
Knowledgebase

Glue between offline and real-time systems
Raw data
Inventory data
Clickstream data
Transaction data

Conceptual Knowledgebase
Category Tree
Stop words, spell corrections, synonyms, etc
Term dictionary

Models
Item Clusters
“clarks women shoe pumps classics”
“authentic handmade amish quilt”

Cluster-Cluster Relations
Clusters

“samsung galaxy s4” – “samsung galaxy s4 screen
protector”
“wolfgang puck electric pressure cooker” –
“kitchenaid food processor”

Cluster-Cluster
Relations
12/30/2013

38
Model Generation - Clusters
Data Store

Inventory

Global clustering not feasible
Inventory size in several hundreds of
millions
Varied inventory ranging from electronic
goods to unique collectibles

Conceptual
Knowledgebase

Partition input data by user queries
Clickstream

Cluster
s
new clusters

items

user queries
concepts,
categories

Query-Recall
Generation

query-toitems

Cluster
Generation

Take advantage of how users’
perspective of item similarity

Parallel distributed K-Means in Hadoop
MapReduce
Feature set
Title tokens
Category hierarchy
Attributes or concepts

Dedupe and merge overlapping clusters
100X reduction in size over inventory
with over 90% coverage

Clusters Model Generation

12/30/2013

39
Model Generation – Related Clusters
Data Store

Transactional data
Item-Item co-purchase
matrix

Conceptual
Knowledgebase

Transactions

Cluster-Cluster
Relations

Clusters

related
cluster-cluster

clusters

bought
item-item
concepts,categories

Cluster Assignment
bought clustercluster

Cluster-toCluster
Model
Generation

Cluster Assignment
Cluster-Cluster directed
graph

Rank outgoing edges
Collaborative filtering
Edge strength ie no. of
users with co-purchase
Cluster-Cluster content
similarity

Related Clusters Model Generation

12/30/2013

40
Experimental Results
A/B Tests comparing against legacy systems
SIR legacy system
Completely online
Naïve approach of using seed item title as a search query

RIR legacy system
Chen, Y. and J.F. Canny, Recommending ephemeral items
at web scale, ACM SIGIR 2011
Collaborative Filtering on stable representations of items

Significant improvements at 90% confidence interval
SIR resulted in 38.18% higher user engagement (CTR)
RIR resulted in 10.5% higher CTR
Statistically significant improvement in site-wide business
metrics from both SIR & RIR

12/30/2013

41
Recommendations in e-CommerceConclusions
Balance between similarity and quality crucial in
driving user engagement and conversion
Clusters of similar items in the inventory
Local clustering in the coverage set of user
queries

Offline models built using Map-Reduce
Huge input datasets including inventory,
clickstream and transactional data

Efficient real-time performance system
Currently deployed on ebay.com
12/30/2013

42
Big Data Analytics- Research Areas
Data representation, including transformations that
reduce representational complexity
Computational complexity issues to characterize
computational resource needs and tradeoffs
Statistical model-building in massive data settings
having messy data validation issues
Sampling- both as data gathering and for data
reduction
Methods to include humans in the data analysis loop

12/30/2013

43
Conclusions
Great opportunity in improving the functioning of
many disciplines by leveraging the data and turning
the data into knowledge
Requires an interdisciplinary approach to solving
problems of massive data
A major need exists for software targeted to end
users
Concerted effort is needed to educate students and
the workforce in statistical thinking and
computational thinking
12/30/2013

44
References
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C.,
& Byers, A. (2011). Big data: The Next Frontier for innovation,
Competition, and Productivity.
Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail
Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012,
from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMartWorlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/
Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends.
Computer, 33(1), 117–119.

Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving
“Big Data” Challenge Involves More Than Just Managing
Volumes of Data. Stamford: Gartner. Retrieved from
http://www.gartner.com/it/page.jsp?id=1731916
12/30/2013

45
References (cont’d)
Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding
Digital Universe. Director, 285(6). doi:10.1002/humu.21252
Russom, P. (2011). Big Data Analytics. TDWI Research.
Pettey, C. (2012, October 18). Gartner Identifies the Top 10
Strategic Technologies for 2012. Gartner.
Hackathorn, R. (2002). Current practices in active data
warehousing. available:
http://www.dmreview.com/whitepaper/WID489.pdf
Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big
Deal. Deloitte. Retrieved from
http://www.deloitte.com/view/en_GX/global/insights/c22d83274
d1b4310VgnVCM2000001b56f00aRCRD.htm
Lee, P., & Steward, D. (2012). Technology, Media &
Telecommunications Predictions 2012, (Deloitte).

12/30/2013

46
References (cont’d)
NRC of the National Academies, Frontiers in Massive Data Analysis, The
National Academy Press, Washington, D.C., 2013. Retrieved from
http://www.nap.edu/catalog.php?record_id=18374
Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses
generation as supervised link discovery with automated class labeling
on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl
3):S5, 2012.
Katukuri, J., Mukherjee, R., and Konik, T. “Large scale
recommendations in a dynamic marketplace”. ACM RecSys (LSRS
workshop), 2013.

12/30/2013

47
References (cont’d)
Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall
Street Journal. Retrieved September 14, 2011, from
http://online.wsj.com/article/SB10001424053111903532804576569133
957145822.html
Davis, J. (2012). What Kind of Big Data Problem Do You Have? SAS
Blogs Home. Retrieved December 16, 2012, from
http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-ofbig-data-problem-do-you-have/
Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers:
How Does Data-Driven Decision Making Affect Firm Performance?, Last
Retrieved on December 16, 2012.
Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’
Value Creation. Master Thesis, University of Amsterdam. Retrieved
December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf
12/30/2013

48

More Related Content

What's hot

Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data WarehousesTom Donoghue
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lectureMahmoud Alfarra
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018suresh sood
 
Data mining in agriculture
Data mining in agricultureData mining in agriculture
Data mining in agricultureSibananda Khatai
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 

What's hot (20)

Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lecture
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Token
TokenToken
Token
 
Big data mining
Big data miningBig data mining
Big data mining
 
Big data road map
Big data road mapBig data road map
Big data road map
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018
 
Data mining in agriculture
Data mining in agricultureData mining in agriculture
Data mining in agriculture
 
The GDELT project
The GDELT project The GDELT project
The GDELT project
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
3 classification
3  classification3  classification
3 classification
 

Similar to Massive Data Analysis- Challenges and Applications

Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataIJMIT JOURNAL
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
Big data: Challenges, Practices and Technologies
Big data: Challenges, Practices and TechnologiesBig data: Challenges, Practices and Technologies
Big data: Challenges, Practices and TechnologiesNavneet Randhawa
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Big data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxBig data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxhartrobert670
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementSimen Smaaberg
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
 

Similar to Massive Data Analysis- Challenges and Applications (20)

Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigData
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
Big data: Challenges, Practices and Technologies
Big data: Challenges, Practices and TechnologiesBig data: Challenges, Practices and Technologies
Big data: Challenges, Practices and Technologies
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Intro big data.pdf
Intro big data.pdfIntro big data.pdf
Intro big data.pdf
 
Big Data
Big DataBig Data
Big Data
 
Big data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxBig data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docx
 
13 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v313 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v3
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for management
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Recently uploaded

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Massive Data Analysis- Challenges and Applications

  • 1. Massive data analysis: applications and challenges Vijay Raghavan University of Louisiana at Lafayette Jayasimha Katukuri eBay Ying Xie Kennesaw State University
  • 2. Agenda Trends and Perspectives Kinds of Big Data problems Big Data Application scenarios Current State of the Art Big Data Applications- Examples Big Data Analysis- Research Areas Conclusions 12/30/2013 2
  • 3. Trends and Perspectives In 2009, McKinsey estimated that nearly all sectors in the US economy had at least an average of 200 terabytes of stored data per organization (for organizations with more than 1000 employees). As an example, Walmart’s customer transaction database was reported to be 110 terabytes in 2000. By 2004 it increased to be over half a petabyte (Schuman, 2004). An increasing 80% of data organizations own, can be classified as unstructured data: for example data packed in emails, social media and multimedia. 12/30/2013 3
  • 4. Trends and Perspectives (Contd …) Taking account the average data growth, annually by 59% (Pettey & Goasduff, 2011), this percentage (unstructured data) will likely be much higher in a few years. Not only an increasing number of human beings are connected to the Internet, also there is a significant increase in the number of physical devices connected to the Internet. Besides the volume of data is becoming a problem, also the variety and velocity are issues we need to look at (Russom, 2011). 12/30/2013 4
  • 5. Trends and Perspectives (Contd …) Big Data: Data that is complex in terms of volume, variety, velocity and/or its relation to other data, which makes it hard to handle using tradition database management or tools. “Through 2015, more than 85% of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.” (Gartner’s Top Predictions 2012). Analysts need to i) cope with massive data distributed across locations; ii) treat data as a resource to understand underlying phenomena (NRC Study, 2013). 12/30/2013 5
  • 6. The Meaning of Big Data – 3V’s Big Volume - With simple and complex (SQL) analytics - Scaling complex operations Big Velocity Drink from the fire hose -Beyond OLTP, NoSQL - Big Variety Large number of diverse data sources to integrate -Beyond Global Schema-based approaches - 12/30/2013 6
  • 7. Velocity- Time to action vs. Value (Hackathorn, 2002) 12/30/2013 7
  • 8. Kinds of Big Data Problems (Davis, 2012) 12/30/2013 8
  • 9. Big Data – Big Analytics Complex math operations (machine learning, clustering, trend detection, …) mostly specified as linear algebra on array data in the stock market domain, the world of “quants” A dozen or so common “inner loops” Matrix multiply QR decomposition SVD decomposition Linear regression 12/30/2013 9
  • 10. Big Data – Big Analytics- An Example Consider choosing price on all trading days for the last 5 years for two stocks A and B What is the covariance between the two time-series? (1/N) * sum (Ai – mean(A)) * (Bi – mean (B)) Now Make it more challenging … All pairs of 4000 selected stocks- 4000 x 1000 matrix Hourly, instead of daily? All securities? 12/30/2013 10
  • 11. Big Data Application ScenariosDetecting anomalies or emerging events Visa’s fraud detection program HP’s compliance detection using its event management solution Detecting abnormal situations in ICU Detecting server attacks, marketing keywords, environmental hazards Detecting terror and diseases Detecting national security risks (Singapore’s RAHS (Risk Assessment & Horizon Scanning) against disease, financial risk 12/30/2013 11
  • 12. Big Data Application ScenariosPredicting near future & Trend analysis CRM: churn prediction Criminal protection by predicting likely locations of criminal activities Defect prediction (Volvo) Google flu trend Personalized recommender systems (Amazon) Personalized labor support system (Germany, saving 10B euro saving) 12/30/2013 12
  • 13. Big Data Application ScenariosReal-time analysis and Decision Support CRM Healthcare applications in ICU Marketing support Navigation service Real-time Q/A systems 13 12/30/2013
  • 14. Big Data Application ScenariosPattern Learning ( Google’s automatic language translation Apple’s siri, Google’s now IBM Watson (Seton HealthCare Family use Watson to learn 2M patient data annually) 12/30/2013 14
  • 15. Current State of the Art Rise of the cloud Big analytics as a service Amazon DynamoDB, Google BigQuery, Windows Azure Tables Hadoop, Open source- heart of big data analytics HDFS does not index data Run big jobs using big files vs. small jobs as fast as possible Several variants- Cloudera, Amazon Elastic MapReduce, IBM Infosphere 12/30/2013 15
  • 16. Current State of the Art (contd.) Machine learning for massive data sets Hadoop requires mappers and reducers to communicate with each other through a file system (HDFS). Some of the alternative technologies in this space are: Graphlab (http://graphlab.org/) Apache spark (http://spark.incubator.apache.org/) Real-time analytics Hadoop is not ideal for real-time analytics. Apache storm (http://storm-project.net/) is one technology that is trying to address the real-time analytics solution 12/30/2013 16
  • 17. Current State of the Art (contd.) In-Memory analytics Focuses on the velocity part of big data Oracle Exalytics In-Memory machine, 1 terabyte RAM SAS High-performance Analytics (unstructured data) Non-commercial- VoltDB 12/30/2013 17
  • 18. Big Data Applications- Hypothesis Discovery 12/30/2013 18
  • 19. Motivation for Literature-based Hypotheses Discovery Systems Biomedical research is divided into highly specialized fields and subfields, with poor communication between them. The rate of growth of publications makes it difficult for a researcher to derive connections between concepts from different research specialties. It also means an opportunity, since the usefulness of the literature-based discovery is greater as more data means better reliability in statistical methods. Mining hidden connections among biomedical concepts from large amounts of scientific literature is one of the important goals pursued in this field [1]. Pfizer uses text mining software to move to a broader understanding before making major investments in specific compounds. It is estimated that $18 billion is spent per year on compounds that never reach market, while $30 billion is spent reinventing what is in the literature. 12/30/2013 19
  • 20. Hypothesis Discovery from Biomedical Literature : Example Swanson found the hidden connection between “Fish Oil” and “Reynaud's Disease” by finding the common concepts from the document set of “Fish Oil” and “Reynaud's Disease” [4,5]. Raynaud’s disease Fish Oil High blood viscosity Platelet aggregation 12/30/2013 20
  • 21. Link Discovery Methods in Biomedical Literature The problem of hypotheses discovery in biomedical literature is similar to the link discovery problem. The existing approaches for hypotheses discovery have not explored the network topology features used in the link discovery methods. The existing approaches do not provide an automated way of evaluating the results. Supervised learning methods have not been 12/30/2013 explored. 21
  • 22. Proposed Method: Supervised Link Discovery Supervised Link Discovery Concept Network : Model the whole Medline literature repository as a complex network of biomedical concepts Generate labeled data automatically using Concept Networks corresponding to two different time periods. Extract a set of features from the concept network for concept pairs. A supervised learning approach to learn a model for link discovery. 12/30/2013 22
  • 23. Concept Network Each node represents a biomedical concept Node Attributes: concept name semantic type, related authors, and document frequency Each edge represents an association between two concepts. Edge Attributes: Co-occurrence frequency 12/30/2013 23
  • 24. Concept Network – Map-Reduce Doc-2 Doc-1 Mapper-1 Key: (c1, c2,year) Value: co-count 12/30/2013 CCM_ local Mapper-2 Doc-3 CCM_l ocal Mapper-3 CCM_ local Reducer-2 Reducer-1 HDFS 24
  • 25. Concept Network Statistics Total number of concept pairs = 17356486 Total number of documents = 11021605 Total number of concepts = 165674 12/30/2013 25
  • 26. Automatic Generation of Labeled Concept Pairs For each pair whose connection is strong in Gts, if it has no direct connection in Gtf, we assign positive to this pair. For each pair whose connection is weak in Gts, if it has no direct connection in Gtf, we assign negative to this pair. Select a random sample of the nodes in Gtf and generate concept pairs from the selected random sample. if a pair has no connection in both Gtf and Gts, we assign negative to it. 12/30/2013 26
  • 27. Features In addition to the commonly used network topological features, we extract the following features: Cycle Free Effective Conductance (CFEC) The Semantic-CFEC The Author_List Jacccard 12/30/2013 27
  • 28. Feature Extraction For each of labeled pairs, we extract the set of features as described before from the snapshot of the concept network Gtf. To scale the feature extraction for large number of labeled pairs, feature extraction is implemented on a Map-Reduce cluster. The distributed implementation of feature extraction can be described in the following way: Trim Gtf such that it only contains edges with strength greater than or equal to the minimum support. Store the trimmed Gtf in each of the mapper’s main memory. Distribute the labeled pairs among the mappers. Each mapper extracts the features for a subset of concept pairs using the trimmed Gtf . 12/30/2013 28
  • 29. min_support All the measures improved as we increase the value for the parameter ‘min_Support’. As we increase the ‘Min_Support’, there will be fewer positive examples. 10-fold cross-validation is used in all the experiments. 12/30/2013 29
  • 30. Different Classifiers SVM provided around 1.5%-2% better classification accuracy than that of decision trees. 12/30/2013 30
  • 32. Big Data Applications- Recommendations in e-Commerce 12/30/2013 32
  • 34. Introduction Challenges in a dynamic marketplace like eBay Huge inventory Several hundreds of millions Seller-defined listings Listings are short-lived Wide variety From electronics to unique collectibles Majority are unstructured and w/o a product catalog Listing quality Condition, price, shipping, etc Seller trustworthiness Goal for a Recommendation System in eBay Address challenges associated with a dynamic marketplace Scalable and efficient Computationally intensive tasks during offline model generation Efficient online performance system 12/30/2013 34
  • 35. Motivation – Pre-purchase User couldn’t purchase a listing s/he showed interest in Placed a bid but lost the auction “Watched” an item but someone else bought it before s/he was ready to buy Similar Item Recommendation (SIR) Recommend replacement items 12/30/2013 35
  • 36. Motivation – Post-purchase User just purchased an item Related Item Recommendation (RIR) Inspire incremental purchases Recommend complementary/related items 12/30/2013 36
  • 37. System Architecture - Overview Offline Model Generation Clusters Model Generation The Data Store Real-time Performance System Lost Item Clusters Inventor y Similar Items Recommender (SIR) ?similarTo(item) Similar Items Clickstream Transactions Related Clusters Model Generation 12/30/2013 Conceptual Knowledgebase Cluster-Cluster Relations Bought Item Related Items Recommender (RIR) ?relatedTo(item) Related Items 37
  • 38. Data Store Inventory Clickstream Transactions Conceptual Knowledgebase Glue between offline and real-time systems Raw data Inventory data Clickstream data Transaction data Conceptual Knowledgebase Category Tree Stop words, spell corrections, synonyms, etc Term dictionary Models Item Clusters “clarks women shoe pumps classics” “authentic handmade amish quilt” Cluster-Cluster Relations Clusters “samsung galaxy s4” – “samsung galaxy s4 screen protector” “wolfgang puck electric pressure cooker” – “kitchenaid food processor” Cluster-Cluster Relations 12/30/2013 38
  • 39. Model Generation - Clusters Data Store Inventory Global clustering not feasible Inventory size in several hundreds of millions Varied inventory ranging from electronic goods to unique collectibles Conceptual Knowledgebase Partition input data by user queries Clickstream Cluster s new clusters items user queries concepts, categories Query-Recall Generation query-toitems Cluster Generation Take advantage of how users’ perspective of item similarity Parallel distributed K-Means in Hadoop MapReduce Feature set Title tokens Category hierarchy Attributes or concepts Dedupe and merge overlapping clusters 100X reduction in size over inventory with over 90% coverage Clusters Model Generation 12/30/2013 39
  • 40. Model Generation – Related Clusters Data Store Transactional data Item-Item co-purchase matrix Conceptual Knowledgebase Transactions Cluster-Cluster Relations Clusters related cluster-cluster clusters bought item-item concepts,categories Cluster Assignment bought clustercluster Cluster-toCluster Model Generation Cluster Assignment Cluster-Cluster directed graph Rank outgoing edges Collaborative filtering Edge strength ie no. of users with co-purchase Cluster-Cluster content similarity Related Clusters Model Generation 12/30/2013 40
  • 41. Experimental Results A/B Tests comparing against legacy systems SIR legacy system Completely online Naïve approach of using seed item title as a search query RIR legacy system Chen, Y. and J.F. Canny, Recommending ephemeral items at web scale, ACM SIGIR 2011 Collaborative Filtering on stable representations of items Significant improvements at 90% confidence interval SIR resulted in 38.18% higher user engagement (CTR) RIR resulted in 10.5% higher CTR Statistically significant improvement in site-wide business metrics from both SIR & RIR 12/30/2013 41
  • 42. Recommendations in e-CommerceConclusions Balance between similarity and quality crucial in driving user engagement and conversion Clusters of similar items in the inventory Local clustering in the coverage set of user queries Offline models built using Map-Reduce Huge input datasets including inventory, clickstream and transactional data Efficient real-time performance system Currently deployed on ebay.com 12/30/2013 42
  • 43. Big Data Analytics- Research Areas Data representation, including transformations that reduce representational complexity Computational complexity issues to characterize computational resource needs and tradeoffs Statistical model-building in massive data settings having messy data validation issues Sampling- both as data gathering and for data reduction Methods to include humans in the data analysis loop 12/30/2013 43
  • 44. Conclusions Great opportunity in improving the functioning of many disciplines by leveraging the data and turning the data into knowledge Requires an interdisciplinary approach to solving problems of massive data A major need exists for software targeted to end users Concerted effort is needed to educate students and the workforce in statistical thinking and computational thinking 12/30/2013 44
  • 45. References Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. (2011). Big data: The Next Frontier for innovation, Competition, and Productivity. Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012, from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMartWorlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/ Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends. Computer, 33(1), 117–119. Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving “Big Data” Challenge Involves More Than Just Managing Volumes of Data. Stamford: Gartner. Retrieved from http://www.gartner.com/it/page.jsp?id=1731916 12/30/2013 45
  • 46. References (cont’d) Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding Digital Universe. Director, 285(6). doi:10.1002/humu.21252 Russom, P. (2011). Big Data Analytics. TDWI Research. Pettey, C. (2012, October 18). Gartner Identifies the Top 10 Strategic Technologies for 2012. Gartner. Hackathorn, R. (2002). Current practices in active data warehousing. available: http://www.dmreview.com/whitepaper/WID489.pdf Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big Deal. Deloitte. Retrieved from http://www.deloitte.com/view/en_GX/global/insights/c22d83274 d1b4310VgnVCM2000001b56f00aRCRD.htm Lee, P., & Steward, D. (2012). Technology, Media & Telecommunications Predictions 2012, (Deloitte). 12/30/2013 46
  • 47. References (cont’d) NRC of the National Academies, Frontiers in Massive Data Analysis, The National Academy Press, Washington, D.C., 2013. Retrieved from http://www.nap.edu/catalog.php?record_id=18374 Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl 3):S5, 2012. Katukuri, J., Mukherjee, R., and Konik, T. “Large scale recommendations in a dynamic marketplace”. ACM RecSys (LSRS workshop), 2013. 12/30/2013 47
  • 48. References (cont’d) Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall Street Journal. Retrieved September 14, 2011, from http://online.wsj.com/article/SB10001424053111903532804576569133 957145822.html Davis, J. (2012). What Kind of Big Data Problem Do You Have? SAS Blogs Home. Retrieved December 16, 2012, from http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-ofbig-data-problem-do-you-have/ Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers: How Does Data-Driven Decision Making Affect Firm Performance?, Last Retrieved on December 16, 2012. Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’ Value Creation. Master Thesis, University of Amsterdam. Retrieved December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf 12/30/2013 48