SlideShare a Scribd company logo
1 of 27
Daqing Zhao, PhD
Founder and Principal, Eureka Analytics
Business Intelligence Innovation Summit,
Chicago
5/23/2013
©Daqing Zhao All rights reserved
Frontiers of Big Data Business
Analytics, Patterns and Cases in
Online Marketing
Agenda
• Overview of big data analytics
• Insights of big data and analysis
• BI process on big data
• Lessons of model building
• Cases for behavioral profiles for predictive models
– Yahoo network segmentation
– Tribal Fusion display ads impression optimization
– University of Phoenix student retention and lead
optimization
• Case of Ask.com SEM algorithms
2
Daqing Zhao, PhD
• Big Data scientist with deep domain knowledge
• Academic training
– Analyzed molecular spectra on Cray supercomputers
– Determined, modeled, simulated molecular motions in 3D space
• Enjoy working with large data and large scale computing
• Worked on computational Internet marketing since 1999
3
New Book on Big Data Analytics
• In the book:
• Daqing Zhao:
• Frontiers of Big Data
Business Analytics:
Patterns and Cases in
Online Marketing
4
Big data, Big Opportunities
• Thanks to Moore’s law, on CPU, storage, network connections
• Too much data, too little knowledge
• Data, analytics changed every field many times over
• From science, government, to commerce
5
Big data characteristics
• Amount of data too big to handle using normal technology,
most data collected are dormant
• Raw data are stored, appended but not updated
• Formatted or free format data
• No aggregation for purpose of data reduction
• Individual customer level and individual event level data
• Sensor data
• Complete 360 degree view
• Process from raw data to get insights and build models
• Some business uses of big data: customer profile, event
prediction, automated decision machine, risk management,
wisdom of crowd
6
Things computers good at
• Computers have perfect memory
– Every page view, click, transaction, every event,…
• Good at finding a needle in a haystack
– E.g., target abandoned shopping carts with promotions
– Clickers of this page in the last week
• Good at trade offs among large number of factors
– Female, 25-34, with child < 5, Asian, earning $30K, rent,
divorced, live in Calif., some college, Walmart,
Coupons.com, Monster.com, drive Camry, …
– Buyer of X or not?
7
Things computers on Internet are good at
• Platforms of cloud sourcing
– Google PageRank, Adwords, Picasa, Translate, …
• Data not previously looked at in aggregate
– Google PageRank/Translate, Amazon Find Product
• Data not previously created, or accumulated
– Social network data at LinkedIn, Facebook
– Amazon Customer Review, Yelp
– Twitter, Flickr
– Wikipedia, Youtube/Khan Academy, eHow, Udemy,
Yahoo/Answers
8
Computers make it possible
• Given data, find models and parameters
– Identify reproducible patterns in the data
– Provide simple picture of a large number of events
– Predict events in the future
• Simulations generate future events, given
assumptions, and current state
– Given a set of models, how future scenario will look like,
under given set of conditions, “what ifs”
• Robots, and agents
– Make decisions based on environment and goals, self
driving cars
9
Computers can’t do everything
• Data often have issues before being well analyzed
• Data often have no taxonomy and context
• Free format data, relevant information need to be
extracted
• Analyst has to define targets, construct predictors
• Analyst has to include critical predictive factors
• Analyst need to add common sense
10
Every wrong data is wrong in its own way
• Some data are not collected, “too big” or “useless”, as in flood
control, purged log data
• Some data feeds to warehouse are incomplete
• Multiple definitions and inconsistent business rules, no
documentation
• Data incomplete due to business nature
– Sparse data
– Separate log in and log out data
– Credit card purchases versus cash
• Some flaws are easy to catch, such as missing, constant
• Some flaws hard to find, partially missing or incorrect
11
Best practices of analyst
• Understand how the data are collected, what data
can and cannot be collected
• Balance cost of collecting data and optimize
modeling
• Use feedback loop to test hypotheses
• Do simulations to see if changes are reasonable
• Good ideas are not necessarily complicated ideas
• Focus on domain knowledge, not just data mining
tools
12
Best Practices of Analytics Managers
• Well versed on analytics, understand analyst, their
behavior, the tests, their work and value
• Focus on domain knowledge, not just data mining
tools
• Focus on impact, not elegance in modeling
• Big Data Analytics are different from small sample
statistics, and need to learn on the job
• As activities become more technical, it is hard to
recognize values and identify issues
– 2008: Financial crisis and credit derivatives
– Principal-agent problem
13
New Information Explosions
• Before ~1450, only nobilities had a few books
• After Gutenberg, information was limited by paper
and printing capacities
– People cried out loud there was too much information
– Then we had libraries, index, abstract, book reviews,…
• Now information is limited by disks & cloud storage
– A person’s lifetime spoken words stored in a thumb drive
– Soon everything can be stored
• Now: how do we make use of all the information?
– Search, crows sourcing, Twitter, Wikipedia, YouTube,
big data and analysis algorithms, …
14
Paradigm Shift in Data Organization
• Mathematics is a way to efficiently use brain resources
– With pen and paper, only simple problems solvable
– Crude approximations, and samples for complicated ones
– Unreasonable effectiveness of mathematics – E. Wigner
• Now, algorithms are ways to efficiently use computing
resources
– Numerical solutions of complex equations
– Large scale simulations, full population databases
– Unreasonable effectiveness of data – P. Norvig
• Elegant, over simplified models are less useful
15
Paradigm Shift in Knowledge
• Knowledge is power, by Francis Bacon
• Past: Drowning in information, starving for
knowledge, by John Naisbitt
• Now, Knowing how to extract knowledge is power
• Soon: There is abundance of knowledge, seeking for
relevance
– Incl. personal finance, medical, political decisions
• Innovations are about connecting the dots
– Distances between the dots are getting smaller
– Leverage knowledge to make decisions, manage risks
16
Big Data problem
• Data size larger than what databases can handle
• Terabytes of data may take hours just to scan it
• Solution requires a cloud of servers with local
storage
– Read, process and write intermediate results in
parallel
– Aggregate at the end
• Cloud computing can build models in scale
• Cloud often scales linearly as number of servers
17
Modeling need to scale
• Traditional predictive models take long time to build
– Small data sets, samples expensive to collect
• Now data are cheap and models may degrade in weeks
– Dimension of predictors are very large
– Number of categories are large
• Human interactive model building not scalable
• Reasons for target events are complex
• Without detailed analysis, it is unclear what drives the
event
• We need to rely on “out of sample testing” and “off the
shelf” modeling
18
Cloud computing
• We built a SAS cloud at University of Phoenix
– I have an invited SAS talk available at SAS web site
– Can process billions of impressions in minutes
• Hadoop clouds are used widely
– Open source software, Hive, Impala, Mahout
– Commodity servers and storage
• Clouds may have 100Ks of servers
– Find needle in a haystack in milliseconds
– Model computations usually would take years to
compute now finishes in minutes
19
Big Data Centers
20
Facebook and Google
data centers use
commodity servers
Google uses 260 million watts
can power 200K Homes – NY Times
Data centers near Columbia River
At Dalles, Oregon
Traditional BI pyramid
• Defines a sequence of efforts
• Most companies never get
beyond reporting and simple
analysis
• No full analysis and predictive
modeling ever done
• Some data issues may not be
caught
• Limited insights hinder
optimal extraction of
knowledge
21
Multidimensional
Report
Standard Report
Segmentation
Predictive
Modeling
Knowledge
Discovery
Datamaturity
Baseline Pyramid
Hadoop
Analysis leads to better data quality
22
Raw data
Algorithms
Analysis
Reports
Business
Rules
Algorithms
Predictive
Models
More analysis leads to better quality
23
Data
Collection
Exploratory
Analysis
Predictive
Modeling
Decision
Algorithms
Better data quality
Data most important
• In modeling, find key data most important
– Identify the smoking gun
• Data transformations
– PageRank is a game changing data transformation
– Wine.com case, wineRank
– Social graph is a key data transformation for credit
card fraud detection
24
Modeling can go wrong
• Leakage in lead scoring model
– For example, use lead source to predict
conversion, when certain values of the field were
populated only for converters
• Display ads conversion model
– Construct data set by taking all converters and a
sample of non-converters
– Predict on page view profiles
– Problem: sample of non-converters included
customers who had no impressions of the ad
25
Modeling lessons
• Yahoo DSL subscribers, one year contract
• If you try to model month to month retention, you
find high retention rate
– Because of contracts and penalties
• The correct way is to model retention at contract
expiry, only on 1/12 of the customers
• For Yahoo email, if you look at quarter by quarter
retention, you find that those acquired early in the
first quarter have lower retention rate
– Because those customers have more time to churn
• A correct way is to use survival analysis
26
Conclusions
• For optimal modeling, domain knowledge is most important
• May require Big Data solutions to scale
• Identify key data and transformations
• Data are not reliable until after seriously analyzed
• Conduct deep analysis, before develop BI reports
• Test and optimize in real market is crucial
• Focus on customer experience not model complexity or
predictive accuracy
• “The best way to get good ideas to have a lot of them”
– Linus Pauling
• Use a lot of common sense
27

More Related Content

What's hot

Key Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramKey Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramData Con LA
 
Data-Ed: A Framework for no sql and Hadoop
Data-Ed: A Framework for no sql and HadoopData-Ed: A Framework for no sql and Hadoop
Data-Ed: A Framework for no sql and HadoopData Blueprint
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDATAVERSITY
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?Dylan
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist Experian_US
 
New Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingNew Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingTextkernel
 
Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements  Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements Data Blueprint
 
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data WrongThe Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data WrongDATAVERSITY
 
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data MeetupData Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data MeetupDavid Johnston
 
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...WiLS
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantageJohn Repko
 
Data-Ed Online Webinar: Data Architecture Requirements
Data-Ed Online Webinar: Data Architecture RequirementsData-Ed Online Webinar: Data Architecture Requirements
Data-Ed Online Webinar: Data Architecture RequirementsDATAVERSITY
 
Data science opportunities
Data science opportunitiesData science opportunities
Data science opportunitiesJay Buckingham
 
Supporting decisions with ML
Supporting decisions with MLSupporting decisions with ML
Supporting decisions with MLMegan Neider
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Data-Ed: Monetizing Data Management
Data-Ed: Monetizing Data Management Data-Ed: Monetizing Data Management
Data-Ed: Monetizing Data Management Data Blueprint
 
Generating Big Value from Big Data
Generating Big Value from Big DataGenerating Big Value from Big Data
Generating Big Value from Big DataBrendan Aldrich
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectDATAVERSITY
 

What's hot (20)

Key Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramKey Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics Program
 
Data-Ed: A Framework for no sql and Hadoop
Data-Ed: A Framework for no sql and HadoopData-Ed: A Framework for no sql and Hadoop
Data-Ed: A Framework for no sql and Hadoop
 
Big data
Big dataBig data
Big data
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business Outcomes
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist
 
New Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingNew Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max Welling
 
Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements  Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements
 
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data WrongThe Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
 
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data MeetupData Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
 
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantage
 
Data-Ed Online Webinar: Data Architecture Requirements
Data-Ed Online Webinar: Data Architecture RequirementsData-Ed Online Webinar: Data Architecture Requirements
Data-Ed Online Webinar: Data Architecture Requirements
 
Data science opportunities
Data science opportunitiesData science opportunities
Data science opportunities
 
Supporting decisions with ML
Supporting decisions with MLSupporting decisions with ML
Supporting decisions with ML
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Data-Ed: Monetizing Data Management
Data-Ed: Monetizing Data Management Data-Ed: Monetizing Data Management
Data-Ed: Monetizing Data Management
 
Generating Big Value from Big Data
Generating Big Value from Big DataGenerating Big Value from Big Data
Generating Big Value from Big Data
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
 

Viewers also liked

SAS Cloud Computing and MapReduce
SAS Cloud Computing and MapReduceSAS Cloud Computing and MapReduce
SAS Cloud Computing and MapReduceDaqing Zhao
 
Open / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware SystemsOpen / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware SystemsCharalampos Doukas
 
Memory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computingMemory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computingPriyanka Aash
 
Virtualization & Cloud Computing Presentation
Virtualization  & Cloud Computing PresentationVirtualization  & Cloud Computing Presentation
Virtualization & Cloud Computing PresentationJIM MUKERJEE
 
Lupus érythémateux disséminé1
Lupus érythémateux disséminé1Lupus érythémateux disséminé1
Lupus érythémateux disséminé1Med Achraf Hadj Ali
 
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGYGeorge Beaton
 
Transforming Application Delivery with PaaS and Linux Containers
Transforming Application Delivery with PaaS and Linux ContainersTransforming Application Delivery with PaaS and Linux Containers
Transforming Application Delivery with PaaS and Linux ContainersGiovanni Galloro
 
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisCloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisPavlos Stefanis
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing pptMehul Patel
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple pptAgarwaljay
 
Cloud computing project report
Cloud computing project reportCloud computing project report
Cloud computing project reportNaveed Farooq
 
6. Non Experimental Methods
6. Non Experimental Methods6. Non Experimental Methods
6. Non Experimental Methodsrossbiology
 

Viewers also liked (18)

SAS Cloud Computing and MapReduce
SAS Cloud Computing and MapReduceSAS Cloud Computing and MapReduce
SAS Cloud Computing and MapReduce
 
Open / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware SystemsOpen / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware Systems
 
Memory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computingMemory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computing
 
Virtualization & Cloud Computing Presentation
Virtualization  & Cloud Computing PresentationVirtualization  & Cloud Computing Presentation
Virtualization & Cloud Computing Presentation
 
Lupus érythémateux disséminé1
Lupus érythémateux disséminé1Lupus érythémateux disséminé1
Lupus érythémateux disséminé1
 
Red hat cloud platforms
Red hat cloud platformsRed hat cloud platforms
Red hat cloud platforms
 
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
 
CS298_presentation
CS298_presentationCS298_presentation
CS298_presentation
 
Transforming Application Delivery with PaaS and Linux Containers
Transforming Application Delivery with PaaS and Linux ContainersTransforming Application Delivery with PaaS and Linux Containers
Transforming Application Delivery with PaaS and Linux Containers
 
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisCloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
 
Virtual machine
Virtual machineVirtual machine
Virtual machine
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing ppt
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple ppt
 
Cloud computing project report
Cloud computing project reportCloud computing project report
Cloud computing project report
 
6. Non Experimental Methods
6. Non Experimental Methods6. Non Experimental Methods
6. Non Experimental Methods
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
 

Similar to Big Data Analysis and Business Intelligence

TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comDaqing Zhao
 
Big data, predictive modeling and analytics in online marketing
Big data, predictive modeling and analytics in online marketingBig data, predictive modeling and analytics in online marketing
Big data, predictive modeling and analytics in online marketingDaqing Zhao
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsVivastream
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data AnalyticsBHARATH KUMAR
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information qualityPeter O'Kelly
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...teodroscampaus
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Data-Ed Online: Data Management Maturity Model
Data-Ed Online: Data Management Maturity ModelData-Ed Online: Data Management Maturity Model
Data-Ed Online: Data Management Maturity ModelDATAVERSITY
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData Blueprint
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 

Similar to Big Data Analysis and Business Intelligence (20)

TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Big data, predictive modeling and analytics in online marketing
Big data, predictive modeling and analytics in online marketingBig data, predictive modeling and analytics in online marketing
Big data, predictive modeling and analytics in online marketing
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Trends in data analytics
Trends in data analyticsTrends in data analytics
Trends in data analytics
 
Digital Economics
Digital EconomicsDigital Economics
Digital Economics
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
 
Data mining
Data miningData mining
Data mining
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Data-Ed Online: Data Management Maturity Model
Data-Ed Online: Data Management Maturity ModelData-Ed Online: Data Management Maturity Model
Data-Ed Online: Data Management Maturity Model
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Data mining
Data miningData mining
Data mining
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 

Recently uploaded

M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc.../:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...lizamodels9
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
Sales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessSales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessAggregage
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...lizamodels9
 
Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfPaul Menig
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...lizamodels9
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc.../:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Sales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessSales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for Success
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
 
Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
Lowrate Call Girls In Sector 18 Noida ❤️8860477959 Escorts 100% Genuine Servi...
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
 

Big Data Analysis and Business Intelligence

  • 1. Daqing Zhao, PhD Founder and Principal, Eureka Analytics Business Intelligence Innovation Summit, Chicago 5/23/2013 ©Daqing Zhao All rights reserved Frontiers of Big Data Business Analytics, Patterns and Cases in Online Marketing
  • 2. Agenda • Overview of big data analytics • Insights of big data and analysis • BI process on big data • Lessons of model building • Cases for behavioral profiles for predictive models – Yahoo network segmentation – Tribal Fusion display ads impression optimization – University of Phoenix student retention and lead optimization • Case of Ask.com SEM algorithms 2
  • 3. Daqing Zhao, PhD • Big Data scientist with deep domain knowledge • Academic training – Analyzed molecular spectra on Cray supercomputers – Determined, modeled, simulated molecular motions in 3D space • Enjoy working with large data and large scale computing • Worked on computational Internet marketing since 1999 3
  • 4. New Book on Big Data Analytics • In the book: • Daqing Zhao: • Frontiers of Big Data Business Analytics: Patterns and Cases in Online Marketing 4
  • 5. Big data, Big Opportunities • Thanks to Moore’s law, on CPU, storage, network connections • Too much data, too little knowledge • Data, analytics changed every field many times over • From science, government, to commerce 5
  • 6. Big data characteristics • Amount of data too big to handle using normal technology, most data collected are dormant • Raw data are stored, appended but not updated • Formatted or free format data • No aggregation for purpose of data reduction • Individual customer level and individual event level data • Sensor data • Complete 360 degree view • Process from raw data to get insights and build models • Some business uses of big data: customer profile, event prediction, automated decision machine, risk management, wisdom of crowd 6
  • 7. Things computers good at • Computers have perfect memory – Every page view, click, transaction, every event,… • Good at finding a needle in a haystack – E.g., target abandoned shopping carts with promotions – Clickers of this page in the last week • Good at trade offs among large number of factors – Female, 25-34, with child < 5, Asian, earning $30K, rent, divorced, live in Calif., some college, Walmart, Coupons.com, Monster.com, drive Camry, … – Buyer of X or not? 7
  • 8. Things computers on Internet are good at • Platforms of cloud sourcing – Google PageRank, Adwords, Picasa, Translate, … • Data not previously looked at in aggregate – Google PageRank/Translate, Amazon Find Product • Data not previously created, or accumulated – Social network data at LinkedIn, Facebook – Amazon Customer Review, Yelp – Twitter, Flickr – Wikipedia, Youtube/Khan Academy, eHow, Udemy, Yahoo/Answers 8
  • 9. Computers make it possible • Given data, find models and parameters – Identify reproducible patterns in the data – Provide simple picture of a large number of events – Predict events in the future • Simulations generate future events, given assumptions, and current state – Given a set of models, how future scenario will look like, under given set of conditions, “what ifs” • Robots, and agents – Make decisions based on environment and goals, self driving cars 9
  • 10. Computers can’t do everything • Data often have issues before being well analyzed • Data often have no taxonomy and context • Free format data, relevant information need to be extracted • Analyst has to define targets, construct predictors • Analyst has to include critical predictive factors • Analyst need to add common sense 10
  • 11. Every wrong data is wrong in its own way • Some data are not collected, “too big” or “useless”, as in flood control, purged log data • Some data feeds to warehouse are incomplete • Multiple definitions and inconsistent business rules, no documentation • Data incomplete due to business nature – Sparse data – Separate log in and log out data – Credit card purchases versus cash • Some flaws are easy to catch, such as missing, constant • Some flaws hard to find, partially missing or incorrect 11
  • 12. Best practices of analyst • Understand how the data are collected, what data can and cannot be collected • Balance cost of collecting data and optimize modeling • Use feedback loop to test hypotheses • Do simulations to see if changes are reasonable • Good ideas are not necessarily complicated ideas • Focus on domain knowledge, not just data mining tools 12
  • 13. Best Practices of Analytics Managers • Well versed on analytics, understand analyst, their behavior, the tests, their work and value • Focus on domain knowledge, not just data mining tools • Focus on impact, not elegance in modeling • Big Data Analytics are different from small sample statistics, and need to learn on the job • As activities become more technical, it is hard to recognize values and identify issues – 2008: Financial crisis and credit derivatives – Principal-agent problem 13
  • 14. New Information Explosions • Before ~1450, only nobilities had a few books • After Gutenberg, information was limited by paper and printing capacities – People cried out loud there was too much information – Then we had libraries, index, abstract, book reviews,… • Now information is limited by disks & cloud storage – A person’s lifetime spoken words stored in a thumb drive – Soon everything can be stored • Now: how do we make use of all the information? – Search, crows sourcing, Twitter, Wikipedia, YouTube, big data and analysis algorithms, … 14
  • 15. Paradigm Shift in Data Organization • Mathematics is a way to efficiently use brain resources – With pen and paper, only simple problems solvable – Crude approximations, and samples for complicated ones – Unreasonable effectiveness of mathematics – E. Wigner • Now, algorithms are ways to efficiently use computing resources – Numerical solutions of complex equations – Large scale simulations, full population databases – Unreasonable effectiveness of data – P. Norvig • Elegant, over simplified models are less useful 15
  • 16. Paradigm Shift in Knowledge • Knowledge is power, by Francis Bacon • Past: Drowning in information, starving for knowledge, by John Naisbitt • Now, Knowing how to extract knowledge is power • Soon: There is abundance of knowledge, seeking for relevance – Incl. personal finance, medical, political decisions • Innovations are about connecting the dots – Distances between the dots are getting smaller – Leverage knowledge to make decisions, manage risks 16
  • 17. Big Data problem • Data size larger than what databases can handle • Terabytes of data may take hours just to scan it • Solution requires a cloud of servers with local storage – Read, process and write intermediate results in parallel – Aggregate at the end • Cloud computing can build models in scale • Cloud often scales linearly as number of servers 17
  • 18. Modeling need to scale • Traditional predictive models take long time to build – Small data sets, samples expensive to collect • Now data are cheap and models may degrade in weeks – Dimension of predictors are very large – Number of categories are large • Human interactive model building not scalable • Reasons for target events are complex • Without detailed analysis, it is unclear what drives the event • We need to rely on “out of sample testing” and “off the shelf” modeling 18
  • 19. Cloud computing • We built a SAS cloud at University of Phoenix – I have an invited SAS talk available at SAS web site – Can process billions of impressions in minutes • Hadoop clouds are used widely – Open source software, Hive, Impala, Mahout – Commodity servers and storage • Clouds may have 100Ks of servers – Find needle in a haystack in milliseconds – Model computations usually would take years to compute now finishes in minutes 19
  • 20. Big Data Centers 20 Facebook and Google data centers use commodity servers Google uses 260 million watts can power 200K Homes – NY Times Data centers near Columbia River At Dalles, Oregon
  • 21. Traditional BI pyramid • Defines a sequence of efforts • Most companies never get beyond reporting and simple analysis • No full analysis and predictive modeling ever done • Some data issues may not be caught • Limited insights hinder optimal extraction of knowledge 21 Multidimensional Report Standard Report Segmentation Predictive Modeling Knowledge Discovery Datamaturity Baseline Pyramid
  • 22. Hadoop Analysis leads to better data quality 22 Raw data Algorithms Analysis Reports Business Rules Algorithms Predictive Models
  • 23. More analysis leads to better quality 23 Data Collection Exploratory Analysis Predictive Modeling Decision Algorithms Better data quality
  • 24. Data most important • In modeling, find key data most important – Identify the smoking gun • Data transformations – PageRank is a game changing data transformation – Wine.com case, wineRank – Social graph is a key data transformation for credit card fraud detection 24
  • 25. Modeling can go wrong • Leakage in lead scoring model – For example, use lead source to predict conversion, when certain values of the field were populated only for converters • Display ads conversion model – Construct data set by taking all converters and a sample of non-converters – Predict on page view profiles – Problem: sample of non-converters included customers who had no impressions of the ad 25
  • 26. Modeling lessons • Yahoo DSL subscribers, one year contract • If you try to model month to month retention, you find high retention rate – Because of contracts and penalties • The correct way is to model retention at contract expiry, only on 1/12 of the customers • For Yahoo email, if you look at quarter by quarter retention, you find that those acquired early in the first quarter have lower retention rate – Because those customers have more time to churn • A correct way is to use survival analysis 26
  • 27. Conclusions • For optimal modeling, domain knowledge is most important • May require Big Data solutions to scale • Identify key data and transformations • Data are not reliable until after seriously analyzed • Conduct deep analysis, before develop BI reports • Test and optimize in real market is crucial • Focus on customer experience not model complexity or predictive accuracy • “The best way to get good ideas to have a lot of them” – Linus Pauling • Use a lot of common sense 27