• Save
Big Data Analysis and Business Intelligence
Upcoming SlideShare
Loading in...5
×
 

Big Data Analysis and Business Intelligence

on

  • 760 views

Slides from talks given Business Analytics and Business Intelligence Summits Feb 2013 at San Diego and May 2013 in Chicago

Slides from talks given Business Analytics and Business Intelligence Summits Feb 2013 at San Diego and May 2013 in Chicago

Statistics

Views

Total Views
760
Views on SlideShare
760
Embed Views
0

Actions

Likes
1
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big Data Analysis and Business Intelligence Big Data Analysis and Business Intelligence Presentation Transcript

  • Daqing Zhao, PhD Founder and Principal, Eureka Analytics Business Intelligence Innovation Summit, Chicago 5/23/2013 ©Daqing Zhao All rights reserved Frontiers of Big Data Business Analytics, Patterns and Cases in Online Marketing
  • Agenda • Overview of big data analytics • Insights of big data and analysis • BI process on big data • Lessons of model building • Cases for behavioral profiles for predictive models – Yahoo network segmentation – Tribal Fusion display ads impression optimization – University of Phoenix student retention and lead optimization • Case of Ask.com SEM algorithms 2
  • Daqing Zhao, PhD • Big Data scientist with deep domain knowledge • Academic training – Analyzed molecular spectra on Cray supercomputers – Determined, modeled, simulated molecular motions in 3D space • Enjoy working with large data and large scale computing • Worked on computational Internet marketing since 1999 3
  • New Book on Big Data Analytics • In the book: • Daqing Zhao: • Frontiers of Big Data Business Analytics: Patterns and Cases in Online Marketing 4
  • Big data, Big Opportunities • Thanks to Moore’s law, on CPU, storage, network connections • Too much data, too little knowledge • Data, analytics changed every field many times over • From science, government, to commerce 5
  • Big data characteristics • Amount of data too big to handle using normal technology, most data collected are dormant • Raw data are stored, appended but not updated • Formatted or free format data • No aggregation for purpose of data reduction • Individual customer level and individual event level data • Sensor data • Complete 360 degree view • Process from raw data to get insights and build models • Some business uses of big data: customer profile, event prediction, automated decision machine, risk management, wisdom of crowd 6
  • Things computers good at • Computers have perfect memory – Every page view, click, transaction, every event,… • Good at finding a needle in a haystack – E.g., target abandoned shopping carts with promotions – Clickers of this page in the last week • Good at trade offs among large number of factors – Female, 25-34, with child < 5, Asian, earning $30K, rent, divorced, live in Calif., some college, Walmart, Coupons.com, Monster.com, drive Camry, … – Buyer of X or not? 7
  • Things computers on Internet are good at • Platforms of cloud sourcing – Google PageRank, Adwords, Picasa, Translate, … • Data not previously looked at in aggregate – Google PageRank/Translate, Amazon Find Product • Data not previously created, or accumulated – Social network data at LinkedIn, Facebook – Amazon Customer Review, Yelp – Twitter, Flickr – Wikipedia, Youtube/Khan Academy, eHow, Udemy, Yahoo/Answers 8
  • Computers make it possible • Given data, find models and parameters – Identify reproducible patterns in the data – Provide simple picture of a large number of events – Predict events in the future • Simulations generate future events, given assumptions, and current state – Given a set of models, how future scenario will look like, under given set of conditions, “what ifs” • Robots, and agents – Make decisions based on environment and goals, self driving cars 9
  • Computers can’t do everything • Data often have issues before being well analyzed • Data often have no taxonomy and context • Free format data, relevant information need to be extracted • Analyst has to define targets, construct predictors • Analyst has to include critical predictive factors • Analyst need to add common sense 10
  • Every wrong data is wrong in its own way • Some data are not collected, “too big” or “useless”, as in flood control, purged log data • Some data feeds to warehouse are incomplete • Multiple definitions and inconsistent business rules, no documentation • Data incomplete due to business nature – Sparse data – Separate log in and log out data – Credit card purchases versus cash • Some flaws are easy to catch, such as missing, constant • Some flaws hard to find, partially missing or incorrect 11
  • Best practices of analyst • Understand how the data are collected, what data can and cannot be collected • Balance cost of collecting data and optimize modeling • Use feedback loop to test hypotheses • Do simulations to see if changes are reasonable • Good ideas are not necessarily complicated ideas • Focus on domain knowledge, not just data mining tools 12
  • Best Practices of Analytics Managers • Well versed on analytics, understand analyst, their behavior, the tests, their work and value • Focus on domain knowledge, not just data mining tools • Focus on impact, not elegance in modeling • Big Data Analytics are different from small sample statistics, and need to learn on the job • As activities become more technical, it is hard to recognize values and identify issues – 2008: Financial crisis and credit derivatives – Principal-agent problem 13
  • New Information Explosions • Before ~1450, only nobilities had a few books • After Gutenberg, information was limited by paper and printing capacities – People cried out loud there was too much information – Then we had libraries, index, abstract, book reviews,… • Now information is limited by disks & cloud storage – A person’s lifetime spoken words stored in a thumb drive – Soon everything can be stored • Now: how do we make use of all the information? – Search, crows sourcing, Twitter, Wikipedia, YouTube, big data and analysis algorithms, … 14
  • Paradigm Shift in Data Organization • Mathematics is a way to efficiently use brain resources – With pen and paper, only simple problems solvable – Crude approximations, and samples for complicated ones – Unreasonable effectiveness of mathematics – E. Wigner • Now, algorithms are ways to efficiently use computing resources – Numerical solutions of complex equations – Large scale simulations, full population databases – Unreasonable effectiveness of data – P. Norvig • Elegant, over simplified models are less useful 15
  • Paradigm Shift in Knowledge • Knowledge is power, by Francis Bacon • Past: Drowning in information, starving for knowledge, by John Naisbitt • Now, Knowing how to extract knowledge is power • Soon: There is abundance of knowledge, seeking for relevance – Incl. personal finance, medical, political decisions • Innovations are about connecting the dots – Distances between the dots are getting smaller – Leverage knowledge to make decisions, manage risks 16
  • Big Data problem • Data size larger than what databases can handle • Terabytes of data may take hours just to scan it • Solution requires a cloud of servers with local storage – Read, process and write intermediate results in parallel – Aggregate at the end • Cloud computing can build models in scale • Cloud often scales linearly as number of servers 17
  • Modeling need to scale • Traditional predictive models take long time to build – Small data sets, samples expensive to collect • Now data are cheap and models may degrade in weeks – Dimension of predictors are very large – Number of categories are large • Human interactive model building not scalable • Reasons for target events are complex • Without detailed analysis, it is unclear what drives the event • We need to rely on “out of sample testing” and “off the shelf” modeling 18
  • Cloud computing • We built a SAS cloud at University of Phoenix – I have an invited SAS talk available at SAS web site – Can process billions of impressions in minutes • Hadoop clouds are used widely – Open source software, Hive, Impala, Mahout – Commodity servers and storage • Clouds may have 100Ks of servers – Find needle in a haystack in milliseconds – Model computations usually would take years to compute now finishes in minutes 19
  • Big Data Centers 20 Facebook and Google data centers use commodity servers Google uses 260 million watts can power 200K Homes – NY Times Data centers near Columbia River At Dalles, Oregon
  • Traditional BI pyramid • Defines a sequence of efforts • Most companies never get beyond reporting and simple analysis • No full analysis and predictive modeling ever done • Some data issues may not be caught • Limited insights hinder optimal extraction of knowledge 21 Multidimensional Report Standard Report Segmentation Predictive Modeling Knowledge Discovery Datamaturity Baseline Pyramid
  • Hadoop Analysis leads to better data quality 22 Raw data Algorithms Analysis Reports Business Rules Algorithms Predictive Models
  • More analysis leads to better quality 23 Data Collection Exploratory Analysis Predictive Modeling Decision Algorithms Better data quality
  • Data most important • In modeling, find key data most important – Identify the smoking gun • Data transformations – PageRank is a game changing data transformation – Wine.com case, wineRank – Social graph is a key data transformation for credit card fraud detection 24
  • Modeling can go wrong • Leakage in lead scoring model – For example, use lead source to predict conversion, when certain values of the field were populated only for converters • Display ads conversion model – Construct data set by taking all converters and a sample of non-converters – Predict on page view profiles – Problem: sample of non-converters included customers who had no impressions of the ad 25
  • Modeling lessons • Yahoo DSL subscribers, one year contract • If you try to model month to month retention, you find high retention rate – Because of contracts and penalties • The correct way is to model retention at contract expiry, only on 1/12 of the customers • For Yahoo email, if you look at quarter by quarter retention, you find that those acquired early in the first quarter have lower retention rate – Because those customers have more time to churn • A correct way is to use survival analysis 26
  • Conclusions • For optimal modeling, domain knowledge is most important • May require Big Data solutions to scale • Identify key data and transformations • Data are not reliable until after seriously analyzed • Conduct deep analysis, before develop BI reports • Test and optimize in real market is crucial • Focus on customer experience not model complexity or predictive accuracy • “The best way to get good ideas to have a lot of them” – Linus Pauling • Use a lot of common sense 27