Big Data Analysis and Business Intelligence


Published on

Slides from talks given Business Analytics and Business Intelligence Summits Feb 2013 at San Diego and May 2013 in Chicago

Published in: Business, Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data Analysis and Business Intelligence

  1. 1. Daqing Zhao, PhD Founder and Principal, Eureka Analytics Business Intelligence Innovation Summit, Chicago 5/23/2013 ©Daqing Zhao All rights reserved Frontiers of Big Data Business Analytics, Patterns and Cases in Online Marketing
  2. 2. Agenda • Overview of big data analytics • Insights of big data and analysis • BI process on big data • Lessons of model building • Cases for behavioral profiles for predictive models – Yahoo network segmentation – Tribal Fusion display ads impression optimization – University of Phoenix student retention and lead optimization • Case of SEM algorithms 2
  3. 3. Daqing Zhao, PhD • Big Data scientist with deep domain knowledge • Academic training – Analyzed molecular spectra on Cray supercomputers – Determined, modeled, simulated molecular motions in 3D space • Enjoy working with large data and large scale computing • Worked on computational Internet marketing since 1999 3
  4. 4. New Book on Big Data Analytics • In the book: • Daqing Zhao: • Frontiers of Big Data Business Analytics: Patterns and Cases in Online Marketing 4
  5. 5. Big data, Big Opportunities • Thanks to Moore’s law, on CPU, storage, network connections • Too much data, too little knowledge • Data, analytics changed every field many times over • From science, government, to commerce 5
  6. 6. Big data characteristics • Amount of data too big to handle using normal technology, most data collected are dormant • Raw data are stored, appended but not updated • Formatted or free format data • No aggregation for purpose of data reduction • Individual customer level and individual event level data • Sensor data • Complete 360 degree view • Process from raw data to get insights and build models • Some business uses of big data: customer profile, event prediction, automated decision machine, risk management, wisdom of crowd 6
  7. 7. Things computers good at • Computers have perfect memory – Every page view, click, transaction, every event,… • Good at finding a needle in a haystack – E.g., target abandoned shopping carts with promotions – Clickers of this page in the last week • Good at trade offs among large number of factors – Female, 25-34, with child < 5, Asian, earning $30K, rent, divorced, live in Calif., some college, Walmart,,, drive Camry, … – Buyer of X or not? 7
  8. 8. Things computers on Internet are good at • Platforms of cloud sourcing – Google PageRank, Adwords, Picasa, Translate, … • Data not previously looked at in aggregate – Google PageRank/Translate, Amazon Find Product • Data not previously created, or accumulated – Social network data at LinkedIn, Facebook – Amazon Customer Review, Yelp – Twitter, Flickr – Wikipedia, Youtube/Khan Academy, eHow, Udemy, Yahoo/Answers 8
  9. 9. Computers make it possible • Given data, find models and parameters – Identify reproducible patterns in the data – Provide simple picture of a large number of events – Predict events in the future • Simulations generate future events, given assumptions, and current state – Given a set of models, how future scenario will look like, under given set of conditions, “what ifs” • Robots, and agents – Make decisions based on environment and goals, self driving cars 9
  10. 10. Computers can’t do everything • Data often have issues before being well analyzed • Data often have no taxonomy and context • Free format data, relevant information need to be extracted • Analyst has to define targets, construct predictors • Analyst has to include critical predictive factors • Analyst need to add common sense 10
  11. 11. Every wrong data is wrong in its own way • Some data are not collected, “too big” or “useless”, as in flood control, purged log data • Some data feeds to warehouse are incomplete • Multiple definitions and inconsistent business rules, no documentation • Data incomplete due to business nature – Sparse data – Separate log in and log out data – Credit card purchases versus cash • Some flaws are easy to catch, such as missing, constant • Some flaws hard to find, partially missing or incorrect 11
  12. 12. Best practices of analyst • Understand how the data are collected, what data can and cannot be collected • Balance cost of collecting data and optimize modeling • Use feedback loop to test hypotheses • Do simulations to see if changes are reasonable • Good ideas are not necessarily complicated ideas • Focus on domain knowledge, not just data mining tools 12
  13. 13. Best Practices of Analytics Managers • Well versed on analytics, understand analyst, their behavior, the tests, their work and value • Focus on domain knowledge, not just data mining tools • Focus on impact, not elegance in modeling • Big Data Analytics are different from small sample statistics, and need to learn on the job • As activities become more technical, it is hard to recognize values and identify issues – 2008: Financial crisis and credit derivatives – Principal-agent problem 13
  14. 14. New Information Explosions • Before ~1450, only nobilities had a few books • After Gutenberg, information was limited by paper and printing capacities – People cried out loud there was too much information – Then we had libraries, index, abstract, book reviews,… • Now information is limited by disks & cloud storage – A person’s lifetime spoken words stored in a thumb drive – Soon everything can be stored • Now: how do we make use of all the information? – Search, crows sourcing, Twitter, Wikipedia, YouTube, big data and analysis algorithms, … 14
  15. 15. Paradigm Shift in Data Organization • Mathematics is a way to efficiently use brain resources – With pen and paper, only simple problems solvable – Crude approximations, and samples for complicated ones – Unreasonable effectiveness of mathematics – E. Wigner • Now, algorithms are ways to efficiently use computing resources – Numerical solutions of complex equations – Large scale simulations, full population databases – Unreasonable effectiveness of data – P. Norvig • Elegant, over simplified models are less useful 15
  16. 16. Paradigm Shift in Knowledge • Knowledge is power, by Francis Bacon • Past: Drowning in information, starving for knowledge, by John Naisbitt • Now, Knowing how to extract knowledge is power • Soon: There is abundance of knowledge, seeking for relevance – Incl. personal finance, medical, political decisions • Innovations are about connecting the dots – Distances between the dots are getting smaller – Leverage knowledge to make decisions, manage risks 16
  17. 17. Big Data problem • Data size larger than what databases can handle • Terabytes of data may take hours just to scan it • Solution requires a cloud of servers with local storage – Read, process and write intermediate results in parallel – Aggregate at the end • Cloud computing can build models in scale • Cloud often scales linearly as number of servers 17
  18. 18. Modeling need to scale • Traditional predictive models take long time to build – Small data sets, samples expensive to collect • Now data are cheap and models may degrade in weeks – Dimension of predictors are very large – Number of categories are large • Human interactive model building not scalable • Reasons for target events are complex • Without detailed analysis, it is unclear what drives the event • We need to rely on “out of sample testing” and “off the shelf” modeling 18
  19. 19. Cloud computing • We built a SAS cloud at University of Phoenix – I have an invited SAS talk available at SAS web site – Can process billions of impressions in minutes • Hadoop clouds are used widely – Open source software, Hive, Impala, Mahout – Commodity servers and storage • Clouds may have 100Ks of servers – Find needle in a haystack in milliseconds – Model computations usually would take years to compute now finishes in minutes 19
  20. 20. Big Data Centers 20 Facebook and Google data centers use commodity servers Google uses 260 million watts can power 200K Homes – NY Times Data centers near Columbia River At Dalles, Oregon
  21. 21. Traditional BI pyramid • Defines a sequence of efforts • Most companies never get beyond reporting and simple analysis • No full analysis and predictive modeling ever done • Some data issues may not be caught • Limited insights hinder optimal extraction of knowledge 21 Multidimensional Report Standard Report Segmentation Predictive Modeling Knowledge Discovery Datamaturity Baseline Pyramid
  22. 22. Hadoop Analysis leads to better data quality 22 Raw data Algorithms Analysis Reports Business Rules Algorithms Predictive Models
  23. 23. More analysis leads to better quality 23 Data Collection Exploratory Analysis Predictive Modeling Decision Algorithms Better data quality
  24. 24. Data most important • In modeling, find key data most important – Identify the smoking gun • Data transformations – PageRank is a game changing data transformation – case, wineRank – Social graph is a key data transformation for credit card fraud detection 24
  25. 25. Modeling can go wrong • Leakage in lead scoring model – For example, use lead source to predict conversion, when certain values of the field were populated only for converters • Display ads conversion model – Construct data set by taking all converters and a sample of non-converters – Predict on page view profiles – Problem: sample of non-converters included customers who had no impressions of the ad 25
  26. 26. Modeling lessons • Yahoo DSL subscribers, one year contract • If you try to model month to month retention, you find high retention rate – Because of contracts and penalties • The correct way is to model retention at contract expiry, only on 1/12 of the customers • For Yahoo email, if you look at quarter by quarter retention, you find that those acquired early in the first quarter have lower retention rate – Because those customers have more time to churn • A correct way is to use survival analysis 26
  27. 27. Conclusions • For optimal modeling, domain knowledge is most important • May require Big Data solutions to scale • Identify key data and transformations • Data are not reliable until after seriously analyzed • Conduct deep analysis, before develop BI reports • Test and optimize in real market is crucial • Focus on customer experience not model complexity or predictive accuracy • “The best way to get good ideas to have a lot of them” – Linus Pauling • Use a lot of common sense 27