- The document discusses big data analytics and business intelligence processes for analyzing large datasets.
- It provides an overview of big data characteristics and challenges, and how cloud computing enables analyzing massive amounts of data.
- Examples of big data analytics applications are described, including customer profiling, predictive modeling, and optimization of online marketing campaigns. Lessons are discussed for effective modeling, including the importance of domain expertise and identifying key data transformations and variables.
2. Agenda
• Overview of big data analytics
• Insights of big data and analysis
• BI process on big data
• Lessons of model building
• Cases for behavioral profiles for predictive models
– Yahoo network segmentation
– Tribal Fusion display ads impression optimization
– University of Phoenix student retention and lead
optimization
• Case of Ask.com SEM algorithms
2
3. Daqing Zhao, PhD
• Big Data scientist with deep domain knowledge
• Academic training
– Analyzed molecular spectra on Cray supercomputers
– Determined, modeled, simulated molecular motions in 3D space
• Enjoy working with large data and large scale computing
• Worked on computational Internet marketing since 1999
3
4. New Book on Big Data Analytics
• In the book:
• Daqing Zhao:
• Frontiers of Big Data
Business Analytics:
Patterns and Cases in
Online Marketing
4
5. Big data, Big Opportunities
• Thanks to Moore’s law, on CPU, storage, network connections
• Too much data, too little knowledge
• Data, analytics changed every field many times over
• From science, government, to commerce
5
6. Big data characteristics
• Amount of data too big to handle using normal technology,
most data collected are dormant
• Raw data are stored, appended but not updated
• Formatted or free format data
• No aggregation for purpose of data reduction
• Individual customer level and individual event level data
• Sensor data
• Complete 360 degree view
• Process from raw data to get insights and build models
• Some business uses of big data: customer profile, event
prediction, automated decision machine, risk management,
wisdom of crowd
6
7. Things computers good at
• Computers have perfect memory
– Every page view, click, transaction, every event,…
• Good at finding a needle in a haystack
– E.g., target abandoned shopping carts with promotions
– Clickers of this page in the last week
• Good at trade offs among large number of factors
– Female, 25-34, with child < 5, Asian, earning $30K, rent,
divorced, live in Calif., some college, Walmart,
Coupons.com, Monster.com, drive Camry, …
– Buyer of X or not?
7
8. Things computers on Internet are good at
• Platforms of cloud sourcing
– Google PageRank, Adwords, Picasa, Translate, …
• Data not previously looked at in aggregate
– Google PageRank/Translate, Amazon Find Product
• Data not previously created, or accumulated
– Social network data at LinkedIn, Facebook
– Amazon Customer Review, Yelp
– Twitter, Flickr
– Wikipedia, Youtube/Khan Academy, eHow, Udemy,
Yahoo/Answers
8
9. Computers make it possible
• Given data, find models and parameters
– Identify reproducible patterns in the data
– Provide simple picture of a large number of events
– Predict events in the future
• Simulations generate future events, given
assumptions, and current state
– Given a set of models, how future scenario will look like,
under given set of conditions, “what ifs”
• Robots, and agents
– Make decisions based on environment and goals, self
driving cars
9
10. Computers can’t do everything
• Data often have issues before being well analyzed
• Data often have no taxonomy and context
• Free format data, relevant information need to be
extracted
• Analyst has to define targets, construct predictors
• Analyst has to include critical predictive factors
• Analyst need to add common sense
10
11. Every wrong data is wrong in its own way
• Some data are not collected, “too big” or “useless”, as in flood
control, purged log data
• Some data feeds to warehouse are incomplete
• Multiple definitions and inconsistent business rules, no
documentation
• Data incomplete due to business nature
– Sparse data
– Separate log in and log out data
– Credit card purchases versus cash
• Some flaws are easy to catch, such as missing, constant
• Some flaws hard to find, partially missing or incorrect
11
12. Best practices of analyst
• Understand how the data are collected, what data
can and cannot be collected
• Balance cost of collecting data and optimize
modeling
• Use feedback loop to test hypotheses
• Do simulations to see if changes are reasonable
• Good ideas are not necessarily complicated ideas
• Focus on domain knowledge, not just data mining
tools
12
13. Best Practices of Analytics Managers
• Well versed on analytics, understand analyst, their
behavior, the tests, their work and value
• Focus on domain knowledge, not just data mining
tools
• Focus on impact, not elegance in modeling
• Big Data Analytics are different from small sample
statistics, and need to learn on the job
• As activities become more technical, it is hard to
recognize values and identify issues
– 2008: Financial crisis and credit derivatives
– Principal-agent problem
13
14. New Information Explosions
• Before ~1450, only nobilities had a few books
• After Gutenberg, information was limited by paper
and printing capacities
– People cried out loud there was too much information
– Then we had libraries, index, abstract, book reviews,…
• Now information is limited by disks & cloud storage
– A person’s lifetime spoken words stored in a thumb drive
– Soon everything can be stored
• Now: how do we make use of all the information?
– Search, crows sourcing, Twitter, Wikipedia, YouTube,
big data and analysis algorithms, …
14
15. Paradigm Shift in Data Organization
• Mathematics is a way to efficiently use brain resources
– With pen and paper, only simple problems solvable
– Crude approximations, and samples for complicated ones
– Unreasonable effectiveness of mathematics – E. Wigner
• Now, algorithms are ways to efficiently use computing
resources
– Numerical solutions of complex equations
– Large scale simulations, full population databases
– Unreasonable effectiveness of data – P. Norvig
• Elegant, over simplified models are less useful
15
16. Paradigm Shift in Knowledge
• Knowledge is power, by Francis Bacon
• Past: Drowning in information, starving for
knowledge, by John Naisbitt
• Now, Knowing how to extract knowledge is power
• Soon: There is abundance of knowledge, seeking for
relevance
– Incl. personal finance, medical, political decisions
• Innovations are about connecting the dots
– Distances between the dots are getting smaller
– Leverage knowledge to make decisions, manage risks
16
17. Big Data problem
• Data size larger than what databases can handle
• Terabytes of data may take hours just to scan it
• Solution requires a cloud of servers with local
storage
– Read, process and write intermediate results in
parallel
– Aggregate at the end
• Cloud computing can build models in scale
• Cloud often scales linearly as number of servers
17
18. Modeling need to scale
• Traditional predictive models take long time to build
– Small data sets, samples expensive to collect
• Now data are cheap and models may degrade in weeks
– Dimension of predictors are very large
– Number of categories are large
• Human interactive model building not scalable
• Reasons for target events are complex
• Without detailed analysis, it is unclear what drives the
event
• We need to rely on “out of sample testing” and “off the
shelf” modeling
18
19. Cloud computing
• We built a SAS cloud at University of Phoenix
– I have an invited SAS talk available at SAS web site
– Can process billions of impressions in minutes
• Hadoop clouds are used widely
– Open source software, Hive, Impala, Mahout
– Commodity servers and storage
• Clouds may have 100Ks of servers
– Find needle in a haystack in milliseconds
– Model computations usually would take years to
compute now finishes in minutes
19
20. Big Data Centers
20
Facebook and Google
data centers use
commodity servers
Google uses 260 million watts
can power 200K Homes – NY Times
Data centers near Columbia River
At Dalles, Oregon
21. Traditional BI pyramid
• Defines a sequence of efforts
• Most companies never get
beyond reporting and simple
analysis
• No full analysis and predictive
modeling ever done
• Some data issues may not be
caught
• Limited insights hinder
optimal extraction of
knowledge
21
Multidimensional
Report
Standard Report
Segmentation
Predictive
Modeling
Knowledge
Discovery
Datamaturity
Baseline Pyramid
22. Hadoop
Analysis leads to better data quality
22
Raw data
Algorithms
Analysis
Reports
Business
Rules
Algorithms
Predictive
Models
23. More analysis leads to better quality
23
Data
Collection
Exploratory
Analysis
Predictive
Modeling
Decision
Algorithms
Better data quality
24. Data most important
• In modeling, find key data most important
– Identify the smoking gun
• Data transformations
– PageRank is a game changing data transformation
– Wine.com case, wineRank
– Social graph is a key data transformation for credit
card fraud detection
24
25. Modeling can go wrong
• Leakage in lead scoring model
– For example, use lead source to predict
conversion, when certain values of the field were
populated only for converters
• Display ads conversion model
– Construct data set by taking all converters and a
sample of non-converters
– Predict on page view profiles
– Problem: sample of non-converters included
customers who had no impressions of the ad
25
26. Modeling lessons
• Yahoo DSL subscribers, one year contract
• If you try to model month to month retention, you
find high retention rate
– Because of contracts and penalties
• The correct way is to model retention at contract
expiry, only on 1/12 of the customers
• For Yahoo email, if you look at quarter by quarter
retention, you find that those acquired early in the
first quarter have lower retention rate
– Because those customers have more time to churn
• A correct way is to use survival analysis
26
27. Conclusions
• For optimal modeling, domain knowledge is most important
• May require Big Data solutions to scale
• Identify key data and transformations
• Data are not reliable until after seriously analyzed
• Conduct deep analysis, before develop BI reports
• Test and optimize in real market is crucial
• Focus on customer experience not model complexity or
predictive accuracy
• “The best way to get good ideas to have a lot of them”
– Linus Pauling
• Use a lot of common sense
27