Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

931 views

Published on

Enterprise companies starting the transformation into a data-driven organization ​often ​​wonder where to start. Companies have traditionally collected large amounts of data from sources such as operational systems. With the rise of big data, big data technologies and the ​I​nternet of ​T​hings​ (IoT)​, additional sources​ – such as sensor readings and social media posts​ – are rapidly becoming available. In order to effectively utilize both traditional sources ​and new ones, companies first need to join and view the data in a holistic context. After establishing a data lake to bring all data sources together in a single analytics environment, one of the first data science projects ​worth exploring is segmentation​, which automatically identif​ies​ patterns.

In this DSC webinar, two Pivotal data scientists will discuss:

· What segmentation is
· Traditional approaches to segmentation
· How big data technologies are enabling advances in this field

They will also share some stories from past ​d​ata ​s​cience ​engagements, ​outline ​best practices and discuss the kinds of insights ​that can be derived from a big data approach to segmentation using both internal and external data sources.

Panelist:
Grace Gee, Data Scientist​ -- Pivotal​
Jarrod Vawdrey, Data Scientist -- Pivotal​

Hosted by:
Tim Matteson, Co-Founder -- Data Science Central

To learn more about data at Pivotal, visit http://www.pivotal.io/big-data
To view video, visit https://www.youtube.com/watch?v=svKLdMWusGA

Published in: Data & Analytics
  • Be the first to comment

Webinar - The Science of Segmentation: What Questions You Should be Asking Your Data?

  1. 1. 1© 2015 Pivotal Software, Inc. All rights reserved.
  2. 2. 2© 2015 Pivotal Software, Inc. All rights reserved. 2© 2015 Pivotal Software, Inc. All rights reserved. The Science of Segmentation: What Questions Should You Be Asking Your Data? April 14, 2015 Jarrod Vawdrey, Data Scientist @ Pivotal Grace Gee, Data Scientist @ Pivotal
  3. 3. 3© 2015 Pivotal Software, Inc. All rights reserved. Agenda • Typical State Of Companies New To Big Data Analytics – Benefits of Big Data technologies • When to Use Segmentation – Common business problems – Types of available data • Use Cases & Approaches To Segmentation – Common approaches – Best practices
  4. 4. 4© 2015 Pivotal Software, Inc. All rights reserved. 4© 2015 Pivotal Software, Inc. All rights reserved. Typical State of Companies New to Big Data Analytics
  5. 5. 5© 2015 Pivotal Software, Inc. All rights reserved. Typical State of Companies New to Analytics • Companies in the process of transforming into a data- driven organization often ask similar questions about where to start: How do I make data available for my analysts?What tools are needed to efficiently process and build models on my big data sets? What data should I be collecting and archiving? Where and how can I start to use all my data to quickly gain actionable insights and begin integrating data science into our organization’s practices? How do I leverage data to generate value for stakeholders? How do I enable analysts and data scientists to be more effective?
  6. 6. 6© 2015 Pivotal Software, Inc. All rights reserved. Common Business Challenges Data Availability • Disparate data sources • No integration of data across lines of businesses • Insufficient data • Unknown single source of truth Slow Time-to-Insight • Often outdated analytics architectures focused on operational processes hamper experimental nature of big data analytics • Lack of knowledge about analytics software for in-place processing of and computation on Big Data • Company organizational structure inhibits fast acquisition of data and communication of insights
  7. 7. 7© 2015 Pivotal Software, Inc. All rights reserved. Big Data Technologies for Data-Driven Organizations • Data Lake: efficient, massively scalable Big Data storage platform – Store all data: we don’t want to inhibit the ability to answer future questions – Save all (structured, unstructured, and semi-structured) types of data: we may not immediately know “optimal” form to store data for analysis – Work with multiple types of data from one location – Centralized location of data accessible to all organizations • Agile Analytics Platform: purpose-built architecture for getting results and gaining insights quickly through parallel, in-place data analytics – No required sampling due to limited memory – No data movement – Scalable analytics
  8. 8. 8© 2015 Pivotal Software, Inc. All rights reserved. Big Data Technologies for Data-Driven Organizations Enterprise Apps Reporting Prioritized Operational Processes Data Sources Inventory Optimization Demand Forecasting Proprietary Structured Data Proprietary Unstructured Data Partner Data Self Reporting: Google, Weblogs, Twitter External Sources: Census, Nielsen, Weather, etc… Sensors HAWQGreenplum DB Pivotal HD (HDFS) GemFire XD MADlib, PL/R, PL/Python, etc. Platform-Driven Data Science 1 0 0 1 01 0 0 1 0 1 1 1 0 Fraud Detection
  9. 9. 9© 2015 Pivotal Software, Inc. All rights reserved. Segmentation: An important step for understanding data • What is segmentation? – Automatic grouping of entities based on a common set of features – Identification of patterns amongst similar entities • What is segmentation good for? – Identifying select features that greatly differentiate groups of entities • E.g. Identifying behaviors of high-profit suppliers and low-profit suppliers – Identifying similar characteristics amongst different groups • E.g. Identifying similar market segments to target – Predicting characteristics and behaviors of new or unknown entities • E.g. Inferring missing labels, predicting market response to new products
  10. 10. 10© 2015 Pivotal Software, Inc. All rights reserved. Segmentation & Big Data Technologies • Segmentation problems often deal with: – Multiple data sources from multiple lines of business and external sources – BIG DATA, particularly from sensor data or transactions/point of sales – High-dimensional feature sets • Big Data technologies help make segmentation problems become feasible and bring faster time-to-insights through: – Ability to leverage and integrate all relevant data sources, no matter how large  Data Lake – Using ALL data to train segmentation models and not rely on samples or a subset of data that fits into memory  MPP databases, Hadoop, HAWQ, MADlib, Spark, etc. – Quickly building segmentation models and scoring new entities through parallelized, in-place computation  MPP databases, Hadoop, HAWQ, MADlib, Spark, etc. How cutting edge Big Data technology enables faster insights
  11. 11. 11© 2015 Pivotal Software, Inc. All rights reserved. 11© 2015 Pivotal Software, Inc. All rights reserved. When To Use Segmentation
  12. 12. 12© 2015 Pivotal Software, Inc. All rights reserved. Common Business Problems Customer Micro-targeting Identifying market segments and their purchasing behaviors Operations & Logistics Identifying behaviors of underperforming or outperforming stores, suppliers, delivery services, etc. Fraud Identifying normal and anomalous user behaviors within networks Domain Resolution Inferring labels or groups of similar web domains where segmentation can help
  13. 13. 13© 2015 Pivotal Software, Inc. All rights reserved. Data Used In Segmentation of Customers Power in leveraging both internal and external datasets Demographic profiles Sensor data Product metadata Shipment data Store metadata Transactions and invoices Delivery information Marketing plans External data: Census, Nielsen, social networks, etc.
  14. 14. 14© 2015 Pivotal Software, Inc. All rights reserved. Gaining Additional Value From External Data Often companies do not or cannot collect sufficient data about their customers to construct a complete profile. Augmenting internal data with external sources allows companies to: • Develop a 360 degree customer view • Gain insights into how consumers are interacting with competitors • Improve accuracy of predictive models • Increase the value of internal data Point of sales Transaction data Web/Apps logs Investments Market basket Loans Traffic Weather IXI wealth complete Haver time series Dept. of Labor CRM Internal External Note: This list only represents a subset of data sources that should be considered.
  15. 15. 15© 2015 Pivotal Software, Inc. All rights reserved. Example: Using Census Data to Build Family Profiles Consumer Packaged Goods (CPG) companies are often interested in building market profiles for micro-targeting to improve marketing strategies and supply chain planning. Hypothesis: • Not only are CPG companies interested in the individual consumer, but in the family profile as well – E.g. Consumption of child products is affected by family size Approach: • Census Public Use Microdata Sample (PUMS) files include person records and housing records which can be combined in segmentation models to build rich family profiles. fraction of households Households with Children* *Children as defined by a certain age group
  16. 16. 16© 2015 Pivotal Software, Inc. All rights reserved. 16© 2015 Pivotal Software, Inc. All rights reserved. Use Cases & Approaches to Segmentation
  17. 17. 17© 2015 Pivotal Software, Inc. All rights reserved. Common Approaches for Implementing Segmentation Data Step • Identify join relationships across all data sources • Aggregate data to common granularity Feature Step • Identify and create features that can characterize the entities you want to segment, e.g. age, gender, types of last transactions, average time between visits, average spend, sensitivity to price change, etc. Model Step • Candidate algorithms: clustering strategies like k- means & hierarchical clustering, regression or hierarchical modeling and grouping by similar coefficients, ensemble methods, etc. Analysis of Results • Look at average features across clusters • Look at average cluster features vs. population average (e.g. to find anomalous behavior) • Identify common features amongst segments (e.g. opportunities for cross-sell/up-sell)
  18. 18. 18© 2015 Pivotal Software, Inc. All rights reserved. • Objective: Identify characteristics of consumers that prefer certain brands or products • Common business challenges: – No integration of data amongst different lines of businesses – Internal data is not sufficient for building profiles – No information about which consumers are more/less profitable • Data sources: – Point of sales, demographic data, loyalty data, product and store metadata, external data Example: Profiling market and consumer segments
  19. 19. 19© 2015 Pivotal Software, Inc. All rights reserved. • Identify relationships and joins amongst all data sources • Clean data by removing outliers and imputing missing values if appropriate – For example using the median or weighted average value for a state to impute into a missing value for a county • Aggregate or select data to common granularity that makes sense – For example, demographic profiles can be built at the zip code or county level, and store profiles can be built at the individual store or tier or region level Step 1: Consolidate Data Sources • Do gap analysis to determine the scope of data sufficient for analysis - For example, a certain subset of customers may be missing data for a large time period and should be scoped out time numberofstoresreporting Using an MPP database like Greenplum, we can join tables with billions of rows in a little over a minute.
  20. 20. 20© 2015 Pivotal Software, Inc. All rights reserved. Step 2: Feature Engineering & Selection Transactions&PointofSales Total sales Change in sales Price Discount Market basket Store/Location Geolocation Weather Product Department/Type Color Size Brand Package Demographic Age Gender Income Employment Education Family size Marital status Citizenship Language Loyalty Status Length of membership Activity It’s common for data scientists to generate hundreds of thousands of features.
  21. 21. 21© 2015 Pivotal Software, Inc. All rights reserved. Step 2: Feature Engineering & Selection In order to reduce feature dimensionality and account for unwanted bias due to the inclusion of highly correlated features, we can filter features using approaches such as : • Principal Component Analysis • Reducing the dimensionality of the feature space to a select number of principal components • Iterative pairwise correlation comparison • Calculate NxN pairwise correlations, where N is the number features • Remove the feature existing in the greatest number of correlated pairs (correlation coefficient greater than some threshold) • Iterate until no correlated pairs exist Example: Subset of feature correlation matrix. The large number of features requires an automated approach to feature selection
  22. 22. 22© 2015 Pivotal Software, Inc. All rights reserved. • Example: K-means Clustering 1. Create single feature vector for each entity, e.g. consumer 2. Use k-means clustering to identify k consumer segments i. Try multiple training trials for multiple values of k ii. Use any one of a variety of techniques for selecting optimal k, e.g. silhouette coefficient 3. Look at average features across segments to identify segment characteristics 4. Look at purchasing behaviors of each segment to identify segment preferences Step 3: Build Models
  23. 23. 23© 2015 Pivotal Software, Inc. All rights reserved. • Segmentation models used to identify and profile consumer groups – Calculate descriptive statistics for each segment and compare to uncover previously hidden opportunities • Cross-sell/up-sell opportunities • Potential data issues or supply chain execution opportunities regarding unequal proportion of product shipment or inventory to regional preference • Rich set of reusable data assets made available for ongoing analysis & reporting Step 4: Extract Business Value from Results Cluster 1 Cluster 2 Cluster 3 Cluster 4 Feature 1 Feature 2 Feature 3 . . . low value high value compared across clusters
  24. 24. 24© 2015 Pivotal Software, Inc. All rights reserved. What Questions Should You Be Asking Your Data? • Are you collecting the right data & storing it in the right fashion? • Do you have the right technology to support your data and data science endeavors? • Where are the gaps in your data? How can external sources fill those gaps? • How can your data sources be joined or aggregated together to build rich feature sets? • How can you extract business value from your data? Segmentation will help you answer all of these questions!
  25. 25. 25© 2015 Pivotal Software, Inc. All rights reserved. Thank You.
  26. 26. 26© 2015 Pivotal Software, Inc. All rights reserved.

×